Skip to content

Commit

Permalink
Add cohere-embed-english-v3.0 to msmarco-v1-passage 2cr (#1794)
Browse files Browse the repository at this point in the history
  • Loading branch information
manveertamber committed Feb 23, 2024
1 parent 0f08ad2 commit 442e7e1
Show file tree
Hide file tree
Showing 6 changed files with 192 additions and 9 deletions.
124 changes: 119 additions & 5 deletions docs/2cr/msmarco-v1-passage.html
Original file line number Diff line number Diff line change
Expand Up @@ -6062,16 +6062,16 @@ <h1 class="mb-3">MS MARCO V1 Passage</h1>
<td class="expand-button"></td>
<td style="min-width: 85px">[14]</td>
<td style="min-width: 400px">BGE-base-en-v1.5: PyTorch</td>
<td>0.4435</td>
<td>0.7065</td>
<td>0.4436</td>
<td>0.7055</td>
<td>0.8472</td>
<td></td>
<td>0.4650</td>
<td>0.4651</td>
<td>0.6780</td>
<td>0.8503</td>
<td></td>
<td>0.3896</td>
<td>0.9796</td>
<td>0.3557</td>
<td>0.9814</td>
</tr>
<tr class="hide-table-padding">
<td></td>
Expand Down Expand Up @@ -6173,6 +6173,116 @@ <h1 class="mb-3">MS MARCO V1 Passage</h1>
</div>
<!-- Tabs content -->

</div></td>
</tr>
<tr><td style="border-bottom: 0"></td></tr>
<!-- Condition: Cohere Embed English v3.0: pre-encoded queries -->
<tr class="accordion-toggle collapsed" id="row54" data-toggle="collapse" data-parent="#row54" href="#collapse54">
<td class="expand-button"></td>
<td style="min-width: 85px"></td>
<td style="min-width: 400px">Cohere Embed English v3.0: pre-encoded queries</td>
<td>0.4884</td>
<td>0.6956</td>
<td>0.8630</td>
<td></td>
<td>0.5067</td>
<td>0.7245</td>
<td>0.8682</td>
<td></td>
<td>0.3660</td>
<td>0.9785</td>
</tr>
<tr class="hide-table-padding">
<td></td>
<td colspan="11">
<div id="collapse54" class="collapse in p-3">

<!-- Tabs navs -->
<ul class="nav nav-tabs mb-3" id="row54-tabs" role="tablist">
<li class="nav-item" role="presentation">
<a class="nav-link active" id="row54-tab1-header" data-mdb-toggle="tab" href="#row54-tab1" role="tab" aria-controls="row54-tab1" aria-selected="true" style="text-transform:none">TREC 2019</a>
</li>
<li class="nav-item" role="presentation">
<a class="nav-link" id="row54-tab2-header" data-mdb-toggle="tab" href="#row54-tab2" role="tab" aria-controls="row54-tab2" aria-selected="false" style="text-transform:none">TREC 2020</a>
</li>
<li class="nav-item" role="presentation">
<a class="nav-link" id="row54-tab3-header" data-mdb-toggle="tab" href="#row54-tab3" role="tab" aria-controls="row54-tab3" aria-selected="false" style="text-transform:none">dev</a>
</li>
</ul>
<!-- Tabs navs -->

<!-- Tabs content -->
<div class="tab-content" id="row54-content">
<div class="tab-pane fade show active" id="row54-tab1" role="tabpanel" aria-labelledby="row54-tab1">
Command to generate run on TREC 2019 queries:

<blockquote class="mycode">
<pre><code>python -m pyserini.search.faiss \
--threads 16 --batch-size 512 \
--index msmarco-v1-passage.cohere-embed-english-v3.0 \
--topics dl19-passage --encoded-queries cohere-embed-english-v3.0-dl19-passage \
--output run.msmarco-v1-passage.cohere-embed-english-v3.0.dl19.txt
</code></pre></blockquote>
Evaluation commands:

<blockquote class="mycode">
<pre><code>python -m pyserini.eval.trec_eval -c -l 2 -m map dl19-passage \
run.msmarco-v1-passage.cohere-embed-english-v3.0.dl19.txt
python -m pyserini.eval.trec_eval -c -m ndcg_cut.10 dl19-passage \
run.msmarco-v1-passage.cohere-embed-english-v3.0.dl19.txt
python -m pyserini.eval.trec_eval -c -l 2 -m recall.1000 dl19-passage \
run.msmarco-v1-passage.cohere-embed-english-v3.0.dl19.txt
</code></pre>
</blockquote>

</div>
<div class="tab-pane fade" id="row54-tab2" role="tabpanel" aria-labelledby="row54-tab2">
Command to generate run on TREC 2020 queries:

<blockquote class="mycode">
<pre><code>python -m pyserini.search.faiss \
--threads 16 --batch-size 512 \
--index msmarco-v1-passage.cohere-embed-english-v3.0 \
--topics dl20 --encoded-queries cohere-embed-english-v3.0-dl20 \
--output run.msmarco-v1-passage.cohere-embed-english-v3.0.dl20.txt
</code></pre></blockquote>
Evaluation commands:

<blockquote class="mycode">
<pre><code>python -m pyserini.eval.trec_eval -c -l 2 -m map dl20-passage \
run.msmarco-v1-passage.cohere-embed-english-v3.0.dl20.txt
python -m pyserini.eval.trec_eval -c -m ndcg_cut.10 dl20-passage \
run.msmarco-v1-passage.cohere-embed-english-v3.0.dl20.txt
python -m pyserini.eval.trec_eval -c -l 2 -m recall.1000 dl20-passage \
run.msmarco-v1-passage.cohere-embed-english-v3.0.dl20.txt
</code></pre>
</blockquote>

</div>
<div class="tab-pane fade" id="row54-tab3" role="tabpanel" aria-labelledby="row54-tab3">
Command to generate run on dev queries:

<blockquote class="mycode">
<pre><code>python -m pyserini.search.faiss \
--threads 16 --batch-size 512 \
--index msmarco-v1-passage.cohere-embed-english-v3.0 \
--topics msmarco-passage-dev-subset --encoded-queries cohere-embed-english-v3.0-msmarco-passage-dev-subset \
--output run.msmarco-v1-passage.cohere-embed-english-v3.0.dev.txt
</code></pre></blockquote>
Evaluation commands:

<blockquote class="mycode">
<pre><code>python -m pyserini.eval.trec_eval -c -M 10 -m recip_rank msmarco-passage-dev-subset \
run.msmarco-v1-passage.cohere-embed-english-v3.0.dev.txt
python -m pyserini.eval.trec_eval -c -m recall.1000 msmarco-passage-dev-subset \
run.msmarco-v1-passage.cohere-embed-english-v3.0.dev.txt
</code></pre>
</blockquote>

</div>
</div>
<!-- Tabs content -->

</div></td>
</tr>

Expand Down Expand Up @@ -6236,6 +6346,10 @@ <h1 class="mb-3">MS MARCO V1 Passage</h1>
<a href="https://dl.acm.org/doi/10.1145/3583780.3615112">Anserini Gets Dense Retrieval: Integration of Lucene's HNSW Indexes.</a>
<i>Proceedings of the 32nd International Conference on Information and Knowledge Management (CIKM 2023)</i>, October 2023, pages 5366–5370, Birmingham, the United Kingdom.</p></li>

<li><p>[14] Shitao Xiao, Zheng Liu, Peitian Zhang, and Niklas Muennighoff.
<a href="https://arxiv.org/abs/2309.07597">C-Pack: Packaged Resources To Advance General Chinese Embedding.</a>
<i>arXiv:2309.07597</i>, December 2023.</p></li>

</ul>

<div style="padding-top: 20px"/>
Expand Down
5 changes: 4 additions & 1 deletion docs/prebuilt-indexes.md
Original file line number Diff line number Diff line change
Expand Up @@ -1069,7 +1069,10 @@ Detailed configuration information for the pre-built indexes are stored in [`pys
<dd>Faiss FlatIP index of the MS MARCO passage corpus encoded by the tct_colbert-v2-hnp passage encoder
</dd>
<dt></dt><b><code>msmarco-v1-passage.openai-ada2</code></b>
<dd>Faiss FlatIP index of the MS MARCO document corpus encoded by TCT-ColBERT-V2-HNP
<dd>Faiss FlatIP index of the MS MARCO passage corpus encoded by OpenAI ada2
</dd>
<dt></dt><b><code>msmarco-v1-passage.cohere-embed-english-v3.0</code></b>
<dd>Faiss FlatIP index of the MS MARCO passage corpus encoded by Cohere Embed English v3.0
</dd>
<dt></dt><b><code>msmarco-v1-doc.ance-maxp</code></b>
<dd>Faiss FlatIP index of the MS MARCO document corpus encoded by the ANCE MaxP encoder
Expand Down
25 changes: 24 additions & 1 deletion pyserini/2cr/msmarco-v1-passage.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -1200,4 +1200,27 @@ conditions:
scores:
- MAP: 0.4938
nDCG@10: 0.6666
R@1K: 0.8919
R@1K: 0.8919
- name: cohere-embed-english-v3.0
display: "Cohere Embed English v3.0: pre-encoded queries"
display-html: "Cohere Embed English v3.0: pre-encoded queries"
display-row: ""
command: python -m pyserini.search.faiss --threads ${dense_threads} --batch-size ${dense_batch_size} --index msmarco-v1-passage.cohere-embed-english-v3.0 --topics $topics --encoded-queries cohere-embed-english-v3.0-$topics --output $output
topics:
- topic_key: msmarco-passage-dev-subset
eval_key: msmarco-passage-dev-subset
scores:
- MRR@10: 0.3660
R@1K: 0.9785
- topic_key: dl19-passage
eval_key: dl19-passage
scores:
- MAP: 0.4884
nDCG@10: 0.6956
R@1K: 0.8630
- topic_key: dl20
eval_key: dl20-passage
scores:
- MAP: 0.5067
nDCG@10: 0.7245
R@1K: 0.8682
2 changes: 2 additions & 0 deletions pyserini/2cr/msmarco.py
Original file line number Diff line number Diff line change
Expand Up @@ -107,6 +107,8 @@
'cosdpr-distil-pytorch',
'',
'bge-base-en-v1.5-pytorch',
'',
'cohere-embed-english-v3.0',
],

# MS MARCO v1 doc
Expand Down
30 changes: 30 additions & 0 deletions pyserini/encoded_query_info.py
Original file line number Diff line number Diff line change
Expand Up @@ -485,6 +485,36 @@
"total_queries": 6980,
"downloaded": False
},
"cohere-embed-english-v3.0-dl19-passage": {
"description": "TREC DL19 passage queries encoded by Cohere Embed English v3.0.",
"urls": [
"https://github.com/castorini/pyserini-data/raw/main/encoded-queries/query-embedding-cohere-embed-english-v3.0-dl19-passage-20240216-2154e79.tar.gz",
],
"md5": "04300156fe6be309b8d83270dbb328c6",
"size (bytes)": 141545,
"total_queries": 43,
"downloaded": False
},
"cohere-embed-english-v3.0-dl20": {
"description": "TREC DL20 passage queries encoded by Cohere Embed English v3.0.",
"urls": [
"https://github.com/castorini/pyserini-data/raw/main/encoded-queries/query-embedding-cohere-embed-english-v3.0-dl20-passage-20240216-2154e79.tar.gz",
],
"md5": "0b12d7049ba46f1ebe1ae07f0e7c1723",
"size (bytes)": 646705,
"total_queries": 200,
"downloaded": False
},
"cohere-embed-english-v3.0-msmarco-passage-dev-subset": {
"description": "MS MARCO passage dev set queries encoded by Cohere Embed English v3.0.",
"urls": [
"https://github.com/castorini/pyserini-data/raw/main/encoded-queries/query-embedding-cohere-embed-english-v3.0-msmarco-passage-dev-subset-20240216-2154e79.tar.gz",
],
"md5": "7dd0026490117e9e6f6acfc110d6ce83",
"size (bytes)": 22377230,
"total_queries": 6980,
"downloaded": False
},
"atomic-v0.2.1-text-ViT-L-14.laion2b_s32b_b82k-validation": {
"description": "AToMiC text v0.2.1 validation set encoded by ViT-L-14.laion2b_s32b_b82k.",
"urls": [
Expand Down
15 changes: 13 additions & 2 deletions pyserini/prebuilt_index_info.py
Original file line number Diff line number Diff line change
Expand Up @@ -3756,7 +3756,7 @@
"texts": "msmarco-v1-passage"
},
"msmarco-v1-passage.openai-ada2": {
"description": "Faiss FlatIP index of the MS MARCO document corpus encoded by TCT-ColBERT-V2-HNP",
"description": "Faiss FlatIP index of the MS MARCO passage corpus encoded by OpenAI ada2",
"filename": "faiss.msmarco-v1-passage.openai-ada2.20230530.e3a58f.tar.gz",
"urls": [
"https://rgw.cs.uwaterloo.ca/pyserini/indexes/faiss.msmarco-v1-passage.openai-ada2.20230530.e3a58f.tar.gz"
Expand All @@ -3767,7 +3767,18 @@
"downloaded": False,
"texts": "msmarco-v1-passage"
},

"msmarco-v1-passage.cohere-embed-english-v3.0": {
"description": "Faiss FlatIP index of the MS MARCO passage corpus encoded by Cohere Embed English v3.0",
"filename": "faiss.msmarco-v1-passage.cohere-embed-english-v3.0.20240216.2154e79.tar.gz",
"urls": [
"https://rgw.cs.uwaterloo.ca/pyserini/indexes/faiss.msmarco-v1-passage.cohere-embed-english-v3.0.20240216.2154e79.tar.gz"
],
"md5": "df0d8e2aac71fb3ee8b554bdcf158f95",
"size compressed (bytes)": 21341576907,
"documents": 8841823,
"downloaded": False,
"texts": "msmarco-v1-passage"
},
"msmarco-v1-doc.ance-maxp": {
"description": "Faiss FlatIP index of the MS MARCO document corpus encoded by the ANCE MaxP encoder",
"filename": "faiss.msmarco-v1-doc.ance_maxp.20210304.b2a1b0.tar.gz",
Expand Down

0 comments on commit 442e7e1

Please sign in to comment.