New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dense retrieval draft #278
Conversation
Let's see if we can replicate the results (i.e., MRR) in the arxiv paper. How about we create a |
sure, sg |
Add a driver program for replicating results from the paper? e.g., https://github.com/castorini/pyserini/blob/master/pyserini/search/__main__.py Not sure if we want in that file or a different one, though. Depends on the complexity of the code paths. |
For example, this works:
Implement something similar for dense retrieval? Maybe |
sure, I got some issue on replicating the result rn (I guess there might be a bug when I indexing with HNSW, but still investigating). I'll add it when the score are replicated successfully on my end. |
reproduced MRR@10 of tct_colbert
with
|
using
|
So we have three modes of operation: brute force GPU, brute force CPU, and hnsw CPU - right? What's your proposed designed - to have separate classes for each type? E.g., There are pros and cons I can see for each alternative... |
I didn't implement GPU mode in pyserini, the original tct_colbert experiments used brute force GPU.
So right now what I have implemented can have four modes of operation (on CPU):
I am assuming we are getting pre build index for both brute force index & hnsw index, which means in pyserini we only do the |
Okay, sounds good - a simple class then! |
hey @MXueguang I was just thinking about this - the idx to docid mapping should be pretty small, right? can't we just manually include it in the prebuilt index? The common case in the future is that the user will do hybrid dense/sparse - so it doesn't make sense to have two separate copies of the text? |
yeah it sounds good. I'll make the prebuilt index contains |
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we're pretty close to an initial merge?
Address these comments, and let's merge?
topic_keys = sorted(topics.keys()) | ||
for i in tqdm(range(0, len(topic_keys), args.batch)): | ||
topic_key_batch = topic_keys[i: i+args.batch] | ||
topic_emb_batch = np.array([query_encoder.encode(topics[topic].get('title').strip()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got lost here reading the code... the query encodings are pre-stored somewhere right? Where are they loaded?
Should we also have something like .load_encoded_queries()
or something like that to load pre-encoded queries?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a line query_encoder = QueryEncoder(searcher.index_dir)
where I loaded the pre encoded query from the index dir
I place the pre encoded query with the pre build index rn, i.e. the registered msmarco-passage-tct_colbert.tar.gz
contains index
+docid
+query text
+query embedding
.
will we always have a pre_encoded query for a prebuild index? If so, I feel we can pack prebuild index and pre encoded queries together like like what i am doing rn?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. I think the pre-encoded queries should be separate, because an index can have multiple queries - for example, for MS MARCO, there's dev queries and test queries.
I think something like .load_encoded_queries()
would be clearer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, sg
docs/pypi-replication.md
Outdated
@@ -56,6 +56,37 @@ map all 0.2805 | |||
recall_1000 all 0.9470 | |||
``` | |||
|
|||
MS MARCO passage ranking task, dense retrieval with TCT-ColBERT, HNSW index. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you move this to a new file, like docs/dense-retrieval.md
? This feature is still experimental... I don't want to mix it with pretty stable stuff, like the rest of the docs here...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few more minor tweaks please.
An example of usage, since dense index doesn't contains raw data, I loaded the corpus separately.