Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dense retrieval draft #278

Merged
merged 14 commits into from Jan 4, 2021
Merged

Dense retrieval draft #278

merged 14 commits into from Jan 4, 2021

Conversation

MXueguang
Copy link
Member

An example of usage, since dense index doesn't contains raw data, I loaded the corpus separately.

import numpy as np
from pyserini.search import SimpleDenseSearcher

searcher = SimpleDenseSearcher.from_prebuilt_index('msmarco_passage_0', 'collection.tsv')

query_emb = np.random.random(768).astype('float32')
result = searcher.search(query_emb)

result[0].raw
>> 'Lander, WY Sales Tax Rate. The current total local sales tax rate in Lander, WY is 5.000%. The December 2015 total local sales tax rate was also 5.000%. Lander, WY is in Fremont County. Lander is in the following zip codes: 82520.'

result[0].docid
>> '350921'

result[0].score
>> 0.42547345

searcher.doc('123')
>> Document(docid='123', raw='With a number of condo developments springing up in the city, it can be difficult to narrow down your choices for the perfect Montreal condo for sale. Our skilled agents organize your steps towards meeting your goals with our condo projects located in popular and trendy neighbourhoods.')

@lintool
Copy link
Member

lintool commented Dec 18, 2020

Let's see if we can replicate the results (i.e., MRR) in the arxiv paper.

How about we create a QueryEncoder class? Then we can have something like .load_pre_encoded queries - that takes a natural language query and looks up the vector representation directly.

@MXueguang
Copy link
Member Author

sure, sg

@lintool
Copy link
Member

lintool commented Dec 20, 2020

Add a driver program for replicating results from the paper? e.g., https://github.com/castorini/pyserini/blob/master/pyserini/search/__main__.py

Not sure if we want in that file or a different one, though. Depends on the complexity of the code paths.

@lintool
Copy link
Member

lintool commented Dec 20, 2020

For example, this works:

python -m pyserini.search --topics msmarco_passage_dev_subset \
 --index msmarco-passage --output runs/run.msmarco-passage.2.txt --bm25 --msmarco &

Implement something similar for dense retrieval? Maybe --sparse as default, with --dense as an option? They we can have --hybrid?

@MXueguang
Copy link
Member Author

sure, I got some issue on replicating the result rn (I guess there might be a bug when I indexing with HNSW, but still investigating). I'll add it when the score are replicated successfully on my end.

@MXueguang
Copy link
Member Author

reproduced MRR@10 of tct_colbert

  • Expected:
MRR @10: 0.3345
  • Using brute force index IndexFlatIP:
MRR @10: 0.33447093737208256
search speed: 10s per query
size: 26G

Current SimpleDenseSearcher is doing single query search, which is not multithreaded for the brute force index. faiss can do multithreading when search by batch of queries. We can create a BatchDenseSearcher if we need.

  • Using hnsw index IndexHNSWFlat:

with store_n= 512 ef_search=128, ef_construction=200

MRR @10: 0.3335402737981526
search speed 0.05s per query, using 16 cpu cores
size: 60G
index time: 10+ hours

@MXueguang
Copy link
Member Author

using batch_search with batch size 16 and running with 16 cpu

  • brute force index: 14s per batch
  • hnsw index: 0.2 s per batch

@lintool
Copy link
Member

lintool commented Dec 30, 2020

So we have three modes of operation: brute force GPU, brute force CPU, and hnsw CPU - right?

What's your proposed designed - to have separate classes for each type? E.g., BruteForceDenseSearcher and IndexedDenseSearcher? Or to have a single class that knows how to work with different indexes?

There are pros and cons I can see for each alternative...

@MXueguang
Copy link
Member Author

I didn't implement GPU mode in pyserini, the original tct_colbert experiments used brute force GPU.

but maybe we should support GPU as well?

So right now what I have implemented can have four modes of operation (on CPU):

  1. brute force, single query search
  2. hnsw, single query search
  3. brute force, batch query search
  4. hnsw, batch query search
    The efficiency is 4 > 2 > 3 > 1

I am assuming we are getting pre build index for both brute force index & hnsw index, which means in pyserini we only do the load and search. They are almost same when doing load and search. I guess that is because faiss wrap all types of index together. So I feel putting in one class will seems simpler, i.e. SimpleDenseSearcher.
There are some minor differences, such as hnsw index can change search accuracy-efficiency trade-off by efSearch during search. When doing these, we can let the corresponding methods figure out which index it is handling.

@lintool
Copy link
Member

lintool commented Dec 30, 2020

Okay, sounds good - a simple class then!

pyserini/search/_dsearcher.py Outdated Show resolved Hide resolved
pyserini/search/_dsearcher.py Outdated Show resolved Hide resolved
pyserini/search/_dsearcher.py Outdated Show resolved Hide resolved
pyserini/search/_dsearcher.py Outdated Show resolved Hide resolved
@lintool
Copy link
Member

lintool commented Jan 1, 2021

another thing is the dense search only gives the index (the position of the document), we need a way to obtain the map of idx&docid, even if we eventually use sparse index to get raw text. rn I am getting the idx_id map by raw corpus which contains the docid.

hey @MXueguang I was just thinking about this - the idx to docid mapping should be pretty small, right? can't we just manually include it in the prebuilt index?

The common case in the future is that the user will do hybrid dense/sparse - so it doesn't make sense to have two separate copies of the text?

@MXueguang
Copy link
Member Author

yeah it sounds good. I'll make the prebuilt index contains faiss index + idx2docid mapping

@MXueguang MXueguang marked this pull request as ready for review January 2, 2021 08:47
@MXueguang MXueguang requested a review from lintool January 2, 2021 16:53
@MXueguang
Copy link
Member Author

  • moved SimpleDenseSearch class from pyserini.search to pyserini.dsearch.
  • added auto install for faiss index
  • added pypi replication for TCT_ColBERT
  • updated requirements

Copy link
Member

@lintool lintool left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're pretty close to an initial merge?

Address these comments, and let's merge?

topic_keys = sorted(topics.keys())
for i in tqdm(range(0, len(topic_keys), args.batch)):
topic_key_batch = topic_keys[i: i+args.batch]
topic_emb_batch = np.array([query_encoder.encode(topics[topic].get('title').strip())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got lost here reading the code... the query encodings are pre-stored somewhere right? Where are they loaded?

Should we also have something like .load_encoded_queries() or something like that to load pre-encoded queries?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a line query_encoder = QueryEncoder(searcher.index_dir) where I loaded the pre encoded query from the index dir
I place the pre encoded query with the pre build index rn, i.e. the registered msmarco-passage-tct_colbert.tar.gz contains index+docid+query text+query embedding.

will we always have a pre_encoded query for a prebuild index? If so, I feel we can pack prebuild index and pre encoded queries together like like what i am doing rn?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I think the pre-encoded queries should be separate, because an index can have multiple queries - for example, for MS MARCO, there's dev queries and test queries.

I think something like .load_encoded_queries() would be clearer.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, sg

@@ -56,6 +56,37 @@ map all 0.2805
recall_1000 all 0.9470
```

MS MARCO passage ranking task, dense retrieval with TCT-ColBERT, HNSW index.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you move this to a new file, like docs/dense-retrieval.md? This feature is still experimental... I don't want to mix it with pretty stable stuff, like the rest of the docs here...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

Copy link
Member

@lintool lintool left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more minor tweaks please.

pyserini/dsearch/__main__.py Outdated Show resolved Hide resolved
docs/dense-retrieval.md Outdated Show resolved Hide resolved
docs/dense-retrieval.md Outdated Show resolved Hide resolved
@lintool lintool merged commit be92ae8 into castorini:master Jan 4, 2021
@MXueguang MXueguang deleted the dense_retrieval branch March 6, 2022 19:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants