Dense retrieval draft #278

MXueguang · 2020-12-18T13:11:16Z

An example of usage, since dense index doesn't contains raw data, I loaded the corpus separately.

import numpy as np
from pyserini.search import SimpleDenseSearcher

searcher = SimpleDenseSearcher.from_prebuilt_index('msmarco_passage_0', 'collection.tsv')

query_emb = np.random.random(768).astype('float32')
result = searcher.search(query_emb)

result[0].raw
>> 'Lander, WY Sales Tax Rate. The current total local sales tax rate in Lander, WY is 5.000%. The December 2015 total local sales tax rate was also 5.000%. Lander, WY is in Fremont County. Lander is in the following zip codes: 82520.'

result[0].docid
>> '350921'

result[0].score
>> 0.42547345

searcher.doc('123')
>> Document(docid='123', raw='With a number of condo developments springing up in the city, it can be difficult to narrow down your choices for the perfect Montreal condo for sale. Our skilled agents organize your steps towards meeting your goals with our condo projects located in popular and trendy neighbourhoods.')

lintool · 2020-12-18T13:54:45Z

Let's see if we can replicate the results (i.e., MRR) in the arxiv paper.

How about we create a QueryEncoder class? Then we can have something like .load_pre_encoded queries - that takes a natural language query and looks up the vector representation directly.

MXueguang · 2020-12-18T14:08:23Z

sure, sg

lintool · 2020-12-20T17:23:18Z

Add a driver program for replicating results from the paper? e.g., https://github.com/castorini/pyserini/blob/master/pyserini/search/__main__.py

Not sure if we want in that file or a different one, though. Depends on the complexity of the code paths.

lintool · 2020-12-20T17:25:21Z

For example, this works:

python -m pyserini.search --topics msmarco_passage_dev_subset \
 --index msmarco-passage --output runs/run.msmarco-passage.2.txt --bm25 --msmarco &

Implement something similar for dense retrieval? Maybe --sparse as default, with --dense as an option? They we can have --hybrid?

MXueguang · 2020-12-20T17:30:30Z

sure, I got some issue on replicating the result rn (I guess there might be a bug when I indexing with HNSW, but still investigating). I'll add it when the score are replicated successfully on my end.

MXueguang · 2020-12-30T05:37:20Z

reproduced MRR@10 of tct_colbert

Expected:

MRR @10: 0.3345

Using brute force index IndexFlatIP:

MRR @10: 0.33447093737208256
search speed: 10s per query
size: 26G

Current SimpleDenseSearcher is doing single query search, which is not multithreaded for the brute force index. faiss can do multithreading when search by batch of queries. We can create a BatchDenseSearcher if we need.

Using hnsw index IndexHNSWFlat:

with store_n= 512 ef_search=128, ef_construction=200

MRR @10: 0.3335402737981526
search speed 0.05s per query, using 16 cpu cores
size: 60G
index time: 10+ hours

…tithreadly

MXueguang · 2020-12-30T18:35:49Z

using batch_search with batch size 16 and running with 16 cpu

brute force index: 14s per batch
hnsw index: 0.2 s per batch

lintool · 2020-12-30T20:32:09Z

So we have three modes of operation: brute force GPU, brute force CPU, and hnsw CPU - right?

What's your proposed designed - to have separate classes for each type? E.g., BruteForceDenseSearcher and IndexedDenseSearcher? Or to have a single class that knows how to work with different indexes?

There are pros and cons I can see for each alternative...

MXueguang · 2020-12-30T22:26:26Z

I didn't implement GPU mode in pyserini, the original tct_colbert experiments used brute force GPU.

but maybe we should support GPU as well?

So right now what I have implemented can have four modes of operation (on CPU):

brute force, single query search
hnsw, single query search
brute force, batch query search
hnsw, batch query search
The efficiency is 4 > 2 > 3 > 1

I am assuming we are getting pre build index for both brute force index & hnsw index, which means in pyserini we only do the load and search. They are almost same when doing load and search. I guess that is because faiss wrap all types of index together. So I feel putting in one class will seems simpler, i.e. SimpleDenseSearcher.
There are some minor differences, such as hnsw index can change search accuracy-efficiency trade-off by efSearch during search. When doing these, we can let the corresponding methods figure out which index it is handling.

lintool · 2020-12-30T22:58:44Z

Okay, sounds good - a simple class then!

pyserini/search/_dsearcher.py

lintool · 2021-01-01T15:07:40Z

another thing is the dense search only gives the index (the position of the document), we need a way to obtain the map of idx&docid, even if we eventually use sparse index to get raw text. rn I am getting the idx_id map by raw corpus which contains the docid.

hey @MXueguang I was just thinking about this - the idx to docid mapping should be pretty small, right? can't we just manually include it in the prebuilt index?

The common case in the future is that the user will do hybrid dense/sparse - so it doesn't make sense to have two separate copies of the text?

MXueguang · 2021-01-01T16:10:30Z

yeah it sounds good. I'll make the prebuilt index contains faiss index + idx2docid mapping

MXueguang · 2021-01-02T16:56:34Z

moved SimpleDenseSearch class from pyserini.search to pyserini.dsearch.
added auto install for faiss index
added pypi replication for TCT_ColBERT
updated requirements

lintool

I think we're pretty close to an initial merge?

Address these comments, and let's merge?

lintool · 2021-01-02T17:03:03Z

pyserini/dsearch/__main__.py

+        topic_keys = sorted(topics.keys())
+        for i in tqdm(range(0, len(topic_keys), args.batch)):
+            topic_key_batch = topic_keys[i: i+args.batch]
+            topic_emb_batch = np.array([query_encoder.encode(topics[topic].get('title').strip())


I got lost here reading the code... the query encodings are pre-stored somewhere right? Where are they loaded?

Should we also have something like .load_encoded_queries() or something like that to load pre-encoded queries?

There is a line query_encoder = QueryEncoder(searcher.index_dir) where I loaded the pre encoded query from the index dir
I place the pre encoded query with the pre build index rn, i.e. the registered msmarco-passage-tct_colbert.tar.gz contains index+docid+query text+query embedding.

will we always have a pre_encoded query for a prebuild index? If so, I feel we can pack prebuild index and pre encoded queries together like like what i am doing rn?

I see. I think the pre-encoded queries should be separate, because an index can have multiple queries - for example, for MS MARCO, there's dev queries and test queries.

I think something like .load_encoded_queries() would be clearer.

lintool · 2021-01-02T17:03:59Z

docs/pypi-replication.md

@@ -56,6 +56,37 @@ map                   	all	0.2805
 recall_1000           	all	0.9470
 ```

+MS MARCO passage ranking task, dense retrieval with TCT-ColBERT, HNSW index.


Can you move this to a new file, like docs/dense-retrieval.md? This feature is still experimental... I don't want to mix it with pretty stable stuff, like the rest of the docs here...

lintool

A few more minor tweaks please.

pyserini/dsearch/__main__.py

docs/dense-retrieval.md

MXueguang added 2 commits December 16, 2020 13:14

init commit for dense retrieval

0677127

first draft of dsearcher implementation

9293d91

add QueryEncoder for dsearch

09d0dab

add batch_search for SimpleDenseSearcher which will run faiss cpu mul…

a112edb

…tithreadly

lintool requested changes Dec 31, 2020

View reviewed changes

pyserini/search/_dsearcher.py Outdated Show resolved Hide resolved

pyserini/search/_dsearcher.py Outdated Show resolved Hide resolved

pyserini/search/_dsearcher.py Outdated Show resolved Hide resolved

pyserini/search/_dsearcher.py Outdated Show resolved Hide resolved

move densesearch to dsearch

7f9c272

MXueguang added 5 commits January 2, 2021 03:34

auto download prebuild faiss index

443c093

update pypi replication for dense retrieval

57b8cd4

add pypi replication code for dense retrieval

a7b9018

update dependency for dense retrieval

587734a

Merge branch 'master' into dense_retrieval

7cd184d

MXueguang marked this pull request as ready for review January 2, 2021 08:47

MXueguang requested a review from lintool January 2, 2021 16:53

lintool requested changes Jan 2, 2021

View reviewed changes

MXueguang added 3 commits January 3, 2021 21:42

move dense retrieval doc to new doc file

aa568de

fix typo in dense retrieval

2561cae

complete load pre encoded queires feature

ea1eedc

lintool requested changes Jan 4, 2021

View reviewed changes

pyserini/dsearch/__main__.py Outdated Show resolved Hide resolved

docs/dense-retrieval.md Outdated Show resolved Hide resolved

docs/dense-retrieval.md Outdated Show resolved Hide resolved

update doc and change cli option name

ac65688

lintool approved these changes Jan 4, 2021

View reviewed changes

lintool merged commit be92ae8 into castorini:master Jan 4, 2021

lintool mentioned this pull request Jan 5, 2021

Incorporating dense retrieval support in pyserini #258

Closed

MXueguang deleted the dense_retrieval branch March 6, 2022 19:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dense retrieval draft #278

Dense retrieval draft #278

MXueguang commented Dec 18, 2020

lintool commented Dec 18, 2020

MXueguang commented Dec 18, 2020

lintool commented Dec 20, 2020

lintool commented Dec 20, 2020

MXueguang commented Dec 20, 2020

MXueguang commented Dec 30, 2020

MXueguang commented Dec 30, 2020

lintool commented Dec 30, 2020

MXueguang commented Dec 30, 2020

lintool commented Dec 30, 2020

lintool commented Jan 1, 2021

MXueguang commented Jan 1, 2021

MXueguang commented Jan 2, 2021

lintool left a comment

lintool Jan 2, 2021

MXueguang Jan 2, 2021

lintool Jan 2, 2021

MXueguang Jan 2, 2021

lintool Jan 2, 2021

MXueguang Jan 2, 2021

lintool left a comment

Dense retrieval draft #278

Dense retrieval draft #278

Conversation

MXueguang commented Dec 18, 2020

lintool commented Dec 18, 2020

MXueguang commented Dec 18, 2020

lintool commented Dec 20, 2020

lintool commented Dec 20, 2020

MXueguang commented Dec 20, 2020

MXueguang commented Dec 30, 2020

MXueguang commented Dec 30, 2020

lintool commented Dec 30, 2020

MXueguang commented Dec 30, 2020

lintool commented Dec 30, 2020

lintool commented Jan 1, 2021

MXueguang commented Jan 1, 2021

MXueguang commented Jan 2, 2021

lintool left a comment

Choose a reason for hiding this comment

lintool Jan 2, 2021

Choose a reason for hiding this comment

MXueguang Jan 2, 2021

Choose a reason for hiding this comment

lintool Jan 2, 2021

Choose a reason for hiding this comment

MXueguang Jan 2, 2021

Choose a reason for hiding this comment

lintool Jan 2, 2021

Choose a reason for hiding this comment

MXueguang Jan 2, 2021

Choose a reason for hiding this comment

lintool left a comment

Choose a reason for hiding this comment