Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add raw() to DenseSearchResult and PRFDenseSearchResult #1876

Closed
wants to merge 8 commits into from

Conversation

DanielKohn1208
Copy link
Contributor

closes #1856 by adding a raw method to both DenseSearchResult and PRFDenseSearchResult

Had to change how DenseSearchResult and PRFDenseSearchResult is initialized as well.

@lintool
Copy link
Member

lintool commented Apr 30, 2024

hey @manveertamber can you take a look at this?

@manveertamber
Copy link
Member

@DanielKohn1208 can you add some documentation to explain how to fetch document content for documents retrieved using a dense index: https://github.com/castorini/pyserini/blob/master/docs/usage-fetch.md. It looks like right now we only explain how to do this for a lucene index

@manveertamber
Copy link
Member

We can call .doc() on a FaissSearcher too, which routes to the lucene index .doc() but the documentation doesn't explain this either.

@manveertamber
Copy link
Member

manveertamber commented May 1, 2024

@DanielKohn1208 looks good! But maybe change the headings in the documentation to Using a Lucene Index and Using a Faiss Index instead of Using a Sparse Representation and Using a Dense Representation. And you should add some example code in the documentation that showcases your changes. Like calling .raw() on some retrieved hits.

@DanielKohn1208
Copy link
Contributor Author

@manveertamber After going through the documentation, I think that I approached this issue in completely the wrong way. As it is now, my fix actually doesn't actually solve any problems. Someone could already get a hit's raw content from a FaissIndex as follows

from pyserini.search.faiss import FaissSearcher, AutoQueryEncoder

DENSE_INDEX = 'beir-v1.0.0-nfcorpus.bge-base-en-v1.5'
encoder = AutoQueryEncoder('BAAI/bge-base-en-v1.5', device='cpu', pooling='mean', l2_norm=True)
searcher = FaissSearcher.from_prebuilt_index(DENSE_INDEX, encoder)
hits = searcher.search('How to Help Prevent Abdominal Aortic Aneurysms')
doc = searcher.doc(hits[0].docid)
raw_content = doc.raw()

Also note that my original assumption about the LuceneSearcher.search() method returning a list of objects with a raw() method was incorrect. Obtaining the raw content of one of these hits would be similar to above code.

Perhaps I am completely misunderstanding the original issue, but I think all that is actually needed is a documentation update?

@manveertamber
Copy link
Member

manveertamber commented May 2, 2024

Yeah I agree. #1548 wanted to get passage text from a DenseSearchResult, but the right way to do this is to just to call searcher.doc(docid), which is possible with a FaissSearcher, we just needed the documentation to reflect this. @lintool is this enough to close #1856?

@DanielKohn1208
Copy link
Contributor Author

Should I close this PR and create a new one at some point to update the docs?

@manveertamber
Copy link
Member

Should I close this PR and create a new one at some point to update the docs?

Yeah that works.

@DanielKohn1208
Copy link
Contributor Author

Closing PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add convenience method to get raw text from dense retrieval for prebuilt indexes
3 participants