Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: Add index parameter to TFiDF retriever #1634

Closed
Timoeller opened this issue Oct 22, 2021 · 5 comments · Fixed by #3666
Closed

Feature Request: Add index parameter to TFiDF retriever #1634

Timoeller opened this issue Oct 22, 2021 · 5 comments · Fixed by #3666
Labels
P3 Low priority, leave it in the backlog topic:retriever type:feature New feature or request

Comments

@Timoeller
Copy link
Contributor

Problem
When using the inmemory docstore on a non standard index (e.g. for evaluation) we cannot use the TFiDF Retriever, because you cannot set an index there.

Solution
Lets add the index option to the TFiDF retriever please.

Background
I would like the inmemory store + TFiDF to be used for fast haystack examples (without "complicated" ES or FAISS setup)

@Timoeller Timoeller added the type:feature New feature or request label Oct 22, 2021
@Timoeller Timoeller changed the title Featuer Request: Add index parameter to TFiDF retriever Feature Request: Add index parameter to TFiDF retriever Oct 22, 2021
@ZanSara
Copy link
Contributor

ZanSara commented Oct 22, 2021

Related? #1637

@Timoeller
Copy link
Contributor Author

Nice one! Yes it is. I stumbled upon this exact same error by adding the data in an InMemoryStore at a custom index in combination with TFiDFRetriever.
Though in #1637 I do not see a custom index being used...

@ugm2
Copy link
Contributor

ugm2 commented Sep 23, 2022

What's the status on this? I'm actually facing the same problem 🥲
And also, what would be a workaround?

@anakin87
Copy link
Member

While thinking about #3447, I decided to revamp this issue.

Document store and retrievers

From what I understood, in Haystack:

  • document stores: should store documents and their representations (such as embeddings)
  • retrievers: should be used to query the document stores, in order to provide the most relevant documents
    (currently, in some cases, they also build document representations that are then saved to a document store. See Separate concepts of "Retriever" and "Embedder" #2403)

This holds true for most of the cases, with some exceptions:

  • graphs: in the current implementations, they combine the features of a document store and a retriever
  • TF-IDF retriever: it uses a document store as a source, but it builds an internal representation for documents (based on a DataFrame and a TfidfVectorizer)

How to make the TF-IDF retriever to support different indexes (accept the index parameter)?

I see two alternative options:

  • find a way to store the representation in the document store
    clean from a logical point of view, but hard to implement since this retriever support several different document stores
  • modify the current implementation of TF-IDF to create a representation for every index
    simpler but dirtier

Support for BM25Retriever in InMemoryDocumentStore (#3447)

It seems related to the same topics.
However, unlike TF-IDF, in Haystack BM25 is a pure retriever (it completely relies on document stores to save document representations).
Based on the proposals made to use external libraries (rank-bm25, gensim) it would seem that even in that case, we want to create a specific implementation of BM25 that builds its own internal representation of the documents...

Sorry if I misunderstood something...

@ZanSara @Timoeller WDYT?

@ZanSara
Copy link
Contributor

ZanSara commented Oct 28, 2022

Hey @anakin87! Great analysis, as usual! This is going to be useful even for me to explain the situation to others 🚀

Mandatory premise

So, first things first: I believe the current abstraction of Retrievers is fundamentally wrong. As you noticed, some Retriever rely fully on the docstore for the retrieval steps, others have their own internal representation stored (usually) in memory. The whole thing at some point will need to be re-evaluated and clarified (a topic already raised, as you noticed, in #2403) by assigning to document store, retrievers and embedders consistent and distinct responsibilities.

However, we need to get stuff done with the current architecture for now 😅

How to make the TF-IDF retriever to support different indexes (accept the index parameter)?

  • find a way to store the representation in the document store
    clean from a logical point of view, but hard to implement since this retriever support several different document stores

Technically all docstores should support indices already, so this should not pose too many challenges. However, I have not verified this in practice.

  • modify the current implementation of TF-IDF to create a representation for every index
    simpler but dirtier

In case the first solution proves too tough, this is a viable alternative. it might make TFIDFRetriever slow as a snail on large collections, but honestly it's already slow in such conditions, so I don't see it as a big issue 😅

Support for BM25Retriever in InMemoryDocumentStore (#3447)

It seems related to the same topics.
However, unlike TF-IDF, in Haystack BM25 is a pure retriever (it completely relies on document stores to save document representations).
Based on the proposals made to use external libraries (rank-bm25, gensim) it would seem that even in that case, we want to create a specific implementation of BM25 that builds its own internal representation of the documents...

For as odd as it sounds, I think the main challenge of adding BM25 support to InMemoryDocumentStore is to figure out where to put the BM25 retrieval step. Currently I tend to believe it should be added as a method of InMemoryDocumentStore, which can then store the representation alongside the documents. BM25Retriever might remain almost unaffected. Please tag me early on in the PR if you decide to go for this, so we can review together how the implementation will look like.

I hope this is helpful, let me know if I forgot to address something!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P3 Low priority, leave it in the backlog topic:retriever type:feature New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants