Feature Request: Add index parameter to TFiDF retriever #1634

Timoeller · 2021-10-22T10:34:07Z

Problem
When using the inmemory docstore on a non standard index (e.g. for evaluation) we cannot use the TFiDF Retriever, because you cannot set an index there.

Solution
Lets add the index option to the TFiDF retriever please.

Background
I would like the inmemory store + TFiDF to be used for fast haystack examples (without "complicated" ES or FAISS setup)

ZanSara · 2021-10-22T15:24:35Z

Related? #1637

Timoeller · 2021-10-22T15:37:03Z

Nice one! Yes it is. I stumbled upon this exact same error by adding the data in an InMemoryStore at a custom index in combination with TFiDFRetriever.
Though in #1637 I do not see a custom index being used...

ugm2 · 2022-09-23T07:48:21Z

What's the status on this? I'm actually facing the same problem 🥲
And also, what would be a workaround?

anakin87 · 2022-10-27T17:57:09Z

While thinking about #3447, I decided to revamp this issue.

Document store and retrievers

From what I understood, in Haystack:

document stores: should store documents and their representations (such as embeddings)
retrievers: should be used to query the document stores, in order to provide the most relevant documents
(currently, in some cases, they also build document representations that are then saved to a document store. See Separate concepts of "Retriever" and "Embedder" #2403)

This holds true for most of the cases, with some exceptions:

graphs: in the current implementations, they combine the features of a document store and a retriever
TF-IDF retriever: it uses a document store as a source, but it builds an internal representation for documents (based on a DataFrame and a TfidfVectorizer)

How to make the TF-IDF retriever to support different indexes (accept the `index` parameter)?

I see two alternative options:

find a way to store the representation in the document store
clean from a logical point of view, but hard to implement since this retriever support several different document stores
modify the current implementation of TF-IDF to create a representation for every index
simpler but dirtier

Support for BM25Retriever in `InMemoryDocumentStore` (#3447)

It seems related to the same topics.
However, unlike TF-IDF, in Haystack BM25 is a pure retriever (it completely relies on document stores to save document representations).
Based on the proposals made to use external libraries (rank-bm25, gensim) it would seem that even in that case, we want to create a specific implementation of BM25 that builds its own internal representation of the documents...

Sorry if I misunderstood something...

@ZanSara @Timoeller WDYT?

ZanSara · 2022-10-28T14:32:31Z

Hey @anakin87! Great analysis, as usual! This is going to be useful even for me to explain the situation to others 🚀

Mandatory premise

So, first things first: I believe the current abstraction of Retrievers is fundamentally wrong. As you noticed, some Retriever rely fully on the docstore for the retrieval steps, others have their own internal representation stored (usually) in memory. The whole thing at some point will need to be re-evaluated and clarified (a topic already raised, as you noticed, in #2403) by assigning to document store, retrievers and embedders consistent and distinct responsibilities.

However, we need to get stuff done with the current architecture for now 😅

How to make the TF-IDF retriever to support different indexes (accept the index parameter)?

find a way to store the representation in the document store
clean from a logical point of view, but hard to implement since this retriever support several different document stores

Technically all docstores should support indices already, so this should not pose too many challenges. However, I have not verified this in practice.

modify the current implementation of TF-IDF to create a representation for every index
simpler but dirtier

In case the first solution proves too tough, this is a viable alternative. it might make TFIDFRetriever slow as a snail on large collections, but honestly it's already slow in such conditions, so I don't see it as a big issue 😅

Support for BM25Retriever in InMemoryDocumentStore (#3447)

It seems related to the same topics.
However, unlike TF-IDF, in Haystack BM25 is a pure retriever (it completely relies on document stores to save document representations).
Based on the proposals made to use external libraries (rank-bm25, gensim) it would seem that even in that case, we want to create a specific implementation of BM25 that builds its own internal representation of the documents...

For as odd as it sounds, I think the main challenge of adding BM25 support to InMemoryDocumentStore is to figure out where to put the BM25 retrieval step. Currently I tend to believe it should be added as a method of InMemoryDocumentStore, which can then store the representation alongside the documents. BM25Retriever might remain almost unaffected. Please tag me early on in the PR if you decide to go for this, so we can review together how the implementation will look like.

I hope this is helpful, let me know if I forgot to address something!

Timoeller added the type:feature New feature or request label Oct 22, 2021

Timoeller changed the title ~~Featuer Request: Add index parameter to TFiDF retriever~~ Feature Request: Add index parameter to TFiDF retriever Oct 22, 2021

ugm2 mentioned this issue Sep 23, 2022

Reindexing has unexpected results ugm2/neural-search-demo#8

Closed

2 tasks

anakin87 mentioned this issue Oct 28, 2022

Can't specify index during TF_IDF fit #3187

Closed

ZanSara mentioned this issue Oct 31, 2022

Extend BM25Retriever to work with non-Elasticsearch based DocumentStores #3509

Closed

masci added topic:retriever P3 Low priority, leave it in the backlog labels Nov 28, 2022

anakin87 mentioned this issue Dec 4, 2022

feat: add index parameter to TfidfRetriever #3666

Merged

6 tasks

ZanSara closed this as completed in #3666 Dec 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: Add index parameter to TFiDF retriever #1634

Feature Request: Add index parameter to TFiDF retriever #1634

Timoeller commented Oct 22, 2021

ZanSara commented Oct 22, 2021

Timoeller commented Oct 22, 2021

ugm2 commented Sep 23, 2022

anakin87 commented Oct 27, 2022

ZanSara commented Oct 28, 2022

Feature Request: Add index parameter to TFiDF retriever #1634

Feature Request: Add index parameter to TFiDF retriever #1634

Comments

Timoeller commented Oct 22, 2021

ZanSara commented Oct 22, 2021

Timoeller commented Oct 22, 2021

ugm2 commented Sep 23, 2022

anakin87 commented Oct 27, 2022

Document store and retrievers

How to make the TF-IDF retriever to support different indexes (accept the index parameter)?

Support for BM25Retriever in InMemoryDocumentStore (#3447)

ZanSara commented Oct 28, 2022

Mandatory premise

How to make the TF-IDF retriever to support different indexes (accept the index parameter)?

Support for BM25Retriever in InMemoryDocumentStore (#3447)

How to make the TF-IDF retriever to support different indexes (accept the `index` parameter)?

Support for BM25Retriever in `InMemoryDocumentStore` (#3447)