ensure tf-idf matrix calculation before retrieval #1665

julian-risch · 2021-10-27T14:54:38Z

Bug
The problem is that the TfidfRetriever uses a dataframe df to store paragraphs and term frequencies and inverse document frequencies that need to be calculated in the fit() method based on documents stored in the document store. This calculation needs to be done before any document retrieval step can be executed. To this end, fit()is called in the init() method of the TfidfRetriever here:

haystack/haystack/nodes/retriever/sparse.py

Line 134 in 13510aa

self.fit()

However, if there aren't any documents yet, for example when we load the pipeline from a yaml file, the dataframe df remains empty, no scores are calculated and any retrieval step fails with the exception reported in #1637.

Proposed changes:

When retrieve() is called in TfidfRetriever, we now check whether the dataframe df and the tfidf-matrix needed for retrieval have been calculated before. If not, we run fit() to calculate them. If dataframe df is still empty after running fit(), we raise an exception because no retrieval can be performed. Most likely reason is an empty document store, which prevents us from calculating document frequencies, etc.

closes #1637

ZanSara

Code looks good! I would add a test with a YAML pipeline to verify it keeps working properly though 🙂

tholor · 2021-10-28T08:27:14Z

haystack/nodes/retriever/sparse.py

-            raise Exception("fit() needs to called before retrieve()")
+            # run fit() to update self.df and self.tfidf_matrix
+            logger.warning("Fit method needs to be run before first retrieval. Running it now.")
+            self.fit()


I think there is still one case that might cause problems here:

You add n docs

You call retrieve() and therefore trigger fit() automatically

You add further m docs

You call retrieve => no warning/error displayed but I believe the m newer docs will never be retrieved as they are not part of self.df

How about we add a parameter to the init auto_fit=True and if this is true we check before every retrieve if the number of docs in df is up to date with the docstore. Setting to false might be helpful if you want to speed up retrieval and avoid this additional check.

haystack/nodes/retriever/sparse.py

…ck into calculate_tfidf_matrix

ZanSara · 2021-10-28T10:11:51Z

haystack/nodes/retriever/sparse.py

        if self.df is None:
-            raise Exception("fit() needs to called before retrieve()")
+            raise Exception("Retrieval requires dataframe df and tf-idf matrix but fit() did not calculate them probably due to an empty document store.")


I got another idea here: how about we make a custom error for this exception? Something like FitNotPerformedException, possibly a name which we could reuse elsewhere. This would allow us to test for this specific exception too and make the test more informative.

@ZanSara Thanks for the feedback 🙏 I added a second test case that checks whether this exact exception is raised based on the string. I was hesitant to add a new type of Exception because there is only one defined in haystack yet: https://github.com/deepset-ai/haystack/blob/master/haystack/errors.py Maybe a topic to discuss further.

ZanSara · 2021-10-28T10:14:28Z

test/samples/pipeline/test_pipeline_tfidfretriever.yaml

+    params:
+      document_store: DocumentStore
+  - name: DocumentStore
+    type: ElasticsearchDocumentStore


For the sake of replicating the original issue, I would try with InMemoryDocumentStore (it's probably also a bit more lightweight)

👍 Changed the code accordingly.

ensure tf-idf matrix calculation before retrieval

05db2ea

julian-risch mentioned this pull request Oct 27, 2021

Exception: fit() needs to called before retrieve() #1637

Closed

julian-risch requested a review from ZanSara October 27, 2021 14:58

julian-risch marked this pull request as ready for review October 27, 2021 15:08

ZanSara suggested changes Oct 28, 2021

View reviewed changes

tholor reviewed Oct 28, 2021

View reviewed changes

julian-risch and others added 3 commits October 28, 2021 11:18

Run fit() automatically if new documents have been added

9962365

Add latest docstring and tutorial changes

1fc6d59

Fix type error

b1029a5

ZanSara reviewed Oct 28, 2021

View reviewed changes

haystack/nodes/retriever/sparse.py Outdated Show resolved Hide resolved

julian-risch added 2 commits October 28, 2021 12:01

Add test case for tfidf retriever yaml pipeline

232aa8a

Merge branch 'calculate_tfidf_matrix' of github.com:deepset-ai/haysta…

3c935e6

…ck into calculate_tfidf_matrix

ZanSara reviewed Oct 28, 2021

View reviewed changes

ZanSara approved these changes Oct 28, 2021

View reviewed changes

Use InMemoryDocStore and add 2nd test case

6a54dc4

julian-risch merged commit 33b2663 into master Oct 28, 2021

julian-risch deleted the calculate_tfidf_matrix branch October 28, 2021 14:48

tholor added the topic:modeling label Dec 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ensure tf-idf matrix calculation before retrieval #1665

ensure tf-idf matrix calculation before retrieval #1665

julian-risch commented Oct 27, 2021 •

edited

Loading

ZanSara left a comment

tholor Oct 28, 2021

ZanSara Oct 28, 2021

julian-risch Oct 28, 2021

ZanSara Oct 28, 2021

julian-risch Oct 28, 2021

ensure tf-idf matrix calculation before retrieval #1665

ensure tf-idf matrix calculation before retrieval #1665

Conversation

julian-risch commented Oct 27, 2021 • edited Loading

ZanSara left a comment

Choose a reason for hiding this comment

tholor Oct 28, 2021

Choose a reason for hiding this comment

ZanSara Oct 28, 2021

Choose a reason for hiding this comment

julian-risch Oct 28, 2021

Choose a reason for hiding this comment

ZanSara Oct 28, 2021

Choose a reason for hiding this comment

julian-risch Oct 28, 2021

Choose a reason for hiding this comment

julian-risch commented Oct 27, 2021 •

edited

Loading