Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ensure tf-idf matrix calculation before retrieval #1665

Merged
merged 7 commits into from
Oct 28, 2021

Conversation

julian-risch
Copy link
Member

@julian-risch julian-risch commented Oct 27, 2021

Bug
The problem is that the TfidfRetriever uses a dataframe df to store paragraphs and term frequencies and inverse document frequencies that need to be calculated in the fit() method based on documents stored in the document store. This calculation needs to be done before any document retrieval step can be executed. To this end, fit()is called in the init() method of the TfidfRetriever here:


However, if there aren't any documents yet, for example when we load the pipeline from a yaml file, the dataframe df remains empty, no scores are calculated and any retrieval step fails with the exception reported in #1637.

Proposed changes:

  • When retrieve() is called in TfidfRetriever, we now check whether the dataframe df and the tfidf-matrix needed for retrieval have been calculated before. If not, we run fit() to calculate them. If dataframe df is still empty after running fit(), we raise an exception because no retrieval can be performed. Most likely reason is an empty document store, which prevents us from calculating document frequencies, etc.

closes #1637

Copy link
Contributor

@ZanSara ZanSara left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks good! I would add a test with a YAML pipeline to verify it keeps working properly though 🙂

raise Exception("fit() needs to called before retrieve()")
# run fit() to update self.df and self.tfidf_matrix
logger.warning("Fit method needs to be run before first retrieval. Running it now.")
self.fit()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is still one case that might cause problems here:

  1. You add n docs
  2. You call retrieve() and therefore trigger fit() automatically
  3. You add further m docs
  4. You call retrieve => no warning/error displayed but I believe the m newer docs will never be retrieved as they are not part of self.df

How about we add a parameter to the init auto_fit=True and if this is true we check before every retrieve if the number of docs in df is up to date with the docstore. Setting to false might be helpful if you want to speed up retrieval and avoid this additional check.

if self.df is None:
raise Exception("fit() needs to called before retrieve()")
raise Exception("Retrieval requires dataframe df and tf-idf matrix but fit() did not calculate them probably due to an empty document store.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got another idea here: how about we make a custom error for this exception? Something like FitNotPerformedException, possibly a name which we could reuse elsewhere. This would allow us to test for this specific exception too and make the test more informative.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ZanSara Thanks for the feedback 🙏 I added a second test case that checks whether this exact exception is raised based on the string. I was hesitant to add a new type of Exception because there is only one defined in haystack yet: https://github.com/deepset-ai/haystack/blob/master/haystack/errors.py Maybe a topic to discuss further.

params:
document_store: DocumentStore
- name: DocumentStore
type: ElasticsearchDocumentStore
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the sake of replicating the original issue, I would try with InMemoryDocumentStore (it's probably also a bit more lightweight)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Changed the code accordingly.

@julian-risch julian-risch merged commit 33b2663 into master Oct 28, 2021
@julian-risch julian-risch deleted the calculate_tfidf_matrix branch October 28, 2021 14:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Exception: fit() needs to called before retrieve()
3 participants