-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ensure tf-idf matrix calculation before retrieval #1665
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code looks good! I would add a test with a YAML pipeline to verify it keeps working properly though 🙂
haystack/nodes/retriever/sparse.py
Outdated
raise Exception("fit() needs to called before retrieve()") | ||
# run fit() to update self.df and self.tfidf_matrix | ||
logger.warning("Fit method needs to be run before first retrieval. Running it now.") | ||
self.fit() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is still one case that might cause problems here:
- You add n docs
- You call retrieve() and therefore trigger fit() automatically
- You add further m docs
- You call retrieve => no warning/error displayed but I believe the m newer docs will never be retrieved as they are not part of self.df
How about we add a parameter to the init auto_fit=True
and if this is true we check before every retrieve if the number of docs in df is up to date with the docstore. Setting to false
might be helpful if you want to speed up retrieval and avoid this additional check.
…ck into calculate_tfidf_matrix
if self.df is None: | ||
raise Exception("fit() needs to called before retrieve()") | ||
raise Exception("Retrieval requires dataframe df and tf-idf matrix but fit() did not calculate them probably due to an empty document store.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got another idea here: how about we make a custom error for this exception? Something like FitNotPerformedException
, possibly a name which we could reuse elsewhere. This would allow us to test for this specific exception too and make the test more informative.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ZanSara Thanks for the feedback 🙏 I added a second test case that checks whether this exact exception is raised based on the string. I was hesitant to add a new type of Exception because there is only one defined in haystack yet: https://github.com/deepset-ai/haystack/blob/master/haystack/errors.py Maybe a topic to discuss further.
params: | ||
document_store: DocumentStore | ||
- name: DocumentStore | ||
type: ElasticsearchDocumentStore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the sake of replicating the original issue, I would try with InMemoryDocumentStore
(it's probably also a bit more lightweight)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Changed the code accordingly.
Bug
The problem is that the TfidfRetriever uses a dataframe
df
to store paragraphs and term frequencies and inverse document frequencies that need to be calculated in thefit()
method based on documents stored in the document store. This calculation needs to be done before any document retrieval step can be executed. To this end,fit()
is called in theinit()
method of the TfidfRetriever here:haystack/haystack/nodes/retriever/sparse.py
Line 134 in 13510aa
However, if there aren't any documents yet, for example when we load the pipeline from a yaml file, the dataframe
df
remains empty, no scores are calculated and any retrieval step fails with the exception reported in #1637.Proposed changes:
closes #1637