-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal to add file similarity retriever to haystack #5629
Conversation
2b3761c
to
0acb0b6
Compare
0acb0b6
to
3958905
Compare
@mathislucka I agree with you and we have decided at our co.work last week that without "subpipelines" or a similar abstraction the v2 concept of pipelines is not really usable. |
I confirm what Michel said about v2 "subpipelines": we have discussed the topic during our co:work and agreed that this abstraction is going to be present in v2 in several situations. So when implementing components we shouldn't worry about splitting them into smaller units, because later they can be grouped back into larger units that contain "subpipelines" made of such smaller components. |
Thank you for the input! 💯 In that case, @ZanSara @MichelBartels I can try to implement it as two separate sub-components:
How does it sound? |
@elundaeva That sounds good to me. |
- file_aggregation_key: default would be "file_id", but can be changed to e.g. "name" | ||
- output: default "top_document", but can also be "file_aggregation_key" | ||
- keep_original_score | ||
- top_k = 5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sjrl @MichelBartels I've updated the Detailed design section & the Basic example sections above - please take a look and let me know if sth can be improved 👍
Also, for the implementation, I'm thinking what these sub-components should inherit from - one option could be DocumentFetcher based on BaseRetriever (bc in practice it will work similarly to FilterRetriever for example) and DocumentAggregator can be based on BaseRanker since it sort of ranks the retrieved documents by assessing full file similarity but not sure... WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one option could be DocumentFetcher based on BaseRetriever
Yeah, that makes sense to me. If it is going to basically be the FilterRetriever I do wonder if we should just use that node instead of creating a new one. What differences would there be between the FilterRetriever and the DocumentFetcher?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and DocumentAggregator can be based on BaseRanker since it sort of ranks the retrieved documents by assessing full file similarity but not sure
I think that makes rough sense since it does eventually rank the output by the score of each file. I would try and inherit from the BaseRanker class, but if it causes too much trouble I would also directly inherit from BaseComponent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one option could be DocumentFetcher based on BaseRetriever
Yeah, that makes sense to me. If it is going to basically be the FilterRetriever I do wonder if we should just use that node instead of creating a new one. What differences would there be between the FilterRetriever and the DocumentFetcher?
One small difference would be in how the metadata for filtering is provided - in the "filters" of the query (in FilterRetriever) or the meta key for aggregation is given as a param to DocumentFetcher and the value is received from the query itself.
But this input part isn't as important as the output - DocumentFetcher needs to return a list of strings of fetched documents' contents (while FilterRetriever outputs haystack document objects), which can then be provided to the next retriever(s) as "queries" for batch retrieval at the next stage. Though this can be solved by adding an "output" param for FilterRetriever that enables an option to get a list of strings instead of whole documents returned, what do you think @sjrl ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@elundaeva Sorry, I could only take a look today because I was on PTO last week. I would agree with Sebastian that it would be nicer if there wasn't a need for a separate DocumentFetcher
node. However, I can see that it would have to be ugly either way because of the output type.
Have you also thought about batch file retrieval? It was originally supported by your draft PR, but I can't see how it's possible the way it's described in the current proposal. I also didn't consider batching when I suggested to split the components. I would guess if we were to enable batching somehow, it would be even harder to use the FilterRetriever
because we would need to flatten the result at some point.
output = "top_document", # This new param is explained in Detailed design section below | ||
keep_original_score = True | ||
) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little confused by the full pipeline flow here. Could you provide code examples for the Retriever(s) and how they interact with the DocumentAggregator
? Will these components be passed to a wrapper node like FileSimilarityRetriever
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think how we package it is still an open question (related to this comment) - whether it makes sense to make a wrapper node already now or if things change drastically with Haystack 2.0 it could be done later to avoid double work 🤔
This proposal is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 10 days. |
Hi, Summarizing the discussion till nowOn Design Principle
On behaviorCompared to the custom
On links with other components
Questions and way forwardI tried creating the pipeline with v2 in this Colab Notebook. For both we'd need a Assuming we have this fetcher: Route 1A bigger scoped The notebook has a basic version of this. Route 2A smaller scoped I couldn't get this to work. The retrievers expect Asides
Next Step Questions
|
Hey @bglearning thanks for your work on this and for exploring the two different options you proposed. Let's start with your step 1 so
but I think you can go ahead and implement it based off of the v1 version. No need for a full proposal since it's a known and understood feature of Haystack v1. In the PR we can revisit something like the name, but my feeling is that we should leave it as So we would want something like a |
I do agree with @sjrl here, no need for a new proposal to add a Also I think it can be generic enough to work with every Document Store. The
@bglearning we're aware of this connection limitation and we're investigating possible solutions. The fastest solution would be changing the input types to accept both but that complicates a bit the logic. We're trying to come up with something better. 🤔 |
Hi all, getting back to this with the I updated the Colab Notebook above based on latest developments. Now, for Route 2, with As such, leaning towards the first approach (Route1) mentioned above i.e. with a Next: I'll update the proposal to reflect this. Route 1Route 2 |
Yeah I think I am also leaning towards this approach. I think using this it would be easier to perform checks on the incoming documents as well from the FilterRetriever in this case. For example, we will want to enforce that all incoming documents to the MetaDocumentAggregator all share the same value in the provided meta key ( |
Revisiting this as part of v2 migrations. Considering a third route (also again added to the Colab Notebook): To summarize: We want to do two things based on resultant documents from the filter retriever:
Route 1 above does both things inside a single component. Leaning towards Route3 now as it results in smaller and possibly independently reusable components. E.g. "SimilarDocuments to a select group of documents from previous result" or "Aggregate/group the result from a query based on metadata using DocumentMetaAggregator" |
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 10 days. |
Related Issues
Link to code PR - #5666
Proposed Changes:
Proposal to add file similarity retriever to haystack