Proposal to add file similarity retriever to haystack #5629

elundaeva · 2023-08-25T13:05:13Z

Related Issues

Link to code PR - #5666

Proposed Changes:

Proposal to add file similarity retriever to haystack

proposals/text/5629-file-similarity-retriever.md

MichelBartels · 2023-09-01T14:20:23Z

@mathislucka I agree with you and we have decided at our co.work last week that without "subpipelines" or a similar abstraction the v2 concept of pipelines is not really usable.
That's also why I was asking. It seems like it's a tradeoff between ease of porting it to v2 and usability while we're still using v1 pipelines.

ZanSara · 2023-09-04T10:48:05Z

I confirm what Michel said about v2 "subpipelines": we have discussed the topic during our co:work and agreed that this abstraction is going to be present in v2 in several situations. So when implementing components we shouldn't worry about splitting them into smaller units, because later they can be grouped back into larger units that contain "subpipelines" made of such smaller components.

elundaeva · 2023-09-04T13:02:01Z

Thank you for the input! 💯 In that case, @ZanSara @MichelBartels I can try to implement it as two separate sub-components:

DocumentFetcher that gets all docs corresponding to a file based on a provided aggregation meta key and returns a list of docs to be used by one or two retrievers as input for batch retrieval.
&
DocumentAggregator placed after the retriever (or the JoinNode) that aggregates the document list(s) into a single ranked list of files (which are put together based on file_ids of retrieved docs) using RRF.

How does it sound?

MichelBartels · 2023-09-04T13:15:58Z

@elundaeva That sounds good to me.
@ZanSara Just to confirm, would we then want some sort of abstraction for v1? Otherwise, it's not very usable until we release v2.

elundaeva · 2023-09-07T10:13:30Z

proposals/text/5629-file-similarity-retriever.md

+  - file_aggregation_key: default would be "file_id", but can be changed to e.g. "name"
+  - output: default "top_document", but can also be "file_aggregation_key"
+  - keep_original_score
+  - top_k = 5


@sjrl @MichelBartels I've updated the Detailed design section & the Basic example sections above - please take a look and let me know if sth can be improved 👍

Also, for the implementation, I'm thinking what these sub-components should inherit from - one option could be DocumentFetcher based on BaseRetriever (bc in practice it will work similarly to FilterRetriever for example) and DocumentAggregator can be based on BaseRanker since it sort of ranks the retrieved documents by assessing full file similarity but not sure... WDYT?

one option could be DocumentFetcher based on BaseRetriever

Yeah, that makes sense to me. If it is going to basically be the FilterRetriever I do wonder if we should just use that node instead of creating a new one. What differences would there be between the FilterRetriever and the DocumentFetcher?

and DocumentAggregator can be based on BaseRanker since it sort of ranks the retrieved documents by assessing full file similarity but not sure

I think that makes rough sense since it does eventually rank the output by the score of each file. I would try and inherit from the BaseRanker class, but if it causes too much trouble I would also directly inherit from BaseComponent.

one option could be DocumentFetcher based on BaseRetriever

Yeah, that makes sense to me. If it is going to basically be the FilterRetriever I do wonder if we should just use that node instead of creating a new one. What differences would there be between the FilterRetriever and the DocumentFetcher?

One small difference would be in how the metadata for filtering is provided - in the "filters" of the query (in FilterRetriever) or the meta key for aggregation is given as a param to DocumentFetcher and the value is received from the query itself.
But this input part isn't as important as the output - DocumentFetcher needs to return a list of strings of fetched documents' contents (while FilterRetriever outputs haystack document objects), which can then be provided to the next retriever(s) as "queries" for batch retrieval at the next stage. Though this can be solved by adding an "output" param for FilterRetriever that enables an option to get a list of strings instead of whole documents returned, what do you think @sjrl ?

@elundaeva Sorry, I could only take a look today because I was on PTO last week. I would agree with Sebastian that it would be nicer if there wasn't a need for a separate DocumentFetcher node. However, I can see that it would have to be ugly either way because of the output type.

Have you also thought about batch file retrieval? It was originally supported by your draft PR, but I can't see how it's possible the way it's described in the current proposal. I also didn't consider batching when I suggested to split the components. I would guess if we were to enable batching somehow, it would be even harder to use the FilterRetriever because we would need to flatten the result at some point.

sjrl · 2023-09-07T10:32:56Z

proposals/text/5629-file-similarity-retriever.md

+        output = "top_document", # This new param is explained in Detailed design section below
+        keep_original_score = True
+    )
+


I'm a little confused by the full pipeline flow here. Could you provide code examples for the Retriever(s) and how they interact with the DocumentAggregator? Will these components be passed to a wrapper node like FileSimilarityRetriever?

I think how we package it is still an open question (related to this comment) - whether it makes sense to make a wrapper node already now or if things change drastically with Haystack 2.0 it could be done later to avoid double work 🤔

proposals/text/5629-file-similarity-retriever.md

github-actions · 2023-10-21T01:45:05Z

This proposal is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 10 days.

bglearning · 2024-01-16T18:02:04Z

Hi,
Looking to reopen this proposal (or create a new one if necessary) of adding the FileSimilarityRetrieval now that v2 is in beta.

Summarizing the discussion till now

On Design Principle

We'd like to avoid composition of the components and ideally have many smaller components than one or few big ones. As such, the suggestion to create two components (DocumentFetcher, DocumentAggregator) instead of one (FileSimilarityRetriever).
The joining of such smaller components may be handled with subpipelines (details to be discussed)

On behavior

Compared to the custom FileSimilarityRetriever already available internally.

Return value: could be list of agg_key values instead of one doc per agg group
Would be good to pass list of retrievers instead of explicitly passing primary, secondary retrievers.

On links with other components

The FilterRetriever in v1 is very close to the DocumentFetcher being proposed. Differences:
- The way the input is specified (direct meta value as query vs ignored query + full filter syntax)
- The output format (list of docs vs possibly list of strings i.e. doc contents)
The DocumentJoiner could combine the docs from the different retrievers. It doesn't provide aggregation which would need to be another component. Also two undesired behavior:
- It would overwrite rrf score for a document with its last appearance instead of adding them up.
- It would overwrite the original score

Questions and way forward

I tried creating the pipeline with v2 in this Colab Notebook.
There seems to be two main options.

For both we'd need a DocumentFetcher component. We could also name it FilterRetriever. But would need to settle on exact behavior (perhaps converging on/collapsing the differences above). Maybe could be a separate proposal?

Assuming we have this fetcher:

Route 1

A bigger scoped MetaDocumentAggregator that wraps the retrievers (so has component composition) and also performs the aggregation.

The notebook has a basic version of this.

Route 2

A smaller scoped MetaDocumentAggregator that only does aggregation and receives documents from the retrievers.

I couldn't get this to work. The retrievers expect str whilst we produce List[Document] or List[str] (with a shaper-like node). Also depending on the retriever we might want to pass content or embedding. As such this seems harder to implement?

Asides

Seems like this could be a more general (something like MetaDocSimilarityRetriever) rather than FileSimilarityRetriever as the grouping can be things apart from file too.
List of retrievers seem fine for Route 1 (at least in code) (?)

Next Step Questions

First propose and implement a "FilterRetriever"?
Go with Route 1 or figure out a way to make Route 2 work?
Are we okay to have variable output type that depends on init arguments? (List[Document] i.e. one from each meta-key grouping vs List[str] i.e. the meta-key values themselves)

sjrl · 2024-01-17T08:42:26Z

Hey @bglearning thanks for your work on this and for exploring the two different options you proposed.

Let's start with your step 1 so

First propose and implement a "FilterRetriever"?

but I think you can go ahead and implement it based off of the v1 version. No need for a full proposal since it's a known and understood feature of Haystack v1. In the PR we can revisit something like the name, but my feeling is that we should leave it as FilterRetriever and we may need to implement different versions based on the doc store. For example, we have now have an InMemoryEmbeddingRetriever and then a separate OpenSearchEmbeddingRetriever implemented in our haystack-core-integrations repo.

So we would want something like a InMemoryFilterRetriever and a OpenSearchFilterRetriever in the respective packages.

silvanocerza · 2024-01-17T09:43:35Z

I do agree with @sjrl here, no need for a new proposal to add a FilterRetriever.

Also I think it can be generic enough to work with every Document Store. The filter_documents() method is part of the common Document Store interface. So there's no need to have a FilterRetriever for each Document Store.

I couldn't get this to work. The retrievers expect str whilst we produce List[Document] or List[str] (with a shaper-like node). Also depending on the retriever we might want to pass content or embedding. As such this seems harder to implement?

@bglearning we're aware of this connection limitation and we're investigating possible solutions. The fastest solution would be changing the input types to accept both but that complicates a bit the logic. We're trying to come up with something better. 🤔

bglearning · 2024-02-16T13:42:36Z

Hi all, getting back to this with the FilterRetriever (#6836) now available, plus with latest developments around v2.

I updated the Colab Notebook above based on latest developments.

Now, for Route 2, with OutputAdapter (#6936) available there is a way to connect the results from the first (filter) retrieval to the subsequent retrievers. However, given the retrievers take one query, I think there isn't a way to run them on all resulting docs from the first retrieval(?). I only took the first of the documents to make it work here. Aside: it's cool that we can simply connect multiple outputs to a component with Variadic input.

As such, leaning towards the first approach (Route1) mentioned above i.e. with a MetaDocumentAggregator that has a list of retrievers. Please flag any issue that might cause (i.e. having list[<another_component>] inside a component. I think there won't be an issue with (de)-serialization. But can't think of other problems except of course that it goes somewhat against the multiple small components approach.

Next: I'll update the proposal to reflect this.

Route 1

Route 2

sjrl · 2024-02-19T10:45:42Z

As such, leaning towards the first approach (Route1) mentioned above i.e. with a MetaDocumentAggregator that has a list of retrievers.

Yeah I think I am also leaning towards this approach. I think using this it would be easier to perform checks on the incoming documents as well from the FilterRetriever in this case. For example, we will want to enforce that all incoming documents to the MetaDocumentAggregator all share the same value in the provided meta key (agg_key in your colab).

bglearning · 2024-05-10T11:17:32Z

Revisiting this as part of v2 migrations. Considering a third route (also again added to the Colab Notebook):

To summarize:

We want to do two things based on resultant documents from the filter retriever:

Run multiple retrievers for each document to get a set of similar documents.
Group/aggregate the documents from 1 based on an agg_key

Route 1 above does both things inside a single component.
Route 2 creates one component for 2 and tries to run 1 without a wrapper component for the retrievers but this doesn't seem to be possible.
Route 3 creates one component for 1 (SimilarDocumentsRetriever) and one for 2 (DocumentMetaAggregator).

Leaning towards Route3 now as it results in smaller and possibly independently reusable components. E.g. "SimilarDocuments to a select group of documents from previous result" or "Aggregate/group the result from a query based on metadata using DocumentMetaAggregator"

github-actions · 2024-06-10T01:50:38Z

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 10 days.

github-actions bot added the proposal label Aug 25, 2023

elundaeva commented Aug 29, 2023

View reviewed changes

proposals/text/5629-file-similarity-retriever.md Outdated Show resolved Hide resolved

elundaeva mentioned this pull request Aug 29, 2023

[DRAFT] Add FileSimilarityRetriever to haystack #5666

Closed

sjrl reviewed Aug 29, 2023

View reviewed changes

proposals/text/5629-file-similarity-retriever.md Outdated Show resolved Hide resolved

sjrl reviewed Aug 29, 2023

View reviewed changes

proposals/text/5629-file-similarity-retriever.md Outdated Show resolved Hide resolved

sjrl reviewed Aug 29, 2023

View reviewed changes

proposals/text/5629-file-similarity-retriever.md Outdated Show resolved Hide resolved

elundaeva force-pushed the file-similarity-retriever-proposal branch from 2b3761c to 0acb0b6 Compare August 29, 2023 13:37

elundaeva added 7 commits August 29, 2023 15:38

1

7c356bb

2

5b3a76e

proposal

bd9389b

proposal update

128bb19

join node addition

cc1678a

link correction

2963dbf

Seb's feedback implementation

3958905

elundaeva force-pushed the file-similarity-retriever-proposal branch from 0acb0b6 to 3958905 Compare August 29, 2023 13:38

elundaeva marked this pull request as ready for review August 29, 2023 13:39

elundaeva requested review from a team as code owners August 29, 2023 13:39

elundaeva requested review from dfokina and MichelBartels and removed request for a team August 29, 2023 13:39

elundaeva changed the title ~~[DRAFT] Proposal to add file similarity retriever to haystack~~ Proposal to add file similarity retriever to haystack Aug 29, 2023

elundaeva added the proposal:active Proposal is in "Active" state label Aug 29, 2023

sjrl reviewed Aug 30, 2023

View reviewed changes

proposals/text/5629-file-similarity-retriever.md Outdated Show resolved Hide resolved

sjrl reviewed Aug 30, 2023

View reviewed changes

proposals/text/5629-file-similarity-retriever.md Show resolved Hide resolved

sjrl reviewed Aug 30, 2023

View reviewed changes

proposals/text/5629-file-similarity-retriever.md Show resolved Hide resolved

sjrl reviewed Aug 30, 2023

View reviewed changes

proposals/text/5629-file-similarity-retriever.md Outdated Show resolved Hide resolved

sjrl reviewed Aug 30, 2023

View reviewed changes

proposals/text/5629-file-similarity-retriever.md Outdated Show resolved Hide resolved

mathislucka reviewed Aug 31, 2023

View reviewed changes

elundaeva added 3 commits September 6, 2023 09:58

design update

e0875fe

update of design and params description for sub-components

9df6811

format updates

bb8f902

elundaeva commented Sep 7, 2023

View reviewed changes

sjrl reviewed Sep 7, 2023

View reviewed changes

proposals/text/5629-file-similarity-retriever.md Show resolved Hide resolved

sjrl reviewed Sep 7, 2023

View reviewed changes

proposals/text/5629-file-similarity-retriever.md Show resolved Hide resolved

Merge branch 'main' into file-similarity-retriever-proposal

38ccac5

github-actions bot added the stale label Oct 21, 2023

github-actions bot closed this Oct 31, 2023

sjrl reopened this Jan 17, 2024

github-actions bot added the 2.x Related to Haystack v2.0 label Jan 17, 2024

github-actions bot removed the stale label Jan 18, 2024

masci mentioned this pull request Jan 24, 2024

FileSimilarityRetriever - Add readme, fix mypy linter issues #6821

Merged

bglearning mentioned this pull request Jan 26, 2024

Feat: Add FilterRetriever #6836

Merged

bglearning mentioned this pull request May 23, 2024

feat: Add component SimilarDocumentsRetriever #7733

Closed

github-actions bot added the stale label Jun 10, 2024

github-actions bot closed this Jun 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal to add file similarity retriever to haystack #5629

Proposal to add file similarity retriever to haystack #5629

elundaeva commented Aug 25, 2023 •

edited

Loading

MichelBartels commented Sep 1, 2023 •

edited

Loading

ZanSara commented Sep 4, 2023

elundaeva commented Sep 4, 2023 •

edited

Loading

MichelBartels commented Sep 4, 2023

elundaeva Sep 7, 2023

sjrl Sep 7, 2023

sjrl Sep 7, 2023

elundaeva Sep 8, 2023 •

edited

Loading

MichelBartels Sep 11, 2023

sjrl Sep 7, 2023

elundaeva Sep 8, 2023

github-actions bot commented Oct 21, 2023

bglearning commented Jan 16, 2024

sjrl commented Jan 17, 2024 •

edited

Loading

silvanocerza commented Jan 17, 2024

bglearning commented Feb 16, 2024

sjrl commented Feb 19, 2024

bglearning commented May 10, 2024 •

edited

Loading

github-actions bot commented Jun 10, 2024

Proposal to add file similarity retriever to haystack #5629

Proposal to add file similarity retriever to haystack #5629

Conversation

elundaeva commented Aug 25, 2023 • edited Loading

Related Issues

Proposed Changes:

MichelBartels commented Sep 1, 2023 • edited Loading

ZanSara commented Sep 4, 2023

elundaeva commented Sep 4, 2023 • edited Loading

MichelBartels commented Sep 4, 2023

elundaeva Sep 7, 2023

Choose a reason for hiding this comment

sjrl Sep 7, 2023

Choose a reason for hiding this comment

sjrl Sep 7, 2023

Choose a reason for hiding this comment

elundaeva Sep 8, 2023 • edited Loading

Choose a reason for hiding this comment

MichelBartels Sep 11, 2023

Choose a reason for hiding this comment

sjrl Sep 7, 2023

Choose a reason for hiding this comment

elundaeva Sep 8, 2023

Choose a reason for hiding this comment

github-actions bot commented Oct 21, 2023

bglearning commented Jan 16, 2024

Summarizing the discussion till now

On Design Principle

On behavior

On links with other components

Questions and way forward

Route 1

Route 2

Asides

Next Step Questions

sjrl commented Jan 17, 2024 • edited Loading

silvanocerza commented Jan 17, 2024

bglearning commented Feb 16, 2024

Route 1

Route 2

sjrl commented Feb 19, 2024

bglearning commented May 10, 2024 • edited Loading

github-actions bot commented Jun 10, 2024

elundaeva commented Aug 25, 2023 •

edited

Loading

MichelBartels commented Sep 1, 2023 •

edited

Loading

elundaeva commented Sep 4, 2023 •

edited

Loading

elundaeva Sep 8, 2023 •

edited

Loading

sjrl commented Jan 17, 2024 •

edited

Loading

bglearning commented May 10, 2024 •

edited

Loading