feat: add `prefix` and `suffix` to `SentenceTransformersDocumentEmbedder` by anakin87 · Pull Request #5745 · deepset-ai/haystack

anakin87 · 2023-09-08T08:57:30Z

Related Issues

fixes prefix and suffix for SentenceTransformersDocumentEmbedder #5741

Proposed Changes:

Add prefix and suffix attributes to SentenceTransformersDocumentEmbedder.
They can be used to add a prefix and suffix to the Document text before embedding it.
This is necessary to take full advantage of some modern embedding models, such as E5.

How did you test it?

Updated existing tests, and added a new test.

Checklist

I have read the contributors guidelines and the code of conduct
I have updated the related issue with new insights and changes
I added unit tests and updated the docstrings
I've used one of the conventional commit types for my PR title: fix:, feat:, build:, chore:, ci:, docs:, style:, refactor:, perf:, test:.
I documented my code
I ran pre-commit hooks and fixed any issue

mathislucka · 2023-09-08T10:44:35Z

Why not use the prompt builder? That would be really powerful as it would allow to embed metadata too.

anakin87 · 2023-09-08T11:15:20Z

Why not use the prompt builder? That would be really powerful as it would allow to embed metadata too.

I originally had a similar idea: to create an Embedding template for Documents.

Then we had an internal discussion, in which the following aspects emerged:

We can already embed a list of metadata.
We have the intuition that adding the same words to all Documents in a collection
(e.g. "The author of this document is: {doc.metadata['author']}")
would not have a good effect on the vector representation, diluting its information power.
(The implementation of the Embedding template can be nontrivial.)

mathislucka · 2023-09-08T13:17:15Z

We can already embed a list of metadata.

That's not very flexible, a metadata key might not be natural language but it also can't be easily changed because changing the data itself might be difficult.

We have the intuition that adding the same words to all Documents in a collection
(e.g. "The author of this document is: {doc.metadata['author']}")
would not have a good effect on the vector representation, diluting its information power.

Maybe better to do some experiments?

anakin87 · 2023-09-08T14:43:43Z

This is a first implementation that we would like to merge. It unlocks the possibility to properly use powerful embedding models such as E5.

Currently, we haven't noticed significant interest or identified a clear use case for Embedding templates within the community.

If someone is interested in experimenting with this idea, they can easily create a custom component to directly manipulate the Document content.

If evidence emerges that Embedding templates offer substantial benefits, we can implement this feature in a future iteration.

mathislucka · 2023-09-11T08:44:09Z

Not trying to block the merge, only commenting.

I think it would keep things simple. For example, templates would make sense for the ExtractiveReader too because there's currently no good way to account for metadata (see #5640). If templates could be applied before all inference nodes, then the users would only need to learn one concept and they would be pretty flexible in how they apply it. I'd prefer that over opaque embed_metadata and prefixes or suffixes but maybe that's just me.

MichelBartels

The code looks good to me. However, I would agree with Mathis that it seems like a less powerful version of the PromptBuilder that we have already.

We have the intuition that adding the same words to all Documents in a collection
(e.g. "The author of this document is: {doc.metadata['author']}")
would not have a good effect on the vector representation, diluting its information power.

I do not share this intuition and also think that it would need to be supported by experiments. I think this is clear if you consider the following (completely made up) example of book reviews where the metadata is author and title:

Title: J. R. R. Tolkien
Author: Humphrey Carpenter
Review: This biography is really great! I would recommend it to everyone.

Title: The Hobbit
Author: J. R. R. Tolkien
Review: Bilbo Baggins is very nice. This book is is great!

You could also remove the words that are the same:

J. R. R. Tolkien
Humphrey Carpenter
This biography is really great! I would recommend it to everyone.

The Hobbit
J. R. R. Tolkien
Bilbo Baggins defeats the dragon. This book is is great!

If your question was "What characters has J. R. R. Tolkien written about?", you need the additional structural information through those "shared words" to know that the document about Tolkien and not written by him is completely useless for this.
Of course, examples like this might occur seldomly or the models might not be able to pickup on this information, but we can't really tell without experiments.

On the other hand, a good reason to handle this in the SentenceTransformersDocumentEmbedder would be if the format was handled by the Embedder automatically. You could argue that the user shouldn't have to worry about those details and that it should be formatted correctly for the most popular models without the user having to care about it. Otherwise, it would also be an easy pitfall for users who don't know about this requirement.

anakin87 · 2023-09-13T09:36:57Z

Thanks, @MichelBartels!
I see that there are mixed opinions about Embedding Templates, which should be clarified by experimentation.
I opened a dedicated issue for this: #5793

In the meantime, I'm going to merge this PR...

coveralls · 2023-09-13T10:44:06Z

Pull Request Test Coverage Report for Build 6171163319

0 of 0 changed or added relevant lines in 0 files are covered.
2 unchanged lines in 2 files lost coverage.
Overall coverage increased (+0.006%) to 48.965%

Files with Coverage Reduction	New Missed Lines	%
preview/components/embedders/sentence_transformers_document_embedder.py	1	97.62%
utils/context_matching.py	1	95.7%

Totals
Change from base Build 6170931279:	0.006%
Covered Lines:	11851
Relevant Lines:	24203

💛 - Coveralls

add prefix and suffix

64ed601

anakin87 requested review from a team as code owners September 8, 2023 08:57

anakin87 requested review from MichelBartels and dfokina and removed request for a team September 8, 2023 08:57

github-actions Bot added topic:tests type:documentation Improvements on the docs labels Sep 8, 2023

MichelBartels approved these changes Sep 11, 2023

View reviewed changes

anakin87 mentioned this pull request Sep 13, 2023

Investigate the utility of Embedding Templates #5793

Closed

Merge branch 'main' into st-doc-embedder-prefix-suffix

9dcd7cd

fix test

d6e1c1b

anakin87 self-assigned this Sep 13, 2023

Merge branch 'main' into st-doc-embedder-prefix-suffix

818e32c

anakin87 merged commit 283ecf2 into main Sep 13, 2023

anakin87 deleted the st-doc-embedder-prefix-suffix branch September 13, 2023 10:55

ZanSara mentioned this pull request Oct 16, 2023

chore: Telemetry for embedder classes #6072

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add `prefix` and `suffix` to `SentenceTransformersDocumentEmbedder`#5745

feat: add `prefix` and `suffix` to `SentenceTransformersDocumentEmbedder`#5745
anakin87 merged 4 commits intomainfrom
st-doc-embedder-prefix-suffix

anakin87 commented Sep 8, 2023

Uh oh!

mathislucka commented Sep 8, 2023

Uh oh!

anakin87 commented Sep 8, 2023 •

edited

Loading

Uh oh!

mathislucka commented Sep 8, 2023

Uh oh!

anakin87 commented Sep 8, 2023

Uh oh!

mathislucka commented Sep 11, 2023

Uh oh!

MichelBartels left a comment •

edited

Loading

Uh oh!

anakin87 commented Sep 13, 2023

Uh oh!

coveralls commented Sep 13, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

anakin87 commented Sep 8, 2023

Related Issues

Proposed Changes:

How did you test it?

Checklist

Uh oh!

mathislucka commented Sep 8, 2023

Uh oh!

anakin87 commented Sep 8, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mathislucka commented Sep 8, 2023

Uh oh!

anakin87 commented Sep 8, 2023

Uh oh!

mathislucka commented Sep 11, 2023

Uh oh!

MichelBartels left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

anakin87 commented Sep 13, 2023

Uh oh!

coveralls commented Sep 13, 2023

Pull Request Test Coverage Report for Build 6171163319

💛 - Coveralls

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

anakin87 commented Sep 8, 2023 •

edited

Loading

MichelBartels left a comment •

edited

Loading