Skip to content

feat: add prefix and suffix to SentenceTransformersDocumentEmbedder#5745

Merged
anakin87 merged 4 commits intomainfrom
st-doc-embedder-prefix-suffix
Sep 13, 2023
Merged

feat: add prefix and suffix to SentenceTransformersDocumentEmbedder#5745
anakin87 merged 4 commits intomainfrom
st-doc-embedder-prefix-suffix

Conversation

@anakin87
Copy link
Copy Markdown
Member

@anakin87 anakin87 commented Sep 8, 2023

Related Issues

Proposed Changes:

Add prefix and suffix attributes to SentenceTransformersDocumentEmbedder.
They can be used to add a prefix and suffix to the Document text before embedding it.
This is necessary to take full advantage of some modern embedding models, such as E5.

How did you test it?

Updated existing tests, and added a new test.

Checklist

@anakin87 anakin87 requested review from a team as code owners September 8, 2023 08:57
@anakin87 anakin87 requested review from MichelBartels and dfokina and removed request for a team September 8, 2023 08:57
@github-actions github-actions Bot added topic:tests type:documentation Improvements on the docs labels Sep 8, 2023
@mathislucka
Copy link
Copy Markdown
Contributor

Why not use the prompt builder? That would be really powerful as it would allow to embed metadata too.

@anakin87
Copy link
Copy Markdown
Member Author

anakin87 commented Sep 8, 2023

Why not use the prompt builder? That would be really powerful as it would allow to embed metadata too.

I originally had a similar idea: to create an Embedding template for Documents.

Then we had an internal discussion, in which the following aspects emerged:

  • We can already embed a list of metadata.

  • We have the intuition that adding the same words to all Documents in a collection
    (e.g. "The author of this document is: {doc.metadata['author']}")
    would not have a good effect on the vector representation, diluting its information power.

  • (The implementation of the Embedding template can be nontrivial.)

@mathislucka
Copy link
Copy Markdown
Contributor

We can already embed a list of metadata.

That's not very flexible, a metadata key might not be natural language but it also can't be easily changed because changing the data itself might be difficult.

We have the intuition that adding the same words to all Documents in a collection
(e.g. "The author of this document is: {doc.metadata['author']}")
would not have a good effect on the vector representation, diluting its information power.

Maybe better to do some experiments?

@anakin87
Copy link
Copy Markdown
Member Author

anakin87 commented Sep 8, 2023

This is a first implementation that we would like to merge. It unlocks the possibility to properly use powerful embedding models such as E5.

Currently, we haven't noticed significant interest or identified a clear use case for Embedding templates within the community.

If someone is interested in experimenting with this idea, they can easily create a custom component to directly manipulate the Document content.

If evidence emerges that Embedding templates offer substantial benefits, we can implement this feature in a future iteration.

@mathislucka
Copy link
Copy Markdown
Contributor

Not trying to block the merge, only commenting.

I think it would keep things simple. For example, templates would make sense for the ExtractiveReader too because there's currently no good way to account for metadata (see #5640). If templates could be applied before all inference nodes, then the users would only need to learn one concept and they would be pretty flexible in how they apply it. I'd prefer that over opaque embed_metadata and prefixes or suffixes but maybe that's just me.

Copy link
Copy Markdown
Contributor

@MichelBartels MichelBartels left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks good to me. However, I would agree with Mathis that it seems like a less powerful version of the PromptBuilder that we have already.

We have the intuition that adding the same words to all Documents in a collection
(e.g. "The author of this document is: {doc.metadata['author']}")
would not have a good effect on the vector representation, diluting its information power.

I do not share this intuition and also think that it would need to be supported by experiments. I think this is clear if you consider the following (completely made up) example of book reviews where the metadata is author and title:

Title: J. R. R. Tolkien
Author: Humphrey Carpenter
Review: This biography is really great! I would recommend it to everyone.
Title: The Hobbit
Author: J. R. R. Tolkien
Review: Bilbo Baggins is very nice. This book is is great!

You could also remove the words that are the same:

J. R. R. Tolkien
Humphrey Carpenter
This biography is really great! I would recommend it to everyone.
The Hobbit
J. R. R. Tolkien
Bilbo Baggins defeats the dragon. This book is is great!

If your question was "What characters has J. R. R. Tolkien written about?", you need the additional structural information through those "shared words" to know that the document about Tolkien and not written by him is completely useless for this.
Of course, examples like this might occur seldomly or the models might not be able to pickup on this information, but we can't really tell without experiments.

On the other hand, a good reason to handle this in the SentenceTransformersDocumentEmbedder would be if the format was handled by the Embedder automatically. You could argue that the user shouldn't have to worry about those details and that it should be formatted correctly for the most popular models without the user having to care about it. Otherwise, it would also be an easy pitfall for users who don't know about this requirement.

@anakin87
Copy link
Copy Markdown
Member Author

Thanks, @MichelBartels!
I see that there are mixed opinions about Embedding Templates, which should be clarified by experimentation.
I opened a dedicated issue for this: #5793

In the meantime, I'm going to merge this PR...

@anakin87 anakin87 self-assigned this Sep 13, 2023
@coveralls
Copy link
Copy Markdown
Collaborator

Pull Request Test Coverage Report for Build 6171163319

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • 2 unchanged lines in 2 files lost coverage.
  • Overall coverage increased (+0.006%) to 48.965%

Files with Coverage Reduction New Missed Lines %
preview/components/embedders/sentence_transformers_document_embedder.py 1 97.62%
utils/context_matching.py 1 95.7%
Totals Coverage Status
Change from base Build 6170931279: 0.006%
Covered Lines: 11851
Relevant Lines: 24203

💛 - Coveralls

@anakin87 anakin87 merged commit 283ecf2 into main Sep 13, 2023
@anakin87 anakin87 deleted the st-doc-embedder-prefix-suffix branch September 13, 2023 10:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

topic:tests type:documentation Improvements on the docs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

prefix and suffix for SentenceTransformersDocumentEmbedder

4 participants