feat: Enable text-embedding-ada-002 for EmbeddingRetriever #3721

vblagoje · 2022-12-15T22:53:45Z

Proposed Changes:

Enabled text-embedding-ada-002 for EmbeddingRetriever. See https://openai.com/blog/new-and-improved-embedding-model/ for more details.

To enable this retriever use the model full name i.e. text-embedding-ada-002:

retriever = EmbeddingRetriever(
    document_store=document_store,
    embedding_model="text-embedding-ada-002",
    batch_size = 32,
    api_key=api_key,
    max_seq_len = 8191,
)

How did you test it?

Added unit test, ran Colab notebook

Notes for the reviewer

Have a look at https://beta.openai.com/docs/guides/embeddings/what-are-embeddings
We now have to deal with two generations of models. The old generation of models - as we have already dealt with them. We'll have users specify the full model name to use the new model.

Have quick look at the unit test, and run the colab yourself

Checklist

I have read the contributors guidelines and the code of conduct
I have updated the related issue with new insights and changes
I added tests that demonstrate the correct behavior of the change
I've used the conventional commit convention for my PR title
I documented my code
I ran pre-commit hooks and fixed any issue

vblagoje · 2022-12-16T14:05:11Z

@ZanSara can you have a look at this one? Please tell me if you think that the decision to use a new model requires specifying the full model name.

ZanSara

Left some thoughts on the implementation

ZanSara · 2022-12-19T09:58:50Z

haystack/nodes/retriever/_embedding_encoder.py

        model_class: str = next(
            (m for m in ["ada", "babbage", "davinci", "curie"] if m in retriever.embedding_model), "babbage"
        )
-        self.query_model_encoder_engine = f"text-search-{model_class}-query-001"
-        self.doc_model_encoder_engine = f"text-search-{model_class}-doc-001"
+        # new generation of embedding models (December 2022), we need to specify the full name
+        if model_class == "ada" and "text-embedding-ada-002" in retriever.embedding_model:
+            self.query_encoder_model = f"text-embedding-ada-002"
+            self.doc_encoder_model = f"text-embedding-ada-002"
+        else:
+            self.query_encoder_model = f"text-search-{model_class}-query-001"
+            self.doc_encoder_model = f"text-search-{model_class}-doc-001"
+


I see the issue... I wonder what API we really want here. It's now getting a bit intricate as it is.

I have two ideas in mind: let me know if you like any of these.

No model_class

Why bother, it just makes the API more complex. The user needs to give the full model name.

This should be quite future proof as we're not assuming anything about the model name. On the other hand, the user needs to know which models are suitable and might provide a wrong model, so we need some more error-checking and better error messages.

One more issue: how to pass both model names to the Encoder.

self.query_model_encoder_engine = retriever.embedding_model or "text-search-babbage-query-001" self.doc_model_encoder_engine = retriever.embedding_model or "text-search-babbage-doc-001"

Add model_type and model_codenumber

The user not only gives us the model_class, but also the prefix ("search" or "embedding") and optionally the codenumber ('001', '002').

Less convincing imho as we still assume a lot about how the model name looks like, but has a cleaner API as the user only needs to give part of the model name.

model_class: str = next( (m for m in ["ada", "babbage", "davinci", "curie"] if m in retriever.embedding_model), "babbage" ) model_type: str = next( (m for m in ["search", "embedding"] if m in retriever.embedding_model), "search" ) model_codenumber: str = next( (m for m in ["001", "002"] if m in retriever.embedding_model), "001" ) self.query_model_encoder_engine = f"text-{model_type}-{model_class}-query-{model_codenumber}" self.doc_model_encoder_engine = f"text-{model_type}-{model_class}-doc-{model_codenumber}"

@ZanSara, thanks for the feedback; as you noticed, there are several important axes to consider for this "problem":

we should ideally only use one string as input for the model selection i.e. retriever.embedding_model

older models have doc and query encoder models, but new ones and future ones will have only one

maintain some resemblance of logic for supporting both old and new models

I would avoid adding new fields. We'll refactor EmbeddingRetriever soon, and we'll create a mess with new fields. Using model class for older models makes sense as we retain backward compatibility for code and docs. And we add support for new models easily. To differentiate new models from the old, it seems logical (at least to me) to specify the full model name. If anyone has a better proposal, let's hear it. Otherwise, admitting my bias, I would claim the current proposal in the PR is still the winner taking everything into account.

Hey @ZanSara and @vblagoje I'm commenting here following Vladimir's request to have a look. I would agree with Vladimir that it would be better to avoid more fields to be filled. But also would like to say I prefer the 'no model class' option too. The reason is because I think this will make the interaction with these models remain a way people are already used to: just provide the model name. From a usage perspective it gets closer to what we always do with models from HF too. This is just my two cents :) So I would agree with the 'let's just have people specify the full model name'

AFAIU there won't be any additional "old generation" models, so the number of corner cases we have to deal with won't grow - if that's correct, I'm fine keeping the logic as it is in this PR. We could maybe make the code a bit more explicit, more verbose even, because it took me a while to understand what was the issue (the tests helped me a lot, thanks for those). Maybe you can isolate the code into a _setup_model_class() method of some sort and call the day.

Deal, will do @masci, while at it I'll see if we can perhaps handle future second-generation models.

vblagoje · 2022-12-19T11:32:57Z

@julian-risch @bogdankostic @masci @mayankjobanputra @TuanaCelik Sara and I would need a third opinion on this during the moment of your break. It'll take you < 10 min to understand the main constraints and then come up with the best proposal on the continued support of old OpenAI embeddings models (ada, cabbage etc.) and the ability to support the new one text-embedding-ada-002 having in mind impact on usability, user-friendliness, docs etc. Having all that in mind, can you come up with a better than the current proposal?

vblagoje · 2022-12-19T13:48:30Z

Thanks all; the updated code commit is here I am happy with this solution. If you have objections - let us know.

ZanSara

Looks nice! Good to go 👍

* Enable text-embedding-ada-002 for EmbeddingRetriever * Easier to understand code, more unit tests

Enable text-embedding-ada-002 for EmbeddingRetriever

1ac0805

vblagoje requested a review from a team as a code owner December 15, 2022 22:53

vblagoje requested review from ZanSara and removed request for a team December 15, 2022 22:53

vblagoje changed the title ~~Enable text-embedding-ada-002 for EmbeddingRetriever~~ feat: Enable text-embedding-ada-002 for EmbeddingRetriever Dec 15, 2022

vblagoje added the topic:retriever label Dec 15, 2022

vblagoje added this to the 1.12.0 milestone Dec 15, 2022

ZanSara reviewed Dec 19, 2022

View reviewed changes

masci mentioned this pull request Dec 19, 2022

Expand LLM support #3306

Closed

4 tasks

Easier to understand code, more unit tests

00d8742

ZanSara approved these changes Dec 19, 2022

View reviewed changes

ZanSara merged commit 56803e5 into deepset-ai:main Dec 19, 2022

ZanSara pushed a commit that referenced this pull request Dec 20, 2022

feat: Enable text-embedding-ada-002 for EmbeddingRetriever (#3721)

01d10db

* Enable text-embedding-ada-002 for EmbeddingRetriever * Easier to understand code, more unit tests

vblagoje deleted the open_ai_new_embedding branch February 28, 2023 12:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Enable text-embedding-ada-002 for EmbeddingRetriever #3721

feat: Enable text-embedding-ada-002 for EmbeddingRetriever #3721

vblagoje commented Dec 15, 2022 •

edited

vblagoje commented Dec 16, 2022

ZanSara left a comment

ZanSara Dec 19, 2022

vblagoje Dec 19, 2022 •

edited

TuanaCelik Dec 19, 2022

masci Dec 19, 2022 •

edited

vblagoje Dec 19, 2022

vblagoje commented Dec 19, 2022

vblagoje commented Dec 19, 2022

ZanSara left a comment

feat: Enable text-embedding-ada-002 for EmbeddingRetriever #3721

feat: Enable text-embedding-ada-002 for EmbeddingRetriever #3721

Conversation

vblagoje commented Dec 15, 2022 • edited

Proposed Changes:

How did you test it?

Notes for the reviewer

Checklist

vblagoje commented Dec 16, 2022

ZanSara left a comment

Choose a reason for hiding this comment

ZanSara Dec 19, 2022

Choose a reason for hiding this comment

No model_class

Add model_type and model_codenumber

vblagoje Dec 19, 2022 • edited

Choose a reason for hiding this comment

TuanaCelik Dec 19, 2022

Choose a reason for hiding this comment

masci Dec 19, 2022 • edited

Choose a reason for hiding this comment

vblagoje Dec 19, 2022

Choose a reason for hiding this comment

vblagoje commented Dec 19, 2022

vblagoje commented Dec 19, 2022

ZanSara left a comment

Choose a reason for hiding this comment

vblagoje commented Dec 15, 2022 •

edited

No `model_class`

Add `model_type` and `model_codenumber`

vblagoje Dec 19, 2022 •

edited

masci Dec 19, 2022 •

edited