Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Enable text-embedding-ada-002 for EmbeddingRetriever #3721

Merged
merged 2 commits into from
Dec 19, 2022
Merged

feat: Enable text-embedding-ada-002 for EmbeddingRetriever #3721

merged 2 commits into from
Dec 19, 2022

Conversation

vblagoje
Copy link
Member

@vblagoje vblagoje commented Dec 15, 2022

Proposed Changes:

Enabled text-embedding-ada-002 for EmbeddingRetriever. See https://openai.com/blog/new-and-improved-embedding-model/ for more details.

To enable this retriever use the model full name i.e. text-embedding-ada-002:

retriever = EmbeddingRetriever(
    document_store=document_store,
    embedding_model="text-embedding-ada-002",
    batch_size = 32,
    api_key=api_key,
    max_seq_len = 8191,
)

How did you test it?

Added unit test, ran Colab notebook

Notes for the reviewer

Have a look at https://beta.openai.com/docs/guides/embeddings/what-are-embeddings
We now have to deal with two generations of models. The old generation of models - as we have already dealt with them. We'll have users specify the full model name to use the new model.

Have quick look at the unit test, and run the colab yourself

Checklist

@vblagoje vblagoje requested a review from a team as a code owner December 15, 2022 22:53
@vblagoje vblagoje requested review from ZanSara and removed request for a team December 15, 2022 22:53
@vblagoje vblagoje changed the title Enable text-embedding-ada-002 for EmbeddingRetriever feat: Enable text-embedding-ada-002 for EmbeddingRetriever Dec 15, 2022
@vblagoje vblagoje added this to the 1.12.0 milestone Dec 15, 2022
@vblagoje
Copy link
Member Author

@ZanSara can you have a look at this one? Please tell me if you think that the decision to use a new model requires specifying the full model name.

Copy link
Contributor

@ZanSara ZanSara left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some thoughts on the implementation

Comment on lines 400 to 410
model_class: str = next(
(m for m in ["ada", "babbage", "davinci", "curie"] if m in retriever.embedding_model), "babbage"
)
self.query_model_encoder_engine = f"text-search-{model_class}-query-001"
self.doc_model_encoder_engine = f"text-search-{model_class}-doc-001"
# new generation of embedding models (December 2022), we need to specify the full name
if model_class == "ada" and "text-embedding-ada-002" in retriever.embedding_model:
self.query_encoder_model = f"text-embedding-ada-002"
self.doc_encoder_model = f"text-embedding-ada-002"
else:
self.query_encoder_model = f"text-search-{model_class}-query-001"
self.doc_encoder_model = f"text-search-{model_class}-doc-001"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the issue... I wonder what API we really want here. It's now getting a bit intricate as it is.

I have two ideas in mind: let me know if you like any of these.

No model_class

Why bother, it just makes the API more complex. The user needs to give the full model name.

This should be quite future proof as we're not assuming anything about the model name. On the other hand, the user needs to know which models are suitable and might provide a wrong model, so we need some more error-checking and better error messages.

One more issue: how to pass both model names to the Encoder.

        self.query_model_encoder_engine =  retriever.embedding_model or "text-search-babbage-query-001"
        self.doc_model_encoder_engine = retriever.embedding_model or "text-search-babbage-doc-001"

Add model_type and model_codenumber

The user not only gives us the model_class, but also the prefix ("search" or "embedding") and optionally the codenumber ('001', '002').

Less convincing imho as we still assume a lot about how the model name looks like, but has a cleaner API as the user only needs to give part of the model name.

        model_class: str = next(
            (m for m in ["ada", "babbage", "davinci", "curie"] if m in retriever.embedding_model), "babbage"
        )
       model_type: str = next(
            (m for m in ["search", "embedding"] if m in retriever.embedding_model), "search"
        )
       model_codenumber: str = next(
            (m for m in ["001", "002"] if m in retriever.embedding_model), "001"
        )
        self.query_model_encoder_engine = f"text-{model_type}-{model_class}-query-{model_codenumber}"
        self.doc_model_encoder_engine = f"text-{model_type}-{model_class}-doc-{model_codenumber}"

Copy link
Member Author

@vblagoje vblagoje Dec 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ZanSara, thanks for the feedback; as you noticed, there are several important axes to consider for this "problem":

  1. we should ideally only use one string as input for the model selection i.e. retriever.embedding_model
  2. older models have doc and query encoder models, but new ones and future ones will have only one
  3. maintain some resemblance of logic for supporting both old and new models

I would avoid adding new fields. We'll refactor EmbeddingRetriever soon, and we'll create a mess with new fields. Using model class for older models makes sense as we retain backward compatibility for code and docs. And we add support for new models easily. To differentiate new models from the old, it seems logical (at least to me) to specify the full model name. If anyone has a better proposal, let's hear it. Otherwise, admitting my bias, I would claim the current proposal in the PR is still the winner taking everything into account.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @ZanSara and @vblagoje I'm commenting here following Vladimir's request to have a look. I would agree with Vladimir that it would be better to avoid more fields to be filled. But also would like to say I prefer the 'no model class' option too. The reason is because I think this will make the interaction with these models remain a way people are already used to: just provide the model name. From a usage perspective it gets closer to what we always do with models from HF too. This is just my two cents :) So I would agree with the 'let's just have people specify the full model name'

Copy link
Member

@masci masci Dec 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIU there won't be any additional "old generation" models, so the number of corner cases we have to deal with won't grow - if that's correct, I'm fine keeping the logic as it is in this PR. We could maybe make the code a bit more explicit, more verbose even, because it took me a while to understand what was the issue (the tests helped me a lot, thanks for those). Maybe you can isolate the code into a _setup_model_class() method of some sort and call the day.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deal, will do @masci, while at it I'll see if we can perhaps handle future second-generation models.

@masci masci mentioned this pull request Dec 19, 2022
4 tasks
@vblagoje
Copy link
Member Author

@julian-risch @bogdankostic @masci @mayankjobanputra @TuanaCelik Sara and I would need a third opinion on this during the moment of your break. It'll take you < 10 min to understand the main constraints and then come up with the best proposal on the continued support of old OpenAI embeddings models (ada, cabbage etc.) and the ability to support the new one text-embedding-ada-002 having in mind impact on usability, user-friendliness, docs etc. Having all that in mind, can you come up with a better than the current proposal?

@vblagoje
Copy link
Member Author

Thanks all; the updated code commit is here I am happy with this solution. If you have objections - let us know.

Copy link
Contributor

@ZanSara ZanSara left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks nice! Good to go 👍

@ZanSara ZanSara merged commit 56803e5 into deepset-ai:main Dec 19, 2022
ZanSara pushed a commit that referenced this pull request Dec 20, 2022
* Enable text-embedding-ada-002 for EmbeddingRetriever

* Easier to understand code, more unit tests
@vblagoje vblagoje deleted the open_ai_new_embedding branch February 28, 2023 12:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants