-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Enable text-embedding-ada-002 for EmbeddingRetriever #3721
Conversation
@ZanSara can you have a look at this one? Please tell me if you think that the decision to use a new model requires specifying the full model name. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some thoughts on the implementation
model_class: str = next( | ||
(m for m in ["ada", "babbage", "davinci", "curie"] if m in retriever.embedding_model), "babbage" | ||
) | ||
self.query_model_encoder_engine = f"text-search-{model_class}-query-001" | ||
self.doc_model_encoder_engine = f"text-search-{model_class}-doc-001" | ||
# new generation of embedding models (December 2022), we need to specify the full name | ||
if model_class == "ada" and "text-embedding-ada-002" in retriever.embedding_model: | ||
self.query_encoder_model = f"text-embedding-ada-002" | ||
self.doc_encoder_model = f"text-embedding-ada-002" | ||
else: | ||
self.query_encoder_model = f"text-search-{model_class}-query-001" | ||
self.doc_encoder_model = f"text-search-{model_class}-doc-001" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see the issue... I wonder what API we really want here. It's now getting a bit intricate as it is.
I have two ideas in mind: let me know if you like any of these.
No model_class
Why bother, it just makes the API more complex. The user needs to give the full model name.
This should be quite future proof as we're not assuming anything about the model name. On the other hand, the user needs to know which models are suitable and might provide a wrong model, so we need some more error-checking and better error messages.
One more issue: how to pass both model names to the Encoder.
self.query_model_encoder_engine = retriever.embedding_model or "text-search-babbage-query-001"
self.doc_model_encoder_engine = retriever.embedding_model or "text-search-babbage-doc-001"
Add model_type
and model_codenumber
The user not only gives us the model_class, but also the prefix ("search" or "embedding") and optionally the codenumber ('001', '002').
Less convincing imho as we still assume a lot about how the model name looks like, but has a cleaner API as the user only needs to give part of the model name.
model_class: str = next(
(m for m in ["ada", "babbage", "davinci", "curie"] if m in retriever.embedding_model), "babbage"
)
model_type: str = next(
(m for m in ["search", "embedding"] if m in retriever.embedding_model), "search"
)
model_codenumber: str = next(
(m for m in ["001", "002"] if m in retriever.embedding_model), "001"
)
self.query_model_encoder_engine = f"text-{model_type}-{model_class}-query-{model_codenumber}"
self.doc_model_encoder_engine = f"text-{model_type}-{model_class}-doc-{model_codenumber}"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ZanSara, thanks for the feedback; as you noticed, there are several important axes to consider for this "problem":
- we should ideally only use one string as input for the model selection i.e. retriever.embedding_model
- older models have doc and query encoder models, but new ones and future ones will have only one
- maintain some resemblance of logic for supporting both old and new models
I would avoid adding new fields. We'll refactor EmbeddingRetriever soon, and we'll create a mess with new fields. Using model class for older models makes sense as we retain backward compatibility for code and docs. And we add support for new models easily. To differentiate new models from the old, it seems logical (at least to me) to specify the full model name. If anyone has a better proposal, let's hear it. Otherwise, admitting my bias, I would claim the current proposal in the PR is still the winner taking everything into account.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @ZanSara and @vblagoje I'm commenting here following Vladimir's request to have a look. I would agree with Vladimir that it would be better to avoid more fields to be filled. But also would like to say I prefer the 'no model class' option too. The reason is because I think this will make the interaction with these models remain a way people are already used to: just provide the model name. From a usage perspective it gets closer to what we always do with models from HF too. This is just my two cents :) So I would agree with the 'let's just have people specify the full model name'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIU there won't be any additional "old generation" models, so the number of corner cases we have to deal with won't grow - if that's correct, I'm fine keeping the logic as it is in this PR. We could maybe make the code a bit more explicit, more verbose even, because it took me a while to understand what was the issue (the tests helped me a lot, thanks for those). Maybe you can isolate the code into a _setup_model_class()
method of some sort and call the day.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Deal, will do @masci, while at it I'll see if we can perhaps handle future second-generation models.
@julian-risch @bogdankostic @masci @mayankjobanputra @TuanaCelik Sara and I would need a third opinion on this during the moment of your break. It'll take you < 10 min to understand the main constraints and then come up with the best proposal on the continued support of old OpenAI embeddings models (ada, cabbage etc.) and the ability to support the new one |
Thanks all; the updated code commit is here I am happy with this solution. If you have objections - let us know. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks nice! Good to go 👍
* Enable text-embedding-ada-002 for EmbeddingRetriever * Easier to understand code, more unit tests
Proposed Changes:
Enabled text-embedding-ada-002 for EmbeddingRetriever. See https://openai.com/blog/new-and-improved-embedding-model/ for more details.
To enable this retriever use the model full name i.e.
text-embedding-ada-002
:How did you test it?
Added unit test, ran Colab notebook
Notes for the reviewer
Have a look at https://beta.openai.com/docs/guides/embeddings/what-are-embeddings
We now have to deal with two generations of models. The old generation of models - as we have already dealt with them. We'll have users specify the full model name to use the new model.
Have quick look at the unit test, and run the colab yourself
Checklist