Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: OpenAI ada-002 embedding #1897

Closed
PhilipMay opened this issue Apr 18, 2023 · 8 comments
Closed

Question: OpenAI ada-002 embedding #1897

PhilipMay opened this issue Apr 18, 2023 · 8 comments

Comments

@PhilipMay
Copy link
Contributor

PhilipMay commented Apr 18, 2023

Hi @nreimers ,

your blog about OpenAI embeddings is very interesting:
https://medium.com/@nils_reimers/openai-gpt-3-text-embeddings-really-a-new-state-of-the-art-in-dense-text-embeddings-6571fe3ec9d9

Now that OpenAI released ada-2 my question is: Did you do a comparison of ada-2 vs. these embedding models provided by SBERT?

Other question: Do you know any other company than OpenAI that provides "multilingual text embeddings as an API call"? :-) How do they compare?

Many thanks
Philip

@nreimers
Copy link
Member

On MTEB you can find the performance for ada-2:
https://huggingface.co/spaces/mteb/leaderboard

They are good, but not excellent compared to alternative options.

OpenAI embeddings just work for English.

But Cohere provides an embedding model as API call that works well across 100+ languages:
https://docs.cohere.ai/docs/multilingual-language-models

@PhilipMay
Copy link
Contributor Author

@nreimers that is exactly what I wanted to know. Very good and many thanks!!

OpenAI embeddings just work for English.

Very interesting info. Thanks. Is there a reference anywhere? Or do users just have to find that out on their own? :-)

@nreimers
Copy link
Member

They used to have a section on:
https://platform.openai.com/docs/guides/embeddings/english-only

that the model is trained just on English data. From tests in other languages it perform comparable to BM25 on Wikipedia data, so not really great.

@PhilipMay
Copy link
Contributor Author

And now they deleted the section? Doh!

How would you suggest to test Ada-2?
Spearman cos. sim. of translated StSb dataset?

@nreimers
Copy link
Member

Yes, sadly was deleted sometime between March and today. Mainly said that the model was only trained on English data and they don't expect it work well on other languages.

We tested ada-02 on MIRACL dataset, as we primarily are interested on search:
https://arxiv.org/abs/2210.09984

For some languages. For English it was ok (had issues connected to cosine similarity), for other languages performance was not really good (on par or worse than BM25 from Elasticsearch).

@gabriead
Copy link

gabriead commented Jun 5, 2023

How does instructorXL compare to Sentence Transformers?

@PhilipMay
Copy link
Contributor Author

How does instructorXL compare to Sentence Transformers?

Good question!

@bhavishpahwa
Copy link

They used to have a section on:
https://platform.openai.com/docs/guides/embeddings/english-only

that the model is trained just on English data. From tests in other languages it perform comparable to BM25 on Wikipedia data, so not really great.

Yep you seem right @nreimers the webarchive snaps prove it :)

https://web.archive.org/web/20221221004637/http://web.archive.org/screenshot/https://beta.openai.com/docs/guides/embeddings/english-only

image

Is this legally/ethically cool btw, they disclosing the limitations properly and then removing a certain part of it without even changing anything with the embedding model? They also don't explicitly mention non-english/ cross-lingual support also.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants