Question: OpenAI ada-002 embedding #1897

PhilipMay · 2023-04-18T18:05:17Z

your blog about OpenAI embeddings is very interesting:
https://medium.com/@nils_reimers/openai-gpt-3-text-embeddings-really-a-new-state-of-the-art-in-dense-text-embeddings-6571fe3ec9d9

Now that OpenAI released ada-2 my question is: Did you do a comparison of ada-2 vs. these embedding models provided by SBERT?

Other question: Do you know any other company than OpenAI that provides "multilingual text embeddings as an API call"? :-) How do they compare?

Many thanks
Philip

nreimers · 2023-04-18T18:23:03Z

On MTEB you can find the performance for ada-2:
https://huggingface.co/spaces/mteb/leaderboard

They are good, but not excellent compared to alternative options.

OpenAI embeddings just work for English.

But Cohere provides an embedding model as API call that works well across 100+ languages:
https://docs.cohere.ai/docs/multilingual-language-models

PhilipMay · 2023-04-18T18:51:25Z

@nreimers that is exactly what I wanted to know. Very good and many thanks!!

OpenAI embeddings just work for English.

Very interesting info. Thanks. Is there a reference anywhere? Or do users just have to find that out on their own? :-)

nreimers · 2023-04-18T19:42:23Z

They used to have a section on:
https://platform.openai.com/docs/guides/embeddings/english-only

that the model is trained just on English data. From tests in other languages it perform comparable to BM25 on Wikipedia data, so not really great.

PhilipMay · 2023-04-18T19:57:58Z

And now they deleted the section? Doh!

How would you suggest to test Ada-2?
Spearman cos. sim. of translated StSb dataset?

nreimers · 2023-04-18T21:53:43Z

Yes, sadly was deleted sometime between March and today. Mainly said that the model was only trained on English data and they don't expect it work well on other languages.

We tested ada-02 on MIRACL dataset, as we primarily are interested on search:
https://arxiv.org/abs/2210.09984

For some languages. For English it was ok (had issues connected to cosine similarity), for other languages performance was not really good (on par or worse than BM25 from Elasticsearch).

gabriead · 2023-06-05T11:39:30Z

How does instructorXL compare to Sentence Transformers?

PhilipMay · 2023-06-05T12:21:55Z

How does instructorXL compare to Sentence Transformers?

Good question!

bhavishpahwa · 2023-08-25T13:25:37Z

They used to have a section on:
https://platform.openai.com/docs/guides/embeddings/english-only

that the model is trained just on English data. From tests in other languages it perform comparable to BM25 on Wikipedia data, so not really great.

Yep you seem right @nreimers the webarchive snaps prove it :)

https://web.archive.org/web/20221221004637/http://web.archive.org/screenshot/https://beta.openai.com/docs/guides/embeddings/english-only

Is this legally/ethically cool btw, they disclosing the limitations properly and then removing a certain part of it without even changing anything with the embedding model? They also don't explicitly mention non-english/ cross-lingual support also.

bitnom mentioned this issue Apr 22, 2023

Input Length / Accuracy xlang-ai/instructor-embedding#29

Closed

PhilipMay closed this as completed Apr 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question: OpenAI ada-002 embedding #1897

Question: OpenAI ada-002 embedding #1897

PhilipMay commented Apr 18, 2023 •

edited

Loading

nreimers commented Apr 18, 2023

PhilipMay commented Apr 18, 2023

nreimers commented Apr 18, 2023

PhilipMay commented Apr 18, 2023

nreimers commented Apr 18, 2023

gabriead commented Jun 5, 2023

PhilipMay commented Jun 5, 2023

bhavishpahwa commented Aug 25, 2023

Question: OpenAI ada-002 embedding #1897

Question: OpenAI ada-002 embedding #1897

Comments

PhilipMay commented Apr 18, 2023 • edited Loading

nreimers commented Apr 18, 2023

PhilipMay commented Apr 18, 2023

nreimers commented Apr 18, 2023

PhilipMay commented Apr 18, 2023

nreimers commented Apr 18, 2023

gabriead commented Jun 5, 2023

PhilipMay commented Jun 5, 2023

bhavishpahwa commented Aug 25, 2023

PhilipMay commented Apr 18, 2023 •

edited

Loading