feat: Add OpenAIEmbeddingEncoder to EmbeddingRetriever by vblagoje · Pull Request #3356 · deepset-ai/haystack

vblagoje · 2022-10-10T13:25:54Z

Related Issues

Doesn't fix, but is a part of Expand LLM support #3306

Proposed Changes:

Added OpenAIEmbeddingEncoder as a method to create document and query embeddings.

How did you test it?

Added a unit test, needs to inject OpenAI API key in unit tests (as a secret)

Notes for the reviewer

LMK if anything is unclear

Checklist

I have read the contributors guidelines and the code of conduct
I have updated the related issue with new insights and changes
I added tests that demonstrate the correct behavior of the change
I've used the conventional commit convention for my PR title
I documented my code
I ran pre-commit hooks and fixed any issue

bogdankostic

This looks already pretty good, left a few comments with some possible improvements.

bogdankostic · 2022-10-11T12:02:43Z

test/nodes/test_retriever.py

+    if isinstance(document_store, WeaviateDocumentStore):
+        # Weaviate sets the embedding dimension to 768 as soon as it is initialized.
+        # We need 1024 here and therefore initialize a new WeaviateDocumentStore.
+        document_store = WeaviateDocumentStore(index="haystack_test", embedding_dim=1024, recreate_index=True)


I think this is not needed as we specify to use only InMemoryDocumentStore in the test parameters.

bogdankostic · 2022-10-11T12:02:56Z

test/nodes/test_retriever.py

+    if isinstance(document_store, WeaviateDocumentStore):
+        # Weaviate sets the embedding dimension to 768 as soon as it is initialized.
+        # We need 1024 here and therefore initialize a new WeaviateDocumentStore.
+        document_store = WeaviateDocumentStore(index="haystack_test", embedding_dim=1024, recreate_index=True)


bogdankostic · 2022-10-11T13:05:20Z

haystack/nodes/retriever/_embedding_encoder.py

+        self.doc_model_encoder_engine = f"text-search-{model_class}-doc-001"
+        self.tokenizer = AutoTokenizer.from_pretrained("gpt2")
+
+    def ensure_texts_limit(self, text: str):


I know we are inside a private class here, still I'd make this method private as it's not supposed to be used outside of that class I guess.

bogdankostic · 2022-10-11T13:06:47Z

haystack/nodes/retriever/_embedding_encoder.py

+        tokenized_payload = self.tokenizer(text)
+        return self.tokenizer.decode(tokenized_payload["input_ids"][: self.max_seq_len])
+
+    def embed(self, model, text: str) -> np.ndarray:


We should add a type for the model argument.

bogdankostic · 2022-10-11T13:22:29Z

haystack/nodes/retriever/_embedding_encoder.py

+        for doc in docs:
+            embedding = self.embed(self.doc_model_encoder_engine, doc.content)
+            embeddings.append(embedding)


According to OpenAI Documentation, we can get embeddings for multiple inputs in a single request. My guess is that this would probably be a bit more efficient than doing one request for each Document.

Also, we should probably take care of OpenAI's rate limit given that we usually create embeddings for a large number of Documents. Timo worked in #3078 on a solution for the OpenAIAnswerGenerator (this PR got unfortunately stale). Other than that, you might also want to take a look at this notebook by OpenAI on best practices for rate limit handling.

bogdankostic · 2022-10-11T13:29:06Z

haystack/nodes/retriever/dense.py

                                  This approach is also used in the TableTextRetriever paper and is likely to improve
                                  performance if your titles contain meaningful information for retrieval
                                  (topic, entities etc.).
+        :param api_key: The OpenAI API key


The doc string should explain that the OpenAI API key is only needed when we use a model by OpenAI and maybe link to the OpenAI page where the user can sign up for an API key.

haystack/nodes/retriever/dense.py

bogdankostic · 2022-10-11T13:34:08Z

haystack/nodes/retriever/_embedding_encoder.py

+        self.max_seq_len = retriever.max_seq_len
+        self.url = "https://api.openai.com/v1/embeddings"
+        self.api_key = retriever.api_key
+        model_class: str = next(


Why not just model_class = retriever.embedding_model?

Yeah, good point but I wanted to handle the case when users accidentally specify the full name of the model. Some might specify "ada", "babbage" etc and some might specify the full name. This way we handle properly both use cases.

Makes sense, I'm just wondering, what if the user want to use text-similarity-ada-001 model for example. In this case, we would silently use text-search-ada-doc-001 / text-search-ada-query-001 without the user knowing that.
We should also probably adapt the docstring of the param embedding_model of the EmbeddingRetriever, what do you think?

@bogdankostic I thought about it too, but that should not happen as the use case does not match. See https://beta.openai.com/docs/guides/embeddings/similarity-embeddings and https://beta.openai.com/docs/guides/embeddings/text-search-embeddings for recommended use-cases

Our use case is definitely Text search embeddings

bogdankostic

Almost good to go, just proposed some minor improvements.

bogdankostic · 2022-10-13T13:47:28Z

haystack/nodes/retriever/_embedding_encoder.py

+        self.max_seq_len = retriever.max_seq_len
+        self.url = "https://api.openai.com/v1/embeddings"
+        self.api_key = retriever.api_key
+        model_class: str = next(


Makes sense, I'm just wondering, what if the user want to use text-similarity-ada-001 model for example. In this case, we would silently use text-search-ada-doc-001 / text-search-ada-query-001 without the user knowing that.
We should also probably adapt the docstring of the param embedding_model of the EmbeddingRetriever, what do you think?

bogdankostic · 2022-10-13T13:52:38Z

haystack/nodes/retriever/_embedding_encoder.py

+            batch_limited = []
+            batch = text[i : i + self.batch_size]
+            for content in batch:
+                batch_limited.append(self._ensure_text_limit(content))


Let's make use of list comprehension here:

Suggested change

batch_limited = []

batch = text[i : i + self.batch_size]

for content in batch:

batch_limited.append(self._ensure_text_limit(content))

batch = text[i : i + self.batch_size]

batch_limited = [self._ensure_text_limit(content) for content in batch]

bogdankostic · 2022-10-13T13:53:05Z

haystack/nodes/retriever/_embedding_encoder.py

+        self.doc_model_encoder_engine = f"text-search-{model_class}-doc-001"
+        self.tokenizer = AutoTokenizer.from_pretrained("gpt2")
+
+    def _ensure_text_limit(self, text: str):


Let's add the return type here.

bogdankostic · 2022-10-13T13:54:18Z

haystack/nodes/retriever/dense.py

+        :param api_key: The OpenAI API key. Required if one wants to use OpenAI embeddings. For more
+                        details see https://beta.openai.com/account/api-keys for more details


"for more details" is doubled here.

bogdankostic

LGTM

vblagoje added 9 commits October 8, 2022 16:30

Add OpenAIEmbeddingEncoder

e5aac8f

Add unit test, final touches

cf1bdf0

Small fix

619434e

Make black happy

db0f7d1

Update schemas

dae464a

Make mypy happy

e211889

Use env var OPENAI_API_KEY in tests

95e04bb

Ensure payload limit when invoke embeddings API

7bbf81d

Add skipif test decorator

15a7ccc

vblagoje requested a review from a team as a code owner October 10, 2022 13:25

vblagoje requested review from bogdankostic and removed request for a team October 10, 2022 13:25

vblagoje mentioned this pull request Oct 10, 2022

Add OpenAIEmbeddingEncoder to EmbeddingRetriever #3350

Closed

6 tasks

vblagoje added 2 commits October 10, 2022 17:43

Create embeddings properly

2b38f8f

Add more unit tests

909e850

bogdankostic reviewed Oct 11, 2022

View reviewed changes

vblagoje added 2 commits October 11, 2022 16:38

PR review minor fixes

5490583

Add batch encoding

85261c5

vblagoje changed the title ~~Add OpenAIEmbeddingEncoder to EmbeddingRetriever~~ feat: Add OpenAIEmbeddingEncoder to EmbeddingRetriever Oct 11, 2022

vblagoje added 3 commits October 12, 2022 09:43

Update unit tests

0ce8c74

Minor fix

3d9972a

Minor updates

51925ef

bogdankostic requested changes Oct 13, 2022

View reviewed changes

bogdankostic added type:feature New feature or request topic:retriever labels Oct 13, 2022

vblagoje added 2 commits October 14, 2022 13:09

Order embeddings

3542a61

PR review

d2b282e

bogdankostic approved these changes Oct 14, 2022

View reviewed changes

vblagoje merged commit 159cd5a into main Oct 14, 2022

vblagoje deleted the openai_encoder branch October 14, 2022 13:01

masci mentioned this pull request Dec 19, 2022

Expand LLM support #3306

Closed

4 tasks

		:param api_key: The OpenAI API key. Required if one wants to use OpenAI embeddings. For more
		details see https://beta.openai.com/account/api-keys for more details

Conversation

vblagoje commented Oct 10, 2022

Related Issues

Proposed Changes:

How did you test it?

Notes for the reviewer

Checklist

Uh oh!

bogdankostic left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bogdankostic left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bogdankostic left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants