Skip to content

feat: document embeddings#1368

Merged
ahmetmeleq merged 25 commits intomainfrom
ahmet/doc-embeddings
Sep 20, 2023
Merged

feat: document embeddings#1368
ahmetmeleq merged 25 commits intomainfrom
ahmet/doc-embeddings

Conversation

@ahmetmeleq
Copy link
Copy Markdown
Contributor

@ahmetmeleq ahmetmeleq commented Sep 11, 2023

Closes #1319, closes #1372

This module:

  • implements EmbeddingEncoder classes which track embedding related data
  • implements embed_documents method which receives a list of Elements, obtains embeddings for the text within Elements, updates the Elements with an attribute named embeddings , and returns the updated Elements
  • the module uses langchain to obtain the embeddings

  • The PR additionally fixes a JSON de-serialization issue on the metadata fields.

To test the changes, run examples/embed/example.py

Comment thread unstructured/embed/embedder/example.py Outdated
Comment thread unstructured/embed/embedder/open_ai.py Outdated
Comment thread unstructured/embed/embedder/open_ai.py Outdated
"""
from langchain.embeddings.openai import OpenAIEmbeddings

embedder = OpenAIEmbeddings(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

usually don't want to comment on names at this stage but... I'd propose using encoder instead of embedder. encoder is more commonly used.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see we have encoder elsewhere... welp

Copy link
Copy Markdown
Contributor Author

@ahmetmeleq ahmetmeleq Sep 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see we have encoder elsewhere... welp

I hadn't thought about that while naming it, but I agree that it might be confused with ElementEncoder if we name it as encoder

This object manages the requests/responses to/from OpenAI API - I wanted to name it as embedder so users with less ML experience can also make sense of it (due to it being used as a trendy word in wider circles)

Copy link
Copy Markdown
Contributor

@ryannikolaidis ryannikolaidis Sep 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about we just call this embeddings, per the class name and that seems to be the name they give to the instance in their documentation. or maybe we call this an embeddings_encoder which would be more accurate and distinct from the element encoder?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed the namings with 499795a, please let me know if we'd like further changes on the names

Comment thread unstructured/embed/embedder/open_ai.py Outdated
logger.debug(f"Reading: {self} - PID: {os.getpid()}")

elements = self.get_elements_from_json()
embeddings = self.session_handle.service.embed_documents([str(e) for e in elements])
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can be a long running process; is this function call gonna be async?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per document, yes, each OpenAIEmbeddingsDoc instance will be handled separately when we integrate this into ingest. Each instance corresponds to a file.

However, in this implementation, sub-document parallelisation would not be supported. I agree that we might want to support it, considering that we might have long documents

Comment thread unstructured/embed/embedder/open_ai.py Outdated
Comment on lines +126 to +127
with open(self._output_filename, "w", encoding="utf8") as output_f:
json.dump(self.elements, output_f, ensure_ascii=False, indent=2, cls=ElementEncoder)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are probably just prototyping here so writing to json is fine but depending on the embedding model this can blow up the json file size very quickly. e.g., the most commonly used ones have 748 dimension, so that is 748 floating point numbers per embedding.
speaking of which we should probably record encoder metadata somewhere in the process/output out well, like the vector dimension and if it is unit vector (two attributes relevant to information retrieval tasks -> computing distance)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed on both points. I'll implement the metadata properties soon 👍

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On saving the embeddings to the disk, what are some good alternatives?

What I think is:

  • If we'd like to isolate the task of encoding from the task of writing to a DB, and want an intermediate place to store the embeddings:
    • we can save embeddings to a separate file, again in a basic format. these files can be processed later on to be written into a database
    • we can use a message queue so the embeddings later get consumed by a job and be written into the database
  • Or, we could directly write the embeddings into a database

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we actually need to write this to file? or can we just rely on return the in-memory elements with embeddings?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depends on how the downstream jobs want to obtain and use these embeddings. If we'd like embed large corpora, we'd like to index the embeddings in a DB. To do that, the job that'll write the embeddings into the DB can read the embeddings from memory, from a message queue, or from the disk

If adequate, we can return the embeddings in-memory for now; and open another ticket to bring in persistence

Copy link
Copy Markdown
Contributor

@ryannikolaidis ryannikolaidis Sep 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depends on how the downstream jobs want to obtain and use these embeddings

yes, definitely. lets start with the requirements for ingest and what makes most sense there and how this interfaces. CC @rbiseck3

aside: let's avoid a message queue

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: For the current ticket, we've decided to make this work fully in memory. Inputs are read from memory, results are passed on via memory. 499795a

Comment thread unstructured/embed/embedder/open_ai.py Outdated
Comment thread unstructured/embed/embedder/open_ai.py Outdated
Comment on lines +147 to +155
def get_embed_docs(self):
"""Reads all result files to embed them."""
return [
OpenAIEmbeddingsDoc(
self.config,
filename,
)
for filename in self.config.list_of_elements_json_paths.split()
]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably for v0 this just takes a single result (list of Elements) or json file and returns the list of Elements with embeddings? i.e. not sure we need to support iterating a list of docs (at least for v0)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

though maybe we need to look at how we want this to slot in for ingest here for a requirements perspective.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-implemented as taking a single list of elements and returning the updated list with embeddings in 499795a

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For fitting into ingest, and a single doc, I think instantiating the class and putting it as self.embedding_encoder.embed(doc) should be possible within those lines:

self.run_partition(docs=docs)
if self.dest_doc_connector:
self.dest_doc_connector.write(docs=docs)

However, to process multiple docs, we need an async / parallelised solution. Would pool.map be possible like with partition, I wonder:

def run_partition(self, docs):
if not self.reprocess:
docs = self._filter_docs_with_outputs(docs)
if not docs:
return
# Debugging tip: use the below line and comment out the mp.Pool loop
# block to remain in single process
# self.doc_processor_fn(docs[0])
logger.info(f"Processing {len(docs)} docs")
json_docs = [doc.to_json() for doc in docs]
with mp.Pool(
processes=self.num_processes,
initializer=ingest_log_streaming_init,
initargs=(logging.DEBUG if self.verbose else logging.INFO,),
) as pool:
pool.map(self.doc_processor_fn, json_docs)

A note, since these are API calls and not actual compute intensive operations, our users will ultimately be bound by the model provider (in the current case OpenAI) API. Paid users get good rate limits, therefore if OpenAI doesn't have hidden limits behind the scenes (ie if they allocate more resources to the user when the load from that user is high), all types of parallelisation should benefit our users' runtime.

@ahmetmeleq
Copy link
Copy Markdown
Contributor Author

ahmetmeleq commented Sep 13, 2023

todo: add metadata properties for the embedding model

edit: addressed with 04468f0

Comment thread unstructured/embed/interfaces.py Outdated
Comment thread unstructured/embed/interfaces.py
Comment thread unstructured/embed/openai.py Outdated
def initialize(self):
self.openai_client = self.get_openai_client()

def embed(self, elements: Optional[List[Element]]) -> List[Element]:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as we have this method with the Optional[List[Element]]) -> List[Element] signature to work with, we should be fine in Enterprise to consume this.

@ahmetmeleq
Copy link
Copy Markdown
Contributor Author

Ready for review assuming parallelisation with this design is doable/easy in ingest, just like it is in enterprise

@ahmetmeleq ahmetmeleq marked this pull request as ready for review September 14, 2023 23:04
Comment thread unstructured/embed/openai.py
@ahmetmeleq ahmetmeleq enabled auto-merge (squash) September 15, 2023 15:22
Comment thread CHANGELOG.md Outdated
Comment on lines +70 to +80
* **Fixes a chunking issue via dropping the field "coordinates".** Problem: chunk_by_title function was chunking each element to its own individual chunk while it needed to group elements into a fewer number of chunks. We've discovered that this happens due to a metadata matching logic in chunk_by_title function, and discovered that elements with different metadata can't be put into the same chunk. At the same time, any element with "coordinates" essentially had different metadata than other elements, due each element locating in different places and having different coordinates. Fix: That is why we have included the key "coordinates" inside a list of excluded metadata keys, while doing this "metadata_matches" comparision. Importance: This change is crucial to be able to chunk by title for documents which include "coordinates" metadata in their elements.
* **Fixes a chunking issue via dropping the field "coordinates".**
* Problem: chunk_by_title function was chunking each element to its own individual chunk while it needed to group elements into a fewer number of chunks. We've discovered that this happens due to a metadata matching logic in chunk_by_title function, and discovered that elements with different metadata can't be put into the same chunk. At the same time, any element with "coordinates" essentially had different metadata than other elements, due each element locating in different places and having different coordinates.
* Fix: That is why we have included the key "coordinates" inside a list of excluded metadata keys, while doing this "metadata_matches" comparision.
* Importance: This change is crucial to be able to chunk by title for documents which include "coordinates" metadata in their elements.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

intentionally updating this older note? ...guessing this will be resolved when you update to latest, but look out for it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this was to modify the changelog item into the multiple paragraph format

Now I've re-modified all changelog items into single paragraph format so this change no longer exists 👍

Comment thread CHANGELOG.md Outdated
Comment on lines +10 to +14
* **Adds the embedding module to be able to embed Elements**
* Problem: Many NLP applications require the ability to represent parts of documents in a semantic way. Until now, Unstructured did not have text embedding ability within the core library.
* Feature: This embedding module is able to track embeddings related data with a class, embed a list of elements, and return an updated list of Elements with the *embeddings* property. The module is also able to embed query strings.
* Importance: Ability to embed documents, or parts of documents will enable users to make use of these semantic representations in different NLP applications, such as search, retrieval, and retrieval augmented generation.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

appreciate adding the importance here. probably should follow the single paragraph format of all of the other recent changelog items

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-modified all changelog items into single paragraph format now ✅

@ahmetmeleq ahmetmeleq merged commit 9e88929 into main Sep 20, 2023
@ahmetmeleq ahmetmeleq deleted the ahmet/doc-embeddings branch September 20, 2023 19:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: metadata data source deserialization feat: document embeddings

5 participants