feat: document embeddings by ahmetmeleq · Pull Request #1368 · Unstructured-IO/unstructured

ahmetmeleq · 2023-09-11T16:31:21Z

Closes #1319, closes #1372

This module:

implements EmbeddingEncoder classes which track embedding related data
implements embed_documents method which receives a list of Elements, obtains embeddings for the text within Elements, updates the Elements with an attribute named embeddings , and returns the updated Elements
the module uses langchain to obtain the embeddings

The PR additionally fixes a JSON de-serialization issue on the metadata fields.

To test the changes, run examples/embed/example.py

badGarnet · 2023-09-11T23:24:33Z

+    """
+    from langchain.embeddings.openai import OpenAIEmbeddings
+
+    embedder = OpenAIEmbeddings(


usually don't want to comment on names at this stage but... I'd propose using encoder instead of embedder. encoder is more commonly used.

I see we have encoder elsewhere... welp

I see we have encoder elsewhere... welp

I hadn't thought about that while naming it, but I agree that it might be confused with ElementEncoder if we name it as encoder

This object manages the requests/responses to/from OpenAI API - I wanted to name it as embedder so users with less ML experience can also make sense of it (due to it being used as a trendy word in wider circles)

how about we just call this embeddings, per the class name and that seems to be the name they give to the instance in their documentation. or maybe we call this an embeddings_encoder which would be more accurate and distinct from the element encoder?

Changed the namings with 499795a, please let me know if we'd like further changes on the names

badGarnet · 2023-09-11T23:28:18Z

+        logger.debug(f"Reading: {self} - PID: {os.getpid()}")
+
+        elements = self.get_elements_from_json()
+        embeddings = self.session_handle.service.embed_documents([str(e) for e in elements])


this can be a long running process; is this function call gonna be async?

Per document, yes, each OpenAIEmbeddingsDoc instance will be handled separately when we integrate this into ingest. Each instance corresponds to a file.

However, in this implementation, sub-document parallelisation would not be supported. I agree that we might want to support it, considering that we might have long documents

badGarnet · 2023-09-11T23:31:13Z

+        with open(self._output_filename, "w", encoding="utf8") as output_f:
+            json.dump(self.elements, output_f, ensure_ascii=False, indent=2, cls=ElementEncoder)


we are probably just prototyping here so writing to json is fine but depending on the embedding model this can blow up the json file size very quickly. e.g., the most commonly used ones have 748 dimension, so that is 748 floating point numbers per embedding.
speaking of which we should probably record encoder metadata somewhere in the process/output out well, like the vector dimension and if it is unit vector (two attributes relevant to information retrieval tasks -> computing distance)

Agreed on both points. I'll implement the metadata properties soon 👍

On saving the embeddings to the disk, what are some good alternatives?

What I think is:

If we'd like to isolate the task of encoding from the task of writing to a DB, and want an intermediate place to store the embeddings:

we can save embeddings to a separate file, again in a basic format. these files can be processed later on to be written into a database

we can use a message queue so the embeddings later get consumed by a job and be written into the database

Or, we could directly write the embeddings into a database

do we actually need to write this to file? or can we just rely on return the in-memory elements with embeddings?

Depends on how the downstream jobs want to obtain and use these embeddings. If we'd like embed large corpora, we'd like to index the embeddings in a DB. To do that, the job that'll write the embeddings into the DB can read the embeddings from memory, from a message queue, or from the disk

If adequate, we can return the embeddings in-memory for now; and open another ticket to bring in persistence

Depends on how the downstream jobs want to obtain and use these embeddings

yes, definitely. lets start with the requirements for ingest and what makes most sense there and how this interfaces. CC @rbiseck3

aside: let's avoid a message queue

Update: For the current ticket, we've decided to make this work fully in memory. Inputs are read from memory, results are passed on via memory. 499795a

ryannikolaidis · 2023-09-13T05:57:10Z

+    def get_embed_docs(self):
+        """Reads all result files to embed them."""
+        return [
+            OpenAIEmbeddingsDoc(
+                self.config,
+                filename,
+            )
+            for filename in self.config.list_of_elements_json_paths.split()
+        ]


probably for v0 this just takes a single result (list of Elements) or json file and returns the list of Elements with embeddings? i.e. not sure we need to support iterating a list of docs (at least for v0)

though maybe we need to look at how we want this to slot in for ingest here for a requirements perspective.

Re-implemented as taking a single list of elements and returning the updated list with embeddings in 499795a

For fitting into ingest, and a single doc, I think instantiating the class and putting it as self.embedding_encoder.embed(doc) should be possible within those lines:

unstructured/unstructured/ingest/processor.py

Lines 96 to 98 in 2b571eb

self.run_partition(docs=docs)

if self.dest_doc_connector:

self.dest_doc_connector.write(docs=docs)

However, to process multiple docs, we need an async / parallelised solution. Would pool.map be possible like with partition, I wonder:

unstructured/unstructured/ingest/processor.py

Lines 71 to 87 in 2b571eb

def run_partition(self, docs):

if not self.reprocess:

docs = self._filter_docs_with_outputs(docs)

if not docs:

return

# Debugging tip: use the below line and comment out the mp.Pool loop

# block to remain in single process

# self.doc_processor_fn(docs[0])

logger.info(f"Processing {len(docs)} docs")

json_docs = [doc.to_json() for doc in docs]

with mp.Pool(

processes=self.num_processes,

initializer=ingest_log_streaming_init,

initargs=(logging.DEBUG if self.verbose else logging.INFO,),

) as pool:

pool.map(self.doc_processor_fn, json_docs)

A note, since these are API calls and not actual compute intensive operations, our users will ultimately be bound by the model provider (in the current case OpenAI) API. Paid users get good rate limits, therefore if OpenAI doesn't have hidden limits behind the scenes (ie if they allocate more resources to the user when the load from that user is high), all types of parallelisation should benefit our users' runtime.

ahmetmeleq · 2023-09-13T14:46:39Z

~~todo: add metadata properties for the embedding model~~

edit: addressed with 04468f0

jasonbot · 2023-09-14T20:11:09Z

+    def initialize(self):
+        self.openai_client = self.get_openai_client()
+
+    def embed(self, elements: Optional[List[Element]]) -> List[Element]:


As long as we have this method with the Optional[List[Element]]) -> List[Element] signature to work with, we should be fine in Enterprise to consume this.

ahmetmeleq · 2023-09-14T23:04:11Z

Ready for review assuming parallelisation with this design is doable/easy in ingest, just like it is in enterprise

ryannikolaidis · 2023-09-20T05:19:42Z

-* **Fixes a chunking issue via dropping the field "coordinates".** Problem: chunk_by_title function was chunking each element to its own individual chunk while it needed to group elements into a fewer number of chunks. We've discovered that this happens due to a metadata matching logic in chunk_by_title function, and discovered that elements with different metadata can't be put into the same chunk. At the same time, any element with "coordinates" essentially had different metadata than other elements, due each element locating in different places and having different coordinates. Fix: That is why we have included the key "coordinates" inside a list of excluded metadata keys, while doing this "metadata_matches" comparision. Importance: This change is crucial to be able to chunk by title for documents which include "coordinates" metadata in their elements.
+* **Fixes a chunking issue via dropping the field "coordinates".**
+  * Problem: chunk_by_title function was chunking each element to its own individual chunk while it needed to group elements into a fewer number of chunks. We've discovered that this happens due to a metadata matching logic in chunk_by_title function, and discovered that elements with different metadata can't be put into the same chunk. At the same time, any element with "coordinates" essentially had different metadata than other elements, due each element locating in different places and having different coordinates.
+  * Fix: That is why we have included the key "coordinates" inside a list of excluded metadata keys, while doing this "metadata_matches" comparision.
+  * Importance: This change is crucial to be able to chunk by title for documents which include "coordinates" metadata in their elements.


intentionally updating this older note? ...guessing this will be resolved when you update to latest, but look out for it.

Yes, this was to modify the changelog item into the multiple paragraph format

Now I've re-modified all changelog items into single paragraph format so this change no longer exists 👍

ryannikolaidis · 2023-09-20T05:20:43Z

+* **Adds the embedding module to be able to embed Elements**
+  * Problem: Many NLP applications require the ability to represent parts of documents in a semantic way. Until now, Unstructured did not have text embedding ability within the core library.
+  * Feature: This embedding module is able to track embeddings related data with a class, embed a list of elements, and return an updated list of Elements with the *embeddings* property. The module is also able to embed query strings.
+  * Importance: Ability to embed documents, or parts of documents will enable users to make use of these semantic representations in different NLP applications, such as search, retrieval, and retrieval augmented generation.
+


appreciate adding the importance here. probably should follow the single paragraph format of all of the other recent changelog items

Re-modified all changelog items into single paragraph format now ✅

…different runs from the PR page

ahmetmeleq added 4 commits September 11, 2023 19:14

fix data_source deserialization issue

39dc8af

remove comments

3056719

add embedding module and openai connector

108914c

add example

c0def72

rbiseck3 reviewed Sep 11, 2023

View reviewed changes

Comment thread unstructured/embed/embedder/example.py Outdated

badGarnet reviewed Sep 11, 2023

View reviewed changes

Comment thread unstructured/embed/embedder/open_ai.py Outdated

badGarnet reviewed Sep 11, 2023

View reviewed changes

update example

d7a8689

ryannikolaidis reviewed Sep 12, 2023

View reviewed changes

Comment thread unstructured/embed/embedder/open_ai.py Outdated

ryannikolaidis reviewed Sep 13, 2023

View reviewed changes

re-implement embedding encoder

499795a

rbiseck3 reviewed Sep 14, 2023

View reviewed changes

Comment thread unstructured/embed/interfaces.py Outdated

rbiseck3 reviewed Sep 14, 2023

View reviewed changes

Comment thread unstructured/embed/interfaces.py

jasonbot reviewed Sep 14, 2023

View reviewed changes

ahmetmeleq added 2 commits September 15, 2023 01:56

add query embedding, add implemented metadata fields

04468f0

update type hint

8c51d8a

Merge branch 'main' into ahmet/doc-embeddings

5479c69

ahmetmeleq marked this pull request as ready for review September 14, 2023 23:04

ahmetmeleq and others added 5 commits September 15, 2023 16:09

Merge branch 'main' into ahmet/doc-embeddings

f3ebe87

changelog and version

633f9c7

minor change in changelog

b1a00f8

linting

7370d1a

linting

701716e

rbiseck3 reviewed Sep 15, 2023

View reviewed changes

Comment thread unstructured/embed/openai.py

ahmetmeleq enabled auto-merge (squash) September 15, 2023 15:22

Merge branch 'main' into ahmet/doc-embeddings

4faca4b

rbiseck3 approved these changes Sep 19, 2023

View reviewed changes

ahmetmeleq and others added 5 commits September 19, 2023 19:08

Merge branch 'main' into ahmet/doc-embeddings

f41db23

minor changes in changelog

f36ab4f

Merge branch 'main' into ahmet/doc-embeddings

15620fc

Merge branch 'main' into ahmet/doc-embeddings

a737e30

add documentation

579c824

ryannikolaidis reviewed Sep 20, 2023

View reviewed changes

ahmetmeleq and others added 5 commits September 20, 2023 20:06

update readme

1d992db

update readme

2034300

update readme

2e2dde0

Merge branch 'main' into ahmet/doc-embeddings

49b3a79

trigger ci with commit rather than rerunning jobs, to be able to see …

a17ffe3

…different runs from the PR page

ahmetmeleq merged commit 9e88929 into main Sep 20, 2023

ahmetmeleq deleted the ahmet/doc-embeddings branch September 20, 2023 19:55

		with open(self._output_filename, "w", encoding="utf8") as output_f:
		json.dump(self.elements, output_f, ensure_ascii=False, indent=2, cls=ElementEncoder)

	self.run_partition(docs=docs)
	if self.dest_doc_connector:
	self.dest_doc_connector.write(docs=docs)

	def run_partition(self, docs):
	if not self.reprocess:
	docs = self._filter_docs_with_outputs(docs)
	if not docs:
	return

	# Debugging tip: use the below line and comment out the mp.Pool loop
	# block to remain in single process
	# self.doc_processor_fn(docs[0])
	logger.info(f"Processing {len(docs)} docs")
	json_docs = [doc.to_json() for doc in docs]
	with mp.Pool(
	processes=self.num_processes,
	initializer=ingest_log_streaming_init,
	initargs=(logging.DEBUG if self.verbose else logging.INFO,),
	) as pool:
	pool.map(self.doc_processor_fn, json_docs)

Conversation

ahmetmeleq commented Sep 11, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ahmetmeleq Sep 12, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ryannikolaidis Sep 13, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ryannikolaidis Sep 13, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ahmetmeleq commented Sep 13, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ahmetmeleq commented Sep 14, 2023

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ahmetmeleq commented Sep 11, 2023 •

edited

Loading

ahmetmeleq Sep 12, 2023 •

edited

Loading

ryannikolaidis Sep 13, 2023 •

edited

Loading

ryannikolaidis Sep 13, 2023 •

edited

Loading

ahmetmeleq commented Sep 13, 2023 •

edited

Loading