Conversation
| """ | ||
| from langchain.embeddings.openai import OpenAIEmbeddings | ||
|
|
||
| embedder = OpenAIEmbeddings( |
There was a problem hiding this comment.
usually don't want to comment on names at this stage but... I'd propose using encoder instead of embedder. encoder is more commonly used.
There was a problem hiding this comment.
I see we have encoder elsewhere... welp
There was a problem hiding this comment.
I see we have
encoderelsewhere... welp
I hadn't thought about that while naming it, but I agree that it might be confused with ElementEncoder if we name it as encoder
This object manages the requests/responses to/from OpenAI API - I wanted to name it as embedder so users with less ML experience can also make sense of it (due to it being used as a trendy word in wider circles)
There was a problem hiding this comment.
how about we just call this embeddings, per the class name and that seems to be the name they give to the instance in their documentation. or maybe we call this an embeddings_encoder which would be more accurate and distinct from the element encoder?
There was a problem hiding this comment.
Changed the namings with 499795a, please let me know if we'd like further changes on the names
| logger.debug(f"Reading: {self} - PID: {os.getpid()}") | ||
|
|
||
| elements = self.get_elements_from_json() | ||
| embeddings = self.session_handle.service.embed_documents([str(e) for e in elements]) |
There was a problem hiding this comment.
this can be a long running process; is this function call gonna be async?
There was a problem hiding this comment.
Per document, yes, each OpenAIEmbeddingsDoc instance will be handled separately when we integrate this into ingest. Each instance corresponds to a file.
However, in this implementation, sub-document parallelisation would not be supported. I agree that we might want to support it, considering that we might have long documents
| with open(self._output_filename, "w", encoding="utf8") as output_f: | ||
| json.dump(self.elements, output_f, ensure_ascii=False, indent=2, cls=ElementEncoder) |
There was a problem hiding this comment.
we are probably just prototyping here so writing to json is fine but depending on the embedding model this can blow up the json file size very quickly. e.g., the most commonly used ones have 748 dimension, so that is 748 floating point numbers per embedding.
speaking of which we should probably record encoder metadata somewhere in the process/output out well, like the vector dimension and if it is unit vector (two attributes relevant to information retrieval tasks -> computing distance)
There was a problem hiding this comment.
Agreed on both points. I'll implement the metadata properties soon 👍
There was a problem hiding this comment.
On saving the embeddings to the disk, what are some good alternatives?
What I think is:
- If we'd like to isolate the task of encoding from the task of writing to a DB, and want an intermediate place to store the embeddings:
- we can save embeddings to a separate file, again in a basic format. these files can be processed later on to be written into a database
- we can use a message queue so the embeddings later get consumed by a job and be written into the database
- Or, we could directly write the embeddings into a database
There was a problem hiding this comment.
do we actually need to write this to file? or can we just rely on return the in-memory elements with embeddings?
There was a problem hiding this comment.
Depends on how the downstream jobs want to obtain and use these embeddings. If we'd like embed large corpora, we'd like to index the embeddings in a DB. To do that, the job that'll write the embeddings into the DB can read the embeddings from memory, from a message queue, or from the disk
If adequate, we can return the embeddings in-memory for now; and open another ticket to bring in persistence
There was a problem hiding this comment.
Depends on how the downstream jobs want to obtain and use these embeddings
yes, definitely. lets start with the requirements for ingest and what makes most sense there and how this interfaces. CC @rbiseck3
aside: let's avoid a message queue
There was a problem hiding this comment.
Update: For the current ticket, we've decided to make this work fully in memory. Inputs are read from memory, results are passed on via memory. 499795a
| def get_embed_docs(self): | ||
| """Reads all result files to embed them.""" | ||
| return [ | ||
| OpenAIEmbeddingsDoc( | ||
| self.config, | ||
| filename, | ||
| ) | ||
| for filename in self.config.list_of_elements_json_paths.split() | ||
| ] |
There was a problem hiding this comment.
probably for v0 this just takes a single result (list of Elements) or json file and returns the list of Elements with embeddings? i.e. not sure we need to support iterating a list of docs (at least for v0)
There was a problem hiding this comment.
though maybe we need to look at how we want this to slot in for ingest here for a requirements perspective.
There was a problem hiding this comment.
Re-implemented as taking a single list of elements and returning the updated list with embeddings in 499795a
There was a problem hiding this comment.
For fitting into ingest, and a single doc, I think instantiating the class and putting it as self.embedding_encoder.embed(doc) should be possible within those lines:
unstructured/unstructured/ingest/processor.py
Lines 96 to 98 in 2b571eb
However, to process multiple docs, we need an async / parallelised solution. Would pool.map be possible like with partition, I wonder:
unstructured/unstructured/ingest/processor.py
Lines 71 to 87 in 2b571eb
A note, since these are API calls and not actual compute intensive operations, our users will ultimately be bound by the model provider (in the current case OpenAI) API. Paid users get good rate limits, therefore if OpenAI doesn't have hidden limits behind the scenes (ie if they allocate more resources to the user when the load from that user is high), all types of parallelisation should benefit our users' runtime.
|
edit: addressed with 04468f0 |
| def initialize(self): | ||
| self.openai_client = self.get_openai_client() | ||
|
|
||
| def embed(self, elements: Optional[List[Element]]) -> List[Element]: |
There was a problem hiding this comment.
As long as we have this method with the Optional[List[Element]]) -> List[Element] signature to work with, we should be fine in Enterprise to consume this.
|
Ready for review assuming parallelisation with this design is doable/easy in ingest, just like it is in enterprise |
| * **Fixes a chunking issue via dropping the field "coordinates".** Problem: chunk_by_title function was chunking each element to its own individual chunk while it needed to group elements into a fewer number of chunks. We've discovered that this happens due to a metadata matching logic in chunk_by_title function, and discovered that elements with different metadata can't be put into the same chunk. At the same time, any element with "coordinates" essentially had different metadata than other elements, due each element locating in different places and having different coordinates. Fix: That is why we have included the key "coordinates" inside a list of excluded metadata keys, while doing this "metadata_matches" comparision. Importance: This change is crucial to be able to chunk by title for documents which include "coordinates" metadata in their elements. | ||
| * **Fixes a chunking issue via dropping the field "coordinates".** | ||
| * Problem: chunk_by_title function was chunking each element to its own individual chunk while it needed to group elements into a fewer number of chunks. We've discovered that this happens due to a metadata matching logic in chunk_by_title function, and discovered that elements with different metadata can't be put into the same chunk. At the same time, any element with "coordinates" essentially had different metadata than other elements, due each element locating in different places and having different coordinates. | ||
| * Fix: That is why we have included the key "coordinates" inside a list of excluded metadata keys, while doing this "metadata_matches" comparision. | ||
| * Importance: This change is crucial to be able to chunk by title for documents which include "coordinates" metadata in their elements. |
There was a problem hiding this comment.
intentionally updating this older note? ...guessing this will be resolved when you update to latest, but look out for it.
There was a problem hiding this comment.
Yes, this was to modify the changelog item into the multiple paragraph format
Now I've re-modified all changelog items into single paragraph format so this change no longer exists 👍
| * **Adds the embedding module to be able to embed Elements** | ||
| * Problem: Many NLP applications require the ability to represent parts of documents in a semantic way. Until now, Unstructured did not have text embedding ability within the core library. | ||
| * Feature: This embedding module is able to track embeddings related data with a class, embed a list of elements, and return an updated list of Elements with the *embeddings* property. The module is also able to embed query strings. | ||
| * Importance: Ability to embed documents, or parts of documents will enable users to make use of these semantic representations in different NLP applications, such as search, retrieval, and retrieval augmented generation. | ||
|
|
There was a problem hiding this comment.
appreciate adding the importance here. probably should follow the single paragraph format of all of the other recent changelog items
There was a problem hiding this comment.
Re-modified all changelog items into single paragraph format now ✅
Closes #1319, closes #1372
This module:
To test the changes, run
examples/embed/example.py