Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion api-reference/how-to/embedding.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -68,7 +68,7 @@ To use the Ingest CLI or Ingest Python library to generate embeddings, do the fo
- `langchain-huggingface`. [Choose a model](https://huggingface.co/models?other=embeddings), or use the default model [sentence-transformers/all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2).
- `langchain-openai`. [Choose a model](https://platform.openai.com/docs/guides/embeddings/embedding-models), or use the default model `text-embedding-ada-002`.
- `langchain-vertexai`. [Choose a model](https://cloud.google.com/vertex-ai/generative-ai/docs/model-reference/text-embeddings-api), or use the default model `textembedding-gecko@001`.
- `langchain-voyageai`. [Choose a model](https://docs.voyageai.com/docs/embeddings). No default model is provided.
- `langchain-voyageai`. [Choose a model](https://docs.voyageai.com/docs/embeddings). No default model is provided. Recommended models include `voyage-3.5` (latest general-purpose), `voyage-3.5-lite` (lightweight with higher token limits), `voyage-3-large` (enhanced performance), and `voyage-context-3` (for contextualized embeddings).
- `mixedbread-ai`. [Choose a model](https://www.mixedbread.ai/docs/embeddings/models), or use the default model [mixedbread-ai/mxbai-embed-large-v1](https://www.mixedbread.ai/docs/embeddings/mxbai-embed-large-v1).
- `octoai`. [Choose a model](https://octo.ai/blog/supercharge-rag-performance-using-octoai-and-unstructured-embeddings/), or use the default model `thenlper/gte-large`.

Expand Down
82 changes: 77 additions & 5 deletions open-source/core-functionality/embedding.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -207,9 +207,9 @@ print(embedding_encoder.is_unit_vector(), embedding_encoder.num_of_dimensions())

## `VoyageAIEmbeddingEncoder`

The `VoyageAIEmbeddingEncoder` class connects to the VoyageAI to obtain embeddings for pieces of text.
The `VoyageAIEmbeddingEncoder` class connects to the VoyageAI API to obtain embeddings for pieces of text using the VoyageAI Python client.

`embed_documents` will receive a list of Elements, and return an updated list which includes the `embeddings` attribute for each Element.
`embed_documents` will receive a list of Elements, and return an updated list which includes the `embeddings` attribute for each Element. The encoder automatically handles batching based on token limits for optimal performance.

`embed_query` will receive a query as a string, and return a list of floats which is the embedding vector for the given query string.

Expand All @@ -219,31 +219,103 @@ The `VoyageAIEmbeddingEncoder` class connects to the VoyageAI to obtain embeddin

The following code block shows an example of how to use `VoyageAIEmbeddingEncoder`. You will see the updated elements list (with the `embeddings` attribute included for each element), the embedding vector for the query string, and some metadata properties about the embedding model.

To use Voyage AI you will need to pass Voyage AI API Key (obtained from [https://dash.voyageai.com/](https://dash.voyageai.com/)) as the `api_key` parameter.
### Configuration Parameters

To use Voyage AI you will need to pass the following parameters to `VoyageAIEmbeddingConfig`:

- **`api_key`** (required): Voyage AI API Key obtained from [https://dash.voyageai.com/](https://dash.voyageai.com/)
- **`model_name`** (required): The embedding model to use. Available models include:
- `voyage-3.5` - Latest general-purpose model with 1024 dimensions
- `voyage-3.5-lite` - Lightweight model with 512 dimensions and higher token limits
- `voyage-3-large` - Large model with enhanced performance
- `voyage-context-3` - Contextualized embedding model for document-level context
- `voyage-3`, `voyage-3-lite` - Previous generation models
- `voyage-2`, `voyage-02` - Legacy models
- Additional specialized models: `voyage-code-3`, `voyage-code-2`, `voyage-finance-2`, `voyage-law-2`, `voyage-multilingual-2`, `voyage-large-2`, `voyage-large-2-instruct`

For the complete list of available models, see [https://docs.voyageai.com/docs/embeddings](https://docs.voyageai.com/docs/embeddings)

- **`show_progress_bar`** (optional, default: `False`): Display a progress bar during batch processing
- **`batch_size`** (optional): Override the default batch size for embedding requests
- **`truncation`** (optional): Enable automatic truncation of inputs that exceed token limits
- **`output_dimension`** (optional): Specify a custom output dimension (model-dependent)

The `model_name` parameter is mandatory, please check the available models at [https://docs.voyageai.com/docs/embeddings](https://docs.voyageai.com/docs/embeddings)
### Basic Example

```python
import os

from unstructured.documents.elements import Text
from unstructured.embed.voyageai import VoyageAIEmbeddingConfig, VoyageAIEmbeddingEncoder

# Basic configuration with required parameters
embedding_encoder = VoyageAIEmbeddingEncoder(
config=VoyageAIEmbeddingConfig(
api_key=os.environ["VOYAGE_API_KEY"],
model_name="voyage-law-2"
model_name="voyage-3.5"
)
)

# Embed documents
elements = embedding_encoder.embed_documents(
elements=[Text("This is sentence 1"), Text("This is sentence 2")],
)

# Embed a query
query = "This is the query"
query_embedding = embedding_encoder.embed_query(query=query)

# Print results
[print(e, e.embeddings) for e in elements]
print(query, query_embedding)
print(embedding_encoder.is_unit_vector, embedding_encoder.num_of_dimensions)
```

### Advanced Example with Custom Options

```python
import os

from unstructured.documents.elements import Text
from unstructured.embed.voyageai import VoyageAIEmbeddingConfig, VoyageAIEmbeddingEncoder

# Advanced configuration with optional parameters
embedding_encoder = VoyageAIEmbeddingEncoder(
config=VoyageAIEmbeddingConfig(
api_key=os.environ["VOYAGE_API_KEY"],
model_name="voyage-3.5",
show_progress_bar=True, # Display progress during processing
truncation=True, # Automatically truncate long texts
output_dimension=512 # Use reduced dimensions
)
)

# Process a larger batch of documents
elements = embedding_encoder.embed_documents(
elements=[Text(f"Document {i}") for i in range(100)],
)
```

### Using Contextual Embeddings

The `voyage-context-3` model provides contextualized embeddings that consider document-level context:

```python
import os

from unstructured.documents.elements import Text
from unstructured.embed.voyageai import VoyageAIEmbeddingConfig, VoyageAIEmbeddingEncoder

# Configure for contextual embeddings
embedding_encoder = VoyageAIEmbeddingEncoder(
config=VoyageAIEmbeddingConfig(
api_key=os.environ["VOYAGE_API_KEY"],
model_name="voyage-context-3"
)
)

# Embed documents with contextual understanding
elements = embedding_encoder.embed_documents(
elements=[Text("Context-aware sentence 1"), Text("Context-aware sentence 2")],
)
```
14 changes: 10 additions & 4 deletions platform/workflows-automation.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -110,10 +110,16 @@ To create a workflow:

- **Anthropic**: Use Anthropic to generate embeddings. Also choose the embedding model to use, from one of the following:

- **voyage-2**: [Learn more](https://docs.anthropic.com/en/docs/build-with-claude/embeddings#available-voyage-models).
- **voyage-large-2**: [Learn more](https://docs.anthropic.com/en/docs/build-with-claude/embeddings#available-voyage-models).
- **voyage-code-2**: [Learn more](https://docs.anthropic.com/en/docs/build-with-claude/embeddings#available-voyage-models).
- **voyage-lite-02-instruct**: [Learn more](https://docs.anthropic.com/en/docs/build-with-claude/embeddings#available-voyage-models).
- **voyage-3.5**: Latest general-purpose model with 1024 dimensions. [Learn more](https://docs.anthropic.com/en/docs/build-with-claude/embeddings#available-voyage-models).
- **voyage-3.5-lite**: Lightweight model with 512 dimensions and higher token limits. [Learn more](https://docs.anthropic.com/en/docs/build-with-claude/embeddings#available-voyage-models).
- **voyage-3-large**: Enhanced performance model. [Learn more](https://docs.anthropic.com/en/docs/build-with-claude/embeddings#available-voyage-models).
- **voyage-3**: General-purpose model (120K token limit). [Learn more](https://docs.anthropic.com/en/docs/build-with-claude/embeddings#available-voyage-models).
- **voyage-3-lite**: Lightweight model (120K token limit). [Learn more](https://docs.anthropic.com/en/docs/build-with-claude/embeddings#available-voyage-models).
- **voyage-context-3**: Contextualized embedding model for document-level context. [Learn more](https://docs.anthropic.com/en/docs/build-with-claude/embeddings#available-voyage-models).
- **voyage-2**: Legacy general-purpose model. [Learn more](https://docs.anthropic.com/en/docs/build-with-claude/embeddings#available-voyage-models).
- **voyage-large-2**: Legacy large model. [Learn more](https://docs.anthropic.com/en/docs/build-with-claude/embeddings#available-voyage-models).
- **voyage-code-2**: Legacy code-specialized model. [Learn more](https://docs.anthropic.com/en/docs/build-with-claude/embeddings#available-voyage-models).
- **voyage-lite-02-instruct**: Legacy lightweight instruction-tuned model. [Learn more](https://docs.anthropic.com/en/docs/build-with-claude/embeddings#available-voyage-models).

- **Hugging Face**: Use Hugging Face to generate embeddings. Also choose the embedding model to use, from one of the following:

Expand Down