diff --git a/docs/docs/ai/llm.mdx b/docs/docs/ai/llm.mdx index 8a1e47fe6..1df420aa7 100644 --- a/docs/docs/ai/llm.mdx +++ b/docs/docs/ai/llm.mdx @@ -6,21 +6,63 @@ description: LLMs integrated with CocoIndex for various built-in functions import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; -CocoIndex provides builtin functions (e.g. [`ExtractByLlm`](/docs/ops/functions#extractbyllm)) that process data using LLM. -You usually need to provide a `LlmSpec`, to configure the LLM integration you want to use and LLM models, etc. +CocoIndex provides builtin functions integrating with various LLM APIs, for various inference tasks: +* [Text Generation](#text-generation): use LLM to generate text. +* [Text Embedding](#text-embedding): embed text into a vector space. +## LLM API Types -## LLM Spec +We support integrating with LLM with different types of APIs. +Each LLM API type is specified by a `cocoindex.LlmApiType` enum. -The `cocoindex.LlmSpec` data class is used to configure the LLM integration you want to use and LLM models, etc. +We support the following types of LLM APIs: + +| API Name | `LlmApiType` enum | Text Generation | Text Embedding | +|----------|---------------------|--------------------|--------------------| +| [OpenAI](#openai) | `LlmApiType.OPENAI` | ✅ | ✅ | +| [Ollama](#ollama) | `LlmApiType.OLLAMA` | ✅ | ❌ | +| [Google Gemini](#google-gemini) | `LlmApiType.GEMINI` | ✅ | ✅ | +| [Anthropic](#anthropic) | `LlmApiType.ANTHROPIC` | ✅ | ❌ | +| [Voyage](#voyage) | `LlmApiType.VOYAGE` | ❌ | ✅ | +| [LiteLLM](#litellm) | `LlmApiType.LITE_LLM` | ✅ | ❌ | +| [OpenRouter](#openrouter) | `LlmApiType.OPEN_ROUTER` | ✅ | ❌ | + +## LLM Tasks + +### Text Generation + +Generation is used as a building block for certain CocoIndex functions that process data using LLM generation. + +We have one builtin functions using LLM generation for now: + +* [`ExtractByLlm`](/docs/ops/functions#extractbyllm): it extracts information from input text. + +#### LLM Spec + +When calling a CocoIndex function that uses LLM generation, you need to provide a `cocoindex.LlmSpec` dataclass, to configure the LLM you want to use in these functions. It has the following fields: -* `api_type`: The type of integrated LLM API to use, e.g. `cocoindex.LlmApiType.OPENAI` or `cocoindex.LlmApiType.OLLAMA`. +* `api_type` (type: [`cocoindex.LlmApiType`](/docs/ai/llm#llm-api-types), required): The type of integrated LLM API to use, e.g. `cocoindex.LlmApiType.OPENAI` or `cocoindex.LlmApiType.OLLAMA`. See supported LLM APIs in the [LLM API integrations](#llm-api-integrations) section below. -* `model`: The name of the LLM model to use. -* `address` (optional): The address of the LLM API. +* `model` (type: `str`, required): The name of the LLM model to use. +* `address` (type: `str`, optional): The address of the LLM API. +### Text Embedding + +Embedding means converting text into a vector space, usually for similarity matching. + +We provide a builtin function [`EmbedText`](/docs/ops/functions#embedtext) that converts a given text into a vector space. +The spec takes the following fields: + +* `api_type` (type: `cocoindex.LlmApiType`, required) +* `model` (type: `str`, required) +* `address` (type: `str`, optional) +* `output_dimension` (type: `int`, optional) +* `task_type` (type: `str`, optional) + +See documentation for [`EmbedText`](/docs/ops/functions#embedtext) for more details about these fields. + ## LLM API Integrations CocoIndex integrates with various LLM APIs for these functions. @@ -30,7 +72,11 @@ CocoIndex integrates with various LLM APIs for these functions. To use the OpenAI LLM API, you need to set the environment variable `OPENAI_API_KEY`. You can generate the API key from [OpenAI Dashboard](https://platform.openai.com/api-keys). -A spec for OpenAI looks like this: +Currently we don't support custom address for OpenAI API. + +You can find the full list of models supported by OpenAI [here](https://platform.openai.com/docs/models). + +For text generation, a spec for OpenAI looks like this: @@ -42,9 +88,20 @@ cocoindex.LlmSpec( ) ``` -Currently we don't support custom address for OpenAI API. + + -You can find the full list of models supported by OpenAI [here](https://platform.openai.com/docs/models). +For text embedding, a spec for OpenAI looks like this: + + + + +```python +cocoindex.functions.EmbedText( + api_type=cocoindex.LlmApiType.OPENAI, + model="text-embedding-3-small", +) +``` @@ -82,7 +139,9 @@ cocoindex.LlmSpec( To use the Gemini LLM API, you need to set the environment variable `GEMINI_API_KEY`. You can generate the API key from [Google AI Studio](https://aistudio.google.com/apikey). -A spec for Gemini looks like this: +You can find the full list of models supported by Gemini [here](https://ai.google.dev/gemini-api/docs/models). + +For text generation, a spec looks like this: @@ -97,14 +156,32 @@ cocoindex.LlmSpec( -You can find the full list of models supported by Gemini [here](https://ai.google.dev/gemini-api/docs/models). +For text embedding, a spec looks like this: + + + + +```python +cocoindex.functions.EmbedText( + api_type=cocoindex.LlmApiType.GEMINI, + model="text-embedding-004", + task_type="SEMANTICS_SIMILARITY", +) +``` + +All supported embedding models can be found [here](https://ai.google.dev/gemini-api/docs/embeddings#embeddings-models). +Gemini supports task type (optional), which can be found [here](https://ai.google.dev/gemini-api/docs/embeddings#supported-task-types). + + + + ### Anthropic To use the Anthropic LLM API, you need to set the environment variable `ANTHROPIC_API_KEY`. You can generate the API key from [Anthropic API](https://console.anthropic.com/settings/keys). -A spec for Anthropic looks like this: +A text generation spec for Anthropic looks like this: @@ -121,6 +198,29 @@ cocoindex.LlmSpec( You can find the full list of models supported by Anthropic [here](https://docs.anthropic.com/en/docs/about-claude/models/all-models). +### Voyage + +To use the Voyage LLM API, you need to set the environment variable `VOYAGE_API_KEY`. +You can generate the API key from [Voyage dashboard](https://dashboard.voyageai.com/organization/api-keys). + +A text embedding spec for Voyage looks like this: + + + + +```python +cocoindex.functions.EmbedText( + api_type=cocoindex.LlmApiType.VOYAGE, + model="voyage-code-3", + task_type="document", +) +``` + + + + +Voyage API supports `document` and `query` as task types (optional, a.k.a. `input_type` in Voyage API, see [Voyage API documentation](https://docs.voyageai.com/reference/embeddings-api) for details). + ### LiteLLM To use the LiteLLM API, you need to set the environment variable `LITELLM_API_KEY`. diff --git a/docs/docs/ops/functions.md b/docs/docs/ops/functions.md index 9b583fe71..2e5e76d88 100644 --- a/docs/docs/ops/functions.md +++ b/docs/docs/ops/functions.md @@ -105,3 +105,32 @@ Input data: * `text` (type: `str`, required): The text to extract information from. Return type: As specified by the `output_type` field in the spec. The extracted information from the input text. + +## EmbedText + +`EmbedText` embeds a text into a vector space using various LLM APIs that support text embedding. + +The spec takes the following fields: + +* `api_type` (type: [`cocoindex.LlmApiType`](/docs/ai/llm#llm-api-types), required): The type of LLM API to use for embedding. +* `model` (type: `str`, required): The name of the embedding model to use. +* `address` (type: `str`, optional): The address of the LLM API. If not specified, uses the default address for the API type. +* `output_dimension` (type: `int`, optional): The expected dimension of the output embedding vector. If not specified, use the default dimension of the model. + + For most API types, the function internally keeps a registry for the default output dimension of known model. + You need to explicitly specify the `output_dimension` if you want to use a new model that is not in the registry yet. + +* `task_type` (type: `str`, optional): The task type for embedding, used by some embedding models to optimize the embedding for specific use cases. + +:::note Supported APIs for Text Embedding + +Not all LLM APIs support text embedding. See the [LLM API Types table](/docs/ai/llm#llm-api-types) for which APIs support text embedding functionality. + +::: + +Input data: + +* `text` (type: `str`, required): The text to embed. + +Return type: `vector[float32; N]`, where `N` is the dimension of the embedding vector determined by the model. + diff --git a/docs/sidebars.ts b/docs/sidebars.ts index bf645bdd6..4bb8dd428 100644 --- a/docs/sidebars.ts +++ b/docs/sidebars.ts @@ -61,4 +61,4 @@ const sidebars: SidebarsConfig = { ], }; -export default sidebars; +export default sidebars; \ No newline at end of file diff --git a/examples/code_embedding/main.py b/examples/code_embedding/main.py index c9ef9787b..ac1b1297d 100644 --- a/examples/code_embedding/main.py +++ b/examples/code_embedding/main.py @@ -24,7 +24,7 @@ def code_to_embedding( # You can also switch to Voyage embedding model: # return text.transform( # cocoindex.functions.EmbedText( - # api_type=cocoindex.llm.LlmApiType.VOYAGE, + # api_type=cocoindex.LlmApiType.VOYAGE, # model="voyage-code-3", # ) # ) diff --git a/examples/text_embedding/main.py b/examples/text_embedding/main.py index 9e84792b7..07e83f031 100644 --- a/examples/text_embedding/main.py +++ b/examples/text_embedding/main.py @@ -19,7 +19,7 @@ def text_to_embedding( # You can also switch to remote embedding model: # return text.transform( # cocoindex.functions.EmbedText( - # api_type=cocoindex.llm.LlmApiType.OPENAI, + # api_type=cocoindex.LlmApiType.OPENAI, # model="text-embedding-3-small", # ) # )