diff --git a/platform/embedding.mdx b/platform/embedding.mdx index 93f91582..1ab97e76 100644 --- a/platform/embedding.mdx +++ b/platform/embedding.mdx @@ -69,6 +69,4 @@ To generate embeddings, choose one of the following embedding providers and mode - **text-embedding-3-large**, with 3072 dimensions. - **Ada 002 (Text)**, with 1536 dimensions. - [Learn more](https://platform.openai.com/docs/guides/embeddings). - -- **Vertex AI**: Use [Vertex AI](https://cloud.google.com/vertex-ai) to generate embeddings by using the [textembedding-gecko@001](https://cloud.google.com/vertex-ai/generative-ai/docs/embeddings/get-text-embeddings) model, with 768 dimensions. \ No newline at end of file + [Learn more](https://platform.openai.com/docs/guides/embeddings). \ No newline at end of file diff --git a/platform/overview.mdx b/platform/overview.mdx index 9e6fd18c..9326fcfd 100644 --- a/platform/overview.mdx +++ b/platform/overview.mdx @@ -23,9 +23,19 @@ To get your data RAG-ready, the Unstructured Platform moves it through the follo Routing determines which strategy Unstructured Platform uses to transforming your documents into Unstructured's canonical JSON schema. The Unstructured Platform provides these [partitioning](/platform/partitioning) strategies for document transformation: - - **Fast** is great for when there is extractable text available, like in HTML files or in the Microsoft Office Document format. - - **Hi Res** is best for PDFs and tables and where accurate classification of document elements is critical. - - If you're unsure which strategy to use, choose **Auto**, and the Unstructured Platform will handle the decision for you. + - **Basic** / **Fast** is ideal for simple, text-only documents. + - **Advanced** / **High Res** is best for PDFs, images, and complex file types. + + + During **Advanced** / **High Res** processing, any detected text-based files are processed and billed at the **Basic** / **Fast** rate instead. + + + - **Platinum** / **VLM** is for challenging documents, including scanned and handwritten content. + + + During **Platinum** / **VLM** processing, any detected files that are not PDFs or images are processed and billed at either the **Advanced** / **High Res** or **Basic** / **Fast** rate instead. + Of those non-PDF and non-image files, all text-based files are processed and billed at the **Basic** / **Fast** rate instead. The other files are processed and billed at the **Advanced** / **High Res** rate instead. + diff --git a/platform/partitioning.mdx b/platform/partitioning.mdx index e5279132..2483d6c6 100644 --- a/platform/partitioning.mdx +++ b/platform/partitioning.mdx @@ -20,11 +20,16 @@ To choose one of these strategies, select one of the **Partition Strategy** opti You can change a workflow's predefined strategy only through [Custom](/platform/workflows#create-a-custom-workflow) workflow settings. - **Fast**: This strategy is ideal for simple, text-based documents. -- **Hi-Res**: This strategy is best for PDFs, images, and complex file types. -- **VLM**: For your most challenging documents, including scanned and handwritten content, use this strategy, which leverages vision - language models (VLMs). During processing, files that are not PDFs or images are processed by using the **Hi-Res** strategy and are charged - at the **Hi-Res** rate instead. -- **Auto**: This strategy examines each file before processing it. If the file is an image, or if the file is a PDF and at least one embedded table - or image is found in it, **Hi-Res** is used to process that file and charged at the **Hi-Res** rate for that file. Otherwise, **Fast** is used and charged at the - **Fast** rate for that file. +- **High Res**: This strategy is best for PDFs, images, and complex file types. + + + During **High Res** processing, any detected text-based files are processed and billed at the **Fast** rate instead. + + +- **VLM**: For your most challenging documents, including scanned and handwritten content. + + + During **VLM** processing, any detected files that are not PDFs or images are processed and billed at either the **High Res** or **Fast** rate instead. + Of those non-PDF and non-image files, all text-based files are processed and billed at the **Fast** rate instead. The other files are processed and billed at the **High Res** rate instead. + diff --git a/platform/workflows.mdx b/platform/workflows.mdx index 2bc05b7d..1962ae72 100644 --- a/platform/workflows.mdx +++ b/platform/workflows.mdx @@ -54,8 +54,17 @@ To create an automatic workflow: - **Basic** Ideal for simple, text-only documents. - **Advanced** Best for PDFs, images, and complex file types. - - **Platinum** For your most challenging documents, including scanned and handwritten content. It uses vision language models (VLMs). - During processing, files that are not PDFs or images are processed by using the **Advanced** strategy and are charged at the **Advanced** rate instead. + + + During **Advanced** processing, any detected text-based files are processed and billed at the **Basic** rate instead. + + + - **Platinum** For your most challenging documents, including scanned and handwritten content. + + + During **Platinum** processing, any detected files that are not PDFs or images are processed and billed at either the **Advanced** or **Basic** rate instead. + Of those non-PDF and non-image files, all text-based files are processed and billed at the **Basic** rate instead. The other files are processed and billed at the **Advanced** rate instead. + 9. The **Reprocess all** box applies only to the Amazon S3 and Azure Blob Storage source connectors: @@ -109,12 +118,18 @@ There are two ways to create a custom workflow: 9. In the **Strategy** area, choose one of the following: - **Fast**: Ideal for simple, text-only documents. - - **Hi-Res**: Best for PDFs, images, and complex file types. - - **VLM**: For your most challenging documents, including scanned and handwritten content. It uses vision language models (VLMs). - During processing, files that are not PDFs or images are processed by using the **Hi-Res** strategy and are charged at the **Hi-Res** rate instead. - - **Auto**: This strategy examines each file before processing it. If the file is an image, or if the file is a PDF and at least one embedded table - or image is found in it, **Hi-Res** is used to process that file and charged at the **Hi-Res** rate for that file. Otherwise, **Fast** is used and charged at the - **Fast** rate for that file. + - **High Res**: Best for PDFs, images, and complex file types. + + + During **High Res** processing, any detected text-based files are processed and billed at the **Fast** rate instead. + + + - **VLM**: For your most challenging documents, including scanned and handwritten content. + + + During **VLM** processing, any detected files that are not PDFs or images are processed and billed at either the **High Res** or **Fast** rate instead. + Of those non-PDF and non-image files, all text-based files are processed and billed at the **Fast** rate instead. The other files are processed and billed at the **High Res** rate instead. + [Learn more](/platform/partitioning). @@ -189,8 +204,6 @@ There are two ways to create a custom workflow: [Learn more](https://platform.openai.com/docs/guides/embeddings). - - **Vertex AI**: Use Vertex AI to generate embeddings by using the `textembedding-gecko@001` model, with 768 dimensions. [Learn more](https://cloud.google.com/vertex-ai/generative-ai/docs/embeddings/get-text-embeddings). - Learn more: - [Embedding overview](/platform/embedding) @@ -266,12 +279,18 @@ There are two ways to create a custom workflow: For **Partition Strategy**, choose one of the following: - **Fast**: Ideal for simple, text-only documents. - - **Hi-Res**: Best for PDFs, images, and complex file types. - - **VLM**: For your most challenging documents, including scanned and handwritten content. It uses vision language models (VLMs). - During processing, files that are not PDFs or images are processed by using the **Hi-Res** strategy and are charged at the **Hi-Res** rate instead. - - **Auto**: This strategy examines each file before processing it. If the file is an image, or if the file is a PDF and at least one embedded table - or image is found in it, **Hi-Res** is used to process that file and charged at the **Hi-Res** rate for that file. Otherwise, **Fast** is used and charged at the - **Fast** rate for that file. + - **High Res**: Best for PDFs, images, and complex file types. + + + During **High Res** processing, any detected text-based files are processed and billed at the **Fast** rate instead. + + + - **VLM**: For your most challenging documents, including scanned and handwritten content. + + + During **VLM** processing, any detected files that are not PDFs or images are processed and billed at either the **High Res** or **Fast** rate instead. + Of those non-PDF and non-image files, all text-based files are processed and billed at the **Fast** rate instead. The other files are processed and billed at the **High Res** rate instead. + [Learn more](/platform/partitioning). @@ -338,8 +357,6 @@ There are two ways to create a custom workflow: [Learn more](https://platform.openai.com/docs/guides/embeddings). - - **Vertex AI**: Use Vertex AI to generate embeddings by using the `textembedding-gecko@001` model, with 768 dimensions. [Learn more](https://cloud.google.com/vertex-ai/generative-ai/docs/embeddings/get-text-embeddings). - Learn more: - [Embedding overview](/platform/embedding)