From 324fbb35cf7ca9d15ebe3cdba53baf307b79fb1f Mon Sep 17 00:00:00 2001 From: Paul Cornell Date: Mon, 2 Dec 2024 11:39:10 -0800 Subject: [PATCH] Platform: Preconfigured workflow settings --- platform/chunking.mdx | 2 +- platform/embedding.mdx | 2 +- platform/partitioning.mdx | 2 +- platform/workflows.mdx | 105 ++++++++++++++++++++++++++---- snippets/quickstarts/platform.mdx | 2 +- 5 files changed, 97 insertions(+), 16 deletions(-) diff --git a/platform/chunking.mdx b/platform/chunking.mdx index 9f9b0755..868e29d7 100644 --- a/platform/chunking.mdx +++ b/platform/chunking.mdx @@ -61,7 +61,7 @@ Here are a few examples: The following sections provide information about the available chunking strategies and their settings. -You can change a workflow's predefined strategy only through [Custom](/platform/workflows#create-a-custom-workflow) workflow settings. +You can change a workflow's preconfigured strategy only through [Custom](/platform/workflows#create-a-custom-workflow) workflow settings. ## Basic chunking strategy diff --git a/platform/embedding.mdx b/platform/embedding.mdx index 1ab97e76..1792f8b6 100644 --- a/platform/embedding.mdx +++ b/platform/embedding.mdx @@ -61,7 +61,7 @@ on Hugging Face: To generate embeddings, choose one of the following embedding providers and models in the **Providers** section of an **Embedder** node in a workflow: -You can change a workflow's predefined provider only through [Custom](/platform/workflows#create-a-custom-workflow) workflow settings. +You can change a workflow's preconfigured provider only through [Custom](/platform/workflows#create-a-custom-workflow) workflow settings. - **OpenAI**: Use [OpenAI](https://openai.com) to generate embeddings. Also, choose the model to use: diff --git a/platform/partitioning.mdx b/platform/partitioning.mdx index 2483d6c6..6c000d0d 100644 --- a/platform/partitioning.mdx +++ b/platform/partitioning.mdx @@ -17,7 +17,7 @@ For example, the **Fast** strategy can be about 100 times faster than leading im To choose one of these strategies, select one of the **Partition Strategy** options in the **Partitioner** node of a workflow: -You can change a workflow's predefined strategy only through [Custom](/platform/workflows#create-a-custom-workflow) workflow settings. +You can change a workflow's preconfigured strategy only through [Custom](/platform/workflows#create-a-custom-workflow) workflow settings. - **Fast**: This strategy is ideal for simple, text-based documents. - **High Res**: This strategy is best for PDFs, images, and complex file types. diff --git a/platform/workflows.mdx b/platform/workflows.mdx index b07ea01f..26a4003f 100644 --- a/platform/workflows.mdx +++ b/platform/workflows.mdx @@ -50,21 +50,102 @@ To create an automatic workflow: You can select multiple source and destination locations. Files will be ingested from all of the selected source locations, and the processed data will be delivered to all of the selected destination locations. 7. Click **Continue**. -8. In the **Optimize for** section, select the option to choose one of these predefined workflow settings groups: +8. In the **Optimize for** section, select the option to choose one of these preconfigured workflow settings groups. Expand any or all + of the following options to learn more about these preconfigured settings: - - **Basic** Ideal for simple, text-only documents. - - **Advanced** Best for PDFs, images, and complex file types. + + + This option is ideal for simple, text-only documents. - - During **Advanced** processing, any detected text-based files are processed and billed at the **Basic** rate instead. - - - - **Platinum** For your most challenging documents, including scanned and handwritten content. + The **Basic** option uses the following preconfigured workflow settings: - - During **Platinum** processing, any detected files that are not PDFs or images are processed and billed at either the **Advanced** or **Basic** rate instead. - Of those non-PDF and non-image files, all text-based files are processed and billed at the **Basic** rate instead. The other files are processed and billed at the **Advanced** rate instead. - + - **Strategy**: Fast + - **Image Summarizer**: None + - **Table Summarizer**: None + - **Include Page Breaks**: No + - **Infer Table Structure**: No + - **Elements to Exclude**: None + - **Chunk**: + + - **Chunker Type**: Chunk By Character + - **Chunk Options**: + + - **Include Original Elements**: No + - **Max Characters**: 2048 + - **New After N Characters**: 1500 + - **Overlap**: 160 + - **Overlap All**: No + + - **Embed**: + + - **Provider**: Azure OpenAI + - **Model**: text-embedding-3-large (3072 dimensions) + + + + This option is best for PDFs, images, and complex file types. + + + During **Advanced** processing, any detected text-based files are processed and billed at the **Basic** rate instead. + + + The **Advanced** option uses the following preconfigured workflow settings: + + - **Strategy**: High-Res + - **Image Summarizer**: GPT-4o + - **Table Summarizer**: Claude 3.5 Sonnet + - **Include Page Breaks**: No + - **Infer Table Structure**: No + - **Elements to Exclude**: None + - **Chunk**: + + - **Chunker Type**: Chunk By Title + - **Chunk Options**: + + - **Combine Text Under N Characters**: 0 + - **Include Original Elements**: No + - **Max Characters**: 2048 + - **New After N Characters**: 1500 + - **Overlap**: 160 + - **Overlap All**: No + + - **Embed**: + + - **Provider**: Azure OpenAI + - **Model**: text-embedding-3-large (3072 dimensions) + + + + This option is for your most challenging documents, including scanned and handwritten content. + + + During **Platinum** processing, any detected files that are not PDFs or images are processed and billed at either the **Advanced** or **Basic** rate instead. + Of those non-PDF and non-image files, all text-based files are processed and billed at the **Basic** rate instead. The other files are processed and billed at the **Advanced** rate instead. + + + The **Platinum** option uses the following preconfigured workflow settings: + + - **Strategy**: VLM + - **Chunk**: + + - **Chunker Type**: Chunk By Title + - **Chunk Options**: + + - **Combine Text Under N Characters**: 0 + - **Include Original Elements**: No + - **Max Characters**: 2048 + = **Multipage Sections**: No + - **New After N Characters**: 1500 + - **Overlap**: 160 + - **Overlap All**: No + + - **Embed**: + + - **Provider**: Azure OpenAI + - **Model**: text-embedding-3-large (3072 dimensions) + + + 9. The **Reprocess all** box applies only to the Amazon S3 and Azure Blob Storage source connectors: diff --git a/snippets/quickstarts/platform.mdx b/snippets/quickstarts/platform.mdx index c4c6e6ce..90a60f73 100644 --- a/snippets/quickstarts/platform.mdx +++ b/snippets/quickstarts/platform.mdx @@ -92,7 +92,7 @@ allowfullscreen You can select multiple source and destination locations. Files will be ingested from all of the selected source locations, and the processed data will be delivered to all of the selected destination locations. 7. Click **Continue**. - 8. In the **Optimize for** section, select the option to choose one of these predefined workflow settings groups: + 8. In the **Optimize for** section, select the option to choose one of these preconfigured workflow settings groups: - **Basic**: Ideal for simple, text-only documents. - **Advanced**: Best for PDFs, images, and complex file types.