diff --git a/img/platform/Job-Complete.png b/img/platform/Job-Complete.png index 25875faf..ba49d01a 100644 Binary files a/img/platform/Job-Complete.png and b/img/platform/Job-Complete.png differ diff --git a/platform/overview.mdx b/platform/overview.mdx index 2895a84c..35043107 100644 --- a/platform/overview.mdx +++ b/platform/overview.mdx @@ -36,20 +36,26 @@ flowchart LR Routing determines which strategy Unstructured Platform uses to transforming your documents into Unstructured's canonical JSON schema. The Unstructured Platform provides these [partitioning](/platform/partitioning) strategies for document transformation: - - **Basic** / **Fast** is ideal for simple, text-only documents. - - **Advanced** / **High Res** is best for PDFs, images, and complex file types. + - **Fast** is ideal for simple, text-only documents. + - **High Res** is best for PDFs, images, and complex file types. - During **Advanced** / **High Res** processing, any detected text-based files are processed and billed at the **Basic** / **Fast** rate instead. + During **High Res** processing, any detected text-based files are processed and billed at the **Fast** rate instead. - - **Platinum** / **VLM** is for challenging documents, including scanned and handwritten content. + - **VLM** is for challenging documents, including scanned and handwritten content. - During **Platinum** / **VLM** processing, any detected files that are not PDFs or images are processed and billed at either the **Advanced** / **High Res** or **Basic** / **Fast** rate instead. - Of those non-PDF and non-image files, all text-based files are processed and billed at the **Basic** / **Fast** rate instead. The other files are processed and billed at the **Advanced** / **High Res** rate instead. + During **VLM** processing, any detected files that are not PDFs or images are processed and billed at either the **High Res** or **Fast** rate instead. + Of those non-PDF and non-image files, all text-based files are processed and billed at the **Fast** rate instead. The other files are processed and billed at the **High Res** rate instead. + - **Auto** automatically analyzes and processes files on a page-by-page basis (for PDF files) and on a document-by-document basis for everything else: + + - If the page or document has no images and likely does not have tables, **Fast** partitioning is used, and the page or document is billed at the **Fast** rate for processing. + - If the page or document has only a few tables or images with standard layouts and languages, **High Res** partitioning is used, and the page or document is billed at the **High Res** rate for processing. + - If the page or document has more than a few tables or images, **VLM** partitioning is used, and the page or document is billed at the **VLM** rate for processing. + Your source document is transformed into Unstructured's canonical JSON schema. Regardless of the input document, this JSON schema gives you a [standardized output](/platform/document-elements). It contains more than 20 elements, such as `Header`, `Footer`, `Title`, `NarrativeText`, `Table`, `Image`, and many more. Each document is wrapped in extensive metadata so you can understand languages, file types, sources, hierarchies, and much more. diff --git a/platform/partitioning.mdx b/platform/partitioning.mdx index 91eb55a1..393866ba 100644 --- a/platform/partitioning.mdx +++ b/platform/partitioning.mdx @@ -36,3 +36,9 @@ To choose one of these strategies, select one of the **Partition Strategy** opti these files are processed. These errors typically occur when these larger PDF files have lots of tables and high-resolution images. +- **Auto**: Unstructured automatically analyzes and processes files on a page-by-page basis (for PDF files) and on a document-by-document basis for everything else: + + - If the page or document has no images and likely does not have tables, **Fast** partitioning is used, and the page or document is billed at the **Fast** rate for processing. + - If the page or document has only a few tables or images with standard layouts and languages, **High Res** partitioning is used, and the page or document is billed at the **High Res** rate for processing. + - If the page or document has more than a few tables or images, **VLM** partitioning is used, and the page or document is billed at the **VLM** rate for processing. + diff --git a/platform/workflows.mdx b/platform/workflows.mdx index 8b7fb0c9..aa770cd9 100644 --- a/platform/workflows.mdx +++ b/platform/workflows.mdx @@ -14,12 +14,10 @@ Workflows are crucial for establishing a systematic approach to managing data fl ## Create a workflow -![Choose a workflow type](/img/platform/Choose-Workflow-Type.png) - The Unstructured Platform provides two types of workflow builders: -- [Automatic](#create-an-automatic-workflow) workflows, which use sensible default workflow settings to enable you to get good-quality results faster. -- [Custom](#create-a-custom-worklow) workflows, which enable you to fine-tune the workflow settings behind the scenes to get very specific results. +- [Automatic](#create-an-automatic-workflow) or **Build it For Me** workflows, which use sensible default workflow settings to enable you to get good-quality results faster. +- [Custom](#create-a-custom-worklow) or **Build it Myself** workflows, which enable you to fine-tune the workflow settings behind the scenes to get very specific results. ### Create an automatic workflow @@ -35,9 +33,9 @@ To create an automatic workflow: 1. On the sidebar, click **Workflows**. 2. Click **New Workflow**. -3. Next to **Build it with Me**, click **Create Workflow**. +3. Next to **Build it for Me**, click **Create Workflow**. - If a radio button appears instead of **Build it with Me**, select it, and then click **Continue**. + If a radio button appears instead of **Build it for Me**, select it, and then click **Continue**. 4. For **Workflow Name**, enter some unique name for this workflow. 5. In the **Sources** dropdown list, select your source location. @@ -46,118 +44,78 @@ To create an automatic workflow: You can select multiple source and destination locations. Files will be ingested from all of the selected source locations, and the processed data will be delivered to all of the selected destination locations. 7. Click **Continue**. -8. In the **Optimize for** section, select the option to choose one of these preconfigured workflow settings groups. Expand any or all - of the following options to learn more about these preconfigured settings: - - - - This option is ideal for simple, text-only documents. - - The **Basic** option uses the following preconfigured workflow settings: - - - **Strategy**: Fast - - **Image Summarizer**: None - - **Table Summarizer**: None - - **Include Page Breaks**: No - - **Infer Table Structure**: No - - **Elements to Exclude**: None - - **Chunk**: +8. The **Reprocess All** box applies only to blob storage connectors such as the Amazon S3, Azure Blob Storage, and Google Cloud Storage connectors: - - **Chunker Type**: Chunk By Character - - **Chunk Options**: - - - **Include Original Elements**: No - - **Max Characters**: 2048 - - **New After N Characters**: 1500 - - **Overlap**: 160 - - **Overlap All**: No - - - **Embed**: + - Checking this box reprocesses all documents in the source location on every workflow run. + - Unchecking this box causes only new documents that are added to the source location since the last workflow run to be processed on future runs. Previously processed documents are not processed again, even if those documents' contents change. - - **Provider**: Azure OpenAI - - **Model**: text-embedding-3-large (3072 dimensions) +9. Click **Continue**. +10. If you want this workflow to run on a schedule, in the **Repeat Run** dropdown list, select one of the scheduling options, and fill in the scheduling settings. Otherwise, select **Don't repeat**. +11. Click **Complete**. - - - This option is best for PDFs, images, and complex file types. +By default, this workflow partitions, chunks, and generates embeddings as follows: - - During **Advanced** processing, any detected text-based files are processed and billed at the **Basic** rate instead. - +- **Partitioner**: **Auto** strategy - The **Advanced** option uses the following preconfigured workflow settings: + Unstructured automatically analyzes and processes files on a page-by-page basis (for PDF files) and on a document-by-document basis for everything else: - - **Strategy**: High-Res - - **Image Summarizer**: GPT-4o - - **Table Summarizer**: Claude 3.5 Sonnet - - **Include Page Breaks**: No - - **Infer Table Structure**: No - - **Elements to Exclude**: None - - **Chunk**: + - If the page or document has no images and likely does not have tables, **Fast** partitioning is used, and the page or document is billed at the **Fast** rate for processing. + - If the page or document has only a few tables or images with standard layouts and languages, **High Res** partitioning is used, and the page or document is billed at the **High Res** rate for processing. + - If the page or document has more than a few tables or images, **VLM** partitioning is used, and the page or document is billed at the **VLM** rate for processing. - - **Chunker Type**: Chunk By Title - - **Chunk Options**: + [Learn about partitioning strategies](/platform/partitioning). - - **Combine Text Under N Characters**: 0 - - **Include Original Elements**: No - - **Max Characters**: 2048 - - **New After N Characters**: 1500 - - **Overlap**: 160 - - **Overlap All**: No +- **Chunker**: **Chunk by Title** strategy - - **Embed**: + - **Contextual Chunking**: No (unchecked) + - **Combine Text Under N Characters**: 3000 + - **Include Original Elements**: Yes (checked) + - **Max Characters**: 5500 + - **Multipage Sections**: Yes (checked) + - **New After N Characters**: 3500 + - **Overlap**: 350 + - **Overlap All**: Yes (checked) - - **Provider**: Azure OpenAI - - **Model**: text-embedding-3-large (3072 dimensions) + [Learn about chunking strategies](/platform/chunking). - - - This option is for your most challenging documents, including scanned and handwritten content. +- **Embedder**: - - During **Platinum** processing, any detected files that are not PDFs or images are processed and billed at either the **Advanced** or **Basic** rate instead. - Of those non-PDF and non-image files, all text-based files are processed and billed at the **Basic** rate instead. The other files are processed and billed at the **Advanced** rate instead. - - When you use the **Platinum** strategy for PDF files of 200 or more pages, you might notice some errors when - these files are processed. These errors typically occur when these larger PDF files have lots of tables and high-resolution images. - + - **Provider**: Azure OpenAI + - **Model**: text-embedding-3-large, with 3072 dimensions - The **Platinum** option uses the following preconfigured workflow settings: + [Learn about embedding providers and models](/platform/embedding). - - **Strategy**: VLM - - **VLM Provider, Model**: Anthropic, Anthropic Claude 3.5 Sonnet - - **Chunk**: +- **Enrichments**: - - **Chunker Type**: Chunk By Title - - **Chunk Options**: + This workflow contains no enrichments. - - **Combine Text Under N Characters**: 0 - - **Include Original Elements**: No - - **Max Characters**: 2048 - = **Multipage Sections**: No - - **New After N Characters**: 1500 - - **Overlap**: 160 - - **Overlap All**: No + [Learn about available enrichments](/platform/enriching/overview). - - **Embed**: +After this workflow is created, you can change any or all of its settings if you want to. This includes the workflow's +source connector, destination connector, partitioning, chunking, and embedding settings. You can also add enrichments +to the workflow if you want to. - - **Provider**: Azure OpenAI - - **Model**: text-embedding-3-large (3072 dimensions) +To change the workflow's default settings or to add enrichments: - - +1. On the sidebar, click **Workflows**. +2. In the list of available workflows, click the workflow that was just created. This opens a visual designer that shows + your workflow as a directed acyclic graph (DAG). This DAG contains a node representing each step in the workflow. + There is one node for the partitioning step, another node for the chunking step, and so on. +3. To learn how to change a node's settings or to add enrichment nodes, click the **FAQ** button in the flyout pane in + the workflow DAG designer. -9. The **Reprocess All** box applies only to blob storage connectors such as the Amazon S3, Azure Blob Storage, and Google Cloud Storage connectors: +If you did not previously set the workflow to run on a schedule, you can [run the worklow](#edit-delete-or-run-a-workflow) now. - - Checking this box reprocesses all documents in the source location on every workflow run. - - Unchecking this box causes only new documents that are added to the source location since the last workflow run to be processed on future runs. Previously processed documents are not processed again, even if those documents' contents change. +### Create a custom workflow -10. Click **Continue**. -11. If you want this workflow to run on a schedule, in the **Repeat Run** dropdown list, select one of the scheduling options, and fill in the scheduling settings. Otherwise, select **Don't repeat**. -12. Click **Complete**. -13. If you did not set the workflow to run on a schedule, you can [run the worklow](#edit-delete-or-run-a-workflow) now. + + If you already have an existing workflow that you want to change, do the following: + + 1. On the sidebar, click **Workflows**. + 2. Click the name of the workflow that you want to change. + 3. Skip ahead to Step 11 in the following procedure. -### Create a custom workflow + You must first have an existing source connector and destination connector to add to the workflow. @@ -281,6 +239,12 @@ To create an automatic workflow: these files are processed. These errors typically occur when these larger PDF files have lots of tables and high-resolution images. + - **Auto** automatically analyzes and processes files on a page-by-page basis (for PDF files) and on a document-by-document basis for everything else: + + - If the page or document has no images and likely does not have tables, **Fast** partitioning is used, and the page or document is billed at the **Fast** rate for processing. + - If the page or document has only a few tables or images with standard layouts and languages, **High Res** partitioning is used, and the page or document is billed at the **High Res** rate for processing. + - If the page or document has more than a few tables or images, **VLM** partitioning is used, and the page or document is billed at the **VLM** rate for processing. + [Learn more](/platform/partitioning). diff --git a/snippets/quickstarts/platform.mdx b/snippets/quickstarts/platform.mdx index e62a07b3..ac957dca 100644 --- a/snippets/quickstarts/platform.mdx +++ b/snippets/quickstarts/platform.mdx @@ -94,9 +94,9 @@ allowfullscreen ![Workflows in the sidebar](/img/platform/Workflows-Sidebar.png) 1. In the sidebar, click **Workflows**. 2. Click **New Workflow**. - 3. Next to **Build it with Me**, click **Create Workflow**. + 3. Next to **Build it for Me**, click **Create Workflow**. - If a radio button appears instead of **Build it with Me**, select it, and then click **Continue**. + If a radio button appears instead of **Build it for Me**, select it, and then click **Continue**. 4. For **Workflow Name**, enter some unique name for this workflow. 5. In the **Sources** dropdown list, select your source location from Step 3. @@ -105,26 +105,14 @@ allowfullscreen You can select multiple source and destination locations. Files will be ingested from all of the selected source locations, and the processed data will be delivered to all of the selected destination locations. 7. Click **Continue**. - 8. In the **Optimize for** section, select the option to choose one of these preconfigured workflow settings groups: - - - **Basic**: Ideal for simple, text-only documents. - - **Advanced**: Best for PDFs, images, and complex file types. - - **Platinum**: For your most challenging documents, including scanned and handwritten content. It uses vision language models (VLMs). - During processing, files that are not PDFs or images are processed by using the **Advanced** strategy and are charged at the **Advanced** rate instead. - - - When you use the **Platinum** strategy for PDF files of 200 or more pages, you might notice some errors when - these files are processed. These errors typically occur when these larger PDF files have lots of tables and high-resolution images. - - - 9. The **Reprocess All** box applies only to blob storage connectors such as the Amazon S3, Azure Blob Storage, and Google Cloud Storage connectors: + 8. The **Reprocess All** box applies only to blob storage connectors such as the Amazon S3, Azure Blob Storage, and Google Cloud Storage connectors: - Checking this box reprocesses all documents in the source location on every workflow run. - Unchecking this box causes new documents that have been added to the source location, as well as existing documents in the source location that have had their contents or titles changed, since the last workflow run to be processed on future runs. Other previously processed documents are not processed again. - - 10. Click **Continue**. - 11. If you want this workflow to run on a schedule, in the **Repeat Run** dropdown list, select one of the scheduling options, and fill in the scheduling settings. Otherwise, select **Don't repeat**. - 12. Click **Complete**. + + 9. Click **Continue**. + 10. If you want this workflow to run on a schedule, in the **Repeat Run** dropdown list, select one of the scheduling options, and fill in the scheduling settings. Otherwise, select **Don't repeat**. + 11. Click **Complete**. ![Workflows in the sidebar](/img/platform/Workflows-Sidebar.png)