Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file modified img/platform/Job-Complete.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
18 changes: 12 additions & 6 deletions platform/overview.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -36,20 +36,26 @@ flowchart LR
<Step title="Route">
Routing determines which strategy Unstructured Platform uses to transforming your documents into Unstructured's canonical JSON schema. The Unstructured Platform provides these [partitioning](/platform/partitioning) strategies for document transformation:

- **Basic** / **Fast** is ideal for simple, text-only documents.
- **Advanced** / **High Res** is best for PDFs, images, and complex file types.
- **Fast** is ideal for simple, text-only documents.
- **High Res** is best for PDFs, images, and complex file types.

<Note>
During **Advanced** / **High Res** processing, any detected text-based files are processed and billed at the **Basic** / **Fast** rate instead.
During **High Res** processing, any detected text-based files are processed and billed at the **Fast** rate instead.
</Note>

- **Platinum** / **VLM** is for challenging documents, including scanned and handwritten content.
- **VLM** is for challenging documents, including scanned and handwritten content.

<Note>
During **Platinum** / **VLM** processing, any detected files that are not PDFs or images are processed and billed at either the **Advanced** / **High Res** or **Basic** / **Fast** rate instead.
Of those non-PDF and non-image files, all text-based files are processed and billed at the **Basic** / **Fast** rate instead. The other files are processed and billed at the **Advanced** / **High Res** rate instead.
During **VLM** processing, any detected files that are not PDFs or images are processed and billed at either the **High Res** or **Fast** rate instead.
Of those non-PDF and non-image files, all text-based files are processed and billed at the **Fast** rate instead. The other files are processed and billed at the **High Res** rate instead.
</Note>

- **Auto** automatically analyzes and processes files on a page-by-page basis (for PDF files) and on a document-by-document basis for everything else:

- If the page or document has no images and likely does not have tables, **Fast** partitioning is used, and the page or document is billed at the **Fast** rate for processing.
- If the page or document has only a few tables or images with standard layouts and languages, **High Res** partitioning is used, and the page or document is billed at the **High Res** rate for processing.
- If the page or document has more than a few tables or images, **VLM** partitioning is used, and the page or document is billed at the **VLM** rate for processing.

</Step>
<Step title="Transform">
Your source document is transformed into Unstructured's canonical JSON schema. Regardless of the input document, this JSON schema gives you a [standardized output](/platform/document-elements). It contains more than 20 elements, such as `Header`, `Footer`, `Title`, `NarrativeText`, `Table`, `Image`, and many more. Each document is wrapped in extensive metadata so you can understand languages, file types, sources, hierarchies, and much more.
Expand Down
6 changes: 6 additions & 0 deletions platform/partitioning.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -36,3 +36,9 @@ To choose one of these strategies, select one of the **Partition Strategy** opti
these files are processed. These errors typically occur when these larger PDF files have lots of tables and high-resolution images.
</Note>

- **Auto**: Unstructured automatically analyzes and processes files on a page-by-page basis (for PDF files) and on a document-by-document basis for everything else:

- If the page or document has no images and likely does not have tables, **Fast** partitioning is used, and the page or document is billed at the **Fast** rate for processing.
- If the page or document has only a few tables or images with standard layouts and languages, **High Res** partitioning is used, and the page or document is billed at the **High Res** rate for processing.
- If the page or document has more than a few tables or images, **VLM** partitioning is used, and the page or document is billed at the **VLM** rate for processing.

154 changes: 59 additions & 95 deletions platform/workflows.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -14,12 +14,10 @@ Workflows are crucial for establishing a systematic approach to managing data fl

## Create a workflow

![Choose a workflow type](/img/platform/Choose-Workflow-Type.png)

The Unstructured Platform provides two types of workflow builders:

- [Automatic](#create-an-automatic-workflow) workflows, which use sensible default workflow settings to enable you to get good-quality results faster.
- [Custom](#create-a-custom-worklow) workflows, which enable you to fine-tune the workflow settings behind the scenes to get very specific results.
- [Automatic](#create-an-automatic-workflow) or **Build it For Me** workflows, which use sensible default workflow settings to enable you to get good-quality results faster.
- [Custom](#create-a-custom-worklow) or **Build it Myself** workflows, which enable you to fine-tune the workflow settings behind the scenes to get very specific results.

### Create an automatic workflow

Expand All @@ -35,9 +33,9 @@ To create an automatic workflow:

1. On the sidebar, click **Workflows**.
2. Click **New Workflow**.
3. Next to **Build it with Me**, click **Create Workflow**.
3. Next to **Build it for Me**, click **Create Workflow**.

<Note>If a radio button appears instead of **Build it with Me**, select it, and then click **Continue**.</Note>
<Note>If a radio button appears instead of **Build it for Me**, select it, and then click **Continue**.</Note>

4. For **Workflow Name**, enter some unique name for this workflow.
5. In the **Sources** dropdown list, select your source location.
Expand All @@ -46,118 +44,78 @@ To create an automatic workflow:
<Note>You can select multiple source and destination locations. Files will be ingested from all of the selected source locations, and the processed data will be delivered to all of the selected destination locations.</Note>

7. Click **Continue**.
8. In the **Optimize for** section, select the option to choose one of these preconfigured workflow settings groups. Expand any or all
of the following options to learn more about these preconfigured settings:

<AccordionGroup>
<Accordion title="Basic">
This option is ideal for simple, text-only documents.

The **Basic** option uses the following preconfigured workflow settings:

- **Strategy**: Fast
- **Image Summarizer**: None
- **Table Summarizer**: None
- **Include Page Breaks**: No
- **Infer Table Structure**: No
- **Elements to Exclude**: None
- **Chunk**:
8. The **Reprocess All** box applies only to blob storage connectors such as the Amazon S3, Azure Blob Storage, and Google Cloud Storage connectors:

- **Chunker Type**: Chunk By Character
- **Chunk Options**:

- **Include Original Elements**: No
- **Max Characters**: 2048
- **New After N Characters**: 1500
- **Overlap**: 160
- **Overlap All**: No

- **Embed**:
- Checking this box reprocesses all documents in the source location on every workflow run.
- Unchecking this box causes only new documents that are added to the source location since the last workflow run to be processed on future runs. Previously processed documents are not processed again, even if those documents' contents change.

- **Provider**: Azure OpenAI
- **Model**: text-embedding-3-large (3072 dimensions)
9. Click **Continue**.
10. If you want this workflow to run on a schedule, in the **Repeat Run** dropdown list, select one of the scheduling options, and fill in the scheduling settings. Otherwise, select **Don't repeat**.
11. Click **Complete**.

</Accordion>
<Accordion title="Advanced">
This option is best for PDFs, images, and complex file types.
By default, this workflow partitions, chunks, and generates embeddings as follows:

<Note>
During **Advanced** processing, any detected text-based files are processed and billed at the **Basic** rate instead.
</Note>
- **Partitioner**: **Auto** strategy

The **Advanced** option uses the following preconfigured workflow settings:
Unstructured automatically analyzes and processes files on a page-by-page basis (for PDF files) and on a document-by-document basis for everything else:

- **Strategy**: High-Res
- **Image Summarizer**: GPT-4o
- **Table Summarizer**: Claude 3.5 Sonnet
- **Include Page Breaks**: No
- **Infer Table Structure**: No
- **Elements to Exclude**: None
- **Chunk**:
- If the page or document has no images and likely does not have tables, **Fast** partitioning is used, and the page or document is billed at the **Fast** rate for processing.
- If the page or document has only a few tables or images with standard layouts and languages, **High Res** partitioning is used, and the page or document is billed at the **High Res** rate for processing.
- If the page or document has more than a few tables or images, **VLM** partitioning is used, and the page or document is billed at the **VLM** rate for processing.

- **Chunker Type**: Chunk By Title
- **Chunk Options**:
[Learn about partitioning strategies](/platform/partitioning).

- **Combine Text Under N Characters**: 0
- **Include Original Elements**: No
- **Max Characters**: 2048
- **New After N Characters**: 1500
- **Overlap**: 160
- **Overlap All**: No
- **Chunker**: **Chunk by Title** strategy

- **Embed**:
- **Contextual Chunking**: No (unchecked)
- **Combine Text Under N Characters**: 3000
- **Include Original Elements**: Yes (checked)
- **Max Characters**: 5500
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we have such massive chunks now???

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤷 Ask Crag?

- **Multipage Sections**: Yes (checked)
- **New After N Characters**: 3500
- **Overlap**: 350
- **Overlap All**: Yes (checked)

- **Provider**: Azure OpenAI
- **Model**: text-embedding-3-large (3072 dimensions)
[Learn about chunking strategies](/platform/chunking).

</Accordion>
<Accordion title="Platinum">
This option is for your most challenging documents, including scanned and handwritten content.
- **Embedder**:

<Note>
During **Platinum** processing, any detected files that are not PDFs or images are processed and billed at either the **Advanced** or **Basic** rate instead.
Of those non-PDF and non-image files, all text-based files are processed and billed at the **Basic** rate instead. The other files are processed and billed at the **Advanced** rate instead.

When you use the **Platinum** strategy for PDF files of 200 or more pages, you might notice some errors when
these files are processed. These errors typically occur when these larger PDF files have lots of tables and high-resolution images.
</Note>
- **Provider**: Azure OpenAI
- **Model**: text-embedding-3-large, with 3072 dimensions

The **Platinum** option uses the following preconfigured workflow settings:
[Learn about embedding providers and models](/platform/embedding).

- **Strategy**: VLM
- **VLM Provider, Model**: Anthropic, Anthropic Claude 3.5 Sonnet
- **Chunk**:
- **Enrichments**:

- **Chunker Type**: Chunk By Title
- **Chunk Options**:
This workflow contains no enrichments.

- **Combine Text Under N Characters**: 0
- **Include Original Elements**: No
- **Max Characters**: 2048
= **Multipage Sections**: No
- **New After N Characters**: 1500
- **Overlap**: 160
- **Overlap All**: No
[Learn about available enrichments](/platform/enriching/overview).

- **Embed**:
After this workflow is created, you can change any or all of its settings if you want to. This includes the workflow's
source connector, destination connector, partitioning, chunking, and embedding settings. You can also add enrichments
to the workflow if you want to.

- **Provider**: Azure OpenAI
- **Model**: text-embedding-3-large (3072 dimensions)
To change the workflow's default settings or to add enrichments:

</Accordion>
</AccordionGroup>
1. On the sidebar, click **Workflows**.
2. In the list of available workflows, click the workflow that was just created. This opens a visual designer that shows
your workflow as a directed acyclic graph (DAG). This DAG contains a node representing each step in the workflow.
There is one node for the partitioning step, another node for the chunking step, and so on.
3. To learn how to change a node's settings or to add enrichment nodes, click the **FAQ** button in the flyout pane in
the workflow DAG designer.

9. The **Reprocess All** box applies only to blob storage connectors such as the Amazon S3, Azure Blob Storage, and Google Cloud Storage connectors:
If you did not previously set the workflow to run on a schedule, you can [run the worklow](#edit-delete-or-run-a-workflow) now.

- Checking this box reprocesses all documents in the source location on every workflow run.
- Unchecking this box causes only new documents that are added to the source location since the last workflow run to be processed on future runs. Previously processed documents are not processed again, even if those documents' contents change.
### Create a custom workflow

10. Click **Continue**.
11. If you want this workflow to run on a schedule, in the **Repeat Run** dropdown list, select one of the scheduling options, and fill in the scheduling settings. Otherwise, select **Don't repeat**.
12. Click **Complete**.
13. If you did not set the workflow to run on a schedule, you can [run the worklow](#edit-delete-or-run-a-workflow) now.
<Tip>
If you already have an existing workflow that you want to change, do the following:

1. On the sidebar, click **Workflows**.
2. Click the name of the workflow that you want to change.
3. Skip ahead to Step 11 in the following procedure.

### Create a custom workflow
</Tip>

<Warning>
You must first have an existing source connector and destination connector to add to the workflow.
Expand Down Expand Up @@ -281,6 +239,12 @@ To create an automatic workflow:
these files are processed. These errors typically occur when these larger PDF files have lots of tables and high-resolution images.
</Note>

- **Auto** automatically analyzes and processes files on a page-by-page basis (for PDF files) and on a document-by-document basis for everything else:

- If the page or document has no images and likely does not have tables, **Fast** partitioning is used, and the page or document is billed at the **Fast** rate for processing.
- If the page or document has only a few tables or images with standard layouts and languages, **High Res** partitioning is used, and the page or document is billed at the **High Res** rate for processing.
- If the page or document has more than a few tables or images, **VLM** partitioning is used, and the page or document is billed at the **VLM** rate for processing.

[Learn more](/platform/partitioning).
</Accordion>
<Accordion title="Chunker node">
Expand Down
26 changes: 7 additions & 19 deletions snippets/quickstarts/platform.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -94,9 +94,9 @@ allowfullscreen
![Workflows in the sidebar](/img/platform/Workflows-Sidebar.png)
1. In the sidebar, click **Workflows**.
2. Click **New Workflow**.
3. Next to **Build it with Me**, click **Create Workflow**.
3. Next to **Build it for Me**, click **Create Workflow**.

<Note>If a radio button appears instead of **Build it with Me**, select it, and then click **Continue**.</Note>
<Note>If a radio button appears instead of **Build it for Me**, select it, and then click **Continue**.</Note>

4. For **Workflow Name**, enter some unique name for this workflow.
5. In the **Sources** dropdown list, select your source location from Step 3.
Expand All @@ -105,26 +105,14 @@ allowfullscreen
<Note>You can select multiple source and destination locations. Files will be ingested from all of the selected source locations, and the processed data will be delivered to all of the selected destination locations.</Note>

7. Click **Continue**.
8. In the **Optimize for** section, select the option to choose one of these preconfigured workflow settings groups:

- **Basic**: Ideal for simple, text-only documents.
- **Advanced**: Best for PDFs, images, and complex file types.
- **Platinum**: For your most challenging documents, including scanned and handwritten content. It uses vision language models (VLMs).
During processing, files that are not PDFs or images are processed by using the **Advanced** strategy and are charged at the **Advanced** rate instead.

<Note>
When you use the **Platinum** strategy for PDF files of 200 or more pages, you might notice some errors when
these files are processed. These errors typically occur when these larger PDF files have lots of tables and high-resolution images.
</Note>

9. The **Reprocess All** box applies only to blob storage connectors such as the Amazon S3, Azure Blob Storage, and Google Cloud Storage connectors:
8. The **Reprocess All** box applies only to blob storage connectors such as the Amazon S3, Azure Blob Storage, and Google Cloud Storage connectors:

- Checking this box reprocesses all documents in the source location on every workflow run.
- Unchecking this box causes new documents that have been added to the source location, as well as existing documents in the source location that have had their contents or titles changed, since the last workflow run to be processed on future runs. Other previously processed documents are not processed again.

10. Click **Continue**.
11. If you want this workflow to run on a schedule, in the **Repeat Run** dropdown list, select one of the scheduling options, and fill in the scheduling settings. Otherwise, select **Don't repeat**.
12. Click **Complete**.
9. Click **Continue**.
10. If you want this workflow to run on a schedule, in the **Repeat Run** dropdown list, select one of the scheduling options, and fill in the scheduling settings. Otherwise, select **Don't repeat**.
11. Click **Complete**.
</Step>
<Step title="Process the documents">
![Workflows in the sidebar](/img/platform/Workflows-Sidebar.png)
Expand Down