Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 2 additions & 8 deletions api-reference/how-to/choose-partitioning-strategy.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -42,12 +42,6 @@ See [Changing partition strategy for a PDF](/api-reference/api-services/examples

## Auto partitioning strategy logic

Setting `--strategy` or `strategy` to `auto` leaves the decision up to Unstructured on a file-by-file basis about which partitioning strategy to use. Specifically:
Setting `--strategy` or `strategy` to `auto` leaves the decision up to Unstructured on a page-by-page basis about which partitioning strategy to use.

- If the file is an image, the `hi_res` strategy is used for that file. The `layout_v1.0.0` high-resolution object detection model is used.
- If the file is a PDF, the local processing logic or Unstructured tries to detect whether there are any embedded tables or images in that file.

- If no embedded tables or images are detected, the `fast` strategy is used for that file. No high-resolution object detection model is used.
- If at least one embedded table or image is found, the `hi_res` strategy is used for that file. The `layout_v1.0.0` high-resolution object detection model is used.

- If `--strategy` or `strategy` is not specified, the `auto` strategy is used by default.
If `--strategy` or `strategy` is not specified, the `auto` strategy is used by default.
25 changes: 4 additions & 21 deletions platform/overview.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -29,33 +29,16 @@ flowchart LR
Connect-->Route-->Transform-->Chunk-->Enrich-->Embed-->Persist
```

import PlatformPartitioningStrategies from '/snippets/general-shared-text/platform-partitioning-strategies.mdx';

<Steps>
<Step title="Connect">
The Unstructured Platform offers multiple [source connectors](/platform/sources/overview) to connect to your data in its existing location.
</Step>
<Step title="Route">
Routing determines which strategy Unstructured Platform uses to transforming your documents into Unstructured's canonical JSON schema. The Unstructured Platform provides these [partitioning](/platform/partitioning) strategies for document transformation:
Routing determines which strategy Unstructured Platform uses to transform your documents into Unstructured's canonical JSON schema. The Unstructured Platform provides four [partitioning](/platform/partitioning) strategies for document transformation, as follows.

- **Fast** is ideal for simple, text-only documents.
- **High Res** is best for PDFs, images, and complex file types.

<Note>
During **High Res** processing, any detected text-based files are processed and billed at the **Fast** rate instead.
</Note>

- **VLM** is for challenging documents, including scanned and handwritten content.

<Note>
During **VLM** processing, any detected files that are not PDFs or images are processed and billed at either the **High Res** or **Fast** rate instead.
Of those non-PDF and non-image files, all text-based files are processed and billed at the **Fast** rate instead. The other files are processed and billed at the **High Res** rate instead.
</Note>

- **Auto** automatically analyzes and processes files on a page-by-page basis (for PDF files) and on a document-by-document basis for everything else:

- If the page or document has no images and likely does not have tables, **Fast** partitioning is used, and the page or document is billed at the **Fast** rate for processing.
- If the page or document has only a few tables or images with standard layouts and languages, **High Res** partitioning is used, and the page or document is billed at the **High Res** rate for processing.
- If the page or document has more than a few tables or images, **VLM** partitioning is used, and the page or document is billed at the **VLM** rate for processing.

<PlatformPartitioningStrategies />
</Step>
<Step title="Transform">
Your source document is transformed into Unstructured's canonical JSON schema. Regardless of the input document, this JSON schema gives you a [standardized output](/platform/document-elements). It contains more than 20 elements, such as `Header`, `Footer`, `Title`, `NarrativeText`, `Table`, `Image`, and many more. Each document is wrapped in extensive metadata so you can understand languages, file types, sources, hierarchies, and much more.
Expand Down
25 changes: 3 additions & 22 deletions platform/partitioning.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -15,30 +15,11 @@ model-based workflows, which can be slower and costlier because they require a m
When you choose a partitioning strategy for your files, you should be mindful of these speed, cost, and quality trade-offs.
For example, the **Fast** strategy can be about 100 times faster than leading image-to-text models.

To choose one of these strategies, select one of the **Partition Strategy** options in the **Partitioner** node of a workflow:
To choose one of these strategies, select one of the following four **Partition Strategy** options in the **Partitioner** node of a workflow.

<Note>You can change a workflow's preconfigured strategy only through [Custom](/platform/workflows#create-a-custom-workflow) workflow settings.</Note>

- **Fast**: This strategy is ideal for simple, text-based documents.
- **High Res**: This strategy is best for PDFs, images, and complex file types.
import PlatformPartitioningStrategies from '/snippets/general-shared-text/platform-partitioning-strategies.mdx';

<Note>
During **High Res** processing, any detected text-based files are processed and billed at the **Fast** rate instead.
</Note>

- **VLM**: For your most challenging documents, including scanned and handwritten content.

<Note>
During **VLM** processing, any detected files that are not PDFs or images are processed and billed at either the **High Res** or **Fast** rate instead.
Of those non-PDF and non-image files, all text-based files are processed and billed at the **Fast** rate instead. The other files are processed and billed at the **High Res** rate instead.

When you use the **VLM** strategy with embeddings for PDF files of 200 or more pages, you might notice some errors when
these files are processed. These errors typically occur when these larger PDF files have lots of tables and high-resolution images.
</Note>

- **Auto**: Unstructured automatically analyzes and processes files on a page-by-page basis (for PDF files) and on a document-by-document basis for everything else:

- If the page or document has no images and likely does not have tables, **Fast** partitioning is used, and the page or document is billed at the **Fast** rate for processing.
- If the page or document has only a few tables or images with standard layouts and languages, **High Res** partitioning is used, and the page or document is billed at the **High Res** rate for processing.
- If the page or document has more than a few tables or images, **VLM** partitioning is used, and the page or document is billed at the **VLM** rate for processing.
<PlatformPartitioningStrategies />

24 changes: 5 additions & 19 deletions platform/workflows.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -197,20 +197,15 @@ If you did not previously set the workflow to run on a schedule, you can [run th

#### Custom workflow node types

import PlatformPartitioningStrategies from '/snippets/general-shared-text/platform-partitioning-strategies.mdx';

<AccordionGroup>
<Accordion title="Partitioner node">
For **Partition Strategy**, choose one of the following:

- **Fast**: Ideal for simple, text-only documents.
- **High Res**: Best for PDFs, images, and complex file types.

<Note>
During **High Res** processing, any detected text-based files are processed and billed at the **Fast** rate instead.
</Note>
Choose from one of four available partitioning strategies.

- **VLM**: For your most challenging documents, including scanned and handwritten content.
<PlatformPartitioningStrategies />

You must also choose a VLM provider and model. Available choices include:
For **VLM**, you must also choose a VLM provider and model. Available choices include:

- **Anthropic**:

Expand All @@ -232,19 +227,10 @@ If you did not previously set the workflow to run on a schedule, you can [run th
- **Meta Llama 3.2 11B Instruct**

<Note>
During **VLM** processing, any detected files that are not PDFs or images are processed and billed at either the **High Res** or **Fast** rate instead.
Of those non-PDF and non-image files, all text-based files are processed and billed at the **Fast** rate instead. The other files are processed and billed at the **High Res** rate instead.

When you use the **VLM** strategy with embeddings for PDF files of 200 or more pages, you might notice some errors when
these files are processed. These errors typically occur when these larger PDF files have lots of tables and high-resolution images.
</Note>

- **Auto** automatically analyzes and processes files on a page-by-page basis (for PDF files) and on a document-by-document basis for everything else:

- If the page or document has no images and likely does not have tables, **Fast** partitioning is used, and the page or document is billed at the **Fast** rate for processing.
- If the page or document has only a few tables or images with standard layouts and languages, **High Res** partitioning is used, and the page or document is billed at the **High Res** rate for processing.
- If the page or document has more than a few tables or images, **VLM** partitioning is used, and the page or document is billed at the **VLM** rate for processing.

[Learn more](/platform/partitioning).
</Accordion>
<Accordion title="Chunker node">
Expand Down
11 changes: 11 additions & 0 deletions snippets/general-shared-text/platform-partitioning-strategies.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
Unstructured recommends that you choose the **Auto** partitioning strategy in most cases. With **Auto**, Unstructured does all
the heavy lifting, optimizing at runtime for the highest quality at the lowest cost page-by-page.

You should consider the following additional strategies only if you are absolutely sure that your documents are of the same
type. Each of the following strategies are best suited for specific situations. Choosing one of these
strategies other than **Auto** for sets of documents of different types could produce undesirable results,
including reduction in transformation quality.

- **VLM**: For the highest-quality transformation of these file types: `.bmp`, `.gif`, `.heic`, `.jpeg`, `.jpg`, `.pdf`, `.png`, `.tiff`, and `.webp`.
- **High Res**: For all other [supported file types](/platform/supported-file-types), and for the generation of bounding box coordinates.
- **Fast**: For text-only documents.