From 0e931c7e57e05ede65a64f5cf7152099d1765f36 Mon Sep 17 00:00:00 2001 From: Paul Cornell Date: Mon, 24 Feb 2025 15:49:46 -0800 Subject: [PATCH 1/2] Platform: Recommend using Auto partitioning strategy whenever possible --- .../how-to/choose-partitioning-strategy.mdx | 10 ++------ platform/overview.mdx | 25 +++---------------- platform/partitioning.mdx | 25 +++---------------- platform/workflows.mdx | 24 ++++-------------- .../platform-partitioning-strategies.mdx | 12 +++++++++ 5 files changed, 26 insertions(+), 70 deletions(-) create mode 100644 snippets/general-shared-text/platform-partitioning-strategies.mdx diff --git a/api-reference/how-to/choose-partitioning-strategy.mdx b/api-reference/how-to/choose-partitioning-strategy.mdx index 08ea2540..63af3a76 100644 --- a/api-reference/how-to/choose-partitioning-strategy.mdx +++ b/api-reference/how-to/choose-partitioning-strategy.mdx @@ -42,12 +42,6 @@ See [Changing partition strategy for a PDF](/api-reference/api-services/examples ## Auto partitioning strategy logic -Setting `--strategy` or `strategy` to `auto` leaves the decision up to Unstructured on a file-by-file basis about which partitioning strategy to use. Specifically: +Setting `--strategy` or `strategy` to `auto` leaves the decision up to Unstructured on a page-by-page basis about which partitioning strategy to use. -- If the file is an image, the `hi_res` strategy is used for that file. The `layout_v1.0.0` high-resolution object detection model is used. -- If the file is a PDF, the local processing logic or Unstructured tries to detect whether there are any embedded tables or images in that file. - - - If no embedded tables or images are detected, the `fast` strategy is used for that file. No high-resolution object detection model is used. - - If at least one embedded table or image is found, the `hi_res` strategy is used for that file. The `layout_v1.0.0` high-resolution object detection model is used. - -- If `--strategy` or `strategy` is not specified, the `auto` strategy is used by default. +If `--strategy` or `strategy` is not specified, the `auto` strategy is used by default. diff --git a/platform/overview.mdx b/platform/overview.mdx index 35043107..3cb8d49a 100644 --- a/platform/overview.mdx +++ b/platform/overview.mdx @@ -29,33 +29,16 @@ flowchart LR Connect-->Route-->Transform-->Chunk-->Enrich-->Embed-->Persist ``` +import PlatformPartitioningStrategies from '/snippets/general-shared-text/platform-partitioning-strategies.mdx'; + The Unstructured Platform offers multiple [source connectors](/platform/sources/overview) to connect to your data in its existing location. - Routing determines which strategy Unstructured Platform uses to transforming your documents into Unstructured's canonical JSON schema. The Unstructured Platform provides these [partitioning](/platform/partitioning) strategies for document transformation: + Routing determines which strategy Unstructured Platform uses to transform your documents into Unstructured's canonical JSON schema. The Unstructured Platform provides four [partitioning](/platform/partitioning) strategies for document transformation, as follows. - - **Fast** is ideal for simple, text-only documents. - - **High Res** is best for PDFs, images, and complex file types. - - - During **High Res** processing, any detected text-based files are processed and billed at the **Fast** rate instead. - - - - **VLM** is for challenging documents, including scanned and handwritten content. - - - During **VLM** processing, any detected files that are not PDFs or images are processed and billed at either the **High Res** or **Fast** rate instead. - Of those non-PDF and non-image files, all text-based files are processed and billed at the **Fast** rate instead. The other files are processed and billed at the **High Res** rate instead. - - - - **Auto** automatically analyzes and processes files on a page-by-page basis (for PDF files) and on a document-by-document basis for everything else: - - - If the page or document has no images and likely does not have tables, **Fast** partitioning is used, and the page or document is billed at the **Fast** rate for processing. - - If the page or document has only a few tables or images with standard layouts and languages, **High Res** partitioning is used, and the page or document is billed at the **High Res** rate for processing. - - If the page or document has more than a few tables or images, **VLM** partitioning is used, and the page or document is billed at the **VLM** rate for processing. - + Your source document is transformed into Unstructured's canonical JSON schema. Regardless of the input document, this JSON schema gives you a [standardized output](/platform/document-elements). It contains more than 20 elements, such as `Header`, `Footer`, `Title`, `NarrativeText`, `Table`, `Image`, and many more. Each document is wrapped in extensive metadata so you can understand languages, file types, sources, hierarchies, and much more. diff --git a/platform/partitioning.mdx b/platform/partitioning.mdx index 393866ba..51084787 100644 --- a/platform/partitioning.mdx +++ b/platform/partitioning.mdx @@ -15,30 +15,11 @@ model-based workflows, which can be slower and costlier because they require a m When you choose a partitioning strategy for your files, you should be mindful of these speed, cost, and quality trade-offs. For example, the **Fast** strategy can be about 100 times faster than leading image-to-text models. -To choose one of these strategies, select one of the **Partition Strategy** options in the **Partitioner** node of a workflow: +To choose one of these strategies, select one of the following four **Partition Strategy** options in the **Partitioner** node of a workflow. You can change a workflow's preconfigured strategy only through [Custom](/platform/workflows#create-a-custom-workflow) workflow settings. -- **Fast**: This strategy is ideal for simple, text-based documents. -- **High Res**: This strategy is best for PDFs, images, and complex file types. +import PlatformPartitioningStrategies from '/snippets/general-shared-text/platform-partitioning-strategies.mdx'; - - During **High Res** processing, any detected text-based files are processed and billed at the **Fast** rate instead. - - -- **VLM**: For your most challenging documents, including scanned and handwritten content. - - - During **VLM** processing, any detected files that are not PDFs or images are processed and billed at either the **High Res** or **Fast** rate instead. - Of those non-PDF and non-image files, all text-based files are processed and billed at the **Fast** rate instead. The other files are processed and billed at the **High Res** rate instead. - - When you use the **VLM** strategy with embeddings for PDF files of 200 or more pages, you might notice some errors when - these files are processed. These errors typically occur when these larger PDF files have lots of tables and high-resolution images. - - -- **Auto**: Unstructured automatically analyzes and processes files on a page-by-page basis (for PDF files) and on a document-by-document basis for everything else: - - - If the page or document has no images and likely does not have tables, **Fast** partitioning is used, and the page or document is billed at the **Fast** rate for processing. - - If the page or document has only a few tables or images with standard layouts and languages, **High Res** partitioning is used, and the page or document is billed at the **High Res** rate for processing. - - If the page or document has more than a few tables or images, **VLM** partitioning is used, and the page or document is billed at the **VLM** rate for processing. + diff --git a/platform/workflows.mdx b/platform/workflows.mdx index bd223fd9..081a6b69 100644 --- a/platform/workflows.mdx +++ b/platform/workflows.mdx @@ -197,20 +197,15 @@ If you did not previously set the workflow to run on a schedule, you can [run th #### Custom workflow node types +import PlatformPartitioningStrategies from '/snippets/general-shared-text/platform-partitioning-strategies.mdx'; + - For **Partition Strategy**, choose one of the following: - - - **Fast**: Ideal for simple, text-only documents. - - **High Res**: Best for PDFs, images, and complex file types. - - - During **High Res** processing, any detected text-based files are processed and billed at the **Fast** rate instead. - + Choose from one of four available partitioning strategies. - - **VLM**: For your most challenging documents, including scanned and handwritten content. + - You must also choose a VLM provider and model. Available choices include: + For **VLM**, you must also choose a VLM provider and model. Available choices include: - **Anthropic**: @@ -232,19 +227,10 @@ If you did not previously set the workflow to run on a schedule, you can [run th - **Meta Llama 3.2 11B Instruct** - During **VLM** processing, any detected files that are not PDFs or images are processed and billed at either the **High Res** or **Fast** rate instead. - Of those non-PDF and non-image files, all text-based files are processed and billed at the **Fast** rate instead. The other files are processed and billed at the **High Res** rate instead. - When you use the **VLM** strategy with embeddings for PDF files of 200 or more pages, you might notice some errors when these files are processed. These errors typically occur when these larger PDF files have lots of tables and high-resolution images. - - **Auto** automatically analyzes and processes files on a page-by-page basis (for PDF files) and on a document-by-document basis for everything else: - - - If the page or document has no images and likely does not have tables, **Fast** partitioning is used, and the page or document is billed at the **Fast** rate for processing. - - If the page or document has only a few tables or images with standard layouts and languages, **High Res** partitioning is used, and the page or document is billed at the **High Res** rate for processing. - - If the page or document has more than a few tables or images, **VLM** partitioning is used, and the page or document is billed at the **VLM** rate for processing. - [Learn more](/platform/partitioning). diff --git a/snippets/general-shared-text/platform-partitioning-strategies.mdx b/snippets/general-shared-text/platform-partitioning-strategies.mdx new file mode 100644 index 00000000..842bc8ee --- /dev/null +++ b/snippets/general-shared-text/platform-partitioning-strategies.mdx @@ -0,0 +1,12 @@ +Unstructured recommends that you choose the **Auto** partitioning strategy in most cases. With **Auto**, Unstructured does all +the heavy lifting, optimizing at runtime for the highest quality at the lowest cost page-by-page. + +You should consider the following three additional strategies only if you are absolutely sure that your documents are of the same +composition and complexity. Each of the following strategies are best suited for specific situations. Choosing one of these +three strategies other than **Auto** for sets of documents of different types could produce undesirable results, +including reduction in transformation quality. + +- **VLM**: Provides the most consistent transformation quality, especially for complex documents. +- **High Res**: Built for documents with medium to low complexity (for example, simple images and tables). + This strategy is less costly and speedier than **VLM**. +- **Fast**: Built for text-only documents with no complexity. This is the cheapest and fastest strategy. \ No newline at end of file From 33f557bd16e4c85cfdc07181c428286bdd8e04dd Mon Sep 17 00:00:00 2001 From: Paul Cornell Date: Tue, 25 Feb 2025 09:05:18 -0800 Subject: [PATCH 2/2] Partitioning options wordsmithing --- .../platform-partitioning-strategies.mdx | 13 ++++++------- 1 file changed, 6 insertions(+), 7 deletions(-) diff --git a/snippets/general-shared-text/platform-partitioning-strategies.mdx b/snippets/general-shared-text/platform-partitioning-strategies.mdx index 842bc8ee..ed55b407 100644 --- a/snippets/general-shared-text/platform-partitioning-strategies.mdx +++ b/snippets/general-shared-text/platform-partitioning-strategies.mdx @@ -1,12 +1,11 @@ Unstructured recommends that you choose the **Auto** partitioning strategy in most cases. With **Auto**, Unstructured does all the heavy lifting, optimizing at runtime for the highest quality at the lowest cost page-by-page. -You should consider the following three additional strategies only if you are absolutely sure that your documents are of the same -composition and complexity. Each of the following strategies are best suited for specific situations. Choosing one of these -three strategies other than **Auto** for sets of documents of different types could produce undesirable results, +You should consider the following additional strategies only if you are absolutely sure that your documents are of the same +type. Each of the following strategies are best suited for specific situations. Choosing one of these +strategies other than **Auto** for sets of documents of different types could produce undesirable results, including reduction in transformation quality. -- **VLM**: Provides the most consistent transformation quality, especially for complex documents. -- **High Res**: Built for documents with medium to low complexity (for example, simple images and tables). - This strategy is less costly and speedier than **VLM**. -- **Fast**: Built for text-only documents with no complexity. This is the cheapest and fastest strategy. \ No newline at end of file +- **VLM**: For the highest-quality transformation of these file types: `.bmp`, `.gif`, `.heic`, `.jpeg`, `.jpg`, `.pdf`, `.png`, `.tiff`, and `.webp`. +- **High Res**: For all other [supported file types](/platform/supported-file-types), and for the generation of bounding box coordinates. +- **Fast**: For text-only documents. \ No newline at end of file