diff --git a/api-reference/workflow/workflows.mdx b/api-reference/workflow/workflows.mdx
index 73574f71..c0d66f55 100644
--- a/api-reference/workflow/workflows.mdx
+++ b/api-reference/workflow/workflows.mdx
@@ -986,8 +986,8 @@ directed acyclic graph (DAG) nodes. These nodes' settings are specified in the `
`workflow_nodes` array.
- A **Destination** node is automatically created when you specify the `destination_id` value outside of the
`workflow_nodes` array.
-- You can specify [Partitioner](#partitioner-node), [Chunker](#chunker-node),
- [Enrichment](#enrichment-node), and [Embedder](#embedder-node) nodes.
+- You can specify [Partitioner](#partitioner-node), [Enrichment](#enrichment-node),
+ [Chunker](#chunker-node), and [Embedder](#embedder-node) nodes.
@@ -1017,6 +1017,11 @@ flowchart LR
Partitioner-->Enrichment-->Chunker-->Embedder
```
+
+ For workflows that use **Chunker** and **Enrichment** nodes together, the **Chunker** node should be placed after all **Enrichment** nodes. Placing the
+ **Chunker** node before any **Enrichment** nodes could cause incomplete or no enrichment results to be generated.
+
+
### Partitioner node
A **Partitioner** node has a `type` of `partition`.
@@ -1352,6 +1357,231 @@ Fields for `settings` include:
- `extract_image_block_types`: _Optional_. A list of the Unstructured element types for use in extracting image blocks as Base64 encoded data stored in `metadata` fields. Available values include `Image` and `Table`. The default is `[ 'Image', 'Table' ]`.
- `infer_table_structure`: _Optional_. True to have any table elements extracted from a PDF to include an additional `metadata` field named `text_as_html`, containing an HTML `
` transformation. The default is false.
+### Enrichment node
+
+An **Enrichment** node has a `type` of `prompter`.
+
+[Learn about the available enrichments](/ui/enriching/overview).
+
+
+
+#### Image Description task
+
+import EnrichmentImageSummaryHiResOnly from '/snippets/general-shared-text/enrichment-image-summary-hi-res-only.mdx';
+
+
+
+
+
+ ```python
+ image_description_enrichment_workflow_node = WorkflowNode(
+ name="Enrichment",
+ subtype="",
+ type="prompter",
+ settings={}
+ )
+ ```
+
+
+ ```json
+ {
+ "name": "Enrichment",
+ "type": "prompter",
+ "subtype": "",
+ "settings": {}
+ }
+ ```
+
+
+
+Allowed values for `` include:
+
+- `openai_image_description`
+- `anthropic_image_description`
+- `bedrock_image_description`
+
+#### Table Description task
+
+import EnrichmentTableSummaryHiResOnly from '/snippets/general-shared-text/enrichment-table-summary-hi-res-only.mdx';
+
+
+
+
+
+ ```python
+ table_description_enrichment_workflow_node = WorkflowNode(
+ name="Enrichment",
+ subtype="",
+ type="prompter",
+ settings={}
+ )
+ ```
+
+
+ ```json
+ {
+ "name": "Enrichment",
+ "type": "prompter",
+ "subtype": "",
+ "settings": {}
+ }
+ ```
+
+
+
+Allowed values for `` include:
+
+- `openai_table_description`
+- `anthropic_table_description`
+- `bedrock_table_description`
+
+#### Table to HTML task
+
+import EnrichmentTableToHTMLHiResOnly from '/snippets/general-shared-text/enrichment-table-to-html-hi-res-only.mdx';
+
+
+
+
+
+ ```python
+ table_to_html_enrichment_workflow_node = WorkflowNode(
+ name="Enrichment",
+ subtype="openai_table2html",
+ type="prompter",
+ settings={}
+ )
+ ```
+
+
+ ```json
+ {
+ "name": "Enrichment",
+ "type": "prompter",
+ "subtype": "openai_table2html",
+ "settings": {}
+ }
+ ```
+
+
+
+#### Named Entity Recognition (NER) task
+
+
+
+ ```python
+ ner_enrichment_workflow_node = WorkflowNode(
+ name="Enrichment",
+ subtype="",
+ type="prompter",
+ settings={
+ "prompt_interface_overrides": {
+ "prompt": {
+ "user": ""
+ }
+ }
+ }
+ )
+ ```
+
+
+ ```json
+ {
+ "name": "Enrichment",
+ "type": "prompter",
+ "subtype": "",
+ "settings": {
+ "prompt_interface_overrides": {
+ "prompt": {
+ "user": ""
+ }
+ }
+ }
+ }
+ ```
+
+
+
+Fields for settings include:
+
+- `prompt_interface_overrides.prompt.user`: _Optional_. Any alternative prompt to use with the underlying NER model. The default is none, which means to rely on using Unstructured's internal default prompt when calling the NER model.
+ The internal default prompt is as follows, which you can override by providing an alternative prompt:
+
+ ```text
+ Extract named entities and their relationships from the following text.
+
+ Provide the entities, their corresponding types and relationships as a structured JSON response.
+
+ Entity types:
+ - PERSON
+ - ORGANIZATION
+ - LOCATION
+ - DATE
+ - TIME
+ - EVENT
+ - MONEY
+ - PERCENT
+ - FACILITY
+ - PRODUCT
+ - ROLE
+ - DOCUMENT
+ - DATASET
+
+ Relationship types:
+ - PERSON - ORGANIZATION: works_for, affiliated_with, founded
+ - PERSON - LOCATION: born_in, lives_in, traveled_to
+ - ORGANIZATION - LOCATION: based_in, has_office_in
+ - Entity - DATE: occurred_on, founded_on, died_on, published_in
+ - PERSON - PERSON: married_to, parent_of, colleague_of
+ - PRODUCT - ORGANIZATION: developed_by, owned_by
+ - EVENT - LOCATION: held_in, occurred_in
+ - Entity - ROLE: has_title, acts_as, has_role
+ - DATASET - PERSON: mentions
+ - DATASET - DOCUMENT: located_in
+ - PERSON - DATASET: published
+ - DOCUMENT - DOCUMENT: referenced_in, contains
+ - DOCUMENT - DATE: dated
+ - PERSON - DOCUMENT: published
+
+ [START OF TEXT]
+ {{text}}
+ [END OF TEXT]
+
+
+ Response format json schema: {
+ "items": [
+ { "entity": "Entity name", "type": "Entity type" },
+ { "entity": "Entity name", "type": "Entity type" }
+ ],
+ "relationships": [
+ {"from": "Entity name", "relationship": "Relationship type", "to": "Entity name"},
+ {"from": "Entity name", "relationship": "Relationship type", "to": "Entity name"}
+ ]
+ }
+ ```
+
+ If you provide an alternative prompt, you must provide the entire alternative prompt in the preceding format. For best results, Unstructured strongly recommends that you limit your changes only to certain portions of the internal default prompt, specifically:
+
+ - Adding, renaming, or deleting items in the list of predefined types (such as `PERSON`, `ORGANIZATION`, `LOCATION`, and so on).
+ - Adding, renaming, or deleting items in the list of predefined relationships (such as `works_for`, `based_in`, `has_role`, and so on).
+ - As needed, adding any clarifying instructions only between these two lines:
+
+ ```text
+ ...
+ Provide the entities and their corresponding types as a structured JSON response.
+
+ (Add any clarifying instructions here only.)
+
+ [START OF TEXT]
+ ...
+ ```
+
+ - Changing any other portions of the internal default prompt could produce unexpected results.
+
+Allowed values for `` include:
+
+- `openai_ner`
+- `anthropic_ner`
+
### Chunker node
A **Chunker** node has a `type` of `chunk`.
@@ -1582,231 +1812,6 @@ Fields for `settings` include:
- `contextual_chunking_strategy`: _Optional_. If specified, prepends chunk-specific explanatory context to each chunk. Allowed values include `v1`. The default is none.
- `similarity_threshold`: _Optional_. The minimum similarity that text in consecutive elements must have to be included in the same chunk. This must be a value between `0.0` and `1.0`, exclusive (`0.01` to `0.99`). The default is none.
-### Enrichment node
-
-An **Enrichment** node has a `type` of `prompter`.
-
-[Learn about the available enrichments](/ui/enriching/overview).
-
-
-
-#### Image Description task
-
-import EnrichmentImageSummaryHiResOnly from '/snippets/general-shared-text/enrichment-image-summary-hi-res-only.mdx';
-
-
-
-
-
- ```python
- image_description_enrichment_workflow_node = WorkflowNode(
- name="Enrichment",
- subtype="",
- type="prompter",
- settings={}
- )
- ```
-
-
- ```json
- {
- "name": "Enrichment",
- "type": "prompter",
- "subtype": "",
- "settings": {}
- }
- ```
-
-
-
-Allowed values for `` include:
-
-- `openai_image_description`
-- `anthropic_image_description`
-- `bedrock_image_description`
-
-#### Table Description task
-
-import EnrichmentTableSummaryHiResOnly from '/snippets/general-shared-text/enrichment-table-summary-hi-res-only.mdx';
-
-
-
-
-
- ```python
- table_description_enrichment_workflow_node = WorkflowNode(
- name="Enrichment",
- subtype="",
- type="prompter",
- settings={}
- )
- ```
-
-
- ```json
- {
- "name": "Enrichment",
- "type": "prompter",
- "subtype": "",
- "settings": {}
- }
- ```
-
-
-
-Allowed values for `` include:
-
-- `openai_table_description`
-- `anthropic_table_description`
-- `bedrock_table_description`
-
-#### Table to HTML task
-
-import EnrichmentTableToHTMLHiResOnly from '/snippets/general-shared-text/enrichment-table-to-html-hi-res-only.mdx';
-
-
-
-
-
- ```python
- table_to_html_enrichment_workflow_node = WorkflowNode(
- name="Enrichment",
- subtype="openai_table2html",
- type="prompter",
- settings={}
- )
- ```
-
-
- ```json
- {
- "name": "Enrichment",
- "type": "prompter",
- "subtype": "openai_table2html",
- "settings": {}
- }
- ```
-
-
-
-#### Named Entity Recognition (NER) task
-
-
-
- ```python
- ner_enrichment_workflow_node = WorkflowNode(
- name="Enrichment",
- subtype="",
- type="prompter",
- settings={
- "prompt_interface_overrides": {
- "prompt": {
- "user": ""
- }
- }
- }
- )
- ```
-
-
- ```json
- {
- "name": "Enrichment",
- "type": "prompter",
- "subtype": "",
- "settings": {
- "prompt_interface_overrides": {
- "prompt": {
- "user": ""
- }
- }
- }
- }
- ```
-
-
-
-Fields for settings include:
-
-- `prompt_interface_overrides.prompt.user`: _Optional_. Any alternative prompt to use with the underlying NER model. The default is none, which means to rely on using Unstructured's internal default prompt when calling the NER model.
- The internal default prompt is as follows, which you can override by providing an alternative prompt:
-
- ```text
- Extract named entities and their relationships from the following text.
-
- Provide the entities, their corresponding types and relationships as a structured JSON response.
-
- Entity types:
- - PERSON
- - ORGANIZATION
- - LOCATION
- - DATE
- - TIME
- - EVENT
- - MONEY
- - PERCENT
- - FACILITY
- - PRODUCT
- - ROLE
- - DOCUMENT
- - DATASET
-
- Relationship types:
- - PERSON - ORGANIZATION: works_for, affiliated_with, founded
- - PERSON - LOCATION: born_in, lives_in, traveled_to
- - ORGANIZATION - LOCATION: based_in, has_office_in
- - Entity - DATE: occurred_on, founded_on, died_on, published_in
- - PERSON - PERSON: married_to, parent_of, colleague_of
- - PRODUCT - ORGANIZATION: developed_by, owned_by
- - EVENT - LOCATION: held_in, occurred_in
- - Entity - ROLE: has_title, acts_as, has_role
- - DATASET - PERSON: mentions
- - DATASET - DOCUMENT: located_in
- - PERSON - DATASET: published
- - DOCUMENT - DOCUMENT: referenced_in, contains
- - DOCUMENT - DATE: dated
- - PERSON - DOCUMENT: published
-
- [START OF TEXT]
- {{text}}
- [END OF TEXT]
-
-
- Response format json schema: {
- "items": [
- { "entity": "Entity name", "type": "Entity type" },
- { "entity": "Entity name", "type": "Entity type" }
- ],
- "relationships": [
- {"from": "Entity name", "relationship": "Relationship type", "to": "Entity name"},
- {"from": "Entity name", "relationship": "Relationship type", "to": "Entity name"}
- ]
- }
- ```
-
- If you provide an alternative prompt, you must provide the entire alternative prompt in the preceding format. For best results, Unstructured strongly recommends that you limit your changes only to certain portions of the internal default prompt, specifically:
-
- - Adding, renaming, or deleting items in the list of predefined types (such as `PERSON`, `ORGANIZATION`, `LOCATION`, and so on).
- - Adding, renaming, or deleting items in the list of predefined relationships (such as `works_for`, `based_in`, `has_role`, and so on).
- - As needed, adding any clarifying instructions only between these two lines:
-
- ```text
- ...
- Provide the entities and their corresponding types as a structured JSON response.
-
- (Add any clarifying instructions here only.)
-
- [START OF TEXT]
- ...
- ```
-
- - Changing any other portions of the internal default prompt could produce unexpected results.
-
-Allowed values for `` include:
-
-- `openai_ner`
-- `anthropic_ner`
-
### Embedder node
An **Embedder** node has a `type` of `embed`.
diff --git a/snippets/quickstarts/single-file-ui.mdx b/snippets/quickstarts/single-file-ui.mdx
index ef441b5b..04475535 100644
--- a/snippets/quickstarts/single-file-ui.mdx
+++ b/snippets/quickstarts/single-file-ui.mdx
@@ -132,17 +132,17 @@ import EnrichmentImagesTablesHiResOnly from '/snippets/general-shared-text/enric
allowfullscreen
>
- - Add a **Chunker** node after the **Partitioner** node, to chunk the partitioned data into smaller pieces for your retrieval-augmented generation (RAG) applications.
- To do this, click the add (**+**) button to the right of the **Partitioner** node, and then click **Enrich > Chunker**. Click the new **Chunker** node and
- specify its settings. For help, click the **FAQ** button in the **Chunker** node's pane. [Learn more about chunking and chunker settings](/ui/chunking).
- - Add an **Enrichment** node after the **Chunker** node, to apply enrichments to the chunked data such as image summaries, table summaries, table-to-HTML transforms, and
- named entity recognition (NER). To do this, click the add (**+**) button to the right of the **Chunker** node, and then click **Enrich > Enrichment**.
+ - Add an **Enrichment** node after the **Partitioner** node, to apply enrichments to the partitioned data such as image summaries, table summaries, table-to-HTML transforms, and
+ named entity recognition (NER). To do this, click the add (**+**) button to the right of the **Partitioner** node, and then click **Enrich > Enrichment**.
Click the new **Enrichment** node and specify its settings. For help, click the **FAQ** button in the **Enrichment** node's pane. [Learn more about enrichments and enrichment settings](/ui/enriching/overview).
- - Add an **Embedder** node after the **Enrichment** node, to generate vector embeddings for performing vector-based searches. To do this, click the add (**+**) button to the
- right of the **Enrichment** node, and then click **Transform > Embedder**. Click the new **Embedder** node and specify its settings. For help, click the **FAQ** button
+ - Add a **Chunker** node after the **Enrichment** node, to chunk the enriched data into smaller pieces for your retrieval-augmented generation (RAG) applications.
+ To do this, click the add (**+**) button to the right of the **Enrichment** node, and then click **Enrich > Chunker**. Click the new **Chunker** node and
+ specify its settings. For help, click the **FAQ** button in the **Chunker** node's pane. [Learn more about chunking and chunker settings](/ui/chunking).
+ - Add an **Embedder** node after the **Chunker** node, to generate vector embeddings for performing vector-based searches. To do this, click the add (**+**) button to the
+ right of the **Chunker** node, and then click **Transform > Embedder**. Click the new **Embedder** node and specify its settings. For help, click the **FAQ** button
in the **Embedder** node's pane. [Learn more about embedding and embedding settings](/ui/embedding).
2. Each time you add a node or change its settings, you can click **Test** above the **Source** node again to test the current workflow end to end and see the results of the changes, if any.
diff --git a/ui/chunking.mdx b/ui/chunking.mdx
index dba5cf4e..895c7ff7 100644
--- a/ui/chunking.mdx
+++ b/ui/chunking.mdx
@@ -22,6 +22,10 @@ text element that was too big to fit in one chunk and required splitting.
- `Table`: A table element is not combined with other elements, and if it fits within the max characters setting it will remain as is.
- `TableChunk`: Large tables that exceed the max characters setting are split into special `TableChunk` elements.
+
+ During chunking, Unstructured removes all detected `Image` elements from the output.
+
+
Here are a few examples:
```json
diff --git a/ui/enriching/image-descriptions.mdx b/ui/enriching/image-descriptions.mdx
index f338f3b6..759edf94 100644
--- a/ui/enriching/image-descriptions.mdx
+++ b/ui/enriching/image-descriptions.mdx
@@ -2,7 +2,7 @@
title: Image descriptions
---
-After partitioning and chunking, you can have Unstructured generate text-based summaries of detected images.
+After partitioning, you can have Unstructured generate text-based summaries of detected images.
This summarization is done by using models offered through these providers:
@@ -39,6 +39,12 @@ Line breaks have been inserted here for readability. The output will not contain
}
```
+For workflows that use [chunking](/ui/chunking), note the following changes:
+
+- Each `Image` element is replaced by a `CompositeElement` element.
+- This `CompositeElement` element will contain the image's summary description as part of the element's `text` field.
+- This `CompositeElement` element will not contain an `image_base64` field.
+
Here are three examples of the descriptions for detected images. These descriptions are generated with GPT-4o by OpenAI:

@@ -57,7 +63,9 @@ To generate image descriptions, in an **Enrichment** node in a workflow, specify
You can change a workflow's image description settings only through [Custom](/ui/workflows#create-a-custom-workflow) workflow settings.
-
+
+ For workflows that use [chunking](/ui/chunking), the **Chunker** node should be placed after all **Enrichment** nodes. Placing the
+ **Chunker** node before an image descriptions **Enrichment** node could cause incomplete or no image descriptions to be generated.
diff --git a/ui/enriching/ner.mdx b/ui/enriching/ner.mdx
index 1cfa7bdd..cf85c310 100644
--- a/ui/enriching/ner.mdx
+++ b/ui/enriching/ner.mdx
@@ -2,7 +2,7 @@
title: Named entity recognition (NER)
---
-After partitioning and chunking, you can have Unstructured generate a list of recognized entities and their types (such as the names of organizations, products, and people) in the content, through a process known as _named entity recognition_ (NER).
+After partitioning, you can have Unstructured generate a list of recognized entities and their types (such as the names of organizations, products, and people) in the content, through a process known as _named entity recognition_ (NER).
You can also have Unstructured generate a list of relationships between the entities that are recognized.
This NER is done by using models offered through these providers:
@@ -144,8 +144,6 @@ To generate a list of recognized entities and their relationships, in an **Enric
You can change a workflow's NER settings only through [Custom](/ui/workflows#create-a-custom-workflow) workflow settings.
-
- Entities are only recognized when the **Partitioner** node in a workflow is also set to use the **High Res** partitioning strategy. [Learn more](/ui/partitioning).
1. Select **Text**.
diff --git a/ui/enriching/table-descriptions.mdx b/ui/enriching/table-descriptions.mdx
index 2b8ceef0..5cec4761 100644
--- a/ui/enriching/table-descriptions.mdx
+++ b/ui/enriching/table-descriptions.mdx
@@ -2,7 +2,7 @@
title: Table descriptions
---
-After partitioning and chunking, you can have Unstructured generate text-based summaries of detected tables.
+After partitioning, you can have Unstructured generate text-based summaries of detected tables.
This summarization is done by using models offered through these providers:
@@ -49,8 +49,14 @@ Here are two examples of the descriptions for detected tables. These description

-The generated table's summary will overwrite any previous contents in the `text` field. The table's original content is available
-in the `image_base64` field.
+The generated table's summary will overwrite any text that Unstructured had previously extracted from that table into the `text` field.
+The table's original content is available in the `image_base64` field.
+
+For workflows that use [chunking](/ui/chunking), note the following changes:
+
+- If a `Table` element must be chunked, the `Table` element is replaced by a set of related `TableChunk` elements.
+- Each of these `TableChunk` elements will contain a summary description only for its own element, as part of the element's `text` field.
+- These `TableChunk` elements will not contain an `image_base64` field.
Any embeddings that are produced after these summaries are generated will be based on the new `text` field's contents.
@@ -63,6 +69,8 @@ To generate table descriptions, in an **Enrichment** node in a workflow, specify
You can change a workflow's table description settings only through [Custom](/ui/workflows#create-a-custom-workflow) workflow settings.
+ For workflows that use [chunking](/ui/chunking), the **Chunker** node should be placed after all **Enrichment** nodes. Placing the
+ **Chunker** node before a table descriptions **Enrichment** node could cause incomplete or no table descriptions to be generated.
diff --git a/ui/enriching/table-to-html.mdx b/ui/enriching/table-to-html.mdx
index e1efbf16..91311ff9 100644
--- a/ui/enriching/table-to-html.mdx
+++ b/ui/enriching/table-to-html.mdx
@@ -2,7 +2,7 @@
title: Tables to HTML
---
-After partitioning and chunking, you can have Unstructured generate representations of each detected table in HTML markup format.
+After partitioning, you can have Unstructured generate representations of each detected table in HTML markup format.
This table-to-HTML output is done by using [GPT-4o](https://openai.com/index/hello-gpt-4o/), provided through OpenAI.
@@ -60,6 +60,14 @@ Line breaks have been inserted here for readability. The output will not contain
}
```
+For workflows that use [chunking](/ui/chunking), note the following changes:
+
+- If a `Table` element must be chunked, the `Table` element is replaced by a set of related `TableChunk` elements.
+- Each of these `TableChunk` elements will contain HTML table output for only its own element.
+- None of the these `TableChunk` elements will contain an `image_base64` field.
+
+
+
## Generate table-to-HTML output
import EnrichmentTableToHTMLHiResOnly from '/snippets/general-shared-text/enrichment-table-to-html-hi-res-only.mdx';
@@ -71,6 +79,8 @@ Make sure after you choose this provider and model, that **Table to HTML** is al
You can change a workflow's table description settings only through [Custom](/ui/workflows#create-a-custom-workflow) workflow settings.
+ For workflows that use [chunking](/ui/chunking), the **Chunker** node should be placed after all **Enrichment** nodes. Placing the
+ **Chunker** node before a table-to-HTML output **Enrichment** node could cause incomplete or no table-to-HTML output to be generated.
\ No newline at end of file
diff --git a/ui/summarizing.mdx b/ui/summarizing.mdx
index aa1aac34..a8957af0 100644
--- a/ui/summarizing.mdx
+++ b/ui/summarizing.mdx
@@ -2,7 +2,7 @@
title: Summarizing
---
-After partitioning and chunking, _summarizing_ generates text-based summaries of images and tables.
+After partitioning, _summarizing_ generates text-based summaries of images and tables.
This summarization is done by using models offered through these providers:
- [GPT-4o](https://openai.com/index/hello-gpt-4o/), provided through OpenAI.
diff --git a/ui/workflows.mdx b/ui/workflows.mdx
index 23c67490..df0ae3fd 100644
--- a/ui/workflows.mdx
+++ b/ui/workflows.mdx
@@ -157,7 +157,6 @@ If you did not previously set the workflow to run on a schedule, you can [run th
8. The workflow begins with the following layout:
-
```mermaid
flowchart LR
Source-->Partitioner-->Destination
@@ -182,6 +181,11 @@ If you did not previously set the workflow to run on a schedule, you can [run th
Source-->Partitioner-->Enrichment-->Chunker-->Embedder-->Destination
```
+
+ For workflows that use **Chunker** and **Enrichment** nodes together, the **Chunker** node should be placed after all **Enrichment** nodes. Placing the
+ **Chunker** node before any **Enrichment** nodes could cause incomplete or no enrichment results to be generated.
+
+
9. In the pipeline designer, click the **Source** node. In the **Source** pane, select the source location. Then click **Save**.
