Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
459 changes: 232 additions & 227 deletions api-reference/workflow/workflows.mdx

Large diffs are not rendered by default.

14 changes: 7 additions & 7 deletions snippets/quickstarts/single-file-ui.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -132,17 +132,17 @@ import EnrichmentImagesTablesHiResOnly from '/snippets/general-shared-text/enric
allowfullscreen
></iframe>

- Add a **Chunker** node after the **Partitioner** node, to chunk the partitioned data into smaller pieces for your retrieval-augmented generation (RAG) applications.
To do this, click the add (**+**) button to the right of the **Partitioner** node, and then click **Enrich > Chunker**. Click the new **Chunker** node and
specify its settings. For help, click the **FAQ** button in the **Chunker** node's pane. [Learn more about chunking and chunker settings](/ui/chunking).
- Add an **Enrichment** node after the **Chunker** node, to apply enrichments to the chunked data such as image summaries, table summaries, table-to-HTML transforms, and
named entity recognition (NER). To do this, click the add (**+**) button to the right of the **Chunker** node, and then click **Enrich > Enrichment**.
- Add an **Enrichment** node after the **Partitioner** node, to apply enrichments to the partitioned data such as image summaries, table summaries, table-to-HTML transforms, and
named entity recognition (NER). To do this, click the add (**+**) button to the right of the **Partitioner** node, and then click **Enrich > Enrichment**.
Click the new **Enrichment** node and specify its settings. For help, click the **FAQ** button in the **Enrichment** node's pane. [Learn more about enrichments and enrichment settings](/ui/enriching/overview).

<EnrichmentImagesTablesHiResOnly />

- Add an **Embedder** node after the **Enrichment** node, to generate vector embeddings for performing vector-based searches. To do this, click the add (**+**) button to the
right of the **Enrichment** node, and then click **Transform > Embedder**. Click the new **Embedder** node and specify its settings. For help, click the **FAQ** button
- Add a **Chunker** node after the **Enrichment** node, to chunk the enriched data into smaller pieces for your retrieval-augmented generation (RAG) applications.
To do this, click the add (**+**) button to the right of the **Enrichment** node, and then click **Enrich > Chunker**. Click the new **Chunker** node and
specify its settings. For help, click the **FAQ** button in the **Chunker** node's pane. [Learn more about chunking and chunker settings](/ui/chunking).
- Add an **Embedder** node after the **Chunker** node, to generate vector embeddings for performing vector-based searches. To do this, click the add (**+**) button to the
right of the **Chunker** node, and then click **Transform > Embedder**. Click the new **Embedder** node and specify its settings. For help, click the **FAQ** button
in the **Embedder** node's pane. [Learn more about embedding and embedding settings](/ui/embedding).

2. Each time you add a node or change its settings, you can click **Test** above the **Source** node again to test the current workflow end to end and see the results of the changes, if any.
Expand Down
4 changes: 4 additions & 0 deletions ui/chunking.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,10 @@ text element that was too big to fit in one chunk and required splitting.
- `Table`: A table element is not combined with other elements, and if it fits within the max characters setting it will remain as is.
- `TableChunk`: Large tables that exceed the max characters setting are split into special `TableChunk` elements.

<Note>
During chunking, Unstructured removes all detected `Image` elements from the output.
</Note>

Here are a few examples:

```json
Expand Down
12 changes: 10 additions & 2 deletions ui/enriching/image-descriptions.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
title: Image descriptions
---

After partitioning and chunking, you can have Unstructured generate text-based summaries of detected images.
After partitioning, you can have Unstructured generate text-based summaries of detected images.

This summarization is done by using models offered through these providers:

Expand Down Expand Up @@ -39,6 +39,12 @@ Line breaks have been inserted here for readability. The output will not contain
}
```

For workflows that use [chunking](/ui/chunking), note the following changes:

- Each `Image` element is replaced by a `CompositeElement` element.
- This `CompositeElement` element will contain the image's summary description as part of the element's `text` field.
- This `CompositeElement` element will not contain an `image_base64` field.

Here are three examples of the descriptions for detected images. These descriptions are generated with GPT-4o by OpenAI:

![Description of an image showing a scatter plot graph](/img/enriching/Image-Description-1.png)
Expand All @@ -57,7 +63,9 @@ To generate image descriptions, in an **Enrichment** node in a workflow, specify

<Note>
You can change a workflow's image description settings only through [Custom](/ui/workflows#create-a-custom-workflow) workflow settings.


For workflows that use [chunking](/ui/chunking), the **Chunker** node should be placed after all **Enrichment** nodes. Placing the
**Chunker** node before an image descriptions **Enrichment** node could cause incomplete or no image descriptions to be generated.
</Note>

<EnrichmentImageSummaryHiResOnly />
Expand Down
4 changes: 1 addition & 3 deletions ui/enriching/ner.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
title: Named entity recognition (NER)
---

After partitioning and chunking, you can have Unstructured generate a list of recognized entities and their types (such as the names of organizations, products, and people) in the content, through a process known as _named entity recognition_ (NER).
After partitioning, you can have Unstructured generate a list of recognized entities and their types (such as the names of organizations, products, and people) in the content, through a process known as _named entity recognition_ (NER).
You can also have Unstructured generate a list of relationships between the entities that are recognized.

This NER is done by using models offered through these providers:
Expand Down Expand Up @@ -144,8 +144,6 @@ To generate a list of recognized entities and their relationships, in an **Enric

<Note>
You can change a workflow's NER settings only through [Custom](/ui/workflows#create-a-custom-workflow) workflow settings.

Entities are only recognized when the **Partitioner** node in a workflow is also set to use the **High Res** partitioning strategy. [Learn more](/ui/partitioning).
</Note>

1. Select **Text**.
Expand Down
14 changes: 11 additions & 3 deletions ui/enriching/table-descriptions.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
title: Table descriptions
---

After partitioning and chunking, you can have Unstructured generate text-based summaries of detected tables.
After partitioning, you can have Unstructured generate text-based summaries of detected tables.

This summarization is done by using models offered through these providers:

Expand Down Expand Up @@ -49,8 +49,14 @@ Here are two examples of the descriptions for detected tables. These description

![Description of a table with information about potentiodynamic polarization of stainless steel](/img/enriching/Table-Description-2.png)

The generated table's summary will overwrite any previous contents in the `text` field. The table's original content is available
in the `image_base64` field.
The generated table's summary will overwrite any text that Unstructured had previously extracted from that table into the `text` field.
The table's original content is available in the `image_base64` field.

For workflows that use [chunking](/ui/chunking), note the following changes:

- If a `Table` element must be chunked, the `Table` element is replaced by a set of related `TableChunk` elements.
- Each of these `TableChunk` elements will contain a summary description only for its own element, as part of the element's `text` field.
- These `TableChunk` elements will not contain an `image_base64` field.

Any embeddings that are produced after these summaries are generated will be based on the new `text` field's contents.

Expand All @@ -63,6 +69,8 @@ To generate table descriptions, in an **Enrichment** node in a workflow, specify
<Note>
You can change a workflow's table description settings only through [Custom](/ui/workflows#create-a-custom-workflow) workflow settings.

For workflows that use [chunking](/ui/chunking), the **Chunker** node should be placed after all **Enrichment** nodes. Placing the
**Chunker** node before a table descriptions **Enrichment** node could cause incomplete or no table descriptions to be generated.
</Note>

<EnrichmentTableSummaryHiResOnly />
Expand Down
12 changes: 11 additions & 1 deletion ui/enriching/table-to-html.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
title: Tables to HTML
---

After partitioning and chunking, you can have Unstructured generate representations of each detected table in HTML markup format.
After partitioning, you can have Unstructured generate representations of each detected table in HTML markup format.

This table-to-HTML output is done by using [GPT-4o](https://openai.com/index/hello-gpt-4o/), provided through OpenAI.

Expand Down Expand Up @@ -60,6 +60,14 @@ Line breaks have been inserted here for readability. The output will not contain
}
```

For workflows that use [chunking](/ui/chunking), note the following changes:

- If a `Table` element must be chunked, the `Table` element is replaced by a set of related `TableChunk` elements.
- Each of these `TableChunk` elements will contain HTML table output for only its own element.
- None of the these `TableChunk` elements will contain an `image_base64` field.



## Generate table-to-HTML output

import EnrichmentTableToHTMLHiResOnly from '/snippets/general-shared-text/enrichment-table-to-html-hi-res-only.mdx';
Expand All @@ -71,6 +79,8 @@ Make sure after you choose this provider and model, that **Table to HTML** is al
<Note>
You can change a workflow's table description settings only through [Custom](/ui/workflows#create-a-custom-workflow) workflow settings.

For workflows that use [chunking](/ui/chunking), the **Chunker** node should be placed after all **Enrichment** nodes. Placing the
**Chunker** node before a table-to-HTML output **Enrichment** node could cause incomplete or no table-to-HTML output to be generated.
</Note>

<EnrichmentTableToHTMLHiResOnly />
2 changes: 1 addition & 1 deletion ui/summarizing.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
title: Summarizing
---

After partitioning and chunking, _summarizing_ generates text-based summaries of images and tables.
After partitioning, _summarizing_ generates text-based summaries of images and tables.
This summarization is done by using models offered through these providers:

- [GPT-4o](https://openai.com/index/hello-gpt-4o/), provided through OpenAI.
Expand Down
6 changes: 5 additions & 1 deletion ui/workflows.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -157,7 +157,6 @@ If you did not previously set the workflow to run on a schedule, you can [run th

8. The workflow begins with the following layout:


```mermaid
flowchart LR
Source-->Partitioner-->Destination
Expand All @@ -182,6 +181,11 @@ If you did not previously set the workflow to run on a schedule, you can [run th
Source-->Partitioner-->Enrichment-->Chunker-->Embedder-->Destination
```

<Note>
For workflows that use **Chunker** and **Enrichment** nodes together, the **Chunker** node should be placed after all **Enrichment** nodes. Placing the
**Chunker** node before any **Enrichment** nodes could cause incomplete or no enrichment results to be generated.
</Note>

9. In the pipeline designer, click the **Source** node. In the **Source** pane, select the source location. Then click **Save**.

![Workflow designer](/img/ui/Workflow-Designer.png)
Expand Down