Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added img/platform/Choose-Workflow-Type.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file removed img/platform/Create-Workflow.png
Binary file not shown.
Binary file modified img/platform/Destinations-Sidebar.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/platform/GoToPlatform.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified img/platform/Job-Complete.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file removed img/platform/Job-Details.png
Binary file not shown.
Binary file modified img/platform/Jobs-Sidebar.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/platform/Node-Usage-Hints.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file removed img/platform/Run-Job.png
Binary file not shown.
Binary file removed img/platform/Run-Workflow.png
Binary file not shown.
Binary file modified img/platform/Select-Job.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified img/platform/Sources-Sidebar.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/platform/Start-Screen.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file removed img/platform/Workflow-Actions.png
Binary file not shown.
Binary file added img/platform/Workflow-Add-Node.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/platform/Workflow-Designer.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/platform/Workflow-Details.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified img/platform/Workflows-Sidebar.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
87 changes: 39 additions & 48 deletions platform/chunking.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -7,23 +7,20 @@ the limits of an embedding model and to improve retrieval precision. The goal is
that contain only the information that is relevant to a user's query. You can specify if and how Unstructured chunks
those elements, based on your intended end use.

If you choose something other than **None** for **Chunker Type** in the **Chunk** section of a workflow, Unstructured will attempt to chunk
the partitioned elements.

During chunking, Unstructured uses a basic chunking strategy that attempts to combine two or more consecutive text elements
into each chunk that fits together within **Max Characters**. To determine the best **Max Characters** length, see the documentation
into each chunk that fits together within **Max characters**. To determine the best **Max characters** length, see the documentation
for the embedding model that you want to use.

You can further control this behavior with by-title, by-page, or by-similarity chunking strategies.
In all cases, Unstructured will only split individual elements if they exceed the specified **Max Characters** length.
In all cases, Unstructured will only split individual elements if they exceed the specified **Max characters** length.
After chunking, you will have document elements of only the following types:

- `CompositeElement`: Any text element will become a `CompositeElement` after chunking. A composite element can be a
combination of two or more original text elements that together fit within the **Max Characters** length. It can also be a single
combination of two or more original text elements that together fit within the **Max characters** length. It can also be a single
element that doesn't leave room in the chunk for any others but fits by itself. Or it can be a fragment of an original
text element that was too big to fit in one chunk and required splitting.
- `Table`: A table element is not combined with other elements, and if it fits within **Max Characters** it will remain as is.
- `TableChunk`: Large tables that exceed **Max Characters** are split into special `TableChunk` elements.
- `Table`: A table element is not combined with other elements, and if it fits within **Max characters** it will remain as is.
- `TableChunk`: Large tables that exceed **Max characters** are split into special `TableChunk` elements.

Here are a few examples:

Expand Down Expand Up @@ -64,109 +61,103 @@ Here are a few examples:

The following sections provide information about the available chunking strategies and their settings.

<Note>You can change a workflow's predefined strategy only through [Custom](/platform/workflows#custom-workflow-settings) workflow settings.</Note>
<Note>You can change a workflow's predefined strategy only through [Custom](/platform/workflows#create-a-custom-workflow) workflow settings.</Note>

## Basic chunking strategy

The basic chunking strategy uses only **Max Characters** and **New After N Characters** to combine sequential elements to maximally fill each chunk.
The basic chunking strategy uses only **Max characters** and **New after n characters** to combine sequential elements to maximally fill each chunk.
This strategy does not use section boundaries, page boundaries, or content similarities to determine the chunks' contents.

To use this chunking strategy, choose **Basic** for **Chunker Type** in the **Chunk** section of a workflow.
To use this chunking strategy, choose **Chunk by character** in the **Chunkers** section of a **Chunker** node in a workflow.

## Chunk By Title strategy
## Chunk by title strategy

The by-title chunking strategy attempts to preserve section boundaries when determining the chunks' contents.
A single chunk should not contain text that occurred in two different sections. When a new section starts, the existing
chunk is closed and a new one is started, even if the next element would fit in the prior chunk.

To use this chunking strategy, choose **Chunk By Title** for **Chunker Type** in the **Chunk** section of a workflow.
To use this chunking strategy, choose **Chunk by title** in the **Chunkers** section of a **Chunker** node in a workflow.

## Chunk By Page strategy
## Chunk by page strategy

The by-page chunking strategy attempts to preserve page boundaries when determining the chunks' contents.
A single chunk should not contain text that occurred in two different page. When a new page starts, the existing
chunk is closed and a new one is started, even if the next element would fit in the prior chunk.

To use this chunking strategy, choose **Chunk By Page** for **Chunker Type** in the **Chunk** section of a workflow.
To use this chunking strategy, choose **Chunk by page** in the **Chunkers** section of a **Chunker** node in a workflow.

## Chunk By Similarity strategy
## Chunk By similarity strategy

The by-similarity chunking strategy uses the
[sentence-transformers/multi-qa-mpnet-base-dot-v1](https://huggingface.co/sentence-transformers/multi-qa-mpnet-base-dot-v1) embedding model
to identify topically similar sequential elements and combines them into chunks.

As with the other chunking strategies, chunks will never exceed the absolute maximum chunk size set by **Max Characters**. For this reason,
As with the other chunking strategies, chunks will never exceed the absolute maximum chunk size set by **Max characters**. For this reason,
not all elements that share a topic will necessarily appear in the same chunk. However, with this strategy you can
guarantee that two elements with low similarity will not be combined in a single chunk.

To use this chunking strategy, choose **Chunk By Similarity** for **Chunker Type** in the **Chunk** section of a workflow.
To use this chunking strategy, choose **Chunk by similarity** in the **Chunkers** section of a **Chunker** node in a workflow.

You can control the level of topic similarity you require for elements to have by setting [Similarity Threshold](#similarity-threshold).
You can control the level of topic similarity you require for elements to have by setting [Similarity threshold](#similarity-threshold).

## Max Characters setting
## Max characters setting

Specifies the absolute maximum number of characters in a chunk.

To specify this setting, enter a number into the **Max Characters** field in the **Chunk** section of a workflow.
To specify this setting, enter a number into the **Max characters** field.

This setting applies to all of the chunking strategies: **Basic**, **Chunk By Title**, **Chunk By Page**, and **Chunk By Similarity**.
This setting applies to all of the chunking strategies.

## Combine Text Under N Characters setting
## Combine text under n characters setting

Combines elements from a section into a chunk until a section reaches a length of this many characters.

To specify this setting, enter a number into the **Combine Text Under N Characters** field in the **Chunk** section of a workflow.

This setting applies only to the chunking strategy **Chunk By Title**.
To specify this setting, enter a number into the **Combine text under n chars** field.

## Include Original Elements setting
This setting applies only to the chunking strategy **Chunk by title**.

If this box is checked, the elements that were used to form a chunk appear in the `metadata` field's `orig_elements` field for that chunk.
## Include original elements setting

The **Include Original Elements** check box is in the **Chunk** section of a workflow.
If the **Include original elements** box is checked, the elements that were used to form a chunk appear in the `metadata` field's `orig_elements` field for that chunk.

This setting applies to all of the chunking strategies: **Basic**, **Chunk By Title**, **Chunk By Page**, and **Chunk By Similarity**.
This setting applies to all of the chunking strategies.

## Multipage Sections setting
## Multipage sections setting

If this box is checked, this allows sections to span multiple pages.
If the **Multipage sections** box is checked, this allows sections to span multiple pages.

The **Multipage Sections** check box is in the **Chunk** section of a workflow.
This setting applies only to the chunking strategy **Chunk by title**.

This setting applies only to the chunking strategy **Chunk By Title**.

## New After N Characters setting
## New after n characters setting

Closes new sections after reaching a length of this many characters. This is an approximate limit.

To specify this setting, enter a number into the **New After N Characters** field in the **Chunk** section of a workflow.
To specify this setting, enter a number into the **New after n characters** field.

This setting applies only to the chunking strategies **Basic**, **Chunk By Title**, and **Chunk By Page**.
This setting applies only to the chunking strategies **Chunk by character**, **Chunk by title**, and **Chunk by page**.

## Overlap setting

Applies a prefix of this many trailing characters from the prior text-split chunk to second and later chunks formed from oversized elements by text-splitting.

To specify this setting, enter a number into the **Overlap** field in the **Chunk** section of a workflow.

This setting applies only to the chunking strategies **Basic**, **Chunk By Title**, and **Chunk By Page**.
To specify this setting, enter a number into the **Overlap** field.

## Overlap All setting
This setting applies only to the chunking strategies **Chunk by character**, **Chunk by title**, and **Chunk by page**.

If this box is checked, applies overlap to "normal" chunks formed by combining whole elements. Use with caution as this can introduce noise into otherwise clean semantic units.
## Overlap all setting

The **Overlap All** check box is in the **Chunk** section of a workflow.
If the **Overlap all** box is checked, applies overlap to "normal" chunks formed by combining whole elements. Use with caution as this can introduce noise into otherwise clean semantic units.

This setting applies only to the chunking strategies **Basic**, **Chunk By Title**, and **Chunk By Page**.
This setting applies only to the chunking strategies **Chunk by character**, **Chunk by title**, and **Chunk by page**.

## Similarity Threshold setting
## Similarity threshold setting

Specifies the minimum similarity that text in consecutive elements must have to be included in the same chunk.
This must be a value between `0.0` and `1.0`, exclusive (`0.01` to `0.99`). The default is `0.5` if not otherwise specified.

To specify this setting, enter a number into the **Similarity Threshold** field in the **Chunk** section of a workflow.
To specify this setting, enter a number into the **Similarity threshold** field.

This setting applies only to the chunking strategy **Chunk By Similarity**.
This setting applies only to the chunking strategy **Chunk by similarity**.

## Learn more

Expand Down
14 changes: 8 additions & 6 deletions platform/destinations/astradb.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -12,12 +12,14 @@ import AstraDBPrerequisites from '/snippets/general-shared-text/astradb.mdx';

To create the destination connector:

1. On the sidebar, click **Destinations**.
2. Click **New Destination**.
3. In the **Type** drop-down list, select **Astra DB**.
4. Fill in the fields as described later on this page.
5. Click **Save and Test**.
6. Click **Close**.
1. On the sidebar, click **Connectors**.
2. Click **Destinations**.
3. Click **Add new**.
4. Give the connector some unique **Name**.
5. In the **Provider** area, click **Astra DB**.
6. Click **Continue**.
7. Follow the on-screen instructions to fill in the fields as described later on this page.
8. Click **Save and Test**.

import AstraDBFields from '/snippets/general-shared-text/astradb-platform.mdx';

Expand Down
14 changes: 8 additions & 6 deletions platform/destinations/azure-cognitive-search.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -12,12 +12,14 @@ import AzureCognitiveSearchPrerequisites from '/snippets/general-shared-text/azu

To create the destination connector:

1. On the sidebar, click **Destinations**.
2. Click **New Destination**.
3. In the **Type** drop-down list, select **Azure Cognitive Search**.
4. Fill in the fields as described later on this page.
5. Click **Save and Test**.
6. Click **Close**.
1. On the sidebar, click **Connectors**.
2. Click **Destinations**.
3. Click **Add new**.
4. Give the connector some unique **Name**.
5. In the **Provider** area, click **Azure Cognitive Search**.
6. Click **Continue**.
7. Follow the on-screen instructions to fill in the fields as described later on this page.
8. Click **Save and Test**.

import AzureCognitiveSearchFields from '/snippets/general-shared-text/azure-cognitive-search-platform.mdx';

Expand Down
14 changes: 8 additions & 6 deletions platform/destinations/chroma.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -12,12 +12,14 @@ import ChromaPrerequisites from '/snippets/general-shared-text/chroma.mdx';

To create the destination connector:

1. On the sidebar, click **Destinations**.
2. Click **New Destination**.
3. In the **Type** drop-down list, select **Chroma**.
4. Fill in the fields as described later on this page.
5. Click **Save and Test**.
6. Click **Close**.
1. On the sidebar, click **Connectors**.
2. Click **Destinations**.
3. Click **Add new**.
4. Give the connector some unique **Name**.
5. In the **Provider** area, click **Chroma**.
6. Click **Continue**.
7. Follow the on-screen instructions to fill in the fields as described later on this page.
8. Click **Save and Test**.

import ChromaFields from '/snippets/general-shared-text/chroma-platform.mdx';

Expand Down
14 changes: 8 additions & 6 deletions platform/destinations/databricks.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -12,12 +12,14 @@ import DatabricksPrerequisites from '/snippets/general-shared-text/databricks-vo

To create the destination connector:

1. On the sidebar, click **Destinations**.
2. Click **New Destination**.
3. In the **Type** drop-down list, select **Databricks**.
4. Fill in the fields as described later on this page.
5. Click **Save and Test**.
6. Click **Close**.
1. On the sidebar, click **Connectors**.
2. Click **Destinations**.
3. Click **Add new**.
4. Give the connector some unique **Name**.
5. In the **Provider** area, click **Databricks**.
6. Click **Continue**.
7. Follow the on-screen instructions to fill in the fields as described later on this page.
8. Click **Save and Test**.

import DatabricksFields from '/snippets/general-shared-text/databricks-volumes-platform.mdx';

Expand Down
14 changes: 8 additions & 6 deletions platform/destinations/elasticsearch.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -12,12 +12,14 @@ import ElasticsearchPrerequisites from '/snippets/general-shared-text/elasticsea

To create the destination connector:

1. On the sidebar, click **Destinations**.
2. Click **New Destination**.
3. In the **Type** drop-down list, select **Elasticsearch**.
4. Fill in the fields as described later on this page.
5. Click **Save and Test**.
6. Click **Close**.
1. On the sidebar, click **Connectors**.
2. Click **Destinations**.
3. Click **Add new**.
4. Give the connector some unique **Name**.
5. In the **Provider** area, click **Elasticsearch**.
6. Click **Continue**.
7. Follow the on-screen instructions to fill in the fields as described later on this page.
8. Click **Save and Test**.

import ElasticsearchFields from '/snippets/general-shared-text/elasticsearch-platform.mdx';

Expand Down
14 changes: 8 additions & 6 deletions platform/destinations/milvus.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -22,12 +22,14 @@ import MilvusPrerequisites from '/snippets/general-shared-text/milvus.mdx';

To create the destination connector:

1. On the sidebar, click **Destinations**.
2. Click **New Destination**.
3. In the **Type** drop-down list, select **Milvus**.
4. Fill in the fields as described later on this page.
5. Click **Save and Test**.
6. Click **Close**.
1. On the sidebar, click **Connectors**.
2. Click **Destinations**.
3. Click **Add new**.
4. Give the connector some unique **Name**.
5. In the **Provider** area, click **Milvus**.
6. Click **Continue**.
7. Follow the on-screen instructions to fill in the fields as described later on this page.
8. Click **Save and Test**.

import MilvusFields from '/snippets/general-shared-text/milvus-platform.mdx';

Expand Down
14 changes: 8 additions & 6 deletions platform/destinations/mongodb.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -12,12 +12,14 @@ import MongoDBPrerequisites from '/snippets/general-shared-text/mongodb.mdx';

To create the destination connector:

1. On the sidebar, click **Destinations**.
2. Click **New Destination**.
3. In the **Type** drop-down list, select **MongoDB**.
4. Fill in the fields as described later on this page.
5. Click **Save and Test**.
6. Click **Close**.
1. On the sidebar, click **Connectors**.
2. Click **Destinations**.
3. Click **Add new**.
4. Give the connector some unique **Name**.
5. In the **Provider** area, click **MongoDB**.
6. Click **Continue**.
7. Follow the on-screen instructions to fill in the fields as described later on this page.
8. Click **Save and Test**.

import MongoDBFields from '/snippets/general-shared-text/mongodb-platform.mdx';

Expand Down
14 changes: 8 additions & 6 deletions platform/destinations/opensearch.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -12,12 +12,14 @@ import OpenSearchPrerequisites from '/snippets/general-shared-text/opensearch.md

To create the destination connector:

1. On the sidebar, click **Destinations**.
2. Click **New Destination**.
3. In the **Type** drop-down list, select **OpenSearch**.
4. Fill in the fields as described later on this page.
5. Click **Save and Test**.
6. Click **Close**.
1. On the sidebar, click **Connectors**.
2. Click **Destinations**.
3. Click **Add new**.
4. Give the connector some unique **Name**.
5. In the **Provider** area, click **OpenSearch**.
6. Click **Continue**.
7. Follow the on-screen instructions to fill in the fields as described later on this page.
8. Click **Save and Test**.

import OpenSearchFields from '/snippets/general-shared-text/opensearch-platform.mdx';

Expand Down
20 changes: 12 additions & 8 deletions platform/destinations/overview.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -5,20 +5,24 @@ description: Destination connectors in the Unstructured Platform are designed to

![Destinations in the sidebar](/img/platform/Destinations-Sidebar.png)

To see your existing destination connectors, on the sidebar, click **Destinations**.
To see your existing destination connectors, on the sidebar, click **Connectors**, and then click **Destinations**.

To create a destination connector:

1. On the sidebar, click **Destinations**.
2. Click **New Destination**.
3. In the **Type** drop-down list, select the connector type that matches your destination.
4. Fill in the fields according to your connector type. To learn how, click your connector type in the following list:
1. In the sidebar, click **Connectors**.
2. Click **Destinations**.
3. Click **Add new**.
4. For **Name**, enter some unique name for this connector.
5. In the **Provider** area, click the destination location type that matches yours.
6. Click **Continue**.
7. Fill in the fields according to your connector type. To learn how, click your connector type in the following list:

- [Astra DB](/platform/destinations/astradb)
- [Azure Cognitive Search](/platform/destinations/azure-cognitive-search)
- [Milvus](/platform/destinations/milvus)
- [MongoDB](/platform/destinations/mongodb)
- [Pinecone](/platform/destinations/pinecone)
- [S3](/platform/destinations/s3)

5. Click **Save and Test**.
6. Click **Close**.
8. If a **Continue** button appears, click it, and fill in any additional settings fields.
9. Click **Save and Test**.
14 changes: 8 additions & 6 deletions platform/destinations/pinecone.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -24,12 +24,14 @@ import PineconePrerequisites from '/snippets/general-shared-text/pinecone.mdx';

To create the destination connector:

1. On the sidebar, click **Destinations**.
2. Click **New Destination**.
3. In the **Type** drop-down list, select **Pinecone**.
4. Fill in the fields as described later on this page.
5. Click **Save and Test**.
6. Click **Close**.
1. On the sidebar, click **Connectors**.
2. Click **Destinations**.
3. Click **Add new**.
4. Give the connector some unique **Name**.
5. In the **Provider** area, click **Pinecone**.
6. Click **Continue**.
7. Follow the on-screen instructions to fill in the fields as described later on this page.
8. Click **Save and Test**.

import PineconeFields from '/snippets/general-shared-text/pinecone-platform.mdx';

Expand Down
Loading