diff --git a/img/platform/Choose-Workflow-Type.png b/img/platform/Choose-Workflow-Type.png new file mode 100644 index 00000000..54c24e75 Binary files /dev/null and b/img/platform/Choose-Workflow-Type.png differ diff --git a/img/platform/Create-Workflow.png b/img/platform/Create-Workflow.png deleted file mode 100644 index a43ccf03..00000000 Binary files a/img/platform/Create-Workflow.png and /dev/null differ diff --git a/img/platform/Destinations-Sidebar.png b/img/platform/Destinations-Sidebar.png index d1e00f71..ea3e6ea8 100644 Binary files a/img/platform/Destinations-Sidebar.png and b/img/platform/Destinations-Sidebar.png differ diff --git a/img/platform/GoToPlatform.png b/img/platform/GoToPlatform.png new file mode 100644 index 00000000..3e3b5f33 Binary files /dev/null and b/img/platform/GoToPlatform.png differ diff --git a/img/platform/Job-Complete.png b/img/platform/Job-Complete.png index d1b3e562..765c6fc6 100644 Binary files a/img/platform/Job-Complete.png and b/img/platform/Job-Complete.png differ diff --git a/img/platform/Job-Details.png b/img/platform/Job-Details.png deleted file mode 100644 index 51f4dfd5..00000000 Binary files a/img/platform/Job-Details.png and /dev/null differ diff --git a/img/platform/Jobs-Sidebar.png b/img/platform/Jobs-Sidebar.png index 0b96dd98..80c32cc2 100644 Binary files a/img/platform/Jobs-Sidebar.png and b/img/platform/Jobs-Sidebar.png differ diff --git a/img/platform/Node-Usage-Hints.png b/img/platform/Node-Usage-Hints.png new file mode 100644 index 00000000..5d28fd25 Binary files /dev/null and b/img/platform/Node-Usage-Hints.png differ diff --git a/img/platform/Run-Job.png b/img/platform/Run-Job.png deleted file mode 100644 index 5c1aa787..00000000 Binary files a/img/platform/Run-Job.png and /dev/null differ diff --git a/img/platform/Run-Workflow.png b/img/platform/Run-Workflow.png deleted file mode 100644 index 34f849c3..00000000 Binary files a/img/platform/Run-Workflow.png and /dev/null differ diff --git a/img/platform/Select-Job.png b/img/platform/Select-Job.png index 9d78ec93..1e6157df 100644 Binary files a/img/platform/Select-Job.png and b/img/platform/Select-Job.png differ diff --git a/img/platform/Sources-Sidebar.png b/img/platform/Sources-Sidebar.png index 408b6122..ffa6a65b 100644 Binary files a/img/platform/Sources-Sidebar.png and b/img/platform/Sources-Sidebar.png differ diff --git a/img/platform/Start-Screen.png b/img/platform/Start-Screen.png new file mode 100644 index 00000000..9fa3ee3d Binary files /dev/null and b/img/platform/Start-Screen.png differ diff --git a/img/platform/Workflow-Actions.png b/img/platform/Workflow-Actions.png deleted file mode 100644 index b5eb2a70..00000000 Binary files a/img/platform/Workflow-Actions.png and /dev/null differ diff --git a/img/platform/Workflow-Add-Node.png b/img/platform/Workflow-Add-Node.png new file mode 100644 index 00000000..ffa97bb1 Binary files /dev/null and b/img/platform/Workflow-Add-Node.png differ diff --git a/img/platform/Workflow-Designer.png b/img/platform/Workflow-Designer.png new file mode 100644 index 00000000..1d90a4d4 Binary files /dev/null and b/img/platform/Workflow-Designer.png differ diff --git a/img/platform/Workflow-Details.png b/img/platform/Workflow-Details.png new file mode 100644 index 00000000..373a1d78 Binary files /dev/null and b/img/platform/Workflow-Details.png differ diff --git a/img/platform/Workflows-Sidebar.png b/img/platform/Workflows-Sidebar.png index 6c7e7ce3..ebe2b147 100644 Binary files a/img/platform/Workflows-Sidebar.png and b/img/platform/Workflows-Sidebar.png differ diff --git a/platform/chunking.mdx b/platform/chunking.mdx index 17348463..9f9b0755 100644 --- a/platform/chunking.mdx +++ b/platform/chunking.mdx @@ -7,23 +7,20 @@ the limits of an embedding model and to improve retrieval precision. The goal is that contain only the information that is relevant to a user's query. You can specify if and how Unstructured chunks those elements, based on your intended end use. -If you choose something other than **None** for **Chunker Type** in the **Chunk** section of a workflow, Unstructured will attempt to chunk -the partitioned elements. - During chunking, Unstructured uses a basic chunking strategy that attempts to combine two or more consecutive text elements -into each chunk that fits together within **Max Characters**. To determine the best **Max Characters** length, see the documentation +into each chunk that fits together within **Max characters**. To determine the best **Max characters** length, see the documentation for the embedding model that you want to use. You can further control this behavior with by-title, by-page, or by-similarity chunking strategies. -In all cases, Unstructured will only split individual elements if they exceed the specified **Max Characters** length. +In all cases, Unstructured will only split individual elements if they exceed the specified **Max characters** length. After chunking, you will have document elements of only the following types: - `CompositeElement`: Any text element will become a `CompositeElement` after chunking. A composite element can be a -combination of two or more original text elements that together fit within the **Max Characters** length. It can also be a single +combination of two or more original text elements that together fit within the **Max characters** length. It can also be a single element that doesn't leave room in the chunk for any others but fits by itself. Or it can be a fragment of an original text element that was too big to fit in one chunk and required splitting. -- `Table`: A table element is not combined with other elements, and if it fits within **Max Characters** it will remain as is. -- `TableChunk`: Large tables that exceed **Max Characters** are split into special `TableChunk` elements. +- `Table`: A table element is not combined with other elements, and if it fits within **Max characters** it will remain as is. +- `TableChunk`: Large tables that exceed **Max characters** are split into special `TableChunk` elements. Here are a few examples: @@ -64,109 +61,103 @@ Here are a few examples: The following sections provide information about the available chunking strategies and their settings. -You can change a workflow's predefined strategy only through [Custom](/platform/workflows#custom-workflow-settings) workflow settings. +You can change a workflow's predefined strategy only through [Custom](/platform/workflows#create-a-custom-workflow) workflow settings. ## Basic chunking strategy -The basic chunking strategy uses only **Max Characters** and **New After N Characters** to combine sequential elements to maximally fill each chunk. +The basic chunking strategy uses only **Max characters** and **New after n characters** to combine sequential elements to maximally fill each chunk. This strategy does not use section boundaries, page boundaries, or content similarities to determine the chunks' contents. -To use this chunking strategy, choose **Basic** for **Chunker Type** in the **Chunk** section of a workflow. +To use this chunking strategy, choose **Chunk by character** in the **Chunkers** section of a **Chunker** node in a workflow. -## Chunk By Title strategy +## Chunk by title strategy The by-title chunking strategy attempts to preserve section boundaries when determining the chunks' contents. A single chunk should not contain text that occurred in two different sections. When a new section starts, the existing chunk is closed and a new one is started, even if the next element would fit in the prior chunk. -To use this chunking strategy, choose **Chunk By Title** for **Chunker Type** in the **Chunk** section of a workflow. +To use this chunking strategy, choose **Chunk by title** in the **Chunkers** section of a **Chunker** node in a workflow. -## Chunk By Page strategy +## Chunk by page strategy The by-page chunking strategy attempts to preserve page boundaries when determining the chunks' contents. A single chunk should not contain text that occurred in two different page. When a new page starts, the existing chunk is closed and a new one is started, even if the next element would fit in the prior chunk. -To use this chunking strategy, choose **Chunk By Page** for **Chunker Type** in the **Chunk** section of a workflow. +To use this chunking strategy, choose **Chunk by page** in the **Chunkers** section of a **Chunker** node in a workflow. -## Chunk By Similarity strategy +## Chunk By similarity strategy The by-similarity chunking strategy uses the [sentence-transformers/multi-qa-mpnet-base-dot-v1](https://huggingface.co/sentence-transformers/multi-qa-mpnet-base-dot-v1) embedding model to identify topically similar sequential elements and combines them into chunks. -As with the other chunking strategies, chunks will never exceed the absolute maximum chunk size set by **Max Characters**. For this reason, +As with the other chunking strategies, chunks will never exceed the absolute maximum chunk size set by **Max characters**. For this reason, not all elements that share a topic will necessarily appear in the same chunk. However, with this strategy you can guarantee that two elements with low similarity will not be combined in a single chunk. -To use this chunking strategy, choose **Chunk By Similarity** for **Chunker Type** in the **Chunk** section of a workflow. +To use this chunking strategy, choose **Chunk by similarity** in the **Chunkers** section of a **Chunker** node in a workflow. -You can control the level of topic similarity you require for elements to have by setting [Similarity Threshold](#similarity-threshold). +You can control the level of topic similarity you require for elements to have by setting [Similarity threshold](#similarity-threshold). -## Max Characters setting +## Max characters setting Specifies the absolute maximum number of characters in a chunk. -To specify this setting, enter a number into the **Max Characters** field in the **Chunk** section of a workflow. +To specify this setting, enter a number into the **Max characters** field. -This setting applies to all of the chunking strategies: **Basic**, **Chunk By Title**, **Chunk By Page**, and **Chunk By Similarity**. +This setting applies to all of the chunking strategies. -## Combine Text Under N Characters setting +## Combine text under n characters setting Combines elements from a section into a chunk until a section reaches a length of this many characters. -To specify this setting, enter a number into the **Combine Text Under N Characters** field in the **Chunk** section of a workflow. - -This setting applies only to the chunking strategy **Chunk By Title**. +To specify this setting, enter a number into the **Combine text under n chars** field. -## Include Original Elements setting +This setting applies only to the chunking strategy **Chunk by title**. -If this box is checked, the elements that were used to form a chunk appear in the `metadata` field's `orig_elements` field for that chunk. +## Include original elements setting -The **Include Original Elements** check box is in the **Chunk** section of a workflow. +If the **Include original elements** box is checked, the elements that were used to form a chunk appear in the `metadata` field's `orig_elements` field for that chunk. -This setting applies to all of the chunking strategies: **Basic**, **Chunk By Title**, **Chunk By Page**, and **Chunk By Similarity**. +This setting applies to all of the chunking strategies. -## Multipage Sections setting +## Multipage sections setting -If this box is checked, this allows sections to span multiple pages. +If the **Multipage sections** box is checked, this allows sections to span multiple pages. -The **Multipage Sections** check box is in the **Chunk** section of a workflow. +This setting applies only to the chunking strategy **Chunk by title**. -This setting applies only to the chunking strategy **Chunk By Title**. - -## New After N Characters setting +## New after n characters setting Closes new sections after reaching a length of this many characters. This is an approximate limit. -To specify this setting, enter a number into the **New After N Characters** field in the **Chunk** section of a workflow. +To specify this setting, enter a number into the **New after n characters** field. -This setting applies only to the chunking strategies **Basic**, **Chunk By Title**, and **Chunk By Page**. +This setting applies only to the chunking strategies **Chunk by character**, **Chunk by title**, and **Chunk by page**. ## Overlap setting Applies a prefix of this many trailing characters from the prior text-split chunk to second and later chunks formed from oversized elements by text-splitting. -To specify this setting, enter a number into the **Overlap** field in the **Chunk** section of a workflow. - -This setting applies only to the chunking strategies **Basic**, **Chunk By Title**, and **Chunk By Page**. +To specify this setting, enter a number into the **Overlap** field. -## Overlap All setting +This setting applies only to the chunking strategies **Chunk by character**, **Chunk by title**, and **Chunk by page**. -If this box is checked, applies overlap to "normal" chunks formed by combining whole elements. Use with caution as this can introduce noise into otherwise clean semantic units. +## Overlap all setting -The **Overlap All** check box is in the **Chunk** section of a workflow. +If the **Overlap all** box is checked, applies overlap to "normal" chunks formed by combining whole elements. Use with caution as this can introduce noise into otherwise clean semantic units. -This setting applies only to the chunking strategies **Basic**, **Chunk By Title**, and **Chunk By Page**. +This setting applies only to the chunking strategies **Chunk by character**, **Chunk by title**, and **Chunk by page**. -## Similarity Threshold setting +## Similarity threshold setting Specifies the minimum similarity that text in consecutive elements must have to be included in the same chunk. This must be a value between `0.0` and `1.0`, exclusive (`0.01` to `0.99`). The default is `0.5` if not otherwise specified. -To specify this setting, enter a number into the **Similarity Threshold** field in the **Chunk** section of a workflow. +To specify this setting, enter a number into the **Similarity threshold** field. -This setting applies only to the chunking strategy **Chunk By Similarity**. +This setting applies only to the chunking strategy **Chunk by similarity**. ## Learn more diff --git a/platform/destinations/astradb.mdx b/platform/destinations/astradb.mdx index c9bd7a6d..dc53de43 100644 --- a/platform/destinations/astradb.mdx +++ b/platform/destinations/astradb.mdx @@ -12,12 +12,14 @@ import AstraDBPrerequisites from '/snippets/general-shared-text/astradb.mdx'; To create the destination connector: -1. On the sidebar, click **Destinations**. -2. Click **New Destination**. -3. In the **Type** drop-down list, select **Astra DB**. -4. Fill in the fields as described later on this page. -5. Click **Save and Test**. -6. Click **Close**. +1. On the sidebar, click **Connectors**. +2. Click **Destinations**. +3. Click **Add new**. +4. Give the connector some unique **Name**. +5. In the **Provider** area, click **Astra DB**. +6. Click **Continue**. +7. Follow the on-screen instructions to fill in the fields as described later on this page. +8. Click **Save and Test**. import AstraDBFields from '/snippets/general-shared-text/astradb-platform.mdx'; diff --git a/platform/destinations/azure-cognitive-search.mdx b/platform/destinations/azure-cognitive-search.mdx index fdb7f09d..73fb5005 100644 --- a/platform/destinations/azure-cognitive-search.mdx +++ b/platform/destinations/azure-cognitive-search.mdx @@ -12,12 +12,14 @@ import AzureCognitiveSearchPrerequisites from '/snippets/general-shared-text/azu To create the destination connector: -1. On the sidebar, click **Destinations**. -2. Click **New Destination**. -3. In the **Type** drop-down list, select **Azure Cognitive Search**. -4. Fill in the fields as described later on this page. -5. Click **Save and Test**. -6. Click **Close**. +1. On the sidebar, click **Connectors**. +2. Click **Destinations**. +3. Click **Add new**. +4. Give the connector some unique **Name**. +5. In the **Provider** area, click **Azure Cognitive Search**. +6. Click **Continue**. +7. Follow the on-screen instructions to fill in the fields as described later on this page. +8. Click **Save and Test**. import AzureCognitiveSearchFields from '/snippets/general-shared-text/azure-cognitive-search-platform.mdx'; diff --git a/platform/destinations/chroma.mdx b/platform/destinations/chroma.mdx index 95b9d5c3..dea3d65f 100644 --- a/platform/destinations/chroma.mdx +++ b/platform/destinations/chroma.mdx @@ -12,12 +12,14 @@ import ChromaPrerequisites from '/snippets/general-shared-text/chroma.mdx'; To create the destination connector: -1. On the sidebar, click **Destinations**. -2. Click **New Destination**. -3. In the **Type** drop-down list, select **Chroma**. -4. Fill in the fields as described later on this page. -5. Click **Save and Test**. -6. Click **Close**. +1. On the sidebar, click **Connectors**. +2. Click **Destinations**. +3. Click **Add new**. +4. Give the connector some unique **Name**. +5. In the **Provider** area, click **Chroma**. +6. Click **Continue**. +7. Follow the on-screen instructions to fill in the fields as described later on this page. +8. Click **Save and Test**. import ChromaFields from '/snippets/general-shared-text/chroma-platform.mdx'; diff --git a/platform/destinations/databricks.mdx b/platform/destinations/databricks.mdx index 90dea652..7a9402aa 100644 --- a/platform/destinations/databricks.mdx +++ b/platform/destinations/databricks.mdx @@ -12,12 +12,14 @@ import DatabricksPrerequisites from '/snippets/general-shared-text/databricks-vo To create the destination connector: -1. On the sidebar, click **Destinations**. -2. Click **New Destination**. -3. In the **Type** drop-down list, select **Databricks**. -4. Fill in the fields as described later on this page. -5. Click **Save and Test**. -6. Click **Close**. +1. On the sidebar, click **Connectors**. +2. Click **Destinations**. +3. Click **Add new**. +4. Give the connector some unique **Name**. +5. In the **Provider** area, click **Databricks**. +6. Click **Continue**. +7. Follow the on-screen instructions to fill in the fields as described later on this page. +8. Click **Save and Test**. import DatabricksFields from '/snippets/general-shared-text/databricks-volumes-platform.mdx'; diff --git a/platform/destinations/elasticsearch.mdx b/platform/destinations/elasticsearch.mdx index 365b3422..cdff3651 100644 --- a/platform/destinations/elasticsearch.mdx +++ b/platform/destinations/elasticsearch.mdx @@ -12,12 +12,14 @@ import ElasticsearchPrerequisites from '/snippets/general-shared-text/elasticsea To create the destination connector: -1. On the sidebar, click **Destinations**. -2. Click **New Destination**. -3. In the **Type** drop-down list, select **Elasticsearch**. -4. Fill in the fields as described later on this page. -5. Click **Save and Test**. -6. Click **Close**. +1. On the sidebar, click **Connectors**. +2. Click **Destinations**. +3. Click **Add new**. +4. Give the connector some unique **Name**. +5. In the **Provider** area, click **Elasticsearch**. +6. Click **Continue**. +7. Follow the on-screen instructions to fill in the fields as described later on this page. +8. Click **Save and Test**. import ElasticsearchFields from '/snippets/general-shared-text/elasticsearch-platform.mdx'; diff --git a/platform/destinations/milvus.mdx b/platform/destinations/milvus.mdx index e07a857d..7806fdbb 100644 --- a/platform/destinations/milvus.mdx +++ b/platform/destinations/milvus.mdx @@ -22,12 +22,14 @@ import MilvusPrerequisites from '/snippets/general-shared-text/milvus.mdx'; To create the destination connector: -1. On the sidebar, click **Destinations**. -2. Click **New Destination**. -3. In the **Type** drop-down list, select **Milvus**. -4. Fill in the fields as described later on this page. -5. Click **Save and Test**. -6. Click **Close**. +1. On the sidebar, click **Connectors**. +2. Click **Destinations**. +3. Click **Add new**. +4. Give the connector some unique **Name**. +5. In the **Provider** area, click **Milvus**. +6. Click **Continue**. +7. Follow the on-screen instructions to fill in the fields as described later on this page. +8. Click **Save and Test**. import MilvusFields from '/snippets/general-shared-text/milvus-platform.mdx'; diff --git a/platform/destinations/mongodb.mdx b/platform/destinations/mongodb.mdx index 9d09c3c4..bf006c2e 100644 --- a/platform/destinations/mongodb.mdx +++ b/platform/destinations/mongodb.mdx @@ -12,12 +12,14 @@ import MongoDBPrerequisites from '/snippets/general-shared-text/mongodb.mdx'; To create the destination connector: -1. On the sidebar, click **Destinations**. -2. Click **New Destination**. -3. In the **Type** drop-down list, select **MongoDB**. -4. Fill in the fields as described later on this page. -5. Click **Save and Test**. -6. Click **Close**. +1. On the sidebar, click **Connectors**. +2. Click **Destinations**. +3. Click **Add new**. +4. Give the connector some unique **Name**. +5. In the **Provider** area, click **MongoDB**. +6. Click **Continue**. +7. Follow the on-screen instructions to fill in the fields as described later on this page. +8. Click **Save and Test**. import MongoDBFields from '/snippets/general-shared-text/mongodb-platform.mdx'; diff --git a/platform/destinations/opensearch.mdx b/platform/destinations/opensearch.mdx index 077905a8..dae57340 100644 --- a/platform/destinations/opensearch.mdx +++ b/platform/destinations/opensearch.mdx @@ -12,12 +12,14 @@ import OpenSearchPrerequisites from '/snippets/general-shared-text/opensearch.md To create the destination connector: -1. On the sidebar, click **Destinations**. -2. Click **New Destination**. -3. In the **Type** drop-down list, select **OpenSearch**. -4. Fill in the fields as described later on this page. -5. Click **Save and Test**. -6. Click **Close**. +1. On the sidebar, click **Connectors**. +2. Click **Destinations**. +3. Click **Add new**. +4. Give the connector some unique **Name**. +5. In the **Provider** area, click **OpenSearch**. +6. Click **Continue**. +7. Follow the on-screen instructions to fill in the fields as described later on this page. +8. Click **Save and Test**. import OpenSearchFields from '/snippets/general-shared-text/opensearch-platform.mdx'; diff --git a/platform/destinations/overview.mdx b/platform/destinations/overview.mdx index 2b2758d5..e3aa3042 100644 --- a/platform/destinations/overview.mdx +++ b/platform/destinations/overview.mdx @@ -5,20 +5,24 @@ description: Destination connectors in the Unstructured Platform are designed to ![Destinations in the sidebar](/img/platform/Destinations-Sidebar.png) -To see your existing destination connectors, on the sidebar, click **Destinations**. +To see your existing destination connectors, on the sidebar, click **Connectors**, and then click **Destinations**. To create a destination connector: -1. On the sidebar, click **Destinations**. -2. Click **New Destination**. -3. In the **Type** drop-down list, select the connector type that matches your destination. -4. Fill in the fields according to your connector type. To learn how, click your connector type in the following list: +1. In the sidebar, click **Connectors**. +2. Click **Destinations**. +3. Click **Add new**. +4. For **Name**, enter some unique name for this connector. +5. In the **Provider** area, click the destination location type that matches yours. +6. Click **Continue**. +7. Fill in the fields according to your connector type. To learn how, click your connector type in the following list: + - [Astra DB](/platform/destinations/astradb) - [Azure Cognitive Search](/platform/destinations/azure-cognitive-search) - [Milvus](/platform/destinations/milvus) - [MongoDB](/platform/destinations/mongodb) - [Pinecone](/platform/destinations/pinecone) - [S3](/platform/destinations/s3) - -5. Click **Save and Test**. -6. Click **Close**. \ No newline at end of file + +8. If a **Continue** button appears, click it, and fill in any additional settings fields. +9. Click **Save and Test**. \ No newline at end of file diff --git a/platform/destinations/pinecone.mdx b/platform/destinations/pinecone.mdx index 60f52cef..61975ced 100644 --- a/platform/destinations/pinecone.mdx +++ b/platform/destinations/pinecone.mdx @@ -24,12 +24,14 @@ import PineconePrerequisites from '/snippets/general-shared-text/pinecone.mdx'; To create the destination connector: -1. On the sidebar, click **Destinations**. -2. Click **New Destination**. -3. In the **Type** drop-down list, select **Pinecone**. -4. Fill in the fields as described later on this page. -5. Click **Save and Test**. -6. Click **Close**. +1. On the sidebar, click **Connectors**. +2. Click **Destinations**. +3. Click **Add new**. +4. Give the connector some unique **Name**. +5. In the **Provider** area, click **Pinecone**. +6. Click **Continue**. +7. Follow the on-screen instructions to fill in the fields as described later on this page. +8. Click **Save and Test**. import PineconeFields from '/snippets/general-shared-text/pinecone-platform.mdx'; diff --git a/platform/destinations/s3.mdx b/platform/destinations/s3.mdx index e720b2cc..09158d32 100644 --- a/platform/destinations/s3.mdx +++ b/platform/destinations/s3.mdx @@ -12,12 +12,14 @@ import S3Prerequisites from '/snippets/general-shared-text/s3.mdx'; To create the destination connector: -1. On the sidebar, click **Destinations**. -2. Click **New Destination**. -3. In the **Type** drop-down list, select **Amazon S3**. -4. Fill in the fields as described later on this page. -5. Click **Save and Test**. -6. Click **Close**. +1. On the sidebar, click **Connectors**. +2. Click **Destinations**. +3. Click **Add new**. +4. Give the connector some unique **Name**. +5. In the **Provider** area, click **Amazon S3**. +6. Click **Continue**. +7. Follow the on-screen instructions to fill in the fields as described later on this page. +8. Click **Save and Test**. import S3Fields from '/snippets/general-shared-text/s3-platform.mdx'; diff --git a/platform/destinations/weaviate.mdx b/platform/destinations/weaviate.mdx index 1bf6a199..dccdd753 100644 --- a/platform/destinations/weaviate.mdx +++ b/platform/destinations/weaviate.mdx @@ -12,12 +12,14 @@ import WeaviatePrerequisites from '/snippets/general-shared-text/weaviate.mdx'; To create the destination connector: -1. On the sidebar, click **Destinations**. -2. Click **New Destination**. -3. In the **Type** drop-down list, select **Weaviate**. -4. Fill in the fields as described later on this page. -5. Click **Save and Test**. -6. Click **Close**. +1. On the sidebar, click **Connectors**. +2. Click **Destinations**. +3. Click **Add new**. +4. Give the connector some unique **Name**. +5. In the **Provider** area, click **Weaviate**. +6. Click **Continue**. +7. Follow the on-screen instructions to fill in the fields as described later on this page. +8. Click **Save and Test**. import WeaviateFields from '/snippets/general-shared-text/weaviate-platform.mdx'; diff --git a/platform/embedding.mdx b/platform/embedding.mdx index 0361422c..9485f2ff 100644 --- a/platform/embedding.mdx +++ b/platform/embedding.mdx @@ -59,36 +59,9 @@ on Hugging Face: ## Generate embeddings -To generate embeddings, choose one of the following embedding providers in the **Vendor** drop-down list in the **Embed** section of a workflow: +To generate embeddings, choose one of the following embedding providers in the **Providers** section of an **Embedder** node in a workflow: -You can change a workflow's predefined provider only through [Custom](/platform/workflows#custom-workflow-settings) workflow settings. +You can change a workflow's predefined provider only through [Custom](/platform/workflows#create-a-custom-workflow) workflow settings. -- **OpenAI**: Use [OpenAI](https://openai.com) to generate embeddings. Also choose the embedding model to use, from one of the following: - - - **text-embedding-3-small** (1536 dimensions): [Learn more](https://platform.openai.com/docs/guides/embeddings). - - **text-embedding-3-large** (3072 dimensions): [Learn more](https://platform.openai.com/docs/guides/embeddings). - - **Ada 002 (Text)** (1536 dimensions): [Learn more](https://platform.openai.com/docs/guides/embeddings). - -- **Anthropic**: Use [Anthropic](https://www.anthropic.com) to generate embeddings. Also choose the embedding model to use, from one of the following: - - - **voyage-2** (1024 dimensions): [Learn more](https://docs.anthropic.com/en/docs/build-with-claude/embeddings#available-voyage-models). - - **voyage-large-2** (1536 dimensions): [Learn more](https://docs.anthropic.com/en/docs/build-with-claude/embeddings#available-voyage-models). - - **voyage-code-2** (1536 dimensions): [Learn more](https://docs.anthropic.com/en/docs/build-with-claude/embeddings#available-voyage-models). - - **voyage-lite-02-instruct** (1024 dimensions): [Learn more](https://docs.anthropic.com/en/docs/build-with-claude/embeddings#available-voyage-models). - -- **Hugging Face**: Use [Hugging Face](https://huggingface.co) to generate embeddings. Also choose the embedding model to use, from one of the following: - - - **nvidia/NV-Embed-v1** (4096 dimensions): [Learn more](https://huggingface.co/nvidia/NV-Embed-v1). - - **voyage-large-2-instruct** (1024 dimensions): [Learn more](https://huggingface.co/voyageai/voyage-large-2-instruct). - - **stella_en_400M_v5** (1024 dimensions): [Learn more](https://huggingface.co/dunzhang/stella_en_400M_v5). - - **stella_en_1.5B_v5** (1024 dimensions): [Learn more](https://huggingface.co/dunzhang/stella_en_1.5B_v5). - - **Alibaba-NLP/gte-Qwen2-7B-instruct** (3584 dimensions): [Learn more](https://huggingface.co/Alibaba-NLP/gte-Qwen2-7B-instruct). - -- **OctoAI**: Use [OctoAI](https://octo.ai) to generate embeddings. Also choose the embedding model to use, from one of the following: - - - **GTE Large** (1024 dimensions): [Learn more](https://octo.ai/blog/introducing-octoais-embedding-api-to-power-your-rag-needs/). - -- **Vertex AI**: Use [Vertex AI](https://cloud.google.com/vertex-ai) to generate embeddings. Also choose the embedding model to use, from one of the following: - - - **textembedding-gecko@003** (768 dimensions): [Learn more](https://cloud.google.com/vertex-ai/generative-ai/docs/learn/model-versions#embeddings_stable_model_versions). - - **text-embedding-004** (768 dimensions): [Learn more](https://cloud.google.com/vertex-ai/generative-ai/docs/learn/model-versions#embeddings_stable_model_versions). +- **OpenAI**: Use [OpenAI](https://openai.com) to generate embeddings. +- **Vertex AI**: Use [Vertex AI](https://cloud.google.com/vertex-ai) to generate embeddings. \ No newline at end of file diff --git a/platform/jobs.mdx b/platform/jobs.mdx index c928df3c..7f98a97a 100644 --- a/platform/jobs.mdx +++ b/platform/jobs.mdx @@ -9,7 +9,7 @@ To view the jobs dashboard, on the sidebar, click **Jobs**. The jobs dashboard provides a centralized view for managing and monitoring the execution of data processing tasks within your workflows. -The jobs dashboard lists each job and its associated **Workflow** name, **ID**, **Status**, and **Job Created Time**. +The jobs dashboard lists each job and its associated **Status**, **Job ID**, **Created** date and time, **Workflow** name, and **Runtime** duration. Each job's status, shown in the **Status** column, can be: @@ -19,12 +19,10 @@ Each job's status, shown in the **Status** column, can be: * **Finished**: The job has completed successfully. -* **Failed**: The job has some errors. +* **Failed**: The job has errors. ## Run a job -![Run a job](/img/platform/Run-Job.png) - You must first have an existing workflow to run a job against. @@ -33,26 +31,24 @@ Each job's status, shown in the **Status** column, can be: To see your existing workflows, on the sidebar, click **Workflows**. -1. On the sidebar, click **Jobs**, and then click **Run Job**. +To run a job, on the sidebar, click **Workflows**, and then click **Run** in the row for the workflow that you want to run. -2. In the **Select a Workflow** dropdown list, select an existing workflow, and then click **Run**. - ## Monitor a job -![Job details](/img/platform/Job-Details.png) - -The **Job Details** page is a comprehensive section for monitoring the specific details of jobs executed within a particular workflow. To access this page, click the specific job's **ID** link on the jobs dashboard. - -The **Job Details** page shows: - -* **Execution Time**: When the job started. +![Completed job](/img/platform/Job-Complete.png) -* **Documents**: The number of documents that were attempted to be processed. +The job details pane is a comprehensive section for monitoring the specific details of jobs executed within a particular workflow. To access this pane, click the specific job on the jobs dashboard. -* **Success**: The number of successfully processed documents. +Clicking the **Details** button shows: -* **Failed**: The number of documents that failed to process. +- The job's ID. +- The job's start date. +- The source and destination connectors for the job. +- Information about the files that were processed in each of the job's stages. This information shows how many files are: -* **Status**: The status of each step in the job, including each step's **Ready**, **In Progress**, **Success**, and **Failure** state. + - **Ready**: Waiting to be processed. + - **In Progress**: Being processed. + - **Success**: Successfully processed. + - **Failure**: Failed to process successfully. -To see the job detail's metadata, click **View Payload**. \ No newline at end of file +To see the job's logs, click the **Logs** button. \ No newline at end of file diff --git a/platform/overview.mdx b/platform/overview.mdx index 0933ad9d..e171a77d 100644 --- a/platform/overview.mdx +++ b/platform/overview.mdx @@ -1,25 +1,85 @@ --- title: Overview -description: Destination connectors in the Unstructured Platform are designed to specify the endpoint for data processed within the platform. These connectors ensure that the transformed and analyzed data is securely and efficiently transferred to a storage system for future use, often to a vector database for tasks that involve high-speed retrieval and advanced data analytics operations. --- -![Destinations in the sidebar](/img/platform/Destinations-Sidebar.png) +The Unstructured Platform is a no-code user interface, pay-as-you-go platform for transforming your unstructured data into data that is ready for Retrieval Augmented Generation (RAG). -To see your existing destination connectors, on the sidebar, click **Destinations**. +![Start screen](/img/platform/Start-Screen.png) -To create a destination connector: +## How does it work? -1. On the sidebar, click **Destinations**. -2. Click **New Destination**. -3. In the **Type** drop-down list, select the connector type that matches your destination. -4. Fill in the fields according to your connector type. To learn how, click your connector type in the following list: +To get your data RAG-ready, the Unstructured Platform moves it through the following process: - - [Astra DB](/platform/destinations/astradb) - - [Azure Cognitive Search](/platform/destinations/azure-cognitive-search) - - [Milvus](/platform/destinations/milvus) - - [MongoDB](/platform/destinations/mongodb) - - [Pinecone](/platform/destinations/pinecone) - - [S3](/platform/destinations/s3) +```mermaid + flowchart LR + Connect[1. Connect]-->Route[2. Route]-->Transform[3. Transform]-->Chunk[4. Chunk]-->Enrich[5. Enrich]-->Embed[6. Embed]-->Persist[7. Persist] +``` + + + The Unstructured Platform offers multiple [source connectors](/platform/sources/overview) to connect to your data in its existing location. + + + Routing determines which strategy Unstructured Platform uses to transforming your documents into Unstructured's canonical JSON schema. The Unstructured Platform provides these [partitioning](/platform/partitioning) strategies for document transformation: + + - **Fast** is great for when there is extractable text available, like in HTML files or in the Microsoft Office Document format. + - **Hi Res** is best for PDFs and tables and where accurate classification of document elements is critical. + - If you're unsure which strategy to use, choose **Auto**, and the Unstructured Platform will handle the decision for you. -5. Click **Save and Test**. -6. Click **Close**. \ No newline at end of file + + + Your source document is transformed into Unstructured's canonical JSON schema. Regardless of the input document, this JSON schema we gives you a [standardized output](/platform/document-elements). It contains more than 20 elements, such as `Header`, `Footer`, `Title`, `NarrativeText`, `Table`, `Image`, and many more. Each document is wrapped in extensive metadata so you can understand languages, file types, sources, hierarchies, and much more. + + + The Unstructured Platform provides these [chunking](/platform/chunking) strategies: + + - **Basic** combines sequential elements up to specified size limits. Oversized elements are split, while tables are isolated and divided if necessary. Overlap between chunks is optional. + - **By Title** uses semantic chunking, understands the layout of the document, and makes intelligent splits. + - **By Page** attempts to preserve page boundaries when determining the chunks' contents. + - **By Similarity** uses an embedding model to identify topically similar sequential elements and combines them into chunks. + + + + Images and tables can be optionally summarized. This generates enriched content around the images or tables that were parsed during the transformation process. + + + The Unstructured Platform uses optional third-party [embedding](/platform/embedding) providers such as OpenAI. + + + The Unstructured Platform offers multiple [destination connectors](/platform/destinations/overview), including all major vector databases. + + + +To simplify this process and provide it as a no-code solution, the Unstructured Platform brings together four key concepts: + +```mermaid + flowchart LR + subgraph Workflow[3. Workflow] + direction LR + Source[1. Source Connector] --> Destination[2. Destination Connector] + end + Jobs + Workflow[3. Workflow] --> Jobs[4. Jobs] +``` + + + + [Source connectors](/platform/sources/overview) to ingest your data into the Unstructured Platform for transformation. + + + [Destination connectors](/platform/destinations/overview) tell the Unstructured Platform where to write your transformed data to. + + + [Workflows](/platform/workflows) connect sources to destinations and provide chunking, embedding, and scheduling options. + + + [Jobs](/platform/jobs) enable you to monitor data transformation progress. + + + +## What support is there for compliance? + +The platform is designed for global reach with SOC2 Type 1, SOC2 Type 2, and HIPAA compliance. It has support for over 50 languages. + +## How do I get started? + +Skip ahead to the [quickstart](/platform/quickstart). diff --git a/platform/partitioning.mdx b/platform/partitioning.mdx index 47073d14..564eefbb 100644 --- a/platform/partitioning.mdx +++ b/platform/partitioning.mdx @@ -15,9 +15,19 @@ model-based workflows, which can be slower and costlier because they require a m When you choose a partitioning strategy for your files, you should be mindful of these speed, cost, and quality trade-offs. For example, the **Fast** strategy can be about 100 times faster than leading image-to-text models. -To choose one of these strategies, select one of the **Strategy** options in the **Transform** section of a workflow: +To choose one of these strategies, select one of the **Partition Strategy** options in the **Partitioner** node of a workflow: -You can change a workflow's predefined strategy only through [Custom](/platform/workflows#custom-workflow-settings) workflow settings. +You can change a workflow's predefined strategy only through [Custom](/platform/workflows#create-a-custom-workflow) workflow settings. + +- **Auto**: This strategy leaves the choice of using **High Res** or **Fast** to Unstructured to determine on a file-by-file basis as it goes along. + Unstructured will use **High Res** if it can determine that the current file under analysis is an image file or a PDF file with embedded images or tables. + Otherwise, Unstructured will use **Fast** on the current file. + You should choose this strategy if you know that all of the files are a combination of: + + - At least one image file; or at least one PDF file with embedded images or tables in it; and any number of other kinds of files. + + Choosing **Auto** can be an effective choice with a reasonable balance of speed, cost, and quality + when you have a mixture of these types of files. - **Fast**: This strategy is rule-based. It is faster and cheaper than **High Res** but might provide lower-quality resolution. You should choose this strategy if you know that: @@ -25,19 +35,10 @@ To choose one of these strategies, select one of the **Strategy** options in the - You have only PDF files, and you know that none of them have embedded images or tables in them, or - You have no PDF files or image files at all. -- **Hi Res**: This strategy uses an image-to-text model for inference. It is slower and costlier than **Fast** but can provide +- **Hi-res**: This strategy uses an image-to-text model for inference. It is slower and costlier than **Fast** but can provide higher-quality resolution. You should choose this strategy if you know that: - All of the files are only image files, or - All of the files are only PDF files, and they have embedded images or tables in them, or - All of the files are a combination of only these two kinds of files. -- **Auto**: This strategy leaves the choice of using **High Res** or **Fast** to Unstructured to determine on a file-by-file basis as it goes along. - Unstructured will use **High Res** if it can determine that the current file under analysis is an image file or a PDF file with embedded images or tables. - Otherwise, Unstructured will use **Fast** on the current file. - You should choose this strategy if you know that all of the files are a combination of: - - - At least one image file; or at least one PDF file with embedded images or tables in it; and any number of other kinds of files. - - Choosing **Auto** can be an effective choice with a reasonable balance of speed, cost, and quality - when you have a mixture of these types of files. diff --git a/platform/sources/azure-blob-storage.mdx b/platform/sources/azure-blob-storage.mdx index 088b0374..32dc06ea 100644 --- a/platform/sources/azure-blob-storage.mdx +++ b/platform/sources/azure-blob-storage.mdx @@ -12,12 +12,14 @@ import AzurePrerequisites from '/snippets/general-shared-text/azure.mdx'; To create the source connector: -1. On the sidebar, click **Sources**. -2. Click **New Source**. -3. In the **Type** drop-down list, select **Azure Blob Storage**. -4. Fill in the fields as described later on this page. -5. Click **Save and Test**. -6. Click **Close**. +1. On the sidebar, click **Connectors**. +2. Click **Sources**. +3. Click **Add new**. +4. Give the connector some unique **Name**. +5. In the **Provider** area, click **Azure Blob Storage**. +6. Click **Continue**. +7. Follow the on-screen instructions to fill in the fields as described later on this page. +8. Click **Save and Test**. import AzureFields from '/snippets/general-shared-text/azure-platform.mdx'; diff --git a/platform/sources/elasticsearch.mdx b/platform/sources/elasticsearch.mdx index cf85ec87..a3d6d7da 100644 --- a/platform/sources/elasticsearch.mdx +++ b/platform/sources/elasticsearch.mdx @@ -12,12 +12,14 @@ import ElasticsearchPrerequisites from '/snippets/general-shared-text/elasticsea To create the source connector: -1. On the sidebar, click **Sources**. -2. Click **New Source**. -3. In the **Type** drop-down list, select **Elasticsearch**. -4. Fill in the fields as described later on this page. -5. Click **Save and Test**. -6. Click **Close**. +1. On the sidebar, click **Connectors**. +2. Click **Sources**. +3. Click **Add new**. +4. Give the connector some unique **Name**. +5. In the **Provider** area, click **Elasticsearch**. +6. Click **Continue**. +7. Follow the on-screen instructions to fill in the fields as described later on this page. +8. Click **Save and Test**. import ElasticsearchFields from '/snippets/general-shared-text/elasticsearch-platform.mdx'; diff --git a/platform/sources/google-cloud.mdx b/platform/sources/google-cloud.mdx index 9d6d3c91..a9ca0804 100644 --- a/platform/sources/google-cloud.mdx +++ b/platform/sources/google-cloud.mdx @@ -12,12 +12,14 @@ import GCSPrerequisites from '/snippets/general-shared-text/gcs.mdx'; To create the source connector: -1. On the sidebar, click **Sources**. -2. Click **New Source**. -3. In the **Type** drop-down list, select **Google Cloud Storage**. -4. Fill in the fields as described later on this page. -5. Click **Save and Test**. -6. Click **Close**. +1. On the sidebar, click **Connectors**. +2. Click **Sources**. +3. Click **Add new**. +4. Give the connector some unique **Name**. +5. In the **Provider** area, click **Google Cloud Storage**. +6. Click **Continue**. +7. Follow the on-screen instructions to fill in the fields as described later on this page. +8. Click **Save and Test**. import GCSFields from '/snippets/general-shared-text/gcs-platform.mdx'; diff --git a/platform/sources/google-drive.mdx b/platform/sources/google-drive.mdx index 0908e035..45e2ad5b 100644 --- a/platform/sources/google-drive.mdx +++ b/platform/sources/google-drive.mdx @@ -12,12 +12,14 @@ import GoogleDrivePrerequisites from '/snippets/general-shared-text/google-drive To create the source connector: -1. On the sidebar, click **Sources**. -2. Click **New Source**. -3. In the **Type** drop-down list, select **Google Drive**. -4. Fill in the fields as described later on this page. -5. Click **Save and Test**. -6. Click **Close**. +1. On the sidebar, click **Connectors**. +2. Click **Sources**. +3. Click **Add new**. +4. Give the connector some unique **Name**. +5. In the **Provider** area, click **Google Drive**. +6. Click **Continue**. +7. Follow the on-screen instructions to fill in the fields as described later on this page. +8. Click **Save and Test**. import GoogleDriveFields from '/snippets/general-shared-text/google-drive-platform.mdx'; diff --git a/platform/sources/onedrive-cloud-storage.mdx b/platform/sources/onedrive-cloud-storage.mdx index 71944ec6..4ebf9e9e 100644 --- a/platform/sources/onedrive-cloud-storage.mdx +++ b/platform/sources/onedrive-cloud-storage.mdx @@ -12,12 +12,14 @@ import OneDrivePrerequisites from '/snippets/general-shared-text/onedrive.mdx'; To create the source connector: -1. On the sidebar, click **Sources**. -2. Click **New Source**. -3. In the **Type** drop-down list, select **OneDrive**. -4. Fill in the fields as described later on this page. -5. Click **Save and Test**. -6. Click **Close**. +1. On the sidebar, click **Connectors**. +2. Click **Sources**. +3. Click **Add new**. +4. Give the connector some unique **Name**. +5. In the **Provider** area, click **OneDrive**. +6. Click **Continue**. +7. Follow the on-screen instructions to fill in the fields as described later on this page. +8. Click **Save and Test**. import OneDriveFields from '/snippets/general-shared-text/onedrive-platform.mdx'; diff --git a/platform/sources/opensearch.mdx b/platform/sources/opensearch.mdx index 0ba0024c..fc52805c 100644 --- a/platform/sources/opensearch.mdx +++ b/platform/sources/opensearch.mdx @@ -12,12 +12,14 @@ import OpenSearchPrerequisites from '/snippets/general-shared-text/opensearch.md To create the source connector: -1. On the sidebar, click **Sources**. -2. Click **New Source**. -3. In the **Type** drop-down list, select **OpenSearch**. -4. Fill in the fields as described later on this page. -5. Click **Save and Test**. -6. Click **Close**. +1. On the sidebar, click **Connectors**. +2. Click **Sources**. +3. Click **Add new**. +4. Give the connector some unique **Name**. +5. In the **Provider** area, click **OpenSearch**. +6. Click **Continue**. +7. Follow the on-screen instructions to fill in the fields as described later on this page. +8. Click **Save and Test**. import OpenSearchFields from '/snippets/general-shared-text/opensearch-platform.mdx'; diff --git a/platform/sources/overview.mdx b/platform/sources/overview.mdx index 41514b83..72eb00a2 100644 --- a/platform/sources/overview.mdx +++ b/platform/sources/overview.mdx @@ -6,18 +6,21 @@ description: Source connectors are essential components in data integration syst ![Sources in the sidebar](/img/platform/Sources-Sidebar.png) -To see your existing source connectors, on the sidebar, click **Sources**. +To see your existing source connectors, on the sidebar, click **Connectors**, and then click **Sources**. To create a source connector: -1. On the sidebar, click **Sources**. -2. Click **New Source**. -3. In the **Type** drop-down list, select the connector type that matches your source. -4. Fill in the fields according to your connector type. To learn how, click your connector type in the following list: +1. In the sidebar, click **Connectors**. +2. Click **Sources**. +3. Click **Add new**. +4. For **Name**, enter some unique name for this connector. +5. In the **Provider** area, click the source location type that matches yours. +6. Click **Continue**. +7. Fill in the fields according to your connector type. To learn how, click your connector type in the following list: - [Azure](/platform/sources/azure-blob-storage) - [S3](/platform/sources/s3) - - [SharePoint](/platform/sources/sharepoint) - -5. Click **Save and Test**. -6. Click **Close**. + - [SharePoint](/platform/source/sharepoint) + +8. If a **Continue** button appears, click it, and fill in any additional settings fields. +9. Click **Save and Test**. diff --git a/platform/sources/s3.mdx b/platform/sources/s3.mdx index 7b9c8537..ef935d58 100644 --- a/platform/sources/s3.mdx +++ b/platform/sources/s3.mdx @@ -12,12 +12,14 @@ import S3Prerequisites from '/snippets/general-shared-text/s3.mdx'; To create the source connector: -1. On the sidebar, click **Sources**. -2. Click **New Source**. -3. In the **Type** drop-down list, select **Amazon S3**. -4. Fill in the fields as described later on this page. -5. Click **Save and Test**. -6. Click **Close**. +1. On the sidebar, click **Connectors**. +2. Click **Sources**. +3. Click **Add new**. +4. Give the connector some unique **Name**. +5. In the **Provider** area, click **Amazon S3**. +6. Click **Continue**. +7. Follow the on-screen instructions to fill in the fields as described later on this page. +8. Click **Save and Test**. import S3Fields from '/snippets/general-shared-text/s3-platform.mdx'; diff --git a/platform/sources/salesforce.mdx b/platform/sources/salesforce.mdx index c3d9be1e..34741db8 100644 --- a/platform/sources/salesforce.mdx +++ b/platform/sources/salesforce.mdx @@ -12,12 +12,14 @@ import SalesforcePrerequisites from '/snippets/general-shared-text/salesforce.md To create the source connector: -1. On the sidebar, click **Sources**. -2. Click **New Source**. -3. In the **Type** drop-down list, select **Salesforce**. -4. Fill in the fields as described later on this page. -5. Click **Save and Test**. -6. Click **Close**. +1. On the sidebar, click **Connectors**. +2. Click **Sources**. +3. Click **Add new**. +4. Give the connector some unique **Name**. +5. In the **Provider** area, click **Salesforce**. +6. Click **Continue**. +7. Follow the on-screen instructions to fill in the fields as described later on this page. +8. Click **Save and Test**. import SalesforceFields from '/snippets/general-shared-text/salesforce-platform.mdx'; diff --git a/platform/sources/sftp-storage.mdx b/platform/sources/sftp-storage.mdx index 0ca7e752..2e3adf78 100644 --- a/platform/sources/sftp-storage.mdx +++ b/platform/sources/sftp-storage.mdx @@ -12,12 +12,14 @@ import SFTPPrerequisites from '/snippets/general-shared-text/sftp.mdx'; To create the source connector: -1. On the sidebar, click **Sources**. -2. Click **New Source**. -3. In the **Type** drop-down list, select **SFTP**. -4. Fill in the fields as described later on this page. -5. Click **Save and Test**. -6. Click **Close**. +1. On the sidebar, click **Connectors**. +2. Click **Sources**. +3. Click **Add new**. +4. Give the connector some unique **Name**. +5. In the **Provider** area, click **SFTP**. +6. Click **Continue**. +7. Follow the on-screen instructions to fill in the fields as described later on this page. +8. Click **Save and Test**. import SFTPFields from '/snippets/general-shared-text/sftp-platform.mdx'; diff --git a/platform/sources/sharepoint.mdx b/platform/sources/sharepoint.mdx index ccd8eb4d..4102205a 100644 --- a/platform/sources/sharepoint.mdx +++ b/platform/sources/sharepoint.mdx @@ -12,12 +12,14 @@ import SharePointPrerequisites from '/snippets/general-shared-text/sharepoint.md To create the source connector: -1. On the sidebar, click **Sources**. -2. Click **New Source**. -3. In the **Type** drop-down list, select **SharePoint**. -4. Fill in the fields as described later on this page. -5. Click **Save and Test**. -6. Click **Close**. +1. On the sidebar, click **Connectors**. +2. Click **Sources**. +3. Click **Add new**. +4. Give the connector some unique **Name**. +5. In the **Provider** area, click **SharePoint**. +6. Click **Continue**. +7. Follow the on-screen instructions to fill in the fields as described later on this page. +8. Click **Save and Test**. import SharePointFields from '/snippets/general-shared-text/sharepoint-platform.mdx'; diff --git a/platform/summarizing.mdx b/platform/summarizing.mdx index 5b5e588d..b21bdf31 100644 --- a/platform/summarizing.mdx +++ b/platform/summarizing.mdx @@ -70,18 +70,16 @@ Line breaks have been inserted here for readability. The output will not contain ## Summarize images and tables -To summarize images and tables, in the **Transform** section of a workflow, specify the following: +To summarize images and tables, in the **Enrichment model** section of an **Enrichment** node in a workflow, specify the following: -You can change a workflow's summarization settings only through [Custom](/platform/workflows#custom-workflow-settings) workflow settings. +You can change a workflow's summarization settings only through [Custom](/platform/workflows#create-a-custom-workflow) workflow settings. -For image summarization, in the **Image summarization** area, choose one of the following: +For image summarization, choose one of the following: -- **None**: No image summarization. This is the default. -- **GPT-4o**: Use GPT-4o to summarize images. [Learn more](https://openai.com/index/hello-gpt-4o/). -- **Claude 3.5 Sonnet**: Use Claude 3.5 Sonnet to summarize images. [Learn more](https://www.anthropic.com/news/claude-3-5-sonnet). +- **OpenAI Image Description**: Use GPT-4o to summarize images. [Learn more](https://openai.com/index/hello-gpt-4o/). +- **Anthropic Image Description**: Use Claude 3.5 Sonnet to summarize images. [Learn more](https://www.anthropic.com/news/claude-3-5-sonnet). -For table summarization, in the **Table summarization** area, choose one of the following: +For table summarization, choose one of the following: -- **None**: No table summarization. This is the default. -- **GPT-4o**: Use GPT-4o to summarize tables. [Learn more](https://openai.com/index/hello-gpt-4o/). -- **Claude 3.5 Sonnet**: Use Claude 3.5 Sonnet to summarize tables. [Learn more](https://www.anthropic.com/news/claude-3-5-sonnet). \ No newline at end of file +- **OpenAI Table Description**: Use GPT-4o to summarize tables. [Learn more](https://openai.com/index/hello-gpt-4o/). +- **Anthropic Table Description**: Use Claude 3.5 Sonnet to summarize tables. [Learn more](https://www.anthropic.com/news/claude-3-5-sonnet). \ No newline at end of file diff --git a/platform/workflows.mdx b/platform/workflows.mdx index b5de259f..4dad9e67 100644 --- a/platform/workflows.mdx +++ b/platform/workflows.mdx @@ -14,144 +14,294 @@ Workflows are crucial for establishing a systematic approach to managing data fl ## Create a workflow +![Choose a workflow type](/img/platform/Choose-Workflow-Type.png) + +The Unstructured Platform provides two types of workflow builders: + +- [Automatic](#create-an-automatic-workflow) workflows, which use sensible default workflow settings to enable you to get good-quality results faster. +- [Custom](#create-a-custom-worklow) workflows, which enable you to fine-tune the workflow settings behind the scenes to get very specific results. + +All Unstructured accounts can create automatic worklows. + +To create custom workflows, you must request Unstructured to enable your account first. [Learn how](#create-a-custom-worklow). + +### Create an automatic workflow + You must first have an existing source connector and destination connector to add to the workflow. If you do not have an existing connector for either your target source (input) or destination (output) location, [create the source connector](/platform/sources/overview), [create the destination connector](/platform/destinations/overview), and then return here. - To see your existing connectors, on the sidebar, click **Sources** or **Destinations**. + To see your existing connectors, on the sidebar, click **Connectors**, and then click **Sources** or **Destinations**. -To create a workflow: +To create an automatic workflow: 1. On the sidebar, click **Workflows**. 2. Click **New Workflow**. -3. Enter a unique **Name** for this workflow. -4. In the **Connectors** section, in the **Sources** dropdown list, select your source location. -5. In the **Destination** dropdown list, select your destination location. +3. Next to **Build it with me**, click **Create Workflow**. + + If a radio button appears instead of **Build it with me**, select it, and then click **Continue**. + +4. For **Workflow Name**, enter some unique name for this workflow. +5. In the **Sources** dropdown list, select your source location. +6. In the **Destinations** dropdown list, select your destination location. You can select multiple source and destination locations. Files will be ingested from all of the selected source locations, and the processed data will be delivered to all of the selected destination locations. -6. In the **Workflow Settings** section, choose one of these predefined workflow settings groups: +7. Click **Continue**. +8. In the **Optimize for** section, select the option to choose one of these predefined workflow settings groups: - **Basic** is a good choice if you have text-only documents that have no images or tables in them. - **Advanced** is a good choice if you have complex documents that have images or tables or both in them. - - To fine-tune your workflow's settings, click **Custom**. If **Custom** is not available, click **Request Access**, - and wait for Unstructured to enable it. Learn how to define [Custom](#custom-workflow-settings) workflow settings. -7. If you want to run this workflow on a regular basis, in the **Schedule** section, select one of the time periods in the **Schedule Type** list: +9. If you want to overwrite any files in the destination location that might have been previously processed, check the **Reprocess all** box. +10. If you want to retry processing any documents that failed to process, check the **Retry Failed Documents** box. +11. Click **Continue**. +12. If you want this workflow to run on a schedule, in the **Repeat Run** dropdown list, select one of the scheduling options, and fill in the scheduling settings. Otherwise, select **Don't repeat**. +13. Click **Complete**. +14. If you did not set the workflow to run on a schedule, you can [run the worklow](#edit-delete-or-run-a-workflow) now. - - **Monthly**: This workflow will automatically run once each month. Choose the day of the month and the time on that day to run this workflow. - - **Daily**: This workflow will automatically run once each day for one or more days of each week. Choose the days of the week and the time on each of those days to run this workflow. - - **Hourly**: This workflow will automatically run once each hour. Choose the minute of the hour to run this workflow. - - **Frequently**: This workflow will automatically run once after each specified number of minutes. Choose the time period in minutes to wait until running this workflow again. +### Create a custom workflow -8. Click **Save**. + + You must first have an existing source connector and destination connector to add to the workflow. -## Custom workflow settings + If you do not have an existing connector for either your target source (input) or destination (output) location, [create the source connector](/platform/sources/overview), [create the destination connector](/platform/destinations/overview), and then return here. -To define custom workflow settings, in the **Workflow Settings** section of a workflow, click **Custom**. -If **Custom** is not available, click **Request Access**, and wait for Unstructured to enable it. + To see your existing connectors, on the sidebar, click **Connectors**, and then click **Sources** or **Destinations**. + -The following workflow settings can be customized: +There are two ways to create a custom workflow: - - - 1. For **Strategy**, choose one of the following: +- Through [Build it with me > Custom](#build-it-with-me-custom). This option enables you to fine-tune the kinds of settings that are in **Basic** and **Advanced**. +- Through [Build it myself](#build-it-myself). This option offers a visual workflow designer with even more fine-tuning than the **Custom** option. - - **Fast**: This strategy uses traditional NLP extraction techniques to quickly pull in all text elements. This strategy is not good for image-based file types. [Learn more](/platform/partitioning). - - **Hi Res**: This strategy uses document layout to gain additional information about document elements. Unstructured recommends using this strategy if your use case is highly sensitive to correct classifications for document elements. [Learn more](/platform/partitioning). - - **Auto**: This strategy chooses the partitioning strategy based on detected document characteristics. [Learn more](/platform/partitioning). +#### Build it with me - Custom - 2. For **Image summarization**, choose one of the following: +1. On the sidebar, click **Workflows**. +2. Click **New Workflow**. +3. Next to **Build it with me**, click **Create Workflow**. - - **None**: No summarization of images. - - **GPT-4o**: Use GPT-4o to summarize images. [Learn more](https://openai.com/index/hello-gpt-4o/). - - **Claude 3.5 Sonnet**: Use Claude 3.5 Sonnet to summarize images. [Learn more](https://www.anthropic.com/news/claude-3-5-sonnet). + If a radio button appears instead of **Build it with me**, select it, and then click **Continue**. - [Learn more about image summarization](/platform/summarizing). +4. For **Workflow Name**, enter some unique name for this workflow. +5. In the **Sources** dropdown list, select your source location. +6. In the **Destinations** dropdown list, select your destination location. - 3. For **Table summarization**, choose one of the following: + You can select multiple source and destination locations. Files will be ingested from all of the selected source locations, and the processed data will be delivered to all of the selected destination locations. - - **None**: No summarization of tables. - - **GPT-4o**: Use GPT-4o to summarize tables. [Learn more](https://openai.com/index/hello-gpt-4o/). - - **Claude 3.5 Sonnet**: Use Claude 3.5 Sonnet to summarize tables. [Learn more](https://www.anthropic.com/news/claude-3-5-sonnet). +7. Click **Continue**. +8. In the **Optimize for** section, click the **Custom** option, and then click **Continue**. - [Learn more about table summarization](/platform/summarizing). + + If the **Custom** option is disabled, inside the **Custom** option click **Notify me**, and follow the on-screen directions to complete the request. + Unstructured will notify you when your account has been enabled with the **Custom** option. After you receive this notification, click the + **Custom** option, and then click **Continue**. + - 4. For **Connector Settings**, check one or more of the following boxes: +9. In the **Strategy** area, choose one of the following: - - **Include Page Breaks**: Include page breaks in the output, if the file type supports it. - - **Infer Table Structure**: If you also set **Strategy** to **Hi Res**, any table elements extracted from a PDF will include an additional metadata field, `text_as_html`, that contains a transformation of the data into an HTML ``. + - **Fast**: This strategy uses traditional NLP extraction techniques to quickly pull in all text elements. This strategy is not good for image-based file types or files with images or tables in them. + - **Hi-res**: This strategy uses the document layout to gain additional information about document elements. This strategy is good for image-based file types and files with images or tables in them. This strategy is recommended if your use case is highly sensitive to correct classification for document elements. + - **Auto**: This strategy chooses the partitioning strategy on a file-by-file basis, depending on detected document characteristics. - 5. For **Elements to Exclude**, select one or more standard Unstructured element types to not include in the output. [Learn more](/platform/document-elements). - - - For **Chunker Type**, select one of the following: + [Learn more](/platform/partitioning). - - **None**: Do not chunk elements. - - **Basic**: Combine sequential elements to maximally fill each chunk. [Learn more](/platform/chunking). Also specify the following: +10. In the **Image Summzarizer** drop-down list, choose one of the following: - - **Include Original Elements**: Check this box to output the elements that were used to form a chunk, to appear in the `metadata` field's `orig_elements` field for that chunk. - - **Max Characters** (_required_): Cut off new sections after reaching a length of this many characters. - - **New After N Characters** (_required_): Cut off new sections after reaching a length of this many characters. This is an approximate limit. - - **Overlap** (_required_): Apply a prefix of this many trailing characters from the prior text-split chunk to second and later chunks formed from oversized elements by text-splitting. - - **Overlap All**: Check this box to apply overlap to "normal" chunks formed by combining whole elements. Use with caution as this can introduce noise into otherwise clean semantic units. + - **None**: Do not provide summaries for any detected images in any of the files. + - **GPT-4o**: Use GPT-4o to provide summaries for any detected images in any of the files. [Learn more](https://openai.com/index/hello-gpt-4o/). + - **Claude 3.5 Sonnet**: Use Claude 3.5 Sonnet to provide summaries for any detected images in any of the files. [Learn more](https://www.anthropic.com/news/claude-3-5-sonnet). - - **Chunk By Title**: Preserve section boundaries and optionally page boundaries as well. A single chunk will never contain text that occurred in two different sections. When a new section starts, the existing chunk is closed and a new one is started, even if the next element would fit in the prior chunk. [Learn more](/api-reference/api-services/chunking). Also specify the following: + [Learn more](/platform/summarizing). + +11. In the **Table Summzarizer** drop-down list, choose one of the following: + + - **None**: Do not provide summaries for any detected tables in any of the files. + - **GPT-4o**: Use GPT-4o to provide summaries for any detected tables in any of the files. [Learn more](https://openai.com/index/hello-gpt-4o/). + - **Claude 3.5 Sonnet**: Use Claude 3.5 Sonnet to provide summaries for any detected tables in any of the files. [Learn more](https://www.anthropic.com/news/claude-3-5-sonnet). + + [Learn more](/platform/summarizing). + +12. Check the **Include Page Breaks** box to include page breaks in the output, if the file type support it. +13. Check the **Infer Table Structure** box to extract any detected table elements in PDF files as HTML format into a `metadata` output field named `text_as_html`. + +14. In the **Elements to Exclude** drop-down list, select any element types to exclude from the output. +15. In the **Chunk** area, for **Chunker Type**, select one of the following: + + - **None**: Do not apply special chunking rules to the output. + - **Chunk by Character** (also known as _basic_ chunking): Combine sequential elements to maximally fill each chunk. Also, specify the following: + + - **Include Original Elements**: Check this box to output the elements that were used to form a chunk, to appear in the `metadata` field's `orig_elements` field for that chunk. By default, this box is unchecked. + - **Max Characters**: Cut off new sections after reaching a length of this many characters. The default is **2048**. + - **New After n chars**: Cut off new sections after reaching a length of this many characters. This is an approximate limit. The default is **1500**. + - **Overlap**: Apply a prefix of this many trailing characters from the prior text-split chunk to second and later chunks formed from oversized elements by text-splitting. The default is **160**. + - **Overlap all**: Check this box to apply overlap to "normal" chunks formed by combining whole elements. Use with caution as this can introduce noise into otherwise clean semantic units. By default, this box is unchecked. - - **Combine Text Under N Characters** (_required_): Combine elements until a section reaches a length of this many characters. - - **Include Original Elements**: Check this box to output the elements that were used to form a chunk, to appear in the `metadata` field's `orig_elements` field for that chunk. - - **Max Characters** (_required_): Cut off new sections after reaching a length of this many characters. This is a strict limit. - - **Multipage Sections**: Check this box to allow sections to span multiple pages. - - **New After N Characters** (_required_): Cut off new sections after reaching a length of this many characters. This is an approximate limit. - - **Overlap** (_required_): Apply a prefix of this many trailing characters from the prior text-split chunk to second and later chunks formed from oversized elements by text-splitting. - - **Overlap All**: Check this box to apply overlap to "normal" chunks formed by combining whole elements. Use with caution as this can introduce noise into otherwise clean semantic units. - - - **Chunk By Page**: Preserve page boundaries. When a new page is detected, the existing chunk is closed and a new one is started, even if the next element would fit in the prior chunk. [Learn more](/platform/chunking). Also specify the following: + - **Chunk by Page**: Preserve page boundaries. When a new page is detected, the existing chunk is closed and a new one is started, even if the next element would fit in the prior chunk. Also, specify the following: - - **Include Original Elements**: Check this box to output the elements that were used to form a chunk, to appear in the `metadata` field's `orig_elements` field for that chunk. - - **Max Characters** (_required_): Cut off new sections after reaching a length of this many characters. This is a strict limit. - - **New After N Characters** (_required_): Cut off new sections after reaching a length of this many characters. This is an approximate limit. - - **Overlap** (_required_): Apply a prefix of this many trailing characters from the prior text-split chunk to second and later chunks formed from oversized elements by text-splitting. - - **Overlap All**: Check this box to apply overlap to "normal" chunks formed by combining whole elements. Use with caution as this can introduce noise into otherwise clean semantic units. + - **Include Original Elements**: Check this box to output the elements that were used to form a chunk, to appear in the `metadata` field's `orig_elements` field for that chunk. By default, this box is unchecked. + - **Max Characters**: Cut off new sections after reaching a length of this many characters. This is a strict limit. The default is **500**. + - **New After n Characters**: Cut off new sections after reaching a length of this many characters. This is an approximate limit. The default is **50**. + - **Overlap**: Apply a prefix of this many trailing characters from the prior text-split chunk to second and later chunks formed from oversized elements by text-splitting. The default is **30**. + - **Overlap all**: Check this box to apply overlap to "normal" chunks formed by combining whole elements. Use with caution as this can introduce noise into otherwise clean semantic units. By default, this box is unchecked. + + - **Chunk by Title**: Preserve section boundaries and optionally page boundaries as well. A single chunk will never contain text that occurred in two different sections. When a new section starts, the existing chunk is closed and a new one is started, even if the next element would fit in the prior chunk. Also, specify the following: + + - **Combine Text Under n Chars**: Combine elements until a section reaches a length of this many characters. The default is **0**. + - **Include Original Elements**: Check this box to output the elements that were used to form a chunk, to appear in the `metadata` field's `orig_elements` field for that chunk. By default, this box is unchecked. + - **Max Characters**: Cut off new sections after reaching a length of this many characters. This is a strict limit. The default is **2048**. + - **Multipage Sections**: Check this box to allow sections to span multiple pages. By default, this box is unchecked. + - **New After n Characters**: Cut off new sections after reaching a length of this many characters. This is an approximate limit. The default is **1500*. + - **Overlap**: Apply a prefix of this many trailing characters from the prior text-split chunk to second and later chunks formed from oversized elements by text-splitting. The default is **160**. + - **Overlap all**: Check this box to apply overlap to "normal" chunks formed by combining whole elements. Use with caution as this can introduce noise into otherwise clean semantic units. By default, this box is unchecked. + + - **Chunk by Similarity**: Use the [sentence-transformers/multi-qa-mpnet-base-dot-v1](https://huggingface.co/sentence-transformers/multi-qa-mpnet-base-dot-v1) embedding model to identify topically similar sequential elements and combine them into chunks. Also, specify the following: + + - **Include Original Elements**: Check this box to output the elements that were used to form a chunk, to appear in the `metadata` field's `orig_elements` field for that chunk. By default, this box is unchecked. + - **Max Characters**: Cut off new sections after reaching a length of this many characters. This is a strict limit. The default is **500**. + - **Similarity Threshold**: Specify a threshold between 0 and 1 exclusive (0.01 to 0.99 inclusive), where 0 indicates completely dissimilar vectors and 1 indicates identical vectors, taking into consider the trade-offs between precision (a higher threshold) and recall (a lower threshold). The default is **0.5**. [Learn more](https://towardsdatascience.com/introduction-to-embedding-clustering-and-similarity-11dd80b00061). + + Learn more: - - **Chunk By Similarity**: Use the [sentence-transformers/multi-qa-mpnet-base-dot-v1](https://huggingface.co/sentence-transformers/multi-qa-mpnet-base-dot-v1) embedding model to identify topically similar sequential elements and combine them into chunks. [Learn more](/platform/chunking). Also specify the following: + - [Chunking overview](/platform/chunking) + - [Chunking for RAG: best practices](https://unstructured.io/blog/chunking-for-rag-best-practices) - - **Include Original Elements**: Check this box to output the elements that were used to form a chunk, to appear in the `metadata` field's `orig_elements` field for that chunk. - - **Max Characters** (_required_): Cut off new sections after reaching a length of this many characters. This is a strict limit. - - **Similarity Threshold** (_required_): Specify a threshold between 0 and 1 exclusive (0.01 to 0.99 inclusive), where 0 indicates completely dissimilar vectors and 1 indicates identical vectors, taking into consider the trade-offs between precision (a higher threshold) and recall (a lower threshold). [Learn more](https://towardsdatascience.com/introduction-to-embedding-clustering-and-similarity-11dd80b00061). +16. In the **Embed** area, for **Provider**, choose one of the following: - [Learn more](https://unstructured.io/blog/chunking-for-rag-best-practices). - - - For **Vendor**, select one of the following: + - **None**: Do not generate embeddings. + - **OpenAI**: Use OpenAI to generate embeddings. + - **Vertex AI**: Use Vertex AI to generate embeddings. + + Learn more: + + - [Embedding overview](/platform/embedding) + - [Understanding embedding models: make an informed choice for your RAG](https://unstructured.io/blog/understanding-embedding-models-make-an-informed-choice-for-your-rag). + +17. Check the **Reprocess all** box if you want to overwrite any files in the destination location that might have been previously processed, +18. Check the **Retry Failed Documents** box if you want to retry processing any documents that failed to process, +19. Click **Continue**. +20. If you want this workflow to run on a schedule, in the **Repeat Run** dropdown list, select one of the scheduling options, and fill in the scheduling settings. Otherwise, select **Don't repeat**. +21. Click **Complete**. +22. If you did not set the workflow to run on a schedule, you can [run the worklow](#edit-delete-or-run-a-workflow) now. + +#### Build it myself - - **Off**: Do not generate embeddings. - - **OpenAI**: Use OpenAI to generate embeddings. Also choose the embedding model to use, from one of the following: +1. On the sidebar, click **Workflows**. +2. Click **New Workflow**. +3. Click the **Build it myself** option, and then click **Continue**. + + + If the **Build it myself** option is disabled, inside the **Build it myself** option click **Notify me**, and follow the on-screen directions to complete the request. + Unstructured will notify you when your account has been enabled with the **Build it myself** option. After you receive this notification, click the + **Build it myself** option, and then click **Continue**. + + +4. In the **This workflow** pane, click the **Details** button. + + ![Workflow details](/img/platform/Workflow-Details.png) + +5. Next to **Name**, click the pencil icon, enter some unique name for this workflow, and then click the check mark icon. +6. If you want this workflow to run on a schedule, click the **Schedule** button. In the **Repeat Run** dropdown list, select one of the scheduling options, and fill in the scheduling settings. +7. To overwrite any previously processed files, or to retry any documents that fail to process, click the **Settings** button, and check either or both of the boxes. +8. In the pipeline designer, click the **Source** node. In the **Source** pane, select the source location. Then click **Save**. + + ![Workflow designer](/img/platform/Workflow-Designer.png) + +9. Click the **Destination** node. In the **Destination** pane, select the destination location. Then click **Save**. +10. As needed, add more nodes by clicking the plus icon (recommended) or **Add Node** button: + + ![Add node to workflow](/img/platform/Workflow-Add-Node.png) + + - Click **Connect** to add another **Source** or **Destination** node. You can add multiple source and destination locations. Files will be ingested from all of the source locations, and the processed data will be delivered to all of the destination locations. [Learn more](#custom-workflow-node-types). + - Click **Enrich** to add a **Chunker** or **Enrichment** node. [Learn more](#custom-workflow-node-types). + - Click **Transform** to add a **Partitioner** or **Embedder** node. [Learn more](#custom-workflow-node-types). + + + Make sure to add nodes in the correct order. If you are unsure, see the usage hints in the blue note that appears + in the node's settings pane. + + ![Node usage hints note](/img/platform/Node-Usage-Hints.png) + + + To edit a node, click that node, and then change its settings. + + To delete a node, click that node, and then click the trash can icon above it. + +11. Click **Save**. +12. If you did not set the workflow to run on a schedule, you can [run the worklow](#edit-delete-or-run-a-workflow) now. + +#### Custom workflow node types + + + + For **Partition Strategy**, choose one of the following: - - **text-embedding-3-small** (1536 dimensions): [Learn more](https://platform.openai.com/docs/guides/embeddings). - - **text-embedding-3-large** (3072 dimensions): [Learn more](https://platform.openai.com/docs/guides/embeddings). - - **Ada 002 (Text)** (1536 dimensions): [Learn more](https://platform.openai.com/docs/guides/embeddings). + - **Auto**: This strategy chooses the partitioning strategy on a file-by-file basis, depending on detected document characteristics. + - **Fast**: This strategy uses traditional NLP extraction techniques to quickly pull in all text elements. This strategy is not good for image-based file types or files with images or tables in them. + - **Hi-res**: This strategy uses the document layout to gain additional information about document elements. This strategy is good for image-based file types and files with images or tables in them. This strategy is recommended if your use case is highly sensitive to correct classification for document elements. - - **Hugging Face**: Use Hugging Face to generate embeddings. Also choose the embedding model to use, from one of the following: + [Learn more](/platform/partitioning). + + + For **Chunkers**, select one of the following: - - **nvidia/NV-Embed-v1** (4096 dimensions): [Learn more](https://huggingface.co/nvidia/NV-Embed-v1). - - **voyage-large-2-instruct** (1024 dimensions): [Learn more](https://huggingface.co/voyageai/voyage-large-2-instruct). - - **stella_en_400M_v5** (1024 dimensions): [Learn more](https://huggingface.co/dunzhang/stella_en_400M_v5). - - **stella_en_1.5B_v5** (1024 dimensions): [Learn more](https://huggingface.co/dunzhang/stella_en_1.5B_v5). - - **Alibaba-NLP/gte-Qwen2-7B-instruct** (3584 dimensions): [Learn more](https://huggingface.co/Alibaba-NLP/gte-Qwen2-7B-instruct). + - **Chunk by title**: Preserve section boundaries and optionally page boundaries as well. A single chunk will never contain text that occurred in two different sections. When a new section starts, the existing chunk is closed and a new one is started, even if the next element would fit in the prior chunk. Also, specify the following: + + - **Combine text under n chars**: Combine elements until a section reaches a length of this many characters. The default is **0**. + - **Include original elements**: Check this box to output the elements that were used to form a chunk, to appear in the `metadata` field's `orig_elements` field for that chunk. By default, this box is unchecked. + - **Max characters**: Cut off new sections after reaching a length of this many characters. This is a strict limit. The default is **2048**. + - **Multipage sections**: Check this box to allow sections to span multiple pages. By default, this box is unchecked. + - **New after n chars**: Cut off new sections after reaching a length of this many characters. This is an approximate limit. The default is **1500*. + - **Overlap**: Apply a prefix of this many trailing characters from the prior text-split chunk to second and later chunks formed from oversized elements by text-splitting. The default is **160**. + - **Overlap all**: Check this box to apply overlap to "normal" chunks formed by combining whole elements. Use with caution as this can introduce noise into otherwise clean semantic units. By default, this box is unchecked. + + - **Chunk by character** (also known as _basic_ chunking): Combine sequential elements to maximally fill each chunk. Also, specify the following: + + - **Include original elements**: Check this box to output the elements that were used to form a chunk, to appear in the `metadata` field's `orig_elements` field for that chunk. By default, this box is unchecked. + - **Max characters**: Cut off new sections after reaching a length of this many characters. The default is **2048**. + - **New after n chars**: Cut off new sections after reaching a length of this many characters. This is an approximate limit. The default is **1500*. + - **Overlap**: Apply a prefix of this many trailing characters from the prior text-split chunk to second and later chunks formed from oversized elements by text-splitting. The default is **160**. + - **Overlap All**: Check this box to apply overlap to "normal" chunks formed by combining whole elements. Use with caution as this can introduce noise into otherwise clean semantic units. By default, this box is unchecked. + + - **Chunk by page**: Preserve page boundaries. When a new page is detected, the existing chunk is closed and a new one is started, even if the next element would fit in the prior chunk. Also, specify the following: + + - **Include original elements**: Check this box to output the elements that were used to form a chunk, to appear in the `metadata` field's `orig_elements` field for that chunk. By default, this box is unchecked. + - **Max characters**: Cut off new sections after reaching a length of this many characters. This is a strict limit. The default is **500**. + - **New after n chars**: Cut off new sections after reaching a length of this many characters. This is an approximate limit. The default is **50**. + - **Overlap**: Apply a prefix of this many trailing characters from the prior text-split chunk to second and later chunks formed from oversized elements by text-splitting. The default is **30**. + - **Overlap all**: Check this box to apply overlap to "normal" chunks formed by combining whole elements. Use with caution as this can introduce noise into otherwise clean semantic units. By default, this box is unchecked. + + - **Chunk by similarity**: Use the [sentence-transformers/multi-qa-mpnet-base-dot-v1](https://huggingface.co/sentence-transformers/multi-qa-mpnet-base-dot-v1) embedding model to identify topically similar sequential elements and combine them into chunks. Also, specify the following: - - **OctoAI**: Use OctoAI to generate embeddings. Also choose the embedding model to use, from one of the following: + - **Include original elements**: Check this box to output the elements that were used to form a chunk, to appear in the `metadata` field's `orig_elements` field for that chunk. By default, this box is unchecked. + - **Max characters**: Cut off new sections after reaching a length of this many characters. This is a strict limit. The default is **500**. + - **Similarity threshold**: Specify a threshold between 0 and 1 exclusive (0.01 to 0.99 inclusive), where 0 indicates completely dissimilar vectors and 1 indicates identical vectors, taking into consider the trade-offs between precision (a higher threshold) and recall (a lower threshold). The default is **0.5**. [Learn more](https://towardsdatascience.com/introduction-to-embedding-clustering-and-similarity-11dd80b00061). - - **GTE Large** (1024 dimensions): [Learn more](https://octo.ai/blog/introducing-octoais-embedding-api-to-power-your-rag-needs/). + Learn more: + + - [Chunking overview](/platform/chunking) + - [Chunking for RAG: best practices](https://unstructured.io/blog/chunking-for-rag-best-practices) - - **Vertex AI**: Use Vertex AI to generate embeddings. Also choose the embedding model to use, from one of the following: + + + For **Enrichment model**, choose one of the following: + + - **OpenAI Image Description** to use GPT-4o to summarize images. [Learn more](https://openai.com/index/hello-gpt-4o/). + - **OpenAI Table Description** to use se GPT-4o to summarize tables. [Learn more](https://openai.com/index/hello-gpt-4o/). + - **OpenAI Table to HTML** to use GPT-4o to convert tables to HTML. [Learn more](https://openai.com/index/hello-gpt-4o/). + - **Anthropic Image Description** to use Claude 3.5 Sonnet to summarize images. [Learn more](https://www.anthropic.com/news/claude-3-5-sonnet). + - **Anthropic Table Description** to use Claude 3.5 Sonnet to summarize tables. [Learn more](https://www.anthropic.com/news/claude-3-5-sonnet). - - **textembedding-gecko@003** (768 dimensions): [Learn more](https://cloud.google.com/vertex-ai/generative-ai/docs/learn/model-versions#embeddings_stable_model_versions). - - **text-embedding-004** (768 dimensions): [Learn more](https://cloud.google.com/vertex-ai/generative-ai/docs/learn/model-versions#embeddings_stable_model_versions). + [Learn more](/platform/summarizing). + + + For **Providers**, select one of the following: + - **OpenAI**: Use OpenAI to generate embeddings. + - **Vertex AI**: Use Vertex AI to generate embeddings. + Learn more: - [Embedding overview](/platform/embedding) @@ -161,13 +311,13 @@ The following workflow settings can be customized: ## Edit, delete, or run a workflow -![Manage a workflow](/img/platform/Workflow-Actions.png) +To run a workflow: + +1. On the sidebar, click **Workflows**. +2. In the list of workflows, click **Run** in the row for the workflow that you want to run. -For each of the workflows on the **Workflows** list page, the following actions are available by clicking the ellipses (the three dots) next to the respective workflow name: +For each of the workflows on the **Workflows** list page, the following actions are available by clicking the ellipses (the three dots) in the row for the respective workflow: -* **Edit**: Changes the existing configuration of your workflow. This can include changing the source, destination, scheduling, and chunking strategies, among other settings. - +* **Edit via Form**: Changes the existing configuration of your workflow. * **Delete**: Removes the workflow from the platform. Use this action cautiously, as it will permanently delete the workflow and its configurations. - -* **Run**: Manually runs the workflow outside of its scheduled runs. This is particularly useful for testing or ad-hoc data processing needs. - +* **Open**: Opens the workflow's settings page. diff --git a/snippets/quickstarts/platform.mdx b/snippets/quickstarts/platform.mdx index 447927c6..8230aa2d 100644 --- a/snippets/quickstarts/platform.mdx +++ b/snippets/quickstarts/platform.mdx @@ -7,70 +7,84 @@ You will need: - A compatible destination (output) location in cloud storage for Unstructured to put the processed data. [See the list of supported destination types](/platform/connectors#destinations). - - [Sign up for the Platform beta](https://unstructured.io/platform). - - - ![Sign in to the Platform](/img/platform/Signin.png) - 1. Use the sign-in URL in the welcome email that Unstructured sends you. + + ![Sign in to your Unstructured account](/img/platform/Signin.png) + 1. Sign in to your Unstructured account, at [https://app.unstructured.io](https://app.unstructured.io). 2. Click **Google** or **GitHub** to sign in with your Google or GitHub account. Or, enter your email address and then click **Sign In**. 3. If you entered your email address, check your email inbox for a message from Unstructured. In that email, click the **Sign In** link. 4. The first time you sign in, read the terms and conditions, and then click **Accept**. + + ![Open your Unstructured Platform dashboard](/img/platform/GoToPlatform.png) + - From your Unstructured account dashboard, in the sidebar, click **API > Platform**. + + The **API > Platform** sidebar option is visible only to users who signed up for an Unstructured API services account before + October 23, 2024. To get access to the Unstructured Platform on or after this date, + [request access](https://unstructured.io/platform), and wait for an access enablement email from Unstructured. + + ![Sources in the sidebar](/img/platform/Sources-Sidebar.png) - 1. In the sidebar, click **Sources**. - 2. Click **New Source**. - 3. In the **Type** dropdown list, select the source location type that matches yours. - 4. Fill in the rest of the fields with the appropriate settings. [Learn more](/platform/sources/overview). - 5. Click **Save and Test**. - 6. Click **Close**. + 1. From your Unstructured Platform dashboard, in the sidebar, click **Connectors**. + 2. Click **Sources**. + 3. Click **Add new**. + 4. For **Name**, enter some unique name for this connector. + 5. In the **Provider** area, click the source location type that matches yours. + 6. Click **Continue**. + 7. Fill in the fields with the appropriate settings. [Learn more](/platform/sources/overview). + 8. If a **Continue** button appears, click it, and fill in any additional settings fields. + 9. Click **Save and Test**. ![Destinations in the sidebar](/img/platform/Destinations-Sidebar.png) - 1. In the sidebar, click **Destinations**. - 2. Click **New Destination**. - 3. In the **Type** dropdown list, select the destination location type that matches yours. - 4. Fill in the rest of the fields with the appropriate settings. [Learn more](/platform/destinations/overview). - 5. Click **Save and Test**. - 6. Click **Close**. + 1. In the sidebar, click **Connectors**. + 2. Click **Destinations**. + 3. Click **Add new**. + 4. For **Name**, enter some unique name for this connector. + 5. In the **Provider** area, click the destination location type that matches yours. + 6. Click **Continue**. + 7. Fill in the fields with the appropriate settings. [Learn more](/platform/sources/overview). + 8. If a **Continue** button appears, click it, and fill in any additional settings fields. + 9. Click **Save and Test**. ![Workflows in the sidebar](/img/platform/Workflows-Sidebar.png) 1. In the sidebar, click **Workflows**. 2. Click **New Workflow**. - 3. Enter a **Name** for the new workflow. - 4. In the **Connectors** section, in the **Sources** dropdown list, select your source location from Step 3. - 5. In the **Destination** dropdown list, select your destination location from Step 4. + 3. Next to **Build it with me**, click **Create Workflow**. + + If a radio button appears instead of **Build it with me**, select it, and then click **Continue**. + + 4. For **Workflow Name**, enter some unique name for this workflow. + 5. In the **Sources** dropdown list, select your source location from Step 3. + 6. In the **Destinations** dropdown list, select your destination location from Step 4. You can select multiple source and destination locations. Files will be ingested from all of the selected source locations, and the processed data will be delivered to all of the selected destination locations. - 6. In the **Workflow Settings** section, choose one of these predefined workflow settings groups: + 7. Click **Continue**. + 8. In the **Optimize for** section, select the option to choose one of these predefined workflow settings groups: + + - **Basic** is a good choice if you have text-only documents that have no images or tables in them. + - **Advanced** is a good choice if you have complex documents that have images or tables or both in them. - - **Basic** is a good choice if you have text-only documents that have no images or tables in them. - - **Advanced** is a good choice if you have complex documents that have images or tables or both in them. - - To fine-tune your workflow's settings, click **Custom**. If **Custom** is not available, click **Request Access**, - and wait for Unstructured to enable it. Learn how to define [Custom](/platform/workflows#custom-workflow-settings) workflow settings. - - 7. If you want to run this workflow on a regular basis, select one of the time periods in the **Schedule Type** list. - 8. Click **Save**. + 9. If you want to overwrite any files in the destination location that might have been previously processed, check the **Reprocess all** box. + 10. If you want to retry processing any documents that failed to process, check the **Retry Failed Documents** box. + 11. Click **Continue**. + 12. If you want this workflow to run on a schedule, in the **Repeat Run** dropdown list, select one of the scheduling options, and fill in the scheduling settings. Otherwise, select **Don't repeat**. + 13. Click **Complete**. - ![Jobs in the sidebar](/img/platform/Jobs-Sidebar.png) - ![Run a job](/img/platform/Run-Job.png) - 1. If you did not choose to run this workflow on a regular basis in Step 5, you can run the workflow now: on the sidebar, click **Jobs**. - 2. Click **Run Job**. - 3. In the **Select a Workflow** dropdown list, select your workflow from Step 5. - 4. Click **Run**. + ![Workflows in the sidebar](/img/platform/Workflows-Sidebar.png) + 1. If you did not choose to run this workflow on a schedule in Step 5, you can run the workflow now: on the sidebar, click **Workflows**. + 2. Next to your workflow from Step 5, click **Run**. ![Select a job](/img/platform/Select-Job.png) ![Completed job](/img/platform/Job-Complete.png) - 1. In the list of **Jobs**, click the **ID** link for the job that you want to monitor. - 2. Wait for the **Status** to change to **Completed**. - 3. If **Failed** at the top of the screen equals **0** (zero), the workflow was fully successful. Go to the next Step. - 4. If **Failed** at the top of the screen equals **1** (one) or greater, the workflow was not fully successful. + 1. In the sidebar, click **Jobs**. + 2. In the list of jobs, wait for the job's **Status** to change to **Finished**. + 3. Click the row for the job. + 4. If **Overview** displays **Success**, go to the next Step. Go to your destination location to view the processed data.