diff --git a/docs.json b/docs.json index 9f5b6286..f2f46d0f 100644 --- a/docs.json +++ b/docs.json @@ -30,7 +30,8 @@ { "group": "Getting started with the UI", "pages": [ - "ui/quickstart" + "ui/quickstart", + "ui/walkthrough" ] }, { diff --git a/img/ui/walkthrough/AddChunker.png b/img/ui/walkthrough/AddChunker.png new file mode 100644 index 00000000..aa42f16c Binary files /dev/null and b/img/ui/walkthrough/AddChunker.png differ diff --git a/img/ui/walkthrough/AddEmbedder.png b/img/ui/walkthrough/AddEmbedder.png new file mode 100644 index 00000000..5e3c3364 Binary files /dev/null and b/img/ui/walkthrough/AddEmbedder.png differ diff --git a/img/ui/walkthrough/AddEnrichment.png b/img/ui/walkthrough/AddEnrichment.png new file mode 100644 index 00000000..d2a64cd0 Binary files /dev/null and b/img/ui/walkthrough/AddEnrichment.png differ diff --git a/img/ui/walkthrough/BuildItMyself.png b/img/ui/walkthrough/BuildItMyself.png new file mode 100644 index 00000000..2b1acc63 Binary files /dev/null and b/img/ui/walkthrough/BuildItMyself.png differ diff --git a/img/ui/walkthrough/ChunkByCharacter.png b/img/ui/walkthrough/ChunkByCharacter.png new file mode 100644 index 00000000..4f229c75 Binary files /dev/null and b/img/ui/walkthrough/ChunkByCharacter.png differ diff --git a/img/ui/walkthrough/DropFileToTest.png b/img/ui/walkthrough/DropFileToTest.png new file mode 100644 index 00000000..6d703080 Binary files /dev/null and b/img/ui/walkthrough/DropFileToTest.png differ diff --git a/img/ui/walkthrough/EnrichedWorkflow.png b/img/ui/walkthrough/EnrichedWorkflow.png new file mode 100644 index 00000000..886eabd6 Binary files /dev/null and b/img/ui/walkthrough/EnrichedWorkflow.png differ diff --git a/img/ui/walkthrough/GoToEnrichmentNode.png b/img/ui/walkthrough/GoToEnrichmentNode.png new file mode 100644 index 00000000..d91e2fbb Binary files /dev/null and b/img/ui/walkthrough/GoToEnrichmentNode.png differ diff --git a/img/ui/walkthrough/HighResPartitioner.png b/img/ui/walkthrough/HighResPartitioner.png new file mode 100644 index 00000000..4149ad48 Binary files /dev/null and b/img/ui/walkthrough/HighResPartitioner.png differ diff --git a/img/ui/walkthrough/NewWorkflow.png b/img/ui/walkthrough/NewWorkflow.png new file mode 100644 index 00000000..7aee84e7 Binary files /dev/null and b/img/ui/walkthrough/NewWorkflow.png differ diff --git a/img/ui/walkthrough/SearchJSON.png b/img/ui/walkthrough/SearchJSON.png new file mode 100644 index 00000000..9149c3bd Binary files /dev/null and b/img/ui/walkthrough/SearchJSON.png differ diff --git a/img/ui/walkthrough/TestLocalFile.png b/img/ui/walkthrough/TestLocalFile.png new file mode 100644 index 00000000..22028e3b Binary files /dev/null and b/img/ui/walkthrough/TestLocalFile.png differ diff --git a/img/ui/walkthrough/TestOutputResults.png b/img/ui/walkthrough/TestOutputResults.png new file mode 100644 index 00000000..2a6732bc Binary files /dev/null and b/img/ui/walkthrough/TestOutputResults.png differ diff --git a/img/ui/walkthrough/WorkflowDesigner.png b/img/ui/walkthrough/WorkflowDesigner.png new file mode 100644 index 00000000..a8d027f2 Binary files /dev/null and b/img/ui/walkthrough/WorkflowDesigner.png differ diff --git a/img/ui/walkthrough/WorkflowsSidebar.png b/img/ui/walkthrough/WorkflowsSidebar.png new file mode 100644 index 00000000..2115437d Binary files /dev/null and b/img/ui/walkthrough/WorkflowsSidebar.png differ diff --git a/ui/quickstart.mdx b/ui/quickstart.mdx index c1ac0a85..a27d07d9 100644 --- a/ui/quickstart.mdx +++ b/ui/quickstart.mdx @@ -27,7 +27,9 @@ import GetStartedSimpleUIOnly from '/snippets/general-shared-text/get-started-si allowfullscreen > -4. To move beyond local file processing, you can try the following [remote quickstart](#remote-quickstart), or skip over to the [Dropbox source connector quickstart](/ui/sources/dropbox-source-quickstart) instead. +4. To keep enhancing your workflow, skip ahead to the [walkthrough](/ui/walkthrough). + +5. To move beyond local file processing, you can try the following [remote quickstart](#remote-quickstart), or skip over to the [Dropbox source connector quickstart](/ui/sources/dropbox-source-quickstart) instead. You can also learn about Unstructured [source connectors](/ui/sources/overview), [destination connectors](/ui/destinations/overview), [workflows](/ui/workflows), [jobs](/ui/jobs), and [managing your account](/ui/account/overview). diff --git a/ui/walkthrough.mdx b/ui/walkthrough.mdx new file mode 100644 index 00000000..a9d7b368 --- /dev/null +++ b/ui/walkthrough.mdx @@ -0,0 +1,289 @@ +--- +title: Unstructured UI walkthrough +sidebarTitle: Walkthrough +--- + +This walkthrough provides you with deep, hands-on experience with the [Unstructured user interface (UI)](/ui/overview). As you follow along, you will learn how to use many of Unstructured's +features for [partitioning](/ui/partitioning), [enriching](/ui/enriching/overview), [chunking](/ui/chunking), and [embedding](/ui/embedding). These features are optimized for turning +your source documents and data into information that is well-tuned for retrieval-augmented generation (RAG), agentic AI, and model fine-tuning. + +This walkthrough uses two sample files to demonstrate how Unstructured identifies and processes content such as image, graphs, complex tables, non-English characters, and handwriting. +These files, which are available for you to download to your local machine, include: + +- Wang, Z., Liu, X., & Zhang, M. (2022, November 23). + _Breaking the Representation Bottleneck of Chinese Characters: Neural Machine Translation with Stroke Sequence Modeling_. + arXiv.org. https://arxiv.org/pdf/2211.12781. This 12-page PDF file features English and non-English characters, images, graphs, and complex tables. + Throughout this walkthrough, this file's title is shortened to "Chinese Characters" for brevity. +- United States Central Security Service. (2012, January 27). _National Cryptologic Museum Opens New Exhibit on Dr. John Nash_. + United States National Security Agency. https://courses.csail.mit.edu/6.857/2012/files/H03-Cryptosystem-proposed-by-Nash.pdf. + This PDF file features English handwriting and scanned images of documents. + Throughout this walkthrough, this file's title is shortened to "Nash letters" for brevity. + +If you are not able to complete any of the following steps, contact Unstructured Support at [support@unstructured.io](mailto:support@unstructured.io). + +## Step 1: Sign up and sign in to Unstructured + +import GetStartedSimpleUIOnly from '/snippets/general-shared-text/get-started-simple-ui-only.mdx'; + + + +## Step 2: Create a custom workflow + +In this step, you create a custom [workflow](/ui/workflows) in your Unstructured account. Workflows are +defined sequences of processes that automate the flow of data from your source documents and data into Unstructured for processing. +Unstructured then sends its processed data over into your destination file storage locations, databases, and vector stores. + +1. After you are signed in to your Unstructured account, on the sidebar, click **Workflows**. + + ![Workflows button on the sidebar](/img/ui/walkthrough/WorkflowsSidebar.png) + +2. Click **New Workflow**. + + ![New Workflow button](/img/ui/walkthrough/NewWorkflow.png) + +3. With **Build it Myself** already selected, click **Continue**. + + ![Build it Myself workflow option](/img/ui/walkthrough/BuildItMyself.png) + +4. The workflow designer appears. + + ![The workflow designer](/img/ui/walkthrough/WorkflowDesigner.png) + +## Step 3: Experiment with partitioning + +In this step, you use your new workflow to [partition](/ui/partitioning) the sample PDF files that you downloaded earlier. +Partitioning is the process where Unstructured identifies and extracts content from your source documents and then +presents this content as a series of contextually-rich [document elements and metadata](/ui/document-elements). This step +shows how well the **High Res** partitioning strategy identifies and extracts content, and how well **VLM** handles +more complex content such as complex tables, multilanguage characters, and handwriting. + +1. With the workflow designer active from the previous step, at the bottom of the **Source** node, click **Drop file to test**. + + ![Drop file to test button](/img/ui/walkthrough/DropFileToTest.png) + +2. Browse to and select the "Chinese Characters" PDF file that you downloaded earlier. +3. Click the **Partitioner** node and then, in the node's settings pane's **Details** tab, select **High Res**. + + ![Selecting the High Res partitioning strategy](/img/ui/walkthrough/HighResPartitioner.png) + +4. Immediately above the **Source** node, click **Test**. + + ![Begin testing the local file](/img/ui/walkthrough/TestLocalFile.png) + +5. The PDF file appears in a pane on the left side of the screen, and Unstructured's output appears in a **Test output** pane on the right side of the screen. + + ![Showing the test output results](/img/ui/walkthrough/TestOutputResults.png) + +6. Some interesting portions of the output include the following, which you can get to be clicking **Search JSON** above the output: + + ![Searching the JSON output](/img/ui/walkthrough/SearchJSON.png) + + - The Chinese characters on page 3. Search for the text `In StrokeNet, the corresponding`. Notice that the Chinese characters are not interpreted correctly. + - The formula on page 5. Search for the text `L= LL + Ln`. Notice that the formula's output diverges quite a bit from the original content. + - Table 2 on page 6. Search for the text `Model Parameters Performance (BLEU)`. Notice that the `text_as_html` output diverges slightly from the original content. + - Figure 4 on page 8. Search for the text `50 45 40 35`. Notice that the output is not that informative about the original image's content. + + These issues will be addressed later in this step when you change the partitioning strategy to **VLM**, and later in **Step 4**when you add enrichments alongside **High Res** partitioning. + +7. Now try changing the partitioning strategy to **VLM** and see how the output changes. To do this: + + a. Click the close (**X**) button above the output on the right side of the screen.
+ b. In the workflow designer, click the **Partitioner** node and then, in the node's settings pane's **Details** tab, select **VLM**.
+ c. Under **Select VLM Model**, under **Anthropic**, select **Claude 3.5 Sonnet**.
+ d. Click **Test**.
+ +8. Notice how the output changes, now that you are using the **VLM** strategy: + + - The Chinese characters on page 3. Search for the text `In StrokeNet, the corresponding`. Notice that the Chinese characters are intepreted correctly. + - The formula on page 5. Search for the text `match class`. Notice that the formula's output is closer to the original content. + - Table 2 on page 6. Search for the text `Model Parameters Performance (BLEU)`. Notice that the `text_as_html` output is closer to the original content. + - Figure 4 on page 8. Search for the text `Graph showing BLEU scores comparison`. Notice the informative description about the figure. + +9. Now try looking at the "Nash letters" PDF file's output. To do this: + + a. Click the close (**X**) button above the output on the right side of the screen.
+ b. In the workflow designer, click the **Partitioner** node and then, in the node's settings pane's **Details** tab, select **High Res**.
+ c. At the bottom of the **Source** node, click the existing PDF's file name.
+ d. Browse to and select the "Nash letters" file that you downloaded earlier.
+ e. Click **Test**.
+ +10. Some interesting portions of the output include the following: + + - The handwriting on page 3. Search for the text `Deo Majr`. Notice that the handwriting is not recognized correctly. + - The mimeograph on page 11. Search for the text `Technicans at this Agency` (note the typo `Technicans`). + Notice that the mimoegraph contains `18 January 1955`, but the output contains only `January 1955`. + - The handwritten diagrams on page 13. Search for the text `"page_number": 13`. Notice that no output is generated for the diagrams. + +11. Now try changing the partitioning strategy to **VLM** and see how the output changes. To do this: + + a. Click the close (**X**) button above the output on the right side of the screen.
+ b. In the workflow designer, click the **Partitioner** node and then, in the node's settings pane's **Details** tab, select **VLM**.
+ c. Under **Select VLM Model**, under **Anthropic**, select **Claude 3.5 Sonnet**.
+ d. Click **Test**.
+ +12. Notice how the output changes, now that you are using the **VLM** strategy: + + - The handwriting on page 3. Search for the text `Dear Major Grosjean`. Notice how well the handwriting is recognized correctly. + - The mimeograph on page 11. Search for the text `Technicians at this Agency` (note the corrected typo `Technicians`). + Notice that the mimoegraph contains `18 January 1955`, and the output now also contains `18 January 1955`. + - The handwritten diagrams on page 13. Search for the text `graph LR`. Notice that [Mermaid](https://docs.mermaidchart.com/mermaid-oss/intro/syntax-reference.html) representations of the + handwritten diagrams are output. + +13. When you are done, be sure to click the close (**X**) button above the output on the right side of the screen, to return to + the workflow designer for the next step. + +## Step 4: Experiment with enriching + +In this step, you add several [enrichments](/ui/enriching/overview) to your workflow, such as generating summary descriptions of detected images and tables, +HTML representations of detected tables, and detected entities (such as people and organizations) and the inferred relationships among these entities. + +1. With the workflow designer active from the previous step, change the **Partitioner** node to use **High Res**. +2. Between the **Partitioner** and **Destination** nodes, click the add (**+**) icon, and then click **Enrich > Enrichment**. + + ![Adding an enrichment node](/img/ui/walkthrough/AddEnrichment.png) + +3. In the node's settings pane's **Details** tab, select **Image** under **Input Type**, and then click **OpenAI (GPT-4o)** under **Model**. +4. Repeat this process to add three more nodes between the **Partitioner** and **Destination** nodes. To do this, click + the add (**+**) icon, and then click **Enrich > Enrichment**, as follows: + + a. Add a **Table** (under **Input Type**) enrichment node with **OpenAI (GPT-4o)** (under **Model**) and **Table Description** (under **Task**) selected.
+ b. Add another **Table** (under **Input Type**) enrichment node with **OpenAI (GPT-4o)** (under **Model**) and **Table to HTML** (under **Task**) selected.
+ c. Add a **Text** (under **Input Type**) enrichment node with **OpenAI (GPT-4o)** (under **Model**) selected.
+ + The workflow designer should now look like this: + + ![The workflow with enrichments added](/img/ui/walkthrough/EnrichedWorkflow.png) + +5. Change the **Source** node to use the "Chinese Characters" PDF file, and then click **Test**. +6. In the **Test output** pane, make sure that **Enrichment (5 of 5)** is showing. If not, click the right arrow (**>**) until **Enrichment (5 of 5)** appears, which will show the output from the last node in the workflow. + + ![The final Enrichment node's output](/img/ui/walkthrough/GoToEnrichmentNode.png) + +7. Some interesting portions of the output include the following: + + - The figures on pages 3, 7, and 8. Search for the seven instances of the text `"type": "Image"`. Notice the summary description for each image. + - The tables on pages 6, 7, 8, 9, and 12. Search for the seven instances of the text `"type": "Table"`. Notice the summary description for each of these tables. + Also notice the `text_as_html` field for each of these tables. + - The identified entities and inferred relationships among them. Search for the text `Zhijun Wang`. Of the eight instances of this name, notice + the author's identification as a `PERSON` three times, the author's `published` relationship twice, and the author's `affiliated_with` relationship twice. + +8. When you are done, be sure to click the close (**X**) button above the output on the right side of the screen, to return to + the workflow designer for the next step. + +## Step 5: Experiment with chunking + +In this step, you apply [chunking](/ui/chunking) to your workflow. Chunking is the process where Unstructured rearranges +the resulting document elements into manageable "chunks" to stay within the limits of an embedding model and to improve retrieval precision. For the +best chunking strategy to apply to your use case, see the documentation for your target embedding model and downstream application toolsets. + +1. With the workflow designer active from the previous step, just before the **Destination** node, click the add (**+**) icon, and then click **Enrich > Chunker**. + + ![Adding a chunker node](/img/ui/walkthrough/AddChunker.png) + +2. In the node's settings pane's **Details** tab, select **Chunk by Character**. +3. Under **Chunk by Character**, specify the following settings: + + - Check the box labelled **Include Original Elements**. + - Set **Max Characters** to **500**. + - Set **New After N Characters** to **400**. + - Set **Overlap** to **50**. + - Leave **Contextual Chunking** turned off and **Overlap All** unchecked. + + ![Setting up the Chunk by Character strategy](/img/ui/walkthrough/ChunkByCharacter.png) + +4. With the "Chinese Characters" PDF file still selected in the **Source** node, click **Test**. +5. In the **Test output** pane, make sure that **Chunker (6 of 6)** is showing. If not, click the right arrow (**>**) until **Chunker (6 of 6)** appears, which will show the output from the last node in the workflow. +6. To explore the chunker's results, search for the text `"type": "CompositeElement"`. +7. Try running this workflow again with the **Chunk by Title** strategy, as follows: + + a. Click the close (**X**) button above the output on the right side of the screen.
+ b. In the workflow designer, click the **Chunker** node and then, in the node's settings pane's **Details** tab, select **Chunk by Title**.
+ c. Under **Chunk by Title**, specify the following settings: + + - Check the box labelled **Include Original Elements**. + - Set **Max Characters** to **500**. + - Set **New After N Characters** to **400**. + - Set **Overlap** to **50**. + - Leave **Contextual Chunking** turned off, leave **Combine Text Under N Characters** blank, and leave **Multipage Sections** and **Overlap All** unchecked. + + d. Click **Test**.
+ e. In the **Test output** pane, make sure that **Chunker (6 of 6)** is showing. If not, click the right arrow (**>**) until **Chunker (6 of 6)** appears, which will show the output from the last node in the workflow.
+ f. To explore the chunker's results, search for the text `"type": "CompositeElement"`. Notice that the lengths of some of the chunks that immediately + precede titles might be shortened due to the presence of the title impacting the chunk's size. + +8. Try running this workflow again with the **Chunk by Page** strategy, as follows: + + a. Click the close (**X**) button above the output on the right side of the screen.
+ b. In the workflow designer, click the **Chunker** node and then, in the node's settings pane's **Details** tab, select **Chunk by Page**.
+ c. Under **Chunk by Page**, specify the following settings: + + - Check the box labelled **Include Original Elements**. + - Set **Max Characters** to **500**. + - Set **New After N Characters** to **400**. + - Set **Overlap** to **50**. + - Leave **Contextual Chunking** turned off, and leave **Overlap All** unchecked. + + d. Click **Test**.
+ e. In the **Test output** pane, make sure that **Chunker (6 of 6)** is showing. If not, click the right arrow (**>**) until **Chunker (6 of 6)** appears, which will show the output from the last node in the workflow.
+ f. To explore the chunker's results, search for the text `"type": "CompositeElement"`. Notice that the lengths of some of the chunks that immediately + precede page breaks might be shortened due to the presence of the page break impacting the chunk's size.
+ +9. Try running this workflow again with the **Chunk by Similarity** strategy, as follows: + + a. Click the close (**X**) button above the output on the right side of the screen.
+ b. In the workflow designer, click the **Chunker** node and then, in the node's settings pane's **Details** tab, select **Chunk by Similarity**.
+ c. Under **Chunk by Similarity**, specify the following settings: + + - Check the box labelled **Include Original Elements**. + - Set **Max Characters** to **500**. + - Set **Similarity Threshold** to **0.99**. + - Leave **Contextual Chunking** turned off. + + d. Click **Test**.
+ e. In the **Test output** pane, make sure that **Chunker (6 of 6)** is showing. If not, click the right arrow (**>**) until **Chunker (6 of 6)** appears, which will show the output from the last node in the workflow.
+ f. To explore the chunker's results, search for the text `"type": "CompositeElement"`. Notice that the lengths of many of the chunks fall well short of the **Max Characters** limit. This is because a similarity threshold + of 0.99 means that only sentences or text segments with a near-perfect semantic match will be grouped together into the same chunk. This is an extremely high threshold, resulting in very short, highly specific chunks of text.
+ g. If you change **Similarity Threshold** to **0.01** and run the workflow again, searching for the text `"type": "CompositeElement"`, many of the chunks will now come closer to the **Max Characters** limit. This is because a similarity threshold + of 0.01 provides an extreme tolerance of differences between pieces of text, grouping almost anything together.
+ +10. When you are done, be sure to click the close (**X**) button above the output on the right side of the screen, to return to + the workflow designer for the next step. + +## Step 6: Experiment with embedding + +In this step, you generate [embeddings](/ui/embedding) for your workflow. Embeddings are vectors of numbers that represent various aspects of the text that is extracted by Unstructured. +These vectors are stored or "embedded" next to the text itself in a vector store or vector database. Chatbots, agents, and other AI solutions can use +these vector embeddings to more efficiently and effectively find, analyze, and use the associated text. These vector embeddings are generated by an +embedding model that is provided by an embedding provider. For the best embedding model to apply to your use case, see the documentation for your target downstream application toolsets. + +1. With the workflow designer active from the previous step, just before the **Destination** node, click the add (**+**) icon, and then click **Transform > Embedder**. + + ![Adding an embedder node](/img/ui/walkthrough/AddEmbedder.png) + +2. In the node's settings pane's **Details** tab, under **Select Embedding Model**, for **Azure OpenAI**, select **Text Embedding 3 Small [dim 1536]**. +3. With the "Chinese Characters" PDF file still selected in the **Source** node, click **Test**. +4. In the **Test output** pane, make sure that **Embedder (7 of 7)** is showing. If not, click the right arrow (**>**) until **Embedder (7 of 7)** appears, which will show the output from the last node in the workflow. +5. To explore the embeddings, search for the text `"embeddings"`. +6. When you are done, be sure to click the close (**X**) button above the output on the right side of the screen, to return to + the workflow designer so that you can continue designing things later as you see fit. + +## Next steps + +Congratulations! You now have an Unstructured workflow that partitions, enriches, chunks, and embeds your source documents, producing +context-rich data that is ready for retrieval-augmented generation (RAG), agentic AI, and model fine-tuning. + +Right now, your workflow only accepts one local file at a time for input. Your worklow also only sends Unstructured's processed data to your screen. +You can modify your workflow to accept multiple files and data from—and send Unstructured's processed data to—one or more file storage +locations, databases, and vector stores. To learn how to do this, try one or more of the following quickstarts: + +- [Remote quickstart](/ui/quickstart#remote-quickstart) +- [Dropbox source connector quickstart](/ui/sources/dropbox-source-quickstart) +- [Pinecone destination connector quickstart](/ui/destinations/pinecone-destination-quickstart) + +Unstructured also offers an API and SDKs, which allow you to work with Unstructured programmatically. For details, see: + +- [Unstructured API quickstart](/api-reference/workflow/overview#quickstart) +- [Unstructured Python SDK](/api-reference/workflow/overview#unstructured-python-sdk) +- [Unstructured API overview](/api-reference/overview) + +If you are not able to complete any of the preceding quickstarts, contact Unstructured Support at [support@unstructured.io](mailto:support@unstructured.io). \ No newline at end of file diff --git a/welcome.mdx b/welcome.mdx index e0dcb411..0218c60f 100644 --- a/welcome.mdx +++ b/welcome.mdx @@ -65,6 +65,8 @@ You can use Unstructured through a user interface (UI), an API, or both. Read on allowfullscreen > +  [Keep enhancing your workflow](/ui/walkthrough). +   [Learn more](/ui/overview). ---