diff --git a/img/ui/walkthrough/GoToEnrichmentNode.png b/img/ui/walkthrough/GoToEnrichmentNode.png index 83440005..122422a3 100644 Binary files a/img/ui/walkthrough/GoToEnrichmentNode.png and b/img/ui/walkthrough/GoToEnrichmentNode.png differ diff --git a/img/ui/walkthrough/VLMPartitioner.png b/img/ui/walkthrough/VLMPartitioner.png new file mode 100644 index 00000000..e4a43737 Binary files /dev/null and b/img/ui/walkthrough/VLMPartitioner.png differ diff --git a/snippets/general-shared-text/get-started-single-file-ui-part-2.mdx b/snippets/general-shared-text/get-started-single-file-ui-part-2.mdx index deeef392..1c0a26ce 100644 --- a/snippets/general-shared-text/get-started-single-file-ui-part-2.mdx +++ b/snippets/general-shared-text/get-started-single-file-ui-part-2.mdx @@ -180,8 +180,7 @@ In this step, you will test the **High Res** partitioning strategy on the "Chine - You can scroll through the original file on the left or, where supported for a given file type, click the up and down arrows to page through the file one page at a time. - You can scroll through Unstructured's JSON output on the right, and you can click **Search JSON** to search for specific text in the JSON output. You will do this next. - **Download Full JSON** allows you to download the full output to your local machine as a JSON file. - - **View JSON at this step** allows you to view the JSON output at each step in the workflow as it was further processed. There's only one step right now (the **Partitioner** step), - but as you add more nodes to the workflow DAG, this can be a useful tool to see how the JSON output changes along the way. + - **View JSON at this step** allows you to view the JSON output at each step in the workflow as it is further processed. - The close (**X**) button returns you to the workflow designer. @@ -193,15 +192,12 @@ In this step, you will test the **High Res** partitioning strategy on the "Chine ![Searching the JSON output](/img/ui/walkthrough/SearchJSON.png) - - The Chinese characters on page 1. Search for the text `all have the meaning of acting`. Notice how the Chinese characters are captures correctly. - - The HTML representations of the seven tables on pages 6-9 and 12. Search for the text `"text_as_html":`. - - The descriptions of the four diagrams on page 3. Search for the text `\"diagram\",\n \"description\"`. - - The descriptions of the three graphs on pages 7-8. Search for the text `\"graph\",\n \"description\"`. - - The Base64-encoded, full-fidelity representations of the 14 tables, diagrams, and graphs on pages 3, 6-9, and 12. - Search for the text `"image_base64":`. You can use a web-based tool such as [base64.guru](https://base64.guru/converter/decode/image) - to experiment with decoding these representations back into their original visual representations. + - The Chinese characters on page 1. Search for the text `verbs. The characters`. Notice how the Chinese characters are captured correctly. + - The HTML representations of the seven tables on pages 6-9 and 12. Search for the text `text_as_html`. + - The descriptions of the four diagrams on page 3. Search for the text `"diagram` (including the opening quotation mark). + - The descriptions of the three graphs on pages 7-8. Search for the text `"graph` (including the opening quotation mark). -8. When you are done, be sure to click the close (**X**) button above the output on the right side of the screen, to return to +7. When you are done, be sure to click the close (**X**) button above the output on the right side of the screen, to return to the workflow designer for the next step. ## Step 4: Add more enrichments @@ -209,7 +205,7 @@ In this step, you will test the **High Res** partitioning strategy on the "Chine Your existing workflow already has three **Enrichment** nodes. Recall that these nodes perform the following enrichments: - An [image description](ui/enriching/image-descriptions) enrichment, which uses a vision language model (VLM) to provide a text-based summary of the contents of the each detected image. -- A [generative OCR](/ui/enriching/generative-ocr) enrichment, which uses a VLM to improve the accuracy of each block of initially-processed text. +- A [generative OCR](/ui/enriching/generative-ocr) enrichment, which uses a VLM to improve the accuracy of each block of initially-processed text, as needed. - A [table to HTML](/ui/enriching/table-to-html) enrichment, which uses a VLM to provide an HTML-structured representation of each detected table. In this step, you add a few more [enrichments](/ui/enriching/overview) to your workflow, such as generating summary descriptions of detected tables, @@ -222,9 +218,9 @@ and generating detected entities (such as people and organizations) and the infe 2. In the node's settings pane's **Details** tab, click: - **Table** under **Input Type**. - - **Anthropic** under **Provider**. - - **Claude Sonnet 4** under **Model**. - - **Table Description** under **Task**. + - Any available choice under **Provider**. + - Any available choice under **Model**. + - If not already selected, **Table Description** under **Task**. The table description enrichment generates a summary description of each detected table. This can help you to more quickly and easily understand @@ -236,14 +232,14 @@ and generating detected entities (such as people and organizations) and the infe In the node's settings pane's **Details** tab, click: - **Text** under **Input Type**. - - **Anthropic** under **Provider**. - - **Claude Sonnet 4** under **Model**. + - Any available choice under **Provider**. + - Any available choice under **Model**. The named entity recognition (NER) enrichment generates a list of detected entities (such as people and organizations) and the inferred relationships among these entities. This provides additional context about these entities' types and their relationships for your graph databases, RAG apps, agents, and models. [Learn more](/ui/enriching/ner). - The workflow designer should now look like this: + The workflow designer should now look similar to this: ![The workflow with enrichments added](/img/ui/walkthrough/EnrichedWorkflow.png) @@ -251,8 +247,8 @@ and generating detected entities (such as people and organizations) and the infe 5. In the **Test output** pane, make sure that **Enrichment (6 of 6)** is showing. If not, click the right arrow (**>**) until **Enrichment (6 of 6)** appears, which will show the output from the last node in the workflow. 6. Some interesting portions of the output include the following: - - The descriptions of the seven tables on pages 6-9 and 12. Search for the text `## Table Structure Analysis\n\n###`. - - The identified entities and inferred relationships among them. For example, search for the text `Zhijun Wang`. Of the eight instances of this name, notice + - The descriptions of the seven tables on pages 6-9 and 12. Search for the text `"Table"` (including the quotation marks). + - The identified entities and inferred relationships among them. For example, search for the text `Zhijun Wang`. Of the nine instances of this name, notice the author's identification as a `PERSON` three times, the author's `published` relationship twice, and the author's `affiliated_with` relationship twice. 7. When you are done, be sure to click the close (**X**) button above the output on the right side of the screen, to return to @@ -303,11 +299,11 @@ the resulting document elements' `text` content into manageable "chunks" to stay _What do each of these chunking settings do?_ - **Contextual Chunking** prepends chunk-specific explanatory context to each chunk, which has been shown to yield significant improvements in downstream retrieval accuracy. [Learn more](/ui/chunking#contextual-chunking). - - **Include Original Elements** outputs into each chunk's `metadata` field's `orig_elements` value the elements that were used to form that particular chunk. [Learn more](/ui/chunking#include-original-elements-setting). + - **Include Original Elements** outputs into each chunk's `metadata` field's `orig_elements` value the elements that were used to form that particular chunk. These elements are output in gzip compressed, Base64 encoded format. To get back to the original content, Base64 decode and then gzip decompress the bytes as UTF-8. [Learn more](/ui/chunking#include-original-elements-setting). - **Max Characters** is the "hard" or maximum number of characters that any one chunk can contain. Unstructured cannot exceed this number when forming chunks. [Learn more](/ui/chunking#max-characters-setting). - **New After N Characters**: is the "soft" or approximate number of characters that any one chunk can contain. Unstructured can exceed this number if needed when forming chunks (but still cannot exceed the **Max Characters** setting). [Learn more](/ui/chunking#new-after-n-characters-setting). - **Overlap**, when applied (see **Overlap All**), prepends to the current chunk the specified number of characters from the previous chunk, which can help provide additional context about this chunk relative to the previous chunk. [Learn more](/ui/chunking#overlap-setting) - - **Overlap All** applies the **Overlap** setting (if greater than zero) to all chunks. Otherwise, unchecking this box means that the **Overlap** setting (if greater than zero)is applied only in edge cases where "normal" chunks cannot be formed by combining whole elements. Check this box with caution as it can introduce noise into otherwise clean semantic units. [Learn more](/ui/chunking#overlap-all-setting). + - **Overlap All** applies the **Overlap** setting (if greater than zero) to all chunks. Otherwise, unchecking this box means that the **Overlap** setting (if greater than zero) is applied only in edge cases where "normal" chunks cannot be formed by combining whole elements. Check this box with caution as it can introduce noise into otherwise clean semantic units. [Learn more](/ui/chunking#overlap-all-setting). 4. Immediately above the **Source** node, click **Test**. @@ -409,7 +405,7 @@ embedding model that is provided by an embedding provider. For the best embeddin 2. In the node's settings pane's **Details** tab, under **Select Embedding Model**, for **Azure OpenAI**, select **Text Embedding 3 Small [dim 1536]**. 3. Immediately above the **Source** node, click **Test**. 4. In the **Test output** pane, make sure that **Embedder (8 of 8)** is showing. If not, click the right arrow (**>**) until **Embedder (8 of 8)** appears, which will show the output from the last node in the workflow. -5. To explore the embeddings, search for the text `"embeddings"`. +5. To explore the embeddings, search for the text `"embeddings"` (including the quotation marks). _What do all of these numbers mean?_ diff --git a/ui/chunking.mdx b/ui/chunking.mdx index da32fd39..c8dae896 100644 --- a/ui/chunking.mdx +++ b/ui/chunking.mdx @@ -103,7 +103,7 @@ Here are a few examples: If the option to include original elements is specified, during chunking the `orig_elements` field is added to the `metadata` field of each chunked element. The `orig_elements` field is a list of the original elements that were used to create the current chunked element. This list is output in -compressed Base64 gzipped format. To get back to the original content for this list, Base64-decode the list's bytes, decompress them, and then decode them using UTF-8. +gzip compressed, Base64-encoded format. To get back to the original content for this list, Base64-decode the list's bytes, and then gzip decompress them as UTF-8. [Learn how](/api-reference/partition/get-chunked-elements). After chunking, `Image` elements are not preserved in the output. However, diff --git a/ui/walkthrough.mdx b/ui/walkthrough.mdx index 31439ab6..a2cca9ea 100644 --- a/ui/walkthrough.mdx +++ b/ui/walkthrough.mdx @@ -142,36 +142,43 @@ In this step, you use your new workflow to [partition](/ui/partitioning) the sam Partitioning is the process where Unstructured identifies and extracts content from your source documents and then outputs this content as a series of contextually-rich [document elements and metadata](/ui/document-elements), which are well-tuned for RAG, agentic AI, and model fine-tuning. This step -shows how well Unstructured's **High Res** partitioning strategy identifies and extracts content, and how well -Unstructured's **VLM** partitioning strategy handles -more complex content such as complex tables, multilanguage characters, and handwriting. +shows how well Unstructured's **VLM** partitioning strategy handles challenging content such as complex tables, multilanguage characters, and handwriting. 1. With the workflow designer active from the previous step, at the bottom of the **Source** node, click **Drop file to test**. ![Drop file to test button](/img/ui/walkthrough/DropFileToTest.png) 2. Browse to and select the "Chinese Characters" PDF file that you downloaded earlier. -3. Click the **Partitioner** node and then, in the node's settings pane's **Details** tab, select **High Res**. +3. In the workflow designer, click the **Partitioner** node and then, in the node's settings pane's **Details** tab, select **VLM**.
- ![Selecting the High Res partitioning strategy](/img/ui/walkthrough/HighResPartitioner.png) + + _When would I choose **Auto**, **Fast**, **High Res**, or **VLM**?_ + + - **Auto** is recommended in most cases. It lets Unstructured figure out the best strategy to switch over to for each incoming file (and even for each page if the incoming file is a PDF), so you don't have to! + - **Fast** is only for when you know for certain that none of your files have tables, images, or multilanguage, scanned, or handwritten content in them. It's optimized for partitioning text-only content and is the fastest of all the strategies. It can recognize the text for only a few languages other than English. + - **High Res** is only for when you know for certain that at least one of your files has images or simple tables in them, and that none of your files also have scanned or handwritten content in them. It can recognize the text for more languages than **Fast** but not as many as **VLM**. + - **VLM** is great for any file, but it is best when you know for certain that some of your files have a combination of tables (especially complex ones), images, and multilanguage, scanned, or handwritten content. It's the highest quality but slowest of all the strategies. + + +4. Under **Select VLM Model**, under **Anthropic**, select **Claude Sonnet 4**.
+ + ![Selecting the VLM for partitioning](/img/ui/walkthrough/VLMPartitioner.png) - _When would I choose **Auto**, **Fast**, **High Res**, or **VLM**?_ + _When I choose **VLM**, when would I choose one of these models over another?_ - - **Auto** is recommended in most cases. It lets Unstructured figure out the best strategy to switch over to for each incoming file (and even for each page if the incoming file is a PDF), so you don't have to! - - **Fast** is only for when you know for certain that none of your files have tables, images, or multilanguage, scanned, or handwritten content in them. It's optimized for partitioning text-only content and is the fastest of all the strategies. It can recognize the text for only a few languages other than English. - - **High Res** is only for when you know for certain that at least one of your files has images or simple tables in them, and that none of your files also have scanned or handwritten content in them. It can recognize the text for more languages than **Fast** but not as many as **VLM**. - - **VLM** is great for any file, but it is best when you know for certain that some of your files have a combination of tables (especially complex ones), images, and multilanguage, scanned, or handwritten content. It's the highest quality but slowest of all the strategies. - - In this walkthrough, you switch between **High Res** and **VLM** strategies only to see how each of these strategies works with a combination of - complex tables, images, and multilanguage, scanned, and handwritten content. In practice, for these kinds of files you would likely just want to choose **Auto**. + A _vision language model_ (VLM) is designed to use sophisticated AI techniques and logic to combine advanced image and text understanding, resulting in more accurate and + contextually-rich output. + + As VLMs are constantly being released and improved, Unstructured is always adding to and updating its list of supported VLMs. + If you aren't getting consistent results with one VLM for a particular set of files, switching over to another one might improve your results, depending on that VLM's capabilities and the sample data that is was trained on. -4. Immediately above the **Source** node, click **Test**. +5. Click **Test**.
![Begin testing the local file](/img/ui/walkthrough/TestLocalFile.png) -5. The PDF file appears in a pane on the left side of the screen, and Unstructured's output appears in a **Test output** pane on the right side of the screen. +6. The PDF file appears in a pane on the left side of the screen, and Unstructured's output appears in a **Test output** pane on the right side of the screen. ![Showing the test output results](/img/ui/walkthrough/TestOutputResults.png) @@ -197,72 +204,27 @@ more complex content such as complex tables, multilanguage characters, and handw - The close (**X**) button returns you to the workflow designer.
-6. Some interesting portions of the output include the following, which you can get to be clicking **Search JSON** above the output: +7. Notice the following in the JSON output, which you can get to by clicking **Search JSON** above the output: ![Searching the JSON output](/img/ui/walkthrough/SearchJSON.png) - - The Chinese characters on page 3. Search for the text `In StrokeNet, the corresponding`. Notice that the Chinese characters are not interpreted correctly. - - The formula on page 5. Search for the text `L= LL + Ln`. Notice that the formula's output diverges quite a bit from the original content. - - Table 2 on page 6. Search for the text `Model Parameters Performance (BLEU)`. Notice that the `text_as_html` output diverges slightly from the original content. - - Figure 4 on page 8. Search for the text `50 45 40 35`. Notice that the output is not that informative about the original image's content. - - These quality issues will be addressed later in this step when you change the partitioning strategy to **VLM**, and later in **Step 4** when you add enrichments alongside **High Res** partitioning. + - The Chinese characters on page 1. Search for the text `verbs. The characters`. Notice that the Chinese characters are intepreted correctly. + - The tables on pages 1, 6, 7, 8, 9, and 12. Search for the text `"Table"` (including the quotation marks) to see how the VLM interprets the various tables. We'll see changes to these elements' `text` and `metadata.text_as_html` contents later in Step 4 in the enrichments portion of this walkthrough. + - The images on pages 3, 7, and 8. Search for the text `"Image"` (including the quotation marks) to see how the VLM interprets the various images. We'll see changes to these elements' `text` contents later in Step 4 in the enrichments portion of this walkthrough. -7. Now try changing the partitioning strategy to **VLM** and see how the output changes. To do this: +8. Now try looking at the "Nash letters" PDF file's output. To do this: a. Click the close (**X**) button above the output on the right side of the screen.
- b. In the workflow designer, click the **Partitioner** node and then, in the node's settings pane's **Details** tab, select **VLM**.
- c. Under **Select VLM Model**, under **Anthropic**, select **Claude Sonnet 4**.
+ b. At the bottom of the **Source** node, click the existing PDF's file name.
+ c. Browse to and select the "Nash letters" file that you downloaded earlier to your local machine.
d. Click **Test**.
- - _When would I choose one of these models over another?_ - - A _vision language model_ (VLM) is designed to use sophisticated AI techniques and logic to combine advanced image and text understanding, resulting in more accurate and - contextually-rich output. - - As VLMs are constantly being released and improved, Unstructured is always adding to and updating its list of supported VLMs. - If you aren't getting consistent results with one VLM for a particular set of files, switching over to another one might improve your results, depending on that VLM's capabilities and the sample data that is was trained on. - - -8. Notice how the quality of the output changes, now that you are using the **VLM** strategy: - - - The Chinese characters on page 3. Search for the text `In StrokeNet, the corresponding`. Notice that the Chinese characters are intepreted correctly. - - The formula on page 5. Search for the text `match class`. Notice that the formula's output is closer to the original content. - - Table 2 on page 6. Search for the text `Model Parameters Performance (BLEU)`. Notice that the `text_as_html` output is closer to the original content. - - Figure 4 on page 8. Search for the text `Graph showing BLEU scores comparison`. Notice the informative description about the figure. +9. Notice the following in the JSON output: -9. Now try looking at the "Nash letters" PDF file's output. To do this: + - The handwriting on page 3. Search for the text `I have written RAND`. Notice how well the handwriting is recognized. + - The mimeograph on page 11. Search for the text `Technicians at this Agency`. Notice how well the mimeographed content is recognized. - a. Click the close (**X**) button above the output on the right side of the screen.
- b. In the workflow designer, click the **Partitioner** node and then, in the node's settings pane's **Details** tab, select **High Res**.
- c. At the bottom of the **Source** node, click the existing PDF's file name.
- d. Browse to and select the "Nash letters" file that you downloaded earlier to your local machine.
- e. Click **Test**.
- -10. Some interesting portions of the **High Res** output against this handwritten and scanned content include the following: - - - The handwriting on page 3. Search for the text `Deo Majr`. Notice that the handwriting is not recognized correctly. - - The mimeograph on page 11. Search for the text `Technicans at this Agency` (note the typo `Technicans`). - Notice that the mimeograph contains `18 January 1955`, but the output contains only `January 1955`. - - The handwritten diagrams on page 13. Search for the text `"page_number": 13`. Notice that no output is generated for the diagrams. - -11. Now try changing the partitioning strategy to **VLM** and see how the quality of the output changes. To do this: - - a. Click the close (**X**) button above the output on the right side of the screen.
- b. In the workflow designer, click the **Partitioner** node and then, in the node's settings pane's **Details** tab, select **VLM**.
- c. Under **Select VLM Model**, under **Anthropic**, select **Claude Sonnet 4**.
- d. Click **Test**.
- -12. Notice how the output changes, now that you are using the **VLM** strategy: - - - The handwriting on page 3. Search for the text `Dear Major Grosjean`. Notice how well the handwriting is recognized correctly. - - The mimeograph on page 11. Search for the text `Technicians at this Agency` (note the corrected typo `Technicians`). - Notice that the mimoegraph contains `18 January 1955`, and the output now also contains `18 January 1955`. - - The handwritten diagrams on page 13. Search for the text `graph LR`. Notice that [Mermaid](https://docs.mermaidchart.com/mermaid-oss/intro/syntax-reference.html) representations of the - handwritten diagrams are output. - -13. When you are done, be sure to click the close (**X**) button above the output on the right side of the screen, to return to +10. When you are done, be sure to click the close (**X**) button above the output on the right side of the screen, to return to the workflow designer for the next step. ## Step 4: Experiment with enriching @@ -277,9 +239,10 @@ HTML representations of detected tables, and detected entities (such as people a 3. In the node's settings pane's **Details** tab, click: - - **Image** under **Input Type** - - **OpenAI** under **Provider** - - **(GPT-4o)** under **Model** + - **Image** under **Input Type**. + - Any available choice for **Provider**. + - Any available choice for **Model**. + - If not already selected, **Image Description** under **Task**. The image description enrichment generates a summary description of each detected image. This can help you to more quickly and easily understand @@ -294,9 +257,9 @@ HTML representations of detected tables, and detected entities (such as people a In the node's settings pane's **Details** tab, click: - **Table** under **Input Type**. - - **Anthropic** under **Provider**. - - **Claude Sonnet 4** under **Model**. - - **Table Description** under **Task**. + - Any available choice for **Provider**. + - Any available choice for **Model**. + - If not already selected, **Table Description** under **Task**. The table description enrichment generates a summary description of each detected table. This can help you to more quickly and easily understand @@ -308,8 +271,8 @@ HTML representations of detected tables, and detected entities (such as people a In the node's settings pane's **Details** tab, click: - **Table** under **Input Type**. - - **Anthropic** under **Provider**. - - **Claude Sonnet 4** under **Model**. + - **OpenAI** under **Provider**. + - Any available choice under **Model**. - **Table to HTML** under **Task**. @@ -321,8 +284,8 @@ HTML representations of detected tables, and detected entities (such as people a In the node's settings pane's **Details** tab, click: - **Text** under **Input Type**. - - **Anthropic** under **Provider**. - - **Claude Sonnet 4** under **Model**. + - Any available choice under **Provider**. + - Any available choice under **Model**. The named entity recognition (NER) enrichment generates a list of detected entities (such as people and organizations) and the inferred relationships among these entities. This provides additional context about these entities' types and their relationships for your graph databases, RAG apps, agents, and models. [Learn more](/ui/enriching/ner). @@ -333,15 +296,12 @@ HTML representations of detected tables, and detected entities (such as people a In the node's settings pane's **Details** tab, click: - **Image** under **Input Type**. - - One of the following providers and models: - - - **Anthropic** under **Provider** and any choice under **Model** - - **OpenAI** under **Provider** and any choice under **Model** - + - **Anthropic** or **Amazon Bedrock** under **Provider**. + - Any available choice under **Model**. - **Generative OCR** under **Task**. - The generative OCR enrichment improves the accuracy of text blocks that Unstructured initially processed during its partitioning phase. [Learn more](/ui/enriching/generative-ocr). + The generative OCR enrichment improves, as needed, the accuracy of text blocks that Unstructured initially processed during its partitioning phase. [Learn more](/ui/enriching/generative-ocr). @@ -349,7 +309,7 @@ HTML representations of detected tables, and detected entities (such as people a This is a known issue and will be addressed in a future release. - The workflow designer should now look like this: + The workflow designer should now look similar to this: ![The workflow with enrichments added](/img/ui/walkthrough/EnrichedWorkflow.png) @@ -360,8 +320,8 @@ HTML representations of detected tables, and detected entities (such as people a 7. Some interesting portions of the output include the following: - - The figures on pages 3, 7, and 8. Search for the seven instances of the text `"type": "Image"`. Notice the summary description for each image. - - The tables on pages 6, 7, 8, 9, and 12. Search for the seven instances of the text `"type": "Table"`. Notice the summary description for each of these tables. + - The images on pages 3, 7, and 8. Search for the text `"Image"` (including the quotation marks). Notice the summary description for each image. + - The tables on pages 1, 6, 7, 8, 9, and 12. Search for the text `"Table"` (including the quotation marks). Notice the summary description for each of these tables. Also notice the `text_as_html` field for each of these tables. - The identified entities and inferred relationships among them. Search for the text `Zhijun Wang`. Of the eight instances of this name, notice the author's identification as a `PERSON` three times, the author's `published` relationship twice, and the author's `affiliated_with` relationship twice. @@ -422,8 +382,8 @@ the resulting document elements' `text` content into manageable "chunks" to stay 4. With the "Chinese Characters" PDF file still selected in the **Source** node, click **Test**. -5. In the **Test output** pane, make sure that **Chunker (6 of 6)** is showing. If not, click the right arrow (**>**) until **Chunker (6 of 6)** appears, which will show the output from the last node in the workflow. -6. To explore the chunker's results, search for the text `"type": "CompositeElement"`. +5. In the **Test output** pane, make sure that **Chunker (7 of 7)** is showing. If not, click the right arrow (**>**) until **Chunker (7 of 7)** appears, which will show the output from the last node in the workflow. +6. To explore the chunker's results, search for the text `"CompositeElement"` (including the quotation marks). _In the chunked output, where did all of the document elements I saw before, such as `Title`, `Image`, and `Table`, go?_ @@ -455,8 +415,8 @@ the resulting document elements' `text` content into manageable "chunks" to stay d. Click **Test**.
- e. In the **Test output** pane, make sure that **Chunker (6 of 6)** is showing. If not, click the right arrow (**>**) until **Chunker (6 of 6)** appears, which will show the output from the last node in the workflow.
- f. To explore the chunker's results, search for the text `"type": "CompositeElement"`. Notice that the lengths of some of the chunks that immediately + e. In the **Test output** pane, make sure that **Chunker (7 of 7)** is showing. If not, click the right arrow (**>**) until **Chunker (7 of 7)** appears, which will show the output from the last node in the workflow.
+ f. To explore the chunker's results, search for the text `"CompositeElement"` (including the quotation marks). Notice that the lengths of some of the chunks that immediately precede titles might be shortened due to the presence of the title impacting the chunk's size. 8. Try running this workflow again with the **Chunk by Page** strategy, as follows: @@ -472,8 +432,8 @@ the resulting document elements' `text` content into manageable "chunks" to stay - Leave **Contextual Chunking** turned off, and leave **Overlap All** unchecked. d. Click **Test**.
- e. In the **Test output** pane, make sure that **Chunker (6 of 6)** is showing. If not, click the right arrow (**>**) until **Chunker (6 of 6)** appears, which will show the output from the last node in the workflow.
- f. To explore the chunker's results, search for the text `"type": "CompositeElement"`. Notice that the lengths of some of the chunks that immediately + e. In the **Test output** pane, make sure that **Chunker (7 of 7)** is showing. If not, click the right arrow (**>**) until **Chunker (7 of 7)** appears, which will show the output from the last node in the workflow.
+ f. To explore the chunker's results, search for the text `"CompositeElement"` (including the quotation marks). Notice that the lengths of some of the chunks that immediately precede page breaks might be shortened due to the presence of the page break impacting the chunk's size.
9. Try running this workflow again with the **Chunk by Similarity** strategy, as follows: @@ -497,10 +457,10 @@ the resulting document elements' `text` content into manageable "chunks" to stay
d. Click **Test**.
- e. In the **Test output** pane, make sure that **Chunker (6 of 6)** is showing. If not, click the right arrow (**>**) until **Chunker (6 of 6)** appears, which will show the output from the last node in the workflow.
- f. To explore the chunker's results, search for the text `"type": "CompositeElement"`. Notice that the lengths of many of the chunks fall well short of the **Max Characters** limit. This is because a similarity threshold + e. In the **Test output** pane, make sure that **Chunker (7 of 7)** is showing. If not, click the right arrow (**>**) until **Chunker (7 of 7)** appears, which will show the output from the last node in the workflow.
+ f. To explore the chunker's results, search for the text `"CompositeElement"` (including the quotation marks). Notice that the lengths of many of the chunks fall well short of the **Max Characters** limit. This is because a similarity threshold of **0.99** means that only sentences or text segments with a near-perfect semantic match will be grouped together into the same chunk. This is an extremely high threshold, resulting in very short, highly specific chunks of text.
- g. If you change **Similarity Threshold** to **0.01** and run the workflow again, searching for the text `"type": "CompositeElement"`, many of the chunks will now come closer to the **Max Characters** limit. This is because a similarity threshold + g. If you change **Similarity Threshold** to **0.01** and run the workflow again, searching for the text `"CompositeElement"` (including the quotation marks), many of the chunks will now come closer to the **Max Characters** limit. This is because a similarity threshold of **0.01** provides an extreme tolerance of differences between pieces of text, grouping almost anything together.
10. When you are done, be sure to click the close (**X**) button above the output on the right side of the screen, to return to @@ -519,8 +479,8 @@ embedding model that is provided by an embedding provider. For the best embeddin 2. In the node's settings pane's **Details** tab, under **Select Embedding Model**, for **Azure OpenAI**, select **Text Embedding 3 Small [dim 1536]**. 3. With the "Chinese Characters" PDF file still selected in the **Source** node, click **Test**. -4. In the **Test output** pane, make sure that **Embedder (7 of 7)** is showing. If not, click the right arrow (**>**) until **Embedder (7 of 7)** appears, which will show the output from the last node in the workflow. -5. To explore the embeddings, search for the text `"embeddings"`. +4. In the **Test output** pane, make sure that **Embedder (8 of 8)** is showing. If not, click the right arrow (**>**) until **Embedder (8 of 8)** appears, which will show the output from the last node in the workflow. +5. To explore the embeddings, search for the text `"embeddings"` (including the quotation marks). _What do all of these numbers mean?_