# Unstructured playground

This notebook can be used to test simple Unstructured features and components in order to get familiarity and understand how they work

Unstructured: https://unstructured-io.github.io/unstructured/

In [19]:
from IPython.display import JSON

import json

from unstructured_client import UnstructuredClient
from unstructured_client.models import shared
from unstructured_client.models.errors import SDKError
from unstructured.partition.html import partition_html
from unstructured.partition.pptx import partition_pptx
from unstructured.staging.base import dict_to_elements, elements_to_json
from unstructured.partition.auto import partition

In [2]:
filename = "documents/medium_blog.html"
elements = partition_html(filename=filename)

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/davideliu/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


In [4]:
element_dict = [el.to_dict() for el in elements]
example_output = json.dumps(element_dict[11:13], indent=2)
print(example_output)

[
  {
    "type": "Title",
    "element_id": "29887a5ff9846ccc23327565a07e17fa",
    "text": "Share",
    "metadata": {
      "category_depth": 0,
      "last_modified": "2024-04-13T20:07:55",
      "page_number": 1,
      "languages": [
        "eng"
      ],
      "file_directory": "documents",
      "filename": "medium_blog.html",
      "filetype": "text/html"
    }
  },
  {
    "type": "NarrativeText",
    "element_id": "2bc60d779d5ea8114272b3c498f34643",
    "text": "In the vast digital universe, data is the lifeblood that drives decision-making and innovation. But not all data is created equal. Unstructured data in images and documents often hold a wealth of information that can be challenging to extract and analyze.",
    "metadata": {
      "last_modified": "2024-04-13T20:07:55",
      "page_number": 1,
      "languages": [
        "eng"
      ],
      "parent_id": "29887a5ff9846ccc23327565a07e17fa",
      "file_directory": "documents",
      "filename": "medium_blog.html",
   

In [9]:
filename = "documents/msft_openai.pptx"
elements = partition_pptx(filename=filename)
element_dict = [el.to_dict() for el in elements]
output = json.dumps(element_dict[:3], indent=2)
print(output)

[
  {
    "type": "Title",
    "element_id": "50a4122943273ad2f00ea92bff9c7cb6",
    "text": "ChatGPT",
    "metadata": {
      "category_depth": 1,
      "file_directory": "documents",
      "filename": "msft_openai.pptx",
      "last_modified": "2024-04-13T20:09:24",
      "page_number": 1,
      "languages": [
        "eng"
      ],
      "filetype": "application/vnd.openxmlformats-officedocument.presentationml.presentation"
    }
  },
  {
    "type": "ListItem",
    "element_id": "62ec8febb7453876bd797cd6cd38ada4",
    "text": "Chat-GPT: AI Chatbot, developed by OpenAI,\u00a0trained to perform conversational tasks and\u00a0creative tasks",
    "metadata": {
      "category_depth": 0,
      "file_directory": "documents",
      "filename": "msft_openai.pptx",
      "last_modified": "2024-04-13T20:09:24",
      "page_number": 1,
      "languages": [
        "eng"
      ],
      "parent_id": "50a4122943273ad2f00ea92bff9c7cb6",
      "filetype": "application/vnd.openxmlformats-officedoc

In [16]:
filename = "documents/CoT.pdf"
elements = partition(filename)
element_dict = [el.to_dict() for el in elements]
output = json.dumps(element_dict[:], indent=2)
print(output)

This function will be deprecated in a future release and `unstructured` will simply use the DEFAULT_MODEL from `unstructured_inference.model.base` to set default model name


[
  {
    "type": "Title",
    "element_id": "bff1fd0ec25e78f1224ad7309a1e79c4",
    "text": "B All Experimental Results",
    "metadata": {
      "detection_class_prob": 0.9187041521072388,
      "coordinates": {
        "points": [
          [
            298.23516845703125,
            205.790342222222
          ],
          [
            298.23516845703125,
            238.99923111111084
          ],
          [
            713.6279296875,
            238.99923111111084
          ],
          [
            713.6279296875,
            205.790342222222
          ]
        ],
        "system": "PixelSpace",
        "layout_width": 1700,
        "layout_height": 2200
      },
      "last_modified": "2024-04-13T20:07:53",
      "filetype": "application/pdf",
      "languages": [
        "eng"
      ],
      "page_number": 1,
      "file_directory": "documents",
      "filename": "CoT.pdf"
    }
  },
  {
    "type": "NarrativeText",
    "element_id": "ebf8dfb149bcbbd8c4b7f9a7046900a9",
 

In [15]:
print("\n\n".join([str(el) for el in elements]))

B All Experimental Results

This section contains tables for experimental results for varying models and model sizes, on all benchmarks, for standard prompting vs. chain-of-thought prompting.

For the arithmetic reasoning benchmarks, some chains of thought (along with the equations produced) were correct, except the model performed an arithmetic operation incorrectly. A similar observation was made in Cobbe et al. (2021). Hence, we can further add a Python program as an external calculator (using the Python eval function) to all the equations in the generated chain of thought. When there are multiple equations in a chain of thought, we propagate the external calculator results from one equation to the following equations via string matching. As shown in Table 1, we see that adding a calculator signiﬁcantly boosts performance of chain-of-thought prompting on most tasks.

Table 1: Chain of thought prompting outperforms standard prompting for various large language models on ﬁve arithmeti

In [18]:
filename = "documents/transformer.png"
elements = partition(filename)
element_dict = [el.to_dict() for el in elements]
output = json.dumps(element_dict[:], indent=2)
print(output)

This function will be deprecated in a future release and `unstructured` will simply use the DEFAULT_MODEL from `unstructured_inference.model.base` to set default model name


[
  {
    "type": "Header",
    "element_id": "d2592b60ee5cdc4d0f8c790be927c679",
    "text": "Nx",
    "metadata": {
      "detection_class_prob": 0.4063887298107147,
      "coordinates": {
        "points": [
          [
            42.20499801635742,
            1228.7130126953125
          ],
          [
            42.20499801635742,
            1286.28515625
          ],
          [
            124.905517578125,
            1286.28515625
          ],
          [
            124.905517578125,
            1228.7130126953125
          ]
        ],
        "system": "PixelSpace",
        "layout_width": 1520,
        "layout_height": 2239
      },
      "last_modified": "2024-04-14T15:25:21",
      "filetype": "image/png",
      "languages": [
        "eng"
      ],
      "page_number": 1,
      "file_directory": "documents",
      "filename": "transformer.png"
    }
  },
  {
    "type": "Image",
    "element_id": "d333b59532c27796553fef06ae6cf16c",
    "text": "Add & Norm | Gada. No