# Unstructured Core Concepts

The goal of this notebook is to introduce users to the core concepts in the `unstructured` library. At the conclusion of this notebook, you should be able to do the following:

- [Partition a document.](#partition)
- [Understand how documents are structured in the `unstructured` library.](#elements)
- [Convert the document to a dictionary.](#dict)

In [1]:
import os
import pathlib

DIRECTORY = os.path.abspath("")
EXAMPLE_DOCS_DIRECTORY = os.path.join(DIRECTORY, "..", "..", "example-docs")

## Partitioning a document <a id="partition"></a>

In this section, we'll cut right to the chase and get to the most important part of the library: partitioning a document. The goal of document partitioning is to read in a source document, split the document into sections, categorize those sections, and extract the text associated with those sections. Depending on the document type, `unstructured` uses different methods for partitioning a document. We'll cover those in a later training notebook. For now, we'll use the simplest API in the library, the `partition` function. The `partition` function will detect the filetype of the source document and route it to the appropriate partitioning function. You can try out the `partition` function by running the cell below.

In [2]:
from unstructured.partition.auto import partition

filename = os.path.join(EXAMPLE_DOCS_DIRECTORY, "layout-parser-paper-fast.pdf")
elements = partition(filename=filename)

In [3]:
elements

[<unstructured.documents.elements.Title at 0x293cca6a0>,
 <unstructured.documents.elements.NarrativeText at 0x103e74e80>,
 <unstructured.documents.elements.ListItem at 0x1041d4730>,
 <unstructured.documents.elements.ListItem at 0x16d9acf10>,
 <unstructured.documents.elements.ListItem at 0x16c7263d0>,
 <unstructured.documents.elements.ListItem at 0x16c7265b0>,
 <unstructured.documents.elements.ListItem at 0x16c7265e0>,
 <unstructured.documents.elements.ListItem at 0x16c726370>,
 <unstructured.documents.elements.NarrativeText at 0x1041d4700>,
 <unstructured.documents.elements.NarrativeText at 0x16c726520>,
 <unstructured.documents.elements.Title at 0x16c726310>,
 <unstructured.documents.elements.NarrativeText at 0x16c7264c0>,
 <unstructured.documents.elements.NarrativeText at 0x16c726460>,
 <unstructured.documents.elements.NarrativeText at 0x16c726040>,
 <unstructured.documents.elements.NarrativeText at 0x16c7260a0>,
 <unstructured.documents.elements.ListItem at 0x16c7260d0>,
 <unstructu

You can also partition a document from a file-like object instead of a filename as follows:

In [4]:
with open(filename, "rb") as f:
    elements = partition(file=f)

In [5]:
elements

[<unstructured.documents.elements.Title at 0x29360e850>,
 <unstructured.documents.elements.NarrativeText at 0x16c436130>,
 <unstructured.documents.elements.ListItem at 0x12f4981c0>,
 <unstructured.documents.elements.ListItem at 0x16bcd21f0>,
 <unstructured.documents.elements.ListItem at 0x293d11280>,
 <unstructured.documents.elements.ListItem at 0x293d11490>,
 <unstructured.documents.elements.ListItem at 0x293d114f0>,
 <unstructured.documents.elements.ListItem at 0x293d11250>,
 <unstructured.documents.elements.NarrativeText at 0x12f498970>,
 <unstructured.documents.elements.NarrativeText at 0x293d11400>,
 <unstructured.documents.elements.Title at 0x293d11220>,
 <unstructured.documents.elements.NarrativeText at 0x293d113d0>,
 <unstructured.documents.elements.NarrativeText at 0x293d11310>,
 <unstructured.documents.elements.NarrativeText at 0x293d11070>,
 <unstructured.documents.elements.NarrativeText at 0x293d11130>,
 <unstructured.documents.elements.ListItem at 0x1767e10a0>,
 <unstructu

#### Troubleshooting Note:

- Filetype detection in the `partition` function relies on the `libmagic` library. If you don't have that installed on your system, `partition` will throw an error.
- For `partition` to work on PDFs and images, you'll need to have installed `unstructured[local-inference]` along with the `detectron2` model. See the `README` for a full list of install instructions.

## `unstructured` document elements <a id="elements"><a>

When we partition a document, the output is a list of document `Element` objects. These element objects represent different components of the source document. Currently, the `unstructured` library supports the following element types:
    
- `Element`
    - `Text`
        - `FigureCaption`
        - `NarrativeText`
        - `ListItem`
        - `Title`
        -  `Address`
    - `CheckBox`
    - `Image`
    - `PageBreak`
    
Other element types that we will add in the future include tables and figures. Different partioning functions use different methods for determining the element type and extracting the associated content. Document elements have a `str` representation. You can print them using the snippet below.

In [6]:
filename = os.path.join(EXAMPLE_DOCS_DIRECTORY, "example-10k.html")
elements = partition(filename=filename)

for element in elements[:5]:
    print(element)
    print("\n")

UNITED STATES


SECURITIES AND EXCHANGE COMMISSION


Washington, D.C. 20549


FORM 10-K


ANNUAL REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934




One helpful aspect of document elements is that they allow you to cut a document down to the elements that you need for your particular use case. For example, if you're training a summarization model you may only want to include narrative text for model training. You'll notice in the output above that the oput includes a lot of titles and other content that may not be suitable for a summarization model. The following code shows how you can limit your output to only narrative text with at least two sentences. As you can see, the output now only contains 

In [7]:
from unstructured.documents.elements import NarrativeText
from unstructured.partition.text_type import sentence_count

for element in elements[:100]:
    if isinstance(element, NarrativeText) and sentence_count(element.text) > 2:
        print(element)
        print("\n")

Indicate by check mark whether the registrant has filed a report on and attestation to its management’s assessment of the effectiveness of its internal control over financial reporting under Section 404(b) of the Sarbanes-Oxley Act (15 U.S.C. 7262(b)) by the registered public accounting firm that prepared or issued its audit report.  ☐


This report contains statements that do not relate to historical or current facts but are “forward-looking” statements. These statements relate to analyses and other information based on forecasts of future results and estimates of amounts not yet determinable. These statements may also relate to future events or trends, our future prospects and proposed new products, services, developments or business strategies, among other things. These statements can generally (although not always) be identified by their use of terms and phrases such as anticipate, appear, believe, could, would, estimate, expect, indicate, intent, may, plan, predict, project, pursu

## Converting to a dictionary <a id="dict"></a>

The final step in the process for most users is to convert the output to JSON. You can convert document elements by calling the `.to_dict()` method on the `Element` objects:

In [8]:
output = [el.to_dict() for el in elements]

In [9]:
output[:10]

[{'text': 'UNITED STATES', 'type': 'Title'},
 {'text': 'SECURITIES AND EXCHANGE COMMISSION', 'type': 'Title'},
 {'text': 'Washington, D.C. 20549', 'type': 'Title'},
 {'text': 'FORM 10-K', 'type': 'Title'},
 {'text': 'ANNUAL REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934',
  'type': 'Uncategorized'},
 {'text': 'For the fiscal year ended\xa0December\xa031, 2021',
  'type': 'NarrativeText'},
 {'text': 'TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934',
  'type': 'Uncategorized'},
 {'text': 'For the transition period from\xa0\xa0\xa0\xa0\xa0\xa0\xa0to',
  'type': 'Title'},
 {'text': 'Commission file number:\xa0000-30653', 'type': 'Title'},
 {'text': 'Galaxy Gaming, Inc.', 'type': 'Title'}]