# Unstructured Core Concepts

The goal of this notebook is to introduce users to the core concepts in the `unstructured` library. At the conclusion of this notebook, you should be able to do the following:

- [Partition a document.](#partition)
- [Understand how documents are structured in the `unstructured` library.](#elements)
- [Convert the document to a dictionary.](#dict)

In [1]:
import os
import pathlib

DIRECTORY = os.path.abspath("")
EXAMPLE_DOCS_DIRECTORY = os.path.join(DIRECTORY, "..", "..", "example-docs")

## Partitioning a document <a id="partition"></a>

In this section, we'll cut right to the chase and get to the most important part of the library: partitioning a document. The goal of document partitioning is to read in a source document, split the document into sections, categorize those sections, and extract the text associated with those sections. Depending on the document type, `unstructured` uses different methods for partitioning a document. We'll cover those in a later training notebook. For now, we'll use the simplest API in the library, the `partition` function. The `partition` function will detect the filetype of the source document and route it to the appropriate partitioning function. You can try out the `partition` function by running the cell below.

In [2]:
from unstructured.partition.auto import partition

filename = os.path.join(EXAMPLE_DOCS_DIRECTORY, "layout-parser-paper-fast.pdf")
elements = partition(filename=filename)

In [3]:
elements

You can also partition a document from a file-like object instead of a filename as follows:

In [4]:
with open(filename, "rb") as f:
    elements = partition(file=f)

In [5]:
elements

#### Troubleshooting Note:

- Filetype detection in the `partition` function relies on the `libmagic` library. If you don't have that installed on your system, `partition` will throw an error.
- For `partition` to work on PDFs and images, you'll need to have installed `unstructured[local-inference]` along with the `detectron2` model. See the `README` for a full list of install instructions.

## `unstructured` document elements <a id="elements"><a>

When we partition a document, the output is a list of document `Element` objects. These element objects represent different components of the source document. Currently, the `unstructured` library supports the following element types:
    
- `Element`
    - `Text`
        - `FigureCaption`
        - `NarrativeText`
        - `ListItem`
        - `Title`
        -  `Address`
    - `CheckBox`
    - `Image`
    - `PageBreak`
    
Other element types that we will add in the future include tables and figures. Different partitioning functions use different methods for determining the element type and extracting the associated content. Document elements have a `str` representation. You can print them using the snippet below.

In [6]:
filename = os.path.join(EXAMPLE_DOCS_DIRECTORY, "example-10k.html")
elements = partition(filename=filename)

for element in elements[:5]:
    print(element)
    print("\n")

One helpful aspect of document elements is that they allow you to cut a document down to the elements that you need for your particular use case. For example, if you're training a summarization model you may only want to include narrative text for model training. You'll notice that the output above includes a lot of titles and other content that may not be suitable for a summarization model. The following code shows how you can limit your output to only narrative text with at least two sentences. As you can see, the output now only contains narrative text.

In [7]:
from unstructured.documents.elements import NarrativeText
from unstructured.partition.text_type import sentence_count

for element in elements[:100]:
    if isinstance(element, NarrativeText) and sentence_count(element.text) > 2:
        print(element)
        print("\n")

## Converting to a dictionary <a id="dict"></a>

The final step in the process for most users is to convert the output to JSON. You can convert a list of document elements to a list of dictionaries using the `convert_to_dict` function. The workflow for using `convert_to_dict` appears below.

In [8]:
from unstructured.staging.base import convert_to_dict

In [9]:
convert_to_dict(elements)