# Intro to Bricks

The goal of this notebook is to introduce you to the concept of bricks. Bricks are functions that live in `unstructured` and are the primary public API for the library. There are three types of bricks in `unstructured`, corresponding to the different stages of document pre-processing: partitioning, cleaning, and staging. At the conclusion of this notebook, you should be able to do the following:

- [Extract content from a document using partitioning bricks](#partition)
- [Remove unwanted content from document elements using cleaning bricks](#cleaning)
- [Preparing data for downstream use cases using staging bricks](#staging)

In [1]:
import os
import pathlib

DIRECTORY = os.path.abspath("")
EXAMPLE_DOCS_DIRECTORY = os.path.join(DIRECTORY, "..", "..", "example-docs")

## Partitioning bricks  <a id="partition"></a>

Partitioning bricks in `unstructured` allow users to extract structured content from a raw unstructured document. As we covered in the [core concepts notebook](https://github.com/Unstructured-IO/unstructured/blob/main/examples/training/0-Core%20Concepts.ipynb), partitioning bricks break a document down into elements such as `Title`, `NarrativeText`, and `ListItem`, enabling users to decide what content they'd like to keep for their particular application. If you're training a summarization model, for example, you may only be interested in `NarrativeText`.

The easiest way to partition documents in `unstructured` is to use the `partition` brick. If you call the `partition` brick, `unstructured` will use `libmagic` to automatically determine the file type and invoke the appropriate `partition` function. As shown in the examples below, the `partition` function accepts both filenames and file-like objects as input. `partition` also has some includes some optional kwargs. For example, if you set `include_page_breaks=True`, the output will include `PageBreak` elements if the filetype supports it. See the
[`unstructured` documentation](https://unstructured-io.github.io/unstructured/bricks.html#partition) for full details on available options.

In [2]:
from unstructured.partition.auto import partition

filename = os.path.join(EXAMPLE_DOCS_DIRECTORY, "layout-parser-paper-fast.pdf")
elements = partition(filename=filename)

In [3]:
print("\n\n".join([str(el) for el in elements][:10]))

LayoutParser : A Uniﬁed Toolkit for Deep Learning Based Document Image Analysis

Zejiang Shen 1 ( (ea)
 ), Ruochen Zhang 2 , Melissa Dell 3 , Benjamin Charles Germain Lee 4 , Jacob Carlson 3 , and Weining Li 5

Allen Institute for AI shannons@allenai.org

Brown University ruochen zhang@brown.edu

Harvard University { melissadell,jacob carlson } @fas.harvard.edu

University of Washington bcgl@cs.washington.edu

University of Waterloo w

li@uwaterloo.ca

Abstract. Recent advances in document image analysis (DIA) have been primarily driven by the application of neural networks. Ideally, research outcomes could be easily deployed in production and extended for further investigation. However, various factors like loosely organized codebases and sophisticated model conﬁgurations complicate the easy reuse of im- portant innovations by a wide audience. Though there have been on-going eﬀorts to improve reusability and simplify deep learning (DL) model development in disciplines like natural lan

In [4]:
with open(filename, "rb") as f:
    elements = partition(file=f, include_page_breaks=True)

In [5]:
print("\n\n".join([str(el) for el in elements][5:15]))

University of Washington bcgl@cs.washington.edu

University of Waterloo w

li@uwaterloo.ca

Abstract. Recent advances in document image analysis (DIA) have been primarily driven by the application of neural networks. Ideally, research outcomes could be easily deployed in production and extended for further investigation. However, various factors like loosely organized codebases and sophisticated model conﬁgurations complicate the easy reuse of im- portant innovations by a wide audience. Though there have been on-going eﬀorts to improve reusability and simplify deep learning (DL) model development in disciplines like natural language processing and computer vision, none of them are optimized for challenges in the domain of DIA. This represents a major gap in the existing toolkit, as DIA is central to academic research across a wide range of disciplines in the social sciences and humanities. This paper introduces LayoutParser , an open-source library for streamlining the usage of DL in D

The `unstructured` library also includes partitioning bricks targeted at specific document types. The `partition` brick uses these document-specific partitioning bricks under the hood. There are a few reasons you may want to use a document-specific partitioning brick instead of `partition`:

1. If you already know the document type, filetype detection is unnecessary. Using the document-specific brick directly will make your program run faster.
2. Fewer dependencies. You don't need to install `libmagic` for filetype detection if you're only using document-specific bricks.
3. Additional features. The API for `partition` is the least common denominator for all document types. Certain document-specific brick include extra features that you may want to take advantage of. For example, `partition_html` allows you to pass in a URL so you don't have to store the `.html` file locally.

Currently, the partitioning bricks we support in `unstructured` are:

- `partition_docx`
    - Works on `.docx` files. Does not yet work on older `.doc` files.
- `partition_pptx`
    - Works on `.pptx` files. Does not yet work on older `.ppt` files.
- `partition_html`
    - Works on `.html` files.
    - Can pass in the HTML document as a string using the `text` kwarg.
    - Can pass in the URL for an HTML document using the `url` kwarg.
- `partition_pdf`
    - Works on `.pdf` files. Partitions the document using a document image analysis model.
    - If `url=None`, the model will run locally. If you pass in a URL, the brick will make a network call
      to a hosted model inference API. There is also an optional `token` kwarg for passing in an authentication token.
- `partition_image`
    - Has the same API as `partition_pdf`. Works on `.jpg` and `.png` files.
- `partition_email`
    - Works on `.eml` files. Most common email clients (i.e. Gmail, Microsoft Outlook) allow users to export emails in
      `.eml` format.
    - Parses the `text/html` content from the email by default. If you set `content_source="text/plain"`, the brick will parse the plain text instead.
    - If you set `include_headers=True`, the output will include information from the email header.
    - You can pass in the email as a string using the `text` kwarg.
- `partition_text`
    - Works on plain text files.
    - You can pass in the document as a string using the `text` kwarg.


See the [`unstructured` docs](https://unstructured-io.github.io/unstructured/bricks.html#partition-docx) for a full list of options. Below we see an example of how to partition a document directly with the URL using the `partition_html` function.


In [6]:
from unstructured.partition.html import partition_html

url = "https://www.cnn.com/2023/01/30/sport/empire-state-building-green-philadelphia-eagles-spt-intl/index.html"
elements = partition_html(url=url)

In [7]:
print("\n\n".join([str(el) for el in elements]))

CNN
         —

The Empire State Building was lit in green and white to celebrate the Philadelphia Eagles' victory in the NFC Championship game on Sunday — a decision that's sparked a bit of a backlash in the Big Apple.

The Eagles advanced to the Super Bowl for the first time since 2018 after defeating the San Francisco 49ers 31-7, and the Empire State Building later tweeted how it was marking the occasion.

Fly @Eagles Fly! We're going Green and White in honor of the Eagles NFC Championship Victory. pic.twitter.com/RNiwbCIkt7— Empire State Building (@EmpireStateBldg)

January 29, 2023

But given the fierce rivalry between the Eagles and the New York Giants, who the Super Bowl-bound team had comfortably defeated in the previous round of the NFL Playoffs, many were left questioning the move.

Did y'all lose a bet, ESPN contributor Mina Kimes asked in response to the tweet, while Giants running back Matt Breida also expressed his disbelief.

SMHð¤¦ð¾âï¸— Matt Breida (@MattBreid

## Cleaning bricks <a id="cleaning"></a>

As part of data preparation for an NLP model, it's common to need to clean up your data prior to passing it into the model. If there's unwanted content in your output, it could impact the quality of your NLP model. To help with this, the `unstructured` library includes cleaning bricks to help users sanitize output before sending it to downstream applications. You can check out our [documentation](https://unstructured-io.github.io/unstructured/bricks.html#cleaning) for a full list of cleaning bricks.

Some cleaning bricks apply automatically. In the example above, the output `Philadelphia Eaglesâ\x80\x99 victory` automatically gets converted to `Philadelphia Eagles' victory` in `partition_html` using the `replace_unicode_quotes` cleaning brick. You can see how that works in the code snippet below:

In [8]:
from unstructured.cleaners.core import replace_unicode_quotes

replace_unicode_quotes("Philadelphia Eaglesâ\x80\x99 victory")

"Philadelphia Eagles' victory"

Document elements in `unstructured` include an `apply` method that allow you to apply the text cleaning to the document element without instantiating a new element. The `apply` method expects a callable that takes a string as input and produces another string as output. In the example below, we invoke the `replace_unicode_quotes` cleaning brick using the `apply` method.

In [9]:
from unstructured.documents.elements import Text
element = Text("Philadelphia Eaglesâ\x80\x99 victory")
element.apply(replace_unicode_quotes)
print(element)

Philadelphia Eagles' victory


Since a cleaning brick is just a `str -> str` function, users can also easily include their own cleaning bricks for custom data preparation tasks. In the example below, we partition a Russian offensive campaign assessment from the institute of the study of war and remove citations, which are not natural language text that we want to include for model training purposes.

In [10]:
url = "https://www.understandingwar.org/backgrounder/russian-offensive-campaign-assessment-february-8-2023"
elements = partition_html(url=url)

In [11]:
from unstructured.documents.elements import NarrativeText

narrative_text = [el for el in elements if isinstance(el, NarrativeText)][2:]

In [12]:
import re

remove_citations = lambda text: re.sub("\[\d{1,3}\]", "", text)

In [13]:
narrative_text[0].text

'[1]\xa0Geolocated combat footage has confirmed Russian gains in the Dvorichne area northwest of Svatove.'

In [14]:
narrative_text[0].apply(remove_citations)
narrative_text[0].text

'\xa0Geolocated combat footage has confirmed Russian gains in the Dvorichne area northwest of Svatove.'

In [15]:
narrative_text[6].apply(remove_citations)
narrative_text[6].text

'Russian officials continue to propose measures to prepare Russia’s military industry for a protracted war in Ukraine while also likely setting further conditions for sanctions evasion.\xa0Russian Prime Minister Mikhail Mishustin stated on February 8 that the Russian government will subsidize investment projects for the modernization of enterprises operating in the interests of the Russian military and will allocate significant funds for manufacturing new military equipment.\xa0Mishustin also stated that the Russian government would extend benefits to Russian entrepreneurs who support the Russian military, including extended payment periods on rented federal property.\xa0The Kremlin likely intends these measures to augment its overarching effort to gradually prepare Russia’s military industry for a protracted war in Ukraine while avoiding a wider economic mobilization that would create further domestic economic disruptions and corresponding discontent.'

As we can see, the citations have been removed. After removing the citations, we still have extra whitespace represented by `\xa0`. We can clean that up using the `clean_extra_whitespace` cleaning brick.

In [16]:
from unstructured.cleaners.core import clean_extra_whitespace

narrative_text[0].apply(clean_extra_whitespace)
narrative_text[6].apply(clean_extra_whitespace)

In [17]:
narrative_text[0].text

'Geolocated combat footage has confirmed Russian gains in the Dvorichne area northwest of Svatove.'

In [18]:
narrative_text[6].text

'Russian officials continue to propose measures to prepare Russia’s military industry for a protracted war in Ukraine while also likely setting further conditions for sanctions evasion. Russian Prime Minister Mikhail Mishustin stated on February 8 that the Russian government will subsidize investment projects for the modernization of enterprises operating in the interests of the Russian military and will allocate significant funds for manufacturing new military equipment. Mishustin also stated that the Russian government would extend benefits to Russian entrepreneurs who support the Russian military, including extended payment periods on rented federal property. The Kremlin likely intends these measures to augment its overarching effort to gradually prepare Russia’s military industry for a protracted war in Ukraine while avoiding a wider economic mobilization that would create further domestic economic disruptions and corresponding discontent.'

Now the text is clean and formatted how we'd like it for our model training application. The best way to invoke a series of cleaning bricks is to loop over the elements and call `apply` with all of your bricks. For example, we can apply the cleaning bricks to all of the elements from the ISW article with the following code:

In [19]:
for element in narrative_text:
    element.apply(remove_citations)
    element.apply(clean_extra_whitespace)

## Staging bricks <a id="staging"></a>

The final step in the process is to prepare your data for ingestion into downstream systems. We include staging bricks in the `unstructured` package to help with that. Staging bricks accept a list of document elements as input and return an appropriately formatted dictionary as output. In the example below, we get our narrative text samples prepared for ingestion into LabelStudio using the `stage_for_label_studio` brick. We can take this data and directly upload it into LabelStudio to quickly get started with an NLP labeling task.

In [20]:
import json
from unstructured.staging.label_studio import stage_for_label_studio

output = stage_for_label_studio(narrative_text)
print(json.dumps(output[:2], indent=4))

[
    {
        "data": {
            "text": "Geolocated combat footage has confirmed Russian gains in the Dvorichne area northwest of Svatove.",
            "ref_id": "c311a941b80429f2ba0b6a2137f7315e"
        }
    },
    {
        "data": {
            "text": "Russian military command additionally appears to have fully committed elements of several conventional divisions to decisive offensive operations along the Svatove-Kreminna line, as ISW previously reported.",
            "ref_id": "79748ec84695bd88f41b13e98eae53be"
        }
    }
]


Currently, `unstructured` supports the following staging bricks:

- `stage_for_argilla`
- `stage_for_transformers`
- `stage_for_label_studio`
- `stage_for_prodigy`
- `stage_for_label_box`
- `stage_for_datasaur`

Also included among the staging bricks are functions for converting a list of document elements to a dictionary, CSV, or dataframe. These helper functions are useful if you just want the text and don't need the data pre-formatted for a particular downstream tool. These functions include:

- `convert_to_isd`
- `convert_to_isd_csv`
- `convert_to_dataframe`

The "ISD" in these functions refers to "initial structured data", our standard dictionary representation of text elements. Here we convert the list of elements to a dictionary and a dataframe.

In [21]:
from unstructured.staging.base import convert_to_isd

isd = convert_to_isd(elements)
print(json.dumps(isd[:2], indent=4))

[
    {
        "text": "Skip to main content",
        "type": "Title"
    },
    {
        "text": "(function(d){\n  var js, id = 'facebook-jssdk'; if (d.getElementById(id)) {return;}\n  js = d.createElement('script'); js.id = id; js.async = true;\n  js.src = \"//connect.facebook.net/en_US/all.js#xfbml=1\";\n  d.getElementsByTagName('head')[0].appendChild(js);\n}(document));",
        "type": "NarrativeText"
    }
]


In [22]:
from unstructured.staging.base import convert_to_dataframe

df = convert_to_dataframe(elements)
df.head()

Unnamed: 0,type,text
0,Title,Skip to main content
1,NarrativeText,"(function(d){\n var js, id = 'facebook-jssdk'..."
2,Title,Search form
3,ListItem,Home
4,ListItem,Who We Are


If you have a dictionary in ISD format, you can convert back to a list of elements using the `isd_to_elements` function.

In [23]:
from unstructured.staging.base import isd_to_elements

isd_to_elements(isd[:2])

[<unstructured.documents.elements.Title at 0x28bf910a0>,
 <unstructured.documents.elements.NarrativeText at 0x28bf91460>]