# Chipper Local Inference
This notebook demonstrates and explains how to parse a PDF file using `chipper` model locally through our main libraries. If you want to run this notebook in Google Colab taking into account that making an inference using `chipper` in CPU can take while; switching the runtime from "CPU" to GPU `T4` (or any other available) will reduce the runtime and is strongly recommended. You can use the following commands to install the required libraries:

```{python}
!pip install unstructured --quiet
!pip install unstructured-inference --quiet
!apt-get update -y && apt-get install -y poppler-utils --quiet
```

Initialize the variables `filename` with your file path, and `model_name` with the model you want to use from the available `MODEL_TYPES` in each of the [models](https://github.com/Unstructured-IO/unstructured-inference/tree/203f7ab75b1644b938f6bae1e81c8365d274f35d/unstructured_inference/models) scripts, in this case `chipper`. For this notebook we will use `DA-1p.pdf` from our [example-docs](https://github.com/Unstructured-IO/unstructured/tree/main/example-docs):

In [2]:
filename = '../../example-docs/DA-1p.pdf'  # 
model_name = "chipper"

Most of the user experience is going to be through our main Unstructured lib, so the highest level call for local inference using `chipper` is through `unstructured.partition.auto.partition`. This method will need the `strategy`='hi_res' and `model_name`=model_name to call `chipper`, the additional kwarg `pdf_image_dpi`=300 is is **necessary for better performance** of `chipper`. Users should be prompted a `WARNING` saying `chipper` is in beta (*up to 14.08.2023*).

#### [unstructured.partition.auto.partition](https://github.com/Unstructured-IO/unstructured/blob/612f9da6e8e27cffc3e6912928a16daad47903dc/unstructured/partition/auto.py#L73)

In [3]:
from unstructured.partition.auto import partition

In [4]:
%%time
elements = partition(filename=filename, strategy='hi_res', model_name=model_name, pdf_image_dpi=300)



CPU times: user 6min 3s, sys: 17.1 s, total: 6min 20s
Wall time: 6min 35s


Our `chipper` model process an image input and returns the textual content with a structure defined by some categories (document element types) it was fine-tuned on. Thereafter during a call of `partition` the PDF document is transformed to an image and the output element types are standardise to Unstructured [elements](https://unstructured-io.github.io/unstructured/getting_started.html#document-elements).

*Disclaimer:* The `UncategorizedText` elements being returned by the `partition` method will soon instead reflect the *category/type* identified by `chipper` (e.g. `Headline`, `Subheadline`, ..).

In [5]:
# Printing all categories/types
print("number of elements: ", len(elements))
for element in elements:
    print(element.category)

number of elements:  13
UncategorizedText
UncategorizedText
UncategorizedText
NarrativeText
NarrativeText
NarrativeText
NarrativeText
NarrativeText
NarrativeText
NarrativeText
NarrativeText
UncategorizedText
NarrativeText


In [6]:
# Printing all element(s).text
for element in elements:
    print(element.text)

MAIN GAME
CREATURES
Abomination
"We arrived in the dead of night. We had been tracking the maleficar for days, and finally had him cornered... or so we thought.
As we approached, a home on the edge of the town exploded, sending splinters of wood and fist-sized chunks of rocks into our ranks. We had but moments to regroup before fire rained from the sky, the sounds of destruction wrapped in a hideous laughter from the center of the village.
There, perched atop the spire of the village concurty, stood the mage. But he was human no longer.
We shooted prayers to the Maker and deflcted what magic we could, but as we fought, the creature fought harder. I saw my comrades fall, burned by the flaming sky or crushed by debris. The tomorrows creature, looking as if a demon were wearing a man like a twisted suit of skin, spotted me and grinded. We had forced it to this, I realized; the mage had made this pact, given himself over to the demon to survive our assault."
—Transscribed from a tale told 

Internally this method calls `unstructured.partition.pdf._partition_pdf_or_image_local` which expects a model definition through `model_name` or an env variable called `UNSTRUCTURED_HI_RES_MODEL_NAME` to partition the file, and which ends up calling `process_file_with_model` | `process_data_with_model` ([1](https://github.com/Unstructured-IO/unstructured-inference/blob/15bbc564c67ae1f1b524918978cdb29010f89647/unstructured_inference/inference/layout.py#L391)|[2](https://github.com/Unstructured-IO/unstructured-inference/blob/15bbc564c67ae1f1b524918978cdb29010f89647/unstructured_inference/inference/layout.py#L361)) from `unstructured_inference.inference.layout.PageLayout`

##### [unstructured.partition.pdf._partition_pdf_or_image_local](https://github.com/Unstructured-IO/unstructured/blob/2e0ab86c6a5c27f14c0f95ea41728bb9a94b7378/unstructured/partition/pdf.py#L215C5-L215C34)

In [12]:
from unstructured.partition.pdf import _partition_pdf_or_image_local

In [13]:
%%time
elements = _partition_pdf_or_image_local(filename=filename, model_name=model_name, pdf_image_dpi=300)  # file parameter could be use here as well



CPU times: user 5min 8s, sys: 8.82 s, total: 5min 17s
Wall time: 5min 18s


In [14]:
# Printing all categories/types
print("number of elements: ", len(elements))
for element in elements:
    print(element.category)

number of elements:  13
UncategorizedText
UncategorizedText
UncategorizedText
NarrativeText
NarrativeText
NarrativeText
NarrativeText
NarrativeText
NarrativeText
NarrativeText
NarrativeText
UncategorizedText
NarrativeText


In [15]:
# Printing all element(s).text
for element in elements:
    print(element.text)

MAIN GAME
CREATURES
Abomination
"We arrived in the dead of night. We had been tracking the maleficar for days, and finally had him cornered... or so we thought.
As we approached, a home on the edge of the town exploded, sending splinters of wood and fist-sized chunks of rocks into our ranks. We had but moments to regroup before fire rained from the sky, the sounds of destruction wrapped in a hideous laughter from the center of the village.
There, perched atop the spire of the village chantry, stood the mage. But he was human no longer.
We shooted prayers to the Maker and defIected what magic we could, but as we fought, the creature fought harder. I saw my comrades fall, burned by the flaming sky or crushed by debris. The tomorrows creature, looking as if a demon were wearing a man like a twisted suit of skin, spotted me and grinded. We had forced it to this, I realized; the mage had made this pact, given himself over to the demon to survive our assault."
—Transscribed from a tale told 

##### unstructured.partition.auto.partition with env variable

<font color="red">Restart your runtime before executing the cells in this sub-section!

Do not import unstructured.partition.auto.partition before defining your env variables!</font>

Let's now use the model through Unstructured lib but instead of using the kwarg `model_name` we can define the env var `UNSTRUCTURED_HI_RES_MODEL_NAME`.

In [3]:
import os

os.environ['UNSTRUCTURED_HI_RES_MODEL_NAME'] = model_name

In [4]:
from unstructured.partition.auto import partition   # we could also use unstructured.partition.pdf._partition_pdf_or_image_local

In [5]:
%%time
elements = partition(filename=filename, strategy='hi_res', pdf_image_dpi=300)   # internally _partition_pdf_or_image_local(filename=filename, pdf_image_dpi=300)



CPU times: user 6min 8s, sys: 17.6 s, total: 6min 25s
Wall time: 6min 49s


In [6]:
# Printing all categories/types
print("number of elements: ", len(elements))
for element in elements:
    print(element.category)

number of elements:  13
UncategorizedText
UncategorizedText
UncategorizedText
NarrativeText
NarrativeText
NarrativeText
NarrativeText
NarrativeText
NarrativeText
NarrativeText
NarrativeText
UncategorizedText
NarrativeText


In [7]:
# Printing all element(s).text
for element in elements:
    print(element.text)

MAIN GAME
CREATURES
Abomination
"We arrived in the dead of night. We had been tracking the maleficar for days, and finally had him cornered... or so we thought.
As we approached, a home on the edge of the town exploded, sending splinters of wood and fist-sized chunks of rocks into our ranks. We had but moments to regroup before fire rained from the sky, the sounds of destruction wrapped in a hideous laughter from the center of the village.
There, perched atop the spire of the village concurty, stood the mage. But he was human no longer.
We shooted prayers to the Maker and deflcted what magic we could, but as we fought, the creature fought harder. I saw my comrades fall, burned by the flaming sky or crushed by debris. The tomorrows creature, looking as if a demon were wearing a man like a twisted suit of skin, spotted me and grinded. We had forced it to this, I realized; the mage had made this pact, given himself over to the demon to survive our assault."
—Transscribed from a tale told 

#### [unstructured_inference.inference.layout.DocumentLayout](https://github.com/Unstructured-IO/unstructured-inference/blob/15bbc564c67ae1f1b524918978cdb29010f89647/unstructured_inference/inference/layout.py#L51)

We know already that `partition` from our main library uses `process_file_with_model` | `process_data_with_model` ([1](https://github.com/Unstructured-IO/unstructured-inference/blob/15bbc564c67ae1f1b524918978cdb29010f89647/unstructured_inference/inference/layout.py#L391)|[2](https://github.com/Unstructured-IO/unstructured-inference/blob/15bbc564c67ae1f1b524918978cdb29010f89647/unstructured_inference/inference/layout.py#L361)) from `unstructured_inference.inference.layout.PageLayout`. Let's now directly create a `DocumentLayout` containing `PageLayout` objects via unstructured-inference. For that, we nned to pass an Unstructured model object to the `element_extraction_model` param when creating a `DocumentLayout` object `from_file`. The method `get_model` from `models.base` creates Unstructured model objects from a model name for you:

In [8]:
from unstructured_inference.models.base import get_model
from unstructured_inference.inference.layout import DocumentLayout

In [9]:
model = get_model(model_name) # This can take a while on first run
model

<unstructured_inference.models.chipper.UnstructuredChipperModel at 0x7965c7c9a260>

In [5]:
%%time
layout = DocumentLayout.from_file(filename, element_extraction_model=model, pdf_image_dpi=300)



CPU times: user 4min 58s, sys: 11.5 s, total: 5min 10s
Wall time: 5min 12s


The `layout` object is organized by pages with elements. In this case, the parsed document layout will contain the document element types that our `chipper` model was originally fine-tuned on.

In [6]:
# Printing all categories/types
for page in layout.pages:
    print("number of elements: ", len(page.elements))
    for element in page.elements:
        print(element.type)

number of elements:  13
Headline
Subheadline
Subheadline
Text
Text
Text
Text
Text
Text
Text
Text
Subheadline
Text


In [7]:
# Printing all element(s).text
for page in layout.pages:
    for element in page.elements:
        print(element.text)

MAIN GAME
CREATURES
Abomination
"We arrived in the dead of night. We had been tracking the maleficar for days, and finally had him cornered... or so we thought.
As we approached, a home on the edge of the town exploded, sending splinters of wood and fist-sized chunks of rocks into our ranks. We had but moments to regroup before fire rained from the sky, the sounds of destruction wrapped in a hideous laughter from the center of the village.
There, perched atop the spire of the village chantry, stood the mage. But he was human no longer.
We shooted prayers to the Maker and defIected what magic we could, but as we fought, the creature fought harder. I saw my comrades fall, burned by the flaming sky or crushed by debris. The tomorrows creature, looking as if a demon were wearing a man like a twisted suit of skin, spotted me and grinded. We had forced it to this, I realized; the mage had made this pact, given himself over to the demon to survive our assault."
—Transscribed from a tale told 