## Quick Tour

The following examples show how to get started with the `unstructured` library. See
our [documentation page](https://unstructured-io.github.io/unstructured) for a full description
of the features in the library.

Another way to try out the `unstructured` library is by running a docker container -- compatible with either Intel/AMD or Apple Silicon! Check out the [instructions for using the docker image](https://github.com/Unstructured-IO/unstructured#dizzy-instructions-for-using-the-docker-image).

In [1]:
# Install Requirements
!apt-get -qq install poppler-utils tesseract-ocr
# Upgrade Pillow to latest version
%pip install -q --user --upgrade pillow
# Install Python Packages
%pip install -q unstructured["all-docs"]==0.12.5
# NOTE: you may also upgrade to the latest version with the command below,
#       though a more recent version of unstructured will not have been tested with this notebook
# %pip install -q --upgrade unstructured

Selecting previously unselected package poppler-utils.
(Reading database ... 123620 files and directories currently installed.)
Preparing to unpack .../poppler-utils_22.02.0-2ubuntu0.5_amd64.deb ...
Unpacking poppler-utils (22.02.0-2ubuntu0.5) ...
Selecting previously unselected package tesseract-ocr-eng.
Preparing to unpack .../tesseract-ocr-eng_1%3a4.00~git30-7274cfa-1.1_all.deb ...
Unpacking tesseract-ocr-eng (1:4.00~git30-7274cfa-1.1) ...
Selecting previously unselected package tesseract-ocr-osd.
Preparing to unpack .../tesseract-ocr-osd_1%3a4.00~git30-7274cfa-1.1_all.deb ...
Unpacking tesseract-ocr-osd (1:4.00~git30-7274cfa-1.1) ...
Selecting previously unselected package tesseract-ocr.
Preparing to unpack .../tesseract-ocr_4.1.1-2.1build1_amd64.deb ...
Unpacking tesseract-ocr (4.1.1-2.1build1) ...
Setting up tesseract-ocr-eng (1:4.00~git30-7274cfa-1.1) ...
Setting up tesseract-ocr-osd (1:4.00~git30-7274cfa-1.1) ...
Setting up poppler-utils (22.02.0-2ubuntu0.5) ...
Setting up tess

In [2]:
# Install NLTK Data
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

In [3]:
pdf_paths = [
    '/content/Limitation_Act_2005.pdf',
    '/content/Criminal_Code_Act_Compilation_Act_1913.pdf',
    '/content/Adoption_Act_1994.pdf',
    '/content/Births_Deaths_and_Marriages_Registration_Act_1998.pdf',
    '/content/Cat_Act_2011.pdf',
    '/content/Dog_Act_1976.pdf',
    '/content/Family_Violence_Legislation_Reform_Act_2020.pdf',
    '/content/Misuse_Of_Drugs_Act_1981.pdf',
    '/content/Residential_Tenancies_Act_1987.pdf',
    '/content/Road_Traffic_(Vehicles)_Act_2012.pdf',
    '/content/Surrogacy_Act_2008.pdf'
]

In [4]:
def get_file_name(path):
    """return the file name of the given path"""
    return path.split('/')[-1]

In [7]:
import json
from unstructured.partition.pdf import partition_pdf
from unstructured.chunking.basic import chunk_elements

for path in pdf_paths:
  file_name = get_file_name(path)

  elements_fast = partition_pdf(path, strategy="fast")
  chunks = chunk_elements(elements_fast)

  data = []
  for c in chunks:
    row = {}
    row['Element Type'] = type(c).__name__
    row['Filename'] = c.metadata.filename
    row['Filetype'] = c.metadata.filetype
    row['Page Number'] = c.metadata.page_number
    row['text'] = c.text
    data.append(row)

  chunk_dictlist = [{'id': file_name.split('.')[0] + '-' + str(i + 1),
                        'metadata': chunk}
                        for i, chunk in enumerate(data)]

  with open(f"/content/chunks/{file_name.split('.')[0]}_chunks.json", 'w') as file:
      file.write(json.dumps(chunk_dictlist, indent=2))

