#PDF Extract Demo

To use this notebook, please upload your credentials, both the JSON file and private.key, to the content directory. Click on the folder icon at left, click on the .. to show folders, click on the content folder, and then click the three dots to the right of the content folder and click Upload. You can upload both by selecting both.

Then execute the first cell to install the SDK. You'll see an error about pip's dependency resolver. You can ignore this error, but you do need to restart the runtime by clicking the button at the bottom of the pip install output.

In [None]:
!pip install pdfservices-sdk

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pdfservices-sdk
  Downloading pdfservices_sdk-1.0.2-py3-none-any.whl (74 kB)
[K     |████████████████████████████████| 74 kB 2.0 MB/s 
[?25hCollecting packaging==20.9
  Downloading packaging-20.9-py2.py3-none-any.whl (40 kB)
[K     |████████████████████████████████| 40 kB 5.4 MB/s 
[?25hCollecting certifi==2020.12.5
  Downloading certifi-2020.12.5-py2.py3-none-any.whl (147 kB)
[K     |████████████████████████████████| 147 kB 13.0 MB/s 
[?25hCollecting build==0.3.0
  Downloading build-0.3.0-py2.py3-none-any.whl (13 kB)
Collecting requests-toolbelt==0.9.1
  Downloading requests_toolbelt-0.9.1-py2.py3-none-any.whl (54 kB)
[K     |████████████████████████████████| 54 kB 2.6 MB/s 
[?25hCollecting cryptography==3.4.6
  Downloading cryptography-3.4.6-cp36-abi3-manylinux2014_x86_64.whl (3.2 MB)
[K     |████████████████████████████████| 3.2 MB 44.6 MB/s 
[?25hCollecting pypars

Run this cell, then choose a file to upload to the Extract service. If you get an error like this:
```
Cannot read property '_uploadFiles' of undefined
```
then you'll need to enable third-party cookies. On Chrome, go to chrome://settings/content/cookies.

In [None]:
from google.colab import files
input = files.upload()
filename = list(input.keys())[0]

Saving FOLIODETE_20220811092600.pdf to FOLIODETE_20220811092600.pdf


Run this cell to upload your file to the service. If successful, it will create a zip file with the same basename in content folder. If you get an error that adobe was not found, you'll need to reinstall the SDK (the first cell).

In [None]:
from adobe.pdfservices.operation.auth.credentials import Credentials
from adobe.pdfservices.operation.exception.exceptions import ServiceApiException, ServiceUsageException, SdkException
from adobe.pdfservices.operation.pdfops.options.extractpdf.extract_pdf_options import ExtractPDFOptions
from adobe.pdfservices.operation.pdfops.options.extractpdf.extract_element_type import ExtractElementType
from adobe.pdfservices.operation.execution_context import ExecutionContext
from adobe.pdfservices.operation.io.file_ref import FileRef
from adobe.pdfservices.operation.pdfops.extract_pdf_operation import ExtractPDFOperation
from adobe.pdfservices.operation.pdfops.options.extractpdf.extract_renditions_element_type import ExtractRenditionsElementType

import os

basename, _ = os.path.splitext(filename)
zip_file = f"{basename}.zip"
print(f'Extract from {filename} ...')


credentials = Credentials.service_account_credentials_builder()\
  .from_file("pdfservices-api-credentials.json") \
  .build()

execution_context = ExecutionContext.create(credentials)
extract_pdf_operation = ExtractPDFOperation.create_new()

#Set operation input from a source file.
source = FileRef.create_from_local_file(filename)
extract_pdf_operation.set_input(source)

# Build ExtractPDF options and set them into the operation
extract_pdf_options: ExtractPDFOptions = ExtractPDFOptions.builder() \
  .with_elements_to_extract([ExtractElementType.TEXT, ExtractElementType.TABLES]) \
  .with_elements_to_extract_renditions([ExtractRenditionsElementType.TABLES,ExtractRenditionsElementType.FIGURES]) \
  .build()

extract_pdf_operation.set_options(extract_pdf_options)

#Execute the operation.
result = extract_pdf_operation.execute(execution_context)

# Save the result to the specified location.
result.save_as(zip_file)
print(f'Saved as {zip_file}')

Extract from FOLIODETE_20220811092600.pdf ...
Saved as FOLIODETE_20220811092600.zip


The next step will extract the JSON from the zip and loop over it to find the text of the PDF. This will be stored in a string variable and returned as output.

In [None]:
import zipfile
import json

with zipfile.ZipFile(zip_file) as z:
  raw = z.read('structuredData.json').decode()
  data = json.loads(raw)

text = ''
for element in data["elements"]:
  if "Text" in element:
    text += element["Text"] + "\n"

print(text)

CAMDEN, RAYMOND  
403 ROBINHOOD CIRCLE 
LAFAYETTE LA  70508     
UNITED STATES OF AMERICA 
HAMPTON INN - LAS COLINAS,TX 
820 W. WALNUT HILL LANE 
IRVING, TX  75038     United States of America TELEPHONE 972-753-1232    • FAX 972-550-0300   Reservations 
www.hamptoninn.com or 1 800 HAMPTON 
Room No: 
505/KXTD    
Arrival Date: 
8/9/2022  1:38:00 PM 
Departure Date: 
8/11/2022 9:26:00 AM 
Adult/Child: 
1/0 
Cashier ID: 
VSTITT 
Room Rate: 
156.00 
AL: 
HH # 
538140961 GOLD 
VAT # 
Folio No/Che 
529987 A 
Confirmation Number: 54367540  
HAMPTON INN - LAS COLINAS,TX 8/11/2022 9:25:00 AM 
DATE 
REF NO 
DESCRIPTION 
CHARGES 
8/9/2022 
1754251 
GUEST ROOM 
$156.00 
8/9/2022 
1754251 
STATE TAX 
$9.36 
8/9/2022 
1754251 
CITY TAX 
$14.04 
8/9/2022 
1754251 
STATE RECOVERY FEE 
$0.81 
8/10/2022 
1754504 
GUEST ROOM 
$163.00 
8/10/2022 
1754504 
STATE TAX 
$9.78 
8/10/2022 
1754504 
CITY TAX 
$14.67 
8/10/2022 
1754504 
STATE RECOVERY FEE 
$0.85 
8/11/2022 
1754615 
MC *0354 
**BALANCE**    
($3