#PDF Extract Demo

To use this notebook, please upload your credentials, both the JSON file and private.key, to the content directory. Click on the folder icon at left, click on the .. to show folders, click on the content folder, and then click the three dots to the right of the content folder and click Upload. You can upload both by selecting both.

Then execute the first cell to install the SDK. You'll see an error about pip's dependency resolver. You can ignore this error, but you do need to restart the runtime by clicking the button at the bottom of the pip install output.

In [None]:
!pip install pdfservices-sdk

Run this cell, then choose a file to upload to the Extract service. If you get an error like this:
```
Cannot read property '_uploadFiles' of undefined
```
then you'll need to enable third-party cookies. On Chrome, go to chrome://settings/content/cookies.

In [None]:
from google.colab import files
input = files.upload()
filename = list(input.keys())[0]

KeyboardInterrupt: ignored

Run this cell to upload your file to the service. If successful, it will create a zip file with the same basename in content folder. If you get an error that adobe was not found, you'll need to reinstall the SDK (the first cell).

In [None]:
from adobe.pdfservices.operation.auth.credentials import Credentials
from adobe.pdfservices.operation.exception.exceptions import ServiceApiException, ServiceUsageException, SdkException
from adobe.pdfservices.operation.pdfops.options.extractpdf.extract_pdf_options import ExtractPDFOptions
from adobe.pdfservices.operation.pdfops.options.extractpdf.extract_element_type import ExtractElementType
from adobe.pdfservices.operation.execution_context import ExecutionContext
from adobe.pdfservices.operation.io.file_ref import FileRef
from adobe.pdfservices.operation.pdfops.extract_pdf_operation import ExtractPDFOperation
from adobe.pdfservices.operation.pdfops.options.extractpdf.extract_renditions_element_type import ExtractRenditionsElementType

import os

basename, _ = os.path.splitext(filename)
zip_file = f"{basename}.zip"
print(f'Extract from {filename} ...')


credentials = Credentials.service_account_credentials_builder()\
  .from_file("pdfservices-api-credentials.json") \
  .build()

execution_context = ExecutionContext.create(credentials)
extract_pdf_operation = ExtractPDFOperation.create_new()

#Set operation input from a source file.
source = FileRef.create_from_local_file(filename)
extract_pdf_operation.set_input(source)

# Build ExtractPDF options and set them into the operation
extract_pdf_options: ExtractPDFOptions = ExtractPDFOptions.builder() \
  .with_elements_to_extract([ExtractElementType.TEXT, ExtractElementType.TABLES]) \
  .with_elements_to_extract_renditions([ExtractRenditionsElementType.TABLES,ExtractRenditionsElementType.FIGURES]) \
  .build()

extract_pdf_operation.set_options(extract_pdf_options)

#Execute the operation.
result = extract_pdf_operation.execute(execution_context)

# Save the result to the specified location.
result.save_as(zip_file)
print(f'Saved as {zip_file}')

Extract from PlanetaryScienceDecadalSurvey.pdf ...
Saved as PlanetaryScienceDecadalSurvey.zip


You can download the zip file from the content folder yourself (using the three dots menu in the Files tree) or run this cell. Uploaded files are automatically deleted when the session ends, or you can do it yourself from the Files tree. (You may have to close and re-open the content folder for the pdf and zip files to show up.

In [None]:
from google.colab import files

files.download(zip_file)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>