# Process PDF files with Amazon Textract or Amazon Bedrock

> This notebook should work well with the **`conda_python3`** kernel in SageMaker Notebook instances.

---

In this notebook, we demonstrate two approaches to process PDF documents:


1. Use [Amazon Textract](https://aws.amazon.com/textract/) through the [Textractor](https://aws-samples.github.io/amazon-textract-textractor/index.html) library. Textract performs OCR to detect text from in documents, which can be in different formats such as PDF and images.

2. Use [Amazon Bedrock](https://aws.amazon.com/bedrock/) by doing a `boto3` call to invoke a multi-modal Foundation Model to perform OCR. For this, we will need to convert the document into a set of images.

We will showcase the functionality in this notebook on multiple files, which can be found in [demo-files](demo-files/) directory. Afterwards, all extracted texts will be stored in [processed-files](processed-files/) directory. The extracted texts are required for subsequent generative AI tasks we will perform in the [02_extract_attributes.ipynb](02_extract_attributes.ipynb). 

---

# 1. PREPARATIONS

In [31]:
import boto3
import os

## Create folders

Define input and output paths as constants. 

Since Amazon Textract expects input files to be uploaded to S3, we will read them from a prepopulated S3 bucket `s3://information-extraction-workshop`. You can check copies of the input files in `demo-files` folder for reference.

In [32]:
input_path = "demo-files"
output_path = "processed-files"
s3_bucket = "information-extraction-workshop"

In [33]:
os.makedirs(output_path, exist_ok=True)

## List input documents

List objects on the S3 bucket, iterate over the objects and print their keys (file names).

In [None]:
s3_client = boto3.client("s3")
response = s3_client.list_objects_v2(Bucket=s3_bucket)
file_objects = response["Contents"]

for obj in file_objects:
    print(obj["Key"])

During the workshop, we will use 2 documents with snippets from scientific papers. We select scientific papers as they have complex layout, including figures, tables, and complex terminology.

This is how the first page of one of the docs looks like:

![screenshots/2302.13971v1.png](screenshots/2302.13971v1.png)

Feel free to extend the list of documents afterwards! Now we are all set and can proceed to the OCR.

---

# 2. PROCESSING

We will start by doing OCR with Amazon Textract, and then do OCR with Amazon Bedrock to compare results.

## Option 1: Run Textractor on demo files

If you use the Amazon Textract OCR engine, you can choose between the synchronous `DetectDocumentText` API (this is called `detect_document_text` in the *Textractor* library) and the asynchronous `StartDocumentTextDetection` API (this is called `start_document_analysis` in the *Textractor* library). 

The former will block code execution until the OCR inference completes, while the latter will return a job_id that you can use to get the results later. In this notebook, we will use the `start_document_analysis` API. 

First step, we will create the Textractor boto3 client:

In [35]:
from textractor import Textractor
from textractor.data.constants import TextractFeatures
from textractor.data.markdown_linearization_config import MarkdownLinearizationConfig

extractor = Textractor(region_name="us-west-2")

Next, we run the Textractor library on all `demo-files`, one by one. The extracted text outputs will be stored in the `processed-files` directory. 

This cell contains your first task! Check out [Textractor documentation](https://github.com/aws-samples/amazon-textract-textractor) in their README. Find what would be the right command to execute to get the document object. Keep an eye on what features you enable as arguments to get the most content out of the document.

In [None]:
%%time

for obj in file_objects:
    print(f"Processing {obj['Key']}")
    base_filename = obj['Key']
    s3_filename = f"s3://{s3_bucket}/{base_filename}"
    output_filename = base_filename.removesuffix(".pdf") + "_textract.txt"
    
    #########################################
    #    TASK 1: Add Textractor call here
    #########################################
    
    document = None
    
    #########################################
    #    TASK 1: Add Textractor call here
    #########################################
    
    document_text = document.get_text(config = MarkdownLinearizationConfig)
    
    with open(os.path.join(output_path, output_filename), "w") as text_file:
        text_file.write(document_text)
    print("---")

* If you have trouble after a few minutes, check out `answers/answers.ipynb`

Note: if the Textractor call fails due to this error:
> PDFInfoNotInstalledError: Unable to get page count. Is poppler installed and in PATH?

This means you need to install [poppler](https://poppler.freedesktop.org/) by running something the following command in the terminal:
```bash
sudo yum install --y poppler-utils
```

Check out the output folder and open one of the two `.txt` files! Write some code for opening the saved `.txt` file from the folder you created.

P.S. We recommend opening the PDF and the text file side by side to see the text extraciton quality. How does Amazon Textract handle text in two columns? What about tables and images? Do you see any quality issues?

In [None]:
##################################################
#    TASK 2: Open and print the output text file
##################################################

parsed_text = None

print(parsed_text)

#################################################
#    TASK 2: Open and print the output text file
#################################################

* If you have trouble after a few minutes, check out `answers/answers.ipynb`

## Option 2: Run Amazon Bedrock on demo files

To improve the parsing quality for complex content like tables and images, we can leverage multi-modal foundation models from Amazon Bedrock. First, we will have to convert the PDF into a set of images to be able to send it to models in Amazon Bedrock. Second, we will send a set of images to the LLM to parse them.

Let's start with the conversion. We don't need to save the images - rather, we will store a list of `base64` strings in memory representing each PDF file.

In [40]:
import base64
import botocore
from io import BytesIO

from pdf2image import convert_from_path
import json


def get_base64_encoded_images_from_pdf(pdf_file_path):
    images = convert_from_path(pdf_file_path)
    base64_img_strs = []
    for image in images:
        buffered = BytesIO()
        image.save(buffered, format="JPEG")
        img_str = base64.b64encode(buffered.getvalue())
        base64_string = img_str.decode("utf-8")
        base64_img_strs.append(base64_string)
    return base64_img_strs

In [None]:
images = get_base64_encoded_images_from_pdf(f"{input_path}/{base_filename}")
print(len(images))

The list of images contains the decoded byte strings for each of the pages of the PDF file.

Now, let's instantiate Bedrock client:

In [42]:
bedrock_runtime = boto3.client("bedrock-runtime")

We will need to put together a prompt that instructs the model to perform OCR. This will be your final task for this notbeook!

Think about how to structure the prompt. You want to ask the LLM to convert the image document to a text content. What things should it keep in mind? How to format the output? 

Hint: think about instructions for dealing with figures and tables. How would you like the LLM to represent them?

In [43]:
##################################################
#    TASK 3: Formulate the system prompt
##################################################

system_prompt = """Always reply 'Amazon Web Services'."""

##################################################
#    TASK 3: Formulate the system prompt
##################################################

prompt = """Here is the document."""

* If you have trouble after a few minutes, check out `answers/answers.ipynb`

Next, we combine the text prompt and the images into the list of messages to be sent to the LLM.

In [44]:
messages_content = [
    {
        "type": "image",
        "source": {
            "type": "base64",
            "media_type": "image/jpeg",
            "data": image_str,
        },
    }
    for image_str in images
]

messages_content.append({"type": "text", "text": prompt})

messages = [
    {
        "role": "user",
        "content": messages_content,
    }
]

Now, we can finally send our document to Amazon Bedrock!

In [None]:
%%time

modelId = "anthropic.claude-3-haiku-20240307-v1:0"  # (change this to try different model versions)

body = json.dumps(
    {
        "max_tokens": 4_096,
        "system": system_prompt,
        "temperature": 0.0,
        "messages": messages,
        "anthropic_version": "bedrock-2023-05-31",
    }
)


try:
    response = bedrock_runtime.invoke_model_with_response_stream(
        body=body,
        modelId=modelId,
        accept="application/json",
        contentType="application/json",
    )

    output = ""

    stream = response.get("body")

    if stream:
        for event in stream:
            chunk = event.get("chunk")
            if chunk:
                chunk_obj = json.loads(chunk.get("bytes").decode())
                if "delta" in chunk_obj:
                    delta_obj = chunk_obj.get("delta", None)
                    if delta_obj:
                        text = delta_obj.get("text", None)
                        if text is not None:
                            output += text
                            print(text, end="")
                        if not text:
                            break
    print("")

except botocore.exceptions.ClientError as error:
    if error.response["Error"]["Code"] == "AccessDeniedException":
        print(
            f"\x1b[41m{error.response['Error']['Message']}\
                \nTo troubleshoot this issue please refer to the following resources.\
                \nhttps://docs.aws.amazon.com/IAM/latest/UserGuide/troubleshoot_access-denied.html\
                \nhttps://docs.aws.amazon.com/bedrock/latest/userguide/security-iam.html\x1b[0m\n"
        )

    else:
        raise error

Check out the output text! We recommend opening the PDF, previous `.txt` and the new text file side by side to see the text extraciton quality. How does parsing quality change when using Amazon Bedrock? Is it better or worse than Amazon Textract?

Don't forget to export the result as `.txt` file below:

In [46]:
base_filename = obj["Key"]
s3_filename = f"s3://{s3_bucket}/{base_filename}"
output_filename = base_filename.removesuffix(".pdf") + "_bedrock.txt"

with open(os.path.join(output_path, output_filename), "w") as text_file:
    text_file.write(output)

## Optional tasks

Once you are done, consider experimenting with one of the tasks below:

Amazon Textract:
- Check configuration arguments for `MarkdownLinearizationConfig` to see if you can tweak Textractor parameters to improve OCR quality.

Amazon Bedrock:
- Experiment with the LLM prompt to see if you can further improve the OCR quality/formatting when using Amazon Bedrock.
- Switch to a different multi-modal LLM and compare the OCR quality. Which model is doing best?
- Run LLM-based OCR for the second document in the demo folder. How does the performance compare?