This notebook will let you analyze all certificate documents uploaded to [californiascouting.org](https://californiascouting.org) and query for the following information:

|  |
| --- |
| This certificate is presented to who? |
| For completing the California Child Abuse Mandated Reporter Online Training for what? |
| Hours of education earned? |
| What is the date of completion? |
| What is the certificate number? |

# Create an S3 Bucket
First create an S3 Bucket to store the documents to be analyzed. Go to [Amazon S3 > Buckets](https://s3.console.aws.amazon.com/s3/buckets?region=us-west-2) and create a bucket named `bsa-ca-ab506-training` in the `us-west-2` region.

# Create an IAM Policy

Go to the [IAM > Policies](https://us-east-1.console.aws.amazon.com/iamv2/home#/policies) page and create a Policy named `BSA_CA_AB-506_TextractAnalyzeDocuments` and grant it the following permissions:
- Service: `S3`
    - Actions
        - List: `ListBucket`
    - Resources
        - Bucket name: `bsa-ca-ab506-training`
- Service: `S3`
    - Actions
        - Read: `GetObject`
        - Write: `PutObject`
    - Resources
        - Bucket name: `bsa-ca-ab506-training`
        - Object name: `*` (any)
- Service: `Textract`
    - Actions
        - Read: `AnalyzeDocument`
        - Read: `GetDocumentAnalysis`
        - Write: `StartDocumentAnalysis`

# Set up to run this notebook

Choose one of the two options for running this notebook:

## Option 1: Running this notebook in the cloud

This option will run this notebook in a Notebook Instance in AWS SageMaker.

### Step 1: Create an Amazon SageMaker Notebook Instance

Follow instructions on https://docs.aws.amazon.com/sagemaker/latest/dg/gs-setup-working-env.html

1. Under "Notebook instance type" choose "ml.t3.medium" or "ml.t2.medium" instance for the cheapest option.
    1. T3 instance has Unlimited Burstable mode enabled automatically. T2 is also Burstable but it has Standard Mode enabled by default.
    2. The price is roughly the same: "ml.t3.medium" is $0.05/hour and "ml.t2.medium" is $0.0464/hour.
2. Under IAM Role choose "Create a new role". You will later attach your perviously-made policy to this new role.

### Step 2: Attach your policy to the new AmazonSageMaker role

1. Go to the [IAM > Roles](https://us-east-1.console.aws.amazon.com/iamv2/home#/roles) page and open the role starting with "AmazonSageMaker".
2. In the "Permissions policies" list check the box for the new policy named `BSA_CA_AB-506_TextractAnalyzeDocuments`.
3. Save the changes.

### Step 3: Import this notebook into the Notebook Instance

1. After the Notebook Instance starts click "Open JupyterLab" on the Notebook Instances list
2. Click the upload button (up arrow button above the folder list) to upload this notebook.
3. Now you can run the following code cells (skip "Option 2" steps).

## Option 2: Running this notebook on your computer

This option will run this notebook on your computer with either Visual Studio Code or Azure Data Studio.

### Step 1: Create an IAM User and Configure Boto3

First you need to create an IAM User for Boto3 to use. This does not need to be unique per project so if you have already done before this you can leave AWS CLI / Boto3 configured to use that same User.

Go to https://docs.aws.amazon.com/textract/latest/dg/setting-up.html and follow the steps to create the user.

Next go to https://docs.aws.amazon.com/textract/latest/dg/setup-awscli-sdk.html and follow the steps to configure Boto3 with the credentials and default region of `us-west-2`.

### Step 2: Create an IAM Role

Go to the [IAM > Roles](https://us-east-1.console.aws.amazon.com/iamv2/home#/roles) page and create a Role. Follow these steps:

#### Select trusted entity
1. Select the "Trusted entity type" of `Custom trust policy`.
2. On the right side under "1. Add actions for STS" leave the default `AssumeRole` option selected.
3. On the right side under "2. Add a principal" click the Add button and choose:
    - Principal type: `IAM users`
    - ARN: Copy and paste the ARN from the User's details page. For example: `arn:aws:iam::840153288643:user/alex`
4. Click Next.

#### Add permissions

1. In the "Permissions policies" list check the box for the new policy named `BSA_CA_AB-506_TextractAnalyzeDocuments`.
2. Click Next.

#### Role details

1. Give the new role a name of `BSA_CA_AB-506_Textract_Role`.
2. Create the role.
3. Copy the ARN from the details page, for example: `arn:aws:iam::840153288643:role/BSA_CA_AB-506_Textract_Role`. You will need this when creating a Profile for Boto3 to use.

### Step 3: Create a Profile in AWS CLI/SDK

Follow the instructions on https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_use_switch-role-cli.html to create a profile named `bsa-ca-ab506-textract` using the `role_arn` of `arn:aws:iam::840153288643:role/BSA_CA_AB-506_Textract_Role` that you copied in the previous step. These are the steps:

1. Edit the `.aws/config` file in Linux or Mac, or the `C:\Users\USERNAME\.aws\config` file in Windows.
2. Add the following:
    ```
    [profile bsa-ca-ab506-textract]
        role_arn = arn:aws:iam::840153288643:role/BSA_CA_AB-506_Textract_Role
        source_profile = default
    ```
3. Save the file.

### Step 4: Use the new profile in this notebook

Change `boto3.Session()` to `boto3.Session(profile_name='bsa-ca-ab506-textract')` in the first code block (not providing a profile name is meant for use with Amazon SageMaker).

### Step 5: Install Python packages

Install the following Python packages in your Notebook (in Azure Data Studio click on the "Manage Packages" icon at the top right of the notebook):

- boto3
- pandas
- requests

Now you can run the following code cells:

# Script 1: Initialize packages and variables

In [25]:
import os
import io
from urllib.parse import urlparse

import boto3
import pandas as pd
import pikepdf
import requests
from PyPDF2 import PdfReader, PdfWriter
from PyPDF2.errors import PdfReadError

input_csv_filename = "bsa-glaac-ca-ab506-certificates-input2.csv"
output_csv_filename = "bsa-glaac-ca-ab506-certificates.csv"
bucket_name = "bsa-ca-ab506-training"
second_page_only_prefix = 'second_page_only/'

# Change  the following `boto3.Session()` to `boto3.Session(profile_name='bsa-ca-ab506-textract')` if running this notebook locally
session = boto3.Session()

# Script 2: Upload all URLs to an S3 Bucket

This script will take an input CSV (with the columns "submit_id" and "upload_doc_url") and upload the files to the S3 bucket.

If you re-run this script block it will only upload the URLs remaining to be uploaded.

If the file is a PDF and it has more than one page it will extract the 2nd page only and upload it to the path `second_page_only/`. Otherwise it will upload the file directly.

In [31]:
def upload_from_url(url: str, bucket):
    if url.startswith('https://californiascouting.org/'):
        r = requests.get(url)
        if r.status_code == 200:
            # get file name
            file_name = os.path.basename(urlparse(url).path)

            # load file
            raw_data = r.content

            # detect pages
            try:
                reader = PdfReader(io.BytesIO(raw_data))
                if len(reader.pages) > 1:
                    # extracting the 2nd page
                    # (Textract synchronous document analysis only works on one page PDFs)
                    file_name = f"{second_page_only_prefix}{file_name}"
                    page = reader.pages[1]
                    writer = PdfWriter()
                    writer.add_page(page)
                    with io.BytesIO() as bytes_stream:
                        writer.write(bytes_stream)
                        bytes_stream.seek(0)
                        bucket.upload_fileobj(bytes_stream, file_name)
                else:
                    # only one page PDF, saving the file directly to S3
                    bucket.upload_fileobj(io.BytesIO(raw_data), file_name)
            except PdfReadError:
                # not a PDF, saving the file directly to S3
                bucket.upload_fileobj(io.BytesIO(raw_data), file_name)
            except Exception:
                # PyPDF2 could not handle this file, switching to pikepdf
                with pikepdf.Pdf.open(io.BytesIO(raw_data)) as pdf:
                    if len(pdf.pages) > 1:
                        # extracting the 2nd page
                        # (Textract synchronous document analysis only works on one page PDFs)
                        file_name = f"{second_page_only_prefix}{file_name}"
                        page = pdf.pages[1]
                        dst = pikepdf.Pdf.new()
                        dst.pages.append(page)
                        with io.BytesIO() as bytes_stream:
                            dst.save(bytes_stream)
                            bytes_stream.seek(0)
                            bucket.upload_fileobj(bytes_stream, file_name)
                    else:
                        # only one page PDF, saving the file directly to S3
                        bucket.upload_fileobj(io.BytesIO(raw_data), file_name)
            print(f'Uploaded {file_name}')

def upload_files(bucket_name: str, input_csv_filename: str, session: boto3.Session, testing=False) -> pd.DataFrame:
    s3 = session.resource('s3')
    bucket = s3.Bucket(bucket_name)

    # load full URL list
    df = pd.read_csv(input_csv_filename)
    if testing:
        df.sort_values(by=['upload_doc_url'], ascending=False, ignore_index=True, inplace=True)
    if 'document' not in df:
        df.insert(
            loc=df.columns.get_loc('upload_doc_url') + 1,
            column='document',
            value=df['upload_doc_url'].apply(lambda x: os.path.basename(urlparse(x).path))
        )

    # check which have already been uploaded (we will filter these out when re-running this script)
    objects = [obj.key.replace(second_page_only_prefix, '') for obj in bucket.objects.all()]
    upload_df = df[~df['document'].isin(objects)]
    if testing:
        upload_df = upload_df.head(6)

    # upload each URL to the bucket
    object_count = len(upload_df.index)
    for index, row in upload_df.iterrows():
        print(f"{index + 1} of {object_count}")
        upload_from_url(row['upload_doc_url'], bucket)
    return df

input_df = upload_files(bucket_name, input_csv_filename, session)

# Script 3: Analyze all files in the S3 bucket using Amazon Textract

If you re-run this script block it will only analyze the files remaining to be analyzed. If there are files that can't be analyzed it will print out a "UnsupportedDocumentException" warning.

In [26]:
def analyze_document(bucket_name: str, document: str, textract_client) -> dict:
    try:
        response = textract_client.analyze_document(
            Document={'S3Object': {'Bucket': bucket_name, 'Name': document}},
            FeatureTypes=["QUERIES"],
            QueriesConfig={'Queries': [
                {'Text': 'This certificate is presented to who?', 'Alias': 'analyzed_certificate_name'},
                {'Text': 'For completing the California Child Abuse Mandated Reporter Online Training for what?', 'Alias': 'analyzed_certificate_training_type'},
                {'Text': 'Hours of education earned?', 'Alias': 'analyzed_certificate_hours'},
                {'Text': 'What is the date of completion?', 'Alias': 'analyzed_certificate_date'},
                {'Text': 'What is the certificate number?', 'Alias': 'analyzed_certificate_number'},
                {'Text': 'What is the name (printed)?', 'Alias': 'analyzed_consent_name'},
                {'Text': 'What is the scout member id?', 'Alias': 'analyzed_consent_scout_member_id'},
                {'Text': 'What is the date?', 'Alias': 'analyzed_consent_date'}
            ]}
        )
    except textract_client.exceptions.UnsupportedDocumentException:
        print(f"UnsupportedDocumentException: {document}")
        return

    answers = {}

    for block in response['Blocks']:
        if block['BlockType'] == "QUERY" and "Relationships" in block:
            answer_id = block['Relationships'][0]['Ids'][0]
            if answer_id not in answers:
                answers[answer_id] = {}
            answers[answer_id]['alias'] = block['Query']['Alias']
        elif block['BlockType'] == "QUERY_RESULT":
            answer_id = block['Id']
            if answer_id not in answers:
                answers[answer_id] = {}
            answers[answer_id]['answer'] = block['Text']
            answers[answer_id]['confidence'] = block['Confidence']

    row = {
        'document': document
    }
    for answer in answers.values():
        row[answer['alias']] = answer['answer']
        row[f"{answer['alias']}_confidence_score"] = answer['confidence']

    if row.get('analyzed_certificate_training_type'):
        # is Certificate
        for key in list(row.keys()):
            if key.startswith('analyzed_consent'):
                del row[key]
    else:
        # is Consent
        for key in list(row.keys()):
            if key.startswith('analyzed_certificate'):
                del row[key]

    return row

def scan_bucket(bucket_name: str, csv_filename: str, session: boto3.Session, force_analyze=False, testing=False) -> pd.DataFrame:
    s3 = session.resource('s3')
    textract_client = session.client('textract', region_name='us-west-2')
    bucket = s3.Bucket(bucket_name)
    if testing:
        objects = [obj.key for obj in bucket.objects.limit(4)]
    else:
        objects = [obj.key for obj in bucket.objects.all()]

    # check which have already been analyzed (we will filter these out when re-running this script)
    if not force_analyze and os.path.exists(csv_filename):
        df = pd.read_csv(csv_filename)
        if 'document' in df:
            objects = list(filter(lambda key: key not in df['document'].values, objects))

    object_count = len(objects)
    for index, key in enumerate(objects):
        print(f"{index + 1} of {object_count}")
        row = analyze_document(bucket_name, key, textract_client)
        if row:
            columns = [
                'document',
                'analyzed_certificate_name',
                'analyzed_certificate_name_confidence_score',
                'analyzed_certificate_training_type',
                'analyzed_certificate_training_type_confidence_score',
                'analyzed_certificate_hours',
                'analyzed_certificate_hours_confidence_score',
                'analyzed_certificate_date',
                'analyzed_certificate_date_confidence_score',
                'analyzed_certificate_number',
                'analyzed_certificate_number_confidence_score',
                'analyzed_consent_name',
                'analyzed_consent_name_confidence_score',
                'analyzed_consent_scout_member_id',
                'analyzed_consent_scout_member_id_confidence_score',
                'analyzed_consent_date',
                'analyzed_consent_date_confidence_score'   
            ]
            row_df = pd.DataFrame([row], columns=columns)
            if not os.path.exists(csv_filename):
                row_df.to_csv(csv_filename, index=False)
            else:
                row_df.to_csv(csv_filename, mode='a', index=False, header=False)
            print(row)
    df = pd.read_csv(csv_filename)
    return df

output_df = scan_bucket(bucket_name, output_csv_filename, session)

1 of 19
UnsupportedDocumentException: 1644912340-upload-doc-Volunteer-BSA-online-training-cert.pdf
2 of 19
UnsupportedDocumentException: 1645690048-upload-doc-Certificate-00857117.pdf
3 of 19
UnsupportedDocumentException: 1645849648-upload-doc-Certificate-00859724.pdf
4 of 19
UnsupportedDocumentException: 1649096610-upload-doc-BackgroundCkAuthorization03312022.docx
5 of 19
UnsupportedDocumentException: 1651021311-upload-doc2-Background-Check-Consent-Form-2022-for-LiveScan.docx
6 of 19
UnsupportedDocumentException: 1652207203-upload-doc2-kara-scout-consent.pdf
7 of 19
UnsupportedDocumentException: 1653605339-upload-doc-2022-EXPLORER-MANDATED-REPORTING-CERT.docx
8 of 19
UnsupportedDocumentException: 1654449500-upload-doc-Mandated-Reporting-Certificate-2022_MEMB.docx
9 of 19
UnsupportedDocumentException: 1655096778-upload-doc2-IMG_5418.pdf
10 of 19
UnsupportedDocumentException: 1657064076-upload-doc2-Scouts-background-consent.pdf
11 of 19
UnsupportedDocumentException: 1658879476-upload-do

# Script 4: Merge the input and output together

If the output does not include "submit_id" column, merge the input and output data frames and save it to the output CSV file.

In [35]:
if 'submit_id' not in output_df:
    output_df['document'] = output_df['document'].apply(lambda x: x.replace(second_page_only_prefix, ''))
    merged_df = pd.merge(input_df, output_df, on="document")
    merged_df.to_csv(output_csv_filename.replace('.csv', '_merged.csv'), index=False)