# Post-Processing Amazon Textract with Location-Aware Transformers

# Part 1: Introduction and Data Collection

> *This notebook works well with the `Python 3 (Data Science)` kernel on SageMaker Studio*

[**LayoutLM** (2019, Xu et al)](https://arxiv.org/abs/1912.13318) is a BERT-like language model architecture in which the *position embedding* inputs (usually encoding the position of each word/token in the input sequence, as detailed in [this paper](https://openreview.net/forum?id=onxoVA9FxMw)) are modified to encode the **absolute position of the word/token on a page**.

Architectures like this enable us to build ML models which are aware of both *text content* and *page position*: especially useful for analyzing and post-processing OCR results from services like [Amazon Textract](https://aws.amazon.com/textract/), which returns both the detected text and the geometry of each word in the input document.

Since LayoutLM is based on a standard multi-task text transformer architecture with customizations to the input processing layer, this approach could be generalized to a wide range of task types using both text and position information, like:

- "Self-supervised" pre-training on Textracted but otherwise unlabelled documents
- Document/page/sequence classification
- Entity extraction (token/word classification)
- Span extraction and extractive question answering
- "Translation", generative question answering or other sequence generation

Some of these use cases (notably pre-training, sequence classification and token/word classification) are already supported in the LayoutLM [implementation](https://huggingface.co/transformers/model_doc/layoutlm.html) provided in the popular open source [Hugging Face Transformers library](https://huggingface.co/transformers/model_doc/layoutlm.html).

In this sample we'll review an example use case where Amazon Textract's [built-in functionality](https://aws.amazon.com/textract/features/) for extracting key-value "Forms" data and structured "Tables" data helps with some examples... But misses others due to the complexity of the document.

This first notebook will focus on preparing and annotating data, before we move on to training, deploying, and integrating models in later notebooks.

## Getting started: Dependencies and configuration

First there are some additional libraries we need to install that aren't present in the SageMaker kernel environments by default:

In [None]:
# Tool for building customized container images and pushing to Amazon ECR:
!pip install sagemaker-studio-image-build

In [None]:
# Helper for interpreting Textract results:
!pip install amazon-textract-response-parser

In [None]:
# Notebook 2 requires sagemaker>=2.49 for Hugging Face container versions:
!pip install "sagemaker>=2.49,<3"

With the requried libraries installed, we're ready to import dependencies and set up some basic configuration including which [Amazon S3 bucket](https://docs.aws.amazon.com/AmazonS3/latest/userguide/UsingBucket.html) and folder/prefix data will be uploaded to.

In [None]:
%load_ext autoreload
%autoreload 2

# Python Built-Ins:
import json
from logging import getLogger
import os
import random
import re
import shutil
from zipfile import ZipFile

# External Dependencies:
import boto3  # AWS SDK for Python
from IPython.display import HTML  # To display rich content in notebook
import pandas as pd  # For tabular data analysis
import sagemaker  # High-level SDK for SageMaker
from tqdm.notebook import tqdm  # Progress bars

# Local Dependencies:
import util

logger = getLogger()

bucket_name = sagemaker.Session().default_bucket()
bucket_prefix = "textract-transformers/"

For some sections below we'll need to reference resources created by the *[AWS CloudFormation](https://aws.amazon.com/cloudformation/) solution stack* you spun up earlier. If you didn't do this step yet, refer to the [README.md](../README.md) in the top level of the repository for instructions.

The solution stack stores these useful variables in [AWS Systems Manager Parameter Store](https://docs.aws.amazon.com/systems-manager/latest/userguide/systems-manager-parameter-store.html) and we use the `util.project` utility module used below to fetch them. This is a transferable pattern you can use to connect from data science notebooks to deployed ML project resources in the cloud by project name/ID.

▶️ **Check** in the [CloudFormation Console](https://console.aws.amazon.com/cloudformation/home?#/stacks) that the `ProjectId` parameter for your OCR Pipeline Stack matches the default `ocr-transformers-demo` value below: Otherwise change the code below to match.

> ⚠️ If you get an **AccessDeniedException** (ClientError) below, it's likely your [SageMaker execution role](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html) doesn't have the required `ssm:GetParameters` permission to look up the OCR pipeline stack parameters.
>
> To fix this, you can click your execution role in the [IAM Roles Console](https://console.aws.amazon.com/iamv2/home#/roles) and use the **Attach policies** button to attach the `PipelineDataSciencePolicy` created by the stack.

In [None]:
try:
    config = util.project.init("ocr-transformers-demo")
    print(config)
except Exception as e:
    try:
        print(f"Your SageMaker execution role is: {sagemaker.get_execution_role()}")
    except:
        print("Couldn't look up your SageMaker execution role")
    raise e

## Fetch the example data

For our example, we'll be exploring (a recent quarter's snapshot of) the [Credit card agreements database](https://www.consumerfinance.gov/credit-cards/agreements/) published by the United States' [Consumer Finance Protection Bureau](https://www.consumerfinance.gov/).

This dataset includes specimen credit card agreement documents from providers across the US, and is interesting for our purposes because the documents are:

- **Diverse** in formatting, as various providers present the required information in different ways
- **Representative of commercial** documents - rather than, for example, academic papers which might have quite different tone and structure
- **Complex** in structure, with common data points in theory (e.g. interest rates, fees, etc) - but a lot of nuances and differences between documents in practice.

Below, we download a recent (approx. 750MB) archive from the dataset and extract the files (approx. 900MB uncompressed):

In [None]:
os.makedirs("data", exist_ok=True)

!wget -O data/CC_Agreements.zip https://files.consumerfinance.gov/a/assets/Credit_Card_Agreements_2020_Q4.zip

In [None]:
valid_file_types = { "jpeg", "jpg", "pdf", "png" }

# Extract the zip:
with ZipFile("data/CC_Agreements.zip", "r") as fzip:
    rel_filepaths_all = sorted([
        f.filename
        for f in fzip.infolist()
        if not (f.is_dir() or "__MACOSX" in f.filename)
    ])
    print(f"Found {len(rel_filepaths_all)} files in archive")
    print("Extracting...")
    fzip.extractall("data/raw")

rel_filepaths = sorted(
    [f for f in rel_filepaths_all if f.lower().rpartition(".")[2] in valid_file_types]
)

# Clean up unneeded files and remap if the folder became nested:
original_root_items = os.listdir("data/raw")
if "__MACOSX" in original_root_items:
    shutil.rmtree("data/raw/__MACOSX")
if len(original_root_items) < 5:
    try:
        folder = next(f for f in original_root_items if f.startswith("Credit_Card_Agreements"))
        print(f"De-nesting folder '{folder}'...")
        for sub in os.listdir(f"data/raw/{folder}"):
            shutil.move(f"data/raw/{folder}/{sub}", f"data/raw/{sub}")
        rel_filepaths = [
            f[len(folder + "/"):] if f.startswith(folder + "/") else f for f in rel_filepaths
        ]
    except StopIteration:
        pass

print(f"Found {len(rel_filepaths)} valid files for OCR")

You can explore these documents in the `data/raw` folder from the file browser - or even pull them through to display inline here in the notebook:

In [None]:
HTML(
    '<iframe src="{}" width=100% height=600 type="application/pdf"></iframe>'.format(
        # Edit the below (e.g. 0, 1, 2) to see different documents:
        "data/raw/" + rel_filepaths[0]
    )
)

For multi-page documents like these PDFs, Amazon Textract [requires us](https://docs.aws.amazon.com/textract/latest/dg/async.html) to use the async APIs and pre-load the documents to S3.

Therefore we'll upload the PDFs to use later:

In [None]:
raw_s3uri = f"s3://{bucket_name}/{bucket_prefix}data/raw"

In [None]:
print(f"Uploading raw PDFs to {raw_s3uri}...")
!aws s3 sync --quiet data/raw $raw_s3uri
print("Done")

## Defining the challenge

We have our source documents, so what will we try to extract about them?

There are many ways position-aware NLP models might be applied to OCR outputs: For example to generate structured summaries, provide translations, answer questions, or just classify the documents.

A common requirement in document analytics and process automation though, is to extract particular **'fields' of interest**: Known attributes expected to be present in all/most of the documents, which would be interesting to compare between them.

In this example we'll tackle this as an **entity detection** task **via word classification**:

- Defining a list of field/entity types of interest
- Classifying each `WORD` in the document to these types, using a Hugging Face [LayoutLMForTokenClassification](https://huggingface.co/transformers/model_doc/layoutlm.html#layoutlmfortokenclassification) model
- ...And finally grouping individual words together (via simple rule-based post-processing) to detect the entities/fields

Some **benefits** of this approach are:

- Results are traceable all the way back to the detected word blocks from the OCR engine; rather than with a text generation method where the output of the model may not correspond 1:1 with detected word inputs.
- Annotation effort is relatively minimal; since we only need to highlight the documents, rather than typing out custom corrections, answers, etc.

Some **drawbacks** are:

- Since it only tags detected words, this model will not be able to *intelligently correct OCR errors* or *standardize form* (e.g. of dates) like a text generation method could learn to.
- Since the ML component only extends to word classification, we're still relying on (usually good, helped by Amazon Textract) rule-based heuristics to group same-type words together to detect multi-word entities.

Below, we'll define the set of fields/entities to be detected and their configuration aspects:

> ⚠️ **Warning:** Although you may **edit** the configuration below, you'll no longer be able to use the pre-annotated data sample we provide in `data/annotations` to accelerate model training (unless the classes are still defined in the same order, and labelled in a consistent way with the previous guidelines)

In [None]:
from util.postproc.config import FieldConfiguration

# For config API details, you can see the docs in the source file or run:
# help(FieldConfiguration)

fields = [
    # (To prevent human error, enter class_id=0 each time and update programmatically below)
    FieldConfiguration(0, "Agreement Effective Date", optional=True, select="first",
        annotation_guidance=(
            "<p>Avoid labeling extraneous dates which are not necessarily the effective date of "
            "the document: E.g. copyright dates/years, or other dates mentioned in text.</p> "
            "<p>Do not include unnecessary qualifiers e.g. 'from 2020/01/01'.</p>"
        ),
    ),
    FieldConfiguration(0, "APR - Introductory", optional=True, select="confidence",
        annotation_guidance=(
            "<p>Use this class (instead of the others) for <em>ANY</em> case where the rate is "
            "offered for a fixed introductory period - regardless of interest rate subtype e.g. "
            "balance transfers, purchases, etc.</p> "
            "<p>Include the term of the introductory period in cases where it's directly listed "
            "(e.g. '20.00% for the first 6 months'). Try to minimize/exclude extraneous "
            "information about the offer (e.g. '20.00% for the first 6 months after account "
            "opening').</p> "
            "<p>'Prime rate + X%' mentions are acceptable and should be labeled.</p>"
        ),
    ),
    FieldConfiguration(0, "APR - Balance Transfers", optional=True, select="confidence",
        annotation_guidance=(
            "<p>Use for interest rates which are specific to balance transfers.</p> "
            "<p>Avoid including extraneous information about the terms of balance transfers, or "
            "using for fixed-term introductory rates.</p> "
            "<p>'Prime rate + X%' mentions are acceptable and should be labeled.</p>"
        ),
    ),
    FieldConfiguration(0, "APR - Cash Advances", optional=True, select="confidence",
        annotation_guidance=(
            "<p>Use for interest rates which are specific to cash advances.</p> "
            "<p>Avoid including extraneous information about the terms of cash advances, or using "
            "for fixed-term introductory rates.</p> "
            "<p>'Prime rate + X%' mentions are acceptable and should be labeled.</p>"
        ),
    ),
    FieldConfiguration(0, "APR - Purchases", optional=True, select="confidence",
        annotation_guidance=(
            "<p>Use for interest rates which are specific to purchases.</p> "
            "<p>'Prime rate + X%' mentions are acceptable and should be labeled.</p>"
        ),
    ),
    FieldConfiguration(0, "APR - Penalty", optional=True, select="confidence",
        annotation_guidance=(
            "<p>Use for penalty interest rates applied under certain conditions.</p> "
            "<p><em>Exclude</em> include information about the conditions under which the penalty "
            "rate comes into effect: Only include the interest rate which will be applied.</p> "
            "<p>'Prime rate + X%' mentions are acceptable and should be labeled.</p>"
        ),
    ),
    FieldConfiguration(0, "APR - General", optional=True, select="confidence",
        annotation_guidance=(
            "<p>Use for interest rates which are general and not specifically tied to a "
            "particular transaction type e.g. purchases / balance transfers.</p> "
            "<p>Avoid using for fixed-term introductory rates.</p> "
            "<p>'Prime rate + X%' mentions are acceptable and should be labeled.</p>"
        ),
    ),
    FieldConfiguration(0, "APR - Other", optional=True, select="confidence",
        # TODO: Remove this class
        annotation_guidance=(
            "<p>Use only for interest rates which don't fall in to any other category (including "
            "general or introductory rates). You may not see any examples in the data.</p> "
            "<p>Avoid using for fixed-term introductory rates.</p> "
            "<p>'Prime rate + X%' mentions are acceptable and should be labeled.</p>"
        ),
    ),
    FieldConfiguration(0, "Fee - Annual", optional=True, select="confidence",
        annotation_guidance=(
            "<p>Include cases where the document explicitly indicates no fee e.g. 'None'</p> "
            "<p>Avoid any introductory terms e.g. '$0 for the first 6 months' or extraneous "
            "words: Label only the standard fee.</p> "
            "<p>Label only the annual amount of the fee, in cases where other breakdowns are "
            "specified: E.g. '$120', not '$10 per month ($120 per year)'.</p> "
        ),
    ),
    FieldConfiguration(0, "Fee - Balance Transfer", optional=True, select="confidence",
        annotation_guidance=(
            # TODO: Review
            "<p>Try to be concise and exclude extra terms where not necessary</p>"
        ),
    ),
    FieldConfiguration(0, "Fee - Late Payment", optional=True, select="confidence",
        annotation_guidance=(
            "<p>Label only the fee, not the circumstances in which it is payable.</p> "
            "<p>Limits e.g. 'Up to $25' are acceptable (don't just label '$25').</p> "
            "<p>Do <em>NOT</em> include non-specific mentions of pass-throgh costs (e.g. 'legal "
            "costs', 'reasonable expenses', etc.) incurred in the general collections process.</p>"
        ),
    ),
    FieldConfiguration(0, "Fee - Returned Payment", optional=True, select="confidence",
        annotation_guidance=(
            "<p>Label only the fee, not the circumstances in which it is payable.</p> "
            "<p>Limits e.g. 'Up to $25' are acceptable (don't just label '$25').</p>"
        ),
    ),
    FieldConfiguration(0, "Fee - Foreign Transaction", optional=True, select="shortest",
        annotation_guidance=(
            "<p>Do <em>NOT</em> include explanations of how exchange rates are calculated or "
            "non-specific indications of margins between rates. <em>DO</em> include specific "
            "charges/margins with <em>brief</em> clarifying info where listed e.g. '3% of the US "
            "dollar amount'.</p>"
        ),
    ),
    FieldConfiguration(0, "Fee - Other", ignore=True,
        annotation_guidance=(
            "<p>Common examples include: Minimum interest charge, cash advance fees, and "
            "overlimit fees.</p> "
            "<p>Do <em>NOT</em> include fixed-term introductory rates for fees (e.g. '$0 during "
            "the first year. After the first year...') - only the standard fees</p> "
            "<p><em>DO</em> include qualifying information on the amount and limits of the fee, "
            "e.g. '$5 or 5% of the amount of each transaction, whichever is the greater'.</p> "
            "<p>Do <em>NOT</em> include general information on the nature of the fee and "
            "circumstances under which it is applied: E.g. 'Cash advance fee' or 'If the amount "
            "of interest payable is...'</p>"
        ),
    ),
    FieldConfiguration(0, "Card Name",
        annotation_guidance=(
            "<p>Label instances of the brand name of specific card(s) offered by the provider "
            "under the agreement, e.g. 'Rewards Platinum Card'</p> "
            "<p>Include the ' Card' suffix where available, but also annotate instances without "
            "such as 'Rewards Platinum'</p> "
            "<p><em>Avoid</em> including the Provider Name (use the separate class for this) e.g. "
            "'AnyCompany Rewards Card' unless it's been substantially modified/abbreviated for "
            "the card name (e.g. 'AnyCo Rewards Card') or the company name is different from the "
            "Credit card provider (e.g. AnyBank offering a store credit card for AnyCompany)</p> "
            "<p>Do <em>NOT</em> include fixed-term introductory rates for fees (e.g. '$0 during "
            "the first year. After the first year...') - only the standard fees</p> "
            "<p><em>Avoid</em> labeling generic payment provider names e.g. 'VISA card' or "
            "'Mastercard', except in contexts where the provider clearly uses them as the brand "
            "name for the offered card (e.g. 'VISA Card' from 'AnyCompany VISA Card'.</p>"
        ),
    ),
    FieldConfiguration(0, "Provider Address", optional=True, select="confidence",
        annotation_guidance=(
            "<p>Include department or 'attn:' lines where present (but not Provider Name where "
            "used at the start of an address e.g. 'AnyCompany; 100 Main Street...').</p> "
            "<p>Include zip/postcode where present.</p> "
            "<p><em>Avoid</em> labeling addresses for non-provider entities, such as watchdogs, "
            "market regulators, or independent agencies.</p>"
        ),
    ),
    FieldConfiguration(0, "Provider Name", select="longest",
        annotation_guidance=(
            "<p>Label the name of the card provider: Including abbreviated mentions.</p>"
        ),
    ),
    FieldConfiguration(0, "Min Payment Calculation", ignore=True,
        annotation_guidance=(
            "<p>Label clauses describing how the minimum payment is calculated.</p> "
            "<p>Exclude lead-in e.g. 'The minimum payment is calculated as...' and label directly "
            "from e.g. 'the minimum of...'.</p> "
            "<p>Do <em>NOT</em> include clauses from related subjects e.g. how account balance is "
            "calculated</p>"
        ),
    ),
    FieldConfiguration(0, "Local Terms", ignore=True,
        annotation_guidance=(
            "<p>Label full terms specific to residents of certain states/countries, or applying "
            "only in particular jurisdictions.</p> "
            "<p><em>Include</em> the scope of where the terms apply e.g. 'Residents of GA and "
            "VA...'</p> "
            "<p><em>Include</em> locally-applicable interest rates, instead of annotating these "
            "with the 'APR - ' classes</p>"
        ),
    )
]
for ix, cfg in enumerate(fields):
    cfg.class_id = ix

# Save the configuration to file:
with open("data/field-config.json", "w") as f:
    f.write(json.dumps(
        [cfg.to_dict() for cfg in fields],
        indent=2,
    ))

In [None]:
entity_classes = [f.name for f in fields]
print("\n".join(entity_classes))

## Data collection

To efficiently annotate training data for entity extraction on documents, we'll want to work **visually**: Highlighting matches, and perhaps also collecting manual transcription reviews - in case we'd like to extend the model later to support correcting text.

[Amazon SageMaker Ground Truth](https://aws.amazon.com/sagemaker/groundtruth/) provides an out-of-the-box annotation UI [for bounding boxes](https://docs.aws.amazon.com/sagemaker/latest/dg/sms-bounding-box.html) which will be useful for this: And can also be incorporated within [customized annotation UIs](https://docs.aws.amazon.com/sagemaker/latest/dg/sms-custom-templates.html) via the [crowd-bounding-box](https://docs.aws.amazon.com/sagemaker/latest/dg/sms-ui-template-crowd-bounding-box.html) element.

However, at the time of writing, the bounding box annotation tool supports images but not PDFs.

Therefore to prepare for data annotation we'll need to:

- Run our documents through Amazon Textract
- Extract individual page images from the PDFs to use through the annotation UI
- Collate the page images and Textract results together, ready for annotation

For a significantly sized corpus like this, we'd also benefit from filtering down the data a little - to save time and cost by Textracting and converting only the amount of data we'll need.

### Filter a sample of the document set

To limit down our input corpus, we'll:

- Apply some basic filename-based rules to try and exclude the few Spanish-language documents in the corpus, since the pre-trained model we'll use later is for English only
- Take a random (but reproducible) split of the first N documents

In [None]:
N_DOCS_KEPT = 120

def include_filename(name: str) -> bool:
    """Filter out docs whose filenames suggest they're likely Spanish/non-English"""
    if not name:
        return name
    name_l = name.lower()
    if "spanish" in name_l:
        return False
    if re.search(r"espa[nñ]ol", name_l):
        return False
    if "tarjeta" in name_l or re.search(r"cr[eé]dito", name_l):
        return False
    if re.search(r"[\[\(]esp?[\]\)]", name_l):
        return False
    return True


def get_preannotated_filepaths(exclude_job_names=[]) -> list:
    """List out (alphabetically) the relative filepaths for which annotations are already exist"""
    filepaths = set()  # Protect against introducing duplicates
    for job_folder in os.listdir("data/annotations"):
        if job_folder in exclude_job_names:
            logger.info(f"Skipping excluded job {job_folder}")
            continue
        manifest_file = os.path.join(
            "data",
            "annotations",
            job_folder,
            "manifests",
            "output",
            "output.manifest",
        )
        if not os.path.isfile(manifest_file):
            if os.path.isdir(os.path.join("data", "annotations", job_folder)):
                logger.warning(f"Skipping job {job_folder}: No output manifest at {manifest_file}")
            continue
        with open(manifest_file, "r") as f:
            textract_s3keys = [
                json.loads(l)["textract-ref"][len("s3://"):].partition("/")[2] for l in f
            ]
            # S3 keys are like some/prefix/data/textracted/subfolders/file.pdf/consolidated.json
            # We want subfolders/file.pdf
            filepaths.update([
                k.partition("data/textracted/")[2].rpartition("/")[0] for k in textract_s3keys
            ])
    return sorted(filepaths)


preannotated_filepaths = get_preannotated_filepaths()
if N_DOCS_KEPT < len(preannotated_filepaths):
    raise ValueError(
        "Existing annotations cannot be used for model training unless the target documents are "
        "Textracted. To proceed with fewer docs than have already been annotated, you'll need to "
        "`exclude_job_names` per the 'data/annotations' folder (e.g. ['augmentation-1']) AND "
        "remember to not include them in notebook 2 (model training). Alternatively, increase "
        f"your N_DOCS_KEPT. (Got {N_DOCS_KEPT} vs {len(preannotated_filepaths)} prev annotations)."
    )

# Forcibly including the pre-annotated docs *after* the shuffling ensures that the order of
# sampling new docs is independent of what/how many have been pre-annotated:
rel_filepaths_sample = list(filter(include_filename, rel_filepaths))
random.Random(1337).shuffle(rel_filepaths_sample)
rel_filepaths_sample = [f for f in rel_filepaths_sample if f not in preannotated_filepaths]
rel_filepaths_sample = sorted(
    preannotated_filepaths
    + rel_filepaths_sample[:N_DOCS_KEPT - len(preannotated_filepaths)]
)

print(f"Extracted random sample of {len(rel_filepaths_sample)} docs")
rel_filepaths_sample[:5] + ["..."]

### Extract clean input images

To annotate our documents with SageMaker Ground Truth image task UIs, we need **individual page images**, stripped of EXIF rotation metadata (because, at the time of writing, SMGT ignores this rotation for annotation consistency) and converted to compatible formats (since some browsers cannot render certain formats - such as TIFF).

For large corpora this process of splitting PDFs and rotating and converting images may require significant resources, but is easy to parallelize.

Therefore instead of pre-processing the raw documents here in the notebook, this is a good use case for a scalable [SageMaker Processing Job](https://docs.aws.amazon.com/sagemaker/latest/dg/processing-job.html).

First, our processing job will require a base container image to work with. The PDF reading tools we use aren't installed by default in pre-built SageMaker containers and aren't `pip install`able, so below we use the [SageMaker Studio Image Build CLI](https://github.com/aws-samples/sagemaker-studio-image-build-cli) to build a customized image based on the SageMaker Scikit-Learn container and upload it to the [Amazon Elastic Container Registry (ECR)](https://aws.amazon.com/ecr/):

In [None]:
%%time
ecr_repo_name = "sm-scikit-ocrtools"
ecr_image_tag = "pytorch-1.7-cpu"

base_image_uri = sagemaker.image_uris.retrieve(
    framework="pytorch",
    region=os.environ["AWS_REGION"],
    instance_type="ml.c5.xlarge",  # (Just used to check whether GPUs/accelerators are used)
    py_version="py3",
    image_scope="training",
    version="1.7",
)

!sm-docker build ./preproc \
    --repository {ecr_repo_name}:{ecr_image_tag} \
    --role {config.sm_image_build_role} \
    --build-arg BASE_IMAGE={base_image_uri}

# Since the above is a shell command, we'll need to reconstruct the built URI here in Python too:
account_id = sagemaker.Session().account_id()
region = os.environ["AWS_REGION"]
ecr_image_uri = f"{account_id}.dkr.ecr.{region}.amazonaws.com/{ecr_repo_name}:{ecr_image_tag}"

# Check from Python the image was successfully created:
ecr = boto3.client("ecr")
imgs_desc = ecr.describe_images(
    registryId=account_id,
    repositoryName=ecr_repo_name,
    imageIds=[{ "imageTag": ecr_image_tag }],
)
assert len(imgs_desc["imageDetails"]) > 0, f"Couldn't find ECR image {ecr_image_uri} after build"

Next, we need to define the inputs for the processing job.

To process the whole `data/raw` corpus, you could simply pass the whole `data/raw` prefix in S3 as input to the job (As shown in the commented-out *Option 2* below) and scale up the `instance_count` to complete the work quickly.

To process just a sample subset of files for speed in our demo, we'll create a **manifest file** listing just the documents we want.

> ⚠️ **Note:** 'Non-augmented' manifest files are still JSON-based, but a different format from the other dataset manifests we'll be using through this sample. You can find guidance for manifests as used here on the [S3DataSource API doc](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_S3DataSource.html), and separate information on the "augmented" manifests as used later with SageMaker Ground Truth in the [Ground Truth documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/sms-input-data-input-manifest.html).

In [None]:
from sagemaker.processing import ProcessingInput, ProcessingOutput, ScriptProcessor

imgs_s3uri = f"s3://{bucket_name}/{bucket_prefix}data/imgs-clean"

#### OPTION 1: For processing the rel_filepaths_sample subset of raw docs only:
# Prepare a manifest file:
os.makedirs("data/preproc", exist_ok=True)
preproc_input_manifest_path = "data/preproc/dataclean-input.manifest.json"
with open(preproc_input_manifest_path, "w") as f:
    f.write(json.dumps(
        [{ "prefix": raw_s3uri + "/" }]
        + rel_filepaths_sample
    ))

# Upload the manifest to S3:
preproc_input_manifest_s3uri = f"s3://{bucket_name}/{bucket_prefix}{preproc_input_manifest_path}"
!aws s3 cp {preproc_input_manifest_path} {preproc_input_manifest_s3uri}

# Set the processing job inputs to reference the manifest:
preproc_inputs = [
    ProcessingInput(
        destination="/opt/ml/processing/input/raw",  # Expected input location, per our script
        input_name="raw",
        s3_data_distribution_type="ShardedByS3Key",  # Distribute between instances, if multiple
        s3_data_type="ManifestFile",
        source=preproc_input_manifest_s3uri,  # Manifest of sample raw documents
    ),
]
print("Selected sample subset of documents")
#### END OPTION 1

#### OPTION 2: For processing the whole data/raw folder:
# preproc_inputs = [
#     ProcessingInput(
#         destination="/opt/ml/processing/input/raw",  # Expected input location, per our script
#         input_name="raw",
#         s3_data_distribution_type="ShardedByS3Key",  # Distribute between instances, if multiple
#         source=raw_s3uri,  # S3 prefix for full raw document collection
#     ),
# ]
# print("Selected whole corpus")
#### END OPTION 2

The script we'll be using to process the documents is in the same folder as the Dockerfile used earlier to build the container image: [preproc/imgclean.py](preproc/imageclean.py).

The code parallelizes processing across available CPUs, and the `ShardedByS3Key` setting used on our [ProcessingInput](https://sagemaker.readthedocs.io/en/stable/api/training/processing.html#sagemaker.processing.ProcessingInput) above distributes documents between instances if muliple are provided - so you should be able to `instance_type` and `instance_count` of the job if needed to take advantage of what resources you have available. The process is typically CPU-bound, so the `ml.c*` families are likely a good fit.

> ⏰ **Note:** In our tests, it took (including job start-up overheads):
>
> - About 9 minutes to process the 120-document sample with 2x `ml.c5.xlarge` instances
> - About 17 minutes to process the full 2,541-document corpus with 5x `ml.c5.4xlarge` instances.

In [None]:
%%time
processor = ScriptProcessor(
    base_job_name="ocr-img-dataclean",
    command=["python3"],
    image_uri=ecr_image_uri,  # As created above
    instance_count=2,
    instance_type="ml.c5.xlarge",
    max_runtime_in_seconds=60*60,
    role=sagemaker.get_execution_role(),
    volume_size_in_gb=15,
)

processor.run(
    code="preproc/imgclean.py",  # PDF splitting / image conversion script
    inputs=preproc_inputs,  # Either whole corpus or sample, as above
    outputs=[
        ProcessingOutput(
            destination=imgs_s3uri,
            output_name="imgs-clean",
            s3_upload_mode="Continuous",
            source="/opt/ml/processing/output/imgs-clean",  # Output folder, per our script
        )
    ],
)

Once the images have been extracted, we'll also download them locally to the notebook for use in visualizations later:

In [None]:
print(f"Downloading cleaned images from {imgs_s3uri}...")
!aws s3 sync --quiet {imgs_s3uri} data/imgs-clean
print("Done")

### Textract the input documents

Since we need to be mindful of the Amazon Textract service [quotas](https://docs.aws.amazon.com/general/latest/gr/textract.html#limits_textract) when processing large batches of documents, and the OCR pipeline solution stack is already set up - we'll use just the **OCR/Textract portion of the pipeline** to run our documents through Textract in bulk.

> ⏰ **Note:** The code below may take ~6 minutes to run against a 120-document sample set and may encounter occasional rate-limiting errors. **If you see errors in the output, try re-running the cell**: Successfully processed files will be skipped in repeat runs.

> ⚠️ **Note:** Refer to the [Amazon Textract Pricing Page](https://aws.amazon.com/textract/pricing/) for up-to-date guidance before running large extraction jobs.
>
> At the time of writing, the projected cost (in `us-east-1`, ignoring free tier allowances) of analyzing 100 documents with 10 pages on average was approximately \\$67 with `TABLES` and `FORMS` enabled, or \\$2 without. Across the full corpus, we measured the average number of pages per document at approximately 6.7.

In [None]:
textract_s3uri = f"s3://{bucket_name}/{bucket_prefix}data/textracted"

In [None]:
textract_results = util.preproc.call_textract(
    textract_sfn_arn=config.plain_textract_sfn_arn,
    input_base_s3uri=raw_s3uri,
    # Can instead set rel_filepaths, to process the whole dataset (see cost note above):
    input_relpaths=rel_filepaths_sample,
    # You can un-comment the below `features` line to turn on both TABLES and FORMS extraction, but
    # note that this could have a significant impact on API costs:
    features=[],
    output_base_s3uri=textract_s3uri,
    skip_existing=True,
)

In [None]:
textract_s3uris = list(filter(lambda s: isinstance(s, str), textract_results))
print(f"{len(textract_s3uris)} of {len(textract_results)} docs textracted successfully")

if len(textract_s3uris) < len(textract_results):
    raise ValueError(
        "Are you sure you want to continue? Consider re-trying to process the failed docs"
    )

### Collate OCR and image data for annotation

Now we have a filtered corpus of documents with Amazon Textract results, plus cleaned and standardized images for each page - all available on Amazon S3.

To prepare for data annotation and later model training, we'll need to collate these together with a [manifest file](https://docs.aws.amazon.com/sagemaker/latest/dg/augmented-manifest.html#augmented-manifest-format) in JSON-lines format.

Later in the data preparation process we'll have many reasons to slice, dice, and manipulate these manifests via the JSON alone (for example to extract shuffled random subsets for annotation jobs, combine the results of annotation jobs together, etc).

For this **initial cataloguing** linking Textract results to page images though, we'll actually **validate that the artifacts are present on S3** in the expected locations.

> ⏰ Because of these validation checks, the cell below may a minute or two to run against our 120-document sample set.

In [None]:
warnings = util.preproc.build_data_manifest(
    "data/pages-all-sample.manifest.jsonl",
    rel_doc_paths=rel_filepaths_sample,
    textract_s3uri=textract_s3uri,
    imgs_s3uri=imgs_s3uri,
    by="page",
    no_content="omit",
)

if len(warnings):
    raise ValueError(
        f"Manifest incomplete - {len(warnings)} docs failed. Please see `warnings` for details"
    )

Let's briefly explore the catalogue we've created.

Each line of the file is a JSON record identifying a particular page:

In [None]:
with open("data/pages-all-sample.manifest.jsonl", "r") as f:
    for ix, line in enumerate(f):
        print(line, end="")
        if ix >= 2:
            print("...")
            break

The corpus has a very skewed distribution of number of pages per document, with a few outliers dragging up the average significantly.

In our tests on corpus-wide statistics:

- The overall average was **~6.7 pages per document**
- The 25th percentile was 3 pages; the 50th percentile was 6 pages; and the 75th percentile was 11 pages
- The longest document was 402 pages

Your results for sub-sampled sets will likely vary a little - but can be analyzed as below:

In [None]:
with open("data/pages-all-sample.manifest.jsonl", "r") as f:
    manifest_df = pd.DataFrame([json.loads(l) for l in f])
page_counts_by_doc = manifest_df.groupby("textract-ref")["textract-ref"].count()

print("Document page count statistics")
page_counts_by_doc.describe()

From visually inspecting some sample documents, we found that the first page was often most useful for the kinds of fields defined for extraction:

Many documents used the first page for a fact-sheet/summary, followed by subsequent pages of dense legal terms.

Therefore (since our annotation will be image/page-based rather than document-based) we'll aim to include proportionally more first pages when choosing datasets to annotate.

## Annotation infrastructure

To create a labelling job in Amazon SageMaker Ground Truth, we'll need to specify

- **Who's** doing the labelling - which could be your own internal teams, the public crowd via Amazon Mechanical Turk, or skilled workers supplied by vendors through the AWS Marketplace
- **What** the task will look like - which could be using the [built-in task UIs](https://docs.aws.amazon.com/sagemaker/latest/dg/sms.html) or [custom workflows](https://docs.aws.amazon.com/sagemaker/latest/dg/sms-custom-templates.html).
- **Where** the input data sourced from and the results will be saved to (locations on Amazon S3)

### Create a private workteam

For this demo, you'll set up a private work "team" for just yourself to test out the annotation process.

▶️ **Open** the [Amazon SageMaker Ground Truth console, *Labeling Workforces* page](https://console.aws.amazon.com/sagemaker/groundtruth?#/labeling-workforces)

> ⚠️ **Check** SM Ground Truth opens in the same **AWS Region** where this notebook and your CloudFormation stack are deployed: You may find it defaults to `N. Virginia`. Use the drop-down in the top right of the screen to switch regions.

▶️ **Select** the *Private* tab and click **Create private team**

- Choose an appropriate **name** for your team e.g. `just-me`
- (If you get the option) select to **Invite new workers via email** and enter your email address (you'll need access to this address to log in and annotate the data)
- And leave the other (Cognito, SNS, etc) parameters as default.

▶️ **If you didn't get the option** to add workers during team creation (typically because your account is already set up for SageMaker Ground Truth), then after the team is created you can:

- Click **Invite new workers** to add your email address to the workforce, and then
- Click on your **team name** to open the team details, then navigate to the *Workers tab* to add yourself to the team

▶️ **Copy** the *name* of your workteam and paste it into the cell below, to store it:

In [None]:
workteam_name = "just-me"  # TODO: Update this to match yours, if different

workteam_arn = util.smgt.workteam_arn_from_name(workteam_name)

Finally:

▶️ **Check your email** for an invitation and log in to the labelling portal. You'll be asked to configure a password on first login.


Your completed setup should look something like this in the AWS Console:

![](img/smgt-private-workforce.png "Screenshot of SageMaker Ground Truth private workforces configuration")

### Set up the custom task template

This sample provides 2 options for data annotation:

1. Use the **built-in [Bounding Box tool](https://docs.aws.amazon.com/sagemaker/latest/dg/sms-bounding-box.html)**
2. Use the provided **custom task template** which collects transcription reviews as well as bounding boxes

We recommend **at least experimenting with the custom template** here in the notebook, to get a better understanding of how the model will "see" and use your annotations (and how you might extend this sample for your own use cases).

However, you'll probably want to use the built-in boxes tool for the bulk of your annotating work because:

- The ML model we present (in the next notebook) only supports tagging and cannot be directly trained on the text corrections you collect in the custom template
- ...And reviewing the text transcription takes extra time & effort

As detailed [in the developer guide](https://docs.aws.amazon.com/sagemaker/latest/dg/sms-custom-templates.html), custom Ground Truth UIs are HTML [Liquid templates](https://shopify.github.io/liquid/basics/introduction/). You can use the [Crowd HTML Elements](https://docs.aws.amazon.com/sagemaker/latest/dg/sms-ui-template-reference.html) to embed standard components, but also include custom HTML/CSS/JS as needed. A set of examples is provided in the [amazon-sagemaker-ground-truth-task-uis repository on GitHub](https://github.com/aws-samples/amazon-sagemaker-ground-truth-task-uis).

Since spinning up a labelling job each time to test and debug a custom template would slow down development, SageMaker provides a [RenderUiTemplate API](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_RenderUiTemplate.html) for previewing the worker experience.

First, we'll populate the master template `*.liquid.tpl.html` with the entity/field types we configured earlier (and some other automated content) to produce the final SageMaker Ground Truth template `*.liquid.html`:

In [None]:
from bs4 import BeautifulSoup

with open("annotation/ocr-bbox-and-validation.liquid.tpl.html", "r") as ftpl:
    with open("annotation/ocr-bbox-and-validation.liquid.html", "w") as fout:
        template = BeautifulSoup(ftpl.read())

        annotator_el = template.find(id="annotator")
        annotator_el["header"] = "Highlight entities and review their OCR results."
        annotator_el["labels"] = json.dumps(entity_classes)

        if any(f.annotation_guidance for f in fields):
            full_instructions_el = template.find("full-instructions")
            full_instructions_el.append(
                BeautifulSoup(
                    "\n".join(
                        ["<h3>Per-Field Guidance</h3>"]
                        + [
                            f"<h4>{f.name}</h4>\n{f.annotation_guidance}"
                            for f in fields if f.annotation_guidance
                        ]
                    )
                )
            )

        fout.write(template.prettify())

To be able to serve our example images through the UI, SageMaker Ground Truth requires the target S3 bucket to be set up [with CORS permissions](https://docs.aws.amazon.com/sagemaker/latest/dg/sms-cors-update.html) (which is not the same as making the bucket or its contents public).

The cell below will ensure these permissions are set on the bucket configured earlier by `bucket_name`:

In [None]:
s3 = boto3.resource("s3")
bucket_cors = s3.BucketCors(bucket_name)

try:
    existing_rules = bucket_cors.cors_rules
except:
    existing_rules = []

if any(
    r for r in existing_rules
    if "*" in r["AllowedOrigins"]
    and "GET" in r["AllowedMethods"]
):
    logger.info(f"Bucket already set up with CORS permissions")
else:
    new_rules = existing_rules + [
        {
            "ID": "SageMakerGroundTruth",
            "AllowedHeaders": [],
            "AllowedMethods": ["GET"],
            "AllowedOrigins": ["*"],
            "ExposeHeaders": [],
            'MaxAgeSeconds': 60,
        },
    ]
    cors_resp = bucket_cors.put(
        CORSConfiguration={ "CORSRules": new_rules },
        ExpectedBucketOwner=os.environ["AWS_ACCOUNT_ID"],
    )
    logger.info(f"Added CORS permissions to bucket")

Now, we're ready to render the preview of what this task would look like with an actual record from the data manifest:

In [None]:
role = sagemaker.get_execution_role()
smclient = boto3.client("sagemaker")

# Fetch an example record from the manifest:
ix_example = 0
with open("data/pages-all-sample.manifest.jsonl", "r") as fmanifest:
    sample_task_str = None
    for ix, line in enumerate(fmanifest):
        if ix == ix_example:
            sample_task_str = line
            break

# Render the template with the example record:
ui_render_file = "annotation/render.tmp.html"
with open("annotation/ocr-bbox-and-validation.liquid.html", "r") as fui:
    with open(ui_render_file, "w") as frender:
        ui_render_resp = smclient.render_ui_template(
            UiTemplate={ "Content": fui.read() },
            Task={ "Input": sample_task_str },
            RoleArn=role,
        )
        frender.write(ui_render_resp["RenderedContent"])

print(f"▶️ Open {ui_render_file} and click 'Trust HTML' to see the UI in action!")

Opening [annotation/render.tmp.html](annotation/render.tmp.html) and clicking **Trust HTML** in the toolbar, you should see a view something similar to the below:

> ℹ️ **Note:** In this task template, you need to click the "Instructions" button to expand the transcription review pane on the left!

![](img/smgt-custom-template-demo.png "Screenshot of custom annotation UI")

Note that:

- When you draw a bounding box on the page image, a new OCR result is populated in the left sidebar prompting you to review (and if necessary correct) Textract's transcription of the text in that region.
- Overlapping bounding boxes of the same type are consolidated, allowing us to highlight non-square regions of text (for example a particular sentence over multiple lines within a paragraph).
- Transcription review fields are mandatory: The template should not let you submit the result until all transcriptions have been reviewed.

You should aim to follow these same conventions when annotating the sample data, even with the built-in task type. Under the hood, the ML model code applies similar logic to map your bounding box annotations to the Textract detected `WORD`s and `LINE`s.

To use this custom template in a data labeling job, you can adjust the instructions below (which assume you'll use the faster built-in template) as follows:

- Select task category 'Custom' > task type 'Custom', instead of 'Image > Bounding Box'
- For template body, copy the contents of the `*.liquid.html` file above (**NOT** the `*.tpl.liquid.html`, which has placeholders e.g. for the list of classes)
- In the tool configuration step, select the `SMGT-Pre` and `SMGT-Post` Lambda functions that have been created for you by the solution stack: These should appear in the drop-down options.

In practice, while it's important to explore how the bounding boxes are being interpreted, we'd recommend to use the simpler built-in template for this walkthrough: To help you complete your data annotation faster.

## Annotate data

> ⏰ **If you're short on time**: You can skip the remaining steps in this notebook altogether.
>
> We've provided pre-prepared annotations for 100 pages in the `data/annotations` folder, to augment your work and help train an effective model faster. If you need, you can skip along to the next notebook and select **only** the `augmentation-*` datasets instead of labeling your own too. If you choose to do this, your model will likely be less accurate.

We're now ready to start annotating data, and will typically **iterate over multiple jobs** in this step to start small and then boost model accuracy.

To make incrementally adding to the dataset easy, we'll need to pay particular attention to:

- How we sample data for jobs, with good randomness but no repetition of previously-annotated pages
- How we collect our results to a single consolidated dataset

So let's follow through the steps:

### Collect a dataset

Here we will:

- **Shuffle** our data (in a *reproducible*/deterministic way), to ensure we annotate documents/pages from a range of providers - not just concentrating on the first provider/doc(s)
- **Exclude** any examples for which the page image has **already been labeled** in the `data/annotations` output folder
- **Stratify** the sample, to obtain a specific (boosted) proportion of first-page samples, since we observed the first pages of documents to often be most useful for the fields of interest.

In [None]:
# This cell just defines the necessary functions & constants:

# Keep this the same across the jobs:
annotations_base_s3uri = f"s3://{bucket_name}/{bucket_prefix}data/annotations"

def rel_path_from_s3uri(uri, key_base="data/imgs-clean/") -> str:
    """Extract e.g. 'subfolders/file' from 's3://bucket/.../{key_base}subfolders/file'"""
    return uri[len("s3://"):].partition("/")[2].partition(key_base)[2]


def get_preannotated_imgs(exclude_job_names=[]) -> set:
    """Find the set of relative image paths that have already been annotated"""
    filepaths = set()  # Protect against introducing duplicates
    for job_folder in os.listdir("data/annotations"):
        if job_folder in exclude_job_names:
            logger.info(f"Skipping excluded job {job_folder}")
            continue
        manifest_file = os.path.join(
            "data",
            "annotations",
            job_folder,
            "manifests",
            "output",
            "output.manifest",
        )
        if not os.path.isfile(manifest_file):
            if os.path.isdir(os.path.join("data", "annotations", job_folder)):
                logger.warning(f"Skipping job {job_folder}: No output manifest at {manifest_file}")
            continue
        with open(manifest_file, "r") as f:
            filepaths.update([
                rel_path_from_s3uri(json.loads(l)["source-ref"])
                for l in f
            ])
    return filepaths


def select_examples(
    job_page_count,
    exclude_img_paths=set(),
    job_first_page_pct=0.4,
):

    with open("data/pages-all-sample.manifest.jsonl", "r") as fmanifest:
        examples_all = [json.loads(l) for l in fmanifest]

    # Separate and shuffle the first vs non-first pages:
    examples_all_arefirsts = [l["page-num"] == 1 for l in examples_all]

    examples_firsts = [e for ix, e in enumerate(examples_all) if examples_all_arefirsts[ix]]
    examples_nonfirsts = [e for ix, e in enumerate(examples_all) if not examples_all_arefirsts[ix]]
    random.Random(1337).shuffle(examples_firsts)
    random.Random(1337).shuffle(examples_nonfirsts)

    # Exclude already-annotated images:
    filtered_firsts = [
        e for e in examples_firsts
        if rel_path_from_s3uri(e["source-ref"]) not in exclude_img_paths
    ]
    filtered_nonfirsts = [
        e for e in examples_nonfirsts
        if rel_path_from_s3uri(e["source-ref"]) not in exclude_img_paths
    ]
    print(f"Excluded {len(examples_firsts) - len(filtered_firsts)} first and {len(examples_nonfirsts) - len(filtered_nonfirsts)} non-first pages")
    
    # Draw from the filtered shuffled lists:
    n_first_pages = round(job_first_page_pct * job_page_count)
    n_nonfirst_pages = job_page_count - n_first_pages
    if n_first_pages > len(filtered_firsts):
        raise ValueError(
            "Unable to find enough first-page records to build manifest: Wanted "
            "{}, but only {} available from list after exclusions ({} before)".format(
                n_first_pages,
                len(filtered_firsts),
                len(examples_firsts),
            )
        )
    if n_nonfirst_pages > len(filtered_nonfirsts):
        raise ValueError(
            "Unable to find enough non-first-page records to build manifest: Wanted "
            "{}, but only {} available from list after exclusions ({} before)".format(
                n_nonfirst_pages,
                len(filtered_nonfirsts),
                len(examples_nonfirsts),
            )
        )
    print(f"Taking {n_first_pages} first pages and {n_nonfirst_pages} non-first pages.")
    selected = filtered_firsts[:n_first_pages] + filtered_nonfirsts[:n_nonfirst_pages]
    random.Random(1337).shuffle(selected)  # Shuffle again to avoid putting all 1stP at front
    return selected

To actually generate a new job input manifest, you just need to specify:

- A unique name for the job
- The number of examples (pages) you'll annotate
- The ratio of first-pages to non-first pages (e.g. 0.4 -> 40% of examples will be the first page of a document)

> ⚠️ **Warning:** If you've just completed an annotation job below, make sure you've `s3 sync`ed results back to the `data/annotations` folder - otherwise you'll set up a new job for the same pages again!

In [None]:
annotation_job_name = "cfpb-boxes-1"  # What will this job be called?
job_page_count = 20  # How many pages will we annotate?
job_first_page_pct = .4  # What proportion of pages should be first pages of a doc?


preannotated_img_paths = get_preannotated_imgs()
input_manifest_file = f"data/manifests/{annotation_job_name}.jsonl"
os.makedirs("data/manifests", exist_ok=True)
print(f"'{annotation_job_name}' saving to: {input_manifest_file}")
with open(input_manifest_file, "w") as f:
    for ix, example in enumerate(select_examples(
        job_page_count,
        exclude_img_paths=preannotated_img_paths,
        job_first_page_pct=job_first_page_pct,
    )):
        if ix < 3:
            print(example)
        elif ix == 3:
            print("...")
        f.write(json.dumps(example) + "\n")

In [None]:
input_manifest_s3uri = f"s3://{bucket_name}/{bucket_prefix}{input_manifest_file}"
!aws s3 cp $input_manifest_file $input_manifest_s3uri

### Create the labelling job

To minimize the risk of errors and get started quickly, you're recommended to create your labeling job by running the utility function provided below.

This will set up a job with the default pre-built bounding box template (for faster annotation than the custom one we explored earlier):

In [None]:
print(f"Starting labeling job {annotation_job_name}\non data {input_manifest_s3uri}\n")
create_labeling_job_resp = util.smgt.create_bbox_labeling_job(
    annotation_job_name,
    bucket_name=bucket_name,
    execution_role_arn=role,
    fields=fields,
    input_manifest_s3uri=input_manifest_s3uri,
    output_s3uri=annotations_base_s3uri,
    workteam_arn=workteam_arn,
    # To create a review/adjustment job from a manifest with existing labels in:
    #reviewing_attribute_name="label",
    s3_inputs_prefix=f"{bucket_prefix}data/manifests",
)
print(f"\nLABELLING JOB STARTED:\n{create_labeling_job_resp['LabelingJobArn']}")

Alternatively, you can also explore creating the job through the [AWS Console for SageMaker Ground Truth](https://console.aws.amazon.com/sagemaker/groundtruth?#/labeling-jobs) (check your AWS Region!) by clicking on *Create labeling job*:

- Leave (as default) the *label attribute name* the same as the *job name*
- Select **Manual data setup** and use:
  - The `input_manifest_s3uri` (`s3://[...].jsonl`) from above for the input location
  - The `annotations_base_s3uri` (`s3://[...]/data/annotations`) with **no trailing slash** for the output location
- Select or create any **SageMaker IAM execution role** that has access to the `bucket_name` we're using.
- For **task type**, select *Image > Bounding Box*
- On the second screen, be sure to use **worker type** *Private* and select the workteam we made earlier from the dropdown.
- For the built-in task type, you'll need to enter the **labels** manually exactly in the order that we defined them in this notebook.

The cell below prints out some of these values to help:

In [None]:
print(input_manifest_s3uri)
print(annotations_base_s3uri)
print(role)
print("\n".join(["\nLabels:", "-------"] + entity_classes))

### Label the data!

Now that the labeling job has been created, you'll see a new task for your user in the labeling portal (If you lost the portal link from your email, you can access it from the *Private* tab of the [SageMaker Ground Truth Workforces console](https://console.aws.amazon.com/sagemaker/groundtruth?#/labeling-workforces)).

> ⏰ SageMaker Ground Truth processes the job data in batches, so it might take a minute or two for the job to appear in your list.
>
> If it's taking a long time, you can:
>
> - Double-check the job in the [Labeling jobs page of the Console](https://console.aws.amazon.com/sagemaker/groundtruth?#/labeling-jobs) to see if it's failed to start due to some error
> - Check the job is set up for a workteam that you're a member of
> - Check your user is showing as *Verified* and *Enabled* (i.e. that you completed the email verification successfully) in the *Private* tab of the [Workforces console](https://console.aws.amazon.com/sagemaker/groundtruth?#/labeling-workforces)

▶️ Click **Start working** and annotate the examples until the all are finished and you're returned to the portal homepage.

▶️ **Try to be as consistent as possible** in how you annotate the classes, because inconsistent annotations can significantly degrade final model accuracy. Refer to the guidance (in this notebook and the 'Full Instructions') that we applied when annotating the example set.

![](img/smgt-task-pending.png "Screenshot of SMGT labeling portal with pending task")

### Sync the results locally (and iterate?)

Once you've finished annotating and the job shows as "Complete" in the [SMGT Console](https://console.aws.amazon.com/sagemaker/groundtruth?#/labeling-jobs) (which **might take an extra minute or two**, while your annotations are consolidated), you can download the results here to the notebook via the cell below:

In [None]:
!aws s3 sync --quiet $annotations_base_s3uri ./data/annotations

You should see a subfolder created with the name of your annotation job, under which the **`manifests/output/output.manifest`** file contains the consolidated results of your labelling - again in the open JSON-Lines format.

▶️ **Check** your results appear as expected, and explore the file format.

> Because label outputs are in JSON-Lines, it's easy to consolidate, transform, and manipulate these results as required using open source tools!

If you like, you can expand your dataset with **additional labelling jobs** by repeating these steps from [Collect a dataset](#Collect-a-dataset) down to here.

> ⚠️ Take care to set a different `annotation_job_name` each time, as these must be unique.

## Next Steps

In this notebook we set up the modelling objective, collected the project dataset, and annotated (perhaps multiple) sets of training data.

In the next, we'll consolidate these output manifests (together with some pre-prepared example data) and actually train/deploy our ML model.

So you can now open up **notebook [2. Model Training.ipynb](2.%20Model%20Training.ipynb)**, and follow along!