# Automated Pre-Processing Model Setup

> This notebook works well with the `Python 3 (Data Science)` kernel on SageMaker Studio

In this notebook, we'll show how you can use AWS SDKs to automatically set up a Rekognition Custom Labels model from the provided sample dataset.

For an alternative manual walkthrough, see the [README.md](README.md) in this same folder.

## Environment Preparation

First, in the cell below, we'll:

- **Import** the libraries we'll use in this notebook
- **Connect** to AWS services via the SDKs
- **Configure** our environment

You'll need to fill in the `PreprocessTrainingBucketName` created by your solution stack. You can find this from the **Outputs tab** of your particular stack, selected from the list in the [CloudFormation console](https://console.aws.amazon.com/cloudformation/home?#/stacks)

In [None]:
# Python Built-Ins:
from datetime import datetime
import json
import os
from zipfile import ZipFile

# External Dependencies:
import boto3  # The general-purpose AWS SDK for Python
import sagemaker  # Additional higher-level APIs for SageMaker

rekognition = boto3.client("rekognition")

training_bucket_name = # TODO: something like "stack-name-preprocesstrainingbucket-abc123456"

## Step 1: Fetch the Labelled Data

The sample data is publicly available via Amazon S3 - with images already classified into 'good' and 'bad' sets:

In [None]:
!wget -P data -N https://public-asean-textract-demo-ap-southeast-1.s3-ap-southeast-1.amazonaws.com/receipts.zip

with ZipFile("data/receipts.zip", "r") as zip_ref:
    print("Unzipping...")
    zip_ref.extractall("data")
print("Done")

## Step 2: Upload to Amazon S3

To use with Rekognition Custom Labels, we'll load the decompressed images into Amazon S3 in the same AWS Region and Account that our solution is deployed in:

In [None]:
!aws s3 sync --quiet ./data s3://$training_bucket_name/

## Step 3: Create a Manifest File

Our images have already been categorized into folders, so there's no need to manually re-label them using either [Amazon SageMaker Ground Truth](https://aws.amazon.com/sagemaker/groundtruth/) or the Rekognition Custom Labels console.

Instead, we'll create a **manifest file** for our dataset as described [in the Rekognition developer guide](https://docs.aws.amazon.com/rekognition/latest/customlabels-dg/cd-manifest-files-classification.html) - listing out each image and the corresponding annotation:

In [None]:
with open("data/receipts.manifest.jsonl", "w") as fmanifest:
    for class_ix, class_name in enumerate(("bad", "good")):
        meta = {
            "class-name": class_name,
            "confidence": 0.0,
            "type": "groundtruth/image-classification",
            "job-name": "does-not-exist",
            "human-annotated": "yes",
            "creation-date": "2021-06-01T00:00:00.000000"
        }
        for filename in os.listdir(os.path.join("data", class_name)):
            fmanifest.write(json.dumps({
                "source-ref": f"s3://{training_bucket_name}/{class_name}/{filename}",
                "label": class_ix,
                "label-metadata": meta,
            }) + "\n")

Again, this manifest itself will need to be loaded to Amazon S3:

In [None]:
manifest_s3uri = f"s3://{training_bucket_name}/receipts.manifest.jsonl"

!aws s3 cp data/receipts.manifest.jsonl $manifest_s3uri

## Step 4: Start Rekognition Custom Labels Training

With the annotated dataset now ready on Amazon S3 in a compatible format, we can create a **Project** in Rekognition, and start the process of training a model version ("project version"):

In [None]:
project_name = "receipts"

print(f"Creating Rekognition Custom Labels project '{project_name}'...")

create_project_resp = rekognition.create_project(
    ProjectName=project_name,
)
project_arn = create_project_resp["ProjectArn"]
create_project_resp

In [None]:
dataset_rek_asset = {
    "GroundTruthManifest": {
        "S3Object": {
            "Bucket": training_bucket_name,
            "Name": manifest_s3uri[len("s3://"):].partition("/")[2],
        },
    },
}

version_name = f"{datetime.now():%Y-%m-%d-%H-%M-%S}"
print(f"Starting model training for version '{version_name}'...")
create_project_version_resp = rekognition.create_project_version(
    ProjectArn=project_arn,
    VersionName=version_name,
    OutputConfig={
        "S3Bucket": training_bucket_name,
        "S3KeyPrefix": f"rekognition/{project_name}",
    },
    TrainingData={
        "Assets": [dataset_rek_asset]
    },
    TestingData={
        'Assets': [dataset_rek_asset],
    },
)

project_version_arn = create_project_version_resp["ProjectVersionArn"]
create_project_version_resp

In [None]:
print(f"Your model version ARN:\n{project_version_arn}")

## Step 5: Waiting and Model Deployment

The above step kicked off version training in the background - which will take some time to complete.

You can check the status in the Rekognition Custom Labels console, or instead wait for completion via boto3:

In [None]:
rekognition.get_waiter("project_version_training_completed").wait(
    ProjectArn=project_arn,
    VersionNames=[
        version_name,
    ],
    WaiterConfig={
        "Delay": 60,  # in seconds
        "MaxAttempts": 60 * 60 * 2,
    },
)
print("Project version training complete!")

In [None]:
rekognition.describe_project_versions(
    ProjectArn=project_arn,
    VersionNames=[
        version_name,
    ],
)

When your model is trained, you can **connect it to your solution stack** as follows:

- In the [AWS SSM Parameter Store](https://console.aws.amazon.com/systems-manager/parameters/?&tab=Table) console, find the deployed stack's `RekognitionModelArn` parameter.
- **Edit** your parameter to set the *Value* as your model version ARN as displayed above.

This model is trained, but not yet deployed. At the moment, the solution Lambda will trigger deployment when first invoked - but still fail until deployment is complete. Let's also trigger deployment from here in the notebook, to avoid first calls failing:

In [None]:
rekognition.start_project_version(
    ProjectVersionArn=project_version_arn,
    MinInferenceUnits=1,
)

As with training, this is an asynchronous operation and we have the option to wait until it's complete:

In [None]:
rekognition.get_waiter("project_version_running").wait(
    ProjectArn=project_arn,
    VersionNames=[
        version_name,
    ],
    WaiterConfig={
        "Delay": 30,  # in seconds
        "MaxAttempts": 40,
    },
)
print("Model deployed!")

## Clean-up

In Rekognition Custom Labels, inference [pricing](https://aws.amazon.com/rekognition/pricing/) is by deployed capacity - not processed requests... So when you're done experimenting with your solution - be sure to 'stop' the project version to avoid unnecessary charges:

In [None]:
rekognition.stop_project_version(
    ProjectVersionArn=project_version_arn,
)