# Create Labeling Job Demo

In this Jupyter Notebook, we will explore the workflow of labeling raw data with Ground Truth, which can later be used in SageMaker Training Jobs.

## Assumptions

Before we begin, we are expecting the following to exist and be accessible by the user of this Notebook:
1. A workforce with a workteam for labeling data already exists. We will utilize the workteam for our labeling tasks. For this demo, all we need is the workteam's name. An administrator must make this workteam via the SageMaker Ground Truth API. A documentation link will be provided in the last cell of the "Initial Setup" section if the specified workteam name does not exist.

## Initial Setup
Replace the following variables for your username and MLSpace project so that this demo can use them for Labeling Job tags and for writing to a private dataset

In [None]:
# Set the existing workteam name
workteam = 'CHANGE_ME_TO_WORKTEAM_NAME'

# MLSpace variables
username = 'CHANGE_ME_TO_CURRENT_USER_NAME'
project = 'CHANGE_ME_TO_CURRENT_PROJECT_NAME'

Next, we'll import all the libraries we are going to use in this demo

In [None]:
# import everything we'll use in this demo
import json
import os
import pickle
import time
import sagemaker

from matplotlib.image import imsave
from multiprocessing import Pool

# local helper function file for retrieving Lambda ARNs that would
# otherwise be provided by the SageMaker console
from groundtruth_utils import (
    get_documentation_domain,
    get_groundtruth_assets_domain,
    get_groundtruth_lambda_arn
)

After, let's get the MLSpace parameters, set a dataset location to store unlabeled data in, and set tags for setting up the labeling job.

In [None]:
# Retrieve MLSpace-provided parameters
with open('/home/ec2-user/SageMaker/global-resources/notebook-params.json', 'r') as fp:
    sagemaker_parameters = json.load(fp)
data_bucket_name = sagemaker_parameters['pSMSDataBucketName']
kms_key_id = sagemaker_parameters['pSMSKMSKeyId']

# Pick the path for unlabeled data, relative to the data bucket root
# We will place objects in a private Dataset for the current user
unlabeled_dataset_key_prefix = f'private/{username}/datasets/mnistimages'
unlabeled_dataset_uri = f's3://{data_bucket_name}/{unlabeled_dataset_key_prefix}'

# Set up tags to allow our API calls to succeed
tags = [
    {
        "Key": "project",
        "Value": project
    },
    {
        "Key": "system",
        "Value": "MLSpace"
    },
    {
        "Key": "user",
        "Value": username
    }
]

Next, we'll get the session details so we can start making API calls to SageMaker

In [None]:
# get role and service clients
sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()
boto_session = sagemaker_session.boto_session
region = sagemaker_session.boto_region_name
partition = boto_session.get_partition_for_region(region)
sagemaker_boto_client = boto_session.client(
    'sagemaker', region_name=region)
s3_boto_client = boto_session.client(
    's3', region_name=region)

# quick confirmation message
print(f'Session loaded. We are using role "{role}" in region "{region}".')

And one last step: let's validate that we have access to the workteam before using it in the labeling job.

In [None]:
# Print out workteam information just to make sure it exists
try:
    workteam_info = sagemaker_boto_client.describe_workteam(WorkteamName=workteam)
    workteam_arn = workteam_info["Workteam"]["WorkteamArn"]
    workteam_portal = f"https://{workteam_info['Workteam']['SubDomain']}"
    print(f"Workteam exists. Labeling portal can be found at: {workteam_portal}")
except:
    print("Found an error when describing the workteam. Please have an administrator set up a workteam for you with the following documentation link:\n"
         + f"https://{get_documentation_domain(partition)}/sagemaker/latest/dg/sms-workforce-create-private-oidc.html")

## Data Preparation

Now, we'll grab the MNIST dataset from the SageMaker examples. It's already in a binary format, which we *don't* want for this demo, so we're going to convert everything to PNG images and store them in S3.
Just so we don't try to generate the dataset multiple times, we'll check to see if a specific local directory exists, otherwise we'll create everything. If you want to regenerate the dataset anyways, simply delete the "images' directory created by this notebook.

In [None]:
# for demo purposes, let's keep the job small so we can
# actually finish it in a reasonable amount of time.
# use 60000 for the full dataset.
num_images_in_labeling_job = 50

# check if we've already generated the dataset before
# assume we generated files already if we have the 'images' directory locally
should_generate_dataset = True
try:
    os.mkdir('images')
except FileExistsError:
    print('The "images" directory exists already. Skipping dataset generation.')
    should_generate_dataset = False

if should_generate_dataset:  # skip if we already generated the files
    s3_response = s3_boto_client.get_object(Bucket="sagemaker-sample-files",
                                            Key="datasets/image/MNIST/mnist.pkl")
    mnist_dataset = pickle.loads(s3_response["Body"].read())
    train_set = mnist_dataset[0]

    def save_and_upload_image(sample_number):
        local_filename = f"./images/{sample_number:05}.jpg"
        s3_key = f'{unlabeled_dataset_key_prefix}/raw_images/{sample_number:05}.jpg'
        imsave(local_filename, train_set[0][sample_number][0], cmap='gray')
        with open(local_filename, 'rb') as image_f:
            s3_boto_client.put_object(Bucket=data_bucket_name,
                                      Key=s3_key,
                                      Body=image_f)

    with Pool() as pool:
        # save and upload images in parallel to speed this step up
        pool.map(save_and_upload_image, range(num_images_in_labeling_job))

Now that we have the entire dataset saved as images, we can start a human labeling job. Even though MNIST is already annotated, this example will show how we can annotate it ourselves using Ground Truth. The following cell will generate a manifest file based on the number of images we uploaded to S3 earlier.

In [None]:
# Printing documentation link as separate line to get the dynamic link depending on partition
print(f"Documentation for the manifest format can be found here: https://{get_documentation_domain(partition)}/sagemaker/latest/dg/sms-input-data-input-manifest.html")

In [None]:
manifest_lines = [
    f'{{"source-ref":"{unlabeled_dataset_uri}/raw_images/{x:05}.jpg"}}\n' for x in range(num_images_in_labeling_job)
]

# Upload the manifest to S3
s3_boto_client.put_object(Bucket=data_bucket_name,
                          Key=f'{unlabeled_dataset_key_prefix}/mnist_images.manifest',
                          Body="".join(manifest_lines))

Next, we'll generate a list of labels for our dataset. Conveniently, MNIST is just the numbers 0-9. To allow the labels to match the Ground Truth Labeling Portal's keyboard shortcuts, we'll place the "0" label last.

In [None]:
# Generate labels configuration file
labels = [
    {"label": str(x+1)} for x in range(9)
]
labels.append({"label": "0"})  # 1-9, 0 to match keyboard shortcuts in GT Labeling Portal

labels_config = {
    "document-version": "2018-11-28",
    "labels": labels
}

# Upload label categories to S3
s3_boto_client.put_object(Bucket=data_bucket_name,
                          Key=f'{unlabeled_dataset_key_prefix}/label_categories.json',
                          Body=json.dumps(labels_config))

In [None]:
# Print as code cell for dynamic docs link based on partition
print(f"From the documentation here (https://{get_documentation_domain(partition)}/sagemaker/latest/dg/sms-image-classification.html), we need to upload an HTML page for our users to access.\n"
      + "Below is the page provided from the docs with modifications to make it work in different partitions, and we'll upload that to our S3 bucket")

In [None]:
labeling_portal_html = f'<script src="https://{get_groundtruth_assets_domain(region)}/crowd-html-elements.js"></script>'
labeling_portal_html += """
<crowd-form>
  <crowd-image-classifier
    name="crowd-image-classifier"
    src="{{ task.input.taskObject | grant_read_access }}"
    header="please classify"
    categories="{{ task.input.labels | to_json | escape }}"
  >
    <full-instructions header="Image classification instructions">
      <ol><li><strong>Read</strong> the task carefully and inspect the image.</li>
      <li><strong>Read</strong> the options and review the examples provided to understand more about the labels.</li>
      <li><strong>Choose</strong> the appropriate label that best suits the image.</li></ol>
    </full-instructions>
    <short-instructions>
      <h3><span style="color: rgb(0, 138, 0);">Good example</span></h3>
      <p>Enter description to explain the correct label to the workers</p>
      <h3><span style="color: rgb(230, 0, 0);">Bad example</span></h3><p>Enter description of an incorrect label</p>
    </short-instructions>
  </crowd-image-classifier>
</crowd-form>
"""

s3_boto_client.put_object(Bucket=data_bucket_name,
                          Key=f'{unlabeled_dataset_key_prefix}/task_template.html',
                          Body=labeling_portal_html)

## Starting the Labeling Job

Now that we've prepared the data and supporting configuration files, we can create the labeling job. The following cell will define all the parameters that we need for this demonstration. The included file "groundtruth_utils.py" provides a convenience function for getting Ground Truth-provided Lambda ARNs needed in the StartLabelingJob API call.

In [None]:
labeling_job_name = f"mlspace-gt-labeling-demo-{project}-{username}-{int(time.time())}"
labeling_job_params = {
    "LabelingJobName": labeling_job_name,
    "LabelAttributeName": "label",
    "InputConfig": {
        "DataSource": {
            "S3DataSource": {
                "ManifestS3Uri": f"{unlabeled_dataset_uri}/mnist_images.manifest"
            }
        }
    },
    "OutputConfig": {
        "KmsKeyId": kms_key_id,
        "S3OutputPath": f"{unlabeled_dataset_uri}/labeling_job_output/"
    },
    "RoleArn": role,
    "LabelCategoryConfigS3Uri": f"{unlabeled_dataset_uri}/label_categories.json",
    "HumanTaskConfig": {
        "WorkteamArn": workteam_arn,
        "PreHumanTaskLambdaArn": get_groundtruth_lambda_arn("PRE", "ImageMultiClass", boto_session),
        "UiConfig": {
            "UiTemplateS3Uri": f"{unlabeled_dataset_uri}/task_template.html"
        },
        "TaskTitle": "MLSpace Demo Labeling Job",
        "TaskDescription": "Example demo for labeling hand-written digits",
        "NumberOfHumanWorkersPerDataObject": 1,
        "TaskTimeLimitInSeconds": 3600,
        "AnnotationConsolidationConfig": {
            "AnnotationConsolidationLambdaArn": get_groundtruth_lambda_arn("ACS", "ImageMultiClass", boto_session)
        }
    },
    "Tags": tags
}

In [None]:
# And finally, start the labeling job!
resp = sagemaker_boto_client.create_labeling_job(**labeling_job_params)

With the job submitted, users within the Workteam can access the labeling job from the following url. Once the job is finished or fails it will disappear from the portal. To access the labeling portal, members of the workteam must sign in with the identity provider used to create the workteam in the first place.

In [None]:
print(f'To access available labeling jobs, click here: {workteam_portal}')

## Describing the job
After submitting the job, the following cell will print out information about the job, such as job status, number of objects to label, and number of objects already labeled. You can rerun it multiple times to get an updated status report on the job.

In [None]:
# Rerun this cell to describe the job as it progresses
sagemaker_boto_client.describe_labeling_job(LabelingJobName=labeling_job_name)