# Managing Quality Training Sets at Scale with SageMaker GroundTruth

# Introduction

In this lab you'll experience the end to end process of managing a quality training set at scale through Amazon SageMaker Ground Truth.

Ths purpose of this notebook is to download and filter out a sample of 10 bird images from Google's Open Images Dataset. The remainder of the lab will be driven through the AWS console. The lab guide can be downloaded from [here](https://github.com/dylan-tong-aws/aws-cv-jumpstarter/blob/master/notebooks/lab3-gluoncv-on-sagemaker.ipynb).

---

Replace <<'PROVIDE YOUR BUCKET NAME HERE'>> with the name of your bucket.

In [1]:
import sagemaker
import os
from collections import defaultdict
import json
import boto3

BUCKET = 'dtong-cv-jumpstarter-workshop'
#BUCKET = '<<PROVIDE YOUR BUCKET NAME HERE>>'
S3_PREFIX = 'ground-truth-lab' # Any valid S3 prefix.

# Make sure the bucket is in the same region as this notebook.
role = sagemaker.get_execution_role()
region = boto3.session.Session().region_name
s3 = boto3.client('s3')
bucket_region = s3.head_bucket(Bucket=BUCKET)['ResponseMetadata']['HTTPHeaders']['x-amz-bucket-region']

assert bucket_region == region, "Your S3 bucket {} and this notebook need to be in the same region.".format(BUCKET)

The variable N_IMGS is set to 10 by default. This will result in the proceeding script to retrieve 10 images of birds. We keep the number of images in this lab to a minimum, so that the time it takes to complete annotation tasks is practical for the purpose of the workshop.

In [2]:
N_IMGS = 10

Run the script below to filter N_IMGS from Google's Open Image Datasets, and upload them to your S3 bucket.

In [10]:
import math

# Download and process the Open Images annotations.
!wget https://storage.googleapis.com/openimages/2018_04/test/test-annotations-bbox.csv
!wget https://storage.googleapis.com/openimages/2018_04/bbox_labels_600_hierarchy.json
    
with open('bbox_labels_600_hierarchy.json', 'r') as f:
    hierarchy = json.load(f)
    
CLASS_NAME = 'Bird'
CLASS_ID = '/m/015p6'

# Find all the subclasses of the desired image class (e.g. 'swans' and 'pigeons' etc if CLASS_NAME=='Bird').
good_subclasses = set()
def get_all_subclasses(hierarchy, good_subtree=False):
    if hierarchy['LabelName'] == CLASS_ID:
        good_subtree = True
    if good_subtree:
        good_subclasses.add(hierarchy['LabelName'])
    if 'Subcategory' in hierarchy:            
        for subcat in hierarchy['Subcategory']:
            get_all_subclasses(subcat, good_subtree=good_subtree)
    return good_subclasses
good_subclasses = get_all_subclasses(hierarchy)
    
fids2bbs = defaultdict(list)
# Skip images with risky content.
skip_these_images = ['251d4c429f6f9c39', 
                    '065ad49f98157c8d']

with open('test-annotations-bbox.csv', 'r') as f:
    for line in f.readlines()[1:]:
        line = line.strip().split(',')
        img_id, _, cls_id, conf, xmin, xmax, ymin, ymax, *_ = line
        if img_id in skip_these_images:
            continue
        if cls_id in good_subclasses:
            fids2bbs[img_id].append([CLASS_NAME, xmin, xmax, ymin, ymax])
            if len(fids2bbs) == N_IMGS:
                break

# Copy the images to our local bucket.
s3 = boto3.client('s3')
for img_id_id, img_id in enumerate(fids2bbs.keys()):
    if img_id_id % math.floor(N_IMGS/10) == 0:
        print('Copying image {} / {}'.format(img_id_id, N_IMGS))
    copy_source = {
        'Bucket': 'open-images-dataset',
        'Key': 'test/{}.jpg'.format(img_id)
    }
    s3.copy(copy_source, BUCKET, '{}/images/{}.jpg'.format(S3_PREFIX, img_id))
print('Done!')

# Create and upload the input manifest.
manifest_name = 'input.manifest'
with open(manifest_name, 'w') as f:
    for img_id_id, img_id in enumerate(fids2bbs.keys()):
        img_path = 's3://{}/{}/images/{}.jpg'.format(BUCKET, S3_PREFIX, img_id)
        f.write('{"source-ref": "' + img_path +'"}\n')
s3.upload_file(manifest_name, BUCKET, S3_PREFIX + '/' + manifest_name)

--2019-05-23 01:25:44--  https://storage.googleapis.com/openimages/2018_04/test/test-annotations-bbox.csv
Resolving storage.googleapis.com (storage.googleapis.com)... 172.217.164.112, 2607:f8b0:400a:803::2010
Connecting to storage.googleapis.com (storage.googleapis.com)|172.217.164.112|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 52174204 (50M) [text/csv]
Saving to: ‘test-annotations-bbox.csv.3’


2019-05-23 01:25:46 (50.1 MB/s) - ‘test-annotations-bbox.csv.3’ saved [52174204/52174204]

--2019-05-23 01:25:46--  https://storage.googleapis.com/openimages/2018_04/bbox_labels_600_hierarchy.json
Resolving storage.googleapis.com (storage.googleapis.com)... 216.58.194.176, 2607:f8b0:400a:804::2010
Connecting to storage.googleapis.com (storage.googleapis.com)|216.58.194.176|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 86291 (84K) [text/csv]
Saving to: ‘bbox_labels_600_hierarchy.json.3’


2019-05-23 01:25:46 (1.29 MB/s) - ‘bbox_labels_60