# Environment setup
To start things off, we'll set the s3 bucket name, download the training data to S3 and upload the customized training container to Elastic Container Registry (ECR).

If you don't have an S3 bucket to use, please go set one up now and note down the bucket name.

## Set basic parameters
Setup the environment with required modules. You will need to __Change the bucket name__ below to the one you created above.

In [None]:
%%time
import sys
sys.path.append('/home/ec2-user/anaconda3/lib/python3.6/site-packages/')

import boto3
import re
from sagemaker import get_execution_role

role = get_execution_role()

bucket='cnidus-ml-iad' # customize to your bucket

#Set your target 
#containers = {'us-west-2': '107995894928.dkr.ecr.us-west-2.amazonaws.com/object-detection'}
containers = {'us-east-1': '366895301435.dkr.ecr.us-west-2.amazonaws.com/object-detection'}
training_image = containers[boto3.Session().region_name]

## Upload training data to S3
Next step is to download a public training dataset, format it appropriately for our model and upload it to S3.

For this example, we're using the ["pets" dataset](http://www.robots.ox.ac.uk/~vgg/data/pets/) from Oxford University.

### Download the training sets and dataset tools

In [None]:
import requests
import os
data_dir = "./data/"
tools_dir = "./object_detection/"

URLList = [
    {'src': 'http://www.robots.ox.ac.uk/~vgg/data/pets/data/images.tar.gz','dst': data_dir},
    {'src': 'http://www.robots.ox.ac.uk/~vgg/data/pets/data/annotations.tar.gz', 'dst': data_dir},
    {'src': 'https://raw.githubusercontent.com/tensorflow/models/master/research/object_detection/data/pet_label_map.pbtxt', 'dst': tools_dir},
    {'src': 'https://raw.githubusercontent.com/tensorflow/models/master/research/object_detection/dataset_tools/create_pet_tf_record.py', 'dst': './'},
    {'src': 'https://raw.githubusercontent.com/tensorflow/models/master/research/object_detection/utils/dataset_util.py', 'dst': str(tools_dir + '/utils/')},
    {'src': 'https://raw.githubusercontent.com/tensorflow/models/master/research/object_detection/utils/label_map_util.py', 'dst': str(tools_dir + '/utils/')},
    {'src': 'https://raw.githubusercontent.com/tensorflow/models/master/research/object_detection/protos/string_int_label_map.proto', 'dst': str(tools_dir + '/protos/')}
]

#Download each file
for URL in URLList:
    #Create the dst directory if it doesnt exist
    if not os.path.exists(URL['dst']):
        os.makedirs(URL['dst'])
    fname = URL['dst'] + URL['src'].split("/")[-1]
    print("Downloading: " + str(URL['src']) + " to: " + fname )
    r = requests.get(URL['src'], stream=True)
    with open(fname, 'wb') as f:
        f.write(r.content)

print('\n' + 'Finished downloading training dataset files')


### Extract the training set and reformat as TFRecord format
The Tensorflow Object Detection API expects data to be in the TFRecord format, so we'll now run the create_pet_tf_record script to convert from the raw Oxford-IIIT Pet dataset into TFRecords.

First, let's extract the training sets.

In [None]:
tfr_dir = './tfrecord/'

import tarfile
fileList = [
    './data/images.tar.gz',
    './data/annotations.tar.gz'
    ]

for file in fileList:
    print('Extracting: ' + file)
    tar = tarfile.open(file)
    tar.extractall(data_dir)
    tar.close()
    print('Finished')

#Create the TFRecord output directory if it doesnt exist
if not os.path.exists(tfr_dir):
    os.makedirs(tfr_dir)

print('\n'+ 'Done!')

Now run the conversion script:

In [None]:
%%bash

# From tensorflow/models/research/
python3 ./create_pet_tf_record.py \
    --label_map_path=./object_detection/pet_label_map.pbtxt \
    --data_dir=./data/ \
    --output_dir=./tfrecord/

## Upload the customized container to ECR

### Building and registering the container

The following shell code shows how to build the container image using `docker build` and push the container image to ECR using `docker push`. 

This code looks for an ECR repository in the account you're using and the current default region (if you're using a SageMaker notebook instance, this will be the region where the notebook instance was created). If the repository doesn't exist, the script will create it.

In [None]:
%%sh

# The name of our algorithm
algorithm_name=decision-trees-sample

cd container

chmod +x decision_trees/train
chmod +x decision_trees/serve

account=$(aws sts get-caller-identity --query Account --output text)

# Get the region defined in the current configuration (default to us-west-2 if none defined)
region=$(aws configure get region)
region=${region:-us-west-2}

fullname="${account}.dkr.ecr.${region}.amazonaws.com/${algorithm_name}:latest"

# If the repository doesn't exist in ECR, create it.

aws ecr describe-repositories --repository-names "${algorithm_name}" > /dev/null 2>&1

if [ $? -ne 0 ]
then
    aws ecr create-repository --repository-name "${algorithm_name}" > /dev/null
fi

# Get the login command from ECR and execute it directly
$(aws ecr get-login --region ${region} --no-include-email)

# Build the docker image locally with the image name and then push it to ECR
# with the full name.

# On a SageMaker Notebook Instance, the docker daemon may need to be restarted in order
# to detect your network configuration correctly.  (This is a known issue.)
if [ -d "/home/ec2-user/SageMaker" ]; then
  sudo service docker restart
fi

docker build  -t ${algorithm_name} .
docker tag ${algorithm_name} ${fullname}

docker push ${fullname}

# Training

In [None]:
# For this training, we will run it for 10 minutes so as to have a demo of it.
max_run_time = 600

Run the training using Amazon sagemaker CreateTrainingJob API

In [None]:
%%time
import time
import boto3
from time import gmtime, strftime


s3 = boto3.client('s3')
# create unique job name 
job_name_prefix = 'object-detection-notebook'
timestamp = time.strftime('-%Y-%m-%d-%H-%M-%S', time.gmtime())
job_name = job_name_prefix + timestamp
training_params = \
{
    # specify the training docker image
    "AlgorithmSpecification": {
        "TrainingImage": training_image,
        "TrainingInputMode": "File"
    },
    "RoleArn": role,
    "OutputDataConfig": {
        "S3OutputPath": 's3://{}/{}/output'.format(bucket, job_name_prefix)
    },
    "ResourceConfig": {
        "InstanceCount": 1,
        "InstanceType": "ml.p3.2xlarge",
        "VolumeSizeInGB": 50
    },
    "TrainingJobName": job_name,
    "HyperParameters": {
        "max_run_time": str(max_run_time) # after this time training job will terminate itself
    },
    "StoppingCondition": {
        "MaxRuntimeInSeconds": 20*60 # 20 minutes. After this sagemaker will stop training
    },
#Training data should be inside a subdirectory called "train"
#Validation data should be inside a subdirectory called "validation"
#The algorithm currently only supports fullyreplicated model (where data is copied onto each machine)
    "InputDataConfig": [
        {
            "ChannelName": "training",
            "DataSource": {
                "S3DataSource": {
                    "S3DataType": "S3Prefix",
                    "S3Uri": 's3://{}/pet_detection_data/tf_record'.format(bucket),
                    "S3DataDistributionType": "FullyReplicated"
                }
            },
#             "ContentType": "application/x-recordio",
            "CompressionType": "None"
        }
    ]
}
print('Training job name: {}'.format(job_name))
print('\nInput Data Location: {}'.format(training_params['InputDataConfig'][0]['DataSource']['S3DataSource']))

In [None]:
# create the Amazon SageMaker training job
sagemaker = boto3.client(service_name='sagemaker')
sagemaker.create_training_job(**training_params)

# confirm that the training job has started
status = sagemaker.describe_training_job(TrainingJobName=job_name)['TrainingJobStatus']
print('Training job current status: {}'.format(status))

try:
    # wait for the job to finish and report the ending status
    sagemaker.get_waiter('training_job_completed_or_stopped').wait(TrainingJobName=job_name)
    training_info = sagemaker.describe_training_job(TrainingJobName=job_name)
    status = training_info['TrainingJobStatus']
    print("Training job ended with status: " + status)
except:
    print('Training failed to start')
     # if exception is raised, that means it has failed
    message = sagemaker.describe_training_job(TrainingJobName=job_name)['FailureReason']
    print('Training failed with the following error: {}'.format(message))

In [None]:
training_info = sagemaker.describe_training_job(TrainingJobName=job_name)
status = training_info['TrainingJobStatus']
print("Training job ended with status: " + status)