# Environment setup
To start things off, we'll set the s3 bucket name, download the training data to S3 and upload the customized training container to Elastic Container Registry (ECR).

If you don't have an S3 bucket to use, please go set one up now and note down the bucket name.

In [2]:
%%bash 
sudo pip install clint



You are using pip version 9.0.1, however version 9.0.3 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.


## Set basic parameters
Setup the environment with required modules. You will need to __Change the bucket name__ below to the one you created above.

In [3]:
%%time
import sys
sys.path.append('/home/ec2-user/anaconda3/lib/python3.6/site-packages/')

import boto3
import re
from sagemaker import get_execution_role

role = get_execution_role()

bucket='cnidus-ml-iad' # customize to your bucket

#Set your target 
#containers = {'us-west-2': '107995894928.dkr.ecr.us-west-2.amazonaws.com/object-detection'}
containers = {'us-east-1': '366895301435.dkr.ecr.us-west-2.amazonaws.com/object-detection'}
training_image = containers[boto3.Session().region_name]

CPU times: user 260 ms, sys: 120 ms, total: 380 ms
Wall time: 448 ms


## Upload training data to S3
Next step is to download a public training dataset, format it appropriately for our model and upload it to S3.

For this example, we're using the ["pets" dataset](http://www.robots.ox.ac.uk/~vgg/data/pets/) from Oxford University.

In [None]:
%%bash
#From tensorflow/models/research/
wget http://www.robots.ox.ac.uk/~vgg/data/pets/data/images.tar.gz
wget http://www.robots.ox.ac.uk/~vgg/data/pets/data/annotations.tar.gz
tar -xvf annotations.tar.gz
tar -xvf images.tar.gz
python object_detection/dataset_tools/create_pet_tf_record.py \
    --label_map_path=object_detection/data/pet_label_map.pbtxt \
    --data_dir=`pwd` \
    --output_dir=`pwd`

### Download the training sets

In [17]:
#from clint.textui import progress
import requests

URLList = [
#    'http://www.robots.ox.ac.uk/~vgg/data/pets/data/images.tar.gz',
#    'http://www.robots.ox.ac.uk/~vgg/data/pets/data/annotations.tar.gz',
    'https://raw.githubusercontent.com/tensorflow/models/master/research/object_detection/data/pet_label_map.pbtxt',
    'https://raw.githubusercontent.com/tensorflow/models/master/research/object_detection/dataset_tools/create_pet_tf_record.py'  
]

#Download each file and show a progress bar
for URL in URLList:
    print("Downloading: " + str(URL))
    fname = URL.split("/")[-1] #url.split("/")[-2:]
    print(fname)
    r = requests.get(URL, stream=True)
    with open(fname, 'wb') as f:
#        print(r.headers.get('content-length'))
#        total_length = int(r.headers.get('content-length'))
#        for chunk in progress.bar(r.iter_content(chunk_size=1024), expected_size=(total_length/1024) + 1): 
#            if chunk:
                f.write(r.content)
#                f.flush()

print("Finished downloading training dataset files")


Downloading: https://raw.githubusercontent.com/tensorflow/models/master/research/object_detection/data/pet_label_map.pbtxt
pet_label_map.pbtxt
Downloading: https://raw.githubusercontent.com/tensorflow/models/master/research/object_detection/dataset_tools/create_pet_tf_record.py
create_pet_tf_record.py
Finished downloading training dataset files


## Upload the customized container to ECR


# Training

In [None]:
# For this training, we will run it for 10 minutes so as to have a demo of it.
max_run_time = 600

Run the training using Amazon sagemaker CreateTrainingJob API

In [None]:
%%time
import time
import boto3
from time import gmtime, strftime


s3 = boto3.client('s3')
# create unique job name 
job_name_prefix = 'object-detection-notebook'
timestamp = time.strftime('-%Y-%m-%d-%H-%M-%S', time.gmtime())
job_name = job_name_prefix + timestamp
training_params = \
{
    # specify the training docker image
    "AlgorithmSpecification": {
        "TrainingImage": training_image,
        "TrainingInputMode": "File"
    },
    "RoleArn": role,
    "OutputDataConfig": {
        "S3OutputPath": 's3://{}/{}/output'.format(bucket, job_name_prefix)
    },
    "ResourceConfig": {
        "InstanceCount": 1,
        "InstanceType": "ml.p3.2xlarge",
        "VolumeSizeInGB": 50
    },
    "TrainingJobName": job_name,
    "HyperParameters": {
        "max_run_time": str(max_run_time) # after this time training job will terminate itself
    },
    "StoppingCondition": {
        "MaxRuntimeInSeconds": 20*60 # 20 minutes. After this sagemaker will stop training
    },
#Training data should be inside a subdirectory called "train"
#Validation data should be inside a subdirectory called "validation"
#The algorithm currently only supports fullyreplicated model (where data is copied onto each machine)
    "InputDataConfig": [
        {
            "ChannelName": "training",
            "DataSource": {
                "S3DataSource": {
                    "S3DataType": "S3Prefix",
                    "S3Uri": 's3://{}/pet_detection_data/tf_record'.format(bucket),
                    "S3DataDistributionType": "FullyReplicated"
                }
            },
#             "ContentType": "application/x-recordio",
            "CompressionType": "None"
        }
    ]
}
print('Training job name: {}'.format(job_name))
print('\nInput Data Location: {}'.format(training_params['InputDataConfig'][0]['DataSource']['S3DataSource']))

In [None]:
# create the Amazon SageMaker training job
sagemaker = boto3.client(service_name='sagemaker')
sagemaker.create_training_job(**training_params)

# confirm that the training job has started
status = sagemaker.describe_training_job(TrainingJobName=job_name)['TrainingJobStatus']
print('Training job current status: {}'.format(status))

try:
    # wait for the job to finish and report the ending status
    sagemaker.get_waiter('training_job_completed_or_stopped').wait(TrainingJobName=job_name)
    training_info = sagemaker.describe_training_job(TrainingJobName=job_name)
    status = training_info['TrainingJobStatus']
    print("Training job ended with status: " + status)
except:
    print('Training failed to start')
     # if exception is raised, that means it has failed
    message = sagemaker.describe_training_job(TrainingJobName=job_name)['FailureReason']
    print('Training failed with the following error: {}'.format(message))

In [None]:
training_info = sagemaker.describe_training_job(TrainingJobName=job_name)
status = training_info['TrainingJobStatus']
print("Training job ended with status: " + status)