# Training a Computer Vision Model for Coin Classification

This training notebook was hosted and run on Amazon Sagemaker to train a computer vision model to classify images of coins.

The code here was largely inspired by a case study for an analogous image classification using the caltech-256 image dataset, which is provided as an example for users signing-up for Sagemaker. Modifications were made to suit the nuances and attributes of this dataset.

## Prequisites and Preprocessing

### Permissions and environment variables

Here we set up the linkage and authentication to AWS services. There are three parts to this:

* The roles used to give learning and hosting access to the data. This will automatically be obtained from the role used to start the notebook
* The S3 bucket that we use for training and model data
* The Amazon sagemaker image classification docker image which need not be changed

In [1]:
%%time
import boto3
import re
from sagemaker import get_execution_role
from sagemaker.amazon.amazon_estimator import get_image_uri

role = get_execution_role()

bucket='my-bucket-name' # customize to your bucket

training_image = get_image_uri(boto3.Session().region_name, 'image-classification')

print(training_image)

811284229777.dkr.ecr.us-east-1.amazonaws.com/image-classification:1
CPU times: user 1.1 s, sys: 291 ms, total: 1.39 s
Wall time: 8.8 s


## Fine-tuning the Image classification model

The coin image dataset consists of images from 36 classes (18 coins with 2 sides each). There are 3,024 images per class. The training and validation sets have been converted into a [recordio format](https://mxnet.incubator.apache.org/tutorials/basic/record_io.html) and hosted on S3 for compatibility with this training approach. 

In [2]:
import os 
import urllib.request
import boto3

# MY DATA
s3_train_key = 'train'
s3_validation_key = 'validation'
s3_train = 's3://{}/{}/'.format(bucket, s3_train_key)
s3_validation = 's3://{}/{}/'.format(bucket, s3_validation_key)

Once we have the data available in the correct format for training, the next step is to actually train the model using the data. Before training the model, we need to setup the training parameters.

## Training parameters

There are two kinds of parameters that need to be set for training. The first are hyperparameters that are specific to the CNN algorithm:

In [3]:
# Since we are using transfer learning, we set use_pretrained_model to 1 so that weights can be 
# initialized with pre-trained values
use_pretrained_model = 1

# Specify hyperparameters for ResNet model architecture:
num_layers = 18
image_shape = '3,227,227'
num_classes = 36

# Additional hyperparameters:
num_training_samples = 108864
epochs = 30
mini_batch_size =  128
learning_rate = 0.001
optimizer = 'adam'
beta_1 = 0.9
beta_2 = 0.999
eps = 1e-8

After these are set, we define the parameters for the training job. These include:

* **Input specification**: These are the training and validation channels that specify the path where training data is present. These are specified in the "InputDataConfig" section. The main parameters that need to be set is the "ContentType" which we set to "application/x-recordio" and the "S3Uri" which specifies the bucket and the folder where the data is present. 
* **Output specification**: This is specified in the "OutputDataConfig" section. We just need to specify the path where the output can be stored after training
* **Resource config**: This section specifies the type of instance on which to run the training and the number of hosts used for training. If "InstanceCount" is more than 1, then training can be run in a distributed manner. 

In [4]:
%%time
import time
import boto3
from time import gmtime, strftime

s3 = boto3.client('s3')

# Create unique job name 
job_name_prefix = 'coinproblem-v4-noOthers'
timestamp = time.strftime('-%Y-%m-%d-%H-%M-%S', time.gmtime())
job_name = job_name_prefix + timestamp
training_params = \
{
    # Specify the training docker image
    "AlgorithmSpecification": {
        "TrainingImage": training_image,
        "TrainingInputMode": "File"
    },
    "RoleArn": role,
    "OutputDataConfig": {
        "S3OutputPath": 's3://{}/{}/output'.format(bucket, job_name_prefix)
    },
    "ResourceConfig": {
        "InstanceCount": 1,
        "InstanceType": "ml.p2.xlarge",
        "VolumeSizeInGB": 50
    },
    "TrainingJobName": job_name,
    "HyperParameters": {
        "image_shape": image_shape,
        "num_layers": str(num_layers),
        "num_training_samples": str(num_training_samples),
        "num_classes": str(num_classes),
        "mini_batch_size": str(mini_batch_size),
        "epochs": str(epochs),
        "learning_rate": str(learning_rate),
        "use_pretrained_model": str(use_pretrained_model),
        "optimizer": str(optimizer),
        "beta_1": str(beta_1),
        "beta_2": str(beta_2),
        "eps": str(eps)
    },
    "StoppingCondition": {
        "MaxRuntimeInSeconds": 360000
    },

    # Specify training & validation data location
    "InputDataConfig": [
        {
            "ChannelName": "train",
            "DataSource": {
                "S3DataSource": {
                    "S3DataType": "S3Prefix",
                    "S3Uri": s3_train,
                    "S3DataDistributionType": "FullyReplicated"
                }
            },
            "ContentType": "application/x-recordio",
            "CompressionType": "None"
        },
        {
            "ChannelName": "validation",
            "DataSource": {
                "S3DataSource": {
                    "S3DataType": "S3Prefix",
                    "S3Uri": s3_validation,
                    "S3DataDistributionType": "FullyReplicated"
                }
            },
            "ContentType": "application/x-recordio",
            "CompressionType": "None"
        }
    ]
}
print('Training job name: {}'.format(job_name))
print('\nInput Data Location: {}'.format(training_params['InputDataConfig'][0]['DataSource']['S3DataSource']))

Training job name: coinproblem-v4-noOthers-2019-10-01-16-50-06

Input Data Location: {'S3DataType': 'S3Prefix', 'S3Uri': 's3://coinproblem/train/', 'S3DataDistributionType': 'FullyReplicated'}
CPU times: user 47.7 ms, sys: 5.39 ms, total: 53.1 ms
Wall time: 215 ms


# Training

After setting training parameters, we run the training using Amazon sagemaker CreateTrainingJob API, and poll its status until the job is completed.

In [None]:
# Create the Amazon SageMaker training job
sagemaker = boto3.client(service_name='sagemaker')
sagemaker.create_training_job(**training_params)

# Confirm that the training job has started
status = sagemaker.describe_training_job(TrainingJobName=job_name)['TrainingJobStatus']
print('Training job current status: {}'.format(status))

try:
    # Wait for the job to finish and report the ending status
    sagemaker.get_waiter('training_job_completed_or_stopped').wait(TrainingJobName=job_name)
    training_info = sagemaker.describe_training_job(TrainingJobName=job_name)
    status = training_info['TrainingJobStatus']
    print("Training job ended with status: " + status)
except:
    print('Training failed to start')
     # If exception is raised, that means it has failed
    message = sagemaker.describe_training_job(TrainingJobName=job_name)['FailureReason']
    print('Training failed with the following error: {}'.format(message))

Training job current status: InProgress


In [1]:
training_info = sagemaker.describe_training_job(TrainingJobName=job_name)
status = training_info['TrainingJobStatus']
print("Training job ended with status: " + status)

Training job ended with status: Completed


If we see the message,

> `Training job ended with status: Completed`

then that means training successfully completed and the output model was stored in the output path specified by `training_params['OutputDataConfig']`.

We can move onto using the trained model to predict on the test data, and analyzing the results.