## Image classification on a small set of images using lst format on AWS Sagemaker


### Dataset description

The dataset is small. It contains the following 20 categories:

- bike
- crab
- ipod
- license-plate
- owl
- playing-card
- raccoon
- smokestack
- spaghetti
- syringe
- chandelier
- grapes
- ketch
- octopus
- paperclip
- pyramid
- skateboard
- soda-can
- speed-boat
- umbrella

The training data contains the above-mentioned 20 categories, with 40 images in each category (800 images total). The testing data includes 40 unclassified images


### Get the dataset

The dataset is located in an S3 bucket. These images are made publicly accessible to you. For example, one of the images could be viewed via https://bhargav-image-classification.s3.amazonaws.com/20/bike/bike0001.jpg.

Each image can be downloaded from a URL, with the following naming convention:

**training data**
```
https://bhargav-image-classification.s3.amazonaws.com/20/<category>/<category><index>.jpg

for example,

https://bhargav-image-classification.s3.amazonaws.com/20/octopus/octopus0001.jpg
https://bhargav-image-classification.s3.amazonaws.com/20/octopus/octopus0002.jpg
...
https://bhargav-image-classification.s3.amazonaws.com/20/octopus/octopus0010.jpg
...
https://bhargav-image-classification.s3.amazonaws.com/20/umbrella/umbrella0040.jpg
```

**testing data**
```
https://bhargav-image-classification.s3.amazonaws.com/20/ic-test/<index>.jpg

for example,

https://bhargav-image-classification.s3.amazonaws.com/20/ic-test/1.jpg
https://bhargav-image-classification.s3.amazonaws.com/20/ic-test/2.jpg
https://bhargav-image-classification.s3.amazonaws.com/20/ic-test/3.jpg
...
https://bhargav-image-classification.s3.amazonaws.com/20/ic-test/40.jpg
```

### Tasks - Image Classification
Use the provided training data, create a model and classify the unlabeled testing data. In specific, you are exepected to perform the following tasks:

1. Download the above-mentioned dataset.
2. Organize and prepare training data for training.
3. Upload the training data to S3 with the appropriate folder structure.
4. Train a model with the training data. 
5. Perform predictions for the testing data. Output the results including label and accuracy.

ToDo:
6. Output the result labels and accuracy into a CSV file.  Use this CSV file to either run a query on Athena or Quicksight.
7. Tune Hyperparameters for better performance.
8. Train a new model with the improved Hyperparameters. 
9. Perform predictions for the testing data (with the improved model). Output the results including label and accuracy.
10. Output the results in step 5 and step 9 into CSV files. Perform queries against the CSV using Athena and / or  perform visualization for the CSV using QuickSight showing differences in accuracy due to tuning hyperparamaters. 


In [None]:
# 1. Download the dataset. 

import boto3
from sagemaker import get_execution_role
from sagemaker.amazon.amazon_estimator import get_image_uri

role = get_execution_role()

bucket='veera-learning-series' # customize to your bucket

training_image = get_image_uri(boto3.Session().region_name, 'image-classification')

In [None]:
import os
import urllib.request

def download(url):
    filename = url.split("/")[-1]
    if not os.path.exists(filename):
        urllib.request.urlretrieve(url, filename)

category = ['bike', 'crab', 'ipod', 'license-plate', 'owl', 'playing-card', 'raccoon', 'smokestack', 'spaghetti', 'syringe', 'chandelier', 'grapes', 'ketch', 'octopus', 'paperclip', 'pyramid', 'skateboard', 'soda-can', 'speed-boat', 'umbrella']

for cat in category:
    for i in range(1,41):
        index = str(i).zfill(4)
        download('https://s3-us-west-2.amazonaws.com/lnh-challenge/dataset/ic/'+cat+'/'+cat+index+'.jpg')

# Tool for creating lst file
download('https://raw.githubusercontent.com/apache/incubator-mxnet/master/tools/im2rec.py')

In [None]:
# 2. Organize/Prepare the data for SageMaker training. 
# Hint. You can train with image lst format as what we have done on Friday.

# TODO
import os
category = ['bike', 'crab', 'ipod', 'license-plate', 'owl', 'playing-card', 'raccoon', 'smokestack', 'spaghetti', 'syringe', 'chandelier', 'grapes', 'ketch', 'octopus', 'paperclip', 'pyramid', 'skateboard', 'soda-can', 'speed-boat', 'umbrella']
i=1
for cat in category:
    index = str(i).zfill(3)
    di = './inputData/'+index+'.'+cat
    print ('mkdir '+di)
    os.system('mkdir '+di)
    print ('mv '+cat+'* '+di)
    os.system('mv '+cat+'* '+di)
    i=i+1

In [None]:
%%bash

mkdir -p data_train_60
for i in inputData/*; do
    c=`basename $i`
    mkdir -p data_train_60/$c
    for j in `ls $i/*.jpg | shuf | head -n 25`; do
        mv $j data_train_60/$c/
    done
done

python im2rec.py --list --recursive data-60-train data_train_60/
python im2rec.py --list --recursive data-60-val data/

In [None]:
!head -n 3 ./data-60-train.lst > example.lst
f = open('example.lst','r')
lst_content = f.read()
print(lst_content)

In [None]:
# 3. Upload to S3 with the correct folder structure. 
bucket='veera-learning-series'
# Four channels: train, validation, train_lst, and validation_lst
s3train = 's3://{}/image-classification/train/'.format(bucket)
s3validation = 's3://{}/image-classification/validation/'.format(bucket)
s3train_lst = 's3://{}/image-classification/train_lst/'.format(bucket)
s3validation_lst = 's3://{}/image-classification/validation_lst/'.format(bucket)

# upload the image files to train and validation channels
!aws s3 cp data_train_60 $s3train --recursive --quiet
!aws s3 cp data $s3validation --recursive --quiet

# upload the lst files to train_lst and validation_lst channels
!aws s3 cp data-60-train.lst $s3train_lst --quiet
!aws s3 cp data-60-val.lst $s3validation_lst --quiet

In [None]:
# Setting up the Hyperparemeters for the training job.

# The algorithm supports multiple network depth (number of layers). They are 18, 34, 50, 101, 152 and 200
# For this training, we will use 18 layers
num_layers = 18
# we need to specify the input image shape for the training data
image_shape = "3,224,224"
# we also need to specify the number of training samples in the training set
num_training_samples = 500
# specify the number of output classes
num_classes = 20
# batch size for training
mini_batch_size = 128
# number of epochs
epochs = 3
# learning rate
learning_rate = 0.01
# report top_5 accuracy
top_k = 5
# resize image before training
resize = 256
# period to store model parameters (in number of epochs), in this case, we will save parameters from epoch 2, 4, and 6
checkpoint_frequency = 2
# Since we are using transfer learning, we set use_pretrained_model to 1 so that weights can be 
# initialized with pre-trained weights
use_pretrained_model = 1

In [None]:
#4. Start training job. 

# First create the training job configuration


import time
import boto3
from time import gmtime, strftime


s3 = boto3.client('s3')
# create unique job name 
job_name_prefix = 'sagemaker-imageclassification-notebook'
timestamp = time.strftime('-%Y-%m-%d-%H-%M-%S', time.gmtime())
job_name = job_name_prefix + timestamp
training_params = \
{
    # specify the training docker image
    "AlgorithmSpecification": {
        "TrainingImage": training_image,
        "TrainingInputMode": "File"
    },
    "RoleArn": role,
    "OutputDataConfig": {
        "S3OutputPath": 's3://{}/{}/output'.format(bucket, job_name_prefix)
    },
    "ResourceConfig": {
        "InstanceCount": 1,
        "InstanceType": "ml.p2.xlarge",
        "VolumeSizeInGB": 50
    },
    "TrainingJobName": job_name,
    "HyperParameters": {
        "image_shape": image_shape,
        "num_layers": str(num_layers),
        "num_training_samples": str(num_training_samples),
        "num_classes": str(num_classes),
        "mini_batch_size": str(mini_batch_size),
        "epochs": str(epochs),
        "learning_rate": str(learning_rate),
        "top_k": str(top_k),
        "resize": str(resize),
        "checkpoint_frequency": str(checkpoint_frequency),
        "use_pretrained_model": str(use_pretrained_model)    
    },
    "StoppingCondition": {
        "MaxRuntimeInSeconds": 360000
    },
#Training data should be inside a subdirectory called "train"
#Validation data should be inside a subdirectory called "validation"
#The algorithm currently only supports fullyreplicated model (where data is copied onto each machine)
    "InputDataConfig": [
        {
            "ChannelName": "train",
            "DataSource": {
                "S3DataSource": {
                    "S3DataType": "S3Prefix",
                    "S3Uri": s3train,
                    "S3DataDistributionType": "FullyReplicated"
                }
            },
            "ContentType": "application/x-image",
            "CompressionType": "None"
        },
        {
            "ChannelName": "validation",
            "DataSource": {
                "S3DataSource": {
                    "S3DataType": "S3Prefix",
                    "S3Uri": s3validation,
                    "S3DataDistributionType": "FullyReplicated"
                }
            },
            "ContentType": "application/x-image",
            "CompressionType": "None"
        },
        {
            "ChannelName": "train_lst",
            "DataSource": {
                "S3DataSource": {
                    "S3DataType": "S3Prefix",
                    "S3Uri": s3train_lst,
                    "S3DataDistributionType": "FullyReplicated"
                }
            },
            "ContentType": "application/x-image",
            "CompressionType": "None"
        },
        {
            "ChannelName": "validation_lst",
            "DataSource": {
                "S3DataSource": {
                    "S3DataType": "S3Prefix",
                    "S3Uri": s3validation_lst,
                    "S3DataDistributionType": "FullyReplicated"
                }
            },
            "ContentType": "application/x-image",
            "CompressionType": "None"
        }
    ]
}
print('Training job name: {}'.format(job_name))
print('\nInput Data Location: {}'.format(training_params['InputDataConfig'][0]['DataSource']['S3DataSource']))

In [None]:
# After the configuration, start the training job

sagemaker = boto3.client(service_name='sagemaker')
sagemaker.create_training_job(**training_params)

# confirm that the training job has started
status = sagemaker.describe_training_job(TrainingJobName=job_name)['TrainingJobStatus']
print('Training job current status: {}'.format(status))

try:
    # wait for the job to finish and report the ending status
    sagemaker.get_waiter('training_job_completed_or_stopped').wait(TrainingJobName=job_name)
    training_info = sagemaker.describe_training_job(TrainingJobName=job_name)
    status = training_info['TrainingJobStatus']
    print("Training job ended with status: " + status)
except:
    print('Training failed to start')
     # if exception is raised, that means it has failed
    message = sagemaker.describe_training_job(TrainingJobName=job_name)['FailureReason']
    print('Training failed with the following error: {}'.format(message))

In [None]:
#Fetch the status of the training job to make sure that it has completed.
training_info = sagemaker.describe_training_job(TrainingJobName=job_name)
status = training_info['TrainingJobStatus']
print("Training job ended with status: " + status)
print (training_info)

In [None]:
# Create Model

import boto3
from time import gmtime, strftime

sage = boto3.Session().client(service_name='sagemaker') 

timestamp = time.strftime('-%Y-%m-%d-%H-%M-%S', time.gmtime())
model_name="image-classification-model" + timestamp
print(model_name)
info = sage.describe_training_job(TrainingJobName=job_name)
model_data = info['ModelArtifacts']['S3ModelArtifacts']
print(model_data)

hosting_image = get_image_uri(boto3.Session().region_name, 'image-classification')

primary_container = {
    'Image': hosting_image,
    'ModelDataUrl': model_data,
}

create_model_response = sage.create_model(
    ModelName = model_name,
    ExecutionRoleArn = role,
    PrimaryContainer = primary_container)

print(create_model_response['ModelArn'])

In [None]:
#Create Endpoint Config

from time import gmtime, strftime

timestamp = time.strftime('-%Y-%m-%d-%H-%M-%S', time.gmtime())
endpoint_config_name = job_name_prefix + '-epc-' + timestamp
endpoint_config_response = sage.create_endpoint_config(
    EndpointConfigName = endpoint_config_name,
    ProductionVariants=[{
        'InstanceType':'ml.p2.xlarge',
        'InitialInstanceCount':1,
        'ModelName':model_name,
        'VariantName':'AllTraffic'}])

print('Endpoint configuration name: {}'.format(endpoint_config_name))
print('Endpoint configuration arn:  {}'.format(endpoint_config_response['EndpointConfigArn']))

In [None]:
# Create and Deploy Endpoint

import time

timestamp = time.strftime('-%Y-%m-%d-%H-%M-%S', time.gmtime())
endpoint_name = job_name_prefix + '-ep-' + timestamp
print('Endpoint name: {}'.format(endpoint_name))

endpoint_params = {
    'EndpointName': endpoint_name,
    'EndpointConfigName': endpoint_config_name,
}
endpoint_response = sagemaker.create_endpoint(**endpoint_params)
print('EndpointArn = {}'.format(endpoint_response['EndpointArn']))

In [None]:
# get the status of the endpoint
response = sagemaker.describe_endpoint(EndpointName=endpoint_name)
status = response['EndpointStatus']
print('EndpointStatus = {}'.format(status))
    
try:
    sagemaker.get_waiter('endpoint_in_service').wait(EndpointName=endpoint_name)
finally:
    resp = sagemaker.describe_endpoint(EndpointName=endpoint_name)
    status = resp['EndpointStatus']
    print("Arn: " + resp['EndpointArn'])
    print("Create endpoint ended with status: " + status)

    if status != 'InService':
        message = sagemaker.describe_endpoint(EndpointName=endpoint_name)['FailureReason']
        print('Training failed with the following error: {}'.format(message))
        raise Exception('Endpoint creation did not succeed')

In [None]:
# 5. Perform predictions for the testing data. 
# Output the results including label and accuracy.  
import boto3
runtime = boto3.Session().client(service_name='runtime.sagemaker') 

In [None]:
#Downloand a test image and display it.

!wget -O /tmp/test.jpg https://bhargav-image-classification.s3.amazonaws.com/20/ic-test/13.jpg
file_name = '/tmp/test.jpg'
# test image
from IPython.display import Image
Image(file_name)  

In [None]:
#Send the image for inference

import json
import numpy as np
with open(file_name, 'rb') as f:
    payload = f.read()
    payload = bytearray(payload)
response = runtime.invoke_endpoint(EndpointName=endpoint_name, 
                                   ContentType='application/x-image', 
                                   Body=payload)
result = response['Body'].read()
# result will be in json format and convert it to ndarray
result = json.loads(result)
# the result will output the probabilities for all classes
# find the class with maximum probability and print the class index
index = np.argmax(result)
category = ['bike', 'crab', 'ipod', 'license-plate', 'owl', 'playing-card', 'raccoon', 'smokestack', 'spaghetti', 'syringe', 'chandelier', 'grapes', 'ketch', 'octopus', 'paperclip', 'pyramid', 'skateboard', 'soda-can', 'speed-boat', 'umbrella']

object_categories = category
print("Result: label - " + object_categories[index] + ", probability - " + str(result[index]))

In [None]:
#Delete Endpoint
sage.delete_endpoint(EndpointName=endpoint_name)

In [None]:
# 6. (bonus) Tune Hyperparameters for better performance.

# TODO

In [None]:
# 7. Traing a new model with the new Hyperparameters. 

# TODO

In [None]:
# 8. Perform predictions for the testing data with the improved model. 
# Output the results including label and accuracy. ##########
# Hint. Try to do this task with real-time inference option

# TODO

In [None]:
# 9. Output the results in step 5 and step 8 into CSV files. 
# Perform queries against the CSV using Athena. 
# Perform visualization for the CSV using QuickSight. 

### Cleaning Up

It is very important that you clean up the AWS resources you are using after this exercise. This includes the endpoints, the notebook instance, and the data stored in your S3 bucket. 