# MNIST Clusters and Batch Transform

Import all of the necessary libraries: 

In [1]:
import pickle, gzip, numpy, json
import urllib.request
import matplotlib.pyplot as plt
import io
import boto3
from sagemaker.amazon.common import write_numpy_to_dense_tensor

from sagemaker import get_execution_role
role = get_execution_role()

ModuleNotFoundError: No module named 'boto3'

Replace ```your-bucket-name``` with the name of the bucket you created for this workshop. 

In [None]:
BUCKET_NAME = '<your-bucket-name>' 
BUCKET_URL = "s3://" + BUCKET_NAME
TRAIN_VAL_TEST_FOLDER = 'ServerlessAIWorkshop/data'
TRAINING_DATA_KEY = TRAIN_VAL_TEST_FOLDER + "/train.data"
TRAINING_DATA_URL = "s3://" + BUCKET_NAME + "/" + TRAINING_DATA_KEY
TRAINING_DATA_FOLDER = "s3://" + BUCKET_NAME + "/" + TRAIN_VAL_TEST_FOLDER
MODEL_URL = BUCKET_URL + "/ServerlessAIWorkshop/model"

Download dataset and load it to local variables. Upload the training data to S3 bucket.

In [None]:
%%time

# Load the dataset
urllib.request.urlretrieve("http://deeplearning.net/data/mnist/mnist.pkl.gz", "mnist.pkl.gz")
with gzip.open('mnist.pkl.gz', 'rb') as f:
    train_set, valid_set, test_set = pickle.load(f, encoding='latin1')

print('training data will be uploaded to: {}'.format(TRAINING_DATA_URL))

boto3.resource('s3').Bucket(BUCKET_NAME).Object(TRAINING_DATA_KEY).upload_fileobj(buf)

Look at the dataset.

In [None]:
%matplotlib inline

plt.rcParams["figure.figsize"] = (2,10)

def show_digit(img, caption='', subplot=None):
    if subplot == None:
        _, (subplot) = plt.subplots(1,1)
    imgr = img.reshape((28,28))
    subplot.axis('off')
    subplot.imshow(imgr, cmap='gray')
    plt.title(caption)

show_digit(train_set[0][30], 'This is a {}'.format(train_set[1][30]))
#plt.savefig("test.png")

The SageMaker Python library embodies a number of conventions. It creates subfolders and default destinations for you with conventions you should know. When setting up locations for your work note: 

* All data must reside in S3, with the possible exception of the initial dataset download. Even that should go directly to S3 if possible.
* Training, validation, and testing data should be specified by bucket only. 
* Model output will go into a directory of the format: 
    * lad

Convert the training dataset and label to the format required by the SageMaker KMeans algorithm

In [None]:
%%time

buf = io.BytesIO()
write_numpy_to_dense_tensor(buf, train_set[0], train_set[1])
buf.seek(0)

Initialize KMeans for model training.

In [None]:
from sagemaker import KMeans

kmeans = KMeans(role=role,
                train_instance_count=2,
                train_instance_type='ml.c4.8xlarge',
                output_path=MODEL_URL,
                k=10,
                data_location=TRAINING_DATA_URL)

Train the model, using the high-level SDK.

In [None]:
%%time

kmeans.fit(kmeans.record_set(train_set[0]))

Get the path to where the trained model is stored.

In [None]:
kmeans.latest_training_job.job_name
TRAINED_MODEL_URL = '{}/{}/output/model.tar.gz'.format(MODEL_URL, kmeans.latest_training_job.job_name)
TRAINED_MODEL_URL

amazon_estimator.get_image_uri() return algorithm image URI for the given AWS region, repository name, and repository version.

In [None]:
from sagemaker.amazon.amazon_estimator import get_image_uri

sagemaker = boto3.client('sagemaker')

image = get_image_uri(boto3.Session().region_name, 'kmeans') #et_image_uri(region_name, repo_name, repo_version=1)

kmeans_hosting_container = {
    'Image': image,
    'ModelDataUrl': TRAINED_MODEL_URL
}

kmeans_hosting_container # print it out to make sure it is correct

Create a SageMaker model from the trained model. This step is necessary as training job that we kicked off in above step does not create a model accessible from batch transform. 

In [None]:
create_model_response = sagemaker.create_model(
    ModelName=kmeans.latest_training_job.job_name,
    ExecutionRoleArn=role,
    PrimaryContainer=kmeans_hosting_container)