### Machine Learning Immersion Day 
# Data preparation

This document contains the documented code to do the data preparation part of the AWS Machine Learning Immersion Day.

Before running the code, make sure to read and try to understand what the code is doing.

Execute the code by clicking the Run button in the toolbar above or pressing Shift + Return, both while having the cell containing the code in focus.

### Let's go!

Begin with importing dependencies and setting up some base variables that later code will be referring to.

*Don't forget to change the your_initials value to the initials you used in Lab1*

In [None]:
your_initials = 'put-your-initials-here'

bucket = your_initials + '-ml-id-lab'

import sagemaker
import sagemaker.amazon.common as smac
from sagemaker import get_execution_role
from sagemaker.predictor import json_deserializer
  
import boto3, csv, io, json
import numpy as np
from scipy.sparse import lil_matrix

> While the code is running, there will be a bracketed asterisk showing to the left of the code **[*]**. Once finished, the asterisk will be replaced with a number showing the order of execution within the current notebook document's state.

Next, download one of the data files used in Lab1 to the notebook. 

In [None]:
s3 = boto3.resource('s3')
s3.Bucket(bucket).download_file('movielens-data/u.data/data.csv', 'u.data')

The file downloaded is a compacted version if the data explored in Lab1. This is the description of the file:

> ```text
> u.data 
>
> The full u data set, 100000 ratings by 943 users on 1682 items.
> Each user has rated at least 20 movies.  Users and items are numbered consecutively from 1.  The data is randomly ordered. This is a tab separated list of:
> user id | item id | rating | timestamp
> The time stamps are unix seconds since 1/1/1970 UTC```

While this is an intuitive and realtively compact way of storing the information, it is not optimal for training factorisation machine models. In order to have good training data, this data needs to be split and transformed.

First, split the data into one larger training part and one smaller testing part (10 samples per user).

At the end of running the code, the two rating counters will be printed to an output that is added below the cell. 


In [None]:
nbUsers=943
nbMovies=1682
nbFeatures=nbUsers+nbMovies

# Pick 10 ratings per user and save as test data set ua.test
# Save the rest as training data set ua.base
!rm -f ua.base || touch ua.base
!rm -f ua.test || touch ua.test
  
# Extract 10 samples per user into test data
maxRatingsByUser = 10
# Keep track of how many ratings have been extracted, initalize to 0 
testRatingsByUser = {}
for userId in range(nbUsers):
    testRatingsByUser[str(userId)]=0

  
with open('u.data','r') as f, open('ua.base','w') as uabase, open('ua.test','w') as uatest:
    filedata=csv.reader(f,delimiter='\t')
    next(filedata, None) # skip headers
    uabasewriter = csv.writer(uabase, delimiter='\t')
    uatestwriter = csv.writer(uatest, delimiter='\t')
    nbRatingsTrain=0
    nbRatingsTest=0
    # For every rating line in file
    for userId,movieId,rating,timestamp in filedata:
        if testRatingsByUser[str(int(userId)-1)] < maxRatingsByUser:
            # Write to test data 
            uatestwriter.writerow([userId,movieId,rating,timestamp])
            testRatingsByUser[str(int(userId)-1)] = testRatingsByUser[str(int(userId)-1)] + 1
            nbRatingsTest=nbRatingsTest+1
        else:
            # Write to training data
            uabasewriter.writerow([userId,movieId,rating,timestamp]) 
            nbRatingsTrain=nbRatingsTrain+1
            
print("Train data ratings counter: %s" % (nbRatingsTrain))
print("Test data ratings counter: %s" % (nbRatingsTest))

Make sure the partitioned data looks good by printing the first 10 rows of each file. 
> Notice that the exclamation mark starting each line in this snippets means that the line is to be executed as a shell command, rather than as python code.

In [None]:
!echo "Training data:"
!head -10 ua.base
!echo "Testing data:"
!head -10 ua.test

The output should show ten lines containing four columns for each file. You may notice that the training data seems to have reoccuring lines contains the same value in the first column (user_id). These types of regularities in the training data can lead to suboptimal training.

Create a new file containing shuffled training data.

In [None]:
!shuf ua.base -o ua.base.shuffled
!head -10 ua.base.shuffled

You now have two sets of source data, but need to process them more before training and testing a factorization machine model. What is needed for each of the sets is:

- Create a one-hot encoded sparse matrix holding **features** (input to the model)
- Create a **label** array (the expected output from the model)
- Serialize both of the above into protobuf format and write them to the S3 bucket.

Define a function **loadDataset** that loads a dataset and returns a one-hot encoded feature sparse matrix and a label vector.

In [None]:
def loadDataset(filename, lines, columns):
    # Features are one-hot encoded in a sparse matrix
    X = lil_matrix((lines, columns)).astype('float32')
    # Labels are stored in a vector
    Y = []
    line=0
    with open(filename,'r') as f:
        samples=csv.reader(f,delimiter='\t')
        for userId,movieId,rating,timestamp in samples:
            X[line,int(userId)-1] = 1
            X[line,int(nbUsers)+int(movieId)-1] = 1
            if int(rating) >= 4:
                Y.append(1)
            else:
                Y.append(0)
            line=line+1
    Y=np.array(Y).astype('float32')
    return X,Y

Use this function to verify some properties of the returned data structures.

In [None]:
X_train, Y_train = loadDataset('ua.base.shuffled', nbRatingsTrain, nbFeatures)

print(X_train.shape)
print(Y_train.shape)
assert X_train.shape == (nbRatingsTrain, nbFeatures)
assert Y_train.shape == (nbRatingsTrain, )
nonzero_labels = np.count_nonzero(Y_train)
print("Training labels: %d ones, %d zeros" % (nonzero_labels, nbRatingsTrain-nonzero_labels))

In [None]:
X_test, Y_test = loadDataset('ua.test', nbRatingsTest, nbFeatures)

print(X_test.shape)
print(Y_test.shape)
assert X_test.shape  == (nbRatingsTest, nbFeatures)
assert Y_test.shape  == (nbRatingsTest, )
nonzero_labels = np.count_nonzero(Y_test)
print("Test labels: %d ones, %d zeros" % (nonzero_labels, nbRatingsTest-nonzero_labels))

Now, you will serialise these structures in protobuf format on S3. Start by defining target names  for the S3 objects, and a function to do the serialisation and return the path to the object on S3.

In [None]:
prefix = 'sagemaker/recommender-fm'

train_key      = 'train.protobuf'
train_prefix   = '{}/{}'.format(prefix, 'train')

test_key       = 'test.protobuf'
test_prefix    = '{}/{}'.format(prefix, 'test')

def writeDatasetToProtobuf(X, Y, bucket, prefix, key):
    buf = io.BytesIO()
    smac.write_spmatrix_to_sparse_tensor(buf, X, Y)
    buf.seek(0)
    obj = '{}/{}'.format(prefix, key)
    boto3.resource('s3').Bucket(bucket).Object(obj).upload_fileobj(buf)
    return 's3://{}/{}'.format(bucket,obj)

Last, write the data by calling the function for the two sets.

In [None]:
train_data = writeDatasetToProtobuf(X_train, Y_train, bucket, train_prefix, train_key)  
print("Training data at: %s" % (train_data))

test_data  = writeDatasetToProtobuf(X_test, Y_test, bucket, test_prefix, test_key)    
print("Testing data at: %s" % (test_data))

You should now see obejcts at these paths in the S3 console. Note how efficiently the sparse matrix is stored, only 5.8 MB for the training set.

You have now finished preparing data and are ready to start training your model. 

---

# Sagemaker Model Training
In this part of the lab, you will now invoke Amazon Sagemaker training and testing from the notebook.

In [None]:
output_prefix  = 's3://{}/{}/output'.format(bucket, prefix)
  
containers = {'us-west-2': '174872318107.dkr.ecr.us-west-2.amazonaws.com/factorization-machines:latest',
             'us-east-1': '382416733822.dkr.ecr.us-east-1.amazonaws.com/factorization-machines:latest',
             'us-east-2': '404615174143.dkr.ecr.us-east-2.amazonaws.com/factorization-machines:latest',
             'ap-northeast-1': '351501993468.dkr.ecr.ap-northeast-1.amazonaws.com/factorization-machines:latest',
             'ap-northeast-2': '835164637446.dkr.ecr.ap-northeast-2.amazonaws.com/factorization-machines:latest',
             'ap-southeast-2': '712309505854.dkr.ecr.ap-southeast-2.amazonaws.com/factorization-machines:latest',
             'eu-central-1': '664544806723.dkr.ecr.eu-central-1.amazonaws.com/factorization-machines:latest',
             'eu-west-1': '438346466558.dkr.ecr.eu-west-1.amazonaws.com/factorization-machines:latest'}
  
print("The trained model will be written to: %s" % (output_prefix))

Create a factorization machine Estimator object and set the hyperparameters to be used when training.

In [None]:
fm = sagemaker.estimator.Estimator(containers[boto3.Session().region_name],
                                  get_execution_role(), 
                                  train_instance_count=1, 
                                  train_instance_type='ml.c4.xlarge',
                                  output_path=output_prefix,
                                  sagemaker_session=sagemaker.Session())
  
fm.set_hyperparameters(feature_dim=nbFeatures,
                     predictor_type='binary_classifier',
                     mini_batch_size=1000,
                     num_factors=64,
                     epochs=100)

Now, invoke training on Amazon Sagemaker. While the training is running, Amazon Sagemaker will continuously produce output below the cell. 

In [None]:
fm.fit({'train': train_data, 'test': test_data})

This particular training job should take 4-5 minutes, the training is finished when you see  `Billable seconds: ###` at the end of the output.
 
You can also monitor progress of the training in the Amazon Sagemaker console by selecting **Training jobs** in the main menu.

The trained model will be written to the path defined by **output_prefix**, you can verify that there is a **model.tar.gz** object in the S3 console.

You have now trained your model and are ready to start using it.

----


# Deploy Endpoint and Test Inference

In the last section of this lab you will deploy a development endpoint and test run some inferences of your model. **Do not start this section unless your training job from the earlier step has status Completed.**

In [None]:
fm_predictor = fm.deploy(instance_type='ml.c4.xlarge', initial_instance_count=1)

This will start up an endpoint instance, you can monitor progress through the notebook, or on the Amazon Sagemaker console by selection **Endpoints** in the menu.

Next, configure serialization options for the predictor

In [None]:
def fm_serializer(data):
    js = {'instances': []}
    for row in data:
        js['instances'].append({'features': row.tolist()})
    #print js
    return json.dumps(js)
  
fm_predictor.content_type = 'application/json'
fm_predictor.serializer = fm_serializer
fm_predictor.deserializer = json_deserializer

Now you are ready to call the endpoint with ten test inputs.

In [None]:
result = fm_predictor.predict(X_test[1000:1010].toarray())
print("Prediction (Score) Expected")
for index, p in enumerate(result['predictions']):
    print("%10.2f %6.2f  %8.2f" % (p['predicted_label'], p['score'], Y_test[1000 + index]))

The output of the cell will produce a text table with three columns: **Prediction** and **Score** (from the model) and **Expected**. If the model works well, Prediction and Expected values should match on each row.

---

**Congratulations! You have now deployed and tested you recommendation ML model and are finished with Lab 2.**