## `scikit-surprise` recommender systems on SageMaker

This notebook demonstrates how to build a movie recommender system using [`scikit-surprise`](http://surpriselib.com/) on [Amazon SageMaker](https://aws.amazon.com/sagemaker/), using the [SageMaker SKLearn Estimator](https://sagemaker.readthedocs.io/en/stable/using_sklearn.html) as a base to avoid building custom containers.

**Note that `surprise` is a ["SciKit"](https://www.scipy.org/scikits.html) and therefore a *peer/sibling* of `scikit-learn`, not a part of it**. This means we can't just expect the SM SKLearn container to understand a surprise model file - we have to show it how.

The notebook should run fine on a ml.t2.medium and consume minimal additional resources to fit/deploy the model as the data set is small

### Install `scikit-surprise`

Although we won't be fitting models on this notebook instance itself, we'll install surprise so we can use the module's standard data pre-processing tools.

We demonstrate inline installation for portability/simplicity, but there's guidance [here](https://docs.aws.amazon.com/sagemaker/latest/dg/notebook-lifecycle-config.html) on how installations can be integrated into the notebook instance's creation/startup.

In [None]:
!pip install scikit-surprise

### Load libraries and configuration

**TODO:** Create your target bucket, check this notebook's role has access to it, and update the config below

In [None]:
# Python Built-Ins:
import json
import os

# Libraries:
import boto3
import numpy as np
import pandas as pd
import sagemaker
from sagemaker.sklearn.estimator import SKLearn as SMSKLearnEstimator
import surprise

sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()  # We'll use this notebook's role for our API interactions

# We don't need a sagemaker boto client as all our operations are supported via the Python SDK:
# sm_client = sagemaker_session.boto_session.client("sagemaker")

In [None]:
# The bucket we'll use:
bucket = "< TODO: ENTER YOUR BUCKET NAME HERE >"

data_prefix = "data"
train_filename = "movie-lens-100k-training.csv"
test_filename = "movie-lens-100k-test.csv"

# The output of training jobs (i.e. trained models or failure details)
output_prefix = "output"

# The results of batch transform jobs (i.e. estimated ratings for test user/movie pairs)
results_prefix = "results"

### Fetch the data set

We demonstrate recommendation on the [MovieLens](https://grouplens.org/datasets/movielens/) 100K benchmark set as a small/easy example.

Remember if exploring bigger data sets that the data prep code here is working on the notebook instance itself. The default notebook storage volume size is 5GB, but it can be set higher on creation or while stopped.

The column names are chosen for consistency with [Amazon Personalize](https://docs.aws.amazon.com/personalize/latest/dg/data-prep-formatting.html) - our advanced recommender engine as-a-service which you might be interested to check out!

In [None]:
!wget -N http://files.grouplens.org/datasets/movielens/ml-100k.zip
!unzip -o ml-100k.zip

In [None]:
movielens = pd.read_csv('./ml-100k/u.data', sep='\t', names=['USER_ID', 'ITEM_ID', 'RATING', 'TIMESTAMP'])
movielens.head()

### Prepare the data

surprise provides its own tools for data input and preparation, so let's use them to split our training vs test set:

In [None]:
data = surprise.Dataset.load_from_df(
    movielens[["USER_ID", "ITEM_ID", "RATING"]],
    surprise.Reader(line_format=u"user item rating", rating_scale=(1, 5))
)

train_data, test_data = surprise.model_selection.train_test_split(data, test_size=.25)

In [None]:
# (train_test_split() actually gives different data types for train_data vs test_data...)
train_df = pd.DataFrame(train_data.all_ratings(), columns=["USER_ID", "ITEM_ID", "RATING"])
test_df = pd.DataFrame(test_data, columns=["USER_ID", "ITEM_ID", "RATING"])
train_df.head(5)

### Upload data to S3

SageMaker training and transform jobs use S3 for data input and output, so we need to upload the prepared sets.

In [None]:
# We have full control over the training script, but batch transform jobs are orchestrated by SageMaker
# (e.g. data job splitting) so the test file **cannot have a header**:
!mkdir -p $data_prefix
train_df.to_csv(os.path.join(data_prefix, train_filename), index=False)
test_df.to_csv(os.path.join(data_prefix, test_filename), index=False, header=False)

boto3.Session().resource("s3").Bucket(bucket).Object(
    "{}/{}".format(data_prefix, train_filename)
).upload_file(os.path.join(data_prefix, train_filename))

boto3.Session().resource("s3").Bucket(bucket).Object(
    "{}/{}".format(data_prefix, test_filename)
).upload_file(os.path.join(data_prefix, test_filename))

### Train the model

The SKLearn Estimator takes an `entry_point` script which defines training and inference/runtime behaviour.

SKLearn usage and script interface/requirements are documented [here](https://sagemaker.readthedocs.io/en/stable/using_sklearn.html), with the internal `Transformer` code [here](https://github.com/aws/sagemaker-containers/blob/master/src/sagemaker_containers/_transformer.py) giving some insight into how the functions are applied.

The `entry_point` is loaded by SageMaker into a container and used in two different ways:

* In a training job, SageMaker loads input S3 data "channels" into folders in the container and runs the `entry_point` **as a script**. The script should fit the model and output the model (or failure logs) into a specified output folder: which SageMaker will then map back to S3.
* At inference time (batch or online endpoint), SageMaker loads the `entry_point` **as a module** inside a bigger web server application that you don't need to write: `entry_point` should export functions to handle loading the trained model from disk and performing inferences.

Our `surprise-recommender.py` implementation has the following key parts:

1. A `subprocess` invocation to install `surprise` before it's `import`ed, since it's not included by default in the SM sklearn container. This means app startup from cold will be a little slower than if `surprise` was pre-installed in the container itself, but with less code required than creating a custom container.
2. An `if __name__ == "__main__"` guard clause to separate code that should only execute when the file is run as a script (the training job)
3. A `model_fn` which can load a trained model from disk into memory
4. A `predict_fn` which executes a `model` against requested input `data`
5. (To show how a flexible API can be implemented) an `input_fn` to intpreret different formats of request correctly.

Review the `surprise-recommender.py` file to understand how it interacts with SageMaker.

In [None]:
script_path = "surprise-recommender.py"

estimator = SMSKLearnEstimator(
    entry_point=script_path,
    train_instance_type="ml.c4.xlarge",
    role=role,
    sagemaker_session=sagemaker_session,
    output_path="s3://{}/{}".format(bucket, output_prefix),
    
    # possibly e.g. hyperparameters={ 'max_leaf_nodes': 30 }, if we had any
    
    # training on spot instances is an easy way to save cost:
    train_use_spot_instances=True,
    train_max_run=60*5, # 5 mins max actual run time
    train_max_wait=60*10 # 10 mins max wait for spot interruptions
)

# Instead of just specifying the training channel as an S3 path string, we can use s3_input to get more control:
train_channel = sagemaker.session.s3_input(
    "s3://{}/{}/{}".format(bucket, data_prefix, train_filename), 
    distribution="FullyReplicated",
    content_type="text/csv", 
    s3_data_type="S3Prefix"
)

# This will block until training is complete, showing console output below:
estimator.fit({ "train": train_channel })

### Test model with batch inference

Now that the model has been trained, we can use SageMaker Batch Transform to run it against a bulk set e.g. our test set:

In [None]:
# Define a SKLearn Transformer from the trained Estimator
transformer = estimator.transformer(
    instance_count=1, 
    instance_type="ml.m4.xlarge",
    assemble_with="Line",
    accept="text/csv",
    output_path="s3://{}/{}".format(bucket, results_prefix)
    # By default data will be processed in batches for speed: You could add strategy="SingleRecord"
)

# Start the inference job
transformer.transform(
    "s3://{}/{}/{}".format(bucket, data_prefix, test_filename),
    content_type="text/csv",
    split_type="Line",
    input_filter="$[0:1]" # Only send the first two columns (the input features UID & IID)
)

print("Waiting for transform job: {}".format(transformer.latest_transform_job.job_name))
transformer.wait()

In [None]:
# Download the raw output data from S3 to local filesystem
batch_output = transformer.output_path
!mkdir -p $results_prefix/
!aws s3 cp --recursive $batch_output/ $results_prefix/
# (Head to see what the batch output looks like)
!head $results_prefix/*

In [None]:
# There should be just one .out file, which we can load into a dataframe
results_df = pd.read_csv(
    os.path.join(results_prefix, "{}.out".format(test_filename)),
    names=["USER_ID", "ITEM_ID", "RATING_ACTUAL", "RATING_PREDICTED", "RESULT_METADATA"]
)
results_df.head()
# Note the RATING_ACTUAL field is empty because surprise gives us the option of passing in actuals, but we chose
# not to send them. Could join this dataframe on to the underlying test CSV to match up the actuals.

### Deploy the model for online/realtime inference

As well as batch jobs, we can deploy our model as an API endpoint for real-time predictions:

In [None]:
# deploy() should print periodic "-"s while running, and a "!" when finished.
print("Deploying model...")
predictor_raw = estimator.deploy(
    initial_instance_count=1,
    instance_type="ml.m4.xlarge"
    #content_type=sagemaker.content_types.CONTENT_TYPE_JSON
)

In [None]:
# Our entry point's input_fn() is set up to support both single and batch requests:
online_test_1 = (186, 377)
print("Sample prediction for {}: Result = \n{}".format(online_test_1, predictor_raw.predict(online_test_1)))

online_test_2 = [(186, 377), (697, 333), (308, 664)]
print("Sample prediction for {}: Result = \n{}".format(online_test_2, predictor_raw.predict(online_test_2)))

### Alternative API formats

The SKLearn Predictor defaults to raw/numpy array input and output because it's idiomatic for scikit-learn.

The deployed `endpoint` can actually accept and generate whichever content types you set it up to support.

Here we create an alternative `Predictor` object, pointing at the same deployed endpoint but demonstrating a more web-API-like JSON endpoint:

In [None]:
predictor_json = sagemaker.predictor.RealTimePredictor(
    endpoint=predictor_raw.endpoint,
    accept=sagemaker.content_types.CONTENT_TYPE_JSON,
    content_type=sagemaker.content_types.CONTENT_TYPE_JSON
)

online_test_3 = '{ "uid": 186, "iid": 377 }'
print("Sample prediction for {}: Result = \n{}".format(
    online_test_3,
    json.loads(predictor_json.predict(online_test_3))
))

online_test_4 = '[{ "uid": 186, "iid": 377 }, { "uid": 697, "iid": 333 }]'
print("Sample prediction for {}: Result = \n{}".format(
    online_test_4,
    json.loads(predictor_json.predict(online_test_4))
))

### Clean up: delete the endpoint

Remember to clean up endpoint resources when no longer in use.

You may also like to do the following from the AWS console:

- Delete / clear out your bucket of data, models and results
- Stop this SageMaker notebook instance

In [None]:
predictor_raw.delete_endpoint()

### Next steps

As of 10th June 2019, AWS also offers the [Amazon Personalize]() managed service for recommender engine AutoML: Including modern algorithms that can sometimes significantly outperform the approaches used here.

This sample uses the same MovieLens data set as the [Amazon Personalize sample](https://github.com/aws-samples/amazon-personalize-samples), to help you experiment comparing the two!