# Building a Recommender System with Amazon SageMaker Factorization Machines

---

---

## Background

- Recommender systems were a catalyst for ML's popularity (Amazon, Netflix Prize)
- User item matrix factorization is a core methodology
- Factorization machines combine linear prediction with a factorized representation of pairwise feature interaction

$$\hat{r} = w_0 + \sum_{i} {w_i x_i} + \sum_{i} {\sum_{j > i} {\langle v_i, v_j \rangle x_i x_j}}$$

- SageMaker has a highly scalable factorization machines algorithm built-in
- To learn more about the math behind _factorization machines_, [this paper](https://www.csie.ntu.edu.tw/~b97053/paper/Rendle2010FM.pdf) is a great resource

In [25]:
import sagemaker

sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()
bucket = sagemaker_session.default_bucket()
region_name = boto3.Session().region_name

In [26]:
base = 'recommender'
prefix = 'sagemaker' + base

In [27]:
import sagemaker
import os
import pandas as pd
import numpy as np
import boto3
import json
import io
import matplotlib.pyplot as plt
import sagemaker.amazon.common as smac
from sagemaker.predictor import json_deserializer
from scipy.sparse import csr_matrix

# Download Dataset

[Amazon Reviews AWS Public Dataset](https://s3.amazonaws.com/amazon-reviews-pds/readme.html)
- 1 to 5 star ratings
- 2M+ Amazon customers
- 160K+ digital videos 

Dataset columns:

- `marketplace`: 2-letter country code (in this case all "US").
- `customer_id`: Random identifier that can be used to aggregate reviews written by a single author.
- `review_id`: A unique ID for the review.
- `product_id`: The Amazon Standard Identification Number (ASIN).  `http://www.amazon.com/dp/<ASIN>` links to the product's detail page.
- `product_parent`: The parent of that ASIN.  Multiple ASINs (color or format variations of the same product) can roll up into a single parent.
- `product_title`: Title description of the product.
- `product_category`: Broad product category that can be used to group reviews (in this case digital videos).
- `star_rating`: The review's rating (1 to 5 stars).
- `helpful_votes`: Number of helpful votes for the review.
- `total_votes`: Number of total votes the review received.
- `vine`: Was the review written as part of the [Vine](https://www.amazon.com/gp/vine/help) program?
- `verified_purchase`: Was the review from a verified purchase?
- `review_headline`: The title of the review itself.
- `review_body`: The text of the review.
- `review_date`: The date the review was written.

In [28]:
!mkdir -p /tmp/recsys/
!aws s3 cp s3://amazon-reviews-pds/tsv/amazon_reviews_us_Digital_Video_Download_v1_00.tsv.gz /tmp/recsys/

download: s3://amazon-reviews-pds/tsv/amazon_reviews_us_Digital_Video_Download_v1_00.tsv.gz to ../../../../../tmp/recsys/amazon_reviews_us_Digital_Video_Download_v1_00.tsv.gz


In [29]:
df = pd.read_csv('/tmp/recsys/amazon_reviews_us_Digital_Video_Download_v1_00.tsv.gz', delimiter='\t',error_bad_lines=False)
df.head()

b'Skipping line 92523: expected 15 fields, saw 22\n'
b'Skipping line 343254: expected 15 fields, saw 22\n'
b'Skipping line 524626: expected 15 fields, saw 22\n'
b'Skipping line 623024: expected 15 fields, saw 22\n'
b'Skipping line 977412: expected 15 fields, saw 22\n'
b'Skipping line 1496867: expected 15 fields, saw 22\n'
b'Skipping line 1711638: expected 15 fields, saw 22\n'
b'Skipping line 1787213: expected 15 fields, saw 22\n'
b'Skipping line 2395306: expected 15 fields, saw 22\n'
b'Skipping line 2527690: expected 15 fields, saw 22\n'


Unnamed: 0,marketplace,customer_id,review_id,product_id,product_parent,product_title,product_category,star_rating,helpful_votes,total_votes,vine,verified_purchase,review_headline,review_body,review_date
0,US,12190288,R3FU16928EP5TC,B00AYB1482,668895143,Enlightened: Season 1,Digital_Video_Download,5,0,0,N,Y,I loved it and I wish there was a season 3,I loved it and I wish there was a season 3... ...,2015-08-31
1,US,30549954,R1IZHHS1MH3AQ4,B00KQD28OM,246219280,Vicious,Digital_Video_Download,5,0,0,N,Y,As always it seems that the best shows come fr...,As always it seems that the best shows come fr...,2015-08-31
2,US,52895410,R52R85WC6TIAH,B01489L5LQ,534732318,After Words,Digital_Video_Download,4,17,18,N,Y,Charming movie,"This movie isn't perfect, but it gets a lot of...",2015-08-31
3,US,27072354,R7HOOYTVIB0DS,B008LOVIIK,239012694,Masterpiece: Inspector Lewis Season 5,Digital_Video_Download,5,0,0,N,Y,Five Stars,excellant this is what tv should be,2015-08-31
4,US,26939022,R1XQ2N5CDOZGNX,B0094LZMT0,535858974,On The Waterfront,Digital_Video_Download,5,0,0,N,Y,Brilliant film from beginning to end,Brilliant film from beginning to end. All of t...,2015-08-31


# Drop some fields that won't be used

In [30]:
df = df[['customer_id', 'product_id', 'product_title', 'star_rating', 'review_date']]

# Most users don't rate most movies - Check our long tail

In [31]:
customers = df['customer_id'].value_counts()
products = df['product_id'].value_counts()

quantiles = [0, 0.01, 0.02, 0.03, 0.04, 0.05, 0.1, 0.25, 0.5, 0.75, 0.9, 0.95, 0.96, 0.97, 0.98, 0.99, 1]
print('customers\n', customers.quantile(quantiles))
print('products\n', products.quantile(quantiles))

customers
 0.00       1.0
0.01       1.0
0.02       1.0
0.03       1.0
0.04       1.0
0.05       1.0
0.10       1.0
0.25       1.0
0.50       1.0
0.75       2.0
0.90       4.0
0.95       5.0
0.96       6.0
0.97       7.0
0.98       9.0
0.99      13.0
1.00    2704.0
Name: customer_id, dtype: float64
products
 0.00        1.00
0.01        1.00
0.02        1.00
0.03        1.00
0.04        1.00
0.05        1.00
0.10        1.00
0.25        1.00
0.50        3.00
0.75        9.00
0.90       31.00
0.95       73.00
0.96       95.00
0.97      130.00
0.98      199.00
0.99      386.67
1.00    32790.00
Name: product_id, dtype: float64


# Filter out customers who haven't rated many movies

In [32]:
customers = customers[customers >= 5]
products = products[products >= 10]

reduced_df = df.merge(pd.DataFrame({'customer_id': customers.index})).merge(pd.DataFrame({'product_id': products.index}))

# Create a sequential index for customers and movies

In [33]:
customers = reduced_df['customer_id'].value_counts()
products = reduced_df['product_id'].value_counts()

In [34]:
customer_index = pd.DataFrame({'customer_id': customers.index, 'user': np.arange(customers.shape[0])})
product_index = pd.DataFrame({'product_id': products.index, 
                              'item': np.arange(products.shape[0]) + customer_index.shape[0]})

reduced_df = reduced_df.merge(customer_index).merge(product_index)
reduced_df.head()

Unnamed: 0,customer_id,product_id,product_title,star_rating,review_date,user,item
0,27072354,B008LOVIIK,Masterpiece: Inspector Lewis Season 5,5,2015-08-31,10463,140450
1,16030865,B008LOVIIK,Masterpiece: Inspector Lewis Season 5,5,2014-06-20,489,140450
2,44025160,B008LOVIIK,Masterpiece: Inspector Lewis Season 5,5,2014-05-27,32100,140450
3,18602179,B008LOVIIK,Masterpiece: Inspector Lewis Season 5,5,2014-12-23,2237,140450
4,14424972,B008LOVIIK,Masterpiece: Inspector Lewis Season 5,5,2015-08-31,32340,140450


# Count days since first review (included as a feature to capture trend)

In [35]:
reduced_df['review_date'] = pd.to_datetime(reduced_df['review_date'])
customer_first_date = reduced_df.groupby('customer_id')['review_date'].min().reset_index()
customer_first_date.columns = ['customer_id', 'first_review_date']

In [36]:
reduced_df = reduced_df.merge(customer_first_date)
reduced_df['days_since_first'] = (reduced_df['review_date'] - reduced_df['first_review_date']).dt.days
reduced_df['days_since_first'] = reduced_df['days_since_first'].fillna(0)

# Split into train and test datasets

In [37]:
test_df = reduced_df.groupby('customer_id').last().reset_index()

train_df = reduced_df.merge(test_df[['customer_id', 'product_id']], 
                            on=['customer_id', 'product_id'], 
                            how='outer', 
                            indicator=True)
train_df = train_df[(train_df['_merge'] == 'left_only')]

# Create Sparse Matrices

- Factorization machines expects data to look something like:
  - Sparse matrix
  - Target variable is that user's rating for a movie
  - One-hot encoding for users ($N$ features)
  - One-hot encoding for movies ($M$ features)

|Rating|User1|User2|...|UserN|Movie1|Movie2|Movie3|...|MovieM|Feature1|Feature2|...|
|---|---|---|---|---|---|---|---|---|---|---|---|---|
|4|1|0|...|0|1|0|0|...|0|20|2.2|...|
|5|1|0|...|0|0|1|0|...|0|17|9.1|...|
|3|0|1|...|0|1|0|0|...|0|3|11.0|...|
|4|0|1|...|0|0|0|1|...|0|15|6.4|...|


- Wouldn't want to hold this full matrix in memory
  - Create a sparse matrix
  - Designed to work efficiently with CPUs. Some parts of training for more dense matrices can be parallelized with GPUs

In [38]:
def to_csr_matrix(df, num_users, num_items):
    feature_dim = num_users + num_items + 1
    data = np.concatenate([np.array([1] * df.shape[0]),
                           np.array([1] * df.shape[0]),
                           df['days_since_first'].values])
    row = np.concatenate([np.arange(df.shape[0])] * 3)
    col = np.concatenate([df['user'].values,
                          df['item'].values,
                          np.array([feature_dim - 1] * df.shape[0])])
    return csr_matrix((data, (row, col)), 
                      shape=(df.shape[0], feature_dim), 
                      dtype=np.float32)

In [39]:
def to_s3_protobuf(csr, label, bucket, prefix, channel, splits):
    indices = np.array_split(np.arange(csr.shape[0]), splits)
    for i in range(len(indices)):
        index = indices[i]
        buf = io.BytesIO()
        smac.write_spmatrix_to_sparse_tensor(buf, csr[index, ], label[index])
        buf.seek(0)
        boto3.client('s3').upload_fileobj(buf, bucket, '{}/{}/data-{}'.format(prefix, channel, i))

# Convert to sparse recordIO-wrapped protobuf that SageMaker factorization machines expects

## Train Dataset

In [40]:
train_csr = to_csr_matrix(train_df, customer_index.shape[0], product_index.shape[0])

In [41]:
to_s3_protobuf(train_csr, train_df['star_rating'].values.astype(np.float32), bucket, prefix, channel='train', splits=10)

## Test Dataset

In [42]:
test_csr = to_csr_matrix(test_df, customer_index.shape[0], product_index.shape[0])

In [43]:
to_s3_protobuf(test_csr, test_df['star_rating'].values.astype(np.float32), bucket, prefix, channel='test', splits=1)

# Train the Model

- Create a [SageMaker Python SDK](https://github.com/aws/sagemaker-python-sdk) estimator to run a training jobs and specify:
  - Algorithm container image
  - IAM role
  - Hardware setup
  - S3 output location
  - Algorithm hyperparameters
    - `feature_dim`: $N + M + 1$ (additional feature is `days_since_first` to capture trend)
    - `num_factors`: number of factor dimensions (increasing too much can lead to overfitting)
    - `epochs`: number of full passes through the dataset
- `.fit()` points to training and test data in S3 and begins the training job

**Note**: For AWS accounts registered in conjunction with a workshop, default instance limits may prevent the use of `ml.c5.2xlarge` (and other equally powerful instances), and may require a lower value for `train_instance_count` depending on the instance type chosen. 

In [57]:
# # https://github.com/aws/sagemaker-python-sdk/issues/1985
# # container = sagemaker.image_uris.retrieve(region_name, "blazingtext", "latest")

# image_uri = ''

# if region_name == 'us-west-1':
#     image_uri = '632365934929.dkr.ecr.us-west-1.amazonaws.com'

# if region_name == 'us-west-2':
#     image_uri = '433757028032.dkr.ecr.us-west-2.amazonaws.com'
    
# if region_name =='us-east-1':
#     image_uri = '811284229777.dkr.ecr.us-east-1.amazonaws.com'

# if region_name == 'us-east-2':
#     image_uri = '825641698319.dkr.ecr.us-east-2.amazonaws.com'

# if region_name =='ap-east-1':
#     image_uri = '286214385809.dkr.ecr.ap-east-1.amazonaws.com'

# if region_name == 'ap-northeast-1':
#     image_uri = '501404015308.dkr.ecr.ap-northeast-1.amazonaws.com'

# if region_name == 'ap-northeast-2':
#     image_uri = '306986355934.dkr.ecr.ap-northeast-2.amazonaws.com'

# if region_name == 'ap-south-1':
#     image_uri = '991648021394.dkr.ecr.ap-south-1.amazonaws.com'

# if region_name == 'ap-southeast-1':
#     image_uri = '475088953585.dkr.ecr.ap-southeast-1.amazonaws.com'

# if region_name == 'ap-southeast-2':
#     image_uri = '544295431143.dkr.ecr.ap-southeast-2.amazonaws.com'

# if region_name == 'ca-central-1':
#     image_uri = '469771592824.dkr.ecr.ca-central-1.amazonaws.com'

# if region_name == 'cn-north-1':
#     image_uri = '390948362332.dkr.ecr.cn-north-1.amazonaws.com.cn'

# if region_name == 'cn-northwest-1':
#     image_uri = '387376663083.dkr.ecr.cn-northwest-1.amazonaws.com.cn'

# if region_name == 'eu-central-1': 
#     image_uri = '813361260812.dkr.ecr.eu-central-1.amazonaws.com'

# if region_name == 'eu-north-1':
#     image_uri = '669576153137.dkr.ecr.eu-north-1.amazonaws.com'

# if region_name == 'eu-west-1':
#     image_uri = '685385470294.dkr.ecr.eu-west-1.amazonaws.com'

# if region_name == 'eu-west-2':
#     image_uri = '644912444149.dkr.ecr.eu-west-2.amazonaws.com'

# if region_name == 'eu-west-3':
#     image_uri = '749696950732.dkr.ecr.eu-west-3.amazonaws.com'

# if region_name == 'me-south-1':
#     image_uri = '249704162688.dkr.ecr.me-south-1.amazonaws.com'
    
# if region_name == 'sa-east-1':
#     image_uri = '855470959533.dkr.ecr.sa-east-1.amazonaws.com'

# if region_name == 'us-gov-west-1':
#     image_uri = '226302683700.dkr.ecr.us-gov-west-1.amazonaws.com'
    
# # https://docs.aws.amazon.com/en_us/sagemaker/latest/dg/sagemaker-algo-docker-registry-paths.html
# image_uri = '{}/factorization-machines:1'.format(image_uri)

# print('Using SageMaker container: {} ({})'.format(image_uri, region_name))


Using SageMaker container: 811284229777.dkr.ecr.us-east-1.amazonaws.com/factorization-machines:latest (us-east-1)


In [None]:
image_uri = '382416733822.dkr.ecr.us-east-1.amazonaws.com/factorization-machines:1'

In [53]:
# from sagemaker.amazon.amazon_estimator import get_image_uri

# image_uri = get_image_uri(region_name, "factorization-machines")
# print(image_uri)

The method get_image_uri has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
Defaulting to the only supported framework/algorithm version: 1. Ignoring framework/algorithm version: 1.


382416733822.dkr.ecr.us-east-1.amazonaws.com/factorization-machines:1


In [54]:
fm = sagemaker.estimator.Estimator(
    image_uri=image_uri,
    role=role, 
    train_instance_count=1,
    train_instance_type='ml.c5.xlarge',
    output_path='s3://{}/{}/output'.format(bucket, prefix),
    base_job_name=base,
    sagemaker_session=sagemaker_session)

fm.set_hyperparameters(
    feature_dim=customer_index.shape[0] + product_index.shape[0] + 1,
    predictor_type='regressor',
    mini_batch_size=1000,
    num_factors=256,
    epochs=3)

train_instance_count has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
train_instance_type has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.


In [55]:
from sagemaker.inputs import TrainingInput

train_input = TrainingInput(s3_data='s3://{}/{}/train/'.format(bucket, prefix), 
                            distribution='ShardedByS3Key')

test_input = TrainingInput(s3_data='s3://{}/{}/test/'.format(bucket, prefix), 
                            distribution='ShardedByS3Key')

In [56]:
fm.fit({'train': train_input, 
        'test': test_input})

2020-11-08 21:10:15 Starting - Starting the training job...
2020-11-08 21:10:20 Starting - Launching requested ML instances.........
2020-11-08 21:11:50 Starting - Preparing the instances for training...............
2020-11-08 21:14:34 Downloading - Downloading input data...
2020-11-08 21:15:16 Training - Training image download completed. Training in progress..[34mDocker entrypoint called with argument(s): train[0m
[34mRunning default environment configuration script[0m
  from numpy.testing import nosetester[0m
[34m[11/08/2020 21:15:17 INFO 139620961433408] Reading default configuration from /opt/amazon/lib/python2.7/site-packages/algorithm/resources/default-conf.json: {u'factors_lr': u'0.0001', u'linear_init_sigma': u'0.01', u'epochs': 1, u'_wd': u'1.0', u'_num_kv_servers': u'auto', u'use_bias': u'true', u'factors_init_sigma': u'0.001', u'_log_level': u'info', u'bias_init_method': u'normal', u'linear_init_method': u'normal', u'linear_lr': u'0.001', u'factors_init_method': u'nor

[34m[11/08/2020 21:15:41 INFO 139620961433408] Iter[1] Batch [500]#011Speed: 72688.85 samples/sec[0m
[34m[11/08/2020 21:15:41 INFO 139620961433408] #quality_metric: host=algo-1, epoch=1, batch=500 train rmse <loss>=1.0475453634[0m
[34m[11/08/2020 21:15:41 INFO 139620961433408] #quality_metric: host=algo-1, epoch=1, batch=500 train mse <loss>=1.09735128839[0m
[34m[11/08/2020 21:15:41 INFO 139620961433408] #quality_metric: host=algo-1, epoch=1, batch=500 train absolute_loss <loss>=0.820659577939[0m
[34m[11/08/2020 21:15:48 INFO 139620961433408] Iter[1] Batch [1000]#011Speed: 71265.85 samples/sec[0m
[34m[11/08/2020 21:15:48 INFO 139620961433408] #quality_metric: host=algo-1, epoch=1, batch=1000 train rmse <loss>=1.08773311719[0m
[34m[11/08/2020 21:15:48 INFO 139620961433408] #quality_metric: host=algo-1, epoch=1, batch=1000 train mse <loss>=1.18316333422[0m
[34m[11/08/2020 21:15:48 INFO 139620961433408] #quality_metric: host=algo-1, epoch=1, batch=1000 train absolute_loss <


2020-11-08 21:16:27 Uploading - Uploading generated training model
2020-11-08 21:16:54 Completed - Training job completed
Training seconds: 140
Billable seconds: 140


# Host the Endpoint

Deploy trained model to a real-time production endpoint

In [66]:
def fm_serializer(df):
    feature_dim = customer_index.shape[0] + product_index.shape[0] + 1
    js = {'instances': []}
    for index, data in df.iterrows():
        js['instances'].append({'data': {'features': {'values': [1, 1, data['days_since_first']],
                                                      'keys': [data['user'], data['item'], feature_dim - 1],
                                                      'shape': [feature_dim]}}})
    return json.dumps(js)

In [None]:
fm_predictor = fm.deploy(instance_type='ml.m4.xlarge', 
                         initial_instance_count=1,
                         serializer=fm_serializer,
                         deserializer=sagemaker.deserializers.JSONDeserializer())

-----

# Setup the predictor request handler
Serialize the request data to match what the model is expecting

In [None]:
#fm_predictor.content_type = 'application/json'
#fm_predictor.serializer = fm_serializer
#fm_predictor.deserializer = json_deserializer

# Show some test data

In [None]:
test_df.head(25)

# Pick a single customer from the dataset

In [None]:
test_customer = test_df.iloc[[20]]
test_df.iloc[[20]] # peek at the data to confirm it's the one we wanted

# Pass `test_customer` to predictor

In [None]:
fm_predictor.predict(test_customer)

# Make a dataframe for an arbitrary customer and movie pair and test it out!

Our `fm_serializer` requires 3 inputs to perform a prediction:
 - `user` id for a customer (type = num)
 - `item` id for a movie (type = num)
 - `days_since_first` review (type = double)

In [None]:
fake_customer = test_customer # make a copy of the test_customer we pulled out before to modify
desired_user_id = 65884 # person who rated Dexter with 5 stars
desired_item_id = 140461 # Code for True Blood: Season 1
desired_review_days = 28.0 # arbitrary number of days since first review

#fake_customer_data = {'user' : desired_user_id, 'item' : desired_item_id, 'days_since_first' : desired_review_days}
#fake_customer = pd.DataFrame(fake_customer_data, index=[0])
fake_customer['user'] = desired_user_id
fake_customer['item'] = desired_item_id
fake_customer['days_since_first'] = desired_review_days

# print the details for this fake customer
fake_customer

In [None]:
fm_predictor.predict(fake_customer)

# Clean-up the endpoint

In [None]:
# fm_predictor.delete_endpoint()