# Introduction

This notebook outlines how to explain the results of a recommendation system built using a Factorization Machine (FM) model in Amazon SageMaker.

There are four parts to this notebook:

1. Building a FM Model
2. Extracting FM model parameters
3. Building the influence matrix
4. Explaining recommendations for a user

## Data sources and citations

I used three primary sources for this notebook.

### How to build and extract FM model

[Extending Amazon SageMaker factorization machines algorithm to predict top x recommendations](https://aws.amazon.com/blogs/machine-learning/extending-amazon-sagemaker-factorization-machines-algorithm-to-predict-top-x-recommendations/), published on the AWS Machine Learning Blog by Zohar Karnin and Rama Thamman on April 5, 2019.  This blog has a sample notebook for building the FM model for the movie lens dataset and extracting the FM model parameters.  I am repeating Parts 1 and 2 from that notebook for the sake of being able to reproduce the entire workflow in a single notebook.

### How to build influence matrix

I implemented the technique in this paper for building the influence matrix:

Bashir Rastegarpanah, Mark Crovella, Krishna Gummadi. 2017. "Exploring Explanations for Matrix Factorization Recommender Systems (Position Paper)." Proceedings of the FATREC Workshop on Responsible Recommendation.  Retrieved on October 7, 2019, from https://hdl.handle.net/2144/26683.


### Data set

The blog published by Karnin and Thamman uses the GroupLens movie dataset, available on https://grouplens.org/datasets/movielens/.  Per the terms of reuse, we do not redistribute the data set here, but rather provide code to download it.  The dataset formal citation is:

F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4, Article 19 (December 2015), 19 pages. DOI=http://dx.doi.org/10.1145/2827872

## Part 1 - Building a FM Model using movie lens dataset

This section is reproduced with minor modifications from the blog cited above.  I include it for completeness so you can see how to build the FM model from the source data set.

Be sure to customize the name of the S3 bucket used to upload the data set for FM training.  

In [1]:
import os
import boto3
import sagemaker
from sagemaker import get_execution_role

sagemaker_session = sagemaker.Session()
role = get_execution_role()
bucket = sagemaker_session.default_bucket()
region_name = boto3.Session().region_name

data_prefix = 'sagemaker/factorization-machines/movielens/data'

In [2]:
import sagemaker
import sagemaker.amazon.common as smac
from sagemaker import get_execution_role
from sagemaker.predictor import json_deserializer
from sagemaker.amazon.amazon_estimator import get_image_uri
import numpy as np
from scipy.sparse import lil_matrix
import pandas as pd
import boto3, io, os

### Download movie rating data from movie lens

In [3]:
#download data
!wget http://files.grouplens.org/datasets/movielens/ml-100k.zip
!unzip -o ml-100k.zip

--2020-11-08 21:37:08--  http://files.grouplens.org/datasets/movielens/ml-100k.zip
Resolving files.grouplens.org (files.grouplens.org)... 128.101.65.152
Connecting to files.grouplens.org (files.grouplens.org)|128.101.65.152|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4924029 (4.7M) [application/zip]
Saving to: ‘ml-100k.zip.4’


2020-11-08 21:37:08 (13.9 MB/s) - ‘ml-100k.zip.4’ saved [4924029/4924029]

Archive:  ml-100k.zip
  inflating: ml-100k/allbut.pl       
  inflating: ml-100k/mku.sh          
  inflating: ml-100k/README          
  inflating: ml-100k/u.data          
  inflating: ml-100k/u.genre         
  inflating: ml-100k/u.info          
  inflating: ml-100k/u.item          
  inflating: ml-100k/u.occupation    
  inflating: ml-100k/u.user          
  inflating: ml-100k/u1.base         
  inflating: ml-100k/u1.test         
  inflating: ml-100k/u2.base         
  inflating: ml-100k/u2.test         
  inflating: ml-100k/u3.base         
  inflating:

### Shuffle the data

In [4]:
!shuf ml-100k/ua.base -o ml-100k/ua.base.shuffled

### Load Training Data

In [5]:
user_movie_ratings_train = pd.read_csv('ml-100k/ua.base.shuffled', sep='\t', index_col=False, 
                 names=['user_id' , 'movie_id' , 'rating'])
user_movie_ratings_train.head(5)

Unnamed: 0,user_id,movie_id,rating
0,756,258,3
1,795,151,3
2,392,272,5
3,661,433,5
4,234,483,5


### Load Test Data

In [6]:
user_movie_ratings_test = pd.read_csv('ml-100k/ua.test', sep='\t', index_col=False, 
                 names=['user_id' , 'movie_id' , 'rating'])
user_movie_ratings_test.head(5)

Unnamed: 0,user_id,movie_id,rating
0,1,20,4
1,1,33,4
2,1,61,4
3,1,117,3
4,1,155,2


In [7]:
nb_users= user_movie_ratings_train['user_id'].max()
nb_movies=user_movie_ratings_train['movie_id'].max()
nb_features=nb_users+nb_movies
nb_ratings_test=len(user_movie_ratings_test.index)
nb_ratings_train=len(user_movie_ratings_train.index)
print(" # of users: ", nb_users)
print(" # of movies: ", nb_movies)
print(" Training Count: ", nb_ratings_train)
print(" Test Count: ", nb_ratings_test)
print(" Features (# of users + # of movies): ", nb_features)

 # of users:  943
 # of movies:  1682
 Training Count:  90570
 Test Count:  9430
 Features (# of users + # of movies):  2625


### FM Input

Input to FM is a one-hot encoded sparse matrix. Only ratings 4 and above are considered for the model. We will be ignoring ratings 3 and below.

In [8]:
def loadDataset(df, lines, columns):
    # Features are one-hot encoded in a sparse matrix
    X = lil_matrix((lines, columns)).astype('float32')
    # Labels are stored in a vector
    Y = []
    line=0
    for index, row in df.iterrows():
            X[line,row['user_id']-1] = 1
            X[line, nb_users+(row['movie_id']-1)] = 1
            if int(row['rating']) >= 4:
                Y.append(1)
            else:
                Y.append(0)
            line=line+1

    Y=np.array(Y).astype('float32')            
    return X,Y


X_train, Y_train = loadDataset(user_movie_ratings_train, nb_ratings_train, nb_features)
X_test, Y_test = loadDataset(user_movie_ratings_test, nb_ratings_test, nb_features)

In [9]:
print(X_train.shape)
print(Y_train.shape)
assert X_train.shape == (nb_ratings_train, nb_features)
assert Y_train.shape == (nb_ratings_train, )
zero_labels = np.count_nonzero(Y_train)
print("Training labels: %d zeros, %d ones" % (zero_labels, nb_ratings_train-zero_labels))

print(X_test.shape)
print(Y_test.shape)
assert X_test.shape  == (nb_ratings_test, nb_features)
assert Y_test.shape  == (nb_ratings_test, )
zero_labels = np.count_nonzero(Y_test)
print("Test labels: %d zeros, %d ones" % (zero_labels, nb_ratings_test-zero_labels))

(90570, 2625)
(90570,)
Training labels: 49906 zeros, 40664 ones
(9430, 2625)
(9430,)
Test labels: 5469 zeros, 3961 ones


### Convert to Protobuf format for saving to S3

In [10]:
prefix = 'fm'

if bucket.strip() == '':
    raise RuntimeError("bucket name is empty.")

train_key      = 'train.protobuf'
train_prefix   = '{}/{}'.format(prefix, 'train')

test_key       = 'test.protobuf'
test_prefix    = '{}/{}'.format(prefix, 'test')

output_prefix  = 's3://{}/{}/output'.format(bucket, prefix)

In [11]:
def writeDatasetToProtobuf(X, bucket, prefix, key, d_type, Y=None):
    buf = io.BytesIO()
    if d_type == "sparse":
        smac.write_spmatrix_to_sparse_tensor(buf, X, labels=Y)
    else:
        smac.write_numpy_to_dense_tensor(buf, X, labels=Y)
        
    buf.seek(0)
    obj = '{}/{}'.format(prefix, key)
    boto3.resource('s3').Bucket(bucket).Object(obj).upload_fileobj(buf)
    return 's3://{}/{}'.format(bucket,obj)
    
fm_train_data_path = writeDatasetToProtobuf(X_train, bucket, train_prefix, train_key, "sparse", Y_train)    
fm_test_data_path  = writeDatasetToProtobuf(X_test, bucket, test_prefix, test_key, "sparse", Y_test)    
  
print("Training data S3 path: ",fm_train_data_path)
print("Test data S3 path: ",fm_test_data_path)
print("FM model output S3 path: {}".format(output_prefix))

Training data S3 path:  s3://sagemaker-us-east-1-835319576252/fm/train/train.protobuf
Test data S3 path:  s3://sagemaker-us-east-1-835319576252/fm/test/test.protobuf
FM model output S3 path: s3://sagemaker-us-east-1-835319576252/fm/output


### Run training job

You can play around with the hyper parameters until you are happy with the prediction. For this dataset and hyper parameters configuration, after 200 epochs, test accuracy was around 70% on average and the F1 score (a typical metric for a binary classifier) was around 0.75 (1 indicates a perfect classifier). Not great, but you can fine tune the model further.

If you've already run the training job, you can load it rather than running the job again.  Just set the `model_uri` parameter to the location of the model artifact, and set the flag `model_exists` to `True`.

Similarly, if you already have a prediction endpoint available, set the flag `model_deployed` to `True` and provide the `model_endpoint` parameter.

In [12]:
# # https://github.com/aws/sagemaker-python-sdk/issues/1985
# # container = sagemaker.image_uris.retrieve(region_name, "blazingtext", "latest")

# image_uri = ''

# if region_name == 'us-west-1':
#     image_uri = '632365934929.dkr.ecr.us-west-1.amazonaws.com'

# if region_name == 'us-west-2':
#     image_uri = '433757028032.dkr.ecr.us-west-2.amazonaws.com'
    
# if region_name =='us-east-1':
#     image_uri = '811284229777.dkr.ecr.us-east-1.amazonaws.com'

# if region_name == 'us-east-2':
#     image_uri = '825641698319.dkr.ecr.us-east-2.amazonaws.com'

# if region_name =='ap-east-1':
#     image_uri = '286214385809.dkr.ecr.ap-east-1.amazonaws.com'

# if region_name == 'ap-northeast-1':
#     image_uri = '501404015308.dkr.ecr.ap-northeast-1.amazonaws.com'

# if region_name == 'ap-northeast-2':
#     image_uri = '306986355934.dkr.ecr.ap-northeast-2.amazonaws.com'

# if region_name == 'ap-south-1':
#     image_uri = '991648021394.dkr.ecr.ap-south-1.amazonaws.com'

# if region_name == 'ap-southeast-1':
#     image_uri = '475088953585.dkr.ecr.ap-southeast-1.amazonaws.com'

# if region_name == 'ap-southeast-2':
#     image_uri = '544295431143.dkr.ecr.ap-southeast-2.amazonaws.com'

# if region_name == 'ca-central-1':
#     image_uri = '469771592824.dkr.ecr.ca-central-1.amazonaws.com'

# if region_name == 'cn-north-1':
#     image_uri = '390948362332.dkr.ecr.cn-north-1.amazonaws.com.cn'

# if region_name == 'cn-northwest-1':
#     image_uri = '387376663083.dkr.ecr.cn-northwest-1.amazonaws.com.cn'

# if region_name == 'eu-central-1': 
#     image_uri = '813361260812.dkr.ecr.eu-central-1.amazonaws.com'

# if region_name == 'eu-north-1':
#     image_uri = '669576153137.dkr.ecr.eu-north-1.amazonaws.com'

# if region_name == 'eu-west-1':
#     image_uri = '685385470294.dkr.ecr.eu-west-1.amazonaws.com'

# if region_name == 'eu-west-2':
#     image_uri = '644912444149.dkr.ecr.eu-west-2.amazonaws.com'

# if region_name == 'eu-west-3':
#     image_uri = '749696950732.dkr.ecr.eu-west-3.amazonaws.com'

# if region_name == 'me-south-1':
#     image_uri = '249704162688.dkr.ecr.me-south-1.amazonaws.com'
    
# if region_name == 'sa-east-1':
#     image_uri = '855470959533.dkr.ecr.sa-east-1.amazonaws.com'

# if region_name == 'us-gov-west-1':
#     image_uri = '226302683700.dkr.ecr.us-gov-west-1.amazonaws.com'
    
# # https://docs.aws.amazon.com/en_us/sagemaker/latest/dg/sagemaker-algo-docker-registry-paths.html
# image_uri = '{}/factorization-machines:1'.format(image_uri)

# print('Using SageMaker container: {} ({})'.format(image_uri, region_name))


In [13]:
# image_uri = get_image_uri(region_name, "factorization-machines")
# print(image_uri)

In [14]:
image_uri = '382416733822.dkr.ecr.us-east-1.amazonaws.com/factorization-machines:1'


In [15]:
model_exists = True
model_uri = 's3://{}/factorization-machines/movielens/output/factorization-machines-2019-10-10-22-13-15-602/output/model.tar.gz'.format(bucket)
model_deployed = True
#model_endpoint = 'factorization-machines-2019-10-11-15-27-16-815' 
#if model_exists:
#    fm_model = sagemaker.FactorizationMachinesModel(model_uri, get_execution_role(), sagemaker_session=sagemaker.Session())
    
#    if model_deployed:
#        fm_predictor =  sagemaker.predictor.RealTimePredictor(model_endpoint, sagemaker_session=sagemaker.Session())
#    else:
#        fm_predictor = fm_model.deploy(initial_instance_count=1,
#                         instance_type='ml.m5.xlarge')
#else:
fm = sagemaker.estimator.Estimator(image_uri=image_uri,
                                   role=role, 
                                   instance_count=1, 
                                   instance_type='ml.m5.xlarge',
                                   output_path=output_prefix,
                                   sagemaker_session=sagemaker.Session())

In [16]:
fm.set_hyperparameters(feature_dim=nb_features,
                       predictor_type='binary_classifier',
                       mini_batch_size=1000,
                       num_factors=64,
                       epochs=200)
fm.fit({'train': fm_train_data_path, 'test': fm_test_data_path})

2020-11-08 21:37:30 Starting - Starting the training job...
2020-11-08 21:37:32 Starting - Launching requested ML instances......
2020-11-08 21:38:43 Starting - Preparing the instances for training...
2020-11-08 21:39:28 Downloading - Downloading input data...
2020-11-08 21:39:52 Training - Downloading the training image..[34mDocker entrypoint called with argument(s): train[0m
[34mRunning default environment configuration script[0m
  from numpy.testing import nosetester[0m
[34m[11/08/2020 21:40:09 INFO 139813082523456] Reading default configuration from /opt/amazon/lib/python2.7/site-packages/algorithm/resources/default-conf.json: {u'factors_lr': u'0.0001', u'linear_init_sigma': u'0.01', u'epochs': 1, u'_wd': u'1.0', u'_num_kv_servers': u'auto', u'use_bias': u'true', u'factors_init_sigma': u'0.001', u'_log_level': u'info', u'bias_init_method': u'normal', u'linear_init_method': u'normal', u'linear_lr': u'0.001', u'factors_init_method': u'normal', u'_tuning_objective_metric': u'', 


2020-11-08 21:40:07 Training - Training image download completed. Training in progress.[34m[2020-11-08 21:40:18.353] [tensorio] [info] epoch_stats={"data_pipeline": "/opt/ml/input/data/train", "epoch": 30, "duration": 554, "num_examples": 91, "num_bytes": 5796480}[0m
[34m[11/08/2020 21:40:18 INFO 139813082523456] #quality_metric: host=algo-1, epoch=14, train binary_classification_accuracy <score>=0.720879120879[0m
[34m[11/08/2020 21:40:18 INFO 139813082523456] #quality_metric: host=algo-1, epoch=14, train binary_classification_cross_entropy <loss>=0.592996871781[0m
[34m[11/08/2020 21:40:18 INFO 139813082523456] #quality_metric: host=algo-1, epoch=14, train binary_f_1.000 <score>=0.76650548804[0m
[34m#metrics {"Metrics": {"update.time": {"count": 1, "max": 556.4618110656738, "sum": 556.4618110656738, "min": 556.4618110656738}}, "EndTime": 1604871618.354108, "Dimensions": {"Host": "algo-1", "Operation": "training", "Algorithm": "factorization-machines"}, "StartTime": 1604871617

[34m[2020-11-08 21:40:28.491] [tensorio] [info] epoch_stats={"data_pipeline": "/opt/ml/input/data/train", "epoch": 66, "duration": 556, "num_examples": 91, "num_bytes": 5796480}[0m
[34m[11/08/2020 21:40:28 INFO 139813082523456] #quality_metric: host=algo-1, epoch=32, train binary_classification_accuracy <score>=0.732791208791[0m
[34m[11/08/2020 21:40:28 INFO 139813082523456] #quality_metric: host=algo-1, epoch=32, train binary_classification_cross_entropy <loss>=0.555287888747[0m
[34m[11/08/2020 21:40:28 INFO 139813082523456] #quality_metric: host=algo-1, epoch=32, train binary_f_1.000 <score>=0.768383753715[0m
[34m#metrics {"Metrics": {"update.time": {"count": 1, "max": 558.3701133728027, "sum": 558.3701133728027, "min": 558.3701133728027}}, "EndTime": 1604871628.492029, "Dimensions": {"Host": "algo-1", "Operation": "training", "Algorithm": "factorization-machines"}, "StartTime": 1604871627.933078}
[0m
[34m[11/08/2020 21:40:28 INFO 139813082523456] #progress_metric: host=al

[34m[2020-11-08 21:40:38.234] [tensorio] [info] epoch_stats={"data_pipeline": "/opt/ml/input/data/train", "epoch": 100, "duration": 548, "num_examples": 91, "num_bytes": 5796480}[0m
[34m[11/08/2020 21:40:38 INFO 139813082523456] #quality_metric: host=algo-1, epoch=49, train binary_classification_accuracy <score>=0.742164835165[0m
[34m[11/08/2020 21:40:38 INFO 139813082523456] #quality_metric: host=algo-1, epoch=49, train binary_classification_cross_entropy <loss>=0.538613998245[0m
[34m[11/08/2020 21:40:38 INFO 139813082523456] #quality_metric: host=algo-1, epoch=49, train binary_f_1.000 <score>=0.773673904446[0m
[34m#metrics {"Metrics": {"update.time": {"count": 1, "max": 550.3809452056885, "sum": 550.3809452056885, "min": 550.3809452056885}}, "EndTime": 1604871638.23479, "Dimensions": {"Host": "algo-1", "Operation": "training", "Algorithm": "factorization-machines"}, "StartTime": 1604871637.683804}
[0m
[34m[11/08/2020 21:40:38 INFO 139813082523456] #progress_metric: host=al

[34m[2020-11-08 21:40:48.425] [tensorio] [info] epoch_stats={"data_pipeline": "/opt/ml/input/data/train", "epoch": 136, "duration": 561, "num_examples": 91, "num_bytes": 5796480}[0m
[34m[11/08/2020 21:40:48 INFO 139813082523456] #quality_metric: host=algo-1, epoch=67, train binary_classification_accuracy <score>=0.744406593407[0m
[34m[11/08/2020 21:40:48 INFO 139813082523456] #quality_metric: host=algo-1, epoch=67, train binary_classification_cross_entropy <loss>=0.527468503512[0m
[34m[11/08/2020 21:40:48 INFO 139813082523456] #quality_metric: host=algo-1, epoch=67, train binary_f_1.000 <score>=0.774933957791[0m
[34m#metrics {"Metrics": {"update.time": {"count": 1, "max": 563.0748271942139, "sum": 563.0748271942139, "min": 563.0748271942139}}, "EndTime": 1604871648.42579, "Dimensions": {"Host": "algo-1", "Operation": "training", "Algorithm": "factorization-machines"}, "StartTime": 1604871647.862052}
[0m
[34m[11/08/2020 21:40:48 INFO 139813082523456] #progress_metric: host=al

[34m[2020-11-08 21:40:58.602] [tensorio] [info] epoch_stats={"data_pipeline": "/opt/ml/input/data/train", "epoch": 172, "duration": 533, "num_examples": 91, "num_bytes": 5796480}[0m
[34m[11/08/2020 21:40:58 INFO 139813082523456] #quality_metric: host=algo-1, epoch=85, train binary_classification_accuracy <score>=0.747032967033[0m
[34m[11/08/2020 21:40:58 INFO 139813082523456] #quality_metric: host=algo-1, epoch=85, train binary_classification_cross_entropy <loss>=0.519658784971[0m
[34m[11/08/2020 21:40:58 INFO 139813082523456] #quality_metric: host=algo-1, epoch=85, train binary_f_1.000 <score>=0.777093500658[0m
[34m#metrics {"Metrics": {"update.time": {"count": 1, "max": 535.3460311889648, "sum": 535.3460311889648, "min": 535.3460311889648}}, "EndTime": 1604871658.602676, "Dimensions": {"Host": "algo-1", "Operation": "training", "Algorithm": "factorization-machines"}, "StartTime": 1604871658.066683}
[0m
[34m[11/08/2020 21:40:58 INFO 139813082523456] #progress_metric: host=a

[34m[2020-11-08 21:41:08.238] [tensorio] [info] epoch_stats={"data_pipeline": "/opt/ml/input/data/train", "epoch": 206, "duration": 547, "num_examples": 91, "num_bytes": 5796480}[0m
[34m[11/08/2020 21:41:08 INFO 139813082523456] #quality_metric: host=algo-1, epoch=102, train binary_classification_accuracy <score>=0.749967032967[0m
[34m[11/08/2020 21:41:08 INFO 139813082523456] #quality_metric: host=algo-1, epoch=102, train binary_classification_cross_entropy <loss>=0.513777320275[0m
[34m[11/08/2020 21:41:08 INFO 139813082523456] #quality_metric: host=algo-1, epoch=102, train binary_f_1.000 <score>=0.779587131523[0m
[34m#metrics {"Metrics": {"update.time": {"count": 1, "max": 548.6500263214111, "sum": 548.6500263214111, "min": 548.6500263214111}}, "EndTime": 1604871668.239358, "Dimensions": {"Host": "algo-1", "Operation": "training", "Algorithm": "factorization-machines"}, "StartTime": 1604871667.690092}
[0m
[34m[11/08/2020 21:41:08 INFO 139813082523456] #progress_metric: hos

[34m[2020-11-08 21:41:18.345] [tensorio] [info] epoch_stats={"data_pipeline": "/opt/ml/input/data/train", "epoch": 242, "duration": 560, "num_examples": 91, "num_bytes": 5796480}[0m
[34m[11/08/2020 21:41:18 INFO 139813082523456] #quality_metric: host=algo-1, epoch=120, train binary_classification_accuracy <score>=0.752186813187[0m
[34m[11/08/2020 21:41:18 INFO 139813082523456] #quality_metric: host=algo-1, epoch=120, train binary_classification_cross_entropy <loss>=0.508168722635[0m
[34m[11/08/2020 21:41:18 INFO 139813082523456] #quality_metric: host=algo-1, epoch=120, train binary_f_1.000 <score>=0.781653934412[0m
[34m#metrics {"Metrics": {"update.time": {"count": 1, "max": 562.2141361236572, "sum": 562.2141361236572, "min": 562.2141361236572}}, "EndTime": 1604871678.345892, "Dimensions": {"Host": "algo-1", "Operation": "training", "Algorithm": "factorization-machines"}, "StartTime": 1604871677.78308}
[0m
[34m[11/08/2020 21:41:18 INFO 139813082523456] #progress_metric: host

[34m[2020-11-08 21:41:28.352] [tensorio] [info] epoch_stats={"data_pipeline": "/opt/ml/input/data/train", "epoch": 278, "duration": 559, "num_examples": 91, "num_bytes": 5796480}[0m
[34m[11/08/2020 21:41:28 INFO 139813082523456] #quality_metric: host=algo-1, epoch=138, train binary_classification_accuracy <score>=0.75489010989[0m
[34m[11/08/2020 21:41:28 INFO 139813082523456] #quality_metric: host=algo-1, epoch=138, train binary_classification_cross_entropy <loss>=0.502633497301[0m
[34m[11/08/2020 21:41:28 INFO 139813082523456] #quality_metric: host=algo-1, epoch=138, train binary_f_1.000 <score>=0.784027421401[0m
[34m#metrics {"Metrics": {"update.time": {"count": 1, "max": 561.1100196838379, "sum": 561.1100196838379, "min": 561.1100196838379}}, "EndTime": 1604871688.352844, "Dimensions": {"Host": "algo-1", "Operation": "training", "Algorithm": "factorization-machines"}, "StartTime": 1604871687.791124}
[0m
[34m[11/08/2020 21:41:28 INFO 139813082523456] #progress_metric: host

[34m[2020-11-08 21:41:38.472] [tensorio] [info] epoch_stats={"data_pipeline": "/opt/ml/input/data/train", "epoch": 314, "duration": 560, "num_examples": 91, "num_bytes": 5796480}[0m
[34m[11/08/2020 21:41:38 INFO 139813082523456] #quality_metric: host=algo-1, epoch=156, train binary_classification_accuracy <score>=0.75878021978[0m
[34m[11/08/2020 21:41:38 INFO 139813082523456] #quality_metric: host=algo-1, epoch=156, train binary_classification_cross_entropy <loss>=0.496849003048[0m
[34m[11/08/2020 21:41:38 INFO 139813082523456] #quality_metric: host=algo-1, epoch=156, train binary_f_1.000 <score>=0.787516818802[0m
[34m#metrics {"Metrics": {"update.time": {"count": 1, "max": 562.0059967041016, "sum": 562.0059967041016, "min": 562.0059967041016}}, "EndTime": 1604871698.472673, "Dimensions": {"Host": "algo-1", "Operation": "training", "Algorithm": "factorization-machines"}, "StartTime": 1604871697.909978}
[0m
[34m[11/08/2020 21:41:38 INFO 139813082523456] #progress_metric: host

[34m[2020-11-08 21:41:48.545] [tensorio] [info] epoch_stats={"data_pipeline": "/opt/ml/input/data/train", "epoch": 350, "duration": 559, "num_examples": 91, "num_bytes": 5796480}[0m
[34m[11/08/2020 21:41:48 INFO 139813082523456] #quality_metric: host=algo-1, epoch=174, train binary_classification_accuracy <score>=0.762956043956[0m
[34m[11/08/2020 21:41:48 INFO 139813082523456] #quality_metric: host=algo-1, epoch=174, train binary_classification_cross_entropy <loss>=0.490630248353[0m
[34m[11/08/2020 21:41:48 INFO 139813082523456] #quality_metric: host=algo-1, epoch=174, train binary_f_1.000 <score>=0.79119921788[0m
[34m#metrics {"Metrics": {"update.time": {"count": 1, "max": 561.0589981079102, "sum": 561.0589981079102, "min": 561.0589981079102}}, "EndTime": 1604871708.545651, "Dimensions": {"Host": "algo-1", "Operation": "training", "Algorithm": "factorization-machines"}, "StartTime": 1604871707.983914}
[0m
[34m[11/08/2020 21:41:48 INFO 139813082523456] #progress_metric: host

[34m[2020-11-08 21:41:58.532] [tensorio] [info] epoch_stats={"data_pipeline": "/opt/ml/input/data/train", "epoch": 386, "duration": 556, "num_examples": 91, "num_bytes": 5796480}[0m
[34m[11/08/2020 21:41:58 INFO 139813082523456] #quality_metric: host=algo-1, epoch=192, train binary_classification_accuracy <score>=0.767879120879[0m
[34m[11/08/2020 21:41:58 INFO 139813082523456] #quality_metric: host=algo-1, epoch=192, train binary_classification_cross_entropy <loss>=0.483857586871[0m
[34m[11/08/2020 21:41:58 INFO 139813082523456] #quality_metric: host=algo-1, epoch=192, train binary_f_1.000 <score>=0.795480291632[0m
[34m#metrics {"Metrics": {"update.time": {"count": 1, "max": 558.3081245422363, "sum": 558.3081245422363, "min": 558.3081245422363}}, "EndTime": 1604871718.533269, "Dimensions": {"Host": "algo-1", "Operation": "training", "Algorithm": "factorization-machines"}, "StartTime": 1604871717.974342}
[0m
[34m[11/08/2020 21:41:58 INFO 139813082523456] #progress_metric: hos


2020-11-08 21:42:12 Uploading - Uploading generated training model
2020-11-08 21:42:12 Completed - Training job completed
Training seconds: 164
Billable seconds: 164


In [19]:
import json
#from sagemaker.predictor import json_deserializer

def fm_serializer(data):
    js = {'instances': []}
    for row in data:
        js['instances'].append({'features': row.tolist()})
    #print(json.dumps(js))
    return json.dumps(js)

# fm_predictor.content_type = 'application/json'
# fm_predictor.serializer = fm_serializer
# fm_predictor.deserializer = json_deserializer

In [None]:
fm_predictor = fm.deploy(initial_instance_count=1,
                         instance_type='ml.m5.xlarge',                         
                         serializer=fm_serializer,
                         deserializer=sagemaker.deserializers.JSONDeserializer())

-----------------

## Part 2 - Extracting parameters from FM model

Now that we have the model created and stored in SageMaker, we can download the same and extract the parameters.  The FM model is stored in MxNet format.

This section is reproduced with minor modifications from the blog cited above for the sake of completeness.

### Download model data

Skip the next cell block if you have already downloaded the model.

In [None]:
import mxnet as mx
model_file_name = "model.tar.gz"
model_full_path = fm.output_path +"/"+ fm.latest_training_job.job_name +"/output/"+model_file_name
print("Model Path: ", model_full_path)

#Download FM model 
os.system("aws s3 cp "+model_full_path+ " .")

#Extract model file for loading to MXNet
os.system("tar xzvf "+model_file_name)
os.system("unzip -o model_algo-1")
os.system("mv symbol.json model-symbol.json")
os.system("mv params model-0000.params")

### Extract model data to create item and user latent matrixes

In [None]:
import mxnet as mx
#Extract model data
m = mx.module.Module.load('./model', 0, False, label_names=['out_label'])
V = m._arg_params['v'].asnumpy()
w = m._arg_params['w1_weight'].asnumpy()
b = m._arg_params['w0_weight'].asnumpy()

# item latent matrix - concat(V[i], w[i]).  
knn_item_matrix = np.concatenate((V[nb_users:], w[nb_users:]), axis=1)
knn_train_label = np.arange(1,nb_movies+1)

#user latent matrix - concat (V[u], 1) 
ones = np.ones(nb_users).reshape((nb_users, 1))
knn_user_matrix = np.concatenate((V[:nb_users], ones), axis=1)

## Part 3: Calculate Influence Matrix

Per the paper cited above, the influence matrix for user $j$ is calculated as:

$$J_j=U^T(U W_j U^T)^{-1}UW_j$$

Let's map those symbols to the variables in this notebook.

* $U$ is the embedding matrix for items.  In this formula, it is the transpose of the item matrix we extracted from the FM model.  So $U={knn\_item\_matrix}^{T}$
* $U^T={knn\_item\_matrix}$
* $W$ is a binary matrix with 1s on the diagonal in positions corresponding the known entries of X for this user.  In other words, it's a matrix of size $nb\_movies$ by $nb\_movies$, with a one on the diagonal in row and column $i$ where user $j$ rated movie $i$.

Now let's confirm that our dimensions line up properly.

In [None]:
knn_item_matrix.shape

In [None]:
knn_user_matrix.shape

### Build the matrix $W$.

For the sake of an example, let's pick user `846`, just because that user was the first row in our training set.

In [None]:
W = np.zeros([nb_movies,nb_movies])
W.shape

In [None]:
user_of_interest = 846

u1 = user_movie_ratings_train[user_movie_ratings_train.user_id == user_of_interest]
u2 = user_movie_ratings_test[user_movie_ratings_test.user_id == user_of_interest]

In [None]:
u1.head(5)

In [None]:
u1 = u1[u1.rating >= 4] # we only include ratings of 4 or more
u2 = u2[u2.rating >= 4]

In [None]:
u_all = np.concatenate((np.array(u1['movie_id']), np.array(u2['movie_id'])), axis=0)

In [None]:
for u_rating in u_all:
    W[u_rating,u_rating] = 1

### Calculate $J$ for user $j$

In [None]:
# influence matrix = u_tr * (u*w*u_tr)-1 * u * w
J1 = np.matmul(np.transpose(knn_item_matrix), W) # u*w
J2 = np.matmul(J1, knn_item_matrix) # u*w*u_tr
J3 = np.linalg.inv(J2) # (u*w*u_tr)-1
J4 = np.matmul(knn_item_matrix, J3) # u_tr * (u*w*u_tr)-1
J5 = np.matmul(J4, np.transpose(knn_item_matrix)) # u_tr * (u*w*u_tr)-1 * u
J = np.matmul(J5, W) # # u_tr * (u*w*u_tr)-1 * u * w

In [None]:
J.shape

## Part 4: Explaining recommendations for a user

Now we can use the influence matrix to calculate the two metrics explained in the research paper:

_Influence_ of the actual rating that user $j$ assigned to item $k$ on the predicted rating for item $i$.  This is calculated as:

$${\beta}_k = J_{ik}^j$$

In other words, we just look up the element at row $i$ and column $k$ of the influence matrix $J$ for user $j$

_Impact_ of the actual rating that user $j$ assigned to item $k$ on the predicted rating for item $i$.  This is calculated as:

$${\gamma}_k = {\beta}_{k}x_{kj}$$

In other words, we multiply the influence by the actual rating that user $j$ gave to item $k$

In this example I'll just use influence, since we converted the ratings to a binary like/don't like.


### Look up influence for a test recommendation

For our selected user, let's find a movie in our test set that they rated.

In [None]:
u2.head(5)

In [None]:
movie_to_rate = 60

In [None]:
result = fm_predictor.predict(X_test[8451:8452].toarray()) # use the row number from the test set

In [None]:
result

For movie 60, the user provided a rating of 4, and the FM model predicted that they'd like it with a score of 0.75.

Let's see what influenced that rating.

In [None]:
influence_i = J[movie_to_rate-1,:] # movies are indexed at 1, so we offset to 0

In [None]:
influence_i[movie_to_rate-1] = 0.0 # zero this out; it's the influence of the movie itself

In [None]:
# join with movie names
df_movies = pd.read_csv('ml-100k/u.item', sep='|', header=None, names=['movie_id', 'movie_name', 'c3','c4','c5','c6','c7',
                                                                      'c9','c9','c10','c11','c12','c13','c14','c15','c16','c17',
                                                                      'c18','c19','c20','c21','c22','c23','c24'])
df_movies.head(5)

In [None]:
df_influence = pd.DataFrame(data={'influence': influence_i, 'movie': df_movies['movie_name']})
df_influence.head(5)

This movie is 'Three Colors:Blue', a French drama that probably appeals to 'art house' movie goers

In [None]:
df_movies[df_movies['movie_id'] == movie_to_rate]

And what do we recommend?

In [None]:
df_top_influence = df_influence.nlargest(20, 'influence')
df_top_influence

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('ggplot')

In [None]:
ax = df_top_influence.plot(x ='movie', y='influence', kind = 'barh', figsize=(20,20), title='Top 20 Influences', color='blue')
ax.set_ylabel("Movie")
ax.set_xlabel("Influence")

These influences seem to make sense.  The first two are the other movies in the same trilogy, and the others seem to make sense if you're a movie fan.  `Short Cuts`, for example, is an indie film by Robert Altman, which might appeal to the people who liked the 'Three Colors' trilogy.''

### Look up influence for new recommendation

Now let's consider a movie that the user hasn't seen before.

In [None]:
np.sort(u_all)[:5]

In [None]:
movie_to_rate = 9

In [None]:
rate_data = np.zeros((1, nb_features))

In [None]:
rate_data[0, user_of_interest-1] = 1.0

In [None]:
rate_data[0, nb_users + movie_to_rate -1] = 1.0

In [None]:
result = fm_predictor.predict(rate_data) 

In [None]:
result

The model predicts that the user will like this movie.  Let's see why.

In [None]:
influence_i = J[movie_to_rate-1,:] # movies are indexed at 1, so we offset to 0
influence_i[movie_to_rate-1] = 0.0

In [None]:
df_influence = pd.DataFrame(data={'influence': influence_i, 'movie': df_movies['movie_name']})
df_influence.head(5)

We're looking at the movie 'Dead Man Walking', which was an acclaimed movie about a prisoner on Death Row.

In [None]:
df_movies[df_movies['movie_id'] == movie_to_rate]

In [None]:
df_top_influence = df_influence.nlargest(20, 'influence')
df_top_influence

In [None]:
ax = df_top_influence.plot(x ='movie', y='influence', kind = 'barh', figsize=(20,20), title='Top 20 Influences', color='blue')
ax.set_ylabel("Movie")
ax.set_xlabel("Influence")

Are these results intuitively satisfying?  I'm not quite sure, but remember that built this model with a relatively limited data set.

# Release Resources

In [None]:
%%javascript
Jupyter.notebook.save_checkpoint();
Jupyter.notebook.session.delete();