## Factorization Machines
Factorization Machines is a generalization of linear models.
https://www.csie.ntu.edu.tw/~b97053/paper/Rendle2010FM.pdf

They're well-suited for high
dimension sparse datasets, such as user-item interaction matrices for recommendation.

In this example, we're going to train a recommendation model based on the MovieLens
dataset ( https://grouplens.org/datasets/movielens/ ).

Factorization Machines is a supervised learning algorithm, so we need to train it on labeled samples.

Instead of using a plain matrix, we'll use a sparse matrix, a data structure specifically
designed and optimized for sparse datasets. Scipy has exactly the object we need,
named lil_matrix ( https://docs.scipy.org/doc/scipy/reference/
generated/scipy.sparse.lil_matrix.html ). This will help us to get rid
of all these nasty zeros.

## Understanding protobuf and RecordIO

So how will we pass this sparse matrix to the SageMaker algorithm? As you would expect,
we're going to serialize the object, and store it in S3. We're not going to use Python
serialization, however. Instead, we're going to use protobuf ( https://developers.
google.com/protocol-buffers/ ), a popular and efficient serialization mechanism.
In addition, we're going to store the protobuf-encoded data in a record format called
RecordIO ( https://mxnet.apache.org/api/faq/recordio/ ). Our dataset
will be stored as a sequence of records in a single file. This has the following benefits:

• A single file is easier to move around: who wants to deal with thousands
of individual files that can get lost or corrupted?

• A sequential file is faster to read, which makes the training process more efficient.

• A sequence of records is easy to split for distributed training.

# Building a Factorization Machines model on MovieLens
Download ml-100k dataset and extracting

In [1]:
%%sh
wget http://files.grouplens.org/datasets/movielens/ml-100k.zip
unzip -o ml-100k.zip

Archive:  ml-100k.zip
   creating: ml-100k/
  inflating: ml-100k/allbut.pl       
  inflating: ml-100k/mku.sh          
  inflating: ml-100k/README          
  inflating: ml-100k/u.data          
  inflating: ml-100k/u.genre         
  inflating: ml-100k/u.info          
  inflating: ml-100k/u.item          
  inflating: ml-100k/u.occupation    
  inflating: ml-100k/u.user          
  inflating: ml-100k/u1.base         
  inflating: ml-100k/u1.test         
  inflating: ml-100k/u2.base         
  inflating: ml-100k/u2.test         
  inflating: ml-100k/u3.base         
  inflating: ml-100k/u3.test         
  inflating: ml-100k/u4.base         
  inflating: ml-100k/u4.test         
  inflating: ml-100k/u5.base         
  inflating: ml-100k/u5.test         
  inflating: ml-100k/ua.base         
  inflating: ml-100k/ua.test         
  inflating: ml-100k/ub.base         
  inflating: ml-100k/ub.test         


--2021-05-02 02:42:42--  http://files.grouplens.org/datasets/movielens/ml-100k.zip
Resolving files.grouplens.org (files.grouplens.org)... 128.101.65.152
Connecting to files.grouplens.org (files.grouplens.org)|128.101.65.152|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4924029 (4.7M) [application/zip]
Saving to: ‘ml-100k.zip’

     0K .......... .......... .......... .......... ..........  1%  107K 44s
    50K .......... .......... .......... .......... ..........  2%  199K 34s
   100K .......... .......... .......... .......... ..........  3% 7.98M 22s
   150K .......... .......... .......... .......... ..........  4% 6.27M 17s
   200K .......... .......... .......... .......... ..........  5%  226K 17s
   250K .......... .......... .......... .......... ..........  6% 6.98M 14s
   300K .......... .......... .......... .......... ..........  7%  226K 15s
   350K .......... .......... .......... .......... ..........  8% 8.13M 13s
   400K .......... .........

In [6]:
# Going in folder
%cd ml-100k
!shuf ua.base -o ua.base.shuffled  # Shuffling it
!head -5 ua.base.shuffled          # Printing 5 lines

389	946	3	880088363
453	238	4	877554396
58	1106	4	892068866
350	181	4	882346720
617	174	1	883788820


### Building training set and test set

In [7]:
num_users=943
num_movies=1682
num_features=num_users+num_movies

num_ratings_train=90570
num_ratings_test=9430

Now, let's write a function to load a dataset into a sparse matrix. Based on the
previous explanation, we go through the dataset line by line. In the X matrix,
we set the appropriate user and movie columns to 1 . We also store the rating in the
Y vector:

In [9]:
import csv
import numpy as np
from scipy.sparse import lil_matrix

def loadDataset(filename, lines, columns):
    # Features are one-hot encoded in a sparse matrix
    X = lil_matrix((lines, columns)).astype('float32')
    # Labels are stored in a vector
    Y = []
    line=0
    with open(filename,'r') as f:
        samples=csv.reader(f,delimiter='\t')
        for userId,movieId,rating,timestamp in samples:
            X[line,int(userId)-1] = 1
            X[line,int(num_users)+int(movieId)-1] = 1
            Y.append(int(rating))
            line=line+1       
    Y=np.array(Y).astype('float32')
    return X,Y

In [10]:
X_train, Y_train = loadDataset('ua.base.shuffled', num_ratings_train, num_features)
X_test, Y_test = loadDataset('ua.test', num_ratings_test, num_features)

In [11]:
print(X_train.shape)
print(Y_train.shape)
assert X_train.shape == (num_ratings_train, num_features)
assert Y_train.shape == (num_ratings_train, )

print(X_test.shape)
print(Y_test.shape)
assert X_test.shape  == (num_ratings_test, num_features)
assert Y_test.shape  == (num_ratings_test, )

(90570, 2625)
(90570,)
(9430, 2625)
(9430,)


In [13]:
print(X_train.shape)
print(Y_train.shape)
print(X_test.shape)
print(Y_test.shape)

(90570, 2625)
(90570,)
(9430, 2625)
(9430,)


# Convert to protobuf and save to S3

In [12]:
import sagemaker

bucket = sagemaker.Session().default_bucket()
prefix = 'fm-movielens'

train_key      = 'train.protobuf'
train_prefix   = '{}/{}'.format(prefix, 'train')

test_key       = 'test.protobuf'
test_prefix    = '{}/{}'.format(prefix, 'test')

output_prefix  = 's3://{}/{}/output'.format(bucket, prefix)

Now, let's write a function that converts a dataset to the RecordIO-wrapped
protobuf , and uploads it to an S3 bucket. We first create an in-memory binary
stream with io.BytesIO() . Then, we use the life-saving write_spmatrix_
to_sparse_tensor( ) function to write the sample matrix and the label vector to
that buffer in protobuf format. Finally, we use boto3 to upload the buffer to S3:

In [14]:
import io, boto3
import sagemaker.amazon.common as smac

def writeDatasetToProtobuf(X, Y, bucket, prefix, key):
    buf = io.BytesIO()
    smac.write_spmatrix_to_sparse_tensor(buf, X, Y)
    # use smac.write_numpy_to_dense_tensor(buf, feature, label) for numpy arrays
    buf.seek(0)
    print(buf)
    obj = '{}/{}'.format(prefix, key)
    boto3.resource('s3').Bucket(bucket).Object(obj).upload_fileobj(buf)
    return 's3://{}/{}'.format(bucket,obj)
    
train_data = writeDatasetToProtobuf(X_train, Y_train, bucket, train_prefix, train_key)    
test_data  = writeDatasetToProtobuf(X_test, Y_test, bucket, test_prefix, test_key)    
  
print(train_data)
print(test_data)
print('Output: {}'.format(output_prefix))

<_io.BytesIO object at 0x7ffabb230270>
<_io.BytesIO object at 0x7ffabb230270>
s3://sagemaker-us-east-1-603012210694/fm-movielens/train/train.protobuf
s3://sagemaker-us-east-1-603012210694/fm-movielens/test/test.protobuf
Output: s3://sagemaker-us-east-1-603012210694/fm-movielens/output


In [None]:
###########################

In [19]:
# Extra step for local user only

import boto3
region = boto3.Session().region_name

def resolve_sm_role():
    client = boto3.client('iam', region_name=region)
    response_roles = client.list_roles(
        PathPrefix='/',
        # Marker='string',
        MaxItems=999
    )
    for role in response_roles['Roles']:
        if role['RoleName'].startswith('AmazonSageMaker-ExecutionRole-'):
            #print('Resolved SageMaker IAM Role to: ' + str(role))
            return role['Arn']
    raise Exception('Could not resolve what should be the SageMaker role to be used')

role = resolve_sm_role()
print(role)

arn:aws:iam::603012210694:role/service-role/AmazonSageMaker-ExecutionRole-20210304T123661


In [None]:
###########################

### Run training job
We find the name of the Factorization Machines container, configure the Estimator function, and set the
hyperparameters:

In [15]:
import boto3
from sagemaker import image_uris

region = boto3.Session().region_name    
container = image_uris.retrieve('factorization-machines', region)

In [None]:
fm = sagemaker.estimator.Estimator(container,
                                   role=role,#sagemaker.get_execution_role(),
                                   instance_count=1, 
                                   instance_type='ml.c5.xlarge',
                                   output_path=output_prefix
                                   )

fm.set_hyperparameters(feature_dim=num_features,
                      predictor_type='regressor',
                      num_factors=64,
                      epochs=10)

We then launch the training job. Did you notice that we didn't configure training
inputs? We're simply passing the location of the two protobuf files. As protobuf
is the default format for Factorization Machines (as well as other built-in
algorithms), we can save a step:

In [20]:
fm.fit({'train': train_data, 'test': test_data})

2021-05-01 22:22:53 Starting - Starting the training job...
2021-05-01 22:22:55 Starting - Launching requested ML instancesProfilerReport-1619907772: InProgress
......
2021-05-01 22:24:23 Starting - Preparing the instances for training...
2021-05-01 22:25:04 Downloading - Downloading input data...
2021-05-01 22:25:43 Training - Training image download completed. Training in progress.[34mDocker entrypoint called with argument(s): train[0m
[34mRunning default environment configuration script[0m
  from collections import Mapping, MutableMapping, Sequence[0m
  """[0m
  """[0m
[34m[05/01/2021 22:25:37 INFO 140392802047808] Reading default configuration from /opt/amazon/lib/python3.7/site-packages/algorithm/resources/default-conf.json: {'epochs': 1, 'mini_batch_size': '1000', 'use_bias': 'true', 'use_linear': 'true', 'bias_lr': '0.1', 'linear_lr': '0.001', 'factors_lr': '0.0001', 'bias_wd': '0.01', 'linear_wd': '0.001', 'factors_wd': '0.00001', 'bias_init_method': 'normal', 'bias_ini

[34m[2021-05-01 22:25:42.698] [tensorio] [info] epoch_stats={"data_pipeline": "/opt/ml/input/data/train", "epoch": 16, "duration": 539, "num_examples": 91, "num_bytes": 5796480}[0m
[34m[05/01/2021 22:25:42 INFO 140392802047808] #quality_metric: host=algo-1, epoch=7, train rmse <loss>=1.0263907882225434[0m
[34m[05/01/2021 22:25:42 INFO 140392802047808] #quality_metric: host=algo-1, epoch=7, train mse <loss>=1.053478050148094[0m
[34m[05/01/2021 22:25:42 INFO 140392802047808] #quality_metric: host=algo-1, epoch=7, train absolute_loss <loss>=0.8341658989204155[0m
[34m#metrics {"StartTime": 1619907942.156609, "EndTime": 1619907942.6989508, "Dimensions": {"Algorithm": "factorization-machines", "Host": "algo-1", "Operation": "training"}, "Metrics": {"update.time": {"sum": 541.7196750640869, "count": 1, "min": 541.7196750640869, "max": 541.7196750640869}}}
[0m
[34m[05/01/2021 22:25:42 INFO 140392802047808] #progress_metric: host=algo-1, completed 80.0 % of epochs[0m
[34m#metrics {


2021-05-01 22:26:03 Uploading - Uploading generated training model
2021-05-01 22:26:03 Completed - Training job completed
Training seconds: 51
Billable seconds: 51


## Deploying model

In [21]:
endpoint_name = 'fm-movielens-100k'
fm_predictor = fm.deploy(endpoint_name=endpoint_name,
                         instance_type='ml.t2.medium', initial_instance_count=1)

---------------------!

We'll now send samples to the endpoint in JSON format ( https://docs.aws.
amazon.com/sagemaker/latest/dg/fact-machines.html#fm-
inputoutput ). For this purpose, we write a custom serializer to convert input
data to JSON. The default JSON deserializer will be used automatically since we set
the content type to 'application/json' :

In [22]:
import json
from sagemaker.deserializers import JSONDeserializer
from sagemaker.serializers import JSONSerializer

class FMSerializer(JSONSerializer):
    def serialize(self, data):
       js = {'instances': []}
       for row in data:
              js['instances'].append({'features': row.tolist()})
       return json.dumps(js)

fm_predictor.serializer = FMSerializer()
fm_predictor.deserializer = JSONDeserializer()

### Run predictions

In [23]:
result = fm_predictor.predict(X_test[:3].toarray())
print(result)

{'predictions': [{'score': 3.3867130279541016}, {'score': 3.422882556915283}, {'score': 3.622199535369873}]}


### Finally, we delete the endpoint

In [24]:
fm_predictor.delete_endpoint()