This notebook uses the package spotlight found at: https://github.com/maciejkula/spotlight, to create an Matrix factorization (MF) recommendation system using BilinearNet. Information on the model can be found here: https://github.com/maciejkula/spotlight/tree/master/examples/movielens_explicit 

Also using as reference the following article: https://towardsdatascience.com/how-to-build-powerful-deep-recommender-systems-using-spotlight-ec11198c173c

This code uses the training, validation and test sets created on the notebook: Implicit_Rating_Calculation_category.ipynb to train and test the model.

#### Installing the Spotlight library 

!conda install -c maciejkula -c pytorch spotlight --y

!conda install -c maciejkula -c pytorch -c peterjc123 spotlight=0.1.4

In [1]:
!pip install git+https://github.com/maciejkula/spotlight.git@master --upgrade 

Collecting git+https://github.com/maciejkula/spotlight.git@master
  Cloning https://github.com/maciejkula/spotlight.git (to revision master) to /tmp/pip-req-build-34cq0wvg
  Running command git clone -q https://github.com/maciejkula/spotlight.git /tmp/pip-req-build-34cq0wvg
Collecting torch>=0.4.0
  Downloading torch-1.8.1-cp36-cp36m-manylinux1_x86_64.whl (804.1 MB)
[K     |████████████████████████████████| 804.1 MB 4.4 kB/s  eta 0:00:01
Building wheels for collected packages: spotlight
  Building wheel for spotlight (setup.py) ... [?25ldone
[?25h  Created wheel for spotlight: filename=spotlight-0.1.6-py3-none-any.whl size=33919 sha256=ee8f3cfc948c00b4cd023af3e760df6966975dd7649b63ae766c69c15577f984
  Stored in directory: /tmp/pip-ephem-wheel-cache-w6r7w92k/wheels/a2/c9/32/26cf5865967a16a7f51891503ef48d501ff3636572991130fe
Successfully built spotlight
Installing collected packages: torch, spotlight
Successfully installed spotlight-0.1.6 torch-1.8.1


In [2]:
# Loading needed libraries
import numpy as np
import pandas as pd
import datetime as dt
from datetime import date
import torch
from sklearn.preprocessing import LabelEncoder
import gc

# Spotlight Libraries
from spotlight.factorization.explicit import ExplicitFactorizationModel
from spotlight.interactions import Interactions
from spotlight.cross_validation import random_train_test_split
from spotlight.evaluation import rmse_score


# Loading libraries for S3 bucket connection
import boto3
import io
from io import StringIO,BytesIO, TextIOWrapper
import gzip

client = boto3.client('s3') 
resource = boto3.resource('s3') 

The expected data input should contain the user ids, the product ids, implicit rating and the timestamps (which are optional)

#### Data Preparation

In [3]:
# Reading Training,validation and testing dfs - Using the created T_implicit_cat data with timestamp
train_df = pd.read_csv('s3://myaws-capstone-bucket/data/modeling/input/T_implicit_cat_rating_train.csv')
val_train_df = pd.read_csv('s3://myaws-capstone-bucket/data/modeling/input/T_implicit_cat_rating_val_train.csv')
test_df = pd.read_csv('s3://myaws-capstone-bucket/data/modeling/input/T_implicit_cat_rating_test.csv')
val_test_df = pd.read_csv('s3://myaws-capstone-bucket/data/modeling/input/T_implicit_cat_rating_val_test.csv')

In [4]:
# Transforming event_time columns into timestamp  
train_df['event_date'] = train_df['event_time'].str[:19]# Grabbing only timestamp portion from original event_time column
train_df['event_date'] = train_df['event_date'].astype('datetime64[ns]')
train_df['timestamp'] = train_df['event_date'].values.astype(np.int64)//10**9
train_df['timestamp'] = train_df['timestamp'].astype(np.int32)
train_df.head()

Unnamed: 0,user_id,category,category_id,implicit_rating,catID,event_time,event_date,timestamp
0,128968633,2232732102103663163_furniture.bedroom.blanket,2232732102103663163,2,734,2019-12-31 10:09:41 UTC,2019-12-31 10:09:41,1577786981
1,128968633,2232732108613223108_sport.trainer,2232732108613223108,2,788,2019-12-31 11:30:56 UTC,2019-12-31 11:30:56,1577791856
2,128968633,2232732108613223108_sport.trainer,2232732108613223108,2,788,2019-12-31 15:30:09 UTC,2019-12-31 15:30:09,1577806209
3,192078182,2232732093077520756_construction.tools.light,2232732093077520756,2,668,2020-03-11 05:47:37 UTC,2020-03-11 05:47:37,1583905657
4,192078182,2232732101063475749_appliances.environment.vacuum,2232732101063475749,2,725,2020-01-17 12:51:40 UTC,2020-01-17 12:51:40,1579265500


In [5]:
# Transform category and user ids to needed format

# instantiating the labelencoder object
le = LabelEncoder()

train_df['catID'] = train_df['catID'].astype(np.int32)
train_df['userID'] = le.fit_transform(train_df['user_id'])
train_df['userID'] = train_df['userID'].astype(np.int32)

In [6]:
train_df.head()

Unnamed: 0,user_id,category,category_id,implicit_rating,catID,event_time,event_date,timestamp,userID
0,128968633,2232732102103663163_furniture.bedroom.blanket,2232732102103663163,2,734,2019-12-31 10:09:41 UTC,2019-12-31 10:09:41,1577786981,0
1,128968633,2232732108613223108_sport.trainer,2232732108613223108,2,788,2019-12-31 11:30:56 UTC,2019-12-31 11:30:56,1577791856,0
2,128968633,2232732108613223108_sport.trainer,2232732108613223108,2,788,2019-12-31 15:30:09 UTC,2019-12-31 15:30:09,1577806209,0
3,192078182,2232732093077520756_construction.tools.light,2232732093077520756,2,668,2020-03-11 05:47:37 UTC,2020-03-11 05:47:37,1583905657,2
4,192078182,2232732101063475749_appliances.environment.vacuum,2232732101063475749,2,725,2020-01-17 12:51:40 UTC,2020-01-17 12:51:40,1579265500,2


In [7]:
train_df.nunique()

user_id             839183
category               910
category_id            910
implicit_rating          5
catID                  910
event_time         3047065
event_date         3047065
timestamp          3047065
userID              839183
dtype: int64

In [8]:
train_df.dtypes

user_id                     int64
category                   object
category_id                 int64
implicit_rating             int64
catID                       int32
event_time                 object
event_date         datetime64[ns]
timestamp                   int32
userID                      int32
dtype: object

In [9]:
# Applying the same to all other dfs
val_train_df['event_date'] = val_train_df['event_time'].str[:19]# Grabbing only timestamp portion from original event_time column
val_train_df['event_date'] = val_train_df['event_date'].astype('datetime64[ns]')
val_train_df['timestamp'] = val_train_df['event_date'].values.astype(np.int64)//10**9
val_train_df['timestamp'] = val_train_df['timestamp'].astype('str')

val_test_df['event_date'] = val_test_df['event_time'].str[:19]# Grabbing only timestamp portion from original event_time column
val_test_df['event_date'] = val_test_df['event_date'].astype('datetime64[ns]')
val_test_df['timestamp'] = val_test_df['event_date'].values.astype(np.int64)//10**9
val_test_df['timestamp'] = val_test_df['timestamp'].astype('str')

test_df['event_date'] = test_df['event_time'].str[:19]# Grabbing only timestamp portion from original event_time column
test_df['event_date'] = test_df['event_date'].astype('datetime64[ns]')
test_df['timestamp'] = test_df['event_date'].values.astype(np.int64)//10**9
test_df['timestamp'] = test_df['timestamp'].astype('str')

In [10]:
# Transform product and user ids to needed format
val_train_df['catID'] = val_train_df['catID'].astype(np.int32)
val_train_df['userID'] = le.fit_transform(val_train_df['user_id'])
val_train_df['userID'] = val_train_df['userID'].astype(np.int32)

In [11]:
val_train_df.nunique()

user_id             727477
category               910
category_id            910
implicit_rating          5
catID                  910
event_time         2693000
event_date         2693000
timestamp          2693000
userID              727477
dtype: int64

In [12]:
# Transform product and user ids to needed format
val_test_df['catID'] = val_test_df['catID'].astype(np.int32)
val_test_df['userID'] = le.fit_transform(val_test_df['user_id'])
val_test_df['userID'] = val_test_df['userID'].astype(np.int32)

In [13]:
# Transform product and user ids to needed format
test_df['catID'] = test_df['catID'].astype(np.int32)
test_df['userID'] = le.fit_transform(test_df['user_id'])
test_df['userID'] = test_df['userID'].astype(np.int32)

In [14]:
# Creating interaction Spotlight objects since Spotlight model expects this specific type of object
train=Interactions(user_ids=train_df['userID'].to_numpy(),item_ids=train_df['catID'].to_numpy(),ratings=train_df['implicit_rating'].to_numpy(),timestamps=train_df['timestamp'].to_numpy())

test=Interactions(user_ids=test_df['userID'].to_numpy(),item_ids=test_df['catID'].to_numpy(),ratings=test_df['implicit_rating'].to_numpy(),timestamps=test_df['timestamp'].to_numpy())

val_train=Interactions(user_ids=val_train_df['userID'].to_numpy(),item_ids=val_train_df['catID'].to_numpy(),ratings=val_train_df['implicit_rating'].to_numpy(),timestamps=val_train_df['timestamp'].to_numpy())

val_test=Interactions(user_ids=val_test_df['userID'].to_numpy(),item_ids=val_test_df['catID'].to_numpy(),ratings=val_test_df['implicit_rating'].to_numpy(),timestamps=val_test_df['timestamp'].to_numpy())

In [15]:
train

<Interactions dataset (839183 users x 927 items x 3559313 interactions)>

In [16]:
model = ExplicitFactorizationModel(loss='regression',
                                   embedding_dim=64,  # latent dimensionality
                                   n_iter=4,  # number of epochs of training
                                   batch_size=64,  # minibatch size
                                   l2=1e-9,  # strength of L2 regularization
                                   learning_rate=1e-3,
                                   use_cuda=1)

model.fit(train, verbose=True)# Training model

Epoch 0: loss 0.2629178272461773
Epoch 1: loss 0.04738976570742345
Epoch 2: loss 0.043486515086508935
Epoch 3: loss 0.041886863962420695


In [17]:
torch.save(model, 'MF_Spotlight.pt')# saving model

In [18]:
model = torch.load('MF_Spotlight.pt')# loading model

In [19]:
test_csr = test.tocsr()# converting test df to sparse row format

In [20]:
# Making predictions with model test
results=[]
for userID, row in enumerate(test_csr):
    predictions = model.predict(userID)
    results.append(predictions.argsort()[-10:][::-1])

In [21]:
# Converting prediction of model into a df
predictions_df = pd.DataFrame(data=results)
predictions_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,843,611,365,791,602,29,855,668,45,862
1,843,611,365,791,602,29,855,45,862,788
2,843,611,602,791,365,29,862,855,45,805
3,668,152,843,655,45,365,788,776,791,162
4,843,611,602,788,862,45,365,784,716,678


In [22]:
# Storing Df with userID
results=[]
for userID, row in enumerate(test_csr):
    results.append(userID)

user_df = pd.DataFrame(data=results)
user_df.columns = ['userID']

In [23]:
predictions_df['userID']= user_df['userID']# mapping userID to predictions_df
predictions_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,userID
0,843,611,365,791,602,29,855,668,45,862,0
1,843,611,365,791,602,29,855,45,862,788,1
2,843,611,602,791,365,29,862,855,45,805,2
3,668,152,843,655,45,365,788,776,791,162,3
4,843,611,602,788,862,45,365,784,716,678,4


In [24]:
predictions_df = predictions_df.groupby('userID').tail(1)# Handling cases with multiple sequences to only grab latest one
# Rearranging recs from rows to columns
predictions_df = predictions_df.melt(id_vars=["userID"], var_name="Category_Rank", value_name="catID")
predictions_df.head()

Unnamed: 0,userID,Category_Rank,catID
0,0,0,843
1,1,0,843
2,2,0,843
3,3,0,668
4,4,0,843


In [25]:
cat_mapping = test_df[['catID','category','category_id']]
cat_mapping = cat_mapping.drop_duplicates(subset=['catID','category','category_id'])

user_mapping = test_df[['userID','user_id']]
user_mapping = user_mapping.drop_duplicates(subset=['userID','user_id'])

In [26]:
# Merging predictions_df to obtain the correct user and category
predictions_df = pd.merge(predictions_df, cat_mapping,  how='inner', on='catID')
predictions_df = pd.merge(predictions_df, user_mapping,  how='inner', on='userID')
#Dropping duplicates
predictions_df = predictions_df.drop_duplicates(['user_id','catID','category','category_id'])
predictions_df.head(10)

Unnamed: 0,userID,Category_Rank,catID,category,category_id,user_id
0,0,0,843,2232732113948377930_sport.bicycle,2232732113948377930,128968633
1,0,7,668,2232732093077520756_construction.tools.light,2232732093077520756,128968633
2,0,6,855,2232732115005342564_apparel.shoes.keds,2232732115005342564,128968633
3,0,3,791,2232732108839715530_apparel.costume,2232732108839715530,128968633
4,0,9,862,2232732116347519880_appliances.environment.vacuum,2232732116347519880,128968633
5,0,2,365,2053013565639492569_apparel.shoes,2053013565639492569,128968633
6,0,8,45,2053013553325015316_appliances.kitchen.toster,2053013553325015316,128968633
7,0,5,29,2053013552821698803_computers.peripherals.mouse,2053013552821698803,128968633
8,0,1,611,2232732082063278200_electronics.clocks,2232732082063278200,128968633
9,0,4,602,2232732079009824823_kids.skates,2232732079009824823,128968633


In [27]:
# Saving Results in S3
predictions_df.to_csv('s3://myaws-capstone-bucket/data/modeling/output/MF_Spotlight_Param1.csv',index=False)
predictions_df.nunique()

userID           548860
Category_Rank        10
catID               588
category            588
category_id         588
user_id          548860
dtype: int64