This notebook uses the package spotlight found at: https://github.com/maciejkula/spotlight, to create an LSTM recommendation model. Information on the model can be found in the below articles: 
- https://towardsdatascience.com/introduction-to-recommender-system-part-2-adoption-of-neural-network-831972c4cbf7

Also using as reference the documentation Spotlight page: https://maciejkula.github.io/spotlight/sequence/representations.html#spotlight.sequence.representations.LSTMNet

This code uses the training, validation and test sets created on the notebook: Implicit_Rating_Calculation_category.ipynb to train and test the model.

#### Installing the Spotlight library 

In [1]:
!pip install git+https://github.com/maciejkula/spotlight.git@master --upgrade 

Collecting git+https://github.com/maciejkula/spotlight.git@master
  Cloning https://github.com/maciejkula/spotlight.git (to revision master) to /tmp/pip-req-build-mqd2as9i
  Running command git clone -q https://github.com/maciejkula/spotlight.git /tmp/pip-req-build-mqd2as9i


In [2]:
# Loading needed libraries
import numpy as np
import pandas as pd
import datetime as dt
from datetime import date
import torch
from sklearn.preprocessing import LabelEncoder
import gc

# Spotlight Libraries
from spotlight.sequence.implicit import ImplicitSequenceModel
from spotlight.sequence.representations import CNNNet
from spotlight.interactions import Interactions
from spotlight.cross_validation import random_train_test_split
from spotlight.evaluation import sequence_mrr_score

# Loading libraries for S3 bucket connection
import boto3
import io
from io import StringIO,BytesIO, TextIOWrapper
import gzip

client = boto3.client('s3') 
resource = boto3.resource('s3') 

The expected data input should contain the user ids, the product ids, implicit rating and the timestamps (which are optional)

#### Data Preparation

In [3]:
# Reading Training,validation and testing dfs - Using the created T_implicit_cat data with timestamp
train_df = pd.read_csv('s3://myaws-capstone-bucket/data/modeling/input/T_implicit_cat_rating_train.csv')
test_df = pd.read_csv('s3://myaws-capstone-bucket/data/modeling/input/T_implicit_cat_rating_test.csv')

In [4]:
# Transforming event_time columns into timestamp  
train_df['event_date'] = train_df['event_time'].str[:19]# Grabbing only timestamp portion from original event_time column
train_df['event_date'] = train_df['event_date'].astype('datetime64[ns]')
train_df['timestamp'] = train_df['event_date'].values.astype(np.int64)//10**9
train_df['timestamp'] = train_df['timestamp'].astype(np.int32)
train_df.head()

Unnamed: 0,user_id,category,category_id,implicit_rating,catID,event_time,event_date,timestamp
0,128968633,2232732102103663163_furniture.bedroom.blanket,2232732102103663163,2,734,2019-12-31 10:09:41 UTC,2019-12-31 10:09:41,1577786981
1,128968633,2232732108613223108_sport.trainer,2232732108613223108,2,788,2019-12-31 11:30:56 UTC,2019-12-31 11:30:56,1577791856
2,128968633,2232732108613223108_sport.trainer,2232732108613223108,2,788,2019-12-31 15:30:09 UTC,2019-12-31 15:30:09,1577806209
3,192078182,2232732093077520756_construction.tools.light,2232732093077520756,2,668,2020-03-11 05:47:37 UTC,2020-03-11 05:47:37,1583905657
4,192078182,2232732101063475749_appliances.environment.vacuum,2232732101063475749,2,725,2020-01-17 12:51:40 UTC,2020-01-17 12:51:40,1579265500


In [5]:
# Transform category and user ids to needed format

# instantiating the labelencoder object
le = LabelEncoder()

train_df['catID'] = train_df['catID'].astype(np.int32)+1
train_df['userID'] = le.fit_transform(train_df['user_id'])
train_df['userID'] = train_df['userID'].astype(np.int32)+1

In [6]:
# Applying the same to test df

test_df['event_date'] = test_df['event_time'].str[:19]# Grabbing only timestamp portion from original event_time column
test_df['event_date'] = test_df['event_date'].astype('datetime64[ns]')
test_df['timestamp'] = test_df['event_date'].values.astype(np.int64)//10**9
test_df['timestamp'] = test_df['timestamp'].astype('str')

In [7]:
# Transform product and user ids to needed format
test_df['catID'] = test_df['catID'].astype(np.int32)+1
test_df['userID'] = le.fit_transform(test_df['user_id'])
test_df['userID'] = test_df['userID'].astype(np.int32)+1

In [8]:
test_df.head()

Unnamed: 0,user_id,category,category_id,implicit_rating,catID,event_time,event_date,timestamp,userID
0,128968633,2232732102103663163_furniture.bedroom.blanket,2232732102103663163,2,735,2019-12-31 10:09:41 UTC,2019-12-31 10:09:41,1577786981,1
1,128968633,2232732108613223108_sport.trainer,2232732108613223108,2,789,2019-12-31 11:30:56 UTC,2019-12-31 11:30:56,1577791856,1
2,128968633,2232732108613223108_sport.trainer,2232732108613223108,2,789,2019-12-31 15:30:09 UTC,2019-12-31 15:30:09,1577806209,1
3,200985178,2232732093077520756_construction.tools.light,2232732093077520756,3,669,2019-12-29 19:28:00 UTC,2019-12-29 19:28:00,1577647680,3
4,221480173,2232732093077520756_construction.tools.light,2232732093077520756,2,669,2019-12-17 10:52:52 UTC,2019-12-17 10:52:52,1576579972,4


In [9]:
# Creating interaction Spotlight objects since Spotlight model expects this specific type of object
train=Interactions(user_ids=train_df['userID'].to_numpy(),item_ids=train_df['catID'].to_numpy(),timestamps=train_df['timestamp'].to_numpy())

test=Interactions(user_ids=test_df['userID'].to_numpy(),item_ids=test_df['catID'].to_numpy(),timestamps=test_df['timestamp'].to_numpy())

In [10]:
# Setting min and max sequence number
# Based on analysis done on notebook:Implicit_Rating_Calculation_final.ipynb we have set a minimum of 2 and max of 8 sequences
min_sequence_length = 2
max_sequence_length = 8
random_state = np.random.RandomState(572)

In [11]:
train_seq = train.to_sequence(max_sequence_length=max_sequence_length,
                              min_sequence_length=min_sequence_length)

test_seq = test.to_sequence(max_sequence_length=max_sequence_length,
                              min_sequence_length=min_sequence_length)

In [12]:
hyperparameters = {
        'n_iter': 3,
        'batch_size': 16,
        'l2': 0.0,
        'learning_rate': 0.01,
        'loss': 'adaptive_hinge',
        'embedding_dim': 64}


model = ImplicitSequenceModel(loss=hyperparameters['loss'],
                              representation='lstm',
                              batch_size=hyperparameters['batch_size'],
                              learning_rate=hyperparameters['learning_rate'],
                              l2=hyperparameters['l2'],
                              n_iter=hyperparameters['n_iter'],
                              use_cuda=torch.cuda.is_available(),
                              random_state=random_state)

In [13]:
model.fit(train_seq, verbose=True)

Epoch 0: loss 0.15526641930176924
Epoch 1: loss 0.1529445510273218
Epoch 2: loss 0.15150756300594873


In [14]:
torch.save(model, 'LSTM.pt')

In [15]:
sequences = test_seq.sequences

In [16]:
pred_list=[]
for i in range(len(sequences)):
    predictions = model.predict(sequences[i])
    pred_list.append(predictions.argsort()[-10:][::-1])

In [17]:
# Converting prediction of CNN into a df
predictions_df = pd.DataFrame(data=pred_list)
predictions_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,669,789,639,155,726,582,715,589,146,72
1,606,669,612,14,639,744,97,663,715,726
2,669,730,85,108,606,715,744,726,752,657
3,715,669,663,726,658,665,656,155,606,156
4,669,606,657,726,715,767,735,85,639,738


In [18]:
# Creating a df of user ids to have them in order
users_df = pd.DataFrame(data=test_seq.user_ids)
users_df.columns = ['userID']

predictions_df['userID']= users_df['userID']

In [19]:
predictions_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,userID
0,669,789,639,155,726,582,715,589,146,72,1
1,606,669,612,14,639,744,97,663,715,726,2
2,669,730,85,108,606,715,744,726,752,657,4
3,715,669,663,726,658,665,656,155,606,156,5
4,669,606,657,726,715,767,735,85,639,738,6


In [20]:
predictions_df = predictions_df.groupby('userID').tail(1)# Handling cases with multiple sequences to only grab latest one

In [21]:
# Rearranging recs from rows to columns
predictions_df = predictions_df.melt(id_vars=["userID"], var_name="Category_Rank", value_name="catID")
predictions_df.head()

Unnamed: 0,userID,Category_Rank,catID
0,1,0,669
1,2,0,606
2,4,0,669
3,5,0,715
4,6,0,669


In [22]:
cat_mapping = test_df[['catID','category','category_id']]
cat_mapping = cat_mapping.drop_duplicates(subset=['catID','category','category_id'])

user_mapping = test_df[['userID','user_id']]
user_mapping = user_mapping.drop_duplicates(subset=['userID','user_id'])

In [23]:
# Merging predictions_df to obtain the correct user and category
predictions_df = pd.merge(predictions_df, cat_mapping,  how='inner', on='catID')
predictions_df = pd.merge(predictions_df, user_mapping,  how='inner', on='userID')
#Dropping duplicates
predictions_df = predictions_df.drop_duplicates(['user_id','catID','category','category_id'])
predictions_df.head(10)

Unnamed: 0,userID,Category_Rank,catID,category,category_id,user_id
0,1,0,669,2232732093077520756_construction.tools.light,2232732093077520756,128968633
1,1,6,715,2232732099754852875_appliances.personal.massager,2232732099754852875,128968633
2,1,9,72,2053013554155487563_computers.components.mothe...,2053013554155487563,128968633
3,1,2,639,2232732086928670945_electronics.camera.photo,2232732086928670945,128968633
4,1,1,789,2232732108613223108_sport.trainer,2232732108613223108,128968633
5,1,5,582,2232732061804790604_furniture.bedroom.bed,2232732061804790604,128968633
6,1,4,726,2232732101063475749_appliances.environment.vacuum,2232732101063475749,128968633
7,1,7,589,2232732069102879671_appliances.kitchen.kettle,2232732069102879671,128968633
8,1,8,146,2053013557024391671_apparel.shoes.moccasins,2053013557024391671,128968633
9,1,3,155,2053013557452210699_electronics.clocks,2053013557452210699,128968633


In [26]:
# Saving Results in S3
predictions_df.to_csv('s3://myaws-capstone-bucket/data/modeling/output/LSTM_param1.csv',index=False)

In [27]:
predictions_df.nunique()

userID           535748
Category_Rank        10
catID               578
category            578
category_id         578
user_id          535748
dtype: int64