## Implementing a Recommender System in PyTorch

In this notebook, we will learn how to implement a recommender system that "discovers"/"generates" latent feature vectors, for users and items, given the user ratings. TODO

We will use the Movielens dataset. TODO

Due to resource restrictions, we will be ajusting the model parameters using a batch approach. TODO

In order to perform model comparison, we will use several approaches: 
1. Mean rating prediction error, TODO
2. Mean Dislike/Like accuracy, through rating thresholding, TODO
3. Mean Dislike/Like accuracy, per batch user, through rating thresholding, TODO

Model assumptions:
1. The model isn't "time invariant": TODO
2. User taste doesn't change too much in a short time span: TODO

Hyper-parameters:
* Batchsize: TODO
* Epochs: TODO
* Lambda: TODO
* Number of Latent features: TODO

Questions:
1. After receiving new data from the users, how should we select past data to adapt the model?
2. TODO

### The model

$U \in \mathbb{R}^{(\#U x F)}$

$I \in \mathbb{R}^{(\#I x F)}$

$R \in \mathbb{R}^{(\#U x \#I)}$, a sparse real matrix

$T \in \mathbb{R}^{(\#U x \#I)}$, a sparse real matrix

$M \in \mathbb{R}^{(\#U x \#I)}$, a sparse binary matrix

$\hat{R} = (U_f \cdot I_f) \odot M$

$E = \frac{1}{\#R}(\hat{R} - R)^2 + \lambda (\left\lVert U_f \right\rVert_{2}^{2} + \left\lVert I_f \right\rVert_{2}^{2})$

$E = \sqrt{\frac{1}{\#R}(\hat{R} - R)^2} + \lambda (\left\lVert U_f \right\rVert_{2}^{2} + \left\lVert I_f \right\rVert_{2}^{2})$

In [1]:
import pandas as pd
import numpy as np
import torch
import time
from datetime import date
import sys
import pickle
import os

### Loading the ratings dataset.

The original ratings dataset is not sorted by timestamp. To make training and evaluation easier, we created a sorted dataset. TODO

In [2]:
ratings = pd.read_csv('ratings_sorted_by_timestamp.csv')

In [18]:
ratings.head(5)

Unnamed: 0.1,Unnamed: 0,userId,movieId,rating,timestamp
0,0,1,122,2,945544824
1,1,1,172,1,945544871
2,2,1,1221,5,945544788
3,3,1,1441,4,945544871
4,4,1,1609,3,945544824


In [19]:
ratings.sample()

Unnamed: 0.1,Unnamed: 0,userId,movieId,rating,timestamp
9428842,9428842,99708,1302,4.5,1120227243


### Counting how many ratings, users and items do we have in the ratings data frame.

In [20]:
numberOfItems = np.max(ratings.movieId.unique())

In [21]:
numberOfUsers = np.max(ratings.userId.unique())

In [22]:
numberOfRatings = ratings.shape[0]

In [23]:
print((numberOfRatings, numberOfItems, numberOfUsers))

(24404096, 165201, 259137)


In [24]:
def build_ratings_and_mask_submatrices_from_batch(batch, userIds, movieIds): 
    R = np.zeros((userIds.shape[0], movieIds.shape[0]))
    M = np.zeros((userIds.shape[0], movieIds.shape[0]))
    
    movieIdsIndexes = {}
    userIdsIndexes = {}
       
    for i in range(movieIds.shape[0]):
        movieIdsIndexes[movieIds[i]] = i
        
    for i in range(userIds.shape[0]):
        userIdsIndexes[userIds[i]] = i
        
    for entry in batch.iterrows():
        row = entry[1]
        
        uid = int(row.userId)
        mid = int(row.movieId)
        
        uIdx = userIdsIndexes[uid]
        mIdx = movieIdsIndexes[mid]
        
        R[uIdx,mIdx] = row.rating
        M[uIdx,mIdx] = 1
    
    return R, M

### Training (v3.0)

In [25]:
def slice_and_dice_user_and_item_matrices(userIds, itemIds, UFeats, IFeats, numberOfLatentFeatures): 
     # Create temporary user and item matrices.
    U = np.zeros((userIds.shape[0],  numberOfLatentFeatures+1))
    I = np.ones((itemIds.shape[0], numberOfLatentFeatures))

    # Slice the tensors and assign those slices to the temporary matrices.
    U[:,range(numberOfLatentFeatures+1)] = UFeats[userIds]
    I[:,range(numberOfLatentFeatures)]   = IFeats[itemIds]

    U = torch.Tensor(U)
    I = torch.Tensor(I)

    # Create Variables
    ufeats = torch.autograd.Variable(U, requires_grad=True)
    ifeats = torch.autograd.Variable(I, requires_grad=True)
    
    return ufeats, ifeats

In [26]:
def test_prediction(dataset, UFeats, IFeats): 
    testSample = dataset.sample()
    uId = testSample.userId.values[0]
    mId = testSample.movieId.values[0]
    rat = testSample.rating.values[0]

    ufeats = UFeats[uId,0:numberOfLatentFeatures]
    ubias = UFeats[uId,numberOfLatentFeatures]
    ifeats = IFeats[mId,0:numberOfLatentFeatures]

    pred = np.dot(ufeats,ifeats) + ubias

    return (((uId, mId, rat), pred))

In [27]:
def extract_batch(dataset, i, maxBatchSize): 
    batchStart = (i-1) * maxBatchSize
    
    batchEnd   = i * maxBatchSize
    
    batch = trainDataset[batchStart:batchEnd]
    
    return batch

In [252]:
def calculate_prediction_and_costs(ufeats, ifeats, mask, targets, nRats, lamb):
    
    pred = torch.mm(ufeats, ifeats.t()) * mask

    mse  = torch.sum((pred - targets) ** 2) / nRats

    rmse = torch.sqrt(mse)

    regl2_u = (1/ufeats.size()[0]) * torch.sum(ufeats * ufeats)
    #regl2_u = torch.sum(ufeats)

    regl2_i = (1/ifeats.size()[0]) *torch.sum(ifeats * ifeats)
    #regl2_i = torch.sum(ifeats)

    # If prediction is bigger than the maximum allowed, 
    # put an additional penalization.
    #aboveMaximumDifCost = torch.sum((pred > 5).float() - 5)
    
    #batchCost = rmse + (lamb * (regl2_u + regl2_i)) + aboveMaximumDifCost
    batchCost = rmse + (lamb * (regl2_u + regl2_i))

    return pred, mse, rmse, batchCost

In [253]:
def calculate_like_dislike_agreement(predictions, targets, mask, likeThreshold=3): 
    targetLikesIdx = (targets >= likeThreshold)
    targetLikesCount = float(targetLikesIdx.sum())
    likeAgreement = float((predictions[targetLikesIdx] >= likeThreshold).sum()) / targetLikesCount

    targetDislikesIdx = (mask > 0) & (targets < likeThreshold)
    targetDislikesCount = float(targetDislikesIdx.sum())
    dislikeAgreement = float((predictions[targetDislikesIdx] < likeThreshold).sum()) / targetDislikesCount
    
    return likeAgreement, dislikeAgreement

In [254]:
def single_batch_training(batch, UFeats, IFeats, lamb): 
    stats = {
        'counts': {
            'users': 0.0, 'items': 0.0, 'ratings': 0.0, 'sparsity': 0.0
        }, 
        'time': {
            'slicing_and_dicing_UFeats_IFeats': 0.0, 
            'building_ratings_and_mask_submatrices': 0.0, 
            'forward': 0.0, 'backward': 0.0, 'optimizer': 0.0, 
            'updating_UFeats_IFeats': 0.0
        }, 
        'costs': {
            'mse': 0.0, 'rmse': 0.0, 'total': 0.0
        }
    }

    # Extract the list of unique user ids for the batch.
    # Then, sort that list.
    userIds = batch.userId.unique()
    userIds.sort()

    # Extract the list of unique movie ids for the batch.
    # Then, sort that list.
    itemIds = batch.movieId.unique()
    itemIds.sort()
    
    nUIds = userIds.shape[0]
    nIIds = itemIds.shape[0]
    nRats = batch.shape[0]
    sparsity = float(nRats) / float((nUIds * nIIds))
    
    stats['counts']['users'] = nUIds
    stats['counts']['users'] = nIIds
    stats['counts']['ratings'] = nRats
    stats['counts']['sparsity'] = sparsity

    # --------------------------------------------------------------------------------
    # --------------------------------------------------------------------------------
    # --------------------------------------------------------------------------------

    t0 = time.time()

    ufeats, ifeats = slice_and_dice_user_and_item_matrices(userIds, itemIds, UFeats, IFeats, numberOfLatentFeatures)
    ones = torch.Tensor(np.ones((itemIds.shape[0], 1)))
    ones = torch.autograd.Variable(ones, requires_grad=False)
    iifeats = torch.cat((ifeats, ones), 1)

    t1 = time.time()
    
    stats['time']['slicing_and_dicing_UFeats_IFeats'] = (t1-t0)

    # --------------------------------------------------------------------------------
    # --------------------------------------------------------------------------------
    # --------------------------------------------------------------------------------

    t0 = time.time()

    # Build ratings and mask matrices.
    targets, mask = build_ratings_and_mask_submatrices_from_batch(batch, userIds, itemIds)
    targets = torch.autograd.Variable(torch.Tensor(targets), requires_grad=False)
    mask    = torch.autograd.Variable(torch.Tensor(mask), requires_grad=False)

    t1 = time.time()
    
    stats['time']['building_ratings_and_mask_submatrices'] = (t1-t0)

    # --------------------------------------------------------------------------------
    # --------------------------------------------------------------------------------
    # --------------------------------------------------------------------------------

    optimizer1 = torch.optim.RMSprop([ufeats, ifeats])
    optimizer1.zero_grad()

    # --------------------------------------------------------------------------------
    # --------------------------------------------------------------------------------
    # --------------------------------------------------------------------------------

    t0 = time.time()

    #pred, mse, rmse, batchCost = calculate_prediction_and_costs(ufeats, ifeats, mask, targets, lamb, nRats
    pred, mse, rmse, batchCost = calculate_prediction_and_costs(ufeats, iifeats, mask, targets, nRats, lamb)

    t1 = time.time()
    
    stats['time']['forward'] = (t1-t0)
    
    stats['costs']['mse']   = mse.data.numpy()[0]
    stats['costs']['rmse']  = rmse.data.numpy()[0]
    stats['costs']['total'] = batchCost.data.numpy()[0]

    # --------------------------------------------------------------------------------
    # --------------------------------------------------------------------------------
    # --------------------------------------------------------------------------------

    t0 = time.time()

    batchCost.backward()

    t1 = time.time()
    
    stats['time']['backward'] = (t1-t0)

    # --------------------------------------------------------------------------------
    # --------------------------------------------------------------------------------
    # --------------------------------------------------------------------------------

    t0 = time.time()

    optimizer1.step()

    t1 = time.time()
    
    stats['time']['optimizer'] = (t1-t0)

    # --------------------------------------------------------------------------------
    # --------------------------------------------------------------------------------
    # --------------------------------------------------------------------------------

    t0 = time.time()

    UFeats[userIds] = ufeats.data.numpy()[:,0:(numberOfLatentFeatures+1)]

    IFeats[itemIds] = ifeats.data.numpy()[:,0:numberOfLatentFeatures]

    t1 = time.time()
    
    stats['time']['updating_UFeats_IFeats'] = (t1-t0)
    
    return stats, targets.data.numpy(), mask.data.numpy()

In [310]:
models_config = [
    { 'id': '0', 'numberOfLatentFeatures': 10, 'lamb': 0.5, 
      'epochs': 10, 'randomSeed': 1000, 'maxBatchSize': 20000, 
      'likeThreshold': 3.5, 'valPerc': 0.1}
]

In [311]:
trainDataset = ratings

In [317]:
for config in models_config: 
    
    np.random.seed(config['randomSeed'])
    
    batchStats = []
    
    print(config)
    
    numberOfLatentFeatures = config['numberOfLatentFeatures']
    maxBatchSize = config['maxBatchSize']
    numberOfBatches = int(np.ceil(float(trainDataset.shape[0]) / float(maxBatchSize)))
    
    # [] Create model parameters
    
    # The first line won't be used
    UFeats = np.random.uniform(low=0.0, high=1.0, size=(numberOfUsers+1, numberOfLatentFeatures+1))
    UFeats[0,:] = np.inf
    
    # The first line won't be used
    IFeats = np.random.uniform(low=0.0, high=1.0, size=(numberOfItems+1, numberOfLatentFeatures))
    IFeats[0,:] = np.inf
    
    
    # [] Training
    lamb   = config['lamb']
    epochs = config['epochs']
    valPerc = config['valPerc']
    
    for b in range(1,numberOfBatches-1): 
        likeThreshold = config['likeThreshold']
        
        # Extract train, validation and test datasets.
        trainBatch = extract_batch(trainDataset, b,   maxBatchSize)
        testBatch  = extract_batch(trainDataset, b+1, maxBatchSize)
        
        trainSize = trainBatch.shape[0]
        valSize   = int(np.ceil(trainSize * valPerc))
        trainSize = int(trainSize - valSize)
        
        trainBatch = trainBatch.head(trainSize)
        valBatch   = trainBatch.tail(valSize)
        
        print('---------------------------------------')
        
        print('\t  Batch %d/%d' % (b,numberOfBatches-1))
        
        trainStats = {
            '#Ratings': 0, 
            '#Likes': 0, '#Dislikes': 0, 
            '#Users': 0, '#Items': 0, 
            'avgRMSE': 0
        }
        
        # Extract the user and item IDs.
        trainUserIds = trainBatch.userId.unique()
        trainUserIds.sort()
        trainItemIds = trainBatch.movieId.unique()
        trainItemIds.sort()
        
        valUserIds = valBatch.userId.unique()
        valUserIds.sort()
        valItemIds = valBatch.movieId.unique()
        valItemIds.sort()
        valTargets, valMask = build_ratings_and_mask_submatrices_from_batch(valBatch, valUserIds, valItemIds)
        
        testUserIds = testBatch.userId.unique()
        testUserIds.sort()
        testItemIds = testBatch.movieId.unique()
        testItemIds.sort()
        
        # Count target likes & dislikes.
        totalNumberOfLikesInTrainBatch = trainBatch[trainBatch.rating >= likeThreshold].shape[0]
        totalNumberOfDislikesInTrainBatch = trainBatch[trainBatch.rating < likeThreshold].shape[0]
        totalNumberOfLikesInValidationBatch = valBatch[valBatch.rating >= likeThreshold].shape[0]
        totalNumberOfDislikesValidationBatch = valBatch[valBatch.rating < likeThreshold].shape[0]
        totalNumberOfLikesInTestBatch = testBatch[testBatch.rating >= likeThreshold].shape[0]
        totalNumberOfDislikesInTestBatch = testBatch[testBatch.rating < likeThreshold].shape[0]
        
        # Count the ratings.
        totalRatingsInTrainBatch      = trainBatch.shape[0]
        totalRatingsInValidationBatch = valBatch.shape[0]
        totalRatingsInTestBatch       = testBatch.shape[0]
        
        # Print the counts.
        print('\t\t Training counts:')
        print('\t\t\t (#Likes, #Dislikes) : (%d,%d)' % (totalNumberOfLikesInTrainBatch, totalNumberOfDislikesInTrainBatch))
        print('\t\t\t (#Users, #Items, #Ratings): (%d,%d,%d)' % (valUserIds.shape[0], trainItemIds.shape[0], totalRatingsInTrainBatch))
        
        print('\t\t Validation counts:')
        print('\t\t\t (#Likes, #Dislikes) : (%d,%d)' % (totalNumberOfLikesInValidationBatch, totalNumberOfDislikesValidationBatch))
        print('\t\t\t (#Users, #Items, #Ratings): (%d,%d,%d)' % (trainUserIds.shape[0], valItemIds.shape[0], totalRatingsInValidationBatch))
        
        # TODO: rearrange the code to include the validation set!
        
        totalRMSE = 0.0
        for e in range(epochs):
            stats, T, M = single_batch_training(trainBatch, UFeats, IFeats, lamb)
            totalRMSE += stats['costs']['rmse']
            print('\t\t epoch %d | training RMSE: %.5f' % (e, stats['costs']['rmse']))

        ufeats, ifeats = slice_and_dice_user_and_item_matrices(userIds, itemIds, UFeats, IFeats, numberOfLatentFeatures)
        ifeats = torch.cat((ifeats, torch.autograd.Variable(torch.Tensor(np.ones((itemIds.shape[0], 1))), requires_grad=False)), 1)

        targets, mask = build_ratings_and_mask_submatrices_from_batch(trainBatch, userIds, itemIds)
        targets = torch.autograd.Variable(torch.Tensor(targets), requires_grad=False)
        mask    = torch.autograd.Variable(torch.Tensor(mask), requires_grad=False)

        pred, _, rmse, _ = calculate_prediction_and_costs(ufeats, ifeats, mask, targets, nRats, 0)
        pred = pred.data.numpy()
        targets = targets.data.numpy()
        mask = mask.data.numpy()
        
        likeAgreement, dislikeAgreement = calculate_like_dislike_agreement(pred, targets, mask, likeThreshold)
        
        trainStats = {
            '#Ratings': stats['counts']['ratings'], 
            '#Likes': totalNumberOfLikesInTrainBatch, '#Dislikes': totalNumberOfDislikesInTrainBatch, 
            '#Users': stats['counts']['users'], '#Items': stats['counts']['items'], 
            'LikeAgreeRate': likeAgreement, 'DislikeAgreeRate': dislikeAgreement, 
            'avgRMSE': totalRMSE / epochs
        }
        
        print('\t\t training avg RMSE: %.5f' % trainStats['avgRMSE'])
        print('\t\t training Like Aggreement Rate: %.5f' % (trainStats['LikeAgreeRate']))
        print('\t\t training Dislike Aggreement Rate: %.5f' % (trainStats['DislikeAgreeRate']))
            
        # [] Testing
        nRats = testBatch.shape[0]

        userIds = testBatch.userId.unique()
        userIds.sort()
        itemIds = testBatch.movieId.unique()
        itemIds.sort()

        ufeats, ifeats = slice_and_dice_user_and_item_matrices(userIds, itemIds, UFeats, IFeats, numberOfLatentFeatures)
        ifeats = torch.cat((ifeats, torch.autograd.Variable(torch.Tensor(np.ones((itemIds.shape[0], 1))), requires_grad=False)), 1)

        targets, mask = build_ratings_and_mask_submatrices_from_batch(testBatch, userIds, itemIds)
        targets = torch.autograd.Variable(torch.Tensor(targets), requires_grad=False)
        mask    = torch.autograd.Variable(torch.Tensor(mask), requires_grad=False)

        pred, _, rmse, _ = calculate_prediction_and_costs(ufeats, ifeats, mask, targets, nRats, 0)
        pred = pred.data.numpy()
        targets = targets.data.numpy()
        mask = mask.data.numpy()
        
        likeAgreement, dislikeAgreement = calculate_like_dislike_agreement(pred, targets, mask, likeThreshold)
        
        print('\t\t-------------------------------')
        totalNumberOfLikesInTestBatch = testBatch[testBatch.rating >= likeThreshold].shape[0]
        totalNumberOfDislikesInTestBatch = testBatch[testBatch.rating < likeThreshold].shape[0]
        print('\t\t (#Likes, #Dislikes) : (%d,%d)' % (totalNumberOfLikesInTestBatch,totalNumberOfDislikesInTestBatch))
        print('\t\t testing RMSE: %.5f' % (rmse.data.numpy()))
        print('\t\t testing Like Aggreement Rate: %.5f' % (likeAgreement))
        print('\t\t testing Dislike Aggreement Rate: %.5f' % (dislikeAgreement))
        
        testStats = {
            '#Ratings': testBatch.shape[0], 
            '#Likes': totalNumberOfLikesInTestBatch, '#Dislikes': totalNumberOfDislikesInTestBatch, 
            '#Users': userIds.shape[0], '#Items': itemIds.shape[0], 
            'RMSE': rmse.data.numpy(), 
            'LikeAgreeRate': likeAgreement, 'DislikeAgreeRate': dislikeAgreement
        }
        
        batchStats.append({
            'trainStats': trainStats, 'testStats': testStats
        })
    
        print('---------------------------------------')
    
    # [] Write model configuration, parameters and test results to files
    if not os.path.exists('models'): 
        os.makedirs('models')
        
    modelDir = 'models/%s' % (config['id'])
    if not os.path.exists(modelDir):
        os.makedirs(modelDir)
        
    UFeatsFilepath = 'models/%s/UFeats' % (config['id'])
    IFeatsFilepath = 'models/%s/IFeats' % (config['id'])
    ConfigFilepath = 'models/%s/Config' % (config['id'])
    StatsFilepath  = 'models/%s/Stats' % (config['id'])
    
    pickle.dump(UFeats, open(UFeatsFilepath, 'wb'))
    pickle.dump(IFeats, open(IFeatsFilepath, 'wb'))
    pickle.dump(config, open(ConfigFilepath, 'wb'))
    pickle.dump(batchStats, open(StatsFilepath, 'wb'))
    # TODO

{'lamb': 0.5, 'numberOfLatentFeatures': 10, 'randomSeed': 1000, 'likeThreshold': 3.5, 'valPerc': 0.1, 'epochs': 10, 'id': '0', 'maxBatchSize': 20000}
---------------------------------------
	  Batch 1/1220
		 Training counts:
			 (#Likes, #Dislikes) : (11463,6537)
			 (#Users, #Items, #Ratings): (36,4521,18000)
		 Validation counts:
			 (#Likes, #Dislikes) : (1393,607)
			 (#Users, #Items, #Ratings): (205,1170,2000)
		 epoch 0 | training RMSE: 1.37457
		 epoch 1 | training RMSE: 0.99120
		 epoch 2 | training RMSE: 0.90068


KeyboardInterrupt: 

In [None]:
test_prediction(dataset=trainBatch, IFeats=IFeats, UFeats=UFeats)