# Notebook #1: Designing and evaluating a recommendation algorithm

In this notebook, we become familiar with the Python recommendation toolbox, in the simplest possible way. First, we setup the working environment in GDrive. Then, we go through the experimental pipeline, by:
- loading the Movielens 1M dataset; 
- performing a train-test splitting;
- creating a pointwise / pairwise / random / mostpop recommendation object;
- training the model (if applicable);
- computing the user-item relevance matrix;
- calculating some of the recommendation metrics (e.g., NDCG, Item Coverage, Diversity, Novelty).

The trained models, together with the partial computation we will save (e.g., user-item relevance matrix or metrics), will be the starting point of the investigation and the treatment covered by the other Jupyter notebooks.

**IMPORTANT**: Please go the "Runtime" option in the top menu, then click on "Change runtime" and select "GPU". 

## Setup the working environment for this notebook

- Python 3.6
- Package Requirements: matplotlib, numpy, pandas, scikit-learn, scipy, tensorflow-gpu==2.0
- Storage requirements: around 1GB

This step serves to mount GDrive storage within this Jupyter notebook. The command will request us to give access permissions to this notebook, so that we will be able to clone the project repository when we desire. Please follow the prompted instructions.

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

We will clone the project repository in our My Drive folder. If you wish to change the target folder, please modify the command below.

In [None]:
%cd /content/gdrive/My Drive/

In [None]:
! git clone https://github.com/mirkomarras/bias-recsys-tutorial.git

We will move to the project folder in order to install the required packages. 

In [None]:
%cd bias-recsys-tutorial

In [None]:
! ls

In [None]:
! pip install -r requirements.txt

We will configure the notebooks directory as our working directory in order to simulate a local notebook execution. 

In [None]:
%cd ./notebooks

## Import packages

In [None]:
import sys 
import os

sys.path.append(os.path.join('..'))

In [None]:
import pandas as pd
import numpy as np

In [None]:
from helpers.train_test_splitter import *
from models.pointwise import PointWise
from models.pairwise import PairWise
from models.mostpop import MostPop
from models.random import Random
from helpers.utils import *

We will define the folders where we will store our pre-computed results. 

In [None]:
data_path = '../data/'

In [None]:
!mkdir '../data/outputs'
!mkdir '../data/outputs/splits'
!mkdir '../data/outputs/instances'
!mkdir '../data/outputs/models'
!mkdir '../data/outputs/predictions'
!mkdir '../data/outputs/metrics'

## Load data

First, we will load the Movielens 1M dataset, which has been pre-arranged in order to comply with the following structure: user_id, item_id, rating, timestamp, type (label for the item category), and type_id (unique id of the item category). For the sake of tutorial easiness, we assume here that each item is randomly assigned to one of its categories in the original dataset. Our toolbox is flexible enough to integrate any other dataset in csv format that has the same structure of the pre-arranged csv shown below. No further changes are needed to experiment with other datasets.   

In [None]:
dataset = 'ml1m'  
user_field = 'user_id'
item_field = 'item_id'
rating_field = 'rating'
time_field = 'timestamp'
type_field = 'type_id'

In [None]:
data = pd.read_csv(os.path.join(data_path, 'datasets/' + dataset + '.csv'), encoding='utf8')

In [None]:
data.head()

During this tutorial, we will simulate a scenario with implicit feedback.

In [None]:
data[rating_field] = data[rating_field].apply(lambda x: 1.0)

## Split data in train and test sets

- **smode**: 'uftime' for fixed timestamp split, 'utime' for time-based split per user, 'urandom' for random split per user 
- **train_ratio**: percentage of data to be included in the train set
- **min_train**: minimum number of train samples for a user to be included  
- **min_test**: minimum number of test samples for a user to be included
- **min_time**: start timestamp for computing the splitting timestamp (only for uftime)
- **max_time**: end timestamp for computing the splitting timestamp (only for uftime)
- **step_time**: timestamp step for computing the splitting timestamp (only for uftime)

In [None]:
smode = 'utime'
train_ratio = 0.80        
min_train_samples = 8
min_test_samples = 2
min_time = None
max_time = None
step_time = 1000

During this tutorial, we will work with a common time-based split per user. 

In [None]:
if smode == 'uftime':
    traintest = fixed_timestamp(data, min_train_samples, min_test_samples, min_time, max_time, step_time, user_field, item_field, time_field, rating_field)
elif smode == 'utime':
    traintest = user_timestamp(data, train_ratio, min_train_samples+min_test_samples, user_field, item_field, time_field)
elif smode == 'urandom':
    traintest = user_random(data, train_ratio, min_train_samples+min_test_samples, user_field, item_field)

Please note that user_ids and item_ids have been scaled so that user_ids is in [0, no_users] and item_ids will be in [0, no_items]. If you wish to link these new ids to the older ones, please refer to the user_id_original and item_id_original columns. 

In [None]:
traintest.head()

For the sake of replicability and efficiency of this tutorial, we will save the pre-computed train and test sets in ./data/outputs/splits

In [None]:
traintest.to_csv(os.path.join(data_path, 'outputs/splits/' + dataset + '_' + smode + '.csv'))

## Run the model train and test

We will create two dataframes, one with train feedback and another with test feedback, from the pre-computed split data. 

In [None]:
train = traintest[traintest['set']=='train'].copy()
test = traintest[traintest['set']=='test'].copy()

In [None]:
users = list(np.unique(traintest[user_field].values))
items = list(np.unique(traintest[item_field].values))

In [None]:
len(users), len(items)

In [None]:
category_per_item = traintest.drop_duplicates(subset=['item_id'], keep='first')[type_field].values

In [None]:
len(np.unique(category_per_item))

For the sake of easiness, we will focus on four main recommendation strategies: 
- Random
- MostPop
- PointWise
- PairWise

In [None]:
model_types = {'random': Random, 'mostpop': MostPop, 'pointwise': PointWise, 'pairwise': PairWise} 

First, we need to initialize the model. We will see how the process works for a PairWise algorithm. Then, we will consider the other ones. 

In [None]:
model_type = 'pairwise'
model = PairWise(users, items, train, test, category_per_item, item_field, user_field, rating_field)

We will train the model by feeding the train data we previously prepared, with the following default values. 

- **no_epochs** (default: 100)
- **batches** (default: 1024)
- **lr** (default: 0.001)
- **no_factors** (default: 10)
- **no_negatives** (default: 10)
- **val_split** (default: 0.0001)

In [None]:
model.train(no_epochs=5) # For the sake of tutorial efficiency, we force to stop after 5 epochs

The architecture of the trained model looks as follows. 

In [None]:
model.print()

## Compute user-item relevance scores

Now, we will use the pre-trained model to predict the user-item relevance scores.

In [None]:
model.predict()

In [None]:
scores = model.get_predictions()

As we expected, the predicted scores are stored in a matrix of shape np_users x no_items. 

In [None]:
scores.shape

Hence, we can access to the relevance score of the user 120 for the item 320 as follows. 

In [None]:
user_id, item_id = 120, 320
scores[user_id, item_id]

For the sake of convenience, we will save the predicted scores. 

In [None]:
save_obj(scores, os.path.join(data_path, 'outputs/predictions/' + dataset + '_' + smode + '_' + model_type + '_scores.pkl'))

## Calculate metrics

In this step, we leverage the predicted scores in order to compute a set of common recommendation metrics. 

In [None]:
cutoffs = np.array([5, 10, 20])

In [None]:
item_group = load_obj(os.path.join(data_path, 'datasets', 'ml1m-item-group')) 
# we discuss this point in detail in the third notebook

In [None]:
model.test(item_group=item_group, cutoffs=cutoffs)

The method has pre-computed a set of metrics and saved the corresponding values in a Python dictionary, as detailed below. 

In [None]:
metrics = model.get_metrics()

In [None]:
metrics.keys()

The values for each metrics have been computed and store for each cutoff.

In [None]:
for name, values in metrics.items():
    print(values.shape, name)

For instance, we can access to the NDCG score for the user 120 at cutoff 10, with the following commands.

In [None]:
user_id, cutoff_index = 1324, int(np.where(cutoffs == 10)[0])
metrics['ndcg'][cutoff_index, user_id]

For the sake of convenience, we will save the compted metrics.

In [None]:
save_obj(metrics, os.path.join(data_path, 'outputs/metrics/' + dataset + '_' + smode + '_' + model_type + '_metrics.pkl'))

We can also see the aggregated values. 

In [None]:
model.show_metrics(index_k=int(np.where(cutoffs == 10)[0]))

## Repeat the experimental pipeline for Random and MostPop (optionally for PointWise)

We will define a utility function to perform ll the above operations jointly.

In [None]:
def run_model(model_type, no_epochs=None):
    print('Running model', model_type)
    model = model_types[model_type](users, items, train, test, category_per_item, item_field, user_field, rating_field)
    model.train(no_epochs=no_epochs) if no_epochs else model.train() 
    model.predict()
    scores = model.get_predictions()
    save_obj(scores, os.path.join(data_path, 'outputs/predictions/' + dataset + '_' + smode + '_' + model_type + '_scores.pkl'))
    model.test(item_group=item_group, cutoffs=cutoffs)
    metrics = model.get_metrics()
    save_obj(metrics, os.path.join(data_path, 'outputs/metrics/' + dataset + '_' + smode + '_' + model_type + '_metrics.pkl'))
    print()
    model.show_metrics(index_k=int(np.where(cutoffs == 10)[0]))

In [None]:
run_model('random')

In [None]:
run_model('mostpop')

In [None]:
run_model('pointwise', no_epochs=5)

## How to extend the toolbox

- New splitter: take a look at the helpers/train_test_splitter.py file and how the existing generators have been defined. 
- New train instances creator: similarly, take a look at the helpers/instances_creator.py file and how the existing generators have been defined. 
- New model: a new subclass of the Model class defined in models/model.py should be defined, implementing a 'train' and a 'predict' method. 
- New metrics: both the 'test' and 'show_metrics' methods of models/model.py should be extended with the computation needed by the new metric.  