# Notebook #1: Designing and evaluating a recommendation algorithm

**Hands-on Outline**. In this notebook, we will focus on becoming familiar with the recommendation pipeline through an introductory Python toolbox, in the simplest possible way. Specifically, we will:

- **Step 1** Setup the working environment in GDrive. 
- **Step 2** Load and understand the Movielens 1M dataset.
- **Step 3** Split data in training and test sets.
- **Step 4** Define a pointwise / pairwise / random / mostpop recommendation algorithm.
- **Step 5** Train a recommendation model (only for point-wise and pair-wise).
- **Step 6** Compute the user-item matrix that includes the predicted relevance scores.
- **Step 7** Calculate evaluation metrics to monitor properties like effectiveness, catalog coverage, and novelty.  
- **Step 8** Run the full pipeline for the other algorithms under consideration.   

For each step of the pipeline, we will save the corresponding computations (e.g., pre-trained models, user-item relevance matrices and so on). These artifacts will be the starting point of the investigation covered in the subsequent notebooks.

## Step 1: Setup the working environment in GDrive. 

Requirements for your working environment:

- Python >= 3.6
- Package Requirements: pandas, numpy, scipy, matplotlib, scikit-learn, tensorflow. 
- GDrive storage requirements: ~1GB

### Mount the GDrive storage

This step serves to mount GDrive storage within this Jupyter notebook. The command will request us to give access permissions to this notebook, so that we will be able to clone the project repository when we desire. Please follow the prompted instructions.

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

We will clone the project repository in our My Drive folder. If you wish to change the target folder, please modify the command below.

In [None]:
%cd /content/gdrive/My Drive/

### Clone the Github repository into GDrive

If you want to work with the codebase locally in your laptop, you should start to run the following commands.

In [None]:
! git clone https://github.com/biasinrecsys/wsdm2021.git

We will move to the project folder in order to install the required packages. 

In [None]:
%cd wsdm2021

In [None]:
! ls

In [None]:
! pip install -r requirements.txt

We will configure the notebooks directory as our working directory in order to simulate a local notebook execution. 

In [None]:
%cd ./notebooks

### Import Python packages

In [1]:
import sys 
import os

sys.path.append(os.path.join('..'))

In [2]:
import pandas as pd
import numpy as np

In [3]:
import matplotlib.pyplot as plt
%matplotlib inline

In [4]:
from helpers.train_test_splitter import *
from models.pointwise import PointWise
from models.pairwise import PairWise
from models.mostpop import MostPop
from models.random import Random
from helpers.utils import *

###  Create folders for saving pre-computed results

We will define the subfolders in **./data** where we will store our pre-computed results. For each dataset:

- *data/outputs/splits* will include two csv files including the train and test interactions, according with the selected train-test split rule. 
- *data/outputs/instances* will include a csv file with instances to be fed to the model, either pairs for point-wise or triplets for pair-wise recommenders.
- *data/outputs/models* will include a h5 file associated with a pre-trained recommender model.  
- *data/outputs/predictions* will include a numpy file representing a user-item matrix; a cell stores the relevance score of an item for a given user.
- *data/outputs/metrics* will include a pickle dictionary with the computed evaluation metrics for a given recommender model. 

**N.B.** This strategy will allow us to play with the intermediate outputs of the pipeline, without starting from scratch any time (e.g., for performing a bias treatment as a post-processing, we just need to load the predictions of a model to start). 

In [5]:
data_path = '../data/'

In [6]:
!mkdir '../data/outputs'
!mkdir '../data/outputs/splits'
!mkdir '../data/outputs/instances'
!mkdir '../data/outputs/models'
!mkdir '../data/outputs/predictions'
!mkdir '../data/outputs/metrics'

Sintassi del comando errata.
Sintassi del comando errata.
Sintassi del comando errata.
Sintassi del comando errata.
Sintassi del comando errata.
Sintassi del comando errata.


## Step 2: Load and understand the Movielens 1M dataset.

First, we will load the **Movielens 1M** dataset, which has been pre-arranged in order to comply with the following structure:

- user_id
- item_id
- rating
- timestamp
- type (label for the item category
- type_id (unique id of the item category)

For the sake of tutorial easiness, we assume here that each item is randomly assigned to one of its categories in the original dataset. 

**N.B.** This toolbox is flexible enough to integrate any other dataset in csv format that has the same structure of the pre-arranged csv shown below. No further changes are then needed to the pipeline in order to experiment with other datasets. The csv file of the new dataset should be placed into the *data/datasets/* folder and the name of the file should be assigned to the *dataset* parameter below. 

In [7]:
dataset = 'ml1m'  
user_field = 'user_id'
item_field = 'item_id'
rating_field = 'rating'
time_field = 'timestamp'
type_field = 'type_id'

In [8]:
data = pd.read_csv(os.path.join(data_path, 'datasets/' + dataset + '.csv'), encoding='utf8')

In [9]:
data.sample(n=10, random_state=1)

Unnamed: 0,user_id,item_id,rating,timestamp,type,type_id
630120,3752,319,4.0,2000-08-13 01:03:20,Thriller,15
229398,2411,39,4.0,2000-11-19 00:38:54,Romance,13
758377,2172,2915,3.0,2002-03-02 22:18:54,Comedy,4
159240,2366,349,3.0,2000-11-16 03:11:21,Action,0
254252,1017,377,4.0,2000-11-23 21:15:04,Thriller,15
27168,3391,527,5.0,2000-08-29 19:18:26,Drama,7
196538,2376,1580,4.0,2002-06-19 12:58:43,Comedy,4
37123,4635,783,3.0,2000-07-19 21:31:53,Musical,11
982048,868,1346,4.0,2000-11-26 23:13:31,Horror,10
994502,1880,2275,4.0,2000-11-20 05:37:43,Adventure,1


### Short exercise 1: find the id of the most popular item (i.e, the item with the highest number of ratings)

In [10]:
### EXERCISE CELL ### Please, add your solution here

During this tutorial, we will simulate a scenario with **implicit feedback**. We assume that a user is interested in an item, if that item was rated by the user, no matter of the rating value. Other strategies can be easily integrated. 

**N.B.** Other papers in the literature assumed that an item is relevant for a user, only if the user has given a rating higher than a value X. To implement this strategy here, you just need to change the body of the lambda function below. 

In [11]:
data[rating_field] = data[rating_field].apply(lambda x: 1.0)

In [12]:
data.sample(n=10, random_state=1)

Unnamed: 0,user_id,item_id,rating,timestamp,type,type_id
630120,3752,319,1.0,2000-08-13 01:03:20,Thriller,15
229398,2411,39,1.0,2000-11-19 00:38:54,Romance,13
758377,2172,2915,1.0,2002-03-02 22:18:54,Comedy,4
159240,2366,349,1.0,2000-11-16 03:11:21,Action,0
254252,1017,377,1.0,2000-11-23 21:15:04,Thriller,15
27168,3391,527,1.0,2000-08-29 19:18:26,Drama,7
196538,2376,1580,1.0,2002-06-19 12:58:43,Comedy,4
37123,4635,783,1.0,2000-07-19 21:31:53,Musical,11
982048,868,1346,1.0,2000-11-26 23:13:31,Horror,10
994502,1880,2275,1.0,2000-11-20 05:37:43,Adventure,1


## Step 3: Split data in training and test sets

Once the original dataset has been loaded and the user preferences have been pre-processed, we need to split the whole dataset in two sets: a training set used for optimizing the recommender model and a test set used for evaluating the recommender model. In the literature, a wide range of train-test split strategy exists. This notebook will use a strategy that, for each user, puts the oldest interactions in the training set and the most recent interactions in the test set. The Python toolbox includes also other strategies, such as a random split or a split based on a fixed timestamp (i.e., the most realistic one).  

- **smode**: 'uftime' for fixed timestamp split, 'utime' for time-based split per user, 'urandom' for random split per user 
- **train_ratio**: percentage of data to be included in the train set
- **min_train**: minimum number of train samples for a user to be included  
- **min_test**: minimum number of test samples for a user to be included
- **min_time**: start timestamp for computing the splitting timestamp (only for uftime)
- **max_time**: end timestamp for computing the splitting timestamp (only for uftime)
- **step_time**: timestamp step for computing the splitting timestamp (only for uftime)

In [13]:
smode = 'utime'
train_ratio = 0.80        
min_train_samples = 8
min_test_samples = 2
min_time = None
max_time = None
step_time = 1000

During this tutorial, we will work with a common **time-based split per user**. For the sake of clarity, we will provide the implementation of this strategy below. The toolbox conserves all the train-test split strategies into the file *helpers/train_test_splitter.py*.  

In [14]:
def user_timestamp(interactions,split=0.80,min_samples=10,user_field='user_id',item_field='item_id',time_field='timestamp'):
    train_set = []
    test_set = []
    
    groups = interactions.groupby([user_field])
    for i, (index, group) in enumerate(groups):
        
        if len(group.index) < min_samples:
            continue
        
        sorted_group = group.sort_values(time_field)
        n_rating_test = int(len(sorted_group.index) * (1.0 - split))
        train_set.append(sorted_group.head(len(sorted_group.index) - n_rating_test))
        test_set.append(sorted_group.tail(n_rating_test))
    
    print('\r> Parsing user', i+1, 'of', len(groups))

    train, test = pd.concat(train_set), pd.concat(test_set)
    train['set'], test['set'] = 'train', 'test' # Ensure that each row has a column that identifies the associated set

    traintest = pd.concat([train, test])
    traintest[user_field + '_original'] = traintest[user_field] # Ensure that we save the original user ids
    traintest[item_field + '_original'] = traintest[item_field] # Ensure that we save the original item ids
    traintest[user_field] = traintest[user_field].astype('category').cat.codes # Ensure that user ids are in [0, |U|] 
    traintest[item_field] = traintest[item_field].astype('category').cat.codes # Ensure that item ids are in [0, |I|] 

    return traintest

### Perform the training and test set split

This notebook can be easily run with any of the different train-test split strategies, through the following code. 

In [15]:
if smode == 'uftime':
    traintest = fixed_timestamp(data, min_train_samples, min_test_samples, min_time, max_time, step_time, user_field, item_field, time_field, rating_field)
elif smode == 'utime':
    traintest = user_timestamp(data, train_ratio, min_train_samples+min_test_samples, user_field, item_field, time_field)
elif smode == 'urandom':
    traintest = user_random(data, train_ratio, min_train_samples+min_test_samples, user_field, item_field)

> Parsing user 6040 of 6040


**N.B.** For the sake of convenience, *user_ids* and *item_ids* have been scaled so that user_ids are in *[0, |U|]* and item_ids are in *[0, |I|]*. To refer back to the original user and item ids, the *user_id_original* and *item_id_original* columns should be used. 

For the sake of replicability and efficiency of this tutorial, we will save the pre-computed train and test sets in *data/outputs/splits*.

In [16]:
traintest.to_csv(os.path.join(data_path, 'outputs/splits/' + dataset + '_' + smode + '.csv'))

In [17]:
traintest.sample(n=10, random_state=1)

Unnamed: 0,user_id,item_id,rating,timestamp,type,type_id,set,user_id_original,item_id_original
461252,4681,1980,1.0,2000-08-06 02:31:13,Adventure,1,train,4682,2161
432180,1700,2303,1.0,2001-12-30 08:15:18,Romance,13,train,1701,2497
396031,5688,759,1.0,2000-06-26 05:01:36,Drama,7,train,5689,805
692829,1217,1611,1.0,2000-12-01 18:23:13,Comedy,4,train,1218,1772
333979,1882,2597,1.0,2000-11-22 07:57:02,Action,0,train,1883,2802
524446,226,2958,1.0,2000-12-14 22:41:09,Adventure,1,train,227,3175
52926,1472,1848,1.0,2000-11-20 21:38:38,Action,0,train,1473,2028
895248,307,313,1.0,2001-07-22 22:58:50,Drama,7,train,308,322
285473,5491,2748,1.0,2001-08-07 19:22:33,Drama,7,test,5492,2959
94374,5851,3031,1.0,2000-05-12 05:35:27,Comedy,4,test,5852,3255


## Step 4: Define a pointwise / pairwise / random / mostpop recommendation algorithm.

In [18]:
train = traintest[traintest['set']=='train'].copy()
test = traintest[traintest['set']=='test'].copy()

### Short exercise 2: plot the distribution of interactions per item in the training set and in the test set

In [19]:
### EXERCISE CELL ### Please, add your solution here

First, we show some statistics about the training and test sets, e.g., number of users and items. 

In [20]:
users = list(np.unique(traintest[user_field].values))
items = list(np.unique(traintest[item_field].values))

In [21]:
len(users), len(items)

(6040, 3706)

Given that some recommender models may require the category of an item, we create a vector of size *|I|* including the integer-encoded category of the item with id *X* at position *X* of the vector. 

In [22]:
category_per_item = traintest.drop_duplicates(subset=['item_id'], keep='first')[type_field].values

In [23]:
len(np.unique(category_per_item))

18

### Initialize the recommendation algorithm object

For the sake of easiness and time, this tutorial focuses on four main recommendation strategies: 

**Random**: randomly recommending a list of items to a user. 

**MostPop**: recommending the same most popular items (i.e, those which received the highest number of ratings) to all users.

**PointWise**: given a user-item pair, it is optimized for predicting a higher score (1) when the current item has been rated by the user, and a lower score (0) otherwise. The training instances include a good reprsentation of both types of pairs.   

**PairWise**: given a triplet with a user, an observed item, and an unobserved item, it is optimized for predicting a higher relevance for the pair of user and unobserved item rather than for the pair of user and unobserved item. 

Each model inherits from the Model class defined in *models/model.py* and extends it by overwriting the *train* and *predict* functions of the original model class. This allows us to minimize the reuse of the code. More details on the implementation of the pairwise recommender can be found into *models/pairwise.py*.  

In [24]:
model_types = {'random': Random, 'mostpop': MostPop, 'pointwise': PointWise, 'pairwise': PairWise}

First, we need to initialize the model. We will see how the process works for a PairWise algorithm. Then, we will consider the other ones. 

In [25]:
model_type = 'pairwise'
%time model = PairWise(users, items, train, test, category_per_item, item_field, user_field, rating_field)

Initializing user, item, and categories lists
Initializing observed, unobserved, and predicted relevance scores
Initializing item popularity lists
Initializing category per item
Initializing category preference per user
Initializing metrics
Wall time: 13 s


## Step 5: Train a recommendation model (only for point-wise and pair-wise).

We will train the model by feeding the train data we previously prepared, using the following default parameters. 

- **no_epochs** (default 100): maximum number of epochs until which the training process will be run. 
- **batches** (default 1024): size of the batches fed into the model during training. 
- **lr** (default 0.001): learning rate defining the pace at which the model will be trained. 
- **no_factors** (default 10): size of the latent vectors associated to users and items. 
- **no_negatives** (default 10): number of triplets for each user-item pair included in the training set. 
- **val_split** (default 0.0001): proportion of the training set used for validation. 

**N.B.** For the sake of tutorial efficiency, we force to stop the training process after 5 epochs (i.e., reasonable trade-off). No grid search on the recommender model is performed at this stage. 

In [26]:
%time model.train(no_epochs=5)

Generating training instances of type pair
Computing instances for interaction 800000 / 803798 of type pair
Performing training - Epochs 5 Batch Size 1024 Learning Rate 0.001 Factors 10 Negatives 10 Mode pair
Validation accuracy: 0.8607365016173177 (Sample 80379 of 80380)
Epoch 2/2
Epoch 3/3
Epoch 4/4
Epoch 5/5
Validation accuracy: 0.9161980592187111 (Sample 80379 of 80380)
Wall time: 5min 38s


The architecture of the trained model looks as follows. Essentially, the model includes:
- **UserEmb** encoding a latent vector for each user.
- **ItemEmb** encoding a latent vector for each item.
- **FlatUserEmb** represents the vector associated with the current user *UserInput*.
- **FlatPosItemEmb** represents the vectors associated with the current observed item *PosItemInput*.
- **FlatNegItemEmb** represents the vectors associated with the current unobserved item *NegItemInput*.
- **Accuracy** computes the margin between (i) the *FlatUserEmb-FlatPosItemEmb* and (ii) the *FlatUserEmb-FlatNegItemEmb* similarity scores.  

In [27]:
model.print()

Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
UserInput (InputLayer)          [(None, 1)]          0                                            
__________________________________________________________________________________________________
PosItemInput (InputLayer)       [(None, 1)]          0                                            
__________________________________________________________________________________________________
NegItemInput (InputLayer)       [(None, 1)]          0                                            
__________________________________________________________________________________________________
UserEmb (Embedding)             (None, 1, 10)        60410       UserInput[0][0]                  
______________________________________________________________________________________________

The model file is saved in *data/outputs/models*. 

In [28]:
model

<models.pairwise.PairWise at 0x1a6f0a14eb8>

## Step 6: Compute the user-item matrix that includes the predicted relevance scores.

Once the recommender model has been trained, we leverage the pre-trained user and item Embedding matrices in order to compute the relevance score predicted for each unseen user-item pair. For all the user-item pairs, the prediction step requires to extract the user and item vector associated to the current user-item pair and, then, compute the similarity between the two - cosine or dot similarity are usually used at this stage.  

In [29]:
model

<models.pairwise.PairWise at 0x1a6f0a14eb8>

Now, we will use the pre-trained model to predict the user-item relevance scores.

In [30]:
model.predict()

Computing predictions


For the sake of easiness, you could directly manipulate the user-item relevance matrix as a numpy array. 

In [31]:
scores = model.get_predictions()

Hence, we can access to the relevance score of the user *120* for the item *320* as follows. 

In [32]:
user_id, item_id = 120, 320
scores[user_id, item_id]

2.774047374725342

### Short exercise 3: compute the range of the scores on the whole population of users.     

In [33]:
### EXERCISE CELL ### Please, add your solution here

For the sake of convenience, we will save the predicted scores. They are often used as an input for re-ranking treatments against bias. 

In [34]:
save_obj(scores, os.path.join(data_path, 'outputs/predictions/' + dataset + '_' + smode + '_' + model_type + '_scores.pkl'))

### Short exercise 4: retrieve the ids of the 10 items with  the highest relevance score for user 47.   

In [35]:
### EXERCISE CELL ### Please, add your solution here

In [36]:
scores.shape

(6040, 3706)

## Step 7: Calculate evaluation metrics.

Finally, with the user-item relevance scores predicted in the previous step, we can generate the recommendations for each user and, then, compute a set of well-known evaluation metrics for recommender systems. 

In [37]:
scores.shape

(6040, 3706)

In [38]:
cutoffs = np.array([5, 10, 20, 50, 100, 200])

For the sake of convenience, for the considered recommender model, we also compute some fairness metrics required for the case studies. The following line of code loads the demographic membership of providers, which will be discussed in detail in Notebook #03.

**N.B.** While the gender is by no means a binary construct, to the best of our knowledge no dataset for speaker recognition with non-binary genders exists. What we are considering is a binary feature, as the current publicly available datasets offer.

In [39]:
group_item_association = pd.read_csv(os.path.join(data_path, 'datasets', 'ml1m-dir-group.csv'))

This dataframes includes, for each item, the percentage of providers with gender_1 and gender_2 for that item, respectively. 

In [40]:
group_item_association.sample(n=10, random_state=1)

Unnamed: 0,item_id,group_1,group_2
3582,1886,0.0,1.0
2073,2079,0.0,1.0
2425,2745,,
93,2490,0.0,1.0
79,2236,0.0,1.0
1459,144,0.0,1.0
3398,758,,
366,1095,0.0,1.0
2724,3568,0.0,1.0
2162,3124,1.0,0.0


### Short exercise 5: compute the percentage of items where providers of group_1 are represented.  

In [41]:
### EXERCISE CELL ### Please, add your solution here

In [42]:
group_maps = {i:g for i, g in zip(group_item_association['item_id'], group_item_association['group_1'])}
item_maps = {i1:i2 for i1, i2 in zip(traintest['item_id'].unique(), traintest['item_id_original'].unique())}

In [43]:
item_group = [(1 if item_maps[i] in group_maps and gender_maps[item_maps[i]] == 0 else 0) for i in range(len(items))]

NameError: name 'gender_maps' is not defined

Then, we run the function which computes all the metrics relevant for the subsequent case studies. 

In [None]:
model.test(item_group=item_group, cutoffs=cutoffs)

The method has pre-computed a set of metrics and saved the corresponding values in a Python dictionary, as detailed below. 

In [None]:
metrics = model.get_metrics()

In [None]:
metrics.keys()

The values for each metrics have been computed and store for each cutoff.

In [None]:
for name, values in metrics.items():
    print(values.shape, name)

For instance, we can access to the NDCG score for the user *120* at cutoff *10*, with the following commands.

In [None]:
user_id, cutoff_index = 1324, int(np.where(cutoffs == 10)[0])
metrics['ndcg'][cutoff_index, user_id]

### Short exercise 6: compute catalog coverage (i.e., percentage of items recommended at least once) at top-20.  

In [None]:
### EXERCISE CELL ### Please, add your solution here

For the sake of convenience, we will save the compted metrics.

In [None]:
save_obj(metrics, os.path.join(data_path, 'outputs/metrics/' + dataset + '_' + smode + '_' + model_type + '_metrics.pkl'))

We can also see the aggregated values. 

In [None]:
model.show_metrics(index_k=int(np.where(cutoffs == 10)[0]))

In [None]:
' - '.join(list(metrics.keys()))

## Step 8: Run the full pipeline for the other algorithms under consideration.

We will define a utility function to run all the above operations jointly for each of the other recommender models.

In [None]:
def run_model(model_type, no_epochs=None):
    print('Running model', model_type)
    # Initialize the model
    model = model_types[model_type](users, items, train, test, category_per_item, item_field, user_field, rating_field)
    # Train the model
    model.train(no_epochs=no_epochs) if no_epochs else model.train() 
    # Make and save predictions
    model.predict()
    scores = model.get_predictions()
    save_obj(scores, os.path.join(data_path, 'outputs/predictions/' + dataset + '_' + smode + '_' + model_type + '_scores.pkl'))
    # Compute and save metrics
    model.test(item_group=item_group, cutoffs=cutoffs)
    metrics = model.get_metrics()
    save_obj(metrics, os.path.join(data_path, 'outputs/metrics/' + dataset + '_' + smode + '_' + model_type + '_metrics.pkl'))
    # Show evaluation metrics
    print('\n\nFinal evaluation metrics:')
    model.show_metrics(index_k=int(np.where(cutoffs == 10)[0]))

In [None]:
run_model('random')

In [None]:
run_model('mostpop')

In [None]:
run_model('pointwise', no_epochs=5)

## Summary

In this notebook, we instantiated recommendation pipelines in the simplest possible way. Specifically, we have setup the working environment in GDrive, loaded and understood the Movielens 1M dataset, split data in training and test sets, defined a pointwise / pairwise / random / mostpop recommendation algorithm, trained a recommendation model (only for point-wise and pair-wise), computed the user-item matrix that includes the predicted relevance scores, calculated evaluation metrics to monitor properties, and run the full pipeline for the other algorithms under consideration.  

## Further Steps

- Take a look at the helpers/train_test_splitter.py file and how the existing generators have been defined. 
- Similarly, take a look at the helpers/instances_creator.py file and how the existing generators have been defined. 
- A new subclass of the Model class in models/model.py could be defined, implementing a 'train' and a 'predict' method. 
- The 'test' and 'show_metrics' methods of models/model.py could be extended with the computation needed by a new metric. 

## Suggested Reading 

If you are interested in an example of how to implement this pipeline for an exploratory analysis of bias, you could read:

**Boratto, L., Fenu, G., & Marras, M. (2019, April)**. The effect of algorithmic bias on recommender systems for massive open online courses. In European Conference on Information Retrieval (pp. 457-472). Springer, Cham.
[Springer Link](https://link.springer.com/chapter/10.1007/978-3-030-15712-8_30)