# Evaluation
The aim of this file is to evaluate the models that had been trained with different techniques. The main purpose is to see if the sampling techniques would make enhancement in the predictive power of the models.

## Mounting Drive

In [1]:
from google.colab import drive
drive.mount('/content/drive')
%cd /content/drive/My Drive/Trivago/Project/TrivagoRecommenderSystem

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive
/content/drive/My Drive/Trivago/Project/TrivagoRecommenderSystem


## Loading Libraries & Datasets

In [0]:
import pandas as pd
import numpy as np
import joblib
from collections import Counter
from tqdm import tqdm_notebook as tqdm
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.linear_model import LogisticRegression
from sklearn import preprocessing
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
import warnings
warnings.filterwarnings('ignore', category=FutureWarning)
warnings.filterwarnings('ignore', category=DeprecationWarning)

In [0]:
TrainDataFilepath = '../TrivagoRecommenderSystem1/Datasets/clean_data/Sets/train.csv'
TrainMoreDataFilepath = '../TrivagoRecommenderSystem1/Datasets/clean_data/Sets/trainMore.csv'
valFilepath = '../TrivagoRecommenderSystem1/Datasets/clean_data/Sets/val.csv'
testFilepath = '../TrivagoRecommenderSystem1/Datasets/clean_data/Sets/test.csv'

TrainData = pd.read_csv(TrainDataFilepath)
TrainMoreData = pd.read_csv(TrainMoreDataFilepath)
valData = pd.read_csv(valFilepath)
testData = pd.read_csv(testFilepath)

## Validation & Test sets' Transformation & Scaling

### Preparation

In [0]:
#declaring features and label
features = TrainData.drop(columns=['session_id', 'item_id', 'clickout']).columns.tolist()
label = ['clickout']

#dropping highly correlated features
FeaturesToDrop = ['NumberInImpressions', 'NumberInReferences', 'MeanPrice', 'MinPrice']
for feature in FeaturesToDrop:
  features.remove(feature)

X_train = TrainData[features]
y_train = TrainData[label]

X_trainMore = TrainData[features]
y_trainMore = TrainData[label]

valData_sessions_item = valData[['session_id', 'item_id', 'clickout']]
X_val = valData[features]
y_val = valData[label]

testData_sessions_item = testData[['session_id', 'item_id', 'clickout']]
X_test = testData[features]
y_test  = testData[label]

### Scaling

In [0]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer

features = ['item_duration','item_interactions','NumberAsClickout','NumberAsFinalClickout','item_rank','price',
            'item_session_duration','top_list']
            
num_pipeline = Pipeline([
('imputer', SimpleImputer(strategy="median")),
('std_scaler', StandardScaler()),
])

# Training Data
full_pipeline = ColumnTransformer([
("num", num_pipeline, list(X_train[features]))
])

# training set scaling
X_train_scaled = full_pipeline.fit_transform(X_train[features])

# More Training Data
full_pipeline = ColumnTransformer([
("num", num_pipeline, list(X_train[features]))
])

## More training set scaling
X_train_scaled = full_pipeline.fit_transform(X_trainMore[features])

# validation set scaling
X_val_scaled = full_pipeline.transform(X_val[features])

# test set scaling
X_test_scaled = full_pipeline.transform(X_test[features])

## Evaluating Models

In [0]:
#function is from this repo https://gist.github.com/bwhite/3726239
def mean_reciprocal_rank(rs):
    """Score is reciprocal of the rank of the first relevant item
    First element is 'rank 1'.  Relevance is binary (nonzero is relevant).
    Example from http://en.wikipedia.org/wiki/Mean_reciprocal_rank
    >>> rs = [[0, 0, 1], [0, 1, 0], [1, 0, 0]]
    >>> mean_reciprocal_rank(rs)
    0.61111111111111105
    >>> rs = np.array([[0, 0, 0], [0, 1, 0], [1, 0, 0]])
    >>> mean_reciprocal_rank(rs)
    0.5
    >>> rs = [[0, 0, 0, 1], [1, 0, 0], [1, 0, 0]]
    >>> mean_reciprocal_rank(rs)
    0.75
    Args:
        rs: Iterator of relevance scores (list or numpy) in rank order
            (first element is the first item)
    Returns:
        Mean reciprocal rank
    """
    rs = (np.asarray(r).nonzero()[0] for r in rs)
    return np.mean([1. / (r[0] + 1) if r.size else 0. for r in rs])

def get_probabilities(model_path, X, session_item_dataset):
  global clickout_rank, RecommendationsDF
  '''
  Desc: function that gets the probability of each item being selected by the user, rerank the items in the session based on the probabilites

  Input: model_path: String with the name of the stored model
         X: array of scaled features of the dataset
         session_item_dataset: Pandas Dataframe with the sessions, items, and clickout
        
  Output: clickout_rank: List of lists that carries which item was selected in which rank
          RecommendationsDF: Pandas Dataframe to be transformed and merged to the Clickout Dataframe
  '''
  model = joblib.load(model_path)
  BothProbabilities = model.predict_proba(X)
  Probabilities = [Probability[1] for Probability in BothProbabilities]
  session_item_dataset['probability'] = Probabilities
  RecommendationsDF = session_item_dataset.groupby(['session_id'], sort=False).apply(lambda x: (x.sort_values('probability', ascending=False)))
  clickout_rank = RecommendationsDF.clickout
  clickout_rank = clickout_rank.reset_index().groupby('session_id').clickout.apply(list).values.tolist()
  return clickout_rank, RecommendationsDF
  
def ClassifReport(model_path, X, y):
  global y_pred
  model = joblib.load(model_path)
  y_pred = model.predict(X)
  return classification_report(y, y_pred)

def PrintMetrics(model_path, X, y, session_item_dataset):
  clickout_rank, RecommendationsDF = get_probabilities(model_path, X, session_item_dataset)
  MeanReciprocalRank = mean_reciprocal_rank(clickout_rank)
  print('Mean Reciprocal Rank : ', MeanReciprocalRank)
  print('=================================================')
  ClassificationReport = ClassifReport(model_path, X, y)
  print('Classification Report')
  print('=================================================')
  print(ClassificationReport)
  ConfMatrix = confusion_matrix(y, y_pred, labels=[1, 0])
  print('Confusion Matrix')
  print('================================================')
  print(ConfMatrix)
  return

### Without Resampling

#### Logistic Regression Illustration



Just for clarification of what the get_probabilities function does, output of each step will be displayed, but for the next models, the function will be used.

In [0]:
LR_model = joblib.load('./models/LR_model.pkl')
Predictions = LR_model.predict(X_val_scaled)
BothProbabilities = LR_model.predict_proba(X_val_scaled)
Probabilities = [Probability[1] for Probability in BothProbabilities]
Probabilities[0:5]

[0.5941419645268843,
 0.12745755945175913,
 0.09408998870484993,
 0.0833132961209747,
 0.09018704167477462]

In [0]:
valData_sessions_item['probability'] = Probabilities
valData_sessions_item.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,session_id,item_id,clickout,probability
0,06e7c29170946,10091602,0,0.594142
1,06e7c29170946,6625240,0,0.127458
2,06e7c29170946,9386776,0,0.09409
3,06e7c29170946,3954788,0,0.083313
4,06e7c29170946,9776792,0,0.090187


In [0]:
clickout_rank = valData_sessions_item.groupby(['session_id'], sort=False).apply(lambda x: (x.sort_values('probability', ascending=False))).clickout
clickout_rank = clickout_rank.reset_index().groupby('session_id').clickout.apply(list).values.tolist()
clickout_rank[0:5]

[[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]

#### Logistic Regression

In [0]:
PrintMetrics('./models/LR_model.pkl', X_val_scaled, y_val, valData_sessions_item)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


Mean Reciprocal Rank :  0.5306751365918464
Classification Report
              precision    recall  f1-score   support

           0       0.96      1.00      0.98   3249590
           1       0.63      0.09      0.16    149676

    accuracy                           0.96   3399266
   macro avg       0.79      0.54      0.57   3399266
weighted avg       0.95      0.96      0.94   3399266

Confusion Matrix
[[  13689  135987]
 [   8072 3241518]]


#### Random Forest

In [0]:
PrintMetrics('./models/RF_model.pkl', X_val_scaled, y_val, valData_sessions_item)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


Mean Reciprocal Rank :  0.6124519929264486


  _warn_prf(average, modifier, msg_start, len(result))


Classification Report
              precision    recall  f1-score   support

           0       0.96      1.00      0.98   3249590
           1       0.00      0.00      0.00    149676

    accuracy                           0.96   3399266
   macro avg       0.48      0.50      0.49   3399266
weighted avg       0.91      0.96      0.93   3399266

Confusion Matrix
[[      0  149676]
 [      0 3249590]]


####XGBoost

In [0]:
PrintMetrics('./models/XGB_model.pkl', X_val_scaled, y_val, valData_sessions_item)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


Mean Reciprocal Rank :  0.5913282067787623
Classification Report
              precision    recall  f1-score   support

           0       0.96      1.00      0.98   3249590
           1       0.66      0.12      0.20    149676

    accuracy                           0.96   3399266
   macro avg       0.81      0.56      0.59   3399266
weighted avg       0.95      0.96      0.94   3399266

Confusion Matrix
[[  17552  132124]
 [   9085 3240505]]


In [0]:
PrintMetrics('./models/XGB_modelMore.pkl', X_val_scaled, y_val, valData_sessions_item)

### With SMOTE

####Logistic Regression

In [0]:
PrintMetrics('./models/LR_SMOTE.pkl', X_val_scaled, y_val, valData_sessions_item)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


Mean Reciprocal Rank :  0.5328501467903881
Classification Report
              precision    recall  f1-score   support

           0       0.98      0.84      0.91   3249590
           1       0.16      0.66      0.26    149676

    accuracy                           0.83   3399266
   macro avg       0.57      0.75      0.58   3399266
weighted avg       0.95      0.83      0.88   3399266

Confusion Matrix
[[  98152   51524]
 [ 513100 2736490]]


#### Random Forest

In [0]:
PrintMetrics('./models/RF_SMOTE.pkl', X_val_scaled, y_val, valData_sessions_item)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


Mean Reciprocal Rank :  0.6042348193349655
Classification Report
              precision    recall  f1-score   support

           0       0.97      0.98      0.97   3249590
           1       0.41      0.33      0.37    149676

    accuracy                           0.95   3399266
   macro avg       0.69      0.65      0.67   3399266
weighted avg       0.94      0.95      0.95   3399266

Confusion Matrix
[[  49682   99994]
 [  72059 3177531]]


#### XGBoost

In [0]:
PrintMetrics('./models/XGB_SMOTE.pkl', X_val_scaled, y_val, valData_sessions_item)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


Mean Reciprocal Rank :  0.5822028529848495
Classification Report
              precision    recall  f1-score   support

           0       0.98      0.95      0.96   3249590
           1       0.34      0.55      0.42    149676

    accuracy                           0.93   3399266
   macro avg       0.66      0.75      0.69   3399266
weighted avg       0.95      0.93      0.94   3399266

Confusion Matrix
[[  83065   66611]
 [ 159313 3090277]]


### With Undersampling

#### Logistic Regression

In [0]:
PrintMetrics('./models/LR_usample.pkl', X_val_scaled, y_val, valData_sessions_item)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


Mean Reciprocal Rank :  0.4874620608041104
Classification Report
              precision    recall  f1-score   support

           0       0.98      0.64      0.77   3249590
           1       0.08      0.69      0.14    149676

    accuracy                           0.64   3399266
   macro avg       0.53      0.66      0.46   3399266
weighted avg       0.94      0.64      0.75   3399266

Confusion Matrix
[[ 103249   46427]
 [1171911 2077679]]


#### Random Forest

In [0]:
PrintMetrics('./models/RF_usample.pkl', X_val_scaled, y_val, valData_sessions_item)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


Mean Reciprocal Rank :  0.5182568797223334
Classification Report
              precision    recall  f1-score   support

           0       0.97      0.48      0.64   3249590
           1       0.06      0.70      0.11    149676

    accuracy                           0.49   3399266
   macro avg       0.51      0.59      0.38   3399266
weighted avg       0.93      0.49      0.62   3399266

Confusion Matrix
[[ 104320   45356]
 [1688551 1561039]]


#### XGBoost

In [0]:
PrintMetrics('./models/XGB_usample.pkl', X_val_scaled, y_val, valData_sessions_item)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


Mean Reciprocal Rank :  0.5040026233202423
Classification Report
              precision    recall  f1-score   support

           0       0.98      0.46      0.63   3249590
           1       0.06      0.77      0.11    149676

    accuracy                           0.47   3399266
   macro avg       0.52      0.62      0.37   3399266
weighted avg       0.94      0.47      0.60   3399266

Confusion Matrix
[[ 115618   34058]
 [1751199 1498391]]


## MoreData Model Evaluation

Evaluating the model that had been trained on more data.


In [0]:
PrintMetrics('./models/RF_modelMore.pkl', X_val_scaled, y_val, valData_sessions_item)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


Mean Reciprocal Rank :  0.6126881175382823


  _warn_prf(average, modifier, msg_start, len(result))


Classification Report
              precision    recall  f1-score   support

           0       0.96      1.00      0.98   3249590
           1       0.00      0.00      0.00    149676

    accuracy                           0.96   3399266
   macro avg       0.48      0.50      0.49   3399266
weighted avg       0.91      0.96      0.93   3399266

Confusion Matrix
[[      0  149676]
 [      0 3249590]]


Sampling techniques did not prove any enhancment in the performance of any of the models, the performance was definitely better in the recall metric, but it actually screwed the MRR metric. The baseline model that would be used with next steps would be Random Forest Without Resampling as it has a Mean Reciprocal Rank 0.6124.

When model which was trained on the more data (4.8% increase of the original training data), the model Mean Reciprocal Rank has slightly increased to be 0.6126. This increase migh be very little on this scale, but sure it can make difference in the real application.

# Evaluating Items' Properties

### Validation and Test Sets Preparation

In [0]:
TrainMoreDataFilepath = '../TrivagoRecommenderSystem1/Datasets/clean_data/Sets/trainMore.csv'
TrainMoreData = pd.read_csv(TrainMoreDataFilepath)

#declaring features and label
features = TrainMoreData.drop(columns=['session_id', 'item_id', 'clickout']).columns.tolist()
label = ['clickout']

#dropping highly correlated features
FeaturesToDrop = ['NumberInImpressions', 'NumberInReferences', 'MeanPrice', 'MinPrice']
for feature in FeaturesToDrop:
  features.remove(feature)