<a href="https://colab.research.google.com/github/ZeyadSabbah/TrivagoRecommenderSystem/blob/master/EvaluatingModels.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Evaluating Models
## Mounting Drive

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
%cd /content/drive/My Drive/Trivago/Project/TrivagoRecommenderSystem

/content/drive/My Drive/Trivago/Project/TrivagoRecommenderSystem


## Loading Libraries & Datasets

In [0]:
import pandas as pd
import numpy as np
from datetime import datetime
from datetime import timedelta
import math
import matplotlib.pyplot as plt
from datetime import datetime
import re
import random
import joblib
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
import warnings
warnings.filterwarnings('ignore', category=FutureWarning)
warnings.filterwarnings('ignore', category=DeprecationWarning)

In [0]:
TrainDataFilepath = 'Datasets/clean_data/Sets/train.csv'
valFilepath = 'clean_data/Sets/val.csv'
testFilepath = 'Datasets/clean_data/Sets/test.csv'

TrainData = pd.read_csv(TrainDataFilepath)
# valData = pd.read_csv(valFilepath)
# testData = pd.read_csv(testFilepath)

## Validation & Test sets' Transformation & Scaling

In [0]:
#declaring features and label
features = ['price', 'item_rank', 'price_rank', 'session_duration', 'item_duration', 'item_session_duration', 'item_interactions', 'maximum_step', 'top_list',
            'NumberOfProperties', 'NumberInImpressions', 'NumberInReferences', 'NumberAsClickout', 'NumberAsFinalClickout', 'FClickoutToImpressions',
            'FClickoutToReferences', 'FClickoutToClickout', 'MeanPrice', 'AveragePriceRank']
label = ['clickout']
X_train = TrainData[features]
y_train = TrainData[label]

### Transformation

In [0]:
#validation set transformation
from data_transformation import data_transformation

valData = data_transformation.transform_data(valData);
valData_sessions_item = valData[['session_id', 'item_id', 'clickout']]
X_val = valData[features]
y_val = valData[label]

In [0]:
#test set transformation and scaling
testData = data_transformation.transform_data(testData)
testData_sessions_item = testData[['session_id', 'item_id', 'clickout']]
X_test = testData[features]
y_test  = testData[label]

### Scaling

In [0]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer

num_pipeline = Pipeline([
('imputer', SimpleImputer(strategy="median")),
('std_scaler', StandardScaler()),
])

from sklearn.compose import ColumnTransformer
full_pipeline = ColumnTransformer([
("num", num_pipeline, list(X_train))
])

# training set scaling
X_train_scaled = full_pipeline.fit_transform(X_train)

# validation set scaling
X_val_scaled = full_pipeline.fit_transform(X_val)

# test set scaling
X_test_scaled = full_pipeline.fit_transform(X_test)

In [0]:
X_train_scaled = np.load('Datasets/clean_data/Xscaled/X_train_scaled.npy')
X_val_scaled = np.load('Datasets/clean_data/Xscaled/X_val_scaled.npy')
X_test_scaled = np.load('Datasets/clean_data/Xscaled/X_test_scaled.npy')

## Mean Reciprocal Rank
Mean Reciprocal Rank is a measure to evaluate systems that return a ranked list of answers to queries.

In [0]:
#function is from this page https://gist.github.com/bwhite/3726239
def mean_reciprocal_rank(rs):
    """Score is reciprocal of the rank of the first relevant item
    First element is 'rank 1'.  Relevance is binary (nonzero is relevant).
    Example from http://en.wikipedia.org/wiki/Mean_reciprocal_rank
    >>> rs = [[0, 0, 1], [0, 1, 0], [1, 0, 0]]
    >>> mean_reciprocal_rank(rs)
    0.61111111111111105
    >>> rs = np.array([[0, 0, 0], [0, 1, 0], [1, 0, 0]])
    >>> mean_reciprocal_rank(rs)
    0.5
    >>> rs = [[0, 0, 0, 1], [1, 0, 0], [1, 0, 0]]
    >>> mean_reciprocal_rank(rs)
    0.75
    Args:
        rs: Iterator of relevance scores (list or numpy) in rank order
            (first element is the first item)
    Returns:
        Mean reciprocal rank
    """
    rs = (np.asarray(r).nonzero()[0] for r in rs)
    return np.mean([1. / (r[0] + 1) if r.size else 0. for r in rs])

## Evaluating Models

In [0]:
def get_probabilities(model_name, X, session_item_dataset):
  '''
  Desc: function that gets the probability of each item being selected by the user, rerank the items in the session based on the probabilites

  Input: model_path: String with the name of the stored model
         X: array of scaled features of the dataset
         session_item_dataset: Pandas Dataframe with the sessions, items, and clickout
        
  Output: clickout_rank: List of lists that carries which item was selected in which rank
          RecommendationsDF: Pandas Dataframe to be transformed and merged to the Clickout Dataframe
  '''
  model = joblib.load(model_name)
  BothProbabilities = model.predict_proba(X)
  Probabilities = [Probability[1] for Probability in BothProbabilities]
  session_item_dataset['probability'] = Probabilities
  RecommendationsDF = session_item_dataset.groupby(['session_id'], sort=False).apply(lambda x: (x.sort_values('probability', ascending=False)))
  clickout_rank = RecommendationsDF.clickout
  clickout_rank = clickout_rank.reset_index().groupby('session_id').clickout.apply(list).values.tolist()
  return clickout_rank, RecommendationsDF

### Without Resampling

#### Logistic Regression

Just for clarification of what the get_probabilities function does, output of each step will be displayed, but for the next models, the function will be used.

In [0]:
LR_model = joblib.load('LR_model.pkl')
Predictions = LR_model.predict(X_val_scaled)
BothProbabilities = LR_model.predict_proba(X_val_scaled)
Probabilities = [Probability[1] for Probability in BothProbabilities]
Probabilities[0:5]

In [0]:
valData_sessions_item['probability'] = Probabilities
valData_sessions_item.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,session_id,item_id,clickout,probability
0,06e7c29170946,10091602,0,0.818978
1,06e7c29170946,6625240,0,0.06236
2,06e7c29170946,9386776,0,0.038675
3,06e7c29170946,3954788,0,0.046309
4,06e7c29170946,9776792,0,0.083385


In [0]:
valData_sessions_item.groupby(['session_id'], sort=False).apply(lambda x: (x.sort_values('probability', ascending=False)))

Unnamed: 0_level_0,Unnamed: 1_level_0,session_id,item_id,clickout,probability
session_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
06e7c29170946,0,06e7c29170946,10091602,0,0.818978
06e7c29170946,4,06e7c29170946,9776792,0,0.083385
06e7c29170946,1,06e7c29170946,6625240,0,0.062360
06e7c29170946,3,06e7c29170946,3954788,0,0.046309
06e7c29170946,2,06e7c29170946,9386776,0,0.038675
...,...,...,...,...,...
f701be9e58e9a,3399256,f701be9e58e9a,7304664,0,0.009075
f701be9e58e9a,3399261,f701be9e58e9a,840461,0,0.007265
f701be9e58e9a,3399259,f701be9e58e9a,3137050,0,0.006525
f701be9e58e9a,3399263,f701be9e58e9a,10064344,0,0.005931


In [0]:
clickout_rank = valData_sessions_item.groupby(['session_id'], sort=False).apply(lambda x: (x.sort_values('probability', ascending=False))).clickout
clickout_rank = clickout_rank.reset_index().groupby('session_id').clickout.apply(list).values.tolist()
clickout_rank[0:5]

[[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]

In [0]:
mean_reciprocal_rank(clickout_rank)

0.5621796407764232

### With Undersampling

#### Logistic Regression

In [20]:
clickout_rank = get_probabilities('modelsUnderSampling/LR_modelUndersampling.pkl', X_val_scaled, valData_sessions_item)
mean_reciprocal_rank(clickout_rank)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


0.6725780150707574

In [0]:
clickout_rank[0:5]

[[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]

In [22]:
clickout_rank = get_probabilities('modelsUnderSampling/LR_modelUndersampling.pkl', X_test_scaled, testData_sessions_item)
mean_reciprocal_rank(clickout_rank)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


0.7585556052007818

In [24]:
clickout_rank, Output = get_probabilities('modelsUnderSampling/LR_modelUndersampling.pkl', X_val_scaled, valData_sessions_item)
Output

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Unnamed: 0_level_0,Unnamed: 1_level_0,session_id,item_id,clickout,probability
session_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
06e7c29170946,0,06e7c29170946,10091602,0,0.998647
06e7c29170946,4,06e7c29170946,9776792,0,0.515845
06e7c29170946,1,06e7c29170946,6625240,0,0.429573
06e7c29170946,6,06e7c29170946,6893402,1,0.384209
06e7c29170946,3,06e7c29170946,3954788,0,0.358810
...,...,...,...,...,...
f701be9e58e9a,3399256,f701be9e58e9a,7304664,0,0.110202
f701be9e58e9a,3399253,f701be9e58e9a,55441,0,0.104138
f701be9e58e9a,3399259,f701be9e58e9a,3137050,0,0.093351
f701be9e58e9a,3399263,f701be9e58e9a,10064344,0,0.089631


## Submission

In the future, the challenge might be a public competition, in that case transformation of the output is needed.

In [8]:
test_set_filepath = './Datasets/raw_data/test.csv'
test_set = pd.read_csv(test_set_filepath)
test_set.head()

Unnamed: 0,user_id,session_id,timestamp,step,action_type,reference,platform,city,device,current_filters,impressions,prices
0,004A07DM0IDW,1d688ec168932,1541555614,1,interaction item image,2059240,CO,"Santa Marta, Colombia",mobile,,,
1,004A07DM0IDW,1d688ec168932,1541555614,2,interaction item image,2059240,CO,"Santa Marta, Colombia",mobile,,,
2,004A07DM0IDW,1d688ec168932,1541555696,3,clickout item,1050068,CO,"Santa Marta, Colombia",mobile,,2059240|2033381|1724779|127131|399441|103357|1...,70|46|48|76|65|65|106|66|87|43|52|44|60|61|50|...
3,004A07DM0IDW,1d688ec168932,1541555707,4,clickout item,1050068,CO,"Santa Marta, Colombia",mobile,,2059240|2033381|1724779|127131|399441|103357|1...,70|46|48|76|65|65|106|66|87|43|52|44|60|61|50|...
4,004A07DM0IDW,1d688ec168932,1541555717,5,clickout item,1050068,CO,"Santa Marta, Colombia",mobile,,2059240|2033381|1724779|127131|399441|103357|1...,70|46|48|76|65|65|106|66|87|43|52|44|60|61|50|...


In [9]:
submission_format_filepath = './Datasets/raw_data/submission_popular.csv'
submission_format = pd.read_csv(submission_format_filepath)
submission_format.head()

Unnamed: 0,user_id,session_id,timestamp,step,item_recommendations
0,000324D9BBUC,89643988fdbfb,1541593942,10,924795 106315 1033140 119494 101758 903037 105...
1,0004Q49X39PY,9de47d9a66494,1541641157,1,3505150 3812004 2227896 2292254 3184842 222702...
2,0004Q49X39PY,beea5c27030cb,1541561202,1,4476010 3505150 3812004 2227896 2292254 222702...
3,00071784XQ6B,9617600e1ba7c,1541630328,2,22854 3067559 22721 22713 16121 22772 22727 22...
4,0008BO33KUQ0,2d0e2102ee0dc,1541636411,6,9857656 5849628 655716 1352530 502066 1405084 ...


In [0]:
def transform_Recommendations(clickout_dataframe, RecommendationsDF):
  ListOfItems = RecommendationsDF.reset_index(drop=True)[['session_id', 'item_id']].groupby('session_id', sort=False).item_id.apply(pd.Series.tolist).tolist()
  SessionsListOfItems = pd.DataFrame({'session_id':RecommendationsDF.session_id.unique().tolist(),
                                      'item_recommendations':ListOfItems})
  SessionsListOfItems.item_recommendations = SessionsListOfItems.item_recommendations.apply(lambda x: ' '.join(x))
  data = clickout_dataframe.merge(SessionsListOfItems, on='session_id', how='left')
  return data

In [0]:
def get_probabilities_submission(model_name, X, session_item_dataset):
  '''
  Desc: function that gets the probability of each item being selected by the user, rerank the items in the session based on the probabilites

  Input: model_path: String with the name of the stored model
         X: array of scaled features of the dataset
         session_item_dataset: Pandas Dataframe with the sessions, items, and clickout
        
  Output: clickout_rank: List of lists that carries which item was selected in which rank
          RecommendationsDF: Pandas Dataframe to be transformed and merged to the Clickout Dataframe
  '''
  model = joblib.load(model_name)
  BothProbabilities = model.predict_proba(X)
  Probabilities = [Probability[1] for Probability in BothProbabilities]
  session_item_dataset['probability'] = Probabilities
  RecommendationsDF = session_item_dataset.groupby(['session_id'], sort=False).apply(lambda x: (x.sort_values('probability', ascending=False)))
  return RecommendationsDF

In [57]:
ListOfItems = Output.reset_index(drop=True)[['session_id', 'item_id']].groupby('session_id', sort=False).item_id.apply(pd.Series.tolist).tolist()
SessionsListOfItems = pd.DataFrame({'session_id':Output.session_id.unique().tolist(),
                                    'item_recommendations':ListOfItems})
SessionsListOfItems.item_recommendations = SessionsListOfItems.item_recommendations.apply(lambda x: ' '.join(x))
SessionsListOfItems.head()

Unnamed: 0,session_id,item_recommendations
0,06e7c29170946,10091602 9776792 6625240 6893402 3954788 93867...
1,ac77b5670630f,46453 46711 123395 46980 1355056 9412876 15293...
2,8748f6984266c,929707 104802 1883779 104773 5032060 2706630 1...
3,649458cf7bc3e,2679006 2677282 3371467 4783716 134412 2635106...
4,73c99d2cd8728,3953294 2530750 18904 12618 3842698 137487 270...


In [28]:
from data_transformation import data_transformation

test_clickout = test_set[test_set.action_type=='clickout item'].groupby('session_id').tail(1)
test_clickout = test_clickout[['user_id', 'session_id', 'timestamp', 'step']]
test_set_transformed = data_transformation.transform_data(test_set)
test_session_item = test_set_transformed[['session_id', 'item_id']]
X_test_submission = test_set_transformed[features]
X_test_submission_scaled = full_pipeline.fit_transform(X_test_submission)
RecommendationsDF = get_probabilities_submission('modelsUnderSampling/LR_modelUndersampling.pkl', X_test_submission_scaled, test_session_item)
SubmissionDF = transform_Recommendations(test_clickout, RecommendationsDF)
SubmissionDF.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  from ipykernel import kernelapp as app


Unnamed: 0,user_id,session_id,timestamp,step,item_recommendations
0,004A07DM0IDW,1d688ec168932,1541555799,7,1050068 2059240 399441 2033381 127131 1724779 ...
1,009RGHI3G9A3,f05ab0de907e2,1541570940,2,10884872 7065316
2,00Y1Z24X8084,26b6d294d66e7,1541651823,2,3853058 7101352 4476010 3833012 3843244 271448...
3,01V3WDTDM5CU,07628a0f5be0b,1541575643,5,4115018 7950162 6434434 2817590 2882092 318000...
4,02AOAVF9PVYH,4a01c3afbc224,1541681278,46,7304020 1177554 559056 693596 1451247 1963879 ...


In [0]:
SubmissionDF.to_csv('TrivagoRecommendations.csv')

In [50]:
SubmissionDF

Unnamed: 0,user_id,session_id,timestamp,step,item_recommendations
0,004A07DM0IDW,1d688ec168932,1541555799,7,1050068 2059240 399441 2033381 127131 1724779 ...
1,009RGHI3G9A3,f05ab0de907e2,1541570940,2,10884872 7065316
2,00Y1Z24X8084,26b6d294d66e7,1541651823,2,3853058 7101352 4476010 3833012 3843244 271448...
3,01V3WDTDM5CU,07628a0f5be0b,1541575643,5,4115018 7950162 6434434 2817590 2882092 318000...
4,02AOAVF9PVYH,4a01c3afbc224,1541681278,46,7304020 1177554 559056 693596 1451247 1963879 ...
...,...,...,...,...,...
275674,ZXGCLNDBW84E,53cb30d7ca9c6,1541703038,4,477936 503361 1886269 1668443 1326068 3847842 ...
275675,ZYG4MMKT847V,0f128fd98b4e3,1541634384,4,2032597 1089752 9394020 2684451 2027835 279089...
275676,ZYMVSZ5A3KQI,8cd16c29b733b,1541695019,2,364796 1299248 4891790 381296 1345175 9647370
275677,ZZ39YE45SZIE,f5db4092ec9fc,1541551945,6,2849116 2625125 3378564 7365168 3166629 577914...
