# Prediction of bookings based on user behavior
Data Scientist – User Profiling, Hotel Search

- Author: Kai Chen
- Date: Apr, 2018

## Situation:

A search session describes a user’s journey to find his ideal hotel, by including all his interactions. Given user search sessions, we are interested in predicting the outcome of these sessions based on the users’ interactions; as well determining which of these interactions have the highest importance for this estimation.

## Data:

We provide two kinds of data sets:

- anonymized user logs generated by usage on our website (user actions); and,
- booking outcome per session with contextual information (bookings).


Both types of datasets are split by the same timestamp into train and target sets. The target set contains the same information as the training set, except the outcome (i.e. has_booking). More information is provided in README.md (You can find this in the resources section).


## Task:

The task is to train a machine learning model to estimate if a booking occurred – the training and target sets have been provided for you. We expect binary predictions for the target sessions, which will be evaluated by Matthews Correlation Coefficient (MCC) using the ground truth dataset on our side. You can have as many submissions as you would like to improve your solution.


## Additional questions:

- What makes the classification problem difficult in this task? How do you handle that?
- Evaluate and compare at least 3 classification methods for this task.
- Propose at least 3 features that are significant to predict bookings?
- We can spot a very significant action type. What might this action refer to?



There are 3 types of data: Bookings, User actions, Example

## Data: Bookings
- Description: List of sessions, each with: session-related contextual data, and whether at least one booking was made
- Files:
	- case_study_bookings_train.csv: Training sessions for bookings
	- case_study_bookings_target.csv: Target sessions to predict bookings
- Rows: Each row represents a session with session context and the outcome of this session
- Columns:
	- ymd: Date of the session in format 'yyMMdd'
	- user_id: Anonymized cookie id of the visitor
	- session_id: Anonymized id of the session
	- referer_code: Encoded category of the referer to the website
	- is_app: If the session was made using the trivago app
	- agent_id: Encoded type of the browser
	- traffic_type: A categorization of the type of the traffic
	- has_booking: 1 if at least one booking was made during the session (excluded from the target set)
    
## Data: User Actions
- Description: Sequence of various type of user actions generated during the usage of the website.
- Files
	- case_study_actions_train.csv: Training set of user actions
	- case_study_actions_target.csv: User actions in the target sessions
- Rows: Each row represents one action from/to the user
- Columns:
	- ymd: Date of the action in format 'yyMMdd'
	- user_id: Anonymized cookie id of the visitor
	- session_id: Anonymized id of the session
	- action_id: Type of the action
	- reference: Object of the action. - We note that action_ids with big set of reference values (e.g. action id '2116') are typically related to the content (e.g hotels, destinations or keywords); while action_ids with small reference set (e.g. action id '2351') are more related a function of the website (e.g. sorting order, room type, filters, etc.)
	- step: The number identifying the action in the session
	
## Data: Example Solution
- Description: List of predictions for bookings in the target sessions
- File: case_study_bookings_target_prediction_example.csv
- Rows: Each row represent a target session for which a prediction should be given
- Columns:
	- session_id: Anonymized id of the session
	- has_booking: Random binary predictions for bookings

In [14]:
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')
warnings.simplefilter("ignore", DeprecationWarning)

from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.linear_model import LogisticRegression

import numpy as np
import pandas as pd
from datetime import datetime
import operator
from collections import OrderedDict

import csv

from sklearn.model_selection import train_test_split
from sklearn.metrics import matthews_corrcoef
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import StratifiedKFold
from sklearn.externals import joblib
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import MultinomialNB
from sklearn import linear_model

import xgboost as xgb
from xgboost import XGBClassifier

import lightgbm as lgb

import catboost
from catboost import CatBoostClassifier

np.random.seed(42)

In [15]:
# ---
# Define file paths
TRAIN_BOOKING_FILE_PATH = 'data/case_study_bookings_train.csv'    # training sessions for bookings
TARGET_BOOKING_FILE_PATH = 'data/case_study_bookings_target.csv'  # target sessions to predict bookings

TRAIN_ACTION_FILE_PATH = 'data/case_study_actions_train.csv'       # training set of user actions
TARGET_ACTION_FILE_PATH = 'data/case_study_actions_target.csv'     # user actions in the target sessions

## Step 1: read and explore the data

In [16]:
"""
train booking data
- ymd: Date of the session in format 'yyMMdd'
- user_id: Anonymized cookie id of the visitor
- session_id: Anonymized id of the session
- referer_code: Encoded category of the referer to the website
- is_app: If the session was made using the trivago app
- agent_id: Encoded type of the browser
- traffic_type: A categorization of the type of the traffic
- has_booking: 1 if at least one booking was made during the session (excluded from the target set)
"""

train_booking_df = pd.read_csv(TRAIN_BOOKING_FILE_PATH, sep='\t')
train_booking_df['ymd'] = pd.to_datetime(train_booking_df['ymd'].astype('str'))

print('train booking')
print(train_booking_df.columns)
print(train_booking_df.describe())
display(train_booking_df.head(5))

train booking
Index(['ymd', 'user_id', 'session_id', 'referer_code', 'is_app', 'agent_id',
       'traffic_type', 'has_booking'],
      dtype='object')
            user_id    session_id   referer_code         is_app  \
count  3.076770e+05  3.076770e+05  307677.000000  307677.000000   
mean   4.622586e+18  4.609514e+18      22.857828       0.073571   
std    2.665868e+18  2.662771e+18      40.179017       0.261071   
min    3.883091e+14  1.097161e+14       0.000000       0.000000   
25%    2.312274e+18  2.303980e+18       0.000000       0.000000   
50%    4.635855e+18  4.612005e+18       1.000000       0.000000   
75%    6.932941e+18  6.912236e+18      15.000000       0.000000   
max    9.223267e+18  9.223359e+18      99.000000       1.000000   

            agent_id   traffic_type    has_booking  
count  307677.000000  307677.000000  307677.000000  
mean        7.424809       2.686174       0.063856  
std         3.713358       1.906000       0.244497  
min         0.000000       1.000

Unnamed: 0,ymd,user_id,session_id,referer_code,is_app,agent_id,traffic_type,has_booking
0,2017-04-23,388309106223940,3052767322364990735,0,0,2,1,0
1,2017-04-10,452426828488840,1022778951418899936,0,0,10,2,0
2,2017-04-15,452426828488840,4191504489082712531,0,0,10,2,0
3,2017-04-06,819438352219100,4560227804862289210,1,0,1,1,0
4,2017-04-07,1113732603712480,4115013282086590434,0,0,9,2,0


In [17]:
"""
target booking data
- ymd: Date of the session in format 'yyMMdd'
- user_id: Anonymized cookie id of the visitor
- session_id: Anonymized id of the session
- referer_code: Encoded category of the referer to the website
- is_app: If the session was made using the trivago app
- agent_id: Encoded type of the browser
- traffic_type: A categorization of the type of the traffic
"""

target_booking_df = pd.read_csv(TARGET_BOOKING_FILE_PATH, sep='\t')
target_booking_df['ymd'] = pd.to_datetime(target_booking_df['ymd'].astype('str'))

print('target booking')
print(target_booking_df.columns)
display(target_booking_df.head(5))

target booking
Index(['ymd', 'user_id', 'session_id', 'referer_code', 'is_app', 'agent_id',
       'traffic_type'],
      dtype='object')


Unnamed: 0,ymd,user_id,session_id,referer_code,is_app,agent_id,traffic_type
0,2017-04-30,1607565913119260,4175939893794521966,0,0,14,6
1,2017-04-30,1607565913119260,9175174925268392332,0,0,14,1
2,2017-04-30,2669945826129900,5361965966177226983,0,0,6,6
3,2017-04-30,6247954936827660,7996347049132178025,0,0,13,2
4,2017-04-30,6447705595982360,6061498713259551906,99,0,1,6


In [18]:
# get number of users and sessions in the train booking data

train_user_id_list = train_booking_df['user_id'].unique()
train_session_id_list = train_booking_df['session_id'].unique()

print('number of users (train booking data): {}'.format(len(train_user_id_list)))
print('number of sessions (tarin booking data): {}'.format(len(train_session_id_list)))
print('dataframe size (train booking data)')
print(train_booking_df.shape)

number of users (train booking data): 181860
number of sessions (tarin booking data): 307677
dataframe size (train booking data)
(307677, 8)


In [19]:
# get number of users and sessions in the target booking data

target_user_id_list = target_booking_df['user_id'].unique()
target_session_id_list = target_booking_df['session_id'].unique()

print('number of users (target booking data): {}'.format(len(target_user_id_list)))
print('number of sessions (target booking data): {}'.format(len(target_session_id_list)))
print('dataframe size (target booking data)')
print(target_booking_df.shape)

number of users (target booking data): 23402
number of sessions (target booking data): 30128
dataframe size (target booking data)
(30128, 7)


In [20]:
"""
train action data
- ymd: Date of the action in format 'yyMMdd'
- user_id: Anonymized cookie id of the visitor
- session_id: Anonymized id of the session
- action_id: Type of the action
- reference: Object of the action. - We note that action_ids with big set of reference values (e.g. action id '2116') are typically related to the content (e.g hotels, destinations or keywords); while action_ids with small reference set (e.g. action id '2351') are more related a function of the website (e.g. sorting order, room type, filters, etc.)
- step: The number identifying the action in the session
"""
train_action_df = pd.read_csv(TRAIN_ACTION_FILE_PATH, sep='\t')
train_action_df['ymd'] = pd.to_datetime(train_action_df['ymd'].astype('str'))

print('train action')
print(train_action_df.columns)
print(train_action_df.describe())
display(train_action_df.head(5))

train action
Index(['ymd', 'user_id', 'session_id', 'action_id', 'reference', 'step'], dtype='object')
            user_id    session_id     action_id     reference          step
count  5.862863e+06  5.862863e+06  5.862863e+06  5.862863e+06  5.862863e+06
mean   4.612543e+18  4.607893e+18  2.813712e+03  4.898678e+05  5.464623e+01
std    2.657161e+18  2.656793e+18  1.635718e+03  2.865848e+06  1.319077e+02
min    3.883091e+14  1.097161e+14  2.900000e+01 -1.000000e+00  1.000000e+00
25%    2.307275e+18  2.310688e+18  2.119000e+03  1.000000e+00  6.000000e+00
50%    4.624574e+18  4.606641e+18  2.146000e+03  3.164400e+04  1.800000e+01
75%    6.897454e+18  6.892497e+18  2.501000e+03  1.297190e+05  4.800000e+01
max    9.223267e+18  9.223359e+18  8.091000e+03  6.814322e+08  3.133000e+03


Unnamed: 0,ymd,user_id,session_id,action_id,reference,step
0,2017-04-23,388309106223940,3052767322364990735,8001,1323836,1
1,2017-04-10,452426828488840,1022778951418899936,2116,929835,1
2,2017-04-10,452426828488840,1022778951418899936,6999,0,2
3,2017-04-10,452426828488840,1022778951418899936,2116,929835,3
4,2017-04-10,452426828488840,1022778951418899936,2503,1,4


In [21]:
# get number of users and sessions in the action list
train_user_id_action_list = train_action_df['user_id'].unique()
train_session_id_action_list = train_action_df['session_id'].unique()

print('number of users (train action data): {}'.format(len(train_user_id_action_list)))
print('number of sessions (train action data): {}'.format(len(train_session_id_action_list)))
print('dataframe size (train booking data)')
print(train_action_df.shape)

number of users (train action data): 181730
number of sessions (train action data): 306106
dataframe size (train booking data)
(5862863, 6)


In [22]:
print(len(set(train_user_id_list) - set(train_user_id_action_list)))

print(len(set(train_user_id_action_list) - set(train_user_id_list)))

# all the users who have an action can be found in the booking data
# 130 users in the traing data do not have actions

130
0


In [24]:
# get user who does have action information
train_user_id_no_action_list = []
for user_id in train_user_id_list:
    if not user_id in train_user_id_action_list:
        train_user_id_no_action_list.append(user_id)
    
print(len(train_user_id_no_action_list))

130


In [11]:
nb_bookings_action = []
nb_bookings_no_action = []

for user_id in train_user_id_list:
    nb_bookings = np.sum(train_booking_df[train_booking_df['user_id'] == user_id]['has_booking'].values)
    nb_bookings_action.append(nb_bookings)

for user_id in train_user_id_no_action_list:
    nb_bookings = np.sum(train_booking_df[train_booking_df['user_id'] == user_id]['has_booking'].values)
    nb_bookings_no_action.append(nb_bookings)
    
    
print('number of users (with action): {}'.format(len(train_user_id_list))) 
print('number of users (without action): {}'.format(len(train_user_id_no_action_list))) 
print('number of bookings (with action): {}'.format(np.sum(nb_bookings_action)))
print('number of bookings (without action): {}'.format(np.sum(nb_bookings_no_action)))
print('mean number of bookings (with action): {}'.format(np.mean(nb_bookings_action)))
print('mean number of bookings (without action): {}'.format(np.mean(nb_bookings_no_action)))
print('standard deviation number of bookings (with action): {}'.format(np.std(nb_bookings_action)))
print('standard deviation number of bookings (without action): {}'.format(np.std(nb_bookings_no_action)))
        
# This shows that users without action are more likely to book a hotel.

number of users (with action): 181860
number of users (without action): 130
number of bookings (with action): 19647
number of bookings (without action): 103
mean number of bookings (with action): 0.1080336522599802
mean number of bookings (without action): 0.7923076923076923
standard deviation number of bookings (with action): 0.3341389925584934
standard deviation number of bookings (without action): 0.829236477788257


In [12]:
action_id_list = train_action_df['action_id'].unique()
print('action id')
print(action_id_list)
print(len(action_id_list))
min_action_id = np.min(action_id_list)
print('min action id: {}'.format(min_action_id))
max_action_id = np.max(action_id_list)
print('max action id: {}'.format(max_action_id))

reference_list = train_action_df['reference'].unique()
print('reference id')
print(reference_list)
print(len(reference_list))
min_reference = np.min(reference_list)
max_reference = np.max(reference_list)
print('min reference {}'.format(min_reference))
print('max reference {}'.format(max_reference))

corr_score = train_user_df['action_id'].corr(train_user_df['reference'])
print(corr_score)

action id
[8001 2116 6999 2503 2113 2100 2362 2306 2358 2350 2146 2331 2145 2122
 2502 2166 2260 2296 8010 2119 2115 2351 2175 2142 2314 2111 2357 2123
 2367 2262 2501 2133 2136 2135 2216 2121 2257 2188 2155 2114 2700 2788
 2784 2884 2710 2791 2726 2792 2702 2226 2252 2156 2504 2863 2773 2721
 2776 2840 2765 2720 2227 2143 2356 2307 2124 2391 2440 2301 2789 2881
 2713 2701 2777 2845 2779 2750 2728 2873 2719 2126 2206 2291 2894 2793
 2897 2778 2848 2860 2851 2853 2844 2704 2759 2761 2706 2892 2820 2850
 2781 2751 2752 2888 2893 2128 2125 2725 2352 2215 2371 2448 2255 2364
 2279 2359 2160 2790 2302 2200 2842 2812 2309 2365 8002 2186 2132 2130
 2714 2846 2733 2887 2134 2205 2131 2729 2735 2794 2711 2811 2814 2731
 2753 2703 2885 2727 2730 8020 2170 2370 2449 2755 2775 2895 2764 2797
 2712 2843 2443 2442 2446 2445 2310 2353 2168 2874 2707 2878 2734 2841
 2385 2732 2292 2876 2875 2869 2148 2891 2857 2856 2705 2380 2137 8006
 2181 2283 2153 2865 2855 2882 2441 2785 2852 2854 2858 2859 2191  

NameError: name 'train_user_df' is not defined

In [None]:
"""
target action data
- ymd: Date of the action in format 'yyMMdd'
- user_id: Anonymized cookie id of the visitor
- session_id: Anonymized id of the session
- action_id: Type of the action
- reference: Object of the action. - We note that action_ids with big set of reference values (e.g. action id '2116') are typically related to the content (e.g hotels, destinations or keywords); while action_ids with small reference set (e.g. action id '2351') are more related a function of the website (e.g. sorting order, room type, filters, etc.)
- step: The number identifying the action in the session
"""

target_action_df = pd.read_csv(TARGET_ACTION_FILE_PATH, sep='\t')
target_action_df['ymd'] = pd.to_datetime(target_action_df['ymd'].astype('str'))

print('target action')
print(target_action_df.columns)
print(target_action_df.describe())
print(target_action_df.head(5))

In [None]:
target_user_id_action_list = target_action_df['user_id'].unique()
target_session_id_action_list = target_action_df['session_id'].unique()

print('number of users (target action data): {}'.format(len(target_user_id_action_list)))
print('number of sessions (target action data): {}'.format(len(target_session_id_action_list)))
print('dataframe size (target booking data)')
print(target_action_df.shape)

In [None]:
# replace the NAN values by a specific value

NA_ACTION_ID = -10
NA_REFERENCE_ID = -10
NA_STEP = 0

In [None]:
train_user_df =  pd.merge(train_booking_df, train_action_df, on=['ymd', 'user_id', 'session_id'], how='left')

print('number of rows where action id is NaN: {}'.format(train_user_df['action_id'].isnull().sum()))
print('number of rows where reference is NaN: {}'.format(train_user_df['reference'].isnull().sum()))
print('number of rows where step is NaN: {}'.format(train_user_df['step'].isnull().sum()))
#print(train_user_df[train_user_df['action_id'].isnull() | train_user_df['reference'].isnull()])

train_user_df['action_id'].fillna(NA_ACTION_ID, inplace=True)
train_user_df['reference'].fillna(NA_REFERENCE_ID, inplace=True)
train_user_df['step'].fillna(NA_STEP, inplace=True)

print('number of rows where action id is NaN: {}'.format(train_user_df['action_id'].isnull().sum()))
print('number of rows where reference is NaN: {}'.format(train_user_df['reference'].isnull().sum()))
print('number of rows where step is NaN: {}'.format(train_user_df['step'].isnull().sum()))
#print(train_user_df[train_user_df['action_id'].isnull() | train_user_df['reference'].isnull()])

train_user_df['action_id'] = train_user_df['action_id'].astype('int')
train_user_df['reference'] = train_user_df['reference'].astype('int')
train_user_df['step'] = train_user_df['step'].astype('int')

print(train_user_df.columns)
print(train_user_df.describe())
print('train user df shape')
print(train_user_df.shape)
display(train_user_df.head(5))
print('number of users {}'.format(len(train_user_df['user_id'].unique())))

In [None]:
print('ymd (train)')
print(train_user_df['ymd'].unique())

In [None]:
target_user_df =  pd.merge(target_booking_df, target_action_df, on=['ymd', 'user_id', 'session_id'], how='left')

print('number of rows where action id (target) is NaN: {}'.format(target_user_df['action_id'].isnull().sum()))
print('number of rows where reference (target) is NaN: {}'.format(target_user_df['reference'].isnull().sum()))
print('number of rows where step is (target) NaN: {}'.format(target_user_df['step'].isnull().sum()))
#print(target_user_df[train_user_df['action_id'].isnull() | target_user_df['reference'].isnull()])

target_user_df['action_id'].fillna(NA_ACTION_ID, inplace=True)
target_user_df['reference'].fillna(NA_REFERENCE_ID, inplace=True)
target_user_df['step'].fillna(NA_STEP, inplace=True)

print('number of rows where action id (target) is NaN: {}'.format(target_user_df['action_id'].isnull().sum()))
print('number of rows where reference (target) is NaN: {}'.format(target_user_df['reference'].isnull().sum()))
print('number of rows where step is (target) NaN: {}'.format(target_user_df['step'].isnull().sum()))
#print(train_user_df[train_user_df['action_id'].isnull() | train_user_df['reference'].isnull()])

target_user_df['action_id'] = target_user_df['action_id'].astype('int')
target_user_df['reference'] = target_user_df['reference'].astype('int')
target_user_df['step'] = target_user_df['step'].astype('int')

print(target_user_df.columns)
print(target_user_df.describe())
print('target user df shape')
print(target_user_df.shape)
display(target_user_df.head(5))
print('number of users (target) {}'.format(len(target_user_df['user_id'].unique())))


In [None]:
print('ymd (target)')
print(target_user_df['ymd'].unique())

In [None]:
train_user_id_list = train_user_df['user_id'].unique()

print('number of users (train) {}'.format(len(train_user_id_list)))

target_user_id_list = target_user_df['user_id'].unique()

print('number of users (target) {}'.format(len(target_user_id_list)))


print('\nnumber of different users between train user id and target user id')
print(len(set(train_user_id_list) - set(target_user_id_list)))
print(len(set(target_user_id_list) - set(train_user_id_list)))

intersect_user_id_list = []
for user_id in target_user_id_list:
    if user_id in train_user_id_list:
        intersect_user_id_list.append(user_id)
    
print('number of users in target data can be found in train data {}'.format(len(intersect_user_id_list)))

# Although I find that 8062/23402 users in the target set can be found in the train set, 
# I doubt taking 'user_id' as features may overfit the model.

In [None]:
print('correlation')

corr_score = train_user_df['user_id'].corr(train_user_df['has_booking'])
print('corr score of user_id and has_booking {}'.format(corr_score))

corr_score = train_user_df['referer_code'].corr(train_user_df['has_booking'])
print('corr score of referer_code and has_booking {}'.format(corr_score))

corr_score = train_user_df['is_app'].corr(train_user_df['has_booking'])
print('corr score of is_app and has_booking {}'.format(corr_score))

corr_score = train_user_df['agent_id'].corr(train_user_df['has_booking'])
print('corr score of agent_id and has_booking {}'.format(corr_score))

corr_score = train_user_df['traffic_type'].corr(train_user_df['has_booking'])
print('corr score of traffic_type and has_booking {}'.format(corr_score))

corr_score = train_user_df['action_id'].corr(train_user_df['has_booking'])
print('corr score of action_id and has_booking {}'.format(corr_score))

corr_score = train_user_df['reference'].corr(train_user_df['has_booking'])
print('corr score of reference and has_booking {}'.format(corr_score))

corr_score = train_user_df['step'].corr(train_user_df['has_booking'])
print('corr score of step and has_booking {}'.format(corr_score))

In [None]:
def get_nb_bookings_dict(df, column_name, has_booking_name='has_booking'):
    # key: feature value  value: number of bookings
    dict_nb_bookings = dict()
    col_list = df[column_name].unique()
    # print(column_name)
    # print(col_list)
    for value in col_list:
        values = df[df[column_name] == value][has_booking_name].values
        dict_nb_bookings[value] = sum(values)

    return dict_nb_bookings

In [None]:
def save_dict_to_csv(data_dict, csv_path):
    w = csv.writer(open(csv_path, "w"))
    for key, val in data_dict.items():
        w.writerow([key, val])

def read_dict_from_csv(csv_path):
    reader = csv.reader(open(csv_path))
    result = {}
    for row in reader:
        key = row[0]
        result[key] = int(row[1:][0].strip())
    return result

In [None]:
# Although I find that 8062/23402 users in the target set can be found in the train set, 
# I doubt taking 'user_id' as features overfits the model.
feature_columns = ['referer_code', 'is_app', 'agent_id', 'traffic_type', 'action_id', 'reference', 'step']

dict_feature_nb_bookings = dict()

for feature_column in feature_columns:
    dict_feature_nb_bookings[feature_column] = get_nb_bookings_dict(train_user_df, feature_column)
    
    print('\n --------------------')
    print(feature_column)
    print(dict_feature_nb_bookings[feature_column].keys())
    print('{}, {}'.format(feature_column, 'nb bookings'))
    for key, value in dict_feature_nb_bookings[feature_column].items():
        print('{}, {}'.format(key, value))
    print('\n --------------------\n')
    
    # save the dictionary
    csv_path = '{}-nb_bookings.csv'.format(feature_column)
    save_dict_to_csv(dict_feature_nb_bookings[feature_column], csv_path)
    print('save dictionary to {}'.format(csv_path))
    

In [None]:
def plot_dict(data_dict, title, xlabel, ylabel):
    """
    plot dictionary
    """
    plt.bar(range(len(data_dict)), list(data_dict.values()), align='center')
    plt.xticks(range(len(data_dict)), list(data_dict.keys()))
    plt.xticks(rotation=90)
    plt.title(title)
    plt.xlabel(xlabel)
    plt.ylabel(ylabel)
    # # for python 2.x:
    # plt.bar(range(len(data_dict)), data_dict.values(), align='center')  # python 2.x
    # plt.xticks(range(len(data_dict)), data_dict.keys())  # in python 2.x
    plt.show()

In [None]:
# referer_code: Encoded category of the referer to the website

# dict_referer_code, referer_code_list = get_nb_bookings_dict(train_user_df, 'referer_code')
dict_referer_code_nb_bookings = read_dict_from_csv('referer_code-nb_bookings.csv')
    
dict_referer_code_nb_bookings = OrderedDict(sorted(dict_referer_code_nb_bookings.items(), key=lambda x: x[1]))

plot_dict(dict_referer_code_nb_bookings, 'referer_code', 'referer_code', 'nb bookings')

#plot_dict(dict_feature_nb_bookings['referer_code'], 'referer_code', 'referer_code', 'nb bookings')

print('referer code, number of bookings')
for key, value in dict_referer_code.items():
    print('{}, {}'.format(key, value))
    
plt.boxplot(list(dict_referer_code_nb_bookings.values()))
plt.title('nb of bookings (referer code)')
plt.show()
    
# Users with the referer code 1 have the largest number of bookings
# Users with the referer code 24, 21 have no bookings.

In [None]:
# is_app: If the session was made using the trivago app

dict_is_app_nb_bookings = read_dict_from_csv('is_app-nb_bookings.csv')

plot_dict(dict_is_app_nb_bookings, 'is_app', 'is_app', 'nb bookings')
#plot_dict(dict_feature_nb_bookings['is_app'], 'is_app', 'is_app', 'nb bookings')

print('is app, number of bookings')
for key, value in dict_is_app_nb_bookings.items():
    print('{}, {}'.format(key, value))
    
# Most of the bookings were not made using the trivago app.

In [None]:
# agent_id: Encoded type of the browser

dict_agend_id_nb_bookings = read_dict_from_csv('agent_id-nb_bookings.csv')
    
dict_agend_id_nb_bookings = OrderedDict(sorted(dict_agend_id_nb_bookings.items(), key=lambda x: x[1]))

plot_dict(dict_agend_id_nb_bookings, 'agent_id', 'agent_id', 'nb bookings')

# plot_dict(dict_feature_nb_bookings['agent_id'], 'agent_id', 'agent_id', 'nb bookings')

plt.boxplot(list(dict_agend_id_nb_bookings.values()))
plt.title('nb of bookings (agent_id)')
plt.show()

# We can see that agent_id has the largest number of bookings

In [None]:
# traffic_type: A categorization of the type of the traffic

dict_traffic_type_nb_bookings = read_dict_from_csv('traffic_type-nb_bookings.csv')
    
dict_traffic_type_nb_bookings = OrderedDict(sorted(dict_traffic_type_nb_bookings.items(), key=lambda x: x[1]))

plot_dict(dict_traffic_type_nb_bookings, 'traffic_type', 'traffic_type', 'nb bookings')

# plot_dict(dict_feature_nb_bookings['traffic_type'], 'traffic_type', 'traffic_type', 'nb bookings')

print('max number of bookings (traffic type) {}'.format(np.max(list(dict_traffic_type_nb_bookings.values()))))
print('min number of bookings (traffic type) {}'.format(np.min(list(dict_traffic_type_nb_bookings.values()))))
print('mean number of bookings (traffic type) {}'.format(np.mean(list(dict_traffic_type_nb_bookings.values()))))
print('standar deviation number of bookings (traffic type) {}'.format(np.std(list(dict_traffic_type_nb_bookings.values()))))

plt.boxplot(list(dict_traffic_type_nb_bookings.values()))
plt.title('nb of bookings (traffic type)')
plt.show()

# We can see the first category of traffic type 1 has the largest number of bookings

In [None]:
# action_id: Type of the action

dict_action_id_nb_bookings = read_dict_from_csv('action_id-nb_bookings.csv')
    
dict_action_id_nb_bookings = OrderedDict(sorted(dict_action_id_nb_bookings.items(), key=lambda x: x[1]))

print('number of action id {}'.format(len(list(dict_action_id_nb_bookings.keys()))))

dict_action_id_nb_bookings_top = {}

key_list = list(dict_action_id_nb_bookings.keys())[-20:]
for key in key_list:
    dict_action_id_nb_bookings_top[key] = dict_action_id_nb_bookings[key]
    
plot_dict(dict_action_id_nb_bookings_top, 'top 20 action id', 'action_id', 'nb bookings')

action_id_nb_booking_list = list(dict_action_id_nb_bookings.values())
print('max number of bookings (action id) {}'.format(np.max(action_id_nb_booking_list)))
print('min number of bookings (action id) {}'.format(np.min(action_id_nb_booking_list)))
print('mean number of bookings (action id) {}'.format(np.mean(action_id_nb_booking_list)))
print('standar deviation number of bookings (action id) {}'.format(np.std(action_id_nb_booking_list)))
#plot_dict(dict_feature_nb_bookings['action_id'], 'action_id', 'action_id', 'nb bookings')

plt.boxplot(action_id_nb_booking_list)
plt.title('nb bookings (action id)')
plt.show()

In [None]:
# reference: Object of the action. 
# - We note that action_ids with big set of reference values (e.g. action id '2116') are typically related to 
# the content (e.g hotels, destinations or keywords); while action_ids with small reference set (e.g. action id '2351') are more related a function of the website (e.g. sorting order, room type, filters, etc.)
    
dict_reference_nb_bookings = read_dict_from_csv('reference-nb_bookings.csv')
    
dict_reference_nb_bookings = OrderedDict(sorted(dict_reference_nb_bookings.items(), key=lambda x: x[1]))

print('number of references {}'.format(len(list(dict_reference_nb_bookings.keys()))))

dict_reference_nb_bookings_top = {}

key_list = list(dict_reference_nb_bookings.keys())[-20:]
for key in key_list:
    dict_reference_nb_bookings_top[key] = dict_reference_nb_bookings[key]
    
plot_dict(dict_reference_nb_bookings_top, 'top 20 references', 'reference', 'nb bookings')

reference_nb_booking_list = list(dict_reference_nb_bookings_top.values())
print('max number of bookings (reference) {}'.format(np.max(reference_nb_booking_list)))
print('min number of bookings (reference) {}'.format(np.min(reference_nb_booking_list)))
print('mean number of bookings (reference) {}'.format(np.mean(reference_nb_booking_list)))
print('standar deviation number of bookings (reference) {}'.format(np.std(reference_nb_booking_list)))
#plot_dict(dict_feature_nb_bookings['action_id'], 'action_id', 'action_id', 'nb bookings')

plt.boxplot(reference_nb_booking_list)
plt.title('nb bookings (reference)')
plt.show()

# plot_dict(dict_feature_nb_bookings['reference'], 'reference', 'reference', 'nb bookings')


In [None]:
# step: The number identifying the action in the session
    
dict_step_nb_bookings = read_dict_from_csv('step-nb_bookings.csv')

plt.plot(list(dict_step_nb_bookings.keys()), list(dict_step_nb_bookings.values()))
plt.title('step')
plt.xlabel('step')
plt.ylabel('nb bookings')
plt.show()

# plot_dict(dict_feature_nb_bookings['step'], 'step', 'step', 'nb bookings')

step_nb_booking_list = list(dict_step_nb_bookings.values())
print('max number of bookings (step) {}'.format(np.max(step_nb_booking_list)))
print('min number of bookings (step) {}'.format(np.min(step_nb_booking_list)))
print('mean number of bookings (step) {}'.format(np.mean(step_nb_booking_list)))
print('standar deviation number of bookings (step) {}'.format(np.std(step_nb_booking_list)))

plt.boxplot(step_nb_booking_list)
plt.title('nb bookings (step)')
plt.show()

# It shows more steps less number of bookings.

## Step 2: Train machine learning models


### Naive approach
First, I use all the features, i.e., 
feature_columns = ['referer_code', 'is_app', 'agent_id', 'traffic_type', 'action_id', 'reference', 'step']. 

Note: although I find that 8062/23402 users in the target set can be found in the train set, I doubt taking 'user_id' as features overfits the model.