# Trivago dataset EDA

## Loading Libraries & Datasets

In [1]:
import pandas as pd
import numpy as np
from datetime import datetime
from datetime import timedelta
import math
import matplotlib.pyplot as plt
from datetime import datetime
import re
import random

In [2]:
item_metadata_filepath = '../raw_data/item_metadata.csv'
submission_popular_filepath = '../raw_data/submission_popular.csv'
train_filepath = '../raw_data/train.csv'
test_filepath = '../raw_data/test.csv'

In [3]:
submission_popular = pd.read_csv(submission_popular_filepath)
item_metadata = pd.read_csv(item_metadata_filepath)
train = pd.read_csv(train_filepath)
test = pd.read_csv(test_filepath)

## Understanding Different Satasets

### test

In [7]:
test[test['session_id'] == '1d688ec168932']

Unnamed: 0,user_id,session_id,timestamp,step,action_type,reference,platform,city,device,current_filters,impressions,prices
0,004A07DM0IDW,1d688ec168932,1541555614,1,interaction item image,2059240.0,CO,"Santa Marta, Colombia",mobile,,,
1,004A07DM0IDW,1d688ec168932,1541555614,2,interaction item image,2059240.0,CO,"Santa Marta, Colombia",mobile,,,
2,004A07DM0IDW,1d688ec168932,1541555696,3,clickout item,1050068.0,CO,"Santa Marta, Colombia",mobile,,2059240|2033381|1724779|127131|399441|103357|1...,70|46|48|76|65|65|106|66|87|43|52|44|60|61|50|...
3,004A07DM0IDW,1d688ec168932,1541555707,4,clickout item,1050068.0,CO,"Santa Marta, Colombia",mobile,,2059240|2033381|1724779|127131|399441|103357|1...,70|46|48|76|65|65|106|66|87|43|52|44|60|61|50|...
4,004A07DM0IDW,1d688ec168932,1541555717,5,clickout item,1050068.0,CO,"Santa Marta, Colombia",mobile,,2059240|2033381|1724779|127131|399441|103357|1...,70|46|48|76|65|65|106|66|87|43|52|44|60|61|50|...
5,004A07DM0IDW,1d688ec168932,1541555792,6,clickout item,3241426.0,CO,"Santa Marta, Colombia",mobile,,2059240|2033381|1724779|127131|399441|103357|1...,70|46|48|76|65|65|106|66|87|43|52|44|60|61|50|...
6,004A07DM0IDW,1d688ec168932,1541555799,7,clickout item,,CO,"Santa Marta, Colombia",mobile,,2059240|2033381|1724779|127131|399441|103357|1...,70|46|48|76|65|65|106|66|87|43|52|44|60|61|50|...


The reference value in the last index is removed, this is the label that the prediction metrics is calculated upon.
The prediction should be on value of the impression list, the metric used to evaluate submissions is mean reciprocal rank.

**Example:**


**query 1:**

impressions = [100, 101, 102, 103, 104, 105]

clicked_item_id = 102

submission = [101, 103, 104, 102, 105, 100]

reciprocal rank = 0.25




**query 2:**

impression = [101, 103, 104, 100, 105]

clicked_item_id = 105

submission = [103, 105, 101, 100, 104]

reciprocal rank = 0.5

mrr = (0.25 + 0.5) / 2 = 0.375

In [15]:
print('Number of unique sessions', test.session_id.nunique(), '.\nNumber of unique sessions that have a clickout',
      test[test.action_type=='clickout item'].session_id.nunique(),'.')

Number of unique sessions 291381 .
Number of unique sessions that have a clickout 275679 .


Clickout here means that there times submissions don't actually have labels. These kinds of sessions should be removed from the datasets as the clickoutout item refers to being a label. A proper function should be handling sessions without any clickouts in the action_type attribute.

### submission_popular

This is the form on how the test set predictions should be submitted.

In [5]:
submission_popular.head()

Unnamed: 0,user_id,session_id,timestamp,step,item_recommendations
0,000324D9BBUC,89643988fdbfb,1541593942,10,924795 106315 1033140 119494 101758 903037 105...
1,0004Q49X39PY,9de47d9a66494,1541641157,1,3505150 3812004 2227896 2292254 3184842 222702...
2,0004Q49X39PY,beea5c27030cb,1541561202,1,4476010 3505150 3812004 2227896 2292254 222702...
3,00071784XQ6B,9617600e1ba7c,1541630328,2,22854 3067559 22721 22713 16121 22772 22727 22...
4,0008BO33KUQ0,2d0e2102ee0dc,1541636411,6,9857656 5849628 655716 1352530 502066 1405084 ...


In [32]:
submission_popular.dtypes

user_id                 object
session_id              object
timestamp                int64
step                     int64
item_recommendations    object
dtype: object

### item_metadata

In [16]:
item_metadata.head()

Unnamed: 0,item_id,properties
0,5101,Satellite TV|Golf Course|Airport Shuttle|Cosme...
1,5416,Satellite TV|Cosmetic Mirror|Safe (Hotel)|Tele...
2,5834,Satellite TV|Cosmetic Mirror|Safe (Hotel)|Tele...
3,5910,Satellite TV|Sailing|Cosmetic Mirror|Telephone...
4,6066,Satellite TV|Sailing|Diving|Cosmetic Mirror|Sa...


In [14]:
#making sure there is no duplicates
item_metadata.nunique(), len(item_metadata)

(item_id       927142
 properties    566835
 dtype: int64, 927142)

Each item in the different datasets has a number (reference and impression attributes), these items have properties mentioned here.

In [33]:
item_metadata.dtypes

item_id        int64
properties    object
dtype: object

Number of properties can be added as a feature.

### train

In [4]:
train.tail(5)

Unnamed: 0,user_id,session_id,timestamp,step,action_type,reference,platform,city,device,current_filters,impressions,prices
15932987,ZYNMLE3MV3LK,62728015bec05,1541544490,15,interaction item image,6617798,PT,"Paris, France",desktop,,,
15932988,ZYNMLE3MV3LK,62728015bec05,1541544491,16,clickout item,6617798,PT,"Paris, France",desktop,Focus on Distance,6617798|1263420|9567886|1161323|149768|1890735...,58|96|55|75|90|60|233|104|150|145|328|207|150|...
15932989,ZYNMLE3MV3LK,62728015bec05,1541544540,17,clickout item,2712342,PT,"Paris, France",desktop,Focus on Distance,6617798|1263420|9567886|1161323|149768|1890735...,58|96|55|75|90|60|233|104|150|145|328|207|150|...
15932990,ZYNMLE3MV3LK,62728015bec05,1541544967,18,change of sort order,interaction sort button,PT,"Paris, France",desktop,,,
15932991,ZYNMLE3MV3LK,62728015bec05,1541544973,19,clickout item,1161323,PT,"Paris, France",desktop,Focus on Distance,6617798|1263420|9567886|1161323|149768|1890735...,58|96|55|75|90|60|233|104|150|145|328|207|150|...


In [9]:
train.dtypes

user_id            object
session_id         object
timestamp           int64
step                int64
action_type        object
reference          object
platform           object
city               object
device             object
current_filters    object
impressions        object
prices             object
dtype: object

In [6]:
len(train.columns), train.columns, len(train)

(12, Index(['user_id', 'session_id', 'timestamp', 'step', 'action_type',
        'reference', 'platform', 'city', 'device', 'current_filters',
        'impressions', 'prices'],
       dtype='object'))

In [35]:
#checking which attributes have any NaN
for attribute in train.columns:
  if train[attribute].isna().any():
    print(attribute)

current_filters
impressions
prices


In [12]:
len(train.action_type.unique()), train.action_type.unique()

(10, array(['search for poi', 'interaction item image', 'clickout item',
        'interaction item info', 'interaction item deals',
        'search for destination', 'filter selection',
        'interaction item rating', 'search for item',
        'change of sort order'], dtype=object))

In [36]:
#checking what values in the attribute action_type that gives values in the above attributes
train.dropna().action_type.unique()

array(['clickout item'], dtype=object)

The attributes prices, current_filters, and impressions are not NaN values when the attribute action_type is a clickout item.
The clickout item means that the user had viewed the item in the item's website.

In [37]:
#checking unique numbers of session_id and user_id
train.session_id.nunique(), train.user_id.nunique()

(910683, 730803)

## Creating validation and test sets

Defining function to subset sets

In [4]:
def CreateSubSet(dataset, ratio):
    '''
    Desc: creates smaller set of the main dataset
    
    Input: dataset: Pandas Dataframe with the dataset required to extract smaller dataframe from
           ratio: float between 0 and 1 as a ratio of the size of the main dataset
           
    Output: SubsetDF: Pandas Dataframe with the subset of the main dataset
    '''
    NUniqueSessionsVal = round(len(dataset.session_id.unique().tolist()) * ratio)  #getting the number of unique sessions validation
    print('Number of unique sessions in validation set', NUniqueSessionsVal, '.') #set should be having.

    #unique sessions list
    UnisuqeSessions = dataset.session_id.unique().tolist()

    #set seed
    random.seed(1)

    #randomly selecting sessions_id from train
    SubsetID = list(set(random.choices(dataset.session_id.unique().tolist(), k=NUniqueSessionsVal)))

    #creating dataframe for validation
    SubsetDF = dataset[dataset.session_id.isin(SubsetID)]
    
    #dropping Subset from main dataset
    main = dataset.drop(index=SubsetDF.index)
    
    return SubsetDF, main

##Validation set

Creating a validation set.

In [5]:
val, train = CreateSubSet(train, 0.2)
val.head()

Number of unique sessions in validation set 182137 .


Unnamed: 0,user_id,session_id,timestamp,step,action_type,reference,platform,city,device,current_filters,impressions,prices
278,0L2TX0JNYVQ6,06e7c29170946,1541041830,1,search for poi,Seoul Station,HK,"Seoul, South Korea",desktop,,,
279,0L2TX0JNYVQ6,06e7c29170946,1541041870,2,clickout item,10091602,HK,"Seoul, South Korea",desktop,,2802232|2733571|5477718|155374|155465|3549258|...,124|176|99|220|191|127|85|54|83|268|78|144|96|...
280,0L2TX0JNYVQ6,06e7c29170946,1541041882,3,interaction item deals,10091602,HK,"Seoul, South Korea",desktop,,,
281,0L2TX0JNYVQ6,06e7c29170946,1541044143,4,search for poi,Myeongdong,HK,"Seoul, South Korea",desktop,,,
282,0L2TX0JNYVQ6,06e7c29170946,1541044151,5,clickout item,10091602,HK,"Seoul, South Korea",desktop,,3549258|155465|155374|363046|3954788|4773608|3...,135|189|219|78|74|135|95|85|176|99|108|83|87|3...


## Test set

Creating a new test set

In [6]:
test, train = CreateSubSet(train, 0.2)
test.head()

Number of unique sessions in validation set 149151 .


Unnamed: 0,user_id,session_id,timestamp,step,action_type,reference,platform,city,device,current_filters,impressions,prices
239,0IVOT7X0FJWE,554392be66854,1541044445,1,interaction item rating,9167996,BR,"Parnaíba, Brazil",mobile,,,
240,0IVOT7X0FJWE,554392be66854,1541044448,2,clickout item,9167996,BR,"Parnaíba, Brazil",mobile,,2035675|4095738|4933410|9167996,225|23|58|53
241,0IVOT7X0FJWE,554392be66854,1541044519,3,interaction item image,9167996,BR,"Parnaíba, Brazil",mobile,,,
242,0IVOT7X0FJWE,554392be66854,1541044519,4,interaction item image,9167996,BR,"Parnaíba, Brazil",mobile,,,
243,0IVOT7X0FJWE,554392be66854,1541044529,5,interaction item image,9167996,BR,"Parnaíba, Brazil",mobile,,,


## Trainlet set

Since the training set is already large enough, then for the feature engineering part, a sample from the training set will
be sufficient to try the feature engineering on. It will be just 10% of the training set.
Creating a sample set

In [7]:
trainlet, train = CreateSubSet(train, 0.01)
trainlet.head()

Number of unique sessions in validation set 6106 .


Unnamed: 0,user_id,session_id,timestamp,step,action_type,reference,platform,city,device,current_filters,impressions,prices
3297,6XWLPIVUWQE8,27d290deac485,1541106705,1,interaction item image,1488185,BG,"Ribarica, Bulgaria",mobile,,,
3298,6XWLPIVUWQE8,27d290deac485,1541106705,2,interaction item image,1488185,BG,"Ribarica, Bulgaria",mobile,,,
3299,6XWLPIVUWQE8,27d290deac485,1541106710,3,clickout item,1488185,BG,"Ribarica, Bulgaria",mobile,,1488185|1241074|1474073|4548038|4130402|733811...,40|35|33|29|26|34|24|35|27|129
3300,6XWLPIVUWQE8,27d290deac485,1541106732,4,interaction item image,1488185,BG,"Ribarica, Bulgaria",mobile,,,
3301,6XWLPIVUWQE8,27d290deac485,1541106732,5,interaction item image,1488185,BG,"Ribarica, Bulgaria",mobile,,,
