<a href="https://colab.research.google.com/github/ZeyadSabbah/TrivagoRecommenderSystem/blob/master/TrivagoEDA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Trivago dataset EDA

## Mounting to Drive

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
%cd /content/drive/My Drive/Trivago/

/content/drive/My Drive/Trivago


## Loading Libraries & Datasets

In [4]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

  import pandas.util.testing as tm


In [5]:
item_metadata_filepath = './Datasets/raw_data/item_metadata.csv'
submission_popular_filepath = './Datasets/raw_data/submission_popular.csv'
train_filepath = './Datasets/raw_data/train.csv'
test_filepath = './Datasets/raw_data/test.csv'

submission_popular = pd.read_csv(submission_popular_filepath)
item_metadata = pd.read_csv(item_metadata_filepath)
train = pd.read_csv(train_filepath)
test = pd.read_csv(test_filepath)

## Understanding Different Datasets

### train

In [9]:
train.shape

(15932992, 12)

In [10]:
train.dtypes

user_id            object
session_id         object
timestamp           int64
step                int64
action_type        object
reference          object
platform           object
city               object
device             object
current_filters    object
impressions        object
prices             object
dtype: object

In [13]:
train.tail()

Unnamed: 0,user_id,session_id,timestamp,step,action_type,reference,platform,city,device,current_filters,impressions,prices
15932987,ZYNMLE3MV3LK,62728015bec05,1541544490,15,interaction item image,6617798,PT,"Paris, France",desktop,,,
15932988,ZYNMLE3MV3LK,62728015bec05,1541544491,16,clickout item,6617798,PT,"Paris, France",desktop,Focus on Distance,6617798|1263420|9567886|1161323|149768|1890735...,58|96|55|75|90|60|233|104|150|145|328|207|150|...
15932989,ZYNMLE3MV3LK,62728015bec05,1541544540,17,clickout item,2712342,PT,"Paris, France",desktop,Focus on Distance,6617798|1263420|9567886|1161323|149768|1890735...,58|96|55|75|90|60|233|104|150|145|328|207|150|...
15932990,ZYNMLE3MV3LK,62728015bec05,1541544967,18,change of sort order,interaction sort button,PT,"Paris, France",desktop,,,
15932991,ZYNMLE3MV3LK,62728015bec05,1541544973,19,clickout item,1161323,PT,"Paris, France",desktop,Focus on Distance,6617798|1263420|9567886|1161323|149768|1890735...,58|96|55|75|90|60|233|104|150|145|328|207|150|...


Final Clickout in each session represented in action_type attribute is the most important step for the click through rate. Any session can contain zero, one, or many clickouts, though.Trivago concentrates at this for the profit calculation. All features required to be engineered is based on predicting the final click out action.  
Each user has an id, each user can have one or more seperate sessions. There is a recorded timestamp for each step the user is taking on the website or on the app, and steps are counted in each session through the count going from one till the end when the user leaves the session.  
The step can be anything from checking a rating to viewing an image to changing an order of the list to other actions that has to do with items (accomodations).
Accomodations have ids shown in reference attribute, these accomodations are displayed to the user in the form of list, and the list can vary from just one item up to 25 items.  
The shown items are put in a string seperated by a pipe in the impressions attribute, matching this order is the prices list seperated by a pipe as well in the prices attribute. (These two attributes do not have a value unless the action_type attribute is 'clickout'.  
The platform attribute contains the location from where the user is checking the website or the app, while the city is the location where they are looking for the accomodation in, and device shows which device they are actually using.  
The current_fiters attribute shows what the filters the user has specified in their search for the suitable accomodation to themselves.

### test

In [17]:
test.shape

(3782335, 12)

In [18]:
test.dtypes

user_id            object
session_id         object
timestamp           int64
step                int64
action_type        object
reference          object
platform           object
city               object
device             object
current_filters    object
impressions        object
prices             object
dtype: object

In [22]:
test.tail(1)

Unnamed: 0,user_id,session_id,timestamp,step,action_type,reference,platform,city,device,current_filters,impressions,prices
3782334,ZZCM39YKI3NR,6226bde1465e7,1541601178,1,clickout item,,IT,"Dublin, Ireland",mobile,,46149|109974|46119|8333280|12455|1185556|84002...,138|138|156|153|128|202|145|137|105|68|133|167...


The NaN value shown in reference is actually the label hidden away. That represents what needs to be predicted or in another way that's what needs to be put in a list and Trivago is considered to do better if this item was on the top of the list provided to the user.

### submission_popular

This is the form on how the test set predictions should be submitted.

In [6]:
submission_popular.head(2)

Unnamed: 0,user_id,session_id,timestamp,step,item_recommendations
0,000324D9BBUC,89643988fdbfb,1541593942,10,924795 106315 1033140 119494 101758 903037 105...
1,0004Q49X39PY,9de47d9a66494,1541641157,1,3505150 3812004 2227896 2292254 3184842 222702...


Taking a session as an example to understand what is being represented in the submission_popular dataset in relation with test set.

In [7]:
test[test.session_id=='9de47d9a66494']

Unnamed: 0,user_id,session_id,timestamp,step,action_type,reference,platform,city,device,current_filters,impressions,prices
1030763,0004Q49X39PY,9de47d9a66494,1541641157,1,clickout item,,PH,"Iloilo City, Philippines",mobile,,2213014|3184842|10213134|4504242|4486372|38120...,53|40|112|57|76|29|42|37|66|66|26|43|28|46|28|...


If we take a look at the impressions of this session at the test set, and the item_recommendations in submission_popular, we will find that they are the same items, only with difference in the order and also the seperator.

In [30]:
test_impressions = test[test.session_id=='9de47d9a66494'].impressions.values[0].split('|')
item_recommendations = submission_popular[submission_popular.session_id=='9de47d9a66494'].item_recommendations.values[0].split(' ')

sorted(test_impressions) == sorted(item_recommendations)

True

### item_metadata

In [None]:
item_metadata.head()

Unnamed: 0,item_id,properties
0,5101,Satellite TV|Golf Course|Airport Shuttle|Cosme...
1,5416,Satellite TV|Cosmetic Mirror|Safe (Hotel)|Tele...
2,5834,Satellite TV|Cosmetic Mirror|Safe (Hotel)|Tele...
3,5910,Satellite TV|Sailing|Cosmetic Mirror|Telephone...
4,6066,Satellite TV|Sailing|Diving|Cosmetic Mirror|Sa...


The item_metadata set contains the properties for each item id (mentioned before in reference attributes in other datasets). The properties are also seperated by a pipe and in the string format.