<a href="https://colab.research.google.com/github/ZeyadSabbah/TrivagoRecommenderSystem/blob/master/TrivagoFeatureEngineering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Feature Engineering

## Mounting to Drive

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


In [2]:
%cd /content/drive/My Drive/Trivago/Project/TrivagoRecommenderSystem

/content/drive/My Drive/Trivago/Project/TrivagoRecommenderSystem


## Loading Libraries & Datasets

In [0]:
!pip install cudf-cuda100
!cp /usr/local/lib/python3.6/dist-packages/librmm.so .
import os  
os.environ['NUMBAPRO_NVVM']='/usr/local/cuda-10.0/nvvm/lib64/libnvvm.so'  
os.environ['NUMBAPRO_LIBDEVICE']='/usr/local/cuda-10.0/nvvm/libdevice'

In [0]:
# import cudf 
import pandas as pd
import numpy as np
from datetime import datetime
from datetime import timedelta
import math
import matplotlib.pyplot as plt
from datetime import datetime
import re
import random
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics.pairwise import cosine_similarity
from collections import Counter
from sklearn.preprocessing import MultiLabelBinarizer

In [0]:
item_metadata_filepath = '../../Datasets/raw_data/item_metadata.csv'
item_metadata = pd.read_csv(item_metadata_filepath)

train_filepath = '../../Datasets/clean_data/train.csv'
train = pd.read_csv(train_filepath)

## Important Dataframes Preparation

In engineer global features from the train dataset, which will be most of the cases around either clickouts or final click out examples (instances), two dataframes of them will be created.

In order to get representative global values to the items, duplications must be removed. Rows would not all be duplicated because of some attributes such as timestamp, reference, and step.

In [0]:
train.drop('Unnamed: 0', axis=1, inplace=True)
ClickoutDF = train[train.action_type=='clickout item']
FinalClickoutDF = train[train.action_type=='clickout item'].groupby('session_id').tail(1)
ClickoutUniqueDF = ClickoutDF.drop_duplicates(subset=['session_id', 'impressions'], keep='first')

## Item Global Features
### Number of Properties

In [0]:
item_metadata.properties = item_metadata.properties.apply(lambda x: x.split('|'))
item_metadata['NumberOfProperties'] = item_metadata.properties.apply(lambda x: len(x))
item_metadata.head(2)

Unnamed: 0,item_id,properties,NumberOfProperties
0,5101,"[Satellite TV, Golf Course, Airport Shuttle, C...",62
1,5416,"[Satellite TV, Cosmetic Mirror, Safe (Hotel), ...",46


In [0]:
#getting total number of unique properties across all items
AllPropertiesList = item_metadata.properties.tolist()

AllPropertiesFlatList = []
for sublist in AllPropertiesList:
    for item in sublist:
        AllPropertiesFlatList.append(item)
        
print('Number of unique properties is', len(set(AllPropertiesFlatList)))

Number of unique properties is 157


### Items Properties Similarities
The purpose is to get cosine similarity between items. The maximum 25 items out of the list can be extracted as a dataframe and cosine similarity can be obtained in order to get the similar items to the ones the user had interacted with to be listed on the top of the list.  

In [0]:
item_metadata.properties = item_metadata.properties.apply(lambda x: tuple(x))

one_hot = MultiLabelBinarizer()

properties_encoded = one_hot.fit_transform(item_metadata.properties.values.tolist())

properties_encodedDF = pd.DataFrame(properties_encoded)

#changing column names
properties_list = one_hot.classes_.tolist()
for i in range(len(properties_list)):
    properties_encodedDF = properties_encodedDF.rename(columns={i:properties_list[i]})

#creating a column of the item id to get the similarity between items
item_metadata.item_id = item_metadata.item_id.apply(lambda x: str(x))
properties_encodedDF['item_id'] = item_metadata.item_id

properties_encodedDF.head()

Unnamed: 0,1 Star,2 Star,3 Star,4 Star,5 Star,Accessible Hotel,Accessible Parking,Adults Only,Air Conditioning,Airport Hotel,...,Terrace (Hotel),Theme Hotel,Towels,Very Good Rating,Volleyball,Washing Machine,Water Slide,Wheelchair Accessible,WiFi (Public Areas),WiFi (Rooms)
0,0,0,0,1,0,1,1,0,1,0,...,1,0,0,0,0,0,0,0,1,1
1,0,0,0,1,0,0,0,0,0,0,...,1,0,0,1,0,0,0,1,1,1
2,0,0,1,0,0,0,0,0,0,0,...,1,0,0,1,0,0,0,0,1,1
3,0,0,0,1,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,1,0
4,0,0,0,1,0,1,1,0,0,0,...,1,0,1,0,1,0,0,1,1,1


In [0]:
cosine_similarity(properties_encodedDF.set_index('item_id').iloc[0:25])[0]

array([1.        , 0.69283044, 0.6024145 , 0.69419307, 0.74385528,
       0.62604751, 0.7220817 , 0.75202407, 0.72586619, 0.77491695,
       0.68469194, 0.56796183, 0.65036141, 0.7573052 , 0.68861713,
       0.73636183, 0.5528638 , 0.66628253, 0.65991202, 0.68250015,
       0.74691014, 0.756971  , 0.57473697, 0.61833711, 0.5265603 ])

It needs to be discussed for further implementation.

### Number of Times in Impressions
The purpose of this feature is to check how many times an item had been shown to users in the list.

Finding number of times an item has been mentioned in a list.

In [0]:
AllImpressionsList = ClickoutUniqueDF.impressions.apply(lambda x:x.split('|'))

AllImpressionsFlatList = []
for sublist in AllImpressionsList:
    for item in sublist:
        AllImpressionsFlatList.append(item)

InImpressionsCounter = Counter(AllImpressionsFlatList)
InImpressionsDF = pd.DataFrame.from_dict(InImpressionsCounter, orient='index').reset_index()\
                              .rename(columns={'index':'item_id', 0:'NumberInImpressions'})
InImpressionsDF.head(2)

Unnamed: 0,item_id,NumberInImpressions
0,3400638,491
1,1253714,84


In [0]:
InImpressionsDF.item_id.nunique(), len(InImpressionsDF), item_metadata.item_id.nunique()

(815092, 815092, 927142)

Number of items in this dataframe is less than the number of items in item_metadata, that's because some of the items had not been mentioned in the impressions list.

In [0]:
item_metadata.item_id = item_metadata.item_id.apply(lambda x: str(x))

#left joining
item_metadata = item_metadata.merge(InImpressionsDF, on='item_id', how='left')
item_metadata.head(2)

Unnamed: 0,item_id,properties,NumberOfProperties,NumberInImpressions
0,5101,"[Satellite TV, Golf Course, Airport Shuttle, C...",62,89.0
1,5416,"[Satellite TV, Cosmetic Mirror, Safe (Hotel), ...",46,63.0


### Number of Times in Reference
The purpose of this feature is to check how many times an item has been mentioned in the Reference attribute in the whole train set. 

In [0]:
InReferencesCounter = Counter(train.reference.values.tolist())
InReferencesDF = pd.DataFrame.from_dict(InReferencesCounter, orient='index').reset_index()\
                              .rename(columns={'index':'item_id', 0:'NumberInReferences'})

InReferencesDF.head(2)

Unnamed: 0,item_id,NumberInReferences
0,Newtown,16
1,666856,26


In [0]:
#left joining
item_metadata = item_metadata.merge(InReferencesDF, on='item_id', how='left')
item_metadata.head(2)

Unnamed: 0,item_id,properties,NumberOfProperties,NumberInImpressions,NumberInReferences
0,5101,"[Satellite TV, Golf Course, Airport Shuttle, C...",62,89.0,14.0
1,5416,"[Satellite TV, Cosmetic Mirror, Safe (Hotel), ...",46,63.0,43.0


### Number of Times in Clickout
The purpose of this feature is to check how many times an item has been clicked out

In [0]:
InClickoutCounter = Counter(ClickoutDF.reference.values.tolist())
InClickoutDF = pd.DataFrame.from_dict(InClickoutCounter, orient='index').reset_index()\
                              .rename(columns={'index':'item_id', 0:'NumberAsClickout'})
InClickoutDF.head(2)

Unnamed: 0,item_id,NumberAsClickout
0,109038,53
1,1257342,20


In [0]:
#left joining
item_metadata = item_metadata.merge(InClickoutDF, on='item_id', how='left')
item_metadata.head(2)

Unnamed: 0,item_id,properties,NumberOfProperties,NumberInImpressions,NumberInReferences,NumberAsClickout
0,5101,"[Satellite TV, Golf Course, Airport Shuttle, C...",62,89.0,14.0,7.0
1,5416,"[Satellite TV, Cosmetic Mirror, Safe (Hotel), ...",46,63.0,43.0,6.0


### Number of Time in Final Clickout
The purpose of this feature is to get the number of times an item had been mentioned as a final clickout.

In [0]:
InFinalClickoutCounter = Counter(FinalClickoutDF.reference.values.tolist())
InFinalClickoutDF = pd.DataFrame.from_dict(InFinalClickoutCounter, orient='index').reset_index()\
                              .rename(columns={'index':'item_id', 0:'NumberAsFinalClickout'})
InFinalClickoutDF.head(2)

Unnamed: 0,item_id,NumberAsFinalClickout
0,1257342,7
1,2795374,18


In [0]:
#left joining
item_metadata = item_metadata.merge(InFinalClickoutDF, on='item_id', how='left')
item_metadata.head(2)

Unnamed: 0,item_id,properties,NumberOfProperties,NumberInImpressions,NumberInReferences,NumberAsClickout,NumberAsFinalClickout
0,5101,"[Satellite TV, Golf Course, Airport Shuttle, C...",62,89.0,14.0,7.0,4.0
1,5416,"[Satellite TV, Cosmetic Mirror, Safe (Hotel), ...",46,63.0,43.0,6.0,2.0


The next step would be dividing the NumberAsFinalClickout by the other 3, in order to get FinalClickout Relativity

### Final Clickout To Impressions
The purpose of this feature is to get item's rate of clicking out when listed to the user.

In [0]:
FClickoutToImpressions = item_metadata.NumberAsFinalClickout/item_metadata.NumberInImpressions
FClickoutToImpressions.head(2)

0    0.044944
1    0.031746
dtype: float64

In [0]:
#adding attribute
item_metadata['FClickoutToImpressions'] = FClickoutToImpressions
item_metadata.head(2)

Unnamed: 0,item_id,properties,NumberOfProperties,NumberInImpressions,NumberInReferences,NumberAsClickout,NumberAsFinalClickout,FClickoutToImpressions
0,5101,"[Satellite TV, Golf Course, Airport Shuttle, C...",62,89.0,14.0,7.0,4.0,0.044944
1,5416,"[Satellite TV, Cosmetic Mirror, Safe (Hotel), ...",46,63.0,43.0,6.0,2.0,0.031746


### Final Clickout To References
The purpose of this feature is to get item's rate of clicking out when it was interacted with.

In [0]:
FClickoutToReferences = item_metadata.NumberAsFinalClickout/item_metadata.NumberInReferences
FClickoutToReferences.head(2)

0    0.285714
1    0.046512
dtype: float64

In [0]:
#adding attribute
item_metadata['FClickoutToReferences'] = FClickoutToReferences
item_metadata.head(2)

Unnamed: 0,item_id,properties,NumberOfProperties,NumberInImpressions,NumberInReferences,NumberAsClickout,NumberAsFinalClickout,FClickoutToImpressions,FClickoutToReferences
0,5101,"[Satellite TV, Golf Course, Airport Shuttle, C...",62,89.0,14.0,7.0,4.0,0.044944,0.285714
1,5416,"[Satellite TV, Cosmetic Mirror, Safe (Hotel), ...",46,63.0,43.0,6.0,2.0,0.031746,0.046512


### Final Clickout To Clickout
The purpose of this feature is to get item's rate of clickout when it was clicked out before.

In [0]:
FClickoutToClickout = item_metadata.NumberAsFinalClickout/item_metadata.NumberAsClickout
FClickoutToClickout.head(2)

0    0.571429
1    0.333333
dtype: float64

In [0]:
#adding attribute
item_metadata['FClickoutToClickout'] = FClickoutToClickout
item_metadata.head(2)

Unnamed: 0,item_id,properties,NumberOfProperties,NumberInImpressions,NumberInReferences,NumberAsClickout,NumberAsFinalClickout,FClickoutToImpressions,FClickoutToReferences,FClickoutToClickout
0,5101,"[Satellite TV, Golf Course, Airport Shuttle, C...",62,89.0,14.0,7.0,4.0,0.044944,0.285714,0.571429
1,5416,"[Satellite TV, Cosmetic Mirror, Safe (Hotel), ...",46,63.0,43.0,6.0,2.0,0.031746,0.046512,0.333333


### Item's Average Rank
The purpose of this feature is to get item's position in the list provided to the user across the train set.  
Since the purpose is to get the average rank across the lists shown to users, an important notice which is that some sessions have different reference and click outs, which provides the same impression list. Duplicated impressions lists in each session should be dropped. (Same thing applies to price as well.)

In [0]:
# using All Clickout dataframe, but the one with the unique impressions for each session for this feature
ClickoutUniqueDF.head(1)

Unnamed: 0,user_id,session_id,timestamp,step,action_type,reference,platform,city,device,current_filters,impressions,prices
13,00RL8Z82B2Z1,aff3928535f48,1541037543,14,clickout item,109038,AU,"Sydney, Australia",mobile,,3400638|1253714|3367857|5100540|1088584|666916...,95|66|501|112|95|100|101|72|82|56|56|143|70|25...


In [0]:
SessionImpressionsDF = ClickoutUniqueDF[['session_id', 'impressions']]
SessionImpressionsDF.impressions = SessionImpressionsDF.impressions.apply(lambda x: x.split('|'))
SessionImpressionsDF.impressions.head(2)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value


13    [3400638, 1253714, 3367857, 5100540, 1088584, ...
15    [55109, 129343, 54824, 2297972, 109014, 125734...
Name: impressions, dtype: object

In [0]:
SessionImpressionsDF_exploded = SessionImpressionsDF.explode('impressions')
SessionImpressionsDF_exploded = SessionImpressionsDF_exploded.rename(columns={'impressions':'item_id'})
SessionImpressionsDF_exploded.head(2)

Unnamed: 0,session_id,item_id
13,aff3928535f48,3400638
13,aff3928535f48,1253714


In [0]:
#creating a rank column
rank = SessionImpressionsDF.impressions.apply(lambda x: list(range(1, (len(x) + 1))))
SessionRankDF = pd.concat([SessionImpressionsDF.session_id, rank], axis=1)
SessionRankDF = SessionRankDF.rename(columns={'impressions':'rank'})
SessionRankDF_exploded = SessionRankDF.explode('rank')
SessionRankDF_exploded.head(2)

Unnamed: 0,session_id,rank
13,aff3928535f48,1
13,aff3928535f48,2


In [0]:
#creating dataframe that combines both items and rank
ItemRankDF = pd.DataFrame({'item_id':SessionImpressionsDF_exploded['item_id'].values.tolist(),
                            'rank':SessionRankDF_exploded['rank'].values.tolist()})
ItemRankDF.head(2)

Unnamed: 0,item_id,rank
0,3400638,1
1,1253714,2


In [0]:
#getting the mean of ranks for each item
ItemAverageRankRank = ItemRankDF.groupby('item_id', sort=False)['rank'].mean().to_frame().reset_index()
ItemAverageRankRank = ItemAverageRank.rename(columns={'rank':'MeanRank'})
ItemAverageRankRank.head(2)

NameError: ignored

In [0]:
#left joining
item_metadata = item_metadata.merge(ItemAverageRankRank, on='item_id', how='left')
item_metadata.head(2)

### Item's Average Price
The purpose of this feature is to get item's average price accross the train set.  
Making sure that the length of each impressions list is as the same length of the prices list.

In [0]:
ImpressionsLength = ClickoutUniqueDF.impressions.apply(lambda x: x.split('|')).apply(lambda x: len(x))
PricesLength = ClickoutUniqueDF.prices.apply(lambda x: x.split('|')).apply(lambda x: len(x))
ImpressionsLength.equals(PricesLength)

True

In [0]:
SessionPricesDF = ClickoutUniqueDF[['session_id', 'prices']]
SessionPricesDF.prices = SessionPricesDF.prices.apply(lambda x: x.split('|'))
SessionPricesDF_exploded = SessionPricesDF.explode('prices')
ItemPriceDF = pd.DataFrame({'item_id':SessionImpressionsDF_exploded.item_id.values.tolist(),
                            'price':SessionPricesDF_exploded.prices.values.tolist()})
ItemPriceDF.head(2)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value


Unnamed: 0,item_id,price
0,3400638,95
1,1253714,66


In [0]:
ItemPriceDF.price = ItemPriceDF.price.apply(lambda x: int(x))
ItemAveragePriceDF = ItemPriceDF.groupby('item_id', sort=False)['price'].mean().to_frame().reset_index()
ItemAveragePriceDF = ItemAveragePriceDF.rename(columns={'price':'MeanPrice'})
ItemAveragePriceDF.head(2)

Unnamed: 0,item_id,MeanPrice
0,3400638,192.154786
1,1253714,82.678571


In [0]:
item_metadata = item_metadata.merge(ItemAveragePriceDF, on='item_id', how='left')
item_metadata.head(2)

Unnamed: 0,item_id,properties,NumberOfProperties,NumberInImpressions,NumberInReferences,NumberAsClickout,NumberAsFinalClickout,FClickoutToImpressions,FClickoutToReferences,FClickoutToClickout,MeanPrice
0,5101,"[Satellite TV, Golf Course, Airport Shuttle, C...",62,89.0,14.0,7.0,4.0,0.044944,0.285714,0.571429,121.696629
1,5416,"[Satellite TV, Cosmetic Mirror, Safe (Hotel), ...",46,63.0,43.0,6.0,2.0,0.031746,0.046512,0.333333,102.873016


### Price Rank
The purpose of this feature is to get the average price rank of an item across the train set. Sorting the items not by what is being showed to the user, but by the items' prices, getting the average price rank for each item across the whole train. It shows where an item stands from its peers.  
By using the argsort function, the price rank can be feature can be engineered.

In [0]:
ClickoutDF.prices[13]

'95|66|501|112|95|100|101|72|82|56|56|143|70|25|71|162|73|143|188|118|77|131|143|49|165'

In [0]:
np.argsort(list(map(int, ClickoutDF.prices[13].split('|'))))

array([13, 23,  9, 10,  1, 12, 14,  7, 16, 20,  8,  0,  4,  5,  6,  3, 19,
       21, 17, 22, 11, 15, 24, 18,  2])

In [0]:
SessionPricesDF['PriceRank'] = SessionPricesDF.prices.apply(lambda x: (list(map(int, x))))\
                                                     .apply(lambda x: np.argsort(x))
SessionPricesDF.head(2)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,session_id,prices,PriceRank
13,aff3928535f48,"[95, 66, 501, 112, 95, 100, 101, 72, 82, 56, 5...","[13, 23, 9, 10, 1, 12, 14, 7, 16, 20, 8, 0, 4,..."
15,aff3928535f48,"[162, 25, 150, 143, 101, 49, 118, 131, 18, 100...","[8, 1, 15, 5, 12, 16, 20, 9, 4, 10, 23, 24, 6,..."


In [0]:
SessionPricesRankDF = SessionPricesDF[['session_id', 'PriceRank']].explode('PriceRank')
SessionPricesRankDF.head(2)

Unnamed: 0,session_id,PriceRank
13,aff3928535f48,13
13,aff3928535f48,23


In [0]:
ItemPriceRankDF = pd.DataFrame({'item_id':SessionImpressionsDF_exploded.item_id.values.tolist(),
                            'PriceRank':SessionPricesRankDF.PriceRank.values.tolist()})
ItemPriceRankDF.head(2)

Unnamed: 0,item_id,PriceRank
0,3400638,13
1,1253714,23


In [0]:
#getting the average price rank of each item
ItemAveragePriceRankDF = ItemPriceRankDF.groupby('item_id', sort=False).PriceRank.mean().to_frame().reset_index()
ItemAveragePriceRankDF = ItemAveragePriceRankDF.rename(columns={'PriceRank':'AveragePriceRank'})
ItemAveragePriceRankDF.head(2)

Unnamed: 0,item_id,AveragePriceRank
0,3400638,11.826884
1,1253714,11.309524


In [0]:
# left joining
item_metadata = item_metadata.merge(ItemAveragePriceRankDF, on='item_id', how='left')
item_metadata.head(2)

Unnamed: 0,item_id,properties,NumberOfProperties,NumberInImpressions,NumberInReferences,NumberAsClickout,NumberAsFinalClickout,FClickoutToImpressions,FClickoutToReferences,FClickoutToClickout,MeanPrice,AveragePriceRank
0,5101,"[Satellite TV, Golf Course, Airport Shuttle, C...",62,89.0,14.0,7.0,4.0,0.044944,0.285714,0.571429,121.696629,12.303371
1,5416,"[Satellite TV, Cosmetic Mirror, Safe (Hotel), ...",46,63.0,43.0,6.0,2.0,0.031746,0.046512,0.333333,102.873016,11.095238


In [0]:
#filling NaN values with zeros
item_metadata = item_metadata.fillna(0)

Unless similarity between items from properties, properties feature should be dropped. item_metada dataframe was converted to csv file and saved in the clean_data folder.

## Item Local Features

### General Features

In [12]:
train.session_id.nunique()

745755

In [0]:
# session_gb = train.groupby('session_id', sort=False)
session = session_gb.get_group('aff3928535f48')
session_exploded = session[session.action_type=='clickout item'].tail(1)[['impressions']]
session_exploded.impressions = session_exploded.impressions.apply(lambda x: x.split('|'))
session_exploded = session_exploded.explode('impressions')
session_exploded = session_exploded.rename(columns={'impressions':'item_id'})
session_exploded

In [0]:
session_explode = session[session.action_type=='clickout item'].tail(1)[['prices']]
session_explode.prices = session_explode.prices.apply(lambda x: x.split('|'))
session_explode = session_explode.explode('prices')
session_explode = session_explode.rename(columns={'prices':'price'})
session_explode.price = session_explode.price.apply(lambda x: int(x))
session_explode.head(2)

In [0]:
session_exploded['price'] = session_explode.price
session_exploded['rank'] = range(1, len(session_exploded) + 1)
session_exploded['price_rank'] = np.argsort(session_exploded['price'])
session_exploded.head(2)

In [98]:
clickout_item_id = session[session.action_type=='clickout item'].tail(1)['reference'].values[0]
clickout_itemDF = pd.DataFrame({'item_id':[clickout_item_id], 'clickout':[1]})
session_exploded = session_exploded.merge(clickout_itemDF, on='item_id', how='left').fillna(0)
session_exploded.head(7)

Unnamed: 0,item_id,price,rank,price_rank,clickout
0,55109,162,1,8,0.0
1,129343,25,2,1,0.0
2,54824,150,3,15,0.0
3,2297972,143,4,5,0.0
4,109014,101,5,12,0.0
5,1257342,49,6,16,1.0
6,1031578,118,7,20,0.0


The next step would be getting the duration of the whole session, so that we can get the duration for each item in the session.

In [99]:
session.timestamp.max() - session.t

Unnamed: 0,user_id,session_id,timestamp,step,action_type,reference,platform,city,device,current_filters,impressions,prices
0,00RL8Z82B2Z1,aff3928535f48,1541037460,1,search for poi,Newtown,AU,"Sydney, Australia",mobile,,,
1,00RL8Z82B2Z1,aff3928535f48,1541037522,2,interaction item image,666856,AU,"Sydney, Australia",mobile,,,
2,00RL8Z82B2Z1,aff3928535f48,1541037522,3,interaction item image,666856,AU,"Sydney, Australia",mobile,,,
3,00RL8Z82B2Z1,aff3928535f48,1541037532,4,interaction item image,666856,AU,"Sydney, Australia",mobile,,,
4,00RL8Z82B2Z1,aff3928535f48,1541037532,5,interaction item image,109038,AU,"Sydney, Australia",mobile,,,
5,00RL8Z82B2Z1,aff3928535f48,1541037532,6,interaction item image,666856,AU,"Sydney, Australia",mobile,,,
6,00RL8Z82B2Z1,aff3928535f48,1541037532,7,interaction item image,109038,AU,"Sydney, Australia",mobile,,,
7,00RL8Z82B2Z1,aff3928535f48,1541037532,8,interaction item image,666856,AU,"Sydney, Australia",mobile,,,
8,00RL8Z82B2Z1,aff3928535f48,1541037542,9,interaction item image,109038,AU,"Sydney, Australia",mobile,,,
9,00RL8Z82B2Z1,aff3928535f48,1541037542,10,interaction item image,109038,AU,"Sydney, Australia",mobile,,,


In [91]:
pd.DataFrame({'item_id':[clickout_item_id], 'clickout':[1]})

Unnamed: 0,item_id,clickout
0,1257342,1


In [84]:
clickout_item = dict({session[session.action_type=='clickout item'].tail(1)['reference'].values[0]:1})

'1257342'

In [85]:
session_exploded

Unnamed: 0,item_id,price,rank,price_rank
15,55109,162,1,8
15,129343,25,2,1
15,54824,150,3,15
15,2297972,143,4,5
15,109014,101,5,12
15,1257342,49,6,16
15,1031578,118,7,20
15,109018,131,8,9
15,1332971,18,9,4
15,666916,100,10,10


In [58]:
session_explode['prices'].apply(lambda x: sorted(x))[15]

['100',
 '101',
 '101',
 '112',
 '118',
 '118',
 '123',
 '124',
 '131',
 '137',
 '138',
 '143',
 '143',
 '143',
 '150',
 '162',
 '18',
 '180',
 '188',
 '25',
 '36',
 '49',
 '51',
 '66',
 '94']

In [56]:
session_explode.explode('price_rank')

Unnamed: 0,prices,price_rank
15,"[162, 25, 150, 143, 101, 49, 118, 131, 18, 100...",9
15,"[162, 25, 150, 143, 101, 49, 118, 131, 18, 100...",4
15,"[162, 25, 150, 143, 101, 49, 118, 131, 18, 100...",10
15,"[162, 25, 150, 143, 101, 49, 118, 131, 18, 100...",23
15,"[162, 25, 150, 143, 101, 49, 118, 131, 18, 100...",24
15,"[162, 25, 150, 143, 101, 49, 118, 131, 18, 100...",6
15,"[162, 25, 150, 143, 101, 49, 118, 131, 18, 100...",14
15,"[162, 25, 150, 143, 101, 49, 118, 131, 18, 100...",18
15,"[162, 25, 150, 143, 101, 49, 118, 131, 18, 100...",7
15,"[162, 25, 150, 143, 101, 49, 118, 131, 18, 100...",21


###Items Interacted With & Number of Interactions

In [0]:
session_trial = train[train.session_id=='aff3928535f48']
l = session_trial[session_trial.action_type=='clickout item'].tail(1).impressions.apply(lambda x: x.split('|')).explode()

In [0]:
train[train.session_id=='aff3928535f48'].reference.isin(l)

0     False
1     False
2     False
3     False
4     False
5     False
6     False
7     False
8     False
9     False
10    False
11    False
12    False
13    False
14    False
15     True
Name: reference, dtype: bool

In [0]:
#loading train set
ReferenceTrain = pd.read_csv('/content/drive/My Drive/Trivago/Clean Dataset/train.csv')
ReferenceTrain = ReferenceTrain.drop(columns='Unnamed: 0')

#converting words values in referenc into NaN
ReferenceTrain.reference = ReferenceTrain.reference.apply(lambda x: pd.to_numeric(x, errors='coerce')).dropna().astype(int).apply(lambda x: str(x))

#dropping NaN in reference
ReferenceTrain = ReferenceTrain.drop(ReferenceTrain[ReferenceTrain.reference.isna()].index.tolist())

#what I meant by the previous step
train.reference, ReferenceTrain.reference

In [0]:
ReferenceTimeSpent = (ReferenceTrain.groupby(['session_id', 'reference']).timestamp.max() 
                      - ReferenceTrain.groupby(['session_id', 'reference']).timestamp.min())
ReferenceTimeSpent

In [0]:
ReferenceTrain.to_csv('train_ref.csv')                  #DO NOT RUN BEFORE INVESTIGATING FOR THE BEST PRACTICES
!cp train_ref.csv '/content/drive/My Drive/Trivago/Clean Dataset/'

Taking a sample of the training set to perform trials

In [0]:
trainlet = train.iloc[0:10000]

#getting session_id with no clickouts
session_idNoClickouts = list(set(trainlet.session_id.unique()) - set(trainlet[trainlet.action_type == 'clickout item'].session_id.unique()))

#dropping sessions with no clickouts
trainlet = trainlet[~trainlet.session_id.isin(session_idNoClickouts)].drop(columns='Unnamed: 0')

#will be working on a grouped by sample, having the count of steps for each reference got interacted with
trainlet1 = trainlet.groupby(['session_id', 'reference'], sort=False).step.count().to_frame()

#converting an session_id as index into a column
trainlet1.reset_index(level=0, inplace=True)

#converting an reference as index into a column
trainlet1.reset_index(level=0, inplace=True)

#changing order of columns
trainlet1 = trainlet1[['session_id', 'reference','step']]

#takng a look
trainlet1.reference.head(10)

In [0]:
#converting alphabetic values in reference attribute into NaN by converting all values into numeric, then again converting values into string
trainlet1.reference = trainlet1.reference.apply(lambda x: pd.to_numeric(x, errors='coerce')).dropna().astype(int).apply(lambda x: str(x))

#taking a look
trainlet1.reference.head(10)

In [0]:
#for the sake of comparison creating another version of trainlet by repeating the previous steps
trainlet0 = trainlet.groupby(['session_id', 'reference'], sort=False).step.count().to_frame()
trainlet0.reset_index(level=0, inplace=True)
trainlet0.reset_index(level=0, inplace=True)
trainlet0 = trainlet0[['session_id', 'reference','step']]

#taking a look
trainlet0

In [0]:
#changing the name of attribute step into NumberOfInteractions
trainlet0 = trainlet0.rename(columns = {'step':'NumberOfInteractions'})
trainlet1 = trainlet1.rename(columns = {'step':'NumberOfInteractions'})

#taking a look
trainlet1

In [0]:
#for future use before dropping the NaN values, getting the indeces of the word values in reference attribute
#getting the index of words values in reference attribute
WordsInReferenceIndex = list(pd.isnull(trainlet1).any(1).nonzero())[0].tolist()

#getting reference with words dataframe
ReferenceWordsDF = trainlet0.iloc[WordsInReferenceIndex, :]  #for future**

In [0]:
#dropping NaN values(rows)
trainlet1 = trainlet1.dropna()

#taking a look
trainlet1

In [0]:
#getting reference and Number of interactions lists
InteractedWithItems = trainlet1.groupby('session_id', sort=False)['reference'].apply(list)
NumberOfInteractions = trainlet1.groupby('session_id', sort=False)['NumberOfInteractions'].apply(list)

#taking a look
InteractedWithItems, NumberOfInteractions

In [0]:
#converting series into lists
InteractedWithItems = InteractedWithItems.tolist()
NumberOfInteractions = NumberOfInteractions.tolist()

Adding the two attributes to trainlet

In [0]:
#getting a list of indeces InteractedWithItems and NumberOfInteractions should be left joined to
FinalClickoutIndex = trainlet[trainlet.action_type=='clickout item'].groupby('session_id').tail(1).index.tolist()

#creating InteractedWithItems dataframe
InteractedWithItemsDF = pd.DataFrame({'FinalClickoutIndex':FinalClickoutIndex, 'InteractedWitItems':InteractedWithItems,
                                      'NumberOfInteractions':NumberOfInteractions}).set_index('FinalClickoutIndex')

#left join on dataset
trainlet = trainlet.join(InteractedWithItemsDF)
trainlet

###Time Spent

Creating a time spent attribute for each reference within a session.

In [0]:
#This should be a seperate function

#getting session_id with no clickouts
session_idNoClickouts = list(set(trainlet.session_id.unique()) - set(trainlet[trainlet.action_type == 'clickout item'].session_id.unique()))

#dropping sessions with no clickouts
trainlet = trainlet[~trainlet.session_id.isin(session_idNoClickouts)].drop(columns='Unnamed: 0')

#obtaining the seconds spent on a reference by subtracting the time started viewing the item till the time of last interaction with the item
time_spent = trainlet.groupby(['session_id', 'reference'], sort=False).timestamp.apply(lambda x:(x.max() - x.min())).to_frame()

#converting an session_id as index into a column
time_spent.reset_index(level=0, inplace=True)

#converting an reference as index into a column
time_spent.reset_index(level=0, inplace=True)

#changing order of columns
time_spent = time_spent[['session_id', 'reference','timestamp']]

#changing the timestamp into SecondsSpent
time_spent = time_spent.rename(columns = {'timestamp':'SecondsSpent'})

#converting alphabetic values in reference attribute into NaN by converting all values into numeric, then again converting values into string
time_spent.reference = time_spent.reference.apply(lambda x: pd.to_numeric(x, errors='coerce')).dropna().astype(int).apply(lambda x: str(x))

#dropping NaN values(rows)
time_spent = time_spent.dropna()

#getting time spent of interactions lists
SecondsSpent = time_spent.groupby('session_id', sort=False)['SecondsSpent'].apply(list)

#converting into list
SecondsSpent = SecondsSpent.tolist()

#getting a list of indeces SecondsSpent to be left joined
FinalClickoutIndex = trainlet[trainlet.action_type=='clickout item'].groupby('session_id').tail(1).index.tolist()

#creating Seconds dataframe
SecondsSpentDF = pd.DataFrame({'FinalClickoutIndex':FinalClickoutIndex, 'SecondsSpent':SecondsSpent}).set_index('FinalClickoutIndex')

#left join on dataset
trainlet = trainlet.join(SecondsSpentDF)
trainlet

In [0]:
time_spent.groupby('session_id', sort=False)['SecondsSpent'].apply(list)

In [0]:
SecondsSpent = time_spent.groupby('session_id', sort=False)['SecondsSpent'].apply(list)

In [0]:
#creating a function
def SecondsSpent(dataset):
  #getting session_id with no clickouts
  session_idNoClickouts = list(set(dataset.session_id.unique()) - set(dataset[dataset.action_type == 'clickout item'].session_id.unique()))

  #dropping sessions with no clickouts
  dataset = dataset[~dataset.session_id.isin(session_idNoClickouts)].drop(columns='Unnamed: 0')

  #obtaining the seconds spent on a reference by subtracting the time started viewing the item till the time of last interaction with the item
  time_spent = dataset.groupby(['session_id', 'reference'], sort=False).timestamp.apply(lambda x:(x.max() - x.min())).to_frame()

  #converting an session_id as index into a column
  time_spent.reset_index(level=0, inplace=True)

  #converting an reference as index into a column
  time_spent.reset_index(level=0, inplace=True)

  #changing order of columns
  time_spent = time_spent[['session_id', 'reference','timestamp']]

  #changing the timestamp into SecondsSpent
  time_spent = time_spent.rename(columns = {'timestamp':'SecondsSpent'})

  #converting alphabetic values in reference attribute into NaN by converting all values into numeric, then again converting values into string
  time_spent.reference = time_spent.reference.apply(lambda x: pd.to_numeric(x, errors='coerce')).dropna().astype(int).apply(lambda x: str(x))

  #dropping NaN values(rows)
  time_spent = time_spent.dropna()

  #getting time spent of interactions lists
  SecondsSpent = time_spent.groupby('session_id', sort=False)['SecondsSpent'].apply(list)

  #converting into list
  SecondsSpent = SecondsSpent.tolist()

  #getting a list of indeces SecondsSpent to be left joined
  FinalClickoutIndex = dataset[dataset.action_type=='clickout item'].groupby('session_id').tail(1).index.tolist()

  #creating Seconds dataframe
  SecondsSpentDF = pd.DataFrame({'FinalClickoutIndex':FinalClickoutIndex, 'SecondsSpent':SecondsSpent}).set_index('FinalClickoutIndex')

  #left join on dataset
  dataset = dataset.join(SecondsSpentDF)
return dataset

In [0]:
trainlet2 = train.iloc[0:10000]
trainlet2.head()

In [0]:
SecondsSpent(trainlet2)

In [0]:
pd.DataFrame({'FinalClickoutIndex':FinalClickoutIndex, 'SecondsSpent':SecondsSpent}).set_index('FinalClickoutIndex')

In [0]:
trainlet[trainlet.action_type=='clickout item'].groupby('session_id').tail(1).index.tolist()

In [0]:
#dropping NaN values(rows)
trainlet1 = trainlet1.dropna()

#taking a look
trainlet1

In [0]:
#getting a list of values of time spent on each reference in a session
time_spent_values = time_spent.values

#getting the index of the time_spent_values in dataset
index = trainlet.groupby(['session_id', 'reference'], sort=False).tail(1).index

#creating a dataframe for a left join on train set
time_spent_df = pd.DataFrame({'index':index, 'seconds_spent':time_spent_values})
time_spent_df.head(2)

In [0]:
#making the index column as the index for dataframe
time_spent_df = time_spent_df.set_index('index')

#left join to train set on index
train = train.join(time_spent_df)
train.head(15)

In [0]:
#exporting dataframe to Google drive  DO NOT RUN THIS CELL UNLESS MODIFIED
train.to_csv('train.csv')

In [0]:
#loading file                         DO NOT RUN THIS CELL UNLESS MODIFIED (SEARCH FOR BEST PRACTICES IN THIS CASE)   
train = pd.read_csv('/content/drive/My Drive/Trivago/Clean Dataset/train.csv')
train.head(15)

###Price

In [0]:
Taking a session as an example, if there are three unique references. If we had the ones that have the same impressions which are the first two, and we get a table out of the impressions and prices, we will find that they have the same price exactly.
By proceeding with the KNN on this table(features can later on be added, and analysis on how different properties matter), we can have a list of 5 items at least(which are the most important) and go from there.	

In [0]:
len(train[train.session_id == 'aff3928535f48'][train.action_type == 'clickout item'].tail(1).impressions.values[0].split('|'))

In [0]:
items = train[train.session_id == 'aff3928535f48'][train.action_type == 'clickout item'].tail(1).impressions.values[0].split('|')

In [0]:
all_interacted_with_items = train[train.session_id == 'aff3928535f48'].reference.unique().tolist()
interacted_with_items = []
interacted_with_items_prices = []
for item in all_interacted_with_items:
  for i in range(len(impression)):
    if item == impression[i]:
        interacted_with_items.append(item)
        print(item)
        interacted_with_items_prices.append(price[i])
        print(price[i])

In [0]:
for item in all_interacted_with_items:
  for i in range(len(impression)):
    if item == impression[i]:
      print(item)

In [0]:
all_interacted_with_items

In [0]:
def impression_price(session_id):               #the isssue with this function is that ignores the first clickouts
                                                #(there are valuable infromation that can be extracted)
  try:                                          #some of the functions don't apply on the some sessions
    impression = train[train.session_id == session_id][train.action_type == 'clickout item'].tail(1).impressions.values[0].split('|')
    price = train[train.session_id == session_id][train.action_type == 'clickout item'].tail(1).prices.values[0].split('|')
    price = list(map(int, price))               #converting list of strings into integers
    clickout_item = train[train.session_id == session_id][train.action_type == 'clickout item'].tail(1).reference.values[0]
    all_interacted_with_items = train[train.session_id == session_id].reference.unique().tolist()
    for i in range(len(impression)):            #getting the clickout item
      if clickout_item == impression[i]:
        rank = i
    interacted_with_items = []
    interacted_with_items_prices = []
    for item in all_interacted_with_items:      #getting interacted with items
      for i in range(len(impression)):
        if item == impression[i]:
          interacted_with_items.append(item)
          interacted_with_items_prices.append(price[i])
    plt.figure(figsize=(10,8))
    plt.title('Impressions and Prices', fontsize=30)
    plt.xlabel('Impressions', fontsize=20)
    plt.ylabel('Price', fontsize=20)
    plt.xticks(rotation=90)
    plt.plot(impression, price, 'o')
    plt.plot(interacted_with_items, interacted_with_items_prices, 'o', color='red')
    plt.plot(clickout_item, price[rank] , 'o', color='black')
  except:
    pass

listOfSessions = random.choices(train.session_id.unique(), k=10)
for session_id in listOfSessions:
  impression_price(session_id)


TRY \ to have a the code running normally in a well organized shape

In [0]:
listOfSessions = train.session_id.unique()[0:100]
for session_id in listOfSessions:
  impression_price(session_id)

Impressions shown on the graphs are put in a order of the rank provided by Trivago last list. After taking a quick look at the graphs (sample), I can see a pattern of having the black dot (clickout item) somewhere close to the red dots, at least not very far away. 
I'll need to validate that the features prices and ranks play an important role of user choice eventually.