<a href="https://colab.research.google.com/github/ZeyadSabbah/TrivagoRecommenderSystem/blob/master/TrivagoFeatureEngineering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Feature Engineering

## Mounting to Drive

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
%cd /content/drive/My Drive/Trivago/Project/TrivagoRecommenderSystem

/content/drive/My Drive/Trivago/Project/TrivagoRecommenderSystem


## Loading Libraries & Datasets

In [0]:
# !pip install cudf-cuda100
# !cp /usr/local/lib/python3.6/dist-packages/librmm.so .
# import os  
# os.environ['NUMBAPRO_NVVM']='/usr/local/cuda-10.0/nvvm/lib64/libnvvm.so'  
# os.environ['NUMBAPRO_LIBDEVICE']='/usr/local/cuda-10.0/nvvm/libdevice'

In [0]:
# import cudf 
import pandas as pd
import numpy as np
from datetime import datetime
from datetime import timedelta
import math
import matplotlib.pyplot as plt
from datetime import datetime
import re
import random
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics.pairwise import cosine_similarity
from collections import Counter
from sklearn.preprocessing import MultiLabelBinarizer

In [0]:
item_metadata_filepath = '../../Datasets/raw_data/item_metadata.csv'
item_metadata = pd.read_csv(item_metadata_filepath)

# train_filepath = '../../Datasets/clean_data/train.csv'
# train = pd.read_csv(train_filepath)
# train.drop('Unnamed: 0', axis=1, inplace=True)

## Important Dataframes Preparation

In engineer global features from the train dataset, which will be most of the cases around either clickouts or final click out examples (instances), two dataframes of them will be created.

In order to get representative global values to the items, duplications must be removed. Rows would not all be duplicated because of some attributes such as timestamp, reference, and step.

In [0]:
ClickoutDF = train[train.action_type=='clickout item']
FinalClickoutDF = train[train.action_type=='clickout item'].groupby('session_id').tail(1)
ClickoutUniqueDF = ClickoutDF.drop_duplicates(subset=['session_id', 'impressions'], keep='first')

## Item Global Features
### Number of Properties

In [0]:
item_metadata.properties = item_metadata.properties.apply(lambda x: x.split('|'))
item_metadata['NumberOfProperties'] = item_metadata.properties.apply(lambda x: len(x))
item_metadata.head(2)

Unnamed: 0,item_id,properties,NumberOfProperties
0,5101,"[Satellite TV, Golf Course, Airport Shuttle, C...",62
1,5416,"[Satellite TV, Cosmetic Mirror, Safe (Hotel), ...",46


In [0]:
#getting total number of unique properties across all items
AllPropertiesList = item_metadata.properties.tolist()

AllPropertiesFlatList = []
for sublist in AllPropertiesList:
    for item in sublist:
        AllPropertiesFlatList.append(item)
        
print('Number of unique properties is', len(set(AllPropertiesFlatList)))

Number of unique properties is 157


### Items Properties Similarities
The purpose is to get cosine similarity between items. The maximum 25 items out of the list can be extracted as a dataframe and cosine similarity can be obtained in order to get the similar items to the ones the user had interacted with to be listed on the top of the list.  

In [5]:
item_metadata.properties = item_metadata.properties.apply(lambda x: tuple(x.split('|')))

one_hot = MultiLabelBinarizer()

properties_encoded = one_hot.fit_transform(item_metadata.properties.values.tolist())

properties_encodedDF = pd.DataFrame(properties_encoded)

#changing column names
properties_list = one_hot.classes_.tolist()
for i in range(len(properties_list)):
    properties_encodedDF = properties_encodedDF.rename(columns={i:properties_list[i]})

#creating a column of the item id to get the similarity between items
item_metadata.item_id = item_metadata.item_id.apply(lambda x: str(x))
properties_encodedDF['item_id'] = item_metadata.item_id

properties_encodedDF.head()

Unnamed: 0,1 Star,2 Star,3 Star,4 Star,5 Star,Accessible Hotel,Accessible Parking,Adults Only,Air Conditioning,Airport Hotel,Airport Shuttle,All Inclusive (Upon Inquiry),Balcony,Bathtub,Beach,Beach Bar,Beauty Salon,Bed & Breakfast,Bike Rental,Boat Rental,Body Treatments,Boutique Hotel,Bowling,Bungalows,Business Centre,Business Hotel,Cable TV,Camping Site,Car Park,Casa Rural (ES),Casino (Hotel),Central Heating,Childcare,Club Hotel,Computer with Internet,Concierge,Conference Rooms,Convenience Store,Convention Hotel,Cosmetic Mirror,...,Satellite TV,Satisfactory Rating,Sauna,Self Catering,Senior Travellers,Serviced Apartment,Shooting Sports,Shower,Singles,Sitting Area (Rooms),Ski Resort,Skiing,Solarium,Spa (Wellness Facility),Spa Hotel,Steam Room,Sun Umbrellas,Surfing,Swimming Pool (Bar),Swimming Pool (Combined Filter),Swimming Pool (Indoor),Swimming Pool (Outdoor),Szep Kartya,Table Tennis,Telephone,Teleprinter,Television,Tennis Court,Tennis Court (Indoor),Terrace (Hotel),Theme Hotel,Towels,Very Good Rating,Volleyball,Washing Machine,Water Slide,Wheelchair Accessible,WiFi (Public Areas),WiFi (Rooms),item_id
0,0,0,0,1,0,1,1,0,1,0,1,0,1,1,0,0,0,0,1,0,0,0,1,0,1,1,1,0,1,0,0,1,0,0,1,0,1,0,0,1,...,1,1,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,1,1,1,0,0,0,0,0,0,0,1,1,5101
1,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,1,0,1,0,0,1,...,1,1,1,0,0,0,0,1,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,1,0,0,1,0,0,0,1,1,1,5416
2,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,0,0,0,1,...,1,1,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,1,0,0,0,0,1,1,5834
3,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,1,1,0,1,0,0,1,0,0,1,1,1,0,0,1,...,1,1,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,1,1,1,1,0,0,0,0,0,0,0,1,0,5910
4,0,0,0,1,0,1,1,0,0,0,0,0,1,1,1,0,1,0,1,1,1,0,1,0,1,1,1,0,1,0,0,1,0,0,1,0,1,0,1,1,...,1,1,1,0,0,0,0,1,0,1,0,0,1,1,1,1,0,1,0,1,1,1,0,1,1,0,1,1,1,1,0,1,0,1,0,0,1,1,1,6066


#### K-means

#### PCA

### Number of Times in Impressions
The purpose of this feature is to check how many times an item had been shown to users in the list.

Finding number of times an item has been mentioned in a list.

In [0]:
AllImpressionsList = ClickoutUniqueDF.impressions.apply(lambda x:x.split('|'))

AllImpressionsFlatList = []
for sublist in AllImpressionsList:
    for item in sublist:
        AllImpressionsFlatList.append(item)

InImpressionsCounter = Counter(AllImpressionsFlatList)
InImpressionsDF = pd.DataFrame.from_dict(InImpressionsCounter, orient='index').reset_index()\
                              .rename(columns={'index':'item_id', 0:'NumberInImpressions'})
InImpressionsDF.head(2)

Unnamed: 0,item_id,NumberInImpressions
0,3400638,491
1,1253714,84


In [0]:
InImpressionsDF.item_id.nunique(), len(InImpressionsDF), item_metadata.item_id.nunique()

(815092, 815092, 927142)

Number of items in this dataframe is less than the number of items in item_metadata, that's because some of the items had not been mentioned in the impressions list.

In [0]:
item_metadata.item_id = item_metadata.item_id.apply(lambda x: str(x))
item_metadata.drop(columns='properties', inplace=True)

#left joining
item_metadata = item_metadata.merge(InImpressionsDF, on='item_id', how='left')
item_metadata.head(2)

Unnamed: 0,item_id,NumberOfProperties,NumberInImpressions
0,5101,62,89.0
1,5416,46,63.0


### Number of Times in Reference
The purpose of this feature is to check how many times an item has been mentioned in the Reference attribute in the whole train set. 

In [0]:
InReferencesCounter = Counter(train.reference.values.tolist())
InReferencesDF = pd.DataFrame.from_dict(InReferencesCounter, orient='index').reset_index()\
                              .rename(columns={'index':'item_id', 0:'NumberInReferences'})

InReferencesDF.head(2)

Unnamed: 0,item_id,NumberInReferences
0,Newtown,16
1,666856,26


In [0]:
#left joining
item_metadata = item_metadata.merge(InReferencesDF, on='item_id', how='left')
item_metadata.head(2)

Unnamed: 0,item_id,NumberOfProperties,NumberInImpressions,NumberInReferences
0,5101,62,89.0,14.0
1,5416,46,63.0,43.0


### Number of Times in Clickout
The purpose of this feature is to check how many times an item has been clicked out

In [0]:
InClickoutCounter = Counter(ClickoutDF.reference.values.tolist())
InClickoutDF = pd.DataFrame.from_dict(InClickoutCounter, orient='index').reset_index()\
                              .rename(columns={'index':'item_id', 0:'NumberAsClickout'})
InClickoutDF.head(2)

Unnamed: 0,item_id,NumberAsClickout
0,109038,53
1,1257342,20


In [0]:
#left joining
item_metadata = item_metadata.merge(InClickoutDF, on='item_id', how='left')
item_metadata.head(2)

Unnamed: 0,item_id,NumberOfProperties,NumberInImpressions,NumberInReferences,NumberAsClickout
0,5101,62,89.0,14.0,7.0
1,5416,46,63.0,43.0,6.0


### Number of Time in Final Clickout
The purpose of this feature is to get the number of times an item had been mentioned as a final clickout.

In [0]:
InFinalClickoutCounter = Counter(FinalClickoutDF.reference.values.tolist())
InFinalClickoutDF = pd.DataFrame.from_dict(InFinalClickoutCounter, orient='index').reset_index()\
                              .rename(columns={'index':'item_id', 0:'NumberAsFinalClickout'})
InFinalClickoutDF.head(2)

Unnamed: 0,item_id,NumberAsFinalClickout
0,1257342,7
1,2795374,18


In [0]:
#left joining
item_metadata = item_metadata.merge(InFinalClickoutDF, on='item_id', how='left')
item_metadata.head(2)

Unnamed: 0,item_id,NumberOfProperties,NumberInImpressions,NumberInReferences,NumberAsClickout,NumberAsFinalClickout
0,5101,62,89.0,14.0,7.0,4.0
1,5416,46,63.0,43.0,6.0,2.0


The next step would be dividing the NumberAsFinalClickout by the other 3, in order to get FinalClickout Relativity

### Final Clickout To Impressions
The purpose of this feature is to get item's rate of clicking out when listed to the user.

In [0]:
FClickoutToImpressions = item_metadata.NumberAsFinalClickout/item_metadata.NumberInImpressions
FClickoutToImpressions.head(2)

0    0.044944
1    0.031746
dtype: float64

In [0]:
#adding attribute
item_metadata['FClickoutToImpressions'] = FClickoutToImpressions
item_metadata.head(2)

Unnamed: 0,item_id,NumberOfProperties,NumberInImpressions,NumberInReferences,NumberAsClickout,NumberAsFinalClickout,FClickoutToImpressions
0,5101,62,89.0,14.0,7.0,4.0,0.044944
1,5416,46,63.0,43.0,6.0,2.0,0.031746


### Final Clickout To References
The purpose of this feature is to get item's rate of clicking out when it was interacted with.

In [0]:
FClickoutToReferences = item_metadata.NumberAsFinalClickout/item_metadata.NumberInReferences
FClickoutToReferences.head(2)

0    0.285714
1    0.046512
dtype: float64

In [0]:
#adding attribute
item_metadata['FClickoutToReferences'] = FClickoutToReferences
item_metadata.head(2)

Unnamed: 0,item_id,NumberOfProperties,NumberInImpressions,NumberInReferences,NumberAsClickout,NumberAsFinalClickout,FClickoutToImpressions,FClickoutToReferences
0,5101,62,89.0,14.0,7.0,4.0,0.044944,0.285714
1,5416,46,63.0,43.0,6.0,2.0,0.031746,0.046512


### Final Clickout To Clickout
The purpose of this feature is to get item's rate of clickout when it was clicked out before.

In [0]:
FClickoutToClickout = item_metadata.NumberAsFinalClickout/item_metadata.NumberAsClickout
FClickoutToClickout.head(2)

0    0.571429
1    0.333333
dtype: float64

In [0]:
#adding attribute
item_metadata['FClickoutToClickout'] = FClickoutToClickout
item_metadata.head(2)

Unnamed: 0,item_id,NumberOfProperties,NumberInImpressions,NumberInReferences,NumberAsClickout,NumberAsFinalClickout,FClickoutToImpressions,FClickoutToReferences,FClickoutToClickout
0,5101,62,89.0,14.0,7.0,4.0,0.044944,0.285714,0.571429
1,5416,46,63.0,43.0,6.0,2.0,0.031746,0.046512,0.333333


### Item's Average Rank
The purpose of this feature is to get item's position in the list provided to the user across the train set.  
Since the purpose is to get the average rank across the lists shown to users, an important notice which is that some sessions have different reference and click outs, which provides the same impression list. Duplicated impressions lists in each session should be dropped. (Same thing applies to price as well.)

In [0]:
# using All Clickout dataframe, but the one with the unique impressions for each session for this feature
ClickoutUniqueDF.head(1)

Unnamed: 0,user_id,session_id,timestamp,step,action_type,reference,platform,city,device,current_filters,impressions,prices
13,00RL8Z82B2Z1,aff3928535f48,1541037543,14,clickout item,109038,AU,"Sydney, Australia",mobile,,3400638|1253714|3367857|5100540|1088584|666916...,95|66|501|112|95|100|101|72|82|56|56|143|70|25...


In [0]:
SessionImpressionsDF = ClickoutUniqueDF[['session_id', 'impressions']]
SessionImpressionsDF.impressions = SessionImpressionsDF.impressions.apply(lambda x: x.split('|'))
SessionImpressionsDF.impressions.head(2)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value


13    [3400638, 1253714, 3367857, 5100540, 1088584, ...
15    [55109, 129343, 54824, 2297972, 109014, 125734...
Name: impressions, dtype: object

In [0]:
SessionImpressionsDF_exploded = SessionImpressionsDF.explode('impressions')
SessionImpressionsDF_exploded = SessionImpressionsDF_exploded.rename(columns={'impressions':'item_id'})
SessionImpressionsDF_exploded.head(2)

Unnamed: 0,session_id,item_id
13,aff3928535f48,3400638
13,aff3928535f48,1253714


In [0]:
#creating a rank column
rank = SessionImpressionsDF.impressions.apply(lambda x: list(range(1, (len(x) + 1))))
SessionRankDF = pd.concat([SessionImpressionsDF.session_id, rank], axis=1)
SessionRankDF = SessionRankDF.rename(columns={'impressions':'rank'})
SessionRankDF_exploded = SessionRankDF.explode('rank')
SessionRankDF_exploded.head(2)

Unnamed: 0,session_id,rank
13,aff3928535f48,1
13,aff3928535f48,2


In [0]:
#creating dataframe that combines both items and rank
ItemRankDF = pd.DataFrame({'item_id':SessionImpressionsDF_exploded['item_id'].values.tolist(),
                            'rank':SessionRankDF_exploded['rank'].values.tolist()})
ItemRankDF.head(2)

Unnamed: 0,item_id,rank
0,3400638,1
1,1253714,2


In [0]:
#getting the mean of ranks for each item
ItemAverageRank = ItemRankDF.groupby('item_id', sort=False)['rank'].mean().to_frame().reset_index()
ItemAverageRank = ItemAverageRank.rename(columns={'rank':'MeanRank'})
ItemAverageRank.head(2)

Unnamed: 0,item_id,MeanRank
0,3400638,15.627291
1,1253714,14.321429


In [0]:
#left joining
item_metadata = item_metadata.merge(ItemAverageRank, on='item_id', how='left')
item_metadata.head(2)

Unnamed: 0,item_id,NumberOfProperties,NumberInImpressions,NumberInReferences,NumberAsClickout,NumberAsFinalClickout,FClickoutToImpressions,FClickoutToReferences,FClickoutToClickout,MeanRank
0,5101,62,89.0,14.0,7.0,4.0,0.044944,0.285714,0.571429,12.168539
1,5416,46,63.0,43.0,6.0,2.0,0.031746,0.046512,0.333333,10.111111


### Item's Average Price
The purpose of this feature is to get item's average price accross the train set.  
Making sure that the length of each impressions list is as the same length of the prices list.

In [0]:
ImpressionsLength = ClickoutUniqueDF.impressions.apply(lambda x: x.split('|')).apply(lambda x: len(x))
PricesLength = ClickoutUniqueDF.prices.apply(lambda x: x.split('|')).apply(lambda x: len(x))
ImpressionsLength.equals(PricesLength)

True

In [0]:
SessionPricesDF = ClickoutUniqueDF[['session_id', 'prices']]
SessionPricesDF.prices = SessionPricesDF.prices.apply(lambda x: x.split('|'))
SessionPricesDF_exploded = SessionPricesDF.explode('prices')
ItemPriceDF = pd.DataFrame({'item_id':SessionImpressionsDF_exploded.item_id.values.tolist(),
                            'price':SessionPricesDF_exploded.prices.values.tolist()})
ItemPriceDF.head(2)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value


Unnamed: 0,item_id,price
0,3400638,95
1,1253714,66


In [0]:
ItemPriceDF.price = ItemPriceDF.price.apply(lambda x: int(x))
ItemAveragePriceDF = ItemPriceDF.groupby('item_id', sort=False)['price'].mean().to_frame().reset_index()
ItemAveragePriceDF = ItemAveragePriceDF.rename(columns={'price':'MeanPrice'})
ItemAveragePriceDF.head(2)

Unnamed: 0,item_id,MeanPrice
0,3400638,192.154786
1,1253714,82.678571


In [0]:
item_metadata = item_metadata.merge(ItemAveragePriceDF, on='item_id', how='left')
item_metadata.head(2)

Unnamed: 0,item_id,NumberOfProperties,NumberInImpressions,NumberInReferences,NumberAsClickout,NumberAsFinalClickout,FClickoutToImpressions,FClickoutToReferences,FClickoutToClickout,MeanRank,MeanPrice
0,5101,62,89.0,14.0,7.0,4.0,0.044944,0.285714,0.571429,12.168539,121.696629
1,5416,46,63.0,43.0,6.0,2.0,0.031746,0.046512,0.333333,10.111111,102.873016


### Item's Maximum Price

In [0]:
ItemMaximumPriceDF = ItemPriceDF.groupby('item_id', sort=False)['price'].max().to_frame().reset_index()
ItemMaximumPriceDF = ItemMaximumPriceDF.rename(columns={'price':'MaxPrice'})
ItemMaximumPriceDF.head(2)

Unnamed: 0,item_id,MaxPrice
0,3400638,2523
1,1253714,248


In [0]:
#left joining
item_metadata = item_metadata.merge(ItemMaximumPriceDF, on='item_id', how='left')
item_metadata.head(2)

Unnamed: 0,item_id,NumberOfProperties,NumberInImpressions,NumberInReferences,NumberAsClickout,NumberAsFinalClickout,FClickoutToImpressions,FClickoutToReferences,FClickoutToClickout,MeanRank,MeanPrice,MaxPrice
0,5101,62,89.0,14.0,7.0,4.0,0.044944,0.285714,0.571429,12.168539,121.696629,241.0
1,5416,46,63.0,43.0,6.0,2.0,0.031746,0.046512,0.333333,10.111111,102.873016,270.0


### Item's Minimum Price

In [0]:
ItemMinimumPriceDF = ItemPriceDF.groupby('item_id', sort=False)['price'].min().to_frame().reset_index()
ItemMinimumPriceDF = ItemMinimumPriceDF.rename(columns={'price':'MinPrice'})
ItemMinimumPriceDF.head(2)

Unnamed: 0,item_id,MinPrice
0,3400638,68
1,1253714,50


In [0]:
#left joining
item_metadata = item_metadata.merge(ItemMinimumPriceDF, on='item_id', how='left')
item_metadata.head(2)

Unnamed: 0,item_id,NumberOfProperties,NumberInImpressions,NumberInReferences,NumberAsClickout,NumberAsFinalClickout,FClickoutToImpressions,FClickoutToReferences,FClickoutToClickout,MeanRank,MeanPrice,MaxPrice,MinPrice
0,5101,62,89.0,14.0,7.0,4.0,0.044944,0.285714,0.571429,12.168539,121.696629,241.0,64.0
1,5416,46,63.0,43.0,6.0,2.0,0.031746,0.046512,0.333333,10.111111,102.873016,270.0,59.0


### Price Rank
The purpose of this feature is to get the average price rank of an item across the train set. Sorting the items not by what is being showed to the user, but by the items' prices, getting the average price rank for each item across the whole train. It shows where an item stands from its peers.  
By using the argsort function, the price rank can be feature can be engineered.

In [0]:
ClickoutDF.prices[13]

'95|66|501|112|95|100|101|72|82|56|56|143|70|25|71|162|73|143|188|118|77|131|143|49|165'

In [0]:
np.argsort(list(map(int, ClickoutDF.prices[13].split('|'))))

array([13, 23,  9, 10,  1, 12, 14,  7, 16, 20,  8,  0,  4,  5,  6,  3, 19,
       21, 17, 22, 11, 15, 24, 18,  2])

In [0]:
SessionPricesDF['PriceRank'] = SessionPricesDF.prices.apply(lambda x: (list(map(int, x))))\
                                                     .apply(lambda x: np.argsort(x))
SessionPricesDF.head(2)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,session_id,prices,PriceRank
13,aff3928535f48,"[95, 66, 501, 112, 95, 100, 101, 72, 82, 56, 5...","[13, 23, 9, 10, 1, 12, 14, 7, 16, 20, 8, 0, 4,..."
15,aff3928535f48,"[162, 25, 150, 143, 101, 49, 118, 131, 18, 100...","[8, 1, 15, 5, 12, 16, 20, 9, 4, 10, 23, 24, 6,..."


In [0]:
SessionPricesRankDF = SessionPricesDF[['session_id', 'PriceRank']].explode('PriceRank')
SessionPricesRankDF.head(2)

Unnamed: 0,session_id,PriceRank
13,aff3928535f48,13
13,aff3928535f48,23


In [0]:
ItemPriceRankDF = pd.DataFrame({'item_id':SessionImpressionsDF_exploded.item_id.values.tolist(),
                            'PriceRank':SessionPricesRankDF.PriceRank.values.tolist()})
ItemPriceRankDF.head(2)

Unnamed: 0,item_id,PriceRank
0,3400638,13
1,1253714,23


In [0]:
#getting the average price rank of each item
ItemAveragePriceRankDF = ItemPriceRankDF.groupby('item_id', sort=False).PriceRank.mean().to_frame().reset_index()
ItemAveragePriceRankDF = ItemAveragePriceRankDF.rename(columns={'PriceRank':'AveragePriceRank'})
ItemAveragePriceRankDF.head(2)

Unnamed: 0,item_id,AveragePriceRank
0,3400638,11.826884
1,1253714,11.309524


In [0]:
# left joining
item_metadata = item_metadata.merge(ItemAveragePriceRankDF, on='item_id', how='left')
item_metadata.head(2)

Unnamed: 0,item_id,NumberOfProperties,NumberInImpressions,NumberInReferences,NumberAsClickout,NumberAsFinalClickout,FClickoutToImpressions,FClickoutToReferences,FClickoutToClickout,MeanRank,MeanPrice,MaxPrice,MinPrice,AveragePriceRank
0,5101,62,89.0,14.0,7.0,4.0,0.044944,0.285714,0.571429,12.168539,121.696629,241.0,64.0,12.303371
1,5416,46,63.0,43.0,6.0,2.0,0.031746,0.046512,0.333333,10.111111,102.873016,270.0,59.0,11.095238


In [0]:
#filling NaN values with zeros
item_metadata = item_metadata.fillna(0)

GlobalItemFeatures = item_metadata.merge(properties_encodedDF, on='item_id', how='left')
GlobalItemFeatures.head(2)

Unnamed: 0,item_id,NumberOfProperties,NumberInImpressions,NumberInReferences,NumberAsClickout,NumberAsFinalClickout,FClickoutToImpressions,FClickoutToReferences,FClickoutToClickout,MeanRank,MeanPrice,MaxPrice,MinPrice,AveragePriceRank,1 Star,2 Star,3 Star,4 Star,5 Star,Accessible Hotel,Accessible Parking,Adults Only,Air Conditioning,Airport Hotel,Airport Shuttle,All Inclusive (Upon Inquiry),Balcony,Bathtub,Beach,Beach Bar,Beauty Salon,Bed & Breakfast,Bike Rental,Boat Rental,Body Treatments,Boutique Hotel,Bowling,Bungalows,Business Centre,Business Hotel,...,Sailing,Satellite TV,Satisfactory Rating,Sauna,Self Catering,Senior Travellers,Serviced Apartment,Shooting Sports,Shower,Singles,Sitting Area (Rooms),Ski Resort,Skiing,Solarium,Spa (Wellness Facility),Spa Hotel,Steam Room,Sun Umbrellas,Surfing,Swimming Pool (Bar),Swimming Pool (Combined Filter),Swimming Pool (Indoor),Swimming Pool (Outdoor),Szep Kartya,Table Tennis,Telephone,Teleprinter,Television,Tennis Court,Tennis Court (Indoor),Terrace (Hotel),Theme Hotel,Towels,Very Good Rating,Volleyball,Washing Machine,Water Slide,Wheelchair Accessible,WiFi (Public Areas),WiFi (Rooms)
0,5101,62,89.0,14.0,7.0,4.0,0.044944,0.285714,0.571429,12.168539,121.696629,241.0,64.0,12.303371,0,0,0,1,0,1,1,0,1,0,1,0,1,1,0,0,0,0,1,0,0,0,1,0,1,1,...,0,1,1,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,1,1,1,0,0,0,0,0,0,0,1,1
1,5416,46,63.0,43.0,6.0,2.0,0.031746,0.046512,0.333333,10.111111,102.873016,270.0,59.0,11.095238,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,1,...,0,1,1,1,0,0,0,0,1,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,1,0,0,1,0,0,0,1,1,1


## Item Local Features

### General Features

In [0]:
session_gb = train.groupby('session_id', sort=False)
train.session_id.nunique()

745755

#### item_id

In [0]:
def get_data_clickout(data):
  data_clickout = data[data['action_type']=='clickout item'].groupby('session_id').tail(1)
  return data_clickout

def get_item_id(data_clickout):
  item_id = data_clickout[['session_id', 'impressions']]
  item_id['impressions'] = item_id['impressions'].apply(lambda x: x.split('|'))
  item_id = item_id.explode('impressions')
  item_id = item_id.rename(columns={'impressions':'item_id'})
  item_id = item_id.reset_index(drop=True)
  return item_id

In [0]:
item_id = get_item_id(FinalClickoutDF)
item_id

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  import sys


Unnamed: 0,session_id,item_id
0,aff3928535f48,55109
1,aff3928535f48,129343
2,aff3928535f48,54824
3,aff3928535f48,2297972
4,aff3928535f48,109014
...,...,...
15373343,62728015bec05,2712342
15373344,62728015bec05,48497
15373345,62728015bec05,11933
15373346,62728015bec05,1714483


#### price

In [0]:
def get_price(data_clickout):
  price = data_clickout[['session_id', 'prices']]
  price['prices'] = price['prices'].apply(lambda x: x.split('|'))
  price = price.explode('prices')
  price['prices'] = price['prices'].apply(lambda x: int(x))
  price = price.rename(columns={'prices':'price'})
  price = price.reset_index(drop=True)
  return price

In [0]:
price = get_price(FinalClickoutDF)
price

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,session_id,price
0,aff3928535f48,162
1,aff3928535f48,25
2,aff3928535f48,150
3,aff3928535f48,143
4,aff3928535f48,101
...,...,...
15373343,62728015bec05,73
15373344,62728015bec05,169
15373345,62728015bec05,87
15373346,62728015bec05,485


In [0]:
type(price.price)

pandas.core.series.Series

#### item_rank

In [0]:
def get_item_rank(data_clickout):
  item_rank = data_clickout[['session_id', 'impressions']]
  item_rank['impressions'] = item_rank['impressions'].apply(lambda x: x.split('|'))
  item_rank['impressions'] = item_rank['impressions'].apply(lambda x: list(range(1, len(x) + 1)))
  item_rank = item_rank.explode('impressions')
  item_rank['impressions'] = item_rank['impressions'].apply(lambda x: int(x))
  item_rank = item_rank.rename(columns={'impressions':'item_rank'})
  item_rank = item_rank.reset_index(drop=True)
  return item_rank

In [0]:
item_rank = get_item_rank(FinalClickoutDF)
item_rank

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.


Unnamed: 0,session_id,item_rank
0,aff3928535f48,1
1,aff3928535f48,2
2,aff3928535f48,3
3,aff3928535f48,4
4,aff3928535f48,5
...,...,...
15373343,62728015bec05,21
15373344,62728015bec05,22
15373345,62728015bec05,23
15373346,62728015bec05,24


#### price_rank

In [0]:
def get_price_rank(data):
  price_rank = data.groupby('session_id', sort=False).price.apply(lambda x: x.values).to_frame().reset_index().rename(columns={'price':'price_list'})
  price_rank.price_list = price_rank.price_list.apply(lambda x: np.argsort(x))
  price_rank = price_rank.rename(columns={'price_list':'price_rank'})
  price_rank = price_rank.explode('price_rank')
  price_rank = price_rank.reset_index(drop=True)
  return price_rank

In [0]:
price_rank = get_price_rank(price)
price_rank

Unnamed: 0,session_id,price_rank
0,aff3928535f48,8
1,aff3928535f48,1
2,aff3928535f48,15
3,aff3928535f48,5
4,aff3928535f48,12
...,...,...
15373343,62728015bec05,6
15373344,62728015bec05,19
15373345,62728015bec05,10
15373346,62728015bec05,23


#### clickout

In [0]:
def get_clickout(data_clickout, item_id):
  clickout = data_clickout[['session_id','reference']]
  clickout = item_id.merge(clickout, on='session_id', how='left')
  clickout['clickout'] = clickout.apply(lambda x: 1 if x['item_id'] == x['reference'] else 0, axis=1)
  clickout.drop(columns='reference', inplace=True)
  clickout = clickout.reset_index(drop=True)
  return clickout

In [0]:
clickout = get_clickout(FinalClickoutDF, item_id)
clickout.head(10)

Unnamed: 0,session_id,item_id,clickout
0,aff3928535f48,55109,0
1,aff3928535f48,129343,0
2,aff3928535f48,54824,0
3,aff3928535f48,2297972,0
4,aff3928535f48,109014,0
5,aff3928535f48,1257342,1
6,aff3928535f48,1031578,0
7,aff3928535f48,109018,0
8,aff3928535f48,1332971,0
9,aff3928535f48,666916,0


#### session_duration

In [0]:
def get_session_duration(data, item_id):
  session_duration = data.groupby('session_id', sort=False).timestamp.max() - data.groupby('session_id', sort=False).timestamp.min()
  session_duration = session_duration.to_frame().rename(columns={'timestamp':'session_duration'})
  session_duration = item_id.merge(session_duration, on='session_id', how='left')
  session_duration.drop(columns='item_id', inplace=True)
  session_duration = session_duration.reset_index(drop=True)
  return session_duration

In [0]:
session_duration = get_session_duration(train, item_id)
session_duration

Unnamed: 0,session_id,session_duration
0,aff3928535f48,1025
1,aff3928535f48,1025
2,aff3928535f48,1025
3,aff3928535f48,1025
4,aff3928535f48,1025
...,...,...
15373343,62728015bec05,571
15373344,62728015bec05,571
15373345,62728015bec05,571
15373346,62728015bec05,571


#### item_duration

In [0]:
def get_item_duration(data, item_id):
  item_duration = data.groupby(['session_id', 'reference'], sort=False).timestamp.max() - data.groupby(['session_id', 'reference'], sort=False).timestamp.min()
  item_duration = item_duration.reset_index().rename(columns={'reference':'item_id', 'timestamp':'item_duration'})
  item_duration = item_id.merge(item_duration, left_on=['session_id', 'item_id'], right_on=['session_id', 'item_id'], how='left')
  item_duration = item_duration.fillna(0)
  item_duration = item_duration.reset_index(drop=True)
  return item_duration

In [0]:
item_duration = get_item_duration(train, item_id)
item_duration[item_duration.item_duration>0]

Unnamed: 0,session_id,item_id,item_duration
25,3599a6f709eab,2795374,134.0
65,ec139e10b9238,1032816,338.0
79,325fafb5fa450,65685,65.0
126,7157899be2839,2552514,3.0
185,4c6062d7cefe4,138930,10.0
...,...,...,...
15373250,92b7ab1287edf,207081,14.0
15373251,92b7ab1287edf,50584,82.0
15373273,6b66cb0cfb518,3811156,3.0
15373274,6b66cb0cfb518,2337580,10.0


#### item_session_duration

In [0]:
def get_item_session_duration(item_duration, session_duration):
  item_duration['item_session_duration'] = item_duration.item_duration/session_duration.session_duration
  item_session_duration = item_duration[['session_id', 'item_id', 'item_session_duration']]
  item_duration = item_duration[['session_id', 'item_id', 'item_duration']]
  item_session_duration = item_session_duration.fillna(0)
  item_session_druation = item_session_duration.reset_index(drop=True)
  return item_session_duration

In [0]:
item_session_duration = get_item_session_duration(item_duration, session_duration)
item_session_duration[item_session_duration.item_session_duration>0]

Unnamed: 0,session_id,item_id,item_session_duration
25,3599a6f709eab,2795374,1.000000
65,ec139e10b9238,1032816,0.550489
79,325fafb5fa450,65685,0.256917
126,7157899be2839,2552514,1.000000
185,4c6062d7cefe4,138930,0.080645
...,...,...,...
15373250,92b7ab1287edf,207081,0.009609
15373251,92b7ab1287edf,50584,0.056280
15373273,6b66cb0cfb518,3811156,0.230769
15373274,6b66cb0cfb518,2337580,0.769231


#### item_interactions

In [0]:
def get_item_interactions(data, item_id):
  item_interactions = data.groupby(['session_id', 'reference']).step.count().to_frame().reset_index()
  item_interactions = item_interactions.rename(columns={'reference':'item_id', 'step':'item_interactions'})
  item_interactions = item_id.merge(item_interactions, left_on=['session_id', 'item_id'], right_on=['session_id', 'item_id'], how='left')
  item_interactions = item_interactions.fillna(0)
  item_interactions = item_interactions.reset_index(drop=True)
  return item_interactions

In [0]:
item_interactions = get_item_interactions(train, item_id)
item_interactions[item_interactions.item_interactions>0]

Unnamed: 0,session_id,item_id,item_interactions
5,aff3928535f48,1257342,1.0
25,3599a6f709eab,2795374,35.0
65,ec139e10b9238,1032816,4.0
79,325fafb5fa450,65685,38.0
82,325fafb5fa450,1320460,1.0
...,...,...,...
15373274,6b66cb0cfb518,2337580,8.0
15373318,e7916050980d9,8985292,1.0
15373323,62728015bec05,6617798,15.0
15373326,62728015bec05,1161323,1.0


#### maximum_step

In [0]:
def get_maximum_step(data, item_id):
  maximum_step = data.groupby('session_id', sort=False).step.max().to_frame().reset_index()
  maximum_step = maximum_step.rename(columns={'step':'maximum_step'})
  maximum_step = item_id.merge(maximum_step, on='session_id', how='left')
  maximum_step = maximum_step.reset_index(drop=True)
  return maximum_step

In [0]:
maximum_step = get_maximum_step(train, item_id)
maximum_step

Unnamed: 0,session_id,item_id,maximum_step
0,aff3928535f48,55109,16
1,aff3928535f48,129343,16
2,aff3928535f48,54824,16
3,aff3928535f48,2297972,16
4,aff3928535f48,109014,16
...,...,...,...
15373343,62728015bec05,2712342,19
15373344,62728015bec05,48497,19
15373345,62728015bec05,11933,19
15373346,62728015bec05,1714483,19


#### top_list

In [0]:
def get_top_list(item_rank):
  top_list = item_rank[['session_id', 'item_rank']]
  top_list['top_list'] = top_list.apply(lambda x: 1 if x['item_rank'] < 6 else 0, axis=1)
  top_list = top_list.reset_index(drop=True)
  return top_list

In [0]:
top_list = get_top_list(item_rank)
top_list

Unnamed: 0,session_id,item_rank,top_list
0,aff3928535f48,1,1
1,aff3928535f48,2,1
2,aff3928535f48,3,1
3,aff3928535f48,4,1
4,aff3928535f48,5,1
...,...,...,...
15373343,62728015bec05,21,0
15373344,62728015bec05,22,0
15373345,62728015bec05,23,0
15373346,62728015bec05,24,0


In [0]:
dataframes = [item_id, price, item_rank, price_rank, clickout, session_duration, item_duration, item_session_duration, item_interactions, maximum_step, top_list]
local_data = pd.concat(dataframes)
local_data

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  


## Local Features Defined Function

In [0]:
def get_data_clickout(data):
  data_clickout = data[data['action_type']=='clickout item'].groupby('session_id').tail(1)
  return data_clickout

def get_item_id(data_clickout):
  item_id = data_clickout[['session_id', 'impressions']]
  item_id['impressions'] = item_id['impressions'].apply(lambda x: x.split('|'))
  item_id = item_id.explode('impressions')
  item_id = item_id.rename(columns={'impressions':'item_id'})
  item_id = item_id.reset_index(drop=True)
  return item_id

def get_price(data_clickout):
  price = data_clickout[['session_id', 'prices']]
  price['prices'] = price['prices'].apply(lambda x: x.split('|'))
  price = price.explode('prices')
  price['prices'] = price['prices'].apply(lambda x: int(x))
  price = price.rename(columns={'prices':'price'})
  price = price.reset_index(drop=True)
  return price

def get_item_rank(data_clickout):
  item_rank = data_clickout[['session_id', 'impressions']]
  item_rank['impressions'] = item_rank['impressions'].apply(lambda x: x.split('|'))
  item_rank['impressions'] = item_rank['impressions'].apply(lambda x: list(range(1, len(x) + 1)))
  item_rank = item_rank.explode('impressions')
  item_rank = item_rank.rename(columns={'impressions':'item_rank'})
  item_rank = item_rank.reset_index(drop=True)
  return item_rank

def get_price_rank(data):
  price_rank = data.groupby('session_id', sort=False).price.apply(lambda x: x.values).to_frame().reset_index().rename(columns={'price':'price_list'})
  price_rank.price_list = price_rank.price_list.apply(lambda x: np.argsort(x))
  price_rank = price_rank.rename(columns={'price_list':'price_rank'})
  price_rank = price_rank.explode('price_rank')
  price_rank = price_rank.reset_index(drop=True)
  return price_rank

def get_clickout(data_clickout, item_id):
  clickout = data_clickout[['session_id','reference']]
  clickout = item_id.merge(clickout, on='session_id', how='left')
  clickout['clickout'] = clickout.apply(lambda x: 1 if x['item_id'] == x['reference'] else 0, axis=1)
  clickout.drop(columns='reference', inplace=True)
  clickout = clickout.reset_index(drop=True)
  return clickout

def get_session_duration(data, item_id):
  session_duration = data.groupby('session_id', sort=False).timestamp.max() - data.groupby('session_id', sort=False).timestamp.min()
  session_duration = session_duration.to_frame().rename(columns={'timestamp':'session_duration'})
  session_duration = item_id.merge(session_duration, on='session_id', how='left')
  session_duration.drop(columns='item_id', inplace=True)
  session_duration = session_duration.reset_index(drop=True)
  return session_duration

def get_item_duration(data, item_id):
  item_duration = data.groupby(['session_id', 'reference'], sort=False).timestamp.max() - data.groupby(['session_id', 'reference'], sort=False).timestamp.min()
  item_duration = item_duration.reset_index().rename(columns={'reference':'item_id', 'timestamp':'item_duration'})
  item_duration = item_id.merge(item_duration, left_on=['session_id', 'item_id'], right_on=['session_id', 'item_id'], how='left')
  item_duration = item_duration.fillna(0)
  item_duration = item_duration.reset_index(drop=True)
  return item_duration

def get_item_session_duration(item_duration, session_duration):
  item_duration['item_session_duration'] = item_duration.item_duration/session_duration.session_duration
  item_session_duration = item_duration[['session_id', 'item_id', 'item_session_duration']]
  item_duration = item_duration[['session_id', 'item_id', 'item_duration']]
  item_session_duration = item_session_duration.fillna(0)
  item_session_duration = item_session_duration.reset_index(drop=True)
  return item_session_duration

def get_item_interactions(data, item_id):
  item_interactions = data.groupby(['session_id', 'reference']).step.count().to_frame().reset_index()
  item_interactions = item_interactions.rename(columns={'reference':'item_id', 'step':'item_interactions'})
  item_interactions = item_id.merge(item_interactions, left_on=['session_id', 'item_id'], right_on=['session_id', 'item_id'], how='left')
  item_interactions = item_interactions.fillna(0)
  item_interactions = item_interactions.reset_index(drop=True)
  return item_interactions

def get_maximum_step(data, item_id):
  maximum_step = data.groupby('session_id', sort=False).step.max().to_frame().reset_index()
  maximum_step = maximum_step.rename(columns={'step':'maximum_step'})
  maximum_step = item_id.merge(maximum_step, on='session_id', how='left')
  maximum_step = maximum_step.reset_index(drop=True)
  return maximum_step

def get_top_list(item_rank):
  top_list = item_rank[['session_id', 'item_rank']]
  top_list['top_list'] = top_list.apply(lambda x: 1 if x['item_rank'] < 6 else 0, axis=1)
  top_list = top_list.reset_index(drop=True)
  return top_list

## Local Data Transformation

In [0]:
def data_transformation(data):
  global local_data
  FinalClickoutDF = get_data_clickout(data)
  item_id = get_item_id(FinalClickoutDF)
  price = get_price(FinalClickoutDF)
  item_rank = get_item_rank(FinalClickoutDF)
  price_rank = get_price_rank(price)
  clickout = get_clickout(FinalClickoutDF, item_id)
  session_duration = get_session_duration(data, item_id)
  item_duration = get_item_duration(data, item_id)
  item_session_duration = get_item_session_duration(item_duration, session_duration)
  item_interactions = get_item_interactions(data, item_id)
  maximum_step = get_maximum_step(data, item_id)
  top_list = get_top_list(item_rank)
  
  local_data = item_id.copy()
  local_data['price'] = price.price
  local_data['item_rank'] = item_rank.item_rank
  local_data['price_rank'] = price_rank.price_rank
  local_data['clickout'] = clickout.clickout
  local_data['session_duration'] = session_duration.session_duration
  local_data['item_duration'] = item_duration.item_duration
  local_data['item_session_duration'] = item_session_duration.item_session_duration
  local_data['item_interactions'] = item_interactions.item_interactions
  local_data['maximum_step'] = maximum_step.maximum_step
  local_data['top_list'] = top_list.top_list

  return local_data

In [0]:
LocalData = data_transformation(train)
LocalData

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  import sys
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  from ipykernel import kernelapp as app
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .

Unnamed: 0,session_id,item_id,price,item_rank,price_rank,clickout,session_duration,item_duration,item_session_duration,item_interactions,maximum_step,top_list
0,aff3928535f48,55109,162,1,8,0,1025,0.0,0.0,0.0,16,1
1,aff3928535f48,129343,25,2,1,0,1025,0.0,0.0,0.0,16,1
2,aff3928535f48,54824,150,3,15,0,1025,0.0,0.0,0.0,16,1
3,aff3928535f48,2297972,143,4,5,0,1025,0.0,0.0,0.0,16,1
4,aff3928535f48,109014,101,5,12,0,1025,0.0,0.0,0.0,16,1
...,...,...,...,...,...,...,...,...,...,...,...,...
15373343,62728015bec05,2712342,73,21,6,0,571,0.0,0.0,1.0,19,0
15373344,62728015bec05,48497,169,22,19,0,571,0.0,0.0,0.0,19,0
15373345,62728015bec05,11933,87,23,10,0,571,0.0,0.0,0.0,19,0
15373346,62728015bec05,1714483,485,24,23,0,571,0.0,0.0,0.0,19,0


In [0]:
LocalData.to_csv('item_local.csv', index=False)

## Merging Global Features to Local Feautres

In [6]:
GlobalPath = '/content/drive/My Drive/Trivago/Datasets/clean_data/item_global.csv'
LocalPath = '/content/drive/My Drive/Trivago/Datasets/clean_data/item_local.csv'

GlobalData = pd.read_csv(GlobalPath)
LocalData = pd.read_csv(LocalPath)

TrainData = LocalData.merge(GlobalData, on='item_id', how='left')
TrainData

Unnamed: 0.1,session_id,item_id,price,item_rank,price_rank,clickout,session_duration,item_duration,item_session_duration,item_interactions,maximum_step,top_list,Unnamed: 0,properties,NumberOfProperties,NumberInImpressions,NumberInReferences,NumberAsClickout,NumberAsFinalClickout,FClickoutToImpressions,FClickoutToReferences,FClickoutToClickout,MeanPrice,AveragePriceRank
0,aff3928535f48,55109,162,1,8,0,1025,0.0,0.0,0.0,16,1,917942.0,"['Car Park', 'Restaurant', 'Cot', 'Hairdryer',...",58.0,701.0,555.0,29.0,14.0,0.019971,0.025225,0.482759,196.208274,11.975749
1,aff3928535f48,129343,25,2,1,0,1025,0.0,0.0,0.0,16,1,657696.0,"['Computer with Internet', 'Fridge', 'WiFi (Ro...",26.0,136.0,69.0,9.0,4.0,0.029412,0.057971,0.444444,74.227941,10.301471
2,aff3928535f48,54824,150,3,15,0,1025,0.0,0.0,0.0,16,1,643694.0,"['WiFi (Public Areas)', 'From 2 Stars', 'Laund...",42.0,535.0,462.0,75.0,40.0,0.074766,0.086580,0.533333,106.532710,11.181308
3,aff3928535f48,2297972,143,4,5,0,1025,0.0,0.0,0.0,16,1,173694.0,"['Business Centre', 'Gay-friendly', 'From 2 St...",38.0,20.0,24.0,3.0,1.0,0.050000,0.041667,0.333333,135.200000,8.900000
4,aff3928535f48,109014,101,5,12,0,1025,0.0,0.0,0.0,16,1,302652.0,"['Radio', 'From 4 Stars', 'Gay-friendly', 'Fro...",60.0,239.0,274.0,22.0,9.0,0.037657,0.032847,0.409091,149.991632,11.271967
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
15373343,62728015bec05,2712342,73,21,6,0,571,0.0,0.0,1.0,19,0,691502.0,"['Tennis Court (Indoor)', 'Fan', 'WiFi (Rooms)...",33.0,49.0,20.0,2.0,0.0,0.000000,0.000000,0.000000,127.183673,12.959184
15373344,62728015bec05,48497,169,22,19,0,571,0.0,0.0,0.0,19,0,899161.0,"['Laundry Service', 'Childcare', 'Safe (Hotel)...",46.0,128.0,70.0,6.0,2.0,0.015625,0.028571,0.333333,142.164062,11.750000
15373345,62728015bec05,11933,87,23,10,0,571,0.0,0.0,0.0,19,0,440975.0,"['Electric Kettle', 'Conference Rooms', 'Tenni...",46.0,340.0,274.0,20.0,9.0,0.026471,0.032847,0.450000,107.011765,12.044118
15373346,62728015bec05,1714483,485,24,23,0,571,0.0,0.0,0.0,19,0,418977.0,"['Telephone', 'Hotel', 'Reception (24/7)', 'Wh...",37.0,56.0,8.0,2.0,1.0,0.017857,0.125000,0.500000,292.142857,12.589286


In [6]:
TrainData[TrainData.properties.isnull()]

Unnamed: 0.1,session_id,item_id,price,item_rank,price_rank,clickout,session_duration,item_duration,item_session_duration,item_interactions,maximum_step,top_list,Unnamed: 0,properties,NumberOfProperties,NumberInImpressions,NumberInReferences,NumberAsClickout,NumberAsFinalClickout,FClickoutToImpressions,FClickoutToReferences,FClickoutToClickout,MeanPrice,AveragePriceRank
5421,eff296130596e,5814172,23,25,5,0,33,0.0,0.000000,0.0,2,0,,,,,,,,,,,,
6173,a1c2a8bad7e09,8964532,132,22,21,0,288,0.0,0.000000,0.0,8,0,,,,,,,,,,,,
11528,ed7b7aee5a8a3,10599618,128,12,5,1,95,4.0,0.042105,2.0,3,0,,,,,,,,,,,,
11602,fe0c86b72bf1e,5786780,140,23,16,0,0,0.0,0.000000,0.0,1,0,,,,,,,,,,,,
14231,330b8df67f204,4466176,31,7,6,0,1799,0.0,0.000000,0.0,12,0,,,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
15366242,a541d7ae86dc3,9301238,27,8,7,0,102,0.0,0.000000,0.0,11,0,,,,,,,,,,,,
15366694,d4a08da70ef00,8994580,45,8,3,0,48,0.0,0.000000,0.0,8,0,,,,,,,,,,,,
15368097,48749a0235b0b,11249888,22,24,24,0,110,0.0,0.000000,0.0,13,0,,,,,,,,,,,,
15369782,69f6a3d291979,9424218,88,19,18,0,3456,0.0,0.000000,0.0,13,0,,,,,,,,,,,,


In [0]:
TrainData.item_id.nunique(), item_metadata.item_id.nunique()

(771204, 927142)

In [0]:
len(set(TrainData.item_id.unique()) - set(item_metadata.item_id.unique()))

706

Although item_metadata should have all the items in the training set, 706 of the items don't exits in item_metadata dataset. NaN values will be filled with mean value of each attribute.

In [15]:
TrainData.columns

Index(['session_id', 'item_id', 'price', 'item_rank', 'price_rank', 'clickout',
       'session_duration', 'item_duration', 'item_session_duration',
       'item_interactions', 'maximum_step', 'top_list', 'Unnamed: 0',
       'properties', 'NumberOfProperties', 'NumberInImpressions',
       'NumberInReferences', 'NumberAsClickout', 'NumberAsFinalClickout',
       'FClickoutToImpressions', 'FClickoutToReferences',
       'FClickoutToClickout', 'MeanPrice', 'AveragePriceRank'],
      dtype='object')

'Unnamed: 0' and 'properties' columns should be dropped.

In [7]:
TrainData.drop(columns=['Unnamed: 0', 'properties'], inplace=True)
NaNcolumns = ['NumberOfProperties', 'NumberInImpressions', 'NumberInReferences', 'NumberAsClickout', 'NumberAsFinalClickout',
              'FClickoutToImpressions', 'FClickoutToReferences', 'FClickoutToClickout', 'MeanPrice', 'AveragePriceRank']
for column in NaNcolumns:
  MeanValue = TrainData[column].mean()
  TrainData[column] =  TrainData[column].fillna(MeanValue)

#checking NaN values
TrainData[TrainData.NumberAsClickout.isnull()]

Unnamed: 0,session_id,item_id,price,item_rank,price_rank,clickout,session_duration,item_duration,item_session_duration,item_interactions,maximum_step,top_list,NumberOfProperties,NumberInImpressions,NumberInReferences,NumberAsClickout,NumberAsFinalClickout,FClickoutToImpressions,FClickoutToReferences,FClickoutToClickout,MeanPrice,AveragePriceRank


Merging properties to the TrainData as it carries valuable information such as rating, but at the same time because the dataset will be huge and session crashes, further step needs to be taken.

In [0]:
# TrainData.item_id = TrainData.item_id.apply(lambda x: str(x))
# TrainData.merge(properties_encodedDF, on='item_id', how='left')

In [0]:
TrainData.to_csv('TrainData.csv', index=False)