# **HYBRID RECOMMENDER SYSTEM**

This RS is made from a Tripadvisor's dataset which saves the user's experience as data, having in count the hotel where she/he stayed, its rating, zone, time and the type of trip (solo, familiar, business, etc). This system is composed of two sub systems, the first is a context based system that filters items according to relevance and preference on the type of trip and the item time zone. The second is collaborative filtering system that takes the last result and parse it to get the best dataset items according to the rating.

<img src="https://i.ibb.co/726t1V2/Diagrama-en-blanco.jpg" width="400"/>

Import libraries

In [209]:
import pandas as pd
import numpy as np
from tqdm import tqdm
from math import sqrt
from sklearn.metrics import multilabel_confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
import math
from sklearn.preprocessing import MinMaxScaler

Charge Tripadvisor dataset

In [210]:
df_trip=pd.read_csv('Data_TripAdvisor_v2.csv')
df_trip.head()

Unnamed: 0,UserID,ItemID,Rating,UserState,UserTimeZone,ItemCity,ItemState,ItemTimeZone,TripType
0,5C28F393B23BB894523AE7126A7AE445,219668,5,AK,AK,GREENSBORO,NC,EASTERN,SOLO
1,3FA27F6E8AC712A82C69C4EDD8B912CC,223860,5,AK,AK,PHOENIX,AZ,MOUNTAIN,SOLO
2,B99CFBB5411EDC8881D13B7A4B313ADA,75680,5,AK,AK,ANAHEIM,CA,PACIFIC,FAMILY
3,3FA27F6E8AC712A82C69C4EDD8B912CC,224783,5,AK,AK,SEATTLE,WA,PACIFIC,SOLO
4,7CEFF5C32BA1F3B186E7838C7D3FE25E,222984,5,AK,AK,MIAMI,MI,EASTERN,COUPLES


The dataset is pretty clear so that we just need to remove the columns that we don't need.

In [211]:
df_hotel=df_trip.drop(['UserID','Rating','UserState','UserTimeZone','TripType'], 1).drop_duplicates(subset=['ItemID'])
df_trip = df_trip.drop('UserState', 1)
df_trip = df_trip.drop('UserTimeZone', 1)
df_trip = df_trip.drop('ItemState', 1)
df_trip.head()

Unnamed: 0,UserID,ItemID,Rating,ItemCity,ItemTimeZone,TripType
0,5C28F393B23BB894523AE7126A7AE445,219668,5,GREENSBORO,EASTERN,SOLO
1,3FA27F6E8AC712A82C69C4EDD8B912CC,223860,5,PHOENIX,MOUNTAIN,SOLO
2,B99CFBB5411EDC8881D13B7A4B313ADA,75680,5,ANAHEIM,PACIFIC,FAMILY
3,3FA27F6E8AC712A82C69C4EDD8B912CC,224783,5,SEATTLE,PACIFIC,SOLO
4,7CEFF5C32BA1F3B186E7838C7D3FE25E,222984,5,MIAMI,EASTERN,COUPLES


In [212]:
df_hotel.head()

Unnamed: 0,ItemID,ItemCity,ItemState,ItemTimeZone
0,219668,GREENSBORO,NC,EASTERN
1,223860,PHOENIX,AZ,MOUNTAIN
2,75680,ANAHEIM,CA,PACIFIC
3,224783,SEATTLE,WA,PACIFIC
4,222984,MIAMI,MI,EASTERN


## **Context Based RS**

https://www.datacamp.com/community/tutorials/recommender-systems-python

The first step is to relate items (hotels) with its type of trip preferred.

In [213]:
df_item = df_trip[['ItemID','Rating','TripType']]
df_item.head()

Unnamed: 0,ItemID,Rating,TripType
0,219668,5,SOLO
1,223860,5,SOLO
2,75680,5,FAMILY
3,224783,5,SOLO
4,222984,5,COUPLES


Group ItemID and TripType to count how many votes has gotten each present combination

In [214]:
df_count=df_item.groupby(['ItemID','TripType'],sort=False).size().reset_index(name='Count')
df_count=df_count[df_count['Count'] >= 1]
df_count.head()

Unnamed: 0,ItemID,TripType,Count
0,219668,SOLO,2
1,223860,SOLO,4
2,75680,FAMILY,22
3,224783,SOLO,12
4,222984,COUPLES,4


Set weights to each combination for estimating the overall preference according to the context. We take in count the number of votes and the total rating mean by applying this formula:

\begin{equation}
\text Weight = \left({{\bf v} \over {\bf v} + {\bf p}} \cdot R\right) + \left({{\bf p} \over {\bf v} + {\bf p}} \cdot Tmean\right)
\end{equation}

Where

- v is the number of votes for the hotel
- p is the minimum votes required to be considered (Percentile 75)
- R is the average rating of the hotel for the current context value
- Tmean is the mean rating across the whole dataset

In [215]:
item_list=list(df_item['ItemID'].unique())
context_list=list(df_item['TripType'].unique())
m={}
w=[]
p75=df_trip['UserID'].value_counts().quantile(0.75)
p90=df_trip['UserID'].value_counts().quantile(0.90)
tmean=df_trip['Rating'].mean()
for i in item_list:
  mean=df_item[df_item['ItemID']==i]['Rating'].mean()
  m[i]=mean
for (i,v) in zip(df_count['ItemID'].tolist(),df_count['Count'].tolist()):
  r=m[i]
  weight=(v/(v+p75)*r)+(p75/(v+p75)*tmean)
  w.append(weight)
df_count['Weight']=w
df_trip_weighted=df_count[df_count['Count']>=p75] #skip values under p75

In [216]:
df_trip_weighted.head()

Unnamed: 0,ItemID,TripType,Count,Weight
2,75680,FAMILY,22,4.223098
3,224783,SOLO,12,3.939766
12,224305,COUPLES,6,4.335482
17,122343,BUSINESS,13,4.307084
22,1724006,COUPLES,7,3.792373


Now let's find the response of the users versus each type of trip. First we create a table of ratings between items and users.

In [217]:
rating = pd.pivot_table(df_trip, values='Rating', index=['ItemID'], columns=['UserID'])
rating.sort_index(axis=1, inplace=True)
rating.head()

UserID,002F55BB8DD9A8C7DD01C3C939D378A5,003BC319571635C677EEFC610BD066F5,005A406ACD437714CC6CBE74F9AD7215,009119643396C998B80B416F1A3AB288,00BBAC3339F576E164FF9F627489481C,00D2B226DD86C30EDFF4B612FCF45131,00D673CA0747712BD29890CB31E3C58D,00E09D6FFC5B7A7D11ADAE4E2CAB809B,00E18DD2BE3B6358DE85FC864AC0283A,00E74A1E0E2A09F913F92518CFDFDD05,00E806D3BA2A15ABCEAD8F5906C15AE8,00E9D7D0CF672EE410016849C9B94F87,012BE838477D08A1A3D0A5E9D8D0922E,015DEC2DABEA8A0987A5360168B75E9B,01734D386264A7F3355F15A5126FB60A,019EF758BBDCA60967E098686B8ABEBC,01A468322D428D652628E35840150FA1,01C78A65257ABE74BD72CEC5EC4F1983,01ED60735C3F50974C5F774A648FD8B7,01F6894476E32CCCE95F0F938B472F75,01F7E9C238D7362D35F57FC4F3B865B6,02201BE6448728BE4D5ECB380D07B009,024174908A7172EA22CE177E22752987,02860EA0ED535DE587635F1F8FC7D0C0,02A73D1D26D5ECC71522223FA8861FF0,02B58A1231CECF055444C221DE3D1029,02E57B7C5ED39707E8AA69027FA30A2B,02EE1392650D0D3CF79A9CB2C0A8F221,02F480C881EACB4B8526CFEB058343D2,02F73A7A5A98DAC5FFA89A04D525CF93,036EEC64945959E1A09650741F3A9B12,037A9F7DBFE3D20F055B0F319A59BAE1,03B86B0D52E86BB5ADA0A9E78F0BF8CE,03D8882A1AFF6DB73D1FA6FA99A28051,03DFFC8FE4644524443D495F166A6B8A,03F739DB20C40E8D64B8C7AAFFF7CBB8,03F8DC0051F3D0EFFB5C49D556A40CE2,03FE6ACAB8512D185D48B75A440A0072,041CAB2C40375D502E116882562D8BB5,04260B123404A3BB64108C0210CF2079,...,FB7315194C4CE14AE2909AB6127C4A04,FB7CA977608E02497B146145FDE14EC4,FB8369036D70382F43156C7C1F749C8E,FB85200105F77A5CAB662E944D984E6A,FB9E300240C27E2894F8AA03AD896A01,FBADE2112793ACB2666E129FBF015B4B,FBAEACC07EE3786BEAAFADD7111996C1,FBC2EEDF6BCCA5852DFDA8DA705BC96B,FBC6DB453816F24A7F99411A20767448,FBD0E620F59CB0DD6D7DAD33474D32BD,FBE33FB68A984F85990B0C95758E1D32,FBE6337895DDB52EB0F7D3EE4EF114A1,FC25E445C8851E2AA658F0F9F5AD58FE,FC272D70711672328080E2ED75EB4E7F,FC59B23F08C1D0626DD57D8A497EA818,FC64E43D5C00D1F9691CBA3C40E4A8AB,FC8E0C6482EFA6CB136A6B4C9B251DA6,FC946FDFB367B4AF04A90BC6297F3FAC,FD075D1EDE01C67F36405B865E84E6B1,FD1B8050CF28BAF24F1A1C7E60D9A092,FD394B9D169121552E8EF42B0D650307,FD5BB4422733FFF591B9374AD3A64F56,FD686E6255EEFF417C5AB93BDE9D8CDF,FD6A44B5FDFA1135F91D4346000C05E1,FDC8D468CBEA76625CC036FCCB9D8E35,FE0BFEE3723AE0EFF37B8D3F19D82D57,FE167F1D245D8877DE5028DF149D9916,FE3F016ACF4B3F2BD27A68A15CE3512E,FE79C819C9526373E94432F74C294806,FE93839BA9AF951C03FA3C62FA548F0B,FE98618B59B6F7F49EE93EF7AA649D7D,FEB5DDEAB459230395D74EA02713F286,FF01119B860DA0A0CD2C85C73064725F,FF3A3C830995AFC032364ADD672B612E,FF87316F9B085A2C5E349F6F1FCA05B3,FFA89E2DBCE6DC46F0F5700D819D1284,FFC660EB649C053734777A39B29D5D93,FFD4710D6E92EE3D9192E5716CCCAEA8,FFD87ACF4AB64A9C67B9F638C4289B5F,FFDC7B441CB2676EC8CA21CFB8EB5803
ItemID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1
72389,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
72396,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
72572,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,5.0,,,,,,,,,,,,,,,,,,,
72670,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
72993,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


Create item's profile

In [218]:
item_profile=df_trip[['ItemID','TripType']]
trip_type=pd.get_dummies(item_profile['TripType'])
item_profile.drop(columns='TripType', inplace=True)
item_profile=pd.concat([item_profile, trip_type], axis=1)
item_profile.sort_values('ItemID', inplace=True)
item_profile.set_index('ItemID', inplace=True)
item_profile=item_profile.groupby(['ItemID']).sum()
item_profile.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


Unnamed: 0_level_0,BUSINESS,COUPLES,FAMILY,FRIENDS,SOLO
ItemID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
72389,1,1,0,0,0
72396,0,1,1,0,0
72572,0,3,0,0,2
72670,1,0,0,0,0
72993,1,2,0,1,1


Create a user's profile

In [219]:
users=rating.columns
df_users=pd.DataFrame(columns= context_list)
for i in tqdm(range(len(users))):
    working_df = item_profile.mul(rating.iloc[:,i], axis=0)
    working_df.replace(0, np.NaN, inplace=True)    
    df_users.loc[users[i]] = working_df.mean(axis=0)
df_users.head()

100%|██████████| 2371/2371 [00:08<00:00, 287.26it/s]


Unnamed: 0,SOLO,FAMILY,COUPLES,BUSINESS,FRIENDS
002F55BB8DD9A8C7DD01C3C939D378A5,14.4,20.0,9.4,15.6,4.0
003BC319571635C677EEFC610BD066F5,25.4,19.25,51.2,31.25,5.0
005A406ACD437714CC6CBE74F9AD7215,10.2,18.0,30.333333,8.5,7.333333
009119643396C998B80B416F1A3AB288,25.0,22.8,54.0,40.666667,15.0
00BBAC3339F576E164FF9F627489481C,30.25,16.5,38.2,36.0,11.0


Check the relevance by the means of Term Frequency and Inverse Document Frequencyand then create TF-IDF matrix. 

In [220]:
tf = item_profile.sum()
idf = (len(df_trip)/tf).apply(np.log)
df_TF_IDF=item_profile.mul(idf.values)
df_TF_IDF.head()

Unnamed: 0_level_0,BUSINESS,COUPLES,FAMILY,FRIENDS,SOLO
ItemID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
72389,1.624006,1.103705,0.0,0.0,0.0
72396,0.0,1.103705,1.579554,0.0,0.0
72572,0.0,3.311114,0.0,0.0,3.107736
72670,1.624006,0.0,0.0,0.0,0.0
72993,1.624006,2.207409,0.0,2.923289,1.553868


Predict user by TF-IDF and normalize values using Z-Score:

<img src="https://www.datavedas.com/wp-content/uploads/2018/01/1.1.2.1.2-Z-Scores-Z-Test-and-Probability-Distribution-1.jpg" width="200"/>


In [221]:
df_predict=pd.DataFrame()
for i in tqdm(range(len(users))):
  working_df=df_TF_IDF.mul(df_users.iloc[i], axis=1)
  df_predict[users[i]]=working_df.sum(axis=1)
df_predict=(df_predict-df_predict.stack().mean())/df_predict.stack().std()
df_predict.stack().quantile(0.97)

100%|██████████| 2371/2371 [00:06<00:00, 365.04it/s]


2.54576241429525

In [222]:
df_predict.head()

Unnamed: 0_level_0,002F55BB8DD9A8C7DD01C3C939D378A5,003BC319571635C677EEFC610BD066F5,005A406ACD437714CC6CBE74F9AD7215,009119643396C998B80B416F1A3AB288,00BBAC3339F576E164FF9F627489481C,00D2B226DD86C30EDFF4B612FCF45131,00D673CA0747712BD29890CB31E3C58D,00E09D6FFC5B7A7D11ADAE4E2CAB809B,00E18DD2BE3B6358DE85FC864AC0283A,00E74A1E0E2A09F913F92518CFDFDD05,00E806D3BA2A15ABCEAD8F5906C15AE8,00E9D7D0CF672EE410016849C9B94F87,012BE838477D08A1A3D0A5E9D8D0922E,015DEC2DABEA8A0987A5360168B75E9B,01734D386264A7F3355F15A5126FB60A,019EF758BBDCA60967E098686B8ABEBC,01A468322D428D652628E35840150FA1,01C78A65257ABE74BD72CEC5EC4F1983,01ED60735C3F50974C5F774A648FD8B7,01F6894476E32CCCE95F0F938B472F75,01F7E9C238D7362D35F57FC4F3B865B6,02201BE6448728BE4D5ECB380D07B009,024174908A7172EA22CE177E22752987,02860EA0ED535DE587635F1F8FC7D0C0,02A73D1D26D5ECC71522223FA8861FF0,02B58A1231CECF055444C221DE3D1029,02E57B7C5ED39707E8AA69027FA30A2B,02EE1392650D0D3CF79A9CB2C0A8F221,02F480C881EACB4B8526CFEB058343D2,02F73A7A5A98DAC5FFA89A04D525CF93,036EEC64945959E1A09650741F3A9B12,037A9F7DBFE3D20F055B0F319A59BAE1,03B86B0D52E86BB5ADA0A9E78F0BF8CE,03D8882A1AFF6DB73D1FA6FA99A28051,03DFFC8FE4644524443D495F166A6B8A,03F739DB20C40E8D64B8C7AAFFF7CBB8,03F8DC0051F3D0EFFB5C49D556A40CE2,03FE6ACAB8512D185D48B75A440A0072,041CAB2C40375D502E116882562D8BB5,04260B123404A3BB64108C0210CF2079,...,FB7315194C4CE14AE2909AB6127C4A04,FB7CA977608E02497B146145FDE14EC4,FB8369036D70382F43156C7C1F749C8E,FB85200105F77A5CAB662E944D984E6A,FB9E300240C27E2894F8AA03AD896A01,FBADE2112793ACB2666E129FBF015B4B,FBAEACC07EE3786BEAAFADD7111996C1,FBC2EEDF6BCCA5852DFDA8DA705BC96B,FBC6DB453816F24A7F99411A20767448,FBD0E620F59CB0DD6D7DAD33474D32BD,FBE33FB68A984F85990B0C95758E1D32,FBE6337895DDB52EB0F7D3EE4EF114A1,FC25E445C8851E2AA658F0F9F5AD58FE,FC272D70711672328080E2ED75EB4E7F,FC59B23F08C1D0626DD57D8A497EA818,FC64E43D5C00D1F9691CBA3C40E4A8AB,FC8E0C6482EFA6CB136A6B4C9B251DA6,FC946FDFB367B4AF04A90BC6297F3FAC,FD075D1EDE01C67F36405B865E84E6B1,FD1B8050CF28BAF24F1A1C7E60D9A092,FD394B9D169121552E8EF42B0D650307,FD5BB4422733FFF591B9374AD3A64F56,FD686E6255EEFF417C5AB93BDE9D8CDF,FD6A44B5FDFA1135F91D4346000C05E1,FDC8D468CBEA76625CC036FCCB9D8E35,FE0BFEE3723AE0EFF37B8D3F19D82D57,FE167F1D245D8877DE5028DF149D9916,FE3F016ACF4B3F2BD27A68A15CE3512E,FE79C819C9526373E94432F74C294806,FE93839BA9AF951C03FA3C62FA548F0B,FE98618B59B6F7F49EE93EF7AA649D7D,FEB5DDEAB459230395D74EA02713F286,FF01119B860DA0A0CD2C85C73064725F,FF3A3C830995AFC032364ADD672B612E,FF87316F9B085A2C5E349F6F1FCA05B3,FFA89E2DBCE6DC46F0F5700D819D1284,FFC660EB649C053734777A39B29D5D93,FFD4710D6E92EE3D9192E5716CCCAEA8,FFD87ACF4AB64A9C67B9F638C4289B5F,FFDC7B441CB2676EC8CA21CFB8EB5803
ItemID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1
72389,-0.523456,-0.262515,-0.481247,-0.195472,-0.286709,-0.275669,-0.394789,-0.586495,-0.498086,-0.132115,-0.467222,-0.519706,-0.270034,-0.48451,-0.583648,-0.457748,-0.48317,-0.51101,-0.361519,-0.394755,-0.399398,-0.341257,-0.290394,-0.337725,-0.278007,-0.400902,-0.522141,-0.210231,-0.565497,-0.344455,-0.491705,-0.518489,-0.566513,-0.480907,-0.470418,-0.585834,-0.202269,-0.538941,-0.365487,-0.273158,...,-0.34931,-0.458129,-0.616655,-0.418072,-0.36057,-0.482382,-0.14903,-0.221515,-0.294875,-0.360676,-0.545985,-0.420688,-0.34987,-0.331359,-0.491157,-0.427832,-0.481353,-0.391851,-0.299273,-0.434961,-0.568526,-0.485703,-0.567816,-0.57971,-0.507876,-0.159591,-0.215408,-0.509558,-0.448691,-0.41695,-0.370684,-0.32197,-0.592504,-0.583936,-0.582757,-0.523574,-0.249012,-0.073572,-0.189492,-0.534005
72396,-0.500638,-0.336707,-0.4279,-0.304987,-0.404876,-0.382639,-0.413909,-0.544182,-0.49283,-0.179085,-0.536589,-0.507681,-0.497365,-0.542762,-0.515962,-0.368822,-0.512733,-0.573731,-0.46118,-0.509735,-0.467968,-0.433746,-0.51058,-0.40967,-0.382757,-0.522626,-0.512822,-0.269832,-0.549518,-0.361672,-0.532877,-0.543396,-0.588045,-0.495549,-0.508898,-0.582804,-0.556938,-0.534558,-0.485084,-0.349278,...,-0.443582,-0.526792,-0.537159,-0.498452,-0.462328,-0.513973,-0.457649,-0.385066,-0.436805,-0.524021,-0.508392,-0.485065,-0.46632,-0.470525,-0.504785,-0.40773,-0.527313,-0.421745,-0.36408,-0.427357,-0.50309,-0.450253,-0.210662,-0.599108,-0.527431,-0.294362,-0.509322,-0.523592,-0.470385,-0.399544,-0.353503,-0.479064,-0.283738,-0.204712,-0.541723,-0.591501,-0.49566,-0.096764,-0.403345,-0.563361
72572,-0.37697,0.252455,-0.171792,0.281733,0.150443,0.014856,-0.183967,-0.500724,-0.128091,0.54522,-0.24592,-0.173213,-0.317108,-0.322196,-0.402016,-0.276728,-0.219572,-0.455268,-0.119086,-0.132788,-0.164176,0.024336,-0.299334,-0.03304,0.025353,-0.210336,-0.244154,0.369488,-0.44447,-0.002563,-0.28488,-0.370258,-0.514714,-0.338546,-0.180826,-0.412031,-0.435378,-0.272338,-0.106994,0.078606,...,-0.271913,-0.242608,-0.386186,-0.178537,-0.173293,-0.290489,-0.035596,0.180257,-0.026588,-0.318223,-0.31364,-0.126925,-0.02444,0.136318,-0.241243,0.004027,-0.333905,-0.054155,-0.11811,-0.105051,-0.339706,-0.195435,-0.343941,-0.558997,-0.270706,0.336073,-0.308594,-0.399235,-0.096846,-0.077796,-0.003753,-0.186878,-0.379469,-0.528377,-0.415074,-0.465601,-0.105448,0.609625,0.209712,-0.342607
72670,-0.561292,-0.468603,-0.603343,-0.412831,-0.44047,-0.428625,-0.496312,-0.621111,-0.582614,-0.410857,-0.539675,-0.629995,-0.345707,-0.545893,-0.653686,-0.535233,-0.583799,-0.554481,-0.464161,-0.493774,-0.501177,-0.48415,-0.368213,-0.46653,-0.432573,-0.485877,-0.608279,-0.44047,-0.620124,-0.498216,-0.556554,-0.585575,-0.582614,-0.539675,-0.57373,-0.624073,-0.21837,-0.603343,-0.467122,-0.45429,...,-0.436522,-0.52857,-0.653686,-0.505619,-0.461199,-0.551816,-0.274212,-0.407566,-0.393089,-0.438496,-0.624073,-0.511542,-0.468109,-0.473045,-0.576691,-0.598901,-0.553001,-0.508581,-0.425663,-0.556291,-0.624073,-0.598408,-0.653686,-0.597421,-0.581133,-0.395063,-0.298326,-0.561885,-0.576691,-0.553001,-0.551816,-0.407303,-0.653686,-0.61815,-0.627034,-0.539675,-0.359033,-0.409672,-0.38914,-0.598408
72993,-0.361372,0.140818,-0.223167,0.323474,0.155747,0.126049,-0.119006,-0.474017,-0.224247,0.541189,-0.256921,-0.274201,-0.070284,-0.162888,-0.434196,-0.248309,-0.273783,-0.382897,-0.102896,-0.132642,-0.138012,0.022258,-0.045516,-0.040276,0.069518,-0.168409,-0.22185,0.342599,-0.440228,0.039267,-0.27576,-0.357013,-0.473094,-0.310076,-0.154157,-0.436151,-0.015877,-0.343154,-0.066808,0.066389,...,-0.148724,-0.226509,-0.50142,-0.164569,-0.122713,-0.266204,0.154727,0.217038,0.036199,-0.231854,-0.372361,-0.155826,-0.012632,0.06032,-0.285058,-0.099222,-0.311088,-0.041707,0.047467,-0.125359,-0.332698,-0.195658,-0.413234,-0.509237,-0.265061,0.302841,-0.084321,-0.408495,-0.124995,-0.138394,-0.082978,-0.04239,-0.432681,-0.506405,-0.432284,-0.394938,0.060715,0.629908,0.323621,-0.33604


Definition of the context based RS function that returns a set of recommendations skipping hotels that user has already gone and checking if the user is new. The new user's recommendations came from the trip_weighted dataframe that assess the overall preference based on the type of trip.

In [223]:
def context_based(user,trip_type_arg='FAMILY'):
  
  if user in df_trip['UserID'].values: 
    item_no = df_predict.index
    user_predicted_rating = df_predict[user]
    user_rating_hotel= pd.concat([user_predicted_rating,df_hotel.set_index('ItemID')], axis=1)
    already_gone=df_trip[df_trip['UserID'].isin([user])]['ItemID']
    all_rec=user_rating_hotel[~user_rating_hotel.index.isin(already_gone)]
    return all_rec.sort_values(by=[user],ascending=False)#.iloc[0:10]#
  else:
    rec_new_user=df_trip_weighted.sort_values(by=['Weight'],ascending=False)
    rec_new_user=rec_new_user[rec_new_user['TripType']==trip_type_arg]
    rec_new_user=rec_new_user.merge(df_hotel).set_index('ItemID')
    rec_new_user=rec_new_user.drop(['TripType','Count'],1)
    return rec_new_user.sort_values(by=['Weight'],ascending=False)#.iloc[0:100]

Get recommendations from user and type of trip

In [224]:
userCF = '286008B04EC788EEA27081EA16850984'  #286008B04EC788EEA27081EA16850984  #01F7E9C238D7362D35F57FC4F3B865B6 #114364637642AD6AA2BE8CE34E02BDEB
context = context_based(userCF,'SOLO')
context.head(10)

Unnamed: 0_level_0,286008B04EC788EEA27081EA16850984,ItemCity,ItemState,ItemTimeZone
ItemID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
675000,15.166946,LASVEGAS,NV,PACIFIC
97704,14.288022,LASVEGAS,NV,PACIFIC
503598,13.84856,LASVEGAS,NV,PACIFIC
611947,13.530246,NEWYORK,NY,EASTERN
84087,12.713842,WASHINGTONDC,DC,EASTERN
91925,12.090712,LASVEGAS,NV,PACIFIC
93450,11.032175,NEWYORK,NY,EASTERN
91674,10.839573,LASVEGAS,NV,PACIFIC
224783,10.711935,SEATTLE,WA,PACIFIC
482154,10.337089,SANDIEGO,CA,PACIFIC


# Collabotarive Filtering

In [225]:
#df_trip.head()

Extraction of the user from dataset

In [226]:
#userInput = [
#            {'ItemID':219668, 'Rating':4.5, 'TripType':'SOLO'},
#            {'ItemID':75680, 'Rating':5, 'TripType':'FAMILY'},
#            {'ItemID':222984, 'Rating':3.5, 'TripType':'COUPLES'},
#            {'ItemID':89361, 'Rating':3.1, 'TripType':'BUSINESS'},
#            {'ItemID':72993, 'Rating':4.5, 'TripType':'FRIENDS'},
#        ]

userInput = df_trip['UserID'] == userCF
inputHotel = df_trip[userInput]
inputHotel

Unnamed: 0,UserID,ItemID,Rating,ItemCity,ItemTimeZone,TripType
23,286008B04EC788EEA27081EA16850984,1724006,5,ATLANTA,EASTERN,COUPLES
28,286008B04EC788EEA27081EA16850984,102466,5,PHILADELPHIA,EASTERN,BUSINESS
34,286008B04EC788EEA27081EA16850984,88823,5,LEXINGTON,CENTRAL,SOLO
40,286008B04EC788EEA27081EA16850984,89377,5,ATLANTA,EASTERN,COUPLES
41,286008B04EC788EEA27081EA16850984,85380,5,ORLANDO,EASTERN,SOLO
6020,286008B04EC788EEA27081EA16850984,249126,4,PHILADELPHIA,EASTERN,BUSINESS
6027,286008B04EC788EEA27081EA16850984,578192,4,DENVER,MOUNTAIN,BUSINESS
6032,286008B04EC788EEA27081EA16850984,114764,4,MILWAUKEE,CENTRAL,SOLO


Search the hotels of the user in the dataset

In [227]:
userSubset = df_trip[df_trip['ItemID'].isin(inputHotel['ItemID'].tolist())]
userSubset.head()

Unnamed: 0,UserID,ItemID,Rating,ItemCity,ItemTimeZone,TripType
23,286008B04EC788EEA27081EA16850984,1724006,5,ATLANTA,EASTERN,COUPLES
28,286008B04EC788EEA27081EA16850984,102466,5,PHILADELPHIA,EASTERN,BUSINESS
34,286008B04EC788EEA27081EA16850984,88823,5,LEXINGTON,CENTRAL,SOLO
40,286008B04EC788EEA27081EA16850984,89377,5,ATLANTA,EASTERN,COUPLES
41,286008B04EC788EEA27081EA16850984,85380,5,ORLANDO,EASTERN,SOLO


3 users most similar with the user

In [228]:
userSubsetGroup = userSubset.groupby(['UserID'])
userSubsetGroup = sorted(userSubsetGroup, key=lambda x: len(x[1]),reverse=True)
userSubsetGroup[0:3]

[('286008B04EC788EEA27081EA16850984',
                                  UserID   ItemID  ...  ItemTimeZone  TripType
  23    286008B04EC788EEA27081EA16850984  1724006  ...       EASTERN   COUPLES
  28    286008B04EC788EEA27081EA16850984   102466  ...       EASTERN  BUSINESS
  34    286008B04EC788EEA27081EA16850984    88823  ...       CENTRAL      SOLO
  40    286008B04EC788EEA27081EA16850984    89377  ...       EASTERN   COUPLES
  41    286008B04EC788EEA27081EA16850984    85380  ...       EASTERN      SOLO
  6020  286008B04EC788EEA27081EA16850984   249126  ...       EASTERN  BUSINESS
  6027  286008B04EC788EEA27081EA16850984   578192  ...      MOUNTAIN  BUSINESS
  6032  286008B04EC788EEA27081EA16850984   114764  ...       CENTRAL      SOLO
  
  [8 rows x 6 columns]),
 ('8D74835A8F3F2B1F50FA5A58254A89D1',
                                  UserID  ItemID  ...  ItemTimeZone  TripType
  4152  8D74835A8F3F2B1F50FA5A58254A89D1   88823  ...       CENTRAL      SOLO
  4189  8D74835A8F3F2B1F50FA5

In [229]:
#userSubsetGroup = userSubsetGroup[0:100]

Pearson coefficient calculation

In [230]:
pearsonCorrelationDict = {}
for name, group in userSubsetGroup:
    group = group.sort_values(by='ItemID')
    inputHotel = inputHotel.sort_values(by='ItemID')
    nRatings = len(group)
    temp_df = inputHotel[inputHotel['ItemID'].isin(group['ItemID'].tolist())]
    tempRatingList = temp_df['Rating'].tolist()
    tempGroupList = group['Rating'].tolist()
    
    Sxx = sum([i**2 for i in tempRatingList]) - pow(sum(tempRatingList),2)/float(nRatings)
    Syy = sum([i**2 for i in tempGroupList]) - pow(sum(tempGroupList),2)/float(nRatings)
    Sxy = sum( i*j for i, j in zip(tempRatingList, tempGroupList)) - sum(tempRatingList)*sum(tempGroupList)/float(nRatings)
    
    if Sxx != 0 and Syy != 0:
        pearsonCorrelationDict[name] = Sxy/sqrt(abs(Sxx*Syy))
    else:
        pearsonCorrelationDict[name] = 0

pearsonCorrelationDict.items()

dict_items([('286008B04EC788EEA27081EA16850984', 1.0), ('8D74835A8F3F2B1F50FA5A58254A89D1', 0.9999999999999972), ('0BEA942204D75F2C83E29BDC54AA627C', 1.0), ('2DD913D1FE05AF081729771A0B81FF11', 1.0), ('343F5AA3FBDB0794C3D5A8F6FF222BF2', 0), ('3E81076073AFB11D56C822FD66ECF7AA', 0), ('54F9645134FF4CB921C3F845C8C06497', 1.0), ('65AAC318AAE7741E1A73CCC7492808BB', 0), ('68487B40EC5ED9FA2F73DACFC06C0078', 1.0), ('743C005A124C44B0CC449550BBB7A62E', 0), ('86C8BCE8FCF7E9B237C05FAEA73BBFE9', 0), ('E4BD9A4CD7872825F3585ECFFF4074B7', 0), ('F931CA7AD30AEBF56ED69D4EA02E8915', 0), ('00D2B226DD86C30EDFF4B612FCF45131', 0), ('01F6894476E32CCCE95F0F938B472F75', 0), ('04260B123404A3BB64108C0210CF2079', 0), ('047C4E52A3B90705AD7CEA7E92B470DE', 0), ('05A7527A96BC315E9ABECE659DB3687E', 0), ('07038C23A2EC9ABFBC1E489527E5D0AB', 0), ('08ED513F301A2C8591443C9A7F8A3849', 0), ('0981EA43801682A95BC06A103319E6B4', 0), ('0E3B05FCB645534C918C932D09CB8649', 0), ('1248A1751B3B2E37E492E7042EB90DAF', 0), ('135B28285CE0B835

Similar users

In [231]:
pearsonDF = pd.DataFrame.from_dict(pearsonCorrelationDict, orient='index')
pearsonDF.columns = ['similarityIndex']
pearsonDF['UserID'] = pearsonDF.index
pearsonDF.index = range(len(pearsonDF))
pearsonDF.head()

Unnamed: 0,similarityIndex,UserID
0,1.0,286008B04EC788EEA27081EA16850984
1,1.0,8D74835A8F3F2B1F50FA5A58254A89D1
2,1.0,0BEA942204D75F2C83E29BDC54AA627C
3,1.0,2DD913D1FE05AF081729771A0B81FF11
4,0.0,343F5AA3FBDB0794C3D5A8F6FF222BF2


Top users most similar

In [232]:
topUsers=pearsonDF.sort_values(by='similarityIndex', ascending=False)[0:50]
topUsers.head()

Unnamed: 0,similarityIndex,UserID
0,1.0,286008B04EC788EEA27081EA16850984
2,1.0,0BEA942204D75F2C83E29BDC54AA627C
3,1.0,2DD913D1FE05AF081729771A0B81FF11
6,1.0,54F9645134FF4CB921C3F845C8C06497
8,1.0,68487B40EC5ED9FA2F73DACFC06C0078


In [233]:
topUsersRating=topUsers.merge(df_trip, left_on='UserID', right_on='UserID', how='inner')
topUsersRating.head()

Unnamed: 0,similarityIndex,UserID,ItemID,Rating,ItemCity,ItemTimeZone,TripType
0,1.0,286008B04EC788EEA27081EA16850984,1724006,5,ATLANTA,EASTERN,COUPLES
1,1.0,286008B04EC788EEA27081EA16850984,102466,5,PHILADELPHIA,EASTERN,BUSINESS
2,1.0,286008B04EC788EEA27081EA16850984,88823,5,LEXINGTON,CENTRAL,SOLO
3,1.0,286008B04EC788EEA27081EA16850984,89377,5,ATLANTA,EASTERN,COUPLES
4,1.0,286008B04EC788EEA27081EA16850984,85380,5,ORLANDO,EASTERN,SOLO


In [234]:
topUsersRating['weightedRating'] = topUsersRating['similarityIndex']*topUsersRating['Rating']
topUsersRating.head()

Unnamed: 0,similarityIndex,UserID,ItemID,Rating,ItemCity,ItemTimeZone,TripType,weightedRating
0,1.0,286008B04EC788EEA27081EA16850984,1724006,5,ATLANTA,EASTERN,COUPLES,5.0
1,1.0,286008B04EC788EEA27081EA16850984,102466,5,PHILADELPHIA,EASTERN,BUSINESS,5.0
2,1.0,286008B04EC788EEA27081EA16850984,88823,5,LEXINGTON,CENTRAL,SOLO,5.0
3,1.0,286008B04EC788EEA27081EA16850984,89377,5,ATLANTA,EASTERN,COUPLES,5.0
4,1.0,286008B04EC788EEA27081EA16850984,85380,5,ORLANDO,EASTERN,SOLO,5.0


In [235]:
tempTopUsersRating = topUsersRating.groupby('ItemID').sum()[['similarityIndex','weightedRating']]
tempTopUsersRating.columns = ['sum_similarityIndex','sum_weightedRating']
tempTopUsersRating.head()

Unnamed: 0_level_0,sum_similarityIndex,sum_weightedRating
ItemID,Unnamed: 1_level_1,Unnamed: 2_level_1
73587,0.0,0.0
73601,0.0,0.0
73947,0.0,0.0
74109,0.0,0.0
74190,0.0,0.0


In [236]:
recommendation_df = pd.DataFrame()
recommendation_df['weighted average recommendation score'] = \
tempTopUsersRating['sum_weightedRating']/tempTopUsersRating['sum_similarityIndex']
recommendation_df['ItemID'] = tempTopUsersRating.index
recommendation_df = recommendation_df.dropna()
recommendation_df.head(10)

Unnamed: 0_level_0,weighted average recommendation score,ItemID
ItemID,Unnamed: 1_level_1,Unnamed: 2_level_1
74199,5.0,74199
85380,3.8,85380
85775,4.0,85775
86260,5.0,86260
86286,5.0,86286
87990,5.0,87990
88823,5.0,88823
89377,5.0,89377
89568,4.0,89568
89585,5.0,89585


In [237]:
recommendation_df = recommendation_df.sort_values(by='weighted average recommendation score', ascending=False)
recommendation_df.head()

Unnamed: 0_level_0,weighted average recommendation score,ItemID
ItemID,Unnamed: 1_level_1,Unnamed: 2_level_1
74199,5.0,74199
111345,5.0,111345
1724006,5.0,1724006
1513541,5.0,1513541
674735,5.0,674735


Top 10 best Hotels for the user

In [238]:
recom = df_trip.loc[df_trip['ItemID'].isin(recommendation_df.head(10)['ItemID'].tolist())]
recom.head(10)

Unnamed: 0,UserID,ItemID,Rating,ItemCity,ItemTimeZone,TripType
23,286008B04EC788EEA27081EA16850984,1724006,5,ATLANTA,EASTERN,COUPLES
527,808C21D5621C601077234855B776452B,115484,5,PHOENIX,MOUNTAIN,BUSINESS
544,950919199B6571171E28DA08DD72506C,503598,5,LASVEGAS,PACIFIC,BUSINESS
560,240332824E556434431596CADAC9EA79,503598,5,LASVEGAS,PACIFIC,BUSINESS
655,B242E765F9D375ED50528342787DDF27,74199,5,SCOTTSDALE,MOUNTAIN,COUPLES
877,00E74A1E0E2A09F913F92518CFDFDD05,74199,5,SCOTTSDALE,MOUNTAIN,COUPLES
882,2324B27FB0715E37B491A9449967E286,1724006,5,ATLANTA,EASTERN,SOLO
1119,DFA86E8F0FD77AB2A1D38B4D32337228,123534,5,ORLANDO,EASTERN,FAMILY
1190,D9875E76132BA98FE7A245DF3F01733C,111345,5,ATLANTA,EASTERN,SOLO
1192,15A52B08F9085A7AA1180CA9D2C2B931,74199,5,SCOTTSDALE,MOUNTAIN,FAMILY


# Evaluation Metrics

Extraction of predictions for the user

In [239]:
cont = df_predict.loc[recommendation_df.index,userCF]
cont = cont.to_frame()
cont.head()

Unnamed: 0_level_0,286008B04EC788EEA27081EA16850984
ItemID,Unnamed: 1_level_1
74199,4.859793
111345,1.120132
1724006,0.644114
1513541,0.884095
674735,3.488991


Combiner and Hybrid Recommender's results

In [240]:
mix = pd.DataFrame()
mix['result'] = cont[userCF]+recommendation_df['weighted average recommendation score']
rec = recommendation_df['weighted average recommendation score'].tolist()
show_results=mix.sort_values(by='result',ascending=False)
show_results.head()

Unnamed: 0_level_0,result
ItemID,Unnamed: 1_level_1
503598,18.84856
97704,18.288022
89377,17.07037
102466,16.811199
91925,15.090712


Normalizing and Comparing predicted data with real data according to Items ID

\begin{equation}
value_{new} = \frac{max_{new} - min_{new}}{max_{old} - min_{old}}\times (value_{old} - max_{old}) + max_{new}
\end{equation}

In [241]:
df_rat = rating
df_rat = df_rat[df_rat.index.isin(mix.index)]
list_rat = df_rat[userCF].tolist()
list_pred = mix['result'].tolist()
prueba = pd.DataFrame(list_rat)
#prueba['list_pred'] = list_pred
prueba['list_pred'] = rec
prueba = prueba.dropna()

list_rat = prueba[0].tolist()
list_pred = prueba['list_pred'].tolist()

list_rat = pd.DataFrame(list_rat)
list_pred = pd.DataFrame(list_pred)
list_pred=((5-1)/(list_pred.max()-1)*(list_pred-list_pred.max())+5)
print(list_pred)
print(list_rat)
#print(prueba)

     0
0  5.0
1  5.0
2  5.0
3  4.5
4  4.0
5  4.0
6  3.8
7  3.0
     0
0  5.0
1  5.0
2  5.0
3  5.0
4  4.0
5  4.0
6  4.0
7  5.0


In [242]:
list_pred = list_pred.round()[0].tolist()
list_rat = list_rat.round()[0].tolist()
#
list_pred = list(map(int,list_pred))
list_rat = list(map(int, list_rat))
print(list_rat)
print(list_pred)

[5, 5, 5, 5, 4, 4, 4, 5]
[5, 5, 5, 4, 4, 4, 4, 3]


In [243]:
accuracy = accuracy_score(list_rat,list_pred)
accuracy

0.75