# Content Based Recommender 

Text is an effective and widely existing form of opinion expression and evaluation by users, as shown by the large number of online review comments over tourism sites, hotels, and services. As a direct expression of users’ needs and emotions, text-based tourism data mining has the potential to transform the tourism industry.

Content-based filtering is a common approach in recommendation system. The features of the items previously rated by users and the best-matching ones are recommended. In our case, we will be transforming implicit information of hotel attributes as featuers for this recommendation engine. 

## Problem Statement

In this project, the objective is to transform implicit information provided  by users into explicit features for hotel recommendation system engine. There are two parts to this recommender engine using hotel attributes and reviews by users respectively to build two separate recommendation engine. 


## Dataset
The dataset is the "515K Hotel Reviews Data in Europe" dataset on Kaggle (https:// www.kaggle.com/jiashenliu/515k-hotel-reviews-data-in-europe). The dataset is a csv file, containing most text. The positive and negative reviews are already in columns. The reviews are all in English, collected from Booking.com from 2015 to 2017.

The dataset contains 515738 reviews for 1492 luxury hotels in Europe.
## Executive Summary


In a very general way, recommender systems are algorithms aimed at suggesting relevant items to users (items being movies to watch, text to read, products to buy or anything else depending on industries).<br>
<br>***There are two main data selection methods:***

Collaborative-filtering: In collaborative-filtering items are recommended, for example hotels, based on how similar your user profile is to other users’, finds the users that are most similar to you and then recommends items that they have shown a preference for. This method suffers from the so-called cold-start problem: If there is a new hotel, no-one else would’ve yet liked or watched it, so you’re not going to have this in your list of recommended hotels, even if you’d love it.

Content-based filtering: This method uses attributes of the content to recommend similar content. It doesn’t have a cold-start problem because it works through attributes or tags of the content, such as views, Wi-Fi or room types, so that new hotels can be recommended right away.


The point of content-based is that we have to know the content of both user and item. Usually you construct user-profile and item-profile using the content of shared attribute space. For example, for a movie, you represent it with the movie stars in it and the genres (using a binary coding for example). 



***There are a number of popular encoding schemes but the main ones are:***

- One-hot encoding
- Term frequency–inverse document frequency (TF-IDF) encoding
- Word embeddings

In this project, we will be discussing content-based filtering of recommender engine, turning implicit attributes into explicit features for hotel recommender engine. 


 

## Content<br>

1. EDA

2. Modelling

   - Based on reviews
   - Based on hotel tag attributes

3. Deployment

4. Conclusion and Recommendation


In [1]:
#import required libraires
import pandas as pd
import numpy as np
from langdetect import detect
from sklearn.feature_extraction import stop_words
import re
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
import folium
from folium import plugins
import ipywidgets
import geocoder
import geopy
import os
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
import reverse_geocode



In [2]:
# importing data
df= pd.read_csv('../data/Hotel_Reviews.csv')
# changing column to lower case
df.columns=[x.lower() for x in df.columns]

In [3]:
#shape of dataframe
df.shape

(515738, 17)

In [4]:
# The dataset comprimises of 17 columns. 
df.columns

Index(['hotel_address', 'additional_number_of_scoring', 'review_date',
       'average_score', 'hotel_name', 'reviewer_nationality',
       'negative_review', 'review_total_negative_word_counts',
       'total_number_of_reviews', 'positive_review',
       'review_total_positive_word_counts',
       'total_number_of_reviews_reviewer_has_given', 'reviewer_score', 'tags',
       'days_since_review', 'lat', 'lng'],
      dtype='object')

In this dataset, there are 515738 observations consisting of 17 variables, consisting of 1492 hotels across Europe. We will discuss the geographical distribution of the hotels across Europe in the later section.

In [5]:
#there are 1492 hotels
hotel=list(df.hotel_name.unique())
len(hotel)

1492

In [6]:
# Inherently, there are "H tel" in the dataset
df[df['hotel_name'].str.contains("H tel")]['hotel_name'].unique()[:10]

array(['H tel De Vend me', 'H tel des Ducs D Anjou',
       'H tel Juliana Paris', 'H tel de Jos phine BONAPARTE',
       'H tel Keppler', 'H tel Chaplain Paris Rive Gauche',
       'H tel Regina Op ra Grands Boulevards', 'H tel Diva Opera',
       'H tel Duo', 'H tel Le Marianne'], dtype=object)

<br>
Through research, it was evident that the scraped data removed latin script.

> <p><b>Example:</b>
> </p>
>Relais Hôtel du Vieux Paris --> Relais H tel du Vieux Paris


In [7]:
#impute the word "H tel" to "Hotel"
df['hotel_name']=df['hotel_name'].apply(lambda x:x.replace('H tel','Hotel'))
                                        

In [8]:
# sanity check to check
df[df.index==12608]['hotel_name']

12608    Hotel De Vend me
Name: hotel_name, dtype: object

## Geocode based on the dataset

In this section, we will discuss about the hotels and with respect to their geolocation.

In [9]:
#group by hotelname and aggregate
geocode_df = df.groupby('hotel_name').agg({'lat': 'first','lng':'first'}).reset_index()

In [10]:
#check if there is null
geocode_df.isnull().sum()

hotel_name     0
lat           17
lng           17
dtype: int64

There were 17 hotels which do not have the latitude and longitude details in the dataframe. Since the number of hotels is lesser than 15%, I have decided to drop empty set. 

In [11]:
#drop null and reset index
geocode_df.dropna(subset=['lat','lng'],inplace=True)
geocode_df.reset_index(drop='index',inplace=True)

In [12]:
#sanity check if there is null
geocode_df.isnull().sum()

hotel_name    0
lat           0
lng           0
dtype: int64

The below section is a function to use a use reverse  latitude and longitude to get the cities.

In [13]:
#create new column
geocode_df['location']= ''
#function to search for geocode
def search(x):
    for i in range (x.shape[0]):
        co= ((geocode_df['lat'][i], geocode_df['lng'][i]),)
        geocode_df['location'][i]= reverse_geocode.search(co)
#apply function
#search the df
search(geocode_df)
        

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  import sys


In [14]:
#function to search for country
def search_country(x):
    return x[0]['country']

In [15]:
#map the city
geocode_df['country']=geocode_df['location'].map(search_country)

In [16]:
geocode_df.country.unique()

array(['United Kingdom', 'France', 'Austria', 'Spain', 'Italy',
       'Netherlands'], dtype=object)

In [17]:
#function to search for city
def search_city(x):
    return x[0]['city']

In [18]:
#map the city
geocode_df['city']=geocode_df['location'].map(search_city)

In [19]:
geocode_df.city.value_counts()

Paris                       224
Vienna                      137
Milan                       135
Amsterdam                    83
Levallois-Perret             82
                           ... 
Sant Genís dels Agudells      1
Aubervilliers                 1
Settimo Milanese              1
Chigwell                      1
Le Pré-Saint-Gervais          1
Name: city, Length: 111, dtype: int64

In [20]:
#generating a map to output the cities 
map2 = folium.Map(location=[41.3851,2.1734], zoom_start=4)
folium.raster_layers.TileLayer('Open Street Map').add_to(map2)
for la,lo in zip(geocode_df.lat,geocode_df.lng):
    folium.Marker(
        location=[la,lo],
        icon=folium.Icon(icon_color='white')
    ).add_to(map2)
# Plotting 
map2.save('../image/Europe_overview.html')
map2

![Drag Racing](../image/Europe_overview.png)


In [21]:
#save to pickle
geocode_df.to_pickle('../data/geocode.pkl')

## Tags a.k.a Attributes

In this section, we will be cleaning and preparing the dataset for modelling for recommender engine.

In [22]:
tag_df=df[['hotel_name','tags','lat','lng']]

In [23]:
tag_df.head()

Unnamed: 0,hotel_name,tags,lat,lng
0,Hotel Arena,"[' Leisure trip ', ' Couple ', ' Duplex Double...",52.360576,4.915968
1,Hotel Arena,"[' Leisure trip ', ' Couple ', ' Duplex Double...",52.360576,4.915968
2,Hotel Arena,"[' Leisure trip ', ' Family with young childre...",52.360576,4.915968
3,Hotel Arena,"[' Leisure trip ', ' Solo traveler ', ' Duplex...",52.360576,4.915968
4,Hotel Arena,"[' Leisure trip ', ' Couple ', ' Suite ', ' St...",52.360576,4.915968


In [24]:
tag_df = tag_df.groupby('hotel_name').agg({'tags': ', '.join,'lat':'first','lng':'first'}).reset_index()

In [25]:
tag_df.shape

(1491, 4)

In [26]:
tag_df.isnull().sum()

hotel_name     0
tags           0
lat           17
lng           17
dtype: int64

In [27]:
tag_df.head()

Unnamed: 0,hotel_name,tags,lat,lng
0,11 Cadogan Gardens,"[' Leisure trip ', ' Couple ', ' Superior Quee...",51.493616,-0.159235
1,1K Hotel,"[' Leisure trip ', ' Couple ', ' Superior M Do...",48.863932,2.365874
2,25hours Hotel beim MuseumsQuartier,"[' Leisure trip ', ' Solo traveler ', ' Standa...",48.206474,16.35463
3,41,"[' Leisure trip ', ' Couple ', ' Executive Kin...",51.498147,-0.143649
4,45 Park Lane Dorchester Collection,"[' Leisure trip ', ' Solo traveler ', ' Execut...",51.506371,-0.151536


In [28]:
tag_df.dropna(subset=['lat','lng'],inplace=True)

In [29]:
tag_df.to_pickle("../data/hoteltag.pkl")

## Reviews

In this section, we will be preparing the dataset for hotel recommender based on reviews.<br>

In the case of positive or negative reviews, I have observed that there is "No Positive" or "No Negative" in the dataframe which potentially affects the vectorization of words as it has appear several times.

In [30]:
#suppose to be empty 
df[df['positive_review']=='No Positive'][['negative_review','positive_review']].head(5)

Unnamed: 0,negative_review,positive_review
8,Even though the pictures show very clean room...,No Positive
32,Our bathroom had an urine order Shower was ve...,No Positive
98,Got charged 50 for a birthday package when it...,No Positive
121,The first room had steep steps to a loft bed ...,No Positive
134,Foyer was a mess Only place to relax was the ...,No Positive


In [31]:
# suppose to be empty 

df[df['negative_review']=='No Negative'][['negative_review','positive_review']].head(5)

Unnamed: 0,negative_review,positive_review
1,No Negative,No real complaints the hotel was great great ...
13,No Negative,This hotel is being renovated with great care...
15,No Negative,This hotel is awesome I took it sincirely bec...
18,No Negative,Public areas are lovely and the room was nice...
48,No Negative,The quality of the hotel was brilliant and ev...


I have decided to replace them as "No" since it will be removed by stopwords. In the case of replacing it with empty string, it will potentially become NaN value which affects my computation.

In [32]:
#replace No negative and no positive
df["negative_review"] = df["negative_review"].apply(lambda x: x.replace("No Negative", "No"))
df["positive_review"] = df["positive_review"].apply(lambda x: x.replace("No Positive", "No"))


Another issue with the dataset, there are duplicated reviews and I am only interested in capturing one set of the reviews. In the below section, I have worte a function to check for duplicates and labelling the duplicates.

In [33]:
#function to check negative & positive duplicates
def check(x):
    pos, neg = x
    if pos ==  neg:
        return 1
    return 0

In [34]:
df['check_dup'] = [check(x) for x in df[['positive_review','negative_review']].values]
index_col= df[df['check_dup']==1].index

In [35]:
df[df['check_dup']==1][['positive_review','negative_review']].head()

Unnamed: 0,positive_review,negative_review
1403,No,No
2451,The hotel good location and clean but some st...,The hotel good location and clean but some st...
2872,Ok,Ok
3067,Standard Hotel,Standard Hotel
5839,I was completely disappointed and mad since t...,I was completely disappointed and mad since t...


Based on the labelling function, I have dedupe duplicates and replacing them with 'nothing' since its going to be removed by Stopwords.

In [36]:
for x in index_col:
    df['positive_review'][x]='nothing'

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [37]:
#dedupe duplicates
df[df['check_dup']==1][['positive_review','negative_review']].head()

Unnamed: 0,positive_review,negative_review
1403,nothing,No
2451,nothing,The hotel good location and clean but some st...
2872,nothing,Ok
3067,nothing,Standard Hotel
5839,nothing,I was completely disappointed and mad since t...


In [38]:
#this is to group by hotel name and joining the reviews into on observation with ','.
df1 = df.groupby('hotel_name').agg({'negative_review': ', '.join,'positive_review': ', '.join,
                                    'lat': 'first','lng':'first','hotel_address':'first',
                                    'tags': ', '.join}).reset_index()

In [39]:
#check if there is null
df1.isnull().sum()

hotel_name          0
negative_review     0
positive_review     0
lat                17
lng                17
hotel_address       0
tags                0
dtype: int64

In [40]:
df1.dropna(subset=['lat','lng'],inplace=True)

In [41]:
#import pickle with city
geocode=pd.read_pickle('../data/geocode.pkl')

In [42]:
# merging the dataset based on hotel name
new_df= pd.merge(df1,geocode,on='hotel_name',how='outer')

In [43]:
# exporting to pickle
new_df.to_pickle('../data/review.pkl')