## EDA and Modelling Based on Hotel Attributes


In this section, we will investigate and dive deeper into the data based on the hotel reviews and geolocation.

In [1]:

import pandas as pd
import numpy as np
from langdetect import detect
from sklearn.feature_extraction import stop_words
import re
from nltk.corpus import stopwords
import geocoder
import folium
from folium import plugins
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
import string
from sklearn.metrics.pairwise import cosine_similarity
import cufflinks as cf
cf.go_offline()
cf.set_config_file(offline=False, world_readable=True)



In [2]:
#importing data 
hoteltags=pd.read_pickle('../data/hoteltag.pkl')
geocode=pd.read_pickle('../data/geocode.pkl')

In [3]:
# merging geocode into dataset
hoteltags_geo= pd.merge(hoteltags,geocode,on='hotel_name',how='outer')

In [4]:
hoteltags_geo.head()

Unnamed: 0,hotel_name,tags,lat_x,lng_x,lat_y,lng_y,location,country,city
0,11 Cadogan Gardens,"[' Leisure trip ', ' Couple ', ' Superior Quee...",51.493616,-0.159235,51.493616,-0.159235,"[{'country_code': 'GB', 'city': 'Chelsea', 'co...",United Kingdom,Chelsea
1,1K Hotel,"[' Leisure trip ', ' Couple ', ' Superior M Do...",48.863932,2.365874,48.863932,2.365874,"[{'country_code': 'FR', 'city': 'Paris', 'coun...",France,Paris
2,25hours Hotel beim MuseumsQuartier,"[' Leisure trip ', ' Solo traveler ', ' Standa...",48.206474,16.35463,48.206474,16.35463,"[{'country_code': 'AT', 'city': 'Vienna', 'cou...",Austria,Vienna
3,41,"[' Leisure trip ', ' Couple ', ' Executive Kin...",51.498147,-0.143649,51.498147,-0.143649,"[{'country_code': 'GB', 'city': 'West End of L...",United Kingdom,West End of London
4,45 Park Lane Dorchester Collection,"[' Leisure trip ', ' Solo traveler ', ' Execut...",51.506371,-0.151536,51.506371,-0.151536,"[{'country_code': 'GB', 'city': 'West End of L...",United Kingdom,West End of London


In [5]:
#subsetting dataset
hoteltags_geo=hoteltags_geo[['hotel_name','tags','lat_x','lng_x','city']]

The hotel tags are saved as string. We will need to write a function to extract each attribute and store in tags.

In [6]:
exclude = set(string.punctuation)
def clean(x):
    return set([''.join(ch for ch in i.lower() if ch not in exclude).strip() for i in x[2:][:-2].split(',')])

In [7]:
# applying to clean the tags and assigning them to a new column
hoteltags_geo['new_tags'] = hoteltags_geo['tags'].map(clean)


## Feature Engineering

This encoding is needed for feeding categorical data. We store every different attributes in columns that contain either 1 or 0. 1 indicates that a hotel has that attribute and 0 shows that it doesn’t. 

In order to understand what are the key attributes in the dataframe, set was used to store respective hotel's attributes. This allows us to not keep duplicating attributes.

In [8]:

# global variable
tag_sum_list = []

def get_tag_sum_elems(tag_sum_string):
    global tag_sum_list # use the global variable
    # extend the global variable with this_list
    tag_sum_list.extend(tag_sum_string)
    return True

for i in hoteltags_geo['new_tags']:
    get_tag_sum_elems(i)

tag_sum_set = set(tag_sum_list)
print(tag_sum_set)
len(tag_sum_set)

{'double without credit card', 'luxury suite music theme', 'double room disability access tub', 'standard room with queen bed', 'regency suite', 'suite with spa bath', 'privilege superior room', 'suite with extra bed', 'grand deluxe contemporary room', 'superior double or twin room with balcony 1 adult', 'stayed 7 nights', 'king hilton plaza room', 'deluxe plus double or twin room', 'superior room with 1 double bed and 1 sofa bed 1 place', 'superior suite 2 adults 2 children', 'suite room 1 or 2 people', 'apartment 3 adults', '2 connecting double rooms', 'superior double bedroom', 'junior king suite with lounge access', 'executive double or twin room with spa access', 'executive king room with executive lounge access', 'river double suite', 'two adjoining superior suite', 'classic single room with spa access', 'premium suite', 'standard double room ground floor', 'deluxe triple room with view', 'two bedroom suite', 'superior apartment 3 adults', 'hypoallergenic double or twin room', 's

2400

In [9]:
# removing 'submitted from a mobile device' since this attribute has contributing factor to our investigation
hoteltags_geo['new_tags'].apply(lambda x: x.remove('submitted from a mobile device'))

0       None
1       None
2       None
3       None
4       None
        ... 
1469    None
1470    None
1471    None
1472    None
1473    None
Name: new_tags, Length: 1474, dtype: object

In [10]:
#function to output the list of hotel with the respective attribute
def get_special(s):
    spike_cols = [col for col in tag_sum_set if s in col]
    hotel_list = set()
    for i in range(0, len(hoteltags_geo.tags)):
        for j in range(0,len(spike_cols)):
            if spike_cols[j] in hoteltags_geo.new_tags[i]:
                hotel_list.add(hoteltags_geo[hoteltags_geo.index==i]['hotel_name'][i])
                
    return hoteltags_geo[hoteltags_geo.hotel_name.isin(list(hotel_list))][['hotel_name','lat_x','lng_x','city']]


In [11]:
#plotting function to map out the location
def get_map(df,imagepath):
    if df.shape[0]<15:
            map2 = folium.Map(location=[df.iloc[0].lat_x,df.iloc[0].lng_x], zoom_start=12)
    elif df.shape[0]>15:
        map2 = folium.Map(location=[df.iloc[0].lat_x,df.iloc[0].lng_x], zoom_start=2)

    #generate folium
    map2 = folium.Map(location=[df.iloc[0].lat_x,df.iloc[0].lng_x], zoom_start=12)
    folium.raster_layers.TileLayer('Open Street Map').add_to(map2)
    for i in range(0,len(df)):
        folium.Marker(
        location=[df.iloc[i].lat_x,df.iloc[i].lng_x],
            tooltip=df.iloc[i].hotel_name,
        icon=folium.Icon(icon_color='white')
    ).add_to(map2)
    map2.save(imagepath)
    return map2


In the following section, I will be refactoring the attributes. 
1. I will be collating the attributes into a array. 
2. Introduce relabelled attribute to indicate its presence in each hotel. 
3. Remove the grouped attributes and we will achieve the refactored attributes.

In [12]:
grp_col=['single room','river view','private pool','breakfast','spa bath','twin room','double room'
       ,'superior room','king room','executive room','city view','sea view','stayed '
       ,'eiffel twin','eiffel tower view','suite','triple room','penthouse','standard room','wheelchair accessible',
      'family room ','deluxe room','apartment','terrace']
newarray=[]

for i in grp_col:
    newarray.extend([col for col in tag_sum_set if i in col])

In [13]:
grp_col=['single room','river view','private pool','breakfast','spa bath','twin room','double room'
       ,'superior room','king room','executive room','city view','sea view','stayed '
       ,'eiffel twin','eiffel tower view','suite','triple room','penthouse','standard room','wheelchair accessible',
      'family room ','deluxe room','guest room','apartment','terrace']

In [14]:
#function to refactor the attributes
def new_fn(a,b):
    efv= list(get_special(a).index)
    for i in efv:
        hoteltags_geo.iloc[i].new_tags.add(b)

In [15]:
hoteltags_geo.iloc[323].new_tags

{'2 rooms',
 '4 rooms',
 'business trip',
 'classic double room',
 'classic single room',
 'classic twin room',
 'comfort double room',
 'couple',
 'deluxe double room with eiffel tower view',
 'deluxe eiffel twin',
 'deluxe twin room with eiffel tower view',
 'family with older children',
 'family with young children',
 'group',
 'leisure trip',
 'solo traveler',
 'stayed 1 night',
 'stayed 2 nights',
 'stayed 3 nights',
 'stayed 4 nights',
 'stayed 5 nights',
 'stayed 6 nights',
 'suite',
 'suite with eiffel tower view',
 'superior double room',
 'travelers with friends'}

In [16]:
# to group/refactor hotel attributes to new attributes
new_fn("single room","single_room")
new_fn("river view","river_view")
new_fn("private pool","private_pool")
new_fn("breakfast","break_fast")
new_fn("spa bath","spa_bath")
new_fn("twin room","twin_room")
new_fn("double room","double_room")
new_fn("superior room","superior_room")
new_fn("king room","king_room")
new_fn("executive room","executive_room")
new_fn("city view","city_view")
new_fn("sea view","sea_view")
new_fn("eiffel tower view","eiffel_tower_view")
new_fn("suite",'_suite_')
new_fn("triple room",'triple_room')
new_fn("penthouse",'_penthouse_')
new_fn("standard room",'standard_room')
new_fn("wheelchair accessible",'wheelchair_accessible')
new_fn("family room ",'family_room')
new_fn("deluxe room",'deluxe_room')
new_fn("guest room",'guest_room')
new_fn("apartment",'_apartment_')
new_fn("terrace",'_terrace')


In [17]:
# to remove attributes which are not used after replacing with generic attribute
for i in range(0, len(hoteltags_geo.new_tags)):
    for j in range(0,len(newarray)):
        if newarray[j] in hoteltags_geo.new_tags[i]:
            hoteltags_geo.new_tags[i].remove(newarray[j])

In [18]:
# global variable
tag_sum_list = []

def get_tag_sum_elems(tag_sum_string):
    global tag_sum_list # use the global variable
    # extend the global variable with this_list
    tag_sum_list.extend(tag_sum_string)
    return True

for i in hoteltags_geo['new_tags']:
    get_tag_sum_elems(i)

tag_sum_set = set(tag_sum_list)
len(tag_sum_set)

514

In [19]:
for i in tag_sum_list:
    hoteltags_geo[i] = 0

In [20]:
hoteltags_geo.head()

Unnamed: 0,hotel_name,tags,lat_x,lng_x,city,new_tags,_suite_,king_room,solo traveler,family with older children,...,king superior plus room,king loft,king premier room,king grand premier with canal view,twin grand premier with canal view,luxury quadruple room,large room,art room with iconic view,art room xl with iconic view,penta plus room
0,11 Cadogan Gardens,"[' Leisure trip ', ' Couple ', ' Superior Quee...",51.493616,-0.159235,Chelsea,"{_suite_, king_room, solo traveler, family wit...",0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1K Hotel,"[' Leisure trip ', ' Couple ', ' Superior M Do...",48.863932,2.365874,Paris,"{travelers with friends, double_room, _suite_,...",0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,25hours Hotel beim MuseumsQuartier,"[' Leisure trip ', ' Solo traveler ', ' Standa...",48.206474,16.35463,Vienna,"{with a pet, travelers with friends, double_ro...",0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,41,"[' Leisure trip ', ' Couple ', ' Executive Kin...",51.498147,-0.143649,West End of London,"{_suite_, king_room, solo traveler, family wit...",0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,45 Park Lane Dorchester Collection,"[' Leisure trip ', ' Solo traveler ', ' Execut...",51.506371,-0.151536,West End of London,"{park view studio, solo traveler, business tri...",0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [21]:
#iterating through the columns comparing if attribtues in is presence in new_tags, assign 1 
only_new_tags=pd.DataFrame(hoteltags_geo.new_tags.loc[:])

for index, row in only_new_tags.iterrows():
    cs_list = set(row['new_tags'])
    
    for j in cs_list:
        #set corresponding column to 1
        if j in tag_sum_set:
            hoteltags_geo.loc[index,j]=1

#### Visualisation of type of rooms in the dataset

In [22]:
index_room = ['single_room','twin_room','double_room','superior_room'
        ,'king_room','executive_room','_suite_','triple_room',
        '_penthouse_','standard_room','family_room','deluxe_room','guest_room','_apartment_',
        '_terrace_']
room_count=[]

for i in index_room:
    room_count.append(get_special(i).shape[0])


In [23]:
import plotly.io as pio


In [24]:
df1 = pd.DataFrame({'No.of Hotel': room_count}, index=index_room)
ax = df1.iplot(kind='barh', yTitle='Room Types', linecolor='black', title='Number of hotels by Room Types')
ax


![Drag Racing](../image/room_charactertistic.png)


#### Visualisation of  types of hotel characteristics

In [25]:
index_view = ['river_view','private_pool','break_fast','spa_bath','city_view','sea_view','eiffel_tower_view'
              ,'wheelchair_accessible']
view_count=[]

for i in index_view:
    view_count.append(get_special(i).shape[0])

In [26]:
df2 = pd.DataFrame({'No.of Hotel': view_count}, index=index_view)
ax2 = df2.iplot(kind='barh', yTitle='Characteristic',colors='blue', linecolor='black', title='Number of hotels with special characteristics')
ax2

![Drag Racing](../image/hotelcharacteristic.png)


In [27]:
hoteltags_geo.isnull().sum()

hotel_name                      0
tags                            0
lat_x                           0
lng_x                           0
city                            0
                               ..
luxury quadruple room           0
large room                      0
art room with iconic view       0
art room xl with iconic view    0
penta plus room                 0
Length: 520, dtype: int64

In [28]:
hoteltags_geo.new_tags.iloc[323]

{'2 rooms',
 '4 rooms',
 '_suite_',
 'business trip',
 'couple',
 'double_room',
 'eiffel_tower_view',
 'family with older children',
 'family with young children',
 'group',
 'leisure trip',
 'single_room',
 'solo traveler',
 'travelers with friends',
 'twin_room'}

In [29]:
hoteltags_geo['double_room'].value_counts()

1    1046
0     428
Name: double_room, dtype: int64

In [30]:
# to generate map and visualise
get_map(get_special('eiffel_tower_view'),'../image/eiffel.html')

![Drag Racing](../image/eiffel.png)

## Modelling

In [31]:
similarityDF=cosine_similarity(hoteltags_geo.iloc[:, 6:],hoteltags_geo.iloc[:, 6:])


In [32]:
np.save("../data/tagcosine.npy",similarityDF)

In [33]:
hoteltags_geo.to_pickle("../data/clean_hoteltag.pkl")

In [34]:
def new_recommendations_tags(name,city, cosine_similarities):
    
    recommended_hotels = []
    
    #get input city index
    city_index= list(hoteltags_geo[hoteltags_geo.city==city].index)
    
    # gettin the index of the hotel that matches the name
    idx = hoteltags_geo[(hoteltags_geo.hotel_name == name)].index[0]
    
    # creating a Series with the similarity scores in descending order
    score_series = pd.Series(cosine_similarities[idx]).sort_values(ascending = False)

    # getting the indexes of  similar hotels
    top_10_indexes = list(score_series.index)
    
    # populating the list with the names of the matching hotels
    for i in range(len(top_10_indexes)):
        if top_10_indexes[i] not in city_index:
            pass
        else:
            recommended_hotels.append(hoteltags_geo[hoteltags_geo.index==top_10_indexes[i]]['hotel_name'].values[0])

    # populating a dictionary of size 10 containing hotel name and lat and longitude 
    h = hoteltags_geo[['hotel_name','lat_x','lng_x']].to_dict(orient='records')
    l = {k['hotel_name']: [k['lat_x'], k['lng_x']] for k in h}
    if {hotel: l[hotel] for hotel in recommended_hotels }=={}:
        print("There are no hotels of similar hotel")
    else:
        output= {hotel: l[hotel] for hotel in recommended_hotels[:10]}
        newoutput={i:output for i in range(1,len(output)+1)}
        return newoutput
     

In [35]:
def get_hotel_fn(mydict,city):
    loc2 = geocoder.osm(city)

    # map
    main_map = folium.Map(location=[loc2.lat, loc2.lng], zoom_start=13)
    folium.raster_layers.TileLayer('Open Street Map').add_to(main_map)

    # loop through dict
    for i in range (1,len(mydict)+1):
        folium.Marker(location=list(mydict[i].values())[i-1],tooltip=list(mydict[i].keys())[i-1]
                      ,popup=list(mydict[i].keys())[i-1],
                     icon=plugins.BeautifyIcon(number=i,
                                               icon='bus',
                                            border_color='blue',
                                            border_width=0.5,
                                            text_color='red',
                                            inner_icon_style='margin-top:0px;')).add_to(main_map)
     
    return main_map


In [36]:
# to populate and pin recommended list of hotels
get_hotel_fn(new_recommendations_tags('Hilton Diagonal Mar Barcelona','Vienna',similarityDF),'Vienna')

## Deployment

As part of the project, I have performed using Flask on Heroku. I have learnt on the steps to take when performing an end-to-end project. This includes re-factoring of codes into functions and classes so as to easier compliation when compiling of the code for Flask. In order to deploy a model, you will need to understand what you want to achieve and re-look at the code on how you could recode to achieve that.


[HotelPal](hotelrecommender.herokuapp.com)




## Limitations and Recommendation

In this project, I have refactored the attributes based on my own research. It will be more effective if domain knowledge are provided, this would allow me to understand what are the key attributes that would be of importance and weightage should be given.

<br> 
In view of gauging performance of the content-based recommender, we could potentially roll this model out for A/B testing to evaluate the performance of the model and understand the conversion rate on commerical website.<br>
<br>
One hot encoding has the following limitations:

If there are too many parameters, then our matrix will be huge and make it impractical for calculations.
Implicit relationships among categorical variables may be ignored. 

<br>

This project can further study into exploration of using topic modelling or Word2Vec on text reviews. we can also look into using Universal Language Model Fine-tuning for Text Classification (ULMFiT). It utilizes neural networks and inductive transfer learning for text classification.(https://arxiv.org/abs/1801.06146).

## Conclusion





Big data analysis is changing the operating mode of the global tourism economy, providing tourism managers with deeper insights, and infiltrating into all aspects of tourist travels, while driving tourism innovation and development [1]. Tourism text big data mining techniques have made it possible to analyze the behaviors of tourists and realize real-time monitoring of the market.Both machine learning and current deep learning with high achievements have been greatly applied in NLP. 

With the increasing of applications in the Internet, the source of data is getting more and more richer.
Therefore, the various factors in the new data brings new challenges. It is also a chance to create novel
methods to achieve better recommendation results. Social networks are still the focus of the
recommendation research, integration methods and new algorithms will continue to appear in the
future. The sound, location and other user preference information are received more and more
attention. I believe the future of the recommender system will be a hot area of innovation and
research. 

## Reference

[1] Li, J.; Xu, L.; Tang, L.; Wang, S.; Li, L. Big data in tourism research: A literature review. Tour. Manag. 2018, 68,
301–323.

[2] Qin Li 1,2, Shaobo Li 3,4,* , Sen Zhang 1,2, Jie Hu 5 and Jianjun Hu 3,A Review of Text Corpus-Based Tourism Big
Data Mining
