# Booking.com Feature Extraction
---

Download the file BookingDotCom_HotelReviews.xlsx from Canvas. This file contains over 515,000 guest reviews and rating of almost 1500 hotels across Europe scraped from popular hotel reservation website Booking.com. The text data was cleaned by removing unicode and punctuation and transformed to lower case. No other preprocessing was done. More information on each field is provided in the "Data Description" tab of the Excel file.

        1. What are the top five hotel features (e.g., location, staff, etc.) that customers mention the most in positive reviews and top five features they mention most in negative reviews? Your identified features must make sense (e.g., "great" or "negative" are not features). (3 points)
        
        2. What are the top five features that customers prefer most if they are a solo traveler vs traveling with a group vs on a business trip vs a leisure trip vs traveling as a couple vs a family with young children. You will find these categories in the "Tags" column. There are a few more tags that we don't need. (2 points).

        3. What are the top five features customers like most and top five features they complain about most about hotels in United Kingdom, France, Italy, and Spain? Country information is available inside Hotel_Address. (2 points)
        
        4. Create a dashboard with the following plots; (1) "Top Five Hotels Overall" with consistently high ratings, (2) Bottom Five Hotels Overall" with consistently low ratings, (3) Five Most Improved Hotels" with the highest improvement in average ratings from 2015 to 2017, showing their average ratings for each of the three years. (0.5+0.5+2 points).

Write clear, compact, and understandable code with comment/markdown statements as appropriate. Non-working code or unnecessary code will be penalized. 

Submit your Jupyter file using the link below or provide a link to your Google Colab or Github file.


In [2]:
# import packages to use
import pandas as pd
import re
import pycountry
import ast
import spacy
from sklearn.feature_extraction.text  import CountVectorizer

### Load dataframe

In [3]:
df = pd.read_excel("BookingDotCom_HotelReviews.xlsx", sheet_name="Data") # original file

# df = pd.read_csv("dfProcessed.csv") # saved from executing the steps in this notebook

# sample the first 1000 rows of df
df = df[:15000]

# rename df columns to lower case
df.columns= df.columns.str.lower()

### Remove numbers from text using regular expressions

In [4]:
df['positive_comments'] = df['positive_comments'].apply(lambda x: re.sub(r'\d+', '', str(x)))
df['negative_comments'] = df['negative_comments'].apply(lambda x: re.sub(r'\d+', '', str(x)))

### Lemmatize, remove stopwords and convert to lowercase using spaCy

In [5]:
load_model = spacy.load('en_core_web_sm', disable = ['parser','ner'])

# remove stopwords
myStopwords = list(load_model.Defaults.stop_words)
myStopwords.append('negative')
myStopwords.append('positive')
myStopwords.append('hotel')
#.append('positive'), 'stay', 'hotel', 'night', 'book', 'com')

def lemmaFunc(text):
    doc = load_model(text)
    allowed_tags = ['NOUN', 'PROPN']        # to get only nouns and proper nouns
    return " ".join([token.lemma_.lower() for token in doc if token.pos_ in allowed_tags \
                          and token.text.lower() not in myStopwords
                          ])

df['positive_comments'] = df['positive_comments'].apply(lambda x: lemmaFunc(x))
df['negative_comments'] = df['negative_comments'].apply(lambda x: lemmaFunc(x))

### Create a column of country names

In [6]:
# we get the list of countries from the pycountry package and use the hotel_address column to extract country name
df['country'] = df["hotel_address"].apply(
    lambda address: ' '.join([c.name for c in pycountry.countries if c.name in address])
    )

# use the review_date column to extract the year and store in new column
df['year'] = pd.DatetimeIndex(df['review_date']).year

### Get customer groupings from column of tags

In [7]:
'''
In this step we deal with the tags column using the steps defined below:
    1. Define tags we are interested in
    2. Define a function to apply to tags column to remove tags we are not interested in by:
            - Converting the individual row values to list (from string) e.g. "[' Leisure trip ']" -> [' leisure trip ']
            - Strip the whitespaces from individual elements e.g. [' leisure trip ']-> ['leisure trip']
            - Drop tags we are not interested in
'''

# customer tags we are interested in
customer_tags = ['solo traveler','group','business trip','leisure trip','couple','family with young children']


def clean_tag(x):

    # convert value from string to a list
    myTags = ast.literal_eval(x.lower())

    # strip whitespaces from elements and drop those we are not interested in
    myTags = [customerTag.strip() for customerTag in myTags if customerTag.strip() in customer_tags]

    return myTags

# apply clean_tag() function to the tags column
df['tags'] = df['tags'].apply(lambda x: clean_tag(x))


'''
Function applied to tags column to extract new columns for customer categories
Lambda function will be used as in the steps below
'''

def split_tag(x:list, tagName:str) -> int:
    
    t = [1 if tagName in x else 0][0]
    
    return t


# dictionary of column names (new additional columns) and customer tags (as contained in tags of interest)
tagDict = {
    'solo_traveler' : 'solo traveler',
    'group' : 'group',
    'business_trip' : 'business trip',
    'leisure_trip' : 'leisure trip',
    'couple' : 'couple',
    'family_with_young_children' : 'family with young children'
}

# applying function on tags column to get new separated columns
for key, value in tagDict.items():
    df[key] = df['tags'].apply(lambda x: split_tag(x, value))

##### Saving dataframe for future use
---

In [26]:
# df.to_csv("dfProcessed.csv", index=False)

---

### Get word frequencies using CountVectorizer from sklearn

In [8]:
vecPos = CountVectorizer(
                        strip_accents='ascii', stop_words='english',
                        analyzer='word', max_df=0.95,max_features=500
                        )

vecNeg = CountVectorizer(max_df=0.85, ngram_range=(1,2), max_features=500)

sparseVecPos = vecPos.fit_transform(df['positive_comments'])   # You can fit and transform jointly 
sparseVecNeg = vecNeg.fit_transform(df['negative_comments'])

# create dataframes from the vectors of counts
matPos = pd.DataFrame(sparseVecPos.toarray(), columns=vecPos.get_feature_names_out())
matNeg = pd.DataFrame(sparseVecNeg.toarray(), columns=vecNeg.get_feature_names_out())

### Top 5 hotel features that customers mention the most in positive reviews

In [9]:
matPos.agg(sum).sort_values(ascending=False)[:5]

staff        5732
room         5524
location     5204
breakfast    2190
bed          1829
dtype: int64

### Top 5 hotel features that customers mention the most in negative reviews

In [10]:
matNeg.agg(sum).sort_values(ascending=False)[:5]

room         7646
breakfast    1646
bed          1276
staff        1189
bathroom     1121
dtype: int64

### Top 5 features that customers prefer most by category: solo, couple, group etc

In [11]:
# indices of customer group data subsets (by travel tag)
soloIndex = df.index[df.solo_traveler != 0] # solo_traveler
groupIndex = df.index[df.group != 0] # group
businessIndex = df.index[df.business_trip != 0] # business_trip
leisureIndex = df.index[df.leisure_trip != 0] # leisure_trip
coupleIndex = df.index[df.couple != 0] # couple
familyIndex = df.index[df.family_with_young_children != 0] # family_with_young_children

Solo

In [12]:
matPos.iloc[soloIndex].agg(sum).sort_values(ascending=False)[:5]

room         1156
staff        1154
location     1073
breakfast     406
bed           362
dtype: int64

Group

In [13]:
matPos.iloc[groupIndex].agg(sum).sort_values(ascending=False)[:5]

staff        694
location     602
room         601
breakfast    247
bed          216
dtype: int64

Business

In [14]:
matPos.iloc[businessIndex].agg(sum).sort_values(ascending=False)[:5]

room         878
staff        846
location     802
breakfast    339
bed          248
dtype: int64

Leisure

In [15]:
matPos.iloc[leisureIndex].agg(sum).sort_values(ascending=False)[:5]

staff        4715
room         4501
location     4279
breakfast    1786
bed          1511
dtype: int64

Couple

In [16]:
matPos.iloc[coupleIndex].agg(sum).sort_values(ascending=False)[:5]

staff        2732
room         2672
location     2440
breakfast    1033
bed           954
dtype: int64

Family with young children

In [17]:
matPos.iloc[familyIndex].agg(sum).sort_values(ascending=False)[:5]

staff        790
room         777
location     726
breakfast    346
bed          204
dtype: int64

### Top 5 liked/complained about features by country (UK, France, Italy, Spain)

In [18]:
# United Kingdom likes/complaints
print("\nLiked: \n", matPos.iloc[df.index[df.country == 'United Kingdom']].agg(sum).sort_values(ascending=False)[:5])
print("\nComplained about:\n", matNeg.iloc[df.index[df.country == 'United Kingdom']].agg(sum).sort_values(ascending=False)[:5])


Liked: 
 staff        4920
room         4650
location     4395
breakfast    1921
bed          1534
dtype: int64

Complained about:
 room         6795
breakfast    1426
bed          1133
staff        1022
bathroom      968
dtype: int64


In [19]:
# France likes/complaints
print(matPos.iloc[df.index[df.country == 'France']].agg(sum).sort_values(ascending=False)[:5])
print('\n')
print(matNeg.iloc[df.index[df.country == 'France']].agg(sum).sort_values(ascending=False)[:5])

location     718
room         679
staff        651
breakfast    227
bed          199
dtype: int64


room         577
breakfast    180
staff        129
location     117
bathroom      98
dtype: int64


In [20]:
# Italy likes/complaints
print(matPos.iloc[df.index[df.country == 'Italy']].agg(sum).sort_values(ascending=False)[:5])
print('\n')
print(matNeg.iloc[df.index[df.country == 'Italy']].agg(sum).sort_values(ascending=False)[:5])

access      0.0
picture     0.0
problem     0.0
price       0.0
pressure    0.0
dtype: float64


ac                      0.0
restaurant breakfast    0.0
room building           0.0
room breakfast          0.0
room bit                0.0
dtype: float64


In [21]:
# Spain likes/complaints
print(matPos.iloc[df.index[df.country == 'Spain']].agg(sum).sort_values(ascending=False)[:5])
print('\n')
print(matPos.iloc[df.index[df.country == 'Spain']].agg(sum).sort_values(ascending=False)[:5])

access      0.0
picture     0.0
problem     0.0
price       0.0
pressure    0.0
dtype: float64


access      0.0
picture     0.0
problem     0.0
price       0.0
pressure    0.0
dtype: float64


### Top 5 Hotels Overall with consistently high ratings

In [37]:
df.head(3)

Unnamed: 0,hotel_name,hotel_address,review_count,non_review_scoring_count,average_hotel_score,review_date,reviewer_nationality,positive_comments,negative_comments,total_reviewer_reviews,reviewer_score,tags,country,year,solo_traveler,group,business_trip,leisure_trip,couple,family_with_young_children
0,Hotel Arena,s Gravesandestraat 55 Oost 1092 AA Amsterdam ...,1403,194,7.7,2017-08-03,Russia,park,post site trip mistake place booking com night...,7,2.9,"[leisure trip, couple]",Netherlands,2017,0,0,0,1,1,0
1,Hotel Arena,s Gravesandestraat 55 Oost 1092 AA Amsterdam ...,1403,194,7.7,2017-08-03,Ireland,complaint location surrounding amenity service...,,7,7.5,"[leisure trip, couple]",Netherlands,2017,0,0,0,1,1,0
2,Hotel Arena,s Gravesandestraat 55 Oost 1092 AA Amsterdam ...,1403,194,7.7,2017-07-31,Australia,location staff breakfast range,room bit room story step level room tea coffee...,9,7.1,"[leisure trip, family with young children]",Netherlands,2017,0,0,0,1,0,1


In [33]:
overallRatings = df[['hotel_name', 'average_hotel_score']].groupby(['hotel_name']).mean()

# top 5 hotels
topOverall = overallRatings.sort_values(by=['average_hotel_score'], ascending=False)[:5]
topOverall

Unnamed: 0_level_0,average_hotel_score
hotel_name,Unnamed: 1_level_1
Haymarket Hotel,9.6
Milestone Hotel Kensington,9.5
Intercontinental London The O2,9.4
Apex Temple Court Hotel,9.2
One Aldwych,9.2


### Bottom 5 Hotels Overall with consistently low ratings

In [35]:
bottomOverall = overallRatings.sort_values(by=['average_hotel_score'])[:5]
bottomOverall

Unnamed: 0_level_0,average_hotel_score
hotel_name,Unnamed: 1_level_1
Kube Hotel Ice Bar,7.2
Novotel Suites Paris Nord 18 me,7.4
H tel des Ducs D Anjou,7.7
The Park Grand London Paddington,7.7
Hotel Arena,7.7


### 5 most improved hotels with the highest improvement in average ratings from 2015 to 2017 
showing their average ratings for each of the 3 years