# Booking.com Feature Extraction
---

Download the file BookingDotCom_HotelReviews.xlsx from Canvas. This file contains over 515,000 guest reviews and rating of almost 1500 hotels across Europe scraped from popular hotel reservation website Booking.com. The text data was cleaned by removing unicode and punctuation and transformed to lower case. No other preprocessing was done. More information on each field is provided in the "Data Description" tab of the Excel file.

        1. What are the top five hotel features (e.g., location, staff, etc.) that customers mention the most in positive reviews and top five features they mention most in negative reviews? Your identified features must make sense (e.g., "great" or "negative" are not features). (3 points)
        
        2. What are the top five features that customers prefer most if they are a solo traveler vs traveling with a group vs on a business trip vs a leisure trip vs traveling as a couple vs a family with young children. You will find these categories in the "Tags" column. There are a few more tags that we don't need. (2 points).

        3. What are the top five features customers like most and top five features they complain about most about hotels in United Kingdom, France, Italy, and Spain? Country information is available inside Hotel_Address. (2 points)
        
        4. Create a dashboard with the following plots; (1) "Top Five Hotels Overall" with consistently high ratings, (2) Bottom Five Hotels Overall" with consistently low ratings, (3) Five Most Improved Hotels" with the highest improvement in average ratings from 2015 to 2017, showing their average ratings for each of the three years. (0.5+0.5+2 points).

Write clear, compact, and understandable code with comment/markdown statements as appropriate. Non-working code or unnecessary code will be penalized. 

Submit your Jupyter file using the link below or provide a link to your Google Colab or Github file.


In [28]:
# import packages to use
import pandas as pd
import nltk
import pycountry
import ast
import spacy
from spacy import displacy
# from collections import Counter
import en_core_web_sm

In [45]:
# load dataframe
df = pd.read_excel("BookingDotCom_HotelReviews.xlsx", sheet_name="Data")

# sample the first 100 rows of df
df = df[:100]

# rename df columns to lower case
df.columns= df.columns.str.lower()
'''

'''

'\n\n'

In [33]:
df[['positive_comments', 'negative_comments']].head(3)

Unnamed: 0,positive_comments,negative_comments
0,Only the park outside of the hotel was beauti...,I am so angry that i made this post available...
1,No real complaints the hotel was great great ...,No Negative
2,Location was good and staff were ok It is cut...,Rooms are nice but for elderly a bit difficul...


In [46]:
# sentence1 = """At eight o'clock on Thursday morning Arthur didn't feel very good."""

load_model = spacy.load('en_core_web_sm', disable = ['parser','ner'])

def lemmaFunc(text):
    doc = load_model(text)
    allowed_tags = ['NOUN', 'PROPN']
    return " ".join([token.lemma_ for token in doc if token.pos_ in allowed_tags])

df['positive_comments'] = df['positive_comments'].apply(lambda x: lemmaFunc(x))
df['negative_comments'] = df['negative_comments'].apply(lambda x: lemmaFunc(x))

In [47]:
df[['positive_comments', 'negative_comments']].head(3)

Unnamed: 0,positive_comments,negative_comments
0,park hotel,post site trip one mistake place booking com n...
1,complaint hotel location surrounding amenity s...,Negative
2,Location staff hotel breakfast range,room bit room story step level room tea coffee...


In [30]:
'''
# remove stopwords

import spacy
#loading the english language small model of spacy
en = spacy.load('en_core_web_sm')
sw_spacy = en.Defaults.stop_words

def stopFunc(text):
    return " ".join([word.lower() for word in text.split() if word.lower() not in sw_spacy])

df['positive_comments'] = df['positive_comments'].apply(lambda x: stopFunc(x))
df['negative_comments'] = df['negative_comments'].apply(lambda x: stopFunc(x))
'''

In [49]:
'''
    This is a function for performing preprocessing on the reviews columns
'''
import re
import contractions
import string
from nltk.corpus import stopwords
from nltk.tag import pos_tag
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer


lemmatizer = WordNetLemmatizer()

def preprocess(row):
    # remove numbers 
    row = re.sub(r'\d+', '', row)

    # tokenize
    tokens_raw = row.split()

    # limit to tokens with more than 3 characters
    # tokens_raw = [token for token in tokens_raw if len(token) > 3]

    # remove stopwords
    stop_words = set(stopwords.words('english'))
    more_stopwords = {'negative','positive', 'the', 'stay', 'hotel', 'night', 'good', 'great'}
    stop_words.update(more_stopwords)

    tokens_filtered = [token for token in tokens_raw if not token in stop_words]
    
    return ' '.join(tokens_filtered)

df['posCom'] = df.positive_comments.str.lower().apply(lambda x: preprocess(x))
df['negCom'] = df.negative_comments.str.lower().apply(lambda x: preprocess(x))

In [50]:
df[['positive_comments', 'negative_comments', 'posCom', 'negCom']].head(3)

Unnamed: 0,positive_comments,negative_comments,posCom,negCom
0,park hotel,post site trip one mistake place booking com n...,park,post site trip one mistake place booking com j...
1,complaint hotel location surrounding amenity s...,Negative,complaint location surrounding amenity service...,
2,Location staff hotel breakfast range,room bit room story step level room tea coffee...,location staff breakfast range,room bit room story step level room tea coffee...


In [51]:
'''
Create column with country name; we get the list of countries from the pycountry package
and use the hotel_address column to extract country name
'''

df['country'] = df["hotel_address"].apply(
    lambda address: ' '.join([c.name for c in pycountry.countries if c.name in address])
    )

# use the review_date column to extract the year and store in new column
df['year'] = pd.DatetimeIndex(df['review_date']).year

In [52]:
'''
In this step we deal with the tags column using the steps defined below:
    1. Define tags we are interested in
    2. Define a function to apply to tags column to remove tags we are not interested in by:
            - Converting the individual row values to list (from string) e.g. "[' Leisure trip ']" -> [' leisure trip ']
            - Strip the whitespaces from individual elements e.g. [' leisure trip ']-> ['leisure trip']
            - Drop tags we are not interested in
'''

# customer tags we are interested in
customer_tags = ['solo traveler','group','business trip','leisure trip','couple','family with young children']


def clean_tag(x):

    # convert value from string to a list
    myTags = ast.literal_eval(x.lower())

    # strip whitespaces from elements and drop those we are not interested in
    myTags = [customerTag.strip() for customerTag in myTags if customerTag.strip() in customer_tags]

    return myTags


# apply clean_tag() function to the tags column
df['tags'] = df['tags'].apply(lambda x: clean_tag(x))

In [53]:
'''
Function applied to tags column to extract new columns for customer categories
Lambda function will be used as in the steps below
'''

def split_tag(x:list, tagName:str) -> int:
    
    t = [1 if tagName in x else 0][0]
    
    return t

# dictionary of column names (new additional columns) and customer tags (as contained in tags of interest)
tagDict = {
    'solo_traveler' : 'solo traveler',
    'group' : 'group',
    'business_trip' : 'business trip',
    'leisure_trip' : 'leisure trip',
    'couple' : 'couple',
    'family_with_young_children' : 'family with young children'
}

# applying function on tags column to get new separated columns
for key, value in tagDict.items():
    df[key] = df['tags'].apply(lambda x: split_tag(x, value))

1. Text Processing
- Cleaning
- Normalization
- Tokenization
- Removing stopwords
- POS tagging
- Named Entity Recognition
- Lemmatization

2. Feature Extraction
- Bag of Words
- TF-IDF
- Word embeddings

3. Modeling

# NER

In [None]:
'''
overallTokens = []

# instantiate Spacy model for performing NER
nlp = en_core_web_sm.load()

# apply model to text
doc = nlp(df.positive_comments[10])
doc.text
'''

In [None]:
# docs = list(nlp.pipe(df.positive_comments))
# len(docs)

In [None]:
'''
# entities
docEntities = list(doc.ents)
docEntities

overallTokens = overallTokens + docEntities
'''

In [None]:
'''
# !!! remove stopwords

# nounPhrases
nounPhrases = [chunk.text for chunk in doc.noun_chunks]
nounPhrases

overallTokens = overallTokens + nounPhrases
'''

In [None]:
'''
# !!! lemmatize first

# using POS tagging

pos = [
    'PROPN',
    'NOUN',
    'VERB'
]

clean_doc = [token.lemma_ for token in doc if token.pos_ in pos]
print(clean_doc)

overallTokens = overallTokens + clean_doc
'''

In [None]:
'''
# sentences
spSent = list(doc.sents)

# words
spWords = [w.text for w in doc]

'''

In [None]:
'''
# Word frequency
from collections import Counter

spCleanWordFreq = Counter(overallTokens)
print(spCleanWordFreq)

# counterSum = counter1 + counter2
'''

In [None]:
'''
# dependency parsing
sentence = doc
spSentence = nlp(sentence)

for w in spSentence:
     print (w.text, w.tag_, w.head.text, w.dep_)
'''

In [None]:
'''
# shallow parsing
# noun phrase detection

for chunk in doc.noun_chunks:
     print (chunk)
'''

# Coreference resolution
https://towardsdatascience.com/from-text-to-knowledge-the-information-extraction-pipeline-b65e7e30273e

# Bi-grams

https://medium.com/analytics-vidhya/feature-extraction-and-sentiment-analysis-of-reviews-of-3-apps-in-india-84b665e1a887

# Trial

In [None]:
'''
from nltk.tokenize import regexp_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
# !pip install gensim
from gensim.utils import simple_preprocess
from gensim.corpora import Dictionary

my_docs = ['This movie was about spaceships and aliens.',
           'I really enjoyed the movie!',
           'Awesome action scenes, but boring characters.',
           'The movie was awful! I hate alien films.',
           'Space is cool! I like space movies.',
           'More space films, please!']

tokens = [] # list of lists >> column of pos/neg reviews tokens

for doc in my_docs:
    words = regexp_tokenize(doc.lower(), r'[A-Za-z]+')
    words = [w for w in words if w not in stopwords.words('english')] # remove stopwords
    words = [lemmatizer.lemmatize(w) for w in words] # lemmatize
    tokens.append(words)

my_dict = Dictionary(tokens) # dictionary
print(my_dict.token2id) 

# Note: You can also add new words or lists of words to a dictionary, save a
# dictionary to a file, load it back later, or read a dictionary from a text file

# my_dict.add_documents(list_of_new_words)
# my_dict.save('saved_dict.dict')
# loaded_dict = Dictionary.load('saved_dict.dict')
# dictionary = Dictionary(line.split()) for line in open('sample.txt', encoding='utf-8'))

dtm = [my_dict.doc2bow(doc) for doc in tokens]
dtm  # Create Gensim BOW corpus

for doc in dtm:
    print([[my_dict[i], freq] for i, freq in doc])

tokenList = tokens # list of lists

dtm = [my_dict.doc2bow(tokenListItem) for tokenListItem in tokenList]
dtm  # Create Gensim BOW corpus
'''

In [8]:
from sklearn.feature_extraction.text  import CountVectorizer

#load_model = en_core_web_sm.load()

In [57]:
vecPos = CountVectorizer(
                        strip_accents='ascii', stop_words='english', ngram_range=(1,2),
                        analyzer='word', max_df=0.85, min_df=0.01,max_features=100
                        )

vecNeg = CountVectorizer(max_df=0.85)

sparseVecPos = vecPos.fit_transform(df['posCom'])   # You can fit and transform jointly 
sparseVecNeg = vecNeg.fit_transform(df['negCom'])

matPos = pd.DataFrame(sparseVecPos.toarray(), columns=vecPos.get_feature_names_out())
matNeg = pd.DataFrame(sparseVecNeg.toarray(), columns=vecNeg.get_feature_names_out())

In [58]:
# 1a. What are the top five hotel features (e.g., location, staff, etc.) that customers mention the most in positive reviews
matPos.agg(sum).sort_values(ascending=False)[:10]

room          61
location      39
park          38
staff         35
bed           29
restaurant    21
building      20
breakfast     18
area          13
tram          12
dtype: int64

In [59]:
# 1b. What are the top five hotel features (e.g., location, staff, etc.) that customers mention the most in negative reviews
matNeg.agg(sum).sort_values(ascending=False)[:20]

room            116
bathroom         22
breakfast        20
shower           19
floor            18
bed              14
staff            14
time             13
construction     12
window           12
door             12
glass            11
restaurant       11
day              11
water            11
service          10
work             10
building         10
coffee            9
area              9
dtype: int64

In [60]:
matPos.shape, matNeg.shape

((100, 100), (100, 452))

In [49]:
# columns to include

colsInteresting = [
                    'hotel_name','average_hotel_score','reviewer_score','posCom','negCom','country','year',
                    'solo_traveler','group','business_trip','leisure_trip','couple','family_with_young_children'
                    ]

In [50]:
dfSample = df[colsInteresting]
dfSample.head(3)

Unnamed: 0,hotel_name,average_hotel_score,reviewer_score,posCom,negCom,country,year,solo_traveler,group,business_trip,leisure_trip,couple,family_with_young_children
0,Hotel Arena,7.7,2.9,park outside beautiful,angry make post available possible site plan t...,Netherlands,2017,0,0,0,1,1,0
1,Hotel Arena,7.7,7.5,real complaint location surroundings room amen...,,Netherlands,2017,0,0,0,1,1,0
2,Hotel Arena,7.7,7.1,location staff cute breakfast range nice back,room nice elderly difficult room story narrow ...,Netherlands,2017,0,0,0,1,0,1


In [55]:
matPos.head(3)

Unnamed: 0,abbey,ability,abit,able,aboard,abroad,absolute,absolutely,absoulty,abysmal,...,yogurt,yougurt,young,youtube,yoyo,yulian,yummy,zero,ziplock,zone
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [56]:
dfPos = pd.concat([dfSample, matPos], axis=1)
dfPos.head(3)

Unnamed: 0,hotel_name,average_hotel_score,reviewer_score,posCom,negCom,country,year,solo_traveler,group,business_trip,...,yogurt,yougurt,young,youtube,yoyo,yulian,yummy,zero,ziplock,zone
0,Hotel Arena,7.7,2.9,park outside beautiful,angry make post available possible site plan t...,Netherlands,2017,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Hotel Arena,7.7,7.5,real complaint location surroundings room amen...,,Netherlands,2017,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Hotel Arena,7.7,7.1,location staff cute breakfast range nice back,room nice elderly difficult room story narrow ...,Netherlands,2017,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [83]:
matPos.agg(sum).sort_values(ascending=False)[:3]

staff       4005
room        3986
location    3758
dtype: int64

In [None]:
'''
What are the top five features that customers prefer most if they are a solo traveler vs traveling with a 
group vs on a business trip vs a leisure trip vs traveling as a couple vs a family with young children.
'''

customerCategory = 'solo_traveler'

x = dfPos[dfPos[customerCategory] != 0]

y = x[list(matPos.columns)].agg(sum).sort_values(ascending=False)[:5]

In [91]:
matPos.shape, dfPos.shape

((10000, 4665), (10000, 4678))

In [None]:
x[[matPos.columns]].head() 

In [92]:
matPos.head()

Unnamed: 0,abbey,ability,abit,able,aboard,abroad,absolute,absolutely,absoulty,abysmal,...,yogurt,yougurt,young,youtube,yoyo,yulian,yummy,zero,ziplock,zone
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
