# Booking.com Feature Extraction
---

Download the file BookingDotCom_HotelReviews.xlsx from Canvas. This file contains over 515,000 guest reviews and rating of almost 1500 hotels across Europe scraped from popular hotel reservation website Booking.com. The text data was cleaned by removing unicode and punctuation and transformed to lower case. No other preprocessing was done. More information on each field is provided in the "Data Description" tab of the Excel file.

        1. What are the top five hotel features (e.g., location, staff, etc.) that customers mention the most in positive reviews and top five features they mention most in negative reviews? Your identified features must make sense (e.g., "great" or "negative" are not features). (3 points)
        
        2. What are the top five features that customers prefer most if they are a solo traveler vs traveling with a group vs on a business trip vs a leisure trip vs traveling as a couple vs a family with young children. You will find these categories in the "Tags" column. There are a few more tags that we don't need. (2 points).

        3. What are the top five features customers like most and top five features they complain about most about hotels in United Kingdom, France, Italy, and Spain? Country information is available inside Hotel_Address. (2 points)
        
        4. Create a dashboard with the following plots; (1) "Top Five Hotels Overall" with consistently high ratings, (2) Bottom Five Hotels Overall" with consistently low ratings, (3) Five Most Improved Hotels" with the highest improvement in average ratings from 2015 to 2017, showing their average ratings for each of the three years. (0.5+0.5+2 points).

Write clear, compact, and understandable code with comment/markdown statements as appropriate. Non-working code or unnecessary code will be penalized. 

Submit your Jupyter file using the link below or provide a link to your Google Colab or Github file.


In [1]:
# import packages to use
import pandas as pd
import pycountry
import spacy
from spacy import displacy
# from collections import Counter
import en_core_web_sm

In [2]:
# load dataframe
df = pd.read_excel("BookingDotCom_HotelReviews.xlsx", sheet_name="Data")

# sample the first 1000 rows of df
df = df[:1000]

# rename df columns to lower case
df.columns= df.columns.str.lower()

In [3]:
df['positive_comments'] = df['positive_comments'].apply(lambda x: " ".join(x.split()))
df['negative_comments'] = df['negative_comments'].apply(lambda x: " ".join(x.split()))

In [4]:
'''
There are mixed cases in the positive and negative comments columns, so we convert them to lower case.
This helps in the next steps of processing
'''
# df['positive_comments'] = df.positive_comments.lower()
# df['positive_comments'] = df.positive_comments.lower()

'\nThere are mixed cases in the positive and negative comments columns, so we convert them to lower case.\nThis helps in the next steps of processing\n'

In [5]:
'''
Create column with country name; we get the list of countries from the pycountry package
and use the hotel_address column to extract country name
'''

df['country'] = df["hotel_address"].apply(
    lambda address: ' '.join([c.name for c in pycountry.countries if c.name in address])
    )

# use the review_date column to extract the year and store in new column
df['year'] = pd.DatetimeIndex(df['review_date']).year

In [6]:
'''
In this step we deal with the tags column using the steps defined below:
    1. Define tags we are interested in
    2. Define a function to apply to tags column to remove tags we are not interested in by:
            - Converting the individual row values to list (from string) e.g. "[' Leisure trip ']" -> [' leisure trip ']
            - Strip the whitespaces from individual elements e.g. [' leisure trip ']-> ['leisure trip']
            - Drop tags we are not interested in
'''

# customer tags we are interested in
customer_tags = [
    'solo traveler',
    'group',
    'business trip',
    'leisure trip',
    'couple',
    'family with young children'
]

def clean_tag(x):

    # convert value from string to a list
    import ast
    
    myTags = ast.literal_eval(x.lower())

    # strip whitespaces from elements and drop those we are not interested in
    myTags = [customerTag.strip() for customerTag in myTags if customerTag.strip() in customer_tags]

    return myTags


# apply function to the tags column
df['tags'] = df['tags'].apply(lambda x: clean_tag(x))

In [7]:
'''
Function that to apply to df.tags column to extract new columns for customer categories
Lambda function will be used to apply this function to the column as in the steps below
'''
def split_tag(x:list, tagName:str) -> int:
    
    t = [1 if tagName in x else 0][0]
    
    return t

# dictionary of column names and customer tags (as contained in df.tags values)
tagDict = {
    'solo_traveler' : 'solo traveler',
    'group' : 'group',
    'business_trip' : 'business trip',
    'leisure_trip' : 'leisure trip',
    'couple' : 'couple',
    'family_with_young_children' : 'family with young children'
}

# applying function on tags column to get new separated columns
for key, value in tagDict.items():
    df[key] = df['tags'].apply(lambda x: split_tag(x, value))

In [8]:
# drop columns we do not need for now
colsToDrop = [
    'hotel_address', 
    'review_date', 
    'reviewer_nationality', 
    'tags'
    ]

df.drop(columns=colsToDrop, inplace=True)

1. Text Processing
- Cleaning
- Normalization
- Tokenization
- Removing stopwords
- POS tagging
- Named Entity Recognition
- Lemmatization

2. Feature Extraction
- Bag of Words
- TF-IDF
- Word embeddings

3. Modeling

# NER

In [9]:
nlp = en_core_web_sm.load()

In [10]:
doc = nlp(df.positive_comments[10])

In [11]:
doc.text

'Rooms were stunningly decorated and really spacious in the top of the building Pictures are of room 300 The true beauty of the building has been kept but modernised brilliantly Also the bath was lovely and big and inviting Great more for couples Restaurant menu was a bit pricey but there were loads of little eatery places nearby within walking distance and the tram stop into the centre was about a 6 minute walk away and only about 3 or 4 stops from the centre of Amsterdam Would recommend this hotel to anyone it s unbelievably well priced too'

In [12]:
spSent = list(doc.sents)
spSent

[Rooms were stunningly decorated and really spacious in the top of the building Pictures are of room 300,
 The true beauty of the building has been kept but modernised brilliantly Also the bath was lovely and big and inviting Great more for couples Restaurant menu was a bit pricey but there were loads of little eatery places nearby within walking distance and the tram stop into the centre was about a 6 minute walk away and only about 3 or 4 stops from the centre of Amsterdam Would recommend this hotel to anyone it s unbelievably well priced too]

In [13]:
spWords = [w.text for w in doc]
print(spWords)

['Rooms', 'were', 'stunningly', 'decorated', 'and', 'really', 'spacious', 'in', 'the', 'top', 'of', 'the', 'building', 'Pictures', 'are', 'of', 'room', '300', 'The', 'true', 'beauty', 'of', 'the', 'building', 'has', 'been', 'kept', 'but', 'modernised', 'brilliantly', 'Also', 'the', 'bath', 'was', 'lovely', 'and', 'big', 'and', 'inviting', 'Great', 'more', 'for', 'couples', 'Restaurant', 'menu', 'was', 'a', 'bit', 'pricey', 'but', 'there', 'were', 'loads', 'of', 'little', 'eatery', 'places', 'nearby', 'within', 'walking', 'distance', 'and', 'the', 'tram', 'stop', 'into', 'the', 'centre', 'was', 'about', 'a', '6', 'minute', 'walk', 'away', 'and', 'only', 'about', '3', 'or', '4', 'stops', 'from', 'the', 'centre', 'of', 'Amsterdam', 'Would', 'recommend', 'this', 'hotel', 'to', 'anyone', 'it', 's', 'unbelievably', 'well', 'priced', 'too']


In [14]:
spCleanWords = [w.text for w in doc
           if not w.is_stop and not w.is_punct]
len(spCleanWords)

48

In [None]:
print(spCleanWords)

In [15]:
# Word frequency
from collections import Counter
spCleanWordFreq = Counter(spCleanWords)
print(spCleanWordFreq)

Counter({'building': 2, 'centre': 2, 'Rooms': 1, 'stunningly': 1, 'decorated': 1, 'spacious': 1, 'Pictures': 1, 'room': 1, '300': 1, 'true': 1, 'beauty': 1, 'kept': 1, 'modernised': 1, 'brilliantly': 1, 'bath': 1, 'lovely': 1, 'big': 1, 'inviting': 1, 'Great': 1, 'couples': 1, 'Restaurant': 1, 'menu': 1, 'bit': 1, 'pricey': 1, 'loads': 1, 'little': 1, 'eatery': 1, 'places': 1, 'nearby': 1, 'walking': 1, 'distance': 1, 'tram': 1, 'stop': 1, '6': 1, 'minute': 1, 'walk': 1, 'away': 1, '3': 1, '4': 1, 'stops': 1, 'Amsterdam': 1, 'recommend': 1, 'hotel': 1, 's': 1, 'unbelievably': 1, 'priced': 1})


In [16]:
spCleanWordFreq.most_common(5)

[('building', 2),
 ('centre', 2),
 ('Rooms', 1),
 ('stunningly', 1),
 ('decorated', 1)]

In [17]:
# Words occurring twice or more
spFreqCleanWords = [w for (w, freq) in spCleanWordFreq.items() if freq > 1]
spFreqCleanWords

['building', 'centre']

In [18]:
# lemmatization
for w in doc:
     print (w, w.lemma_)

Rooms room
were be
stunningly stunningly
decorated decorate
and and
really really
spacious spacious
in in
the the
top top
of of
the the
building building
Pictures Pictures
are be
of of
room room
300 300
The the
true true
beauty beauty
of of
the the
building building
has have
been be
kept keep
but but
modernised modernise
brilliantly brilliantly
Also also
the the
bath bath
was be
lovely lovely
and and
big big
and and
inviting invite
Great great
more more
for for
couples couple
Restaurant restaurant
menu menu
was be
a a
bit bit
pricey pricey
but but
there there
were be
loads load
of of
little little
eatery eatery
places place
nearby nearby
within within
walking walk
distance distance
and and
the the
tram tram
stop stop
into into
the the
centre centre
was be
about about
a a
6 6
minute minute
walk walk
away away
and and
only only
about about
3 3
or or
4 4
stops stop
from from
the the
centre centre
of of
Amsterdam Amsterdam
Would would
recommend recommend
this this
hotel hotel
to to
anyone 

In [19]:
CleanSentence = ' '.join(spCleanWords).strip()
CleanSentence

'Rooms stunningly decorated spacious building Pictures room 300 true beauty building kept modernised brilliantly bath lovely big inviting Great couples Restaurant menu bit pricey loads little eatery places nearby walking distance tram stop centre 6 minute walk away 3 4 stops centre Amsterdam recommend hotel s unbelievably priced'

In [20]:
spCleanSentence = nlp(CleanSentence)
spCleanWords = []

for w in spCleanSentence:
    spCleanWords.append(w.lemma_)

print(spCleanWords)

['room', 'stunningly', 'decorate', 'spacious', 'building', 'Pictures', 'room', '300', 'true', 'beauty', 'building', 'keep', 'modernise', 'brilliantly', 'bath', 'lovely', 'big', 'invite', 'great', 'couple', 'restaurant', 'menu', 'bit', 'pricey', 'load', 'little', 'eatery', 'place', 'nearby', 'walk', 'distance', 'tram', 'stop', 'centre', '6', 'minute', 'walk', 'away', '3', '4', 'stop', 'centre', 'Amsterdam', 'recommend', 'hotel', 's', 'unbelievably', 'price']


In [21]:
for w in doc:
     print (w, w.pos_, w.tag_, spacy.explain(w.tag_))

Rooms NOUN NNS noun, plural
were AUX VBD verb, past tense
stunningly ADV RB adverb
decorated VERB VBN verb, past participle
and CCONJ CC conjunction, coordinating
really ADV RB adverb
spacious ADJ JJ adjective (English), other noun-modifier (Chinese)
in ADP IN conjunction, subordinating or preposition
the DET DT determiner
top NOUN NN noun, singular or mass
of ADP IN conjunction, subordinating or preposition
the DET DT determiner
building NOUN NN noun, singular or mass
Pictures PROPN NNPS noun, proper plural
are AUX VBP verb, non-3rd person singular present
of ADP IN conjunction, subordinating or preposition
room NOUN NN noun, singular or mass
300 NUM CD cardinal number
The DET DT determiner
true ADJ JJ adjective (English), other noun-modifier (Chinese)
beauty NOUN NN noun, singular or mass
of ADP IN conjunction, subordinating or preposition
the DET DT determiner
building NOUN NN noun, singular or mass
has AUX VBZ verb, 3rd person singular present
been AUX VBN verb, past participle
kep

In [22]:
# Extracting nouns using POS tags

spNouns = []
for w in doc:
     if w.pos_ == 'NOUN':
         spNouns.append(w)

spNouns

[Rooms,
 top,
 building,
 room,
 beauty,
 building,
 bath,
 couples,
 Restaurant,
 menu,
 bit,
 loads,
 eatery,
 places,
 distance,
 centre,
 minute,
 walk,
 stops,
 centre,
 hotel]

In [23]:
# dependency parsing
sentence = doc
spSentence = nlp(sentence)

for w in spSentence:
     print (w.text, w.tag_, w.head.text, w.dep_)

Rooms NNS decorated nsubjpass
were VBD decorated auxpass
stunningly RB decorated advmod
decorated VBN decorated ROOT
and CC decorated cc
really RB spacious advmod
spacious JJ decorated conj
in IN spacious prep
the DT top det
top NN in pobj
of IN top prep
the DT building det
building NN of pobj
Pictures NNPS are nsubj
are VBP decorated conj
of IN are prep
room NN of pobj
300 CD room nummod
The DT beauty det
true JJ beauty amod
beauty NN kept nsubjpass
of IN beauty prep
the DT building det
building NN of pobj
has VBZ kept aux
been VBN kept auxpass
kept VBN s ccomp
but CC kept cc
modernised VBD kept conj
brilliantly RB modernised advmod
Also RB was advmod
the DT bath det
bath NN was nsubj
was VBD kept conj
lovely JJ was acomp
and CC lovely cc
big JJ lovely conj
and CC big cc
inviting VBG big conj
Great JJ inviting dobj
more JJR inviting dobj
for IN was mark
couples NNS menu compound
Restaurant NN menu compound
menu NN was nsubj
was VBD was advcl
a DT bit det
bit NN pricey npadvmod
pricey 

In [24]:
# shallow parsing
for chunk in doc.noun_chunks:
     print (chunk)

Rooms
the top
the building
Pictures
room
The true beauty
the building
the bath
couples Restaurant menu
loads
little eatery places
walking distance
the tram
the centre
about a 6 minute walk
only about 3 or 4 stops
the centre
Amsterdam
this hotel
anyone
it


In [25]:
# NER
for ent in doc.ents:
     print(ent.text, ent.start_char, ent.end_char,
           ent.label_, spacy.explain(ent.label_))

300 100 103 CARDINAL Numerals that do not fall under another type
6 minute 401 409 TIME Times smaller than a day
only about 3 424 436 CARDINAL Numerals that do not fall under another type
4 440 441 CARDINAL Numerals that do not fall under another type
Amsterdam 467 476 GPE Countries, cities, states


In [26]:
displacy.render(doc, style='ent', jupyter=True, options={'distance': 75})

In [27]:
# noun phrase detection
for chunk in doc.noun_chunks:
     print (chunk)

Rooms
the top
the building
Pictures
room
The true beauty
the building
the bath
couples Restaurant menu
loads
little eatery places
walking distance
the tram
the centre
about a 6 minute walk
only about 3 or 4 stops
the centre
Amsterdam
this hotel
anyone
it


# Coreference resolution
https://towardsdatascience.com/from-text-to-knowledge-the-information-extraction-pipeline-b65e7e30273e

# Bi-grams

https://medium.com/analytics-vidhya/feature-extraction-and-sentiment-analysis-of-reviews-of-3-apps-in-india-84b665e1a887