<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Problem-Statement" data-toc-modified-id="Problem-Statement-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Problem Statement</a></span></li><li><span><a href="#Data-Science-Problem" data-toc-modified-id="Data-Science-Problem-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Data Science Problem</a></span></li><li><span><a href="#Data-Scraping" data-toc-modified-id="Data-Scraping-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Data Scraping</a></span><ul class="toc-item"><li><span><a href="#TripAdvisor" data-toc-modified-id="TripAdvisor-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>TripAdvisor</a></span></li></ul></li></ul></div>

# Capstone Project: Aspect-Based Sentiment Analysis of Luxury Hotels in Singapore


Online hotel reviews are part of a paradigm shift which have affected hotels' reputations, and subsequently their bottom lines since the early 2000s. They are written by the hotels' visitors with the intent of addressing other potential visitors. Potential visitors to the hotel often consider reviews when making their choices of place to stay. For hotel management, online reviews are a valuable source of feedback from their guests. While good reviews encourage more people to stay, so do negative reviews drive people away. [(Source)](https://www.researchgate.net/publication/291164180_Understanding_the_Impact_of_Online_Reviews_on_Hotel_Performance_An_Empirical_Analysis)

## Problem Statement
The management of online reviews is a subset of what is called Reputation Management in hotel work. Reputation Management is typically an executive-level function often under the purview of the hotel's General Manager, but the actual management of online reviews tend to fall under the Marketing Department of a large hotel. Within said Marketing Department, there is often only a small team (often just one person) involved in the processing of review data for the purpose of producing valuable insights for the executive and feedback for different branches of a hotel in-charge of different functions. This involved lengthy communication with each different stakeholder to complete.

Such work was highly tedious due to the subjective and multi-faceted (one review talked about many different things) nature of hotel reviews and required personnel skilled in understanding nuances of the English Language to be accurately produced. That said, online reviews tend to be repetitive in their use of words to describe different parts of a hotel -- making them good candidates to be processed by data analysis techniques. The reduction of said labour and time required to derive insights from online reviews is the motivation of this project.

## Data Science Problem

1. Create a model which will correctly identify aspects of a hotel being reviewed from sections of online review data with highest possible accuracy.
2. Model will subsequently identify the general sentiment (positive, neutral, negative) of target aspect section.

It is the hope that such automatic processing of online review data will enable all stakeholders to receive only information which is relevant to them, and to reduce labour of hotels' Marketing Departments.

## Data Scraping

In [1]:
import pandas as pd
import numpy as np
import re
import requests
import time
import datetime
import random

from bs4 import BeautifulSoup
# from selenium import webdriver
# from selenium.webdriver.common.by import By
# from selenium.webdriver.support.ui import WebDriverWait
# from selenium.webdriver.support import expected_conditions as EC
# from selenium.webdriver.chrome.options import Options
# New comp install selenium also

### TripAdvisor

Scraping Tripadvisor is an easier task than most other sources. Their URL links lead directly to 5 posts at a time.

For page-wise iteration, the TA website starts with a frontpage in the format of:  
```https://www.tripadvisor.com/Hotel_Review-g294265-d1770798-Reviews-Marina_Bay_Sands-Singapore.html#REVIEWS```

And for each iteration, a or# type iterator is found in the url, making each successive page accessible via an iterator which jumps 5 numbers at a time.  
```https://www.tripadvisor.com/Hotel_Review-g294265-d1770798-Reviews-or5-Marina_Bay_Sands-Singapore.html#REVIEWS```

Each Tripadvisor page review section is split into a div with data-test-target: 'HR_CC_CARD'. It is then further broken down into two sections:
1. An element div containing the date of review and location (user-selected) of the reviewer
2. The content and date stayed

For part 1, randomly generated classes require a find_all type pattern to populate lists. In part 2, the content is in divs containing the review's unique identifier.

# Tripadvisor Page Iterator

In [4]:
def create_review_dict(property_name, website, ids, review_dates, review_locations, \
                       review_titles, review_content, review_scores, visit_dates):
    '''
    Function turns distinct lists into a dict required for our dataframe
    Returns a dict of id: review content and meta information
    '''
    reviews_dict = {}
    for i, item in enumerate(ids):
        reviews_dict.update({
            item: {
                'property': property_name,
                'rev_source': website,
                'rev_date': review_dates[i],
                'rev_location': review_locations[i],
                'rev_title': review_titles[i],
                'rev_content': review_content[i],
                'rev_score': review_scores[i],
                'rev_visit_date': visit_dates[i],
            }
        })
    return reviews_dict

In [5]:
def tripadvisor_page_url(i, url_str_prefix, url_str_postfix):
    '''
    Function returns the link of i, the iterator
    Returns list of urls to be scrapped through
    '''
    iterator_str = ''
    if i != 0:
        iterator_str = 'or' + str(i) + '-'
        
    return url_str_prefix + iterator_str + url_str_postfix

In [6]:
def tripadvisor_get_reviews(hotel, reviews_to_scrape, url_str_prefix, url_str_postfix, start_at = 0):
    '''
    Function gets review data from tripadvisor for a hotel and returns results in a dataframe
    hotel: hotel name in lowercase acronym - to be used in dataframe
    reviews_to_scrape: rounded to closed denomination of 5. Number of reviews to scrape, starting from newest
    url_str_prefix: 1st part of url
    url_str_postfix: 2nd part of url
    '''
    # Init returned dict
    scrapped_reviews = {}
    
    # Round review number to scrape to 5 if not already
    reviews_to_scrape = int(np.around(reviews_to_scrape/5, 0) * 5)
    start_at = int(np.around(start_at/5, 0) * 5)
    
    list_pages = [tripadvisor_page_url(iterator,
                    url_str_prefix,
                    url_str_postfix) \
                  for iterator in range(start_at, (start_at + reviews_to_scrape), 5)
                 ]
    
    for page in list_pages:
        print(page)
        res = requests.get(page, headers={'User-agent': 'Curomu 1.25'})
    
        if res.status_code != 200:
            # If status code is not 200 (success), break loop and debug
            raise Exception(f"Get of {page} was not successful. Status Code: {res.status_code}.")
            break
        else:
            # Else scrape as normal
            page_ids = []
            page_review_dates = []
            page_review_locations = []
            page_review_titles = []
            page_review_content = []
            page_review_scores = []
            page_visit_dates = []
    
            tripadvisor_content = BeautifulSoup(res.content, 'lxml')
            reviews_root = tripadvisor_content.find('div', {"data-test-target" : "reviews-tab"})
            
            # Get Review ID
            for string in reviews_root.select("div[data-reviewid]"):
                page_ids.append(string.get('data-reviewid'))
                
            each_review = reviews_root.find_all('div', {"data-test-target" : "HR_CC_CARD"})

            for string in each_review:
                # Get Review Date
                review_date = string.find('a', {'class': 'ui_header_link'}).next_sibling.split(' ')
                if (review_date[-1] == 'Today') or (review_date[-1] == 'Yesterday'):
                    review_year = datetime.datetime.now().year
                    review_date[-2] = datetime.datetime.now().strftime("%b")
                elif (len(review_date[-1]) < 4):
                    review_year = datetime.datetime.now().year
                else:
                    review_year = int(review_date[-1])
                review_time_str = review_date[-2] + ' ' + str(review_year)
                review_datetime = datetime.datetime.strptime(review_time_str, '%b %Y')
                page_review_dates.append(review_datetime)

                # Get Reviewer Location
                review_location_parent = string.find('a', {'class': 'ui_header_link'}).parent.parent.next_sibling
                try:
                    review_location = review_location_parent.find('span', {'class': 'map-pin-fill'}).next_sibling
                    review_location = review_location.split(', ')[-1]
                except:
                    review_location = np.nan
                page_review_locations.append(review_location)

            for string in [reviews_root.find('div', {'data-reviewid': page_id}) for page_id in page_ids]:
                # Get Review Title
                page_review_titles.append(string.find('div', {'data-test-target': 'review-title'}).find('a').find('span').find('span').text)

                # Get Review Content
                page_review_content.append(string.find('q').find('span').text)

                # Get review rating
                rev_classes = string.find('div', {'data-test-target': 'review-rating'}).select_one("span").get('class')
                page_review_scores.append(int(rev_classes[1].split('_')[1])/10*2)

                # Get stay date 
                stay_date_text = reviews_root.find('span', text = 'Date of stay:').next_sibling.lstrip()
                page_visit_dates.append(datetime.datetime.strptime(stay_date_text, '%B %Y'))
                    
            # update total scrapped reviews dictionary
            scrapped_reviews.update(
                create_review_dict(hotel, 'tripadvisor', page_ids, page_review_dates, page_review_locations, \
                   page_review_titles, page_review_content, page_review_scores, page_visit_dates)
            )
            # Verbosely indicate number of reviews scrapped
            print(f'{len(scrapped_reviews)} scrapped.')
            
            # Sleep before next page request to reduce suspicion
            sleep_duration = random.randint(3,15)
            print(f'Sleeping {sleep_duration} sec.')
            time.sleep(sleep_duration)
            
    print(f'tripadvisor_get_reviews has scrapped {len(scrapped_reviews)} reviews successfully.')
    
    # Return all scrapped reviews to be updated into dataframe
    return scrapped_reviews

In [32]:
reviews = tripadvisor_get_reviews('mos', 500,
                                  'https://www.tripadvisor.com/Hotel_Review-g294265-d300850-Reviews-',
                    'Mandarin_Orchard_Singapore-Singapore.html',
                                 start_at = 8500)

https://www.tripadvisor.com/Hotel_Review-g294265-d300850-Reviews-or8500-Mandarin_Orchard_Singapore-Singapore.html
5 scrapped.
Sleeping 8 sec.
https://www.tripadvisor.com/Hotel_Review-g294265-d300850-Reviews-or8505-Mandarin_Orchard_Singapore-Singapore.html
10 scrapped.
Sleeping 12 sec.
https://www.tripadvisor.com/Hotel_Review-g294265-d300850-Reviews-or8510-Mandarin_Orchard_Singapore-Singapore.html
15 scrapped.
Sleeping 9 sec.
https://www.tripadvisor.com/Hotel_Review-g294265-d300850-Reviews-or8515-Mandarin_Orchard_Singapore-Singapore.html
20 scrapped.
Sleeping 6 sec.
https://www.tripadvisor.com/Hotel_Review-g294265-d300850-Reviews-or8520-Mandarin_Orchard_Singapore-Singapore.html
25 scrapped.
Sleeping 7 sec.
https://www.tripadvisor.com/Hotel_Review-g294265-d300850-Reviews-or8525-Mandarin_Orchard_Singapore-Singapore.html
30 scrapped.
Sleeping 4 sec.
https://www.tripadvisor.com/Hotel_Review-g294265-d300850-Reviews-or8530-Mandarin_Orchard_Singapore-Singapore.html
35 scrapped.
Sleeping 12 sec

https://www.tripadvisor.com/Hotel_Review-g294265-d300850-Reviews-or8785-Mandarin_Orchard_Singapore-Singapore.html
290 scrapped.
Sleeping 14 sec.
https://www.tripadvisor.com/Hotel_Review-g294265-d300850-Reviews-or8790-Mandarin_Orchard_Singapore-Singapore.html
295 scrapped.
Sleeping 13 sec.
https://www.tripadvisor.com/Hotel_Review-g294265-d300850-Reviews-or8795-Mandarin_Orchard_Singapore-Singapore.html
300 scrapped.
Sleeping 5 sec.
https://www.tripadvisor.com/Hotel_Review-g294265-d300850-Reviews-or8800-Mandarin_Orchard_Singapore-Singapore.html
305 scrapped.
Sleeping 5 sec.
https://www.tripadvisor.com/Hotel_Review-g294265-d300850-Reviews-or8805-Mandarin_Orchard_Singapore-Singapore.html
310 scrapped.
Sleeping 9 sec.
https://www.tripadvisor.com/Hotel_Review-g294265-d300850-Reviews-or8810-Mandarin_Orchard_Singapore-Singapore.html
315 scrapped.
Sleeping 4 sec.
https://www.tripadvisor.com/Hotel_Review-g294265-d300850-Reviews-or8815-Mandarin_Orchard_Singapore-Singapore.html
320 scrapped.
Sleepi

In [33]:
tripadvisor_df = pd.DataFrame(reviews).T

In [34]:
# Check for duplicates in the index (review_id)
tripadvisor_df.index.duplicated(keep='first').sum()

0

In [35]:
tripadvisor_df.tail()

Unnamed: 0,property,rev_source,rev_date,rev_location,rev_title,rev_content,rev_score,rev_visit_date
225956576,mos,tripadvisor,2014-09-01,,Business hotel,right location for avid shopper. right smack i...,8,2014-01-01
225877196,mos,tripadvisor,2014-09-01,Australia,"Mandarin Orchard, hotel, singapore",I was invited to attend a church service at th...,10,2014-01-01
225871380,mos,tripadvisor,2014-09-01,Tassin La Demi Lune,Great location,Great location close to the metro and many sho...,6,2014-01-01
225864717,mos,tripadvisor,2014-09-01,Hk,Ankie C,This hotel is located at very prime area and e...,6,2014-01-01
225841387,mos,tripadvisor,2014-09-01,Selangor,"Terrible service, disappointed.",I had to wait for more than an hour before I w...,2,2014-01-01


In [36]:
# tripadvisor_df.to_csv(f'../datasets/tripadvisor_mos_{datetime.datetime.now().strftime("%y%m%d_%H%M")}')
# print(f'Dataframe of shape {tripadvisor_df.shape} written to csv at {datetime.datetime.now().strftime("%y%m%d_%H%M")}')

Dataframe of shape (500, 8) written to csv at 210412_2215


We will process our scraped review data in the next section.