(In order to load the stylesheet of this notebook, execute the last code cell in this notebook)

# Analyzing Hotel Ratings on Tripadvisor

In this homework we will focus on practicing two techniques: web scraping and regression. For the first part, we will get some basic information for each hotel in Boston. Then, we will fit a regression model on this information and try to analyze it.

** Task 1 (30 pts)**

We will scrape the data using Beautiful Soup. For each hotel that our search returns, we will get the information below.

![Information to be scraped](hotel_info.png)

Of course, feel free to collect even more data if you want. 

In [3]:
'''
1 - Retrieve all 82 hotels in Boston
    - url to get to the first page of hotels: https://www.tripadvisor.com/Hotels-g60745-Boston_Massachusetts-Hotels.html 
    - url to get to the first page of hotels: https://www.tripadvisor.com/Hotels-g60745-oa30-Boston_Massachusetts-Hotels.html 
    - url to get to the first page of hotels: https://www.tripadvisor.com/Hotels-g60745-oa60-Boston_Massachusetts-Hotels.html 
    - extract name and url of each hotel and its page
2 - Collect traveler ratings for Omni Park House: Location, Sleep Quality, Rooms, Service, Value and Cleanliness
    - read through each review for their IDs
    - if ratings exist, pull rating of that review
    - record review_id:rating_category:rating_score in OPH_rating_summary.txt
'''
from bs4 import BeautifulSoup
import requests
import time
from contextlib import contextmanager
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

base_url = 'https://www.tripadvisor.com'
url_list = ['https://www.tripadvisor.com/Hotels-g60745-Boston_Massachusetts-Hotels.html','https://www.tripadvisor.com/Hotels-g60745-oa30-Boston_Massachusetts-Hotels.html','https://www.tripadvisor.com/Hotels-g60745-oa60-Boston_Massachusetts-Hotels.html']
user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.76 Safari/537.36'

def parse_hotels():
    '''Record list of all hotels in Boston + urls to dictionary'''
    hotel_list = {}
    headers = {'User-Agent' : user_agent }
    for i in range(len(url_list)):
        response = requests.get(url_list[i], headers=headers)
        html = response.text.encode('utf-8')
        soup = BeautifulSoup(html,"lxml")
        hotel_boxes = soup.findAll('div', {'class' :'listing easyClear  p13n_imperfect '})
        for hotel_box in hotel_boxes:
            name = hotel_box.find('div', {'class' :'listing_title'}).find(text=True).replace("/","")
            url = hotel_box.find('div', {'class' :'listing_title'}).find('a',href=True)['href']
            hotel_list[name] = url
    return hotel_list

def parse_hotel_reviews(hotel_list, hotel_name = "Omni Parker House"):
    '''Scrapes each review of a specific hotel to get review_body_id : attribute : rating
       Input: hotel_list, hotel = "Omni Parker House"'''
    
    #create file
    ratings_file = open('OPH_rating_summary.txt', 'w+')
    
    hotel_url = hotel_list[hotel_name]
    headers = {'User-Agent' : user_agent}
    response = requests.get(base_url + hotel_url, headers=headers)
    html = response.text.encode('utf-8')
    soup = BeautifulSoup(html,"lxml")
    
    #soupify html then find the first review url on the hotel front page
    review_box = soup.find('div', {'class' :'reviewSelector   track_back'})
    review_url = review_box.find('div',{'class' :'quote'}).find('a',href=True)['href']
    review_url = review_url.split('#')[0]
    
    #url is obtained, use selenium to load first page
    driver = webdriver.Firefox()
    n = 0
    driver.get(base_url + review_url + "#or" + str(n))
    
    #get number of pages
    soup = BeautifulSoup(driver.page_source,"lxml")
    num_pages = int(soup.find('h3',{'class':"reviews_header"}).find(text=True).split(" reviews from our ")[0].replace(',', ''))
    num_pages = num_pages/7 + 1
    
    #retrieve the n to the n+7th review data on each page, avoid duplicate review on top of the page
    for i in range(num_pages):
        current_page = driver.find_element_by_id('REVIEWS')
        soup = BeautifulSoup(driver.page_source,"lxml")

        if i > 0:
            review_boxes = soup.findAll('div', {'class' :" reviewSelector "})[1:]
        else:
            review_boxes = soup.findAll('div', {'class' :" reviewSelector "})
            
        for review_box in review_boxes:
            review_id = review_box['id']
            ratings = review_box.findAll('li',{'class':"recommend-answer"})
            if len(ratings) != 0:
                for rating in ratings:
                    rating_score = rating.find('img')['alt'].split(' of 5')[0]
                    rating_cat = rating.find('div',{'class':"recommend-description"}).find(text=True)
                    ratings_file.write(str(review_id) + ':' + str(rating_cat) + ':' + str(rating_score) + '\n')
        n += 7   
        driver.get(base_url + review_url + "#or" + str(n))
        current_page = WebDriverWait(driver, 10).until(EC.staleness_of(current_page))
        
    driver.quit()
    ratings_file.close()

    
#run scraper and print completion time
start = time.time()
hlist = parse_hotels()
parse_hotel_reviews(hlist)
end = time.time() - start
print("Completed, time: " + str(end) + " secs")

Completed, time: 999.205928802 secs


** Task 2 (20 pts) **

Now, we will use regression to analyze this information. First, we will fit a linear regression model that predicts the average rating. For example, for the hotel above, the average rating is

$$ \text{AVG_SCORE} = \frac{1*31 + 2*33 + 3*98 + 4*504 + 5*1861}{2527}$$

Use the model to analyze the important factors that decide the $\text{AVG_SCORE}$.

In [25]:
from bs4 import BeautifulSoup
import requests
import time
import numpy as np
import pandas as pd

base_url = 'https://www.tripadvisor.com'
user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.76 Safari/537.36'
url_list = ['https://www.tripadvisor.com/Hotels-g60745-Boston_Massachusetts-Hotels.html','https://www.tripadvisor.com/Hotels-g60745-oa30-Boston_Massachusetts-Hotels.html','https://www.tripadvisor.com/Hotels-g60745-oa60-Boston_Massachusetts-Hotels.html']

def parse_hotels():
    '''Record list of all hotels in Boston + urls to dictionary'''
    hotel_list = {}
    headers = {'User-Agent' : user_agent }
    for i in range(len(url_list)):
        response = requests.get(url_list[i], headers=headers)
        html = response.text.encode('utf-8')
        soup = BeautifulSoup(html,"lxml")
        hotel_boxes = soup.findAll('div', {'class' :'listing easyClear  p13n_imperfect '})
        for hotel_box in hotel_boxes:
            name = hotel_box.find('div', {'class' :'listing_title'}).find(text=True).replace("/","")
            url = hotel_box.find('div', {'class' :'listing_title'}).find('a',href=True)['href']
            hotel_list[name] = url
    return hotel_list

def parse_hotel_ratings(hotel_list):
    '''Goes through each hotel page and records # of ratings per star to calculate AVG_SCORE and Excellence
       Input: hotel_list(name,urls)'''
    
    hotel_names = hotel_list.keys()
    headers = {'User-Agent' : user_agent }
    
    hotel_avg_scores = {}
    
    for hotel in hotel_names:
        
        response = requests.get(base_url + hotel_list[hotel], headers=headers)
        html = response.text.encode('utf-8')
        soup = BeautifulSoup(html,"lxml")
        
        #soupify and collect number of reviews for each star rating
        soup = BeautifulSoup(html,"lxml")
        review_box = soup.find('div', {'class' :'col rating '})
        
        avg_score = 0.00
        total_count = 0.00
        five_score = 0.00
        
        for line in review_box.findAll('li'):
            review_categories = line.find('div',{'class':'row_label'}).find(text=True)
            review_scores = line.findAll('span')[3].find(text=True)
            review_scores = int(review_scores.replace(",", ""))

            #weigh scores, count excellent scores and total
            if review_categories == "Excellent":
                avg_score += 5*review_scores
                five_score += review_scores
            if review_categories == "Very good":
                avg_score += 4*review_scores
            if review_categories == "Average":
                avg_score += 3*review_scores
            if review_categories == "Poor":
                avg_score += 2*review_scores
            if review_categories == "Terrible":
                avg_score += 1*review_scores    
            total_count += review_scores

        #average avg_score over total_count and assign to hotel in hotel_avg_scores
        avg_score = avg_score / total_count
        
        #assign excellence score
        if five_score / total_count >= 0.600:
            five_score = 1
        else:
            five_score = 0
        hotel_avg_scores[hotel] = (avg_score,five_score)
        time.sleep(0.3)
    return hotel_avg_scores

'''Read rating_summary.txt and OPH_rating_summary to create numpy array in this format:
   hotel - [hotel, AVG_SCORE, Service, Cleanliness, Value, Sleep_Quality, Rooms, Location]
   put into dataframe for later use'''

def scores_df(avg_scores, filename = "rating_summary.txt", oph_filename = "OPH_rating_summary.txt"):
    """Read appropriate file and parse through to get vector of:
                [hotel, AVG_SCORE, Service, Cleanliness, Value, Sleep_Quality, Rooms, Location, Excellence]
                AVG_SCORE = avg_scores[hotel]
       Push it all into dataframe as hotel_scores.csv
       Input: avg_scores dict, filenames for rating summaries"""
    
    hotel_vectors = []
    current_vector = {}
    score = 0.00
    total = 0
    current_hotel = "Boston Harbor Hotel"
    
    with open(filename,'r') as f:
        for line in f:
            #read 5 lines at a time, calc score then average (ignore Business_service and Check_in)
            current_line = line.split(':')
            
            #if line has a new hotel name, then push/clear current_line into current_hotel_vector
            #push current_hotel_vector into hotel_vectors
            if current_hotel != current_line[0]:
                current_hotel = current_hotel.replace('&amp;','&')
                current_hotel = current_hotel.replace('&#39;',"'")
                hotel_vectors.append([current_hotel, avg_scores[current_hotel][0], avg_scores[current_hotel][1], current_vector['Service'],current_vector['Cleanliness'],current_vector['Value'],current_vector['Sleep Quality'],current_vector['Rooms'],current_vector['Location']])
                current_hotel = current_line[0]
                
            score += int(current_line[2]) * int(current_line[3]) 
            total += int(current_line[3])
            
            #if it is the 5th line, append to current_line[name] = score & reset
            if current_line[2] == '5':
                current_vector[current_line[1]] = score / total
                score = 0.00
                total = 0 
            
        #last line to append before close
        hotel_vectors.append([current_hotel, avg_scores[current_hotel][0], avg_scores[current_hotel][1], current_vector['Service'],current_vector['Cleanliness'],current_vector['Value'],current_vector['Sleep Quality'],current_vector['Rooms'],current_vector['Location']])
   

    total_count = [0,0,0,0,0,0]
    total_score = [0.00,0.00,0.00,0.00,0.00,0.00]
    
    #add OPH to hotel vectors
    with open(oph_filename,'r') as f:
        for line in f:
            #keep running total count and total score for each category
            current_line = line.split(':')
            if current_line[1] == 'Service':
                total_count[0] += 1
                total_score[0] += int(current_line[2])
            if current_line[1] == 'Cleanliness':
                total_count[1] += 1
                total_score[1] += int(current_line[2])
            if current_line[1] == 'Value':
                total_count[2] += 1
                total_score[2] += int(current_line[2])
            if current_line[1] == 'Sleep Quality':
                total_count[3] += 1
                total_score[3] += int(current_line[2])
            if current_line[1] == 'Rooms':
                total_count[4] += 1
                total_score[4] += int(current_line[2])
            if current_line[1] == 'Location':
                total_count[5] += 1
                total_score[5] += int(current_line[2])
                
        for i in range(6):
            total_score[i] = total_score[i] / total_count[i]
        total_score = ['Omni Park House', avg_scores['Omni Parker House'][0],avg_scores['Omni Parker House'][1]] + total_score
        hotel_vectors.append(total_score)
        
    data = pd.DataFrame(hotel_vectors,columns = ['Hotel', 'AVG_SCORE', 'Excellence' , 'Service', 'Cleanliness', 'Value', 'Sleep_Quality', 'Rooms', 'Location'])
    data.to_csv('hotel_rating_summary.csv')
    return data
        
#hotel_avg_scores = parse_hotel_ratings(hlist)
#scores_df(hotel_avg_scores)
hlist = parse_hotels()
while(len(hlist) != 82): 
#If for some reason sometimes does not produce a complete list, run again
    hlist = parse_hotels()
hotel_avg_scores = parse_hotel_ratings(hlist)
scores_df(hotel_avg_scores)

Unnamed: 0,Hotel,AVG_SCORE,Excellence,Service,Cleanliness,Value,Sleep_Quality,Rooms,Location
0,Boston Harbor Hotel,4.707746,1,4.800151,4.824275,4.265269,4.710259,4.701031,4.818182
1,Boston Hotel Buckminster,3.595972,0,3.891148,3.815451,3.736917,3.852670,3.427338,4.443252
2,Boston Marriott Copley Place,3.991115,0,4.122093,4.386243,3.677065,4.209266,4.040532,4.719094
3,Boston Marriott Long Wharf,4.265713,0,4.377953,4.496833,3.758840,4.268608,4.186155,4.859583
4,Courtyard by Marriott Boston Copley Square,4.590604,1,4.692500,4.683146,4.321788,4.552743,4.476773,4.844471
5,DoubleTree Club by Hilton Hotel Boston Bayside,3.797970,0,4.050476,4.193998,3.765540,4.075000,3.966025,3.610609
6,Embassy Suites by Hilton Boston - at Logan Air...,3.991206,0,4.149183,4.325472,3.921615,4.256731,4.194239,4.212605
7,enVision Hotel Boston,4.577093,1,4.722359,4.846154,4.615894,4.562278,4.654362,4.150000
8,"Fairmont Copley Plaza, Boston",4.372299,0,4.471007,4.550926,3.889253,4.444444,4.266212,4.848338
9,Harborside Inn,4.089302,0,4.064000,4.444824,4.089888,4.249608,4.109283,4.819632


In [26]:
import pandas as pd
import statsmodels.formula.api as sm

df = pd.DataFrame.from_csv('hotel_rating_summary.csv')
result = sm.ols(formula="AVG_SCORE ~ Service + Cleanliness + Value + Sleep_Quality + Rooms + Location", data=df).fit()
print result.summary()

                            OLS Regression Results                            
Dep. Variable:              AVG_SCORE   R-squared:                       0.976
Model:                            OLS   Adj. R-squared:                  0.974
Method:                 Least Squares   F-statistic:                     515.0
Date:                Tue, 29 Mar 2016   Prob (F-statistic):           8.14e-59
Time:                        14:36:13   Log-Likelihood:                 100.19
No. Observations:                  82   AIC:                            -186.4
Df Residuals:                      75   BIC:                            -169.5
Df Model:                           6                                         
Covariance Type:            nonrobust                                         
                    coef    std err          t      P>|t|      [95.0% Conf. Int.]
---------------------------------------------------------------------------------
Intercept        -0.6759      0.117     -5.773

The OLS Regression generated a pretty good fit for AVG_SCORE. Out of the 6 coefficents, none of the them are statstically insignificant: all the P values are below the 95% confidence level. Other coefficents are statistically signficant judging by the P-values.  Judging by the high R-squared and Adjusted R-squared value, the multivariate regression is accurate to a signficant degree, and fails the null-hypothesis by the F-statistic. Upon observing the coefficents, Rooms has the most significant impact on the scores of the hotel, followed by Cleanliness, Sleep_Quality, Value, Service and Location. It's interesting to note that Location has a much smaller coefficient compared to any other coefficent, by at least .11 from the second lowest coefficient, Service. This could be explained that Location has a much smaller standard error than any other variable and thus contributes less towards the AVG_SCORE. In context, this means that most reviews do not vary the location score greatly compared to other, more signficant variables. The non-negative skew suggests that in the context of this regresson, most ratings tend to lean towards the higher end of the scale, and the large kurtosis means that these reviews are more spread out further from the mean. 

In context of the model, this means that the quality of Rooms and Cleanliness matter the most to the average travel, as they have the largest determining factor in the overall score of the AVG_SCORE of the hotel. That said, all the values show statisical significance, meaning that all these factors definitely do matter to the rating of the hotel. Location varies the smallest, most likely because hotels usually choose their location very carefully and avoid any poor locations to build a hotel. Most travelers do make sure that the hotel they live in are well situated in the city anyways, since they are limited in choices of transportation. The skew and kurtosis suggest that most reviews are biased towards the extremes, leaning towards to the upper end of the scale. This is in line with the context, as most reviews come into fruition due to extraordinary experiences, mostly good and infrequently bad. 

** Task 3 (30 pts) **

Finally, we will use logistic regression to decide if a hotel is _excellent_ or not. We classify a hotel as _excellent_ if more than **60%** of its ratings are 5 stars. This is a binary attribute on which we can fit a logistic regression model. As before, use the model to analyze the data.

In [83]:
import pandas as pd
import statsmodels.formula.api as sm

df = pd.DataFrame.from_csv('hotel_rating_summary.csv')
result = sm.logit(formula='Excellence ~ Service + Cleanliness + Value + Sleep_Quality + Location + Rooms', data=df)
result_powell = result.fit(method = 'powell',maxiter = 10000)
print result_powell.summary()

Optimization terminated successfully.
         Current function value: 0.042182
         Iterations: 20
         Function evaluations: 1620
                           Logit Regression Results                           
Dep. Variable:             Excellence   No. Observations:                   82
Model:                          Logit   Df Residuals:                       75
Method:                           MLE   Df Model:                            6
Date:                Tue, 29 Mar 2016   Pseudo R-squ.:                  0.9289
Time:                        16:24:36   Log-Likelihood:                -3.4589
converged:                       True   LL-Null:                       -48.660
                                        LLR p-value:                 2.500e-17
                    coef    std err          z      P>|z|      [95.0% Conf. Int.]
---------------------------------------------------------------------------------
Intercept      -448.8165    297.645     -1.508      0.132     -1

The results of the logistic regression are quite different from the OLS one. While the Pseudo R-squared value is still pretty high and the P-value of the regression is low enough to be siginficant, the variables showed less fit in general compared to the OLS. 

Firstly, I had troble when I tried to run the multivariate regression with all the variables. The inclusion of Rooms did not allow for the complete convergence for regression. After analysing a few different regression methods, I found Powell to be the best model that managed to converge on MLE and generated a regression with the highest Pseudo R-squared and Log-likelihood. However there are a few problems to begin with: Firstly, none of the coefficents show any statstical significance. This means that in a multivariate regression, the individual coefficents do not point to any importance in the the regression, and therefore each coefficent fails to pass the null hypothesis i.e. variables do not affect the excellence score of the hotel. Yet the minimized Log-likelihood and Pseudo R-squared values show that the model is still well-fitted to the data despite these setbacks. An interesting thing to note is that the Cleanliness and Value coefficient is negative and Sleep_quality is much larger than any other coefficient by almost double. In context of the model, this means that higher Service Ratings negatively impact the hotel's excellence, while Sleep_Quality has a much greater positive impact on excellence than any other coefficient. Given the wide range of standard error and confidnece interval, however, the regression implies that the variables are poorly fitted to explain the data or the regression, and the high Pseudo r-squared value is a fluke more than proof that these variables significantly impact excellence. This is obviously not true, as it would make more sense that if a reviewer were to enjoy their stay they would assign higher ratings for the hotel as well as their overall score, thus increasing the number of excellent reviews. 

-------

In [None]:
# Code for setting the style of the notebook
from IPython.core.display import HTML
def css_styling():
    styles = open("../../theme/custom.css", "r").read()
    return HTML(styles)
css_styling()