(In order to load the stylesheet of this notebook, execute the last code cell in this notebook)

# Analyzing Hotel Ratings on Tripadvisor

In this homework we will focus on practicing two techniques: web scraping and regression. For the first part, we will get some basic information for each hotel in Boston. Then, we will fit a regression model on this information and try to analyze it.

** Task 1 (30 pts)**

We will scrape the data using Beautiful Soup. For each hotel that our search returns, we will get the information below.

![Information to be scraped](hotel_info.png)

Of course, feel free to collect even more data if you want. 

In [91]:
'''
1 - Retrieve all 82 hotels in Boston
    - url to get to the first page of hotels: https://www.tripadvisor.com/Hotels-g60745-Boston_Massachusetts-Hotels.html 
    - url to get to the first page of hotels: https://www.tripadvisor.com/Hotels-g60745-oa30-Boston_Massachusetts-Hotels.html 
    - url to get to the first page of hotels: https://www.tripadvisor.com/Hotels-g60745-oa60-Boston_Massachusetts-Hotels.html 
    - extract name and url of each hotel and its page
2 - Collect traveler ratings for Omni Park House to  Location, Sleep Quality, Rooms, Service, Value and Cleanliness from
    each hotel
    - 
'''
from bs4 import BeautifulSoup
import requests
import time
import pandas as pd
from contextlib import contextmanager
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

base_url = 'https://www.tripadvisor.com'
url_list = ['https://www.tripadvisor.com/Hotels-g60745-Boston_Massachusetts-Hotels.html','https://www.tripadvisor.com/Hotels-g60745-oa30-Boston_Massachusetts-Hotels.html','https://www.tripadvisor.com/Hotels-g60745-oa60-Boston_Massachusetts-Hotels.html']
user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.76 Safari/537.36'

def parse_hotels():
    '''Record list of all hotels in Boston + urls to review page'''
    hotel_list = {}
    headers = {'User-Agent' : user_agent }
    for i in range(len(url_list)):
        response = requests.get(url_list[i], headers=headers)
        html = response.text.encode('utf-8')
        soup = BeautifulSoup(html,"lxml")
        hotel_boxes = soup.findAll('div', {'class' :'listing easyClear  p13n_imperfect '})
        for hotel_box in hotel_boxes:
            name = hotel_box.find('div', {'class' :'listing_title'}).find(text=True)
            url = hotel_box.find('div', {'class' :'listing_title'}).find('a',href=True)['href']
            hotel_list[name] = url
    return hotel_list

def parse_hotel_reviews(hotel_list, hotel_name = "Omni Parker House"):
    '''Scrapes each review of a specific hotel to get review_body_id : attribute : rating
       Input: hotel_list, hotel = "Omni Parker House"'''
    
    #create file
    ratings_file = open('OPH_rating_summary.txt', 'w+')
    
    hotel_url = hotel_list[hotel_name]
    headers = {'User-Agent' : user_agent}
    response = requests.get(base_url + hotel_url, headers=headers)
    html = response.text.encode('utf-8')
    soup = BeautifulSoup(html,"lxml")
    html = response.text.encode('utf-8')
    
    #soupify html then find the first review url on the hotel front page
    review_box = soup.find('div', {'class' :'reviewSelector   track_back'})
    review_url = review_box.find('div',{'class' :'quote'}).find('a',href=True)['href']
    review_url = review_url.split('#')[0]
    
    #url is obtained, use selenium to load first page
    driver = webdriver.Firefox()
    n = 0
    driver.get(base_url + review_url + "#or" + str(n))
    
    #get number of pages
    soup = BeautifulSoup(driver.page_source,"lxml")
    num_pages = int(soup.find('h3',{'class':"reviews_header"}).find(text=True).split(" reviews from our ")[0].replace(',', ''))
    num_pages = num_pages/7 + 1
    
    #retrieve the n to the n+7th review data on each page, avoid duplicate review on top of the page
    for i in range(num_pages):
        current_page = driver.find_element_by_id('REVIEWS')
        soup = BeautifulSoup(driver.page_source,"lxml")

        if i > 0:
            review_boxes = soup.findAll('div', {'class' :" reviewSelector "})[1:]
        else:
            review_boxes = soup.findAll('div', {'class' :" reviewSelector "})
            
        for review_box in review_boxes:
            review_id = review_box['id']
            ratings = review_box.findAll('li',{'class':"recommend-answer"})
            if len(ratings) != 0:
                for rating in ratings:
                    rating_score = rating.find('img')['alt'].split(' of 5')[0]
                    rating_cat = rating.find('div',{'class':"recommend-description"}).find(text=True)
                    ratings_file.write(str(review_id) + ':' + str(rating_cat) + ':' + str(rating_score) + '\n')
        n += 7   
        driver.get(base_url + review_url + "#or" + str(n))
        current_page = WebDriverWait(driver, 10).until(EC.staleness_of(current_page))
        
    driver.quit()
    ratings_file.close()

start = time.time()
hlist = parse_hotels()
parse_hotel_reviews(hlist)
end = time.time() - start
print("Completed, time: " + str(end) + " secs")

Completed, time: 1374.39435792 secs


** Task 2 (20 pts) **

Now, we will use regression to analyze this information. First, we will fit a linear regression model that predicts the average rating. For example, for the hotel above, the average rating is

$$ \text{AVG_SCORE} = \frac{1*31 + 2*33 + 3*98 + 4*504 + 5*1861}{2527}$$

Use the model to analyze the important factors that decide the $\text{AVG_SCORE}$.

** Task 3 (30 pts) **

Finally, we will use logistic regression to decide if a hotel is _excellent_ or not. We classify a hotel as _excellent_ if more than **60%** of its ratings are 5 stars. This is a binary attribute on which we can fit a logistic regression model. As before, use the model to analyze the data.

-------

In [None]:
# Code for setting the style of the notebook
from IPython.core.display import HTML
def css_styling():
    styles = open("../../theme/custom.css", "r").read()
    return HTML(styles)
css_styling()