(In order to load the stylesheet of this notebook, execute the last code cell in this notebook)

# Analyzing Hotel Ratings on Tripadvisor

In this homework we will focus on practicing two techniques: web scraping and regression. For the first part, we will get some basic information for each hotel in Boston. Then, we will fit a regression model on this information and try to analyze it.

** Task 1 (30 pts)**

We will scrape the data using Beautiful Soup. For each hotel that our search returns, we will get the information below.

![Information to be scraped](hotel_info.png)

Of course, feel free to collect even more data if you want. 

In [16]:
'''
1 - Retrieve all 82 hotels in Boston
    - url to get to the first page of hotels: https://www.tripadvisor.com/Hotels-g60745-Boston_Massachusetts-Hotels.html 
    - url to get to the first page of hotels: https://www.tripadvisor.com/Hotels-g60745-oa30-Boston_Massachusetts-Hotels.html 
    - url to get to the first page of hotels: https://www.tripadvisor.com/Hotels-g60745-oa60-Boston_Massachusetts-Hotels.html 
    - extract name and url of each hotel and its page, put into dataframe
2 - Collect traveler ratings for each hotel to get avg_score
3 - Get Location, Sleep Quality, Rooms, Service, Value and Cleanliness from each hotel
'''
from bs4 import BeautifulSoup
import requests
import json
import time
import pandas as pd


url_1 = 'https://www.tripadvisor.com/Hotels-g60745-Boston_Massachusetts-Hotels.html'
url_2 = 'https://www.tripadvisor.com/Hotels-g60745-oa30-Boston_Massachusetts-Hotels.html'
url_3 = 'https://www.tripadvisor.com/Hotels-g60745-oa60-Boston_Massachusetts-Hotels.html'
user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.76 Safari/537.36'

def parse_hotels():
    '''Record list of all hotels in Boston + urls to review page'''
    hotel_list = []
    headers = {'User-Agent' : user_agent }
    
    response = requests.get(url_1, headers=headers)
    html = response.text.encode('utf-8')
    soup = BeautifulSoup(html)
    hotel_boxes = soup.findAll('div', {'class' :'listing easyClear  p13n_imperfect '})
    for hotel_box in hotel_boxes:
        name = hotel_box.find('div', {'class' :'listing_title'}).find(text=True)
        url = hotel_box.find('div', {'class' :'listing_title'}).find('a',href=True)['href']
        hotel_list.append([name,url])
        
    print hotel_list
    
parse_hotels()

[(u'Omni Parker House', '/Hotel_Review-g60745-d89599-Reviews-Omni_Parker_House-Boston_Massachusetts.html'), (u'Hyatt Regency Boston Harbor', '/Hotel_Review-g60745-d89620-Reviews-Hyatt_Regency_Boston_Harbor-Boston_Massachusetts.html'), (u'Seaport Boston Hotel', '/Hotel_Review-g60745-d94330-Reviews-Seaport_Boston_Hotel-Boston_Massachusetts.html'), (u'Hotel Commonwealth', '/Hotel_Review-g60745-d258705-Reviews-Hotel_Commonwealth-Boston_Massachusetts.html'), (u'Revere Hotel Boston Common', '/Hotel_Review-g60745-d89600-Reviews-Revere_Hotel_Boston_Common-Boston_Massachusetts.html'), (u'The Westin Copley Place', '/Hotel_Review-g60745-d89617-Reviews-The_Westin_Copley_Place-Boston_Massachusetts.html'), (u'Sheraton Boston Hotel', '/Hotel_Review-g60745-d89602-Reviews-Sheraton_Boston_Hotel-Boston_Massachusetts.html'), (u'Boston Harbor Hotel', '/Hotel_Review-g60745-d89575-Reviews-Boston_Harbor_Hotel-Boston_Massachusetts.html'), (u'InterContinental Boston', '/Hotel_Review-g60745-d620703-Reviews-Inter

** Task 2 (20 pts) **

Now, we will use regression to analyze this information. First, we will fit a linear regression model that predicts the average rating. For example, for the hotel above, the average rating is

$$ \text{AVG_SCORE} = \frac{1*31 + 2*33 + 3*98 + 4*504 + 5*1861}{2527}$$

Use the model to analyze the important factors that decide the $\text{AVG_SCORE}$.

** Task 3 (30 pts) **

Finally, we will use logistic regression to decide if a hotel is _excellent_ or not. We classify a hotel as _excellent_ if more than **60%** of its ratings are 5 stars. This is a binary attribute on which we can fit a logistic regression model. As before, use the model to analyze the data.

-------

In [None]:
# Code for setting the style of the notebook
from IPython.core.display import HTML
def css_styling():
    styles = open("../../theme/custom.css", "r").read()
    return HTML(styles)
css_styling()