(In order to load the stylesheet of this notebook, execute the last code cell in this notebook)

# Analyzing Hotel Ratings on Tripadvisor

In this homework we will focus on practicing two techniques: web scraping and regression. For the first part, we will get some basic information for each hotel in Boston. Then, we will fit a regression model on this information and try to analyze it.

** Task 1 (30 pts)**

We will scrape the data using Beautiful Soup. For each hotel that our search returns, we will get the information below.

![Information to be scraped](hotel_info.png)

Of course, feel free to collect even more data if you want. 

In [4]:
from BeautifulSoup import BeautifulSoup
import requests
import time

def download_html(url):
    headers = { 'User-Agent' : "Mozilla/5.0 (Macintosh; Intel Mac " + \
                "OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko)" + \
               "Chrome/41.0.2272.76 Safari/537.36" }
    response = requests.get(url, headers=headers)
    html = response.text.encode('utf-8')
    return html

def extract_hotel_data():
    base_url = 'https://www.tripadvisor.com'
    params = '/Hotels-g60745-Boston_Massachusetts-Hotels.html'
    hotels = {}
    while True:
        html = download_html(base_url + params)
        soup = BeautifulSoup(html)
        next_page = soup.find("div", {"class" : "unified pagination standard_pagination"})
        hotel_boxes = soup.findAll('div', {'class' :'listing easyClear  p13n_imperfect '})
        for hotel_box in hotel_boxes:
            name = hotel_box.find('div', {'class' :'listing_title'}).find(text=True)
            url =  base_url + hotel_box.findAll('a', href= True)[0]['href']
            hotels[name] = url
        if next_page.find('span', {'class' : 'nav next ui_button disabled'}):
            break
        else:
            hrefs = next_page.findAll('a', href= True)
            for href in hrefs:
                if href.find(text = True) == 'Next':
                    params = href['href']
                    break
            time.sleep(2)
    review_data = load_and_parse_data()
    for hotel in hotels:
        time.sleep(2)
        hotel_data = get_hotel_attribs(hotels[hotel])
        if hotel in review_data:
            for key in hotel_data:
                review_data[hotel][key] = hotel_data[key]
        else:
            print(hotel + ' does not exist!')
    return review_data

def load_and_parse_data():
    fname = 'rating-summary.dat'
    data = [row.split(':') for row in open(fname, 'r').read().split('\n') if row != '']
    
    def new_hotel():
        return {'Service': {1: 0, 2: 0, 3: 0, 4: 0, 5: 0}, 
                'Cleanliness': {1: 0, 2: 0, 3: 0, 4: 0, 5: 0},
                'Business service (e.g., internet access)': {1: 0, 2: 0, 3: 0, 4: 0, 5: 0},
                'Check in / front desk': {1: 0, 2: 0, 3: 0, 4: 0, 5: 0},
                'Value': {1: 0, 2: 0, 3: 0, 4: 0, 5: 0},
                'Sleep Quality': {1: 0, 2: 0, 3: 0, 4: 0, 5: 0},
                'Rooms': {1: 0, 2: 0, 3: 0, 4: 0, 5: 0},
                'Location': {1: 0, 2: 0, 3: 0, 4: 0, 5: 0}}
    
    hotels = dict()
    for entry in data:
        hotel, cat, rating, count = entry[0], entry[1], int(entry[2]), int(entry[3])
        if hotel not in hotels:
            hotels[hotel] = dict()
            hotels[hotel]['Attribute'] = new_hotel()
        hotels[hotel]['Attribute'][cat][rating] += count
        
    to_remove = []
    for hotel in hotels:
        for cat in hotels[hotel]['Attribute']:
            ratings = list(hotels[hotel]['Attribute'][cat].keys())
            count = list(hotels[hotel]['Attribute'][cat].values())
            try: # in case hotel has no reviews, avoid division by zero
                avg = sum([ratings[i]*count[i] for i in range(5)])/float(sum(count))
                hotels[hotel]['Attribute'][cat] = avg
            except:
                to_remove.append(hotel)
                break
    
    for hotel in to_remove:
        hotels.pop(hotel)
    
    return hotels

def get_hotel_attribs(url):
    html = download_html(url)
    soup = BeautifulSoup(html)
    data = {"Rating": {}, "Type": {}}
            
    rating_str = "taplc_prodp13n_hr_sur_review_filter_controls_0_filterRating_"
    for i in range(1, 6):
        rating = soup.find("label", {"for" : rating_str + str(i)})
        rating = int(str(rating.findAll("span")[-2]).partition(">")[2].partition("<")[0].replace(",", ""))
        data['Rating'][i] = rating
    
    ratings = list(data['Rating'].keys())
    count = list(data['Rating'].values())
    avg = sum([ratings[i]*count[i] for i in range(5)])/float(sum(count))
    data['Rating']['Average'] = avg
        
    type_str = "taplc_prodp13n_hr_sur_review_filter_controls_0_filterSegment_"
    for typ in ['Family', 'Couples', 'Solo', 'Business', 'Friends']:
        trav_typ = soup.find("label", {"for" : type_str + typ}).find("span")
        trav_typ = int(str(trav_typ).partition("(")[2].partition(")")[0].replace(",", ""))
        data['Type'][typ] = trav_typ
        
    return data

data = extract_hotel_data()
f = open('first_test.txt', 'w')
f.write(str(data))
f.close()
print('Done!')

The Boston Common Hotel and Conference Center does not exist!
enVision Hotel Boston does not exist!
Battery Wharf Hotel, Boston Waterfront does not exist!
Sheraton Boston Hotel does not exist!
Hilton Boston Downtown / Faneuil Hall does not exist!
Aloft Boston Seaport does not exist!
Ames Boston Hotel does not exist!
Element Boston Seaport does not exist!
Holiday Inn Express Hotel &amp; Suites Boston Garden does not exist!
Revere Hotel Boston Common does not exist!
The Liberty Hotel does not exist!
The Godfrey Hotel Boston does not exist!
Wyndham Boston Beacon Hill does not exist!
Loews Boston Hotel does not exist!
The Boxer Boston does not exist!
W Boston does not exist!
Omni Parker House does not exist!
The Verb Hotel does not exist!
Residence Inn Boston Back Bay / Fenway does not exist!
Hilton Garden Inn Boston Logan Airport does not exist!
Residence Inn Boston Downtown Seaport does not exist!
The Envoy Hotel, Autograph Collection does not exist!
DoubleTree by Hilton Hotel Boston - D

** Task 2 (20 pts) **

Now, we will use regression to analyze this information. First, we will fit a linear regression model that predicts the average rating. For example, for the hotel above, the average rating is

$$ \text{AVG_SCORE} = \frac{1*31 + 2*33 + 3*98 + 4*504 + 5*1861}{2527}$$

Use the model to analyze the important factors that decide the $\text{AVG_SCORE}$.

In [16]:
import statsmodels.api as sm
import numpy as np

sleep = [data[hotel]['Attribute']['Sleep Quality'] for hotel in data]
rooms = [data[hotel]['Attribute']['Rooms'] for hotel in data]
service = [data[hotel]['Attribute']['Service'] for hotel in data]
cleanliness = [data[hotel]['Attribute']['Cleanliness'] for hotel in data]
business_service = [data[hotel]['Attribute']['Business service (e.g., internet access)'] for hotel in data]
front_desk = [data[hotel]['Attribute']['Check in / front desk'] for hotel in data]
value = [data[hotel]['Attribute']['Value'] for hotel in data]
location = [data[hotel]['Attribute']['Location'] for hotel in data]
couples = [data[hotel]['Type']['Couples'] for hotel in data]
solo = [data[hotel]['Type']['Solo'] for hotel in data]
friends = [data[hotel]['Type']['Friends'] for hotel in data]
business = [data[hotel]['Type']['Business'] for hotel in data]
family = [data[hotel]['Type']['Family'] for hotel in data]
rating1 = [data[hotel]['Rating'][1] for hotel in data]
rating2 = [data[hotel]['Rating'][2] for hotel in data]
rating3 = [data[hotel]['Rating'][3] for hotel in data]
rating4 = [data[hotel]['Rating'][4] for hotel in data]
rating5 = [data[hotel]['Rating'][5] for hotel in data]

c
X = np.array([sleep, rooms, service, cleanliness, business_service, front_desk, \
    value, location, couples, solo, friends, business, family, rating1, \
    rating2, rating3, rating4, rating5]).T

model = sm.OLS(y, X)
results = model.fit()
print results.summary()

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       1.000
Model:                            OLS   Adj. R-squared:                  1.000
Method:                 Least Squares   F-statistic:                 1.347e+04
Date:                Mon, 28 Mar 2016   Prob (F-statistic):           1.73e-71
Time:                        01:13:04   Log-Likelihood:                 89.681
No. Observations:                  59   AIC:                            -143.4
Df Residuals:                      41   BIC:                            -106.0
Df Model:                          18                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
x1             0.0959      0.099      0.967      0.3

** Task 3 (30 pts) **

Finally, we will use logistic regression to decide if a hotel is _excellent_ or not. We classify a hotel as _excellent_ if more than **60%** of its ratings are 5 stars. This is a binary attribute on which we can fit a logistic regression model. As before, use the model to analyze the data.

In [21]:
y_binary = np.array([1 if data[hotel]['Rating'][5] >= sum(list(data[hotel]['Rating'].values()))*.6 \
                     else 0 for hotel in data])

logit = sm.Logit(y, X)
result = logit.fit()
print result.summary()

ValueError: endog must be in the unit interval.

-------

In [None]:
# Code for setting the style of the notebook
from IPython.core.display import HTML
def css_styling():
    styles = open("../../theme/custom.css", "r").read()
    return HTML(styles)
css_styling()