(In order to load the stylesheet of this notebook, execute the last code cell in this notebook)

# Analyzing hotel ratings on Tripadvisor

In this homework we will focus on practicing two techniques: web scraping and regression. For the first part, we will build upon the sample code from the Lecture and attempt to get some basic information for each hotel. Then, we will fit a regression model on this information and try to analyze it.   

One of the main disadvantages of scraping a website instead of using an API is that, without any notice, the website may change its layout and render our code useless. Something like that happened in our case. Tripadvisor changed the layout of the buttons that we use to navigate between the different pages of the results. This was the main reason people were having problem with executing the code.

**Task 1 (20 pts)**

The first task of the homework is to fix the scraping code. We basically need to replace the part where we are checking if there is another page and getting its link with new code that reflects the new navigation layout. 

** Task 2 (30 pts)**

Then, for each hotel that our search returns, we will "click" (with the code of course) on it and scrape the information below.

![Information to be scraped](hotel_info.png)

Of course, feel free to collect even more data if you want. 

** Task 3 (20 pts) **

Now, we will use regression to analyze this information. First, we will fit a linear regression model that predicts the average rating. For example, for the hotel above, the average rating is

$$ \text{AVG_SCORE} = \frac{1*31 + 2*33 + 3*98 + 4*504 + 5*1861}{2527}$$

Use the model to analyze the important factors that decide the $\text{AVG_SCORE}$.

** Task 4 (30 pts) **

Finally, we will use logistic regression to decide if a hotel is _excellent_ or not. We classify a hotel as _excellent_ if more than **60%** of its ratings are 5 stars. This is a binary attribute on which we can fit a logistic regression model. As before, use the model to analyze the data.

-------

In order to use code from a Python script file, we need to put that file in the same folder as the notebook and import it as a library. Then, we will be able to access it's functions. For example, in the case of the lecture code, we could do the following:

``` python
import scrape_solution as scrape

scrape.get_city_page()
```

Of course, you might need to modify and restructure the code so that it returns what you need.

----

In [5]:
import time

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

import scipy as sp
import scipy.sparse.linalg as linalg
import scipy.cluster.hierarchy as hr
from scipy.spatial.distance import pdist, squareform

import sklearn.datasets as datasets
import sklearn.metrics as metrics
import sklearn.utils as utils
import sklearn.linear_model as linear_model
import sklearn.svm as svm
import sklearn.cross_validation as cross_validation
import sklearn.cluster as cluster
from sklearn.ensemble import AdaBoostClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.decomposition import TruncatedSVD
from sklearn.preprocessing import StandardScaler

import statsmodels.api as sm

from patsy import dmatrices

import networkx as nx

import seaborn as sns
%matplotlib inline

In [6]:
import scrape_solution as scrape
#finds the hotels, number of stars, and number of ratings for the state and city of boston
scrape.scrape_hotels("boston", "massachusetts")
#returns a list of the urls
urls = scrape.getUrls()

[2015-04-06 18:26:54,627] #################################### Option 2 ######################################
INFO:scrape_solution:#################################### Option 2 ######################################
[2015-04-06 18:26:54,716] #################################### Option 3 ######################################
INFO:scrape_solution:#################################### Option 3 ######################################
[2015-04-06 18:26:54,811] Hotel name: Hyatt Boston Harbor
INFO:scrape_solution:Hotel name: Hyatt Boston Harbor
[2015-04-06 18:26:54,812] Stars: 4
INFO:scrape_solution:Stars: 4
[2015-04-06 18:26:54,813] Number of reviews: 1,171 reviews 
INFO:scrape_solution:Number of reviews: 1,171 reviews 
[2015-04-06 18:26:54,815] Hotel name: Seaport Boston Hotel
INFO:scrape_solution:Hotel name: Seaport Boston Hotel
[2015-04-06 18:26:54,815] Stars: 4.5
INFO:scrape_solution:Stars: 4.5
[2015-04-06 18:26:54,818] Number of reviews: 2,551 reviews 
INFO:scrape_solution:Number of re

SystemExit: 

To exit: use 'exit', 'quit', or Ctrl-D.


In [10]:
#goes through all the hotels pages and appends the information to a global dictionary
scrape.parse_hotelpages(urls)
#returns the dictionary that contains everything I need
hotel_dt = scrape.getDict()
#converts it to a dataframe. 
hotel_db =pd.DataFrame(hotel_dt)
hotel_db
hotel_db = hotel_db.T

Unnamed: 0,Americas Best Value Inn,Ames Boston Hotel,BEST WESTERN PLUS Roundhouse Suites,BEST WESTERN University Hotel Boston-Brighton,"Battery Wharf Hotel, Boston Waterfront",Beacon Hill Hotel and Bistro,Boston Harbor Hotel,Boston Hotel Buckminster,Boston Marriott Copley Place,Boston Marriott Long Wharf,...,The Liberty Hotel,The Midtown Hotel,The Ritz-Carlton Boston Common,The Verb Hotel,The Westin Boston Waterfront,The Westin Copley Place,W Boston,Wyndham Boston Beacon Hill,XV Beacon,enVision Hotel Boston
Average,8.0,89.0,129.0,78.0,74.0,18.0,44.0,149.0,332.0,137.0,...,77.0,338.0,51.0,9.0,267.0,277.0,107.0,185.0,30.0,21.0
Business,3.0,153.0,104.0,49.0,128.0,13.0,255.0,106.0,710.0,253.0,...,173.0,244.0,143.0,31.0,728.0,774.0,218.0,281.0,115.0,64.0
Cleanliness,3.0,4.5,4.0,4.0,4.5,4.5,5.0,4.0,4.5,4.5,...,4.5,4.0,4.5,5.0,4.5,4.5,4.5,4.5,5.0,5.0
Couples,3.0,431.0,179.0,59.0,407.0,72.0,462.0,276.0,466.0,344.0,...,264.0,326.0,120.0,78.0,383.0,495.0,312.0,167.0,406.0,102.0
Excellent,2.0,473.0,179.0,97.0,560.0,66.0,991.0,169.0,638.0,611.0,...,396.0,251.0,299.0,177.0,586.0,1071.0,418.0,348.0,598.0,246.0
Families,4.0,103.0,257.0,166.0,175.0,17.0,284.0,178.0,440.0,434.0,...,126.0,394.0,124.0,41.0,234.0,547.0,128.0,362.0,112.0,104.0
Location,3.0,5.0,4.0,4.0,4.5,5.0,5.0,4.5,4.5,5.0,...,4.5,4.5,4.5,4.5,4.5,5.0,4.5,4.5,5.0,4.5
Poor,1.0,35.0,69.0,33.0,24.0,9.0,15.0,67.0,98.0,49.0,...,28.0,124.0,39.0,1.0,85.0,93.0,50.0,70.0,22.0,7.0
Rooms,2.5,4.5,4.0,3.5,4.5,4.0,4.5,3.5,4.0,4.0,...,4.0,3.5,4.5,4.5,4.0,4.5,4.0,4.0,4.5,4.5
Service,2.5,4.5,4.0,4.0,4.5,4.0,5.0,4.0,4.0,4.5,...,4.5,4.0,4.5,4.5,4.0,4.5,4.5,4.0,4.5,4.5


In [30]:

avg_ls = []

for idx, row in hotel_db.iterrows():
    average = (5*row["Excellent"]) + (4*row["Very good"]) + (3*row["Average"]) + (2*row["Poor"]) + row["Terrible"]
    avg_rate = average /(row["total number rating"])
    avg_ls.append(avg_rate)
hotel_db["Average rating"] = avg_ls
hotel_db
# avg_ls

Unnamed: 0,Average,Business,Cleanliness,Couples,Excellent,Families,Location,Poor,Rooms,Service,Sleep Quality,Solo,Terrible,Value,Very good,total number rating,total rating,Average rating
Americas Best Value Inn,8,3,3.0,3,2,4,3.0,1,2.5,2.5,2.5,2,5,3.0,2,18,2.5,2.722222
Ames Boston Hotel,89,153,4.5,431,473,103,5.0,35,4.5,4.5,4.5,65,15,4.0,258,870,4.5,4.309195
BEST WESTERN PLUS Roundhouse Suites,129,104,4.0,179,179,257,4.0,69,4.0,4.0,3.5,17,34,4.0,305,716,3.5,3.734637
BEST WESTERN University Hotel Boston-Brighton,78,49,4.0,59,97,166,4.0,33,3.5,4.0,4.0,27,21,4.0,132,361,3.5,3.695291
"Battery Wharf Hotel, Boston Waterfront",74,128,4.5,407,560,175,4.5,24,4.5,4.5,4.5,51,9,4.0,197,864,4.5,4.475694
Beacon Hill Hotel and Bistro,18,13,4.5,72,66,17,5.0,9,4.0,4.0,4.0,15,4,4.0,46,143,4.0,4.125874
Boston Harbor Hotel,44,255,5.0,462,991,284,5.0,15,4.5,5.0,4.5,55,11,4.5,189,1250,4.5,4.707200
Boston Hotel Buckminster,149,106,4.0,276,169,178,4.5,67,3.5,4.0,4.0,63,70,4.0,373,828,3.5,3.608696
Boston Marriott Copley Place,332,710,4.5,466,638,440,4.5,98,4.0,4.0,4.0,95,35,3.5,937,2040,4.0,4.002451
Boston Marriott Long Wharf,137,253,4.5,344,611,434,5.0,49,4.0,4.5,4.0,38,26,3.5,467,1290,4.0,4.231008


In [2]:
# Code for setting the style of the notebook
from IPython.core.display import HTML
def css_styling():
    styles = open("../../theme/custom.css", "r").read()
    return HTML(styles)
css_styling()