(In order to load the stylesheet of this notebook, execute the last code cell in this notebook)

# Analyzing hotel ratings on Tripadvisor

In this homework we will focus on practicing two techniques: web scraping and regression. For the first part, we will build upon the sample code from the Lecture and attempt to get some basic information for each hotel. Then, we will fit a regression model on this information and try to analyze it.   

One of the main disadvantages of scraping a website instead of using an API is that, without any notice, the website may change its layout and render our code useless. Something like that happened in our case. Tripadvisor changed the layout of the buttons that we use to navigate between the different pages of the results. This was the main reason people were having problem with executing the code.

**Task 1 (20 pts)**

The first task of the homework is to fix the scraping code. We basically need to replace the part where we are checking if there is another page and getting its link with new code that reflects the new navigation layout. 

** Task 2 (30 pts)**

Then, for each hotel that our search returns, we will "click" (with the code of course) on it and scrape the information below.

![Information to be scraped](hotel_info.png)

Of course, feel free to collect even more data if you want. 

** Task 3 (20 pts) **

Now, we will use regression to analyze this information. First, we will fit a linear regression model that predicts the average rating. For example, for the hotel above, the average rating is

$$ \text{AVG_SCORE} = \frac{1*31 + 2*33 + 3*98 + 4*504 + 5*1861}{2527}$$

Use the model to analyze the important factors that decide the $\text{AVG_SCORE}$.

** Task 4 (30 pts) **

Finally, we will use logistic regression to decide if a hotel is _excellent_ or not. We classify a hotel as _excellent_ if more than **60%** of its ratings are 5 stars. This is a binary attribute on which we can fit a logistic regression model. As before, use the model to analyze the data.

-------

In order to use code from a Python script file, we need to put that file in the same folder as the notebook and import it as a library. Then, we will be able to access it's functions. For example, in the case of the lecture code, we could do the following:

``` python
import scrape_solution as scrape

scrape.get_city_page()
```

Of course, you might need to modify and restructure the code so that it returns what you need.

----

In [70]:
import time

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

import scipy as sp
import scipy.sparse.linalg as linalg
import scipy.cluster.hierarchy as hr
from scipy.spatial.distance import pdist, squareform

import sklearn.datasets as datasets
import sklearn.metrics as metrics
import sklearn.utils as utils
import sklearn.linear_model as linear_model
import sklearn.svm as svm
import sklearn.cross_validation as cross_validation
import sklearn.cluster as cluster
from sklearn.ensemble import AdaBoostClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.decomposition import TruncatedSVD
from sklearn.preprocessing import StandardScaler

import statsmodels.api as sm

from patsy import dmatrices

import networkx as nx

import seaborn as sns
%matplotlib inline

In [71]:
import scrape_solution as scrape
#finds the hotels, number of stars, and number of ratings for the state and city of boston
scrape.scrape_hotels("boston", "massachusetts")
#returns a list of the urls
urls = scrape.getUrls()

[2015-04-06 22:41:21,930] #################################### Option 2 ######################################
INFO:scrape_solution:#################################### Option 2 ######################################
[2015-04-06 22:41:22,249] #################################### Option 3 ######################################
INFO:scrape_solution:#################################### Option 3 ######################################
[2015-04-06 22:41:22,605] Hotel name: Seaport Boston Hotel
INFO:scrape_solution:Hotel name: Seaport Boston Hotel
[2015-04-06 22:41:22,609] Stars: 4.5
INFO:scrape_solution:Stars: 4.5
[2015-04-06 22:41:22,615] Number of reviews: 2,551 reviews 
INFO:scrape_solution:Number of reviews: 2,551 reviews 
[2015-04-06 22:41:22,619] Hotel name: Hyatt Boston Harbor
INFO:scrape_solution:Hotel name: Hyatt Boston Harbor
[2015-04-06 22:41:22,625] Stars: 4
INFO:scrape_solution:Stars: 4
[2015-04-06 22:41:22,631] Number of reviews: 1,171 reviews 
INFO:scrape_solution:Number of re

SystemExit: 

To exit: use 'exit', 'quit', or Ctrl-D.


In [73]:
#goes through all the hotels pages and appends the information to a global dictionary
scrape.parse_hotelpages(urls)
#returns the dictionary that contains everything I need
hotel_dt = scrape.getDict()
#converts it to a dataframe. 
hotel_db =pd.DataFrame(hotel_dt)
hotel_db = hotel_db.T
hotel_db

In [74]:

avg_ls = []
exlent = []
#Calculates the average score and whether or not it is excellent
for idx, row in hotel_db.iterrows():
    average = (5*row["Excellent"]) + (4*row["Very good"]) + (3*row["Average"]) + (2*row["Poor"]) + row["Terrible"]
    avg_rate = average /(row["total number rating"])
    avg_ls.append(avg_rate)
    
    percentEx = row["Excellent"] / row["total number rating"]
    if percentEx >= .6:
        exlent.append(True)
    else:
        exlent.append(False)
    
hotel_db["Average rating"] = avg_ls
hotel_db["Is Excellent"] = exlent

hotel_db
# avg_ls

Unnamed: 0,Average,Business,Cleanliness,Couples,Excellent,Families,Location,Poor,Rooms,Service,Sleep Quality,Solo,Terrible,Value,Very good,total number rating,total rating,Average rating,Is Excellent
Americas Best Value Inn,8,3,3.0,3,2,4,3.0,1,2.5,2.5,2.5,2,5,3.0,2,18,2.5,2.722222,False
Ames Boston Hotel,89,153,4.5,431,473,103,5.0,35,4.5,4.5,4.5,65,15,4.0,258,870,4.5,4.309195,False
BEST WESTERN PLUS Roundhouse Suites,129,104,4.0,179,179,257,4.0,69,4.0,4.0,3.5,17,34,4.0,305,716,3.5,3.734637,False
BEST WESTERN University Hotel Boston-Brighton,78,49,4.0,59,97,166,4.0,33,3.5,4.0,4.0,27,21,4.0,132,361,3.5,3.695291,False
"Battery Wharf Hotel, Boston Waterfront",74,128,4.5,407,560,175,4.5,24,4.5,4.5,4.5,51,9,4.0,197,864,4.5,4.475694,True
Beacon Hill Hotel and Bistro,18,13,4.5,72,66,17,5.0,9,4.0,4.0,4.0,15,4,4.0,46,143,4.0,4.125874,False
Boston Harbor Hotel,44,255,5.0,462,992,285,5.0,15,4.5,5.0,4.5,55,11,4.5,189,1251,4.5,4.707434,True
Boston Hotel Buckminster,149,106,4.0,276,169,178,4.5,67,3.5,4.0,4.0,63,70,4.0,373,828,3.5,3.608696,False
Boston Marriott Copley Place,332,710,4.5,466,638,440,4.5,98,4.0,4.0,4.0,95,35,3.5,937,2040,4.0,4.002451,False
Boston Marriott Long Wharf,137,253,4.5,344,611,434,5.0,49,4.0,4.5,4.0,38,26,3.5,467,1290,4.0,4.231008,False


In [75]:

#shuffles their order
ls = []
for i in range(77):
    ls.append(np.random.rand())
hotel_db["rand"] = ls
hotel_db = hotel_db.sort("rand")

# makes a new dataframe with just the features I want to look at. 
features = hotel_db[["Families", "Couples", "Solo", "Business", "Location", "Sleep Quality", "Rooms", "Service", "Value", "Cleanliness"]]

feature_names = ["Families", "Couples", "Solo", "Business", "Location", "Sleep Quality", "Rooms", "Service", "Value", "Cleanliness"]

features2 = hotel_db[["Families", "Couples", "Solo", "Business", "Location", "Sleep Quality",
                     "Rooms", "Service", "Value", "Cleanliness", "Average rating"]]

feature_names2 = ["Families", "Couples", "Solo", "Business", "Location", "Sleep Quality",
                 "Rooms", "Service", "Value", "Cleanliness", "Average rating"]


#make the training and testing data sets 
#x2 and y2 will be used for logistic regression. 
#This is so there will be a testing set to verify with. 
y_train = hotel_db[["Average rating"]].head(40)
y_test = hotel_db[["Average rating"]].tail(len(hotel_db) - 40)

y_train2 = hotel_db[["Is Excellent"]].head(40)
y_test2 = hotel_db[["Is Excellent"]].tail(len(hotel_db) - 40)


X_train = features.head(40)
X_test = features.tail(len(features) - 40)

X_train2 = features2.head(40)
X_test2 = features2.tail(len(features) - 40)

# model = sm.OLS(y_training, training)
# results = model.fit()
# print results.summary()
regr = linear_model.LinearRegression()
regr.fit(training, y_training);
# The mean square error
print("Training error: ", metrics.mean_squared_error(regr.predict(X_train),y_train))
print("Test     error: ", metrics.mean_squared_error(regr.predict(X_test),y_test))

train_score = regr.score(X_train,y_train)
test_score = regr.score(X_test,y_test)
print("Training score: ", train_score)
print("Test     score: ", test_score)

coefficients = regr.coef_
# for i in range(len(coefficients)):
#     print feature_names[i],"\t",coefficients[i]
    
pd.DataFrame(zip(feature_names, np.transpose(coefficients)))

# print "Confidence Intervals:", results.conf_int()
# print "Parameters:", results.params

('Training error: ', 0.011653078378382612)
('Test     error: ', 0.009875575579639027)
('Training score: ', 0.93855252750699569)
('Test     score: ', 0.95887781527938731)


Unnamed: 0,0,1
0,Families,[-0.000327535732498]
1,Couples,[0.000361935629163]
2,Solo,[-0.000520046762075]
3,Business,[-3.13212273054e-05]
4,Location,[0.0470369930227]
5,Sleep Quality,[0.22424338631]
6,Rooms,[0.283164416082]
7,Service,[-0.0128281471846]
8,Value,[0.282372420688]
9,Cleanliness,[0.269765258915]


Based on the results above the most important features are: Sleep Quality, Rooms, Value, Cleanliness, and Location. This is a good model because the R^2 values are very high. 

Now onto Logistic Regression!

In [76]:


logistic_regr = linear_model.LogisticRegression()
logistic_regr.fit(X_train2, y_train2)

# predict class labels for the test set
y_predicted = logistic_regr.predict(X_test2)
print "y predicted"
print y_predicted


# generate evaluation metrics
#finds accuracy -> jaccard similarity
print "\njaccard similarity"
print metrics.accuracy_score(y_test2, y_predicted)

print "\nconfusion matrix"
print metrics.confusion_matrix(y_test2, y_predicted)

print "\nclassification report"
print metrics.classification_report(y_test2, y_predicted)

# examine the coefficients
print "\ncoefficients"
print pd.DataFrame(zip(feature_names2, np.transpose(logistic_regr.coef_)))

y predicted
[ True False False False False False False False False False False False
 False False False False False False  True False False  True  True False
 False False False False False False False False False  True False False
 False]

jaccard similarity
0.756756756757

confusion matrix
[[25  2]
 [ 7  3]]

classification report
             precision    recall  f1-score   support

      False       0.78      0.93      0.85        27
       True       0.60      0.30      0.40        10

avg / total       0.73      0.76      0.73        37


coefficients
                 0                    1
0         Families   [0.00446422489268]
1          Couples   [0.00548095804556]
2             Solo   [-0.0323030214542]
3         Business  [-0.00321877814329]
4         Location    [-0.943074141633]
5    Sleep Quality     [0.295181060245]
6            Rooms     [0.282868232319]
7          Service    [0.0399944546104]
8            Value   [-0.0279421543755]
9      Cleanliness    [-0.27394067857

  y = column_or_1d(y, warn=True)


So the results of the logistic regression was not as good as the line. The Jaccard similarity was only 0.729, which is okay, but not great. 

Confusion Matrix
The top left and the bottom right are the number that were classified correcttly. The top right and bottom left are the number that were classified incorrectly. The results are again just okay. 

Classification Report
The precision is the number it classified correctly out of the ones of that type. Recall is the number that were classified correctly out of all posibl ones. As you can see, the results were once again not great. 

Of the coefficients, Average rating was by far the most significant, which makes sense because we're seeing how many are excellent. 

In [2]:
# Code for setting the style of the notebook
from IPython.core.display import HTML
def css_styling():
    styles = open("../../theme/custom.css", "r").read()
    return HTML(styles)
css_styling()