(In order to load the stylesheet of this notebook, execute the last code cell in this notebook)

# Analyzing Hotel Ratings on Tripadvisor

In this homework we will focus on practicing two techniques: web scraping and regression. For the first part, we will get some basic information for each hotel in Boston. Then, we will fit a regression model on this information and try to analyze it.

** Task 1 (30 pts)**

We will scrape the data using Beautiful Soup. For each hotel that our search returns, we will get the information below.

![Information to be scraped](hotel_info.png)

Of course, feel free to collect even more data if you want. 

The scraping of the reviews is done in the below code. The parsehotellist function moves through each hotel. We call
the getAverageRating function which will give us average rating and the number of family reviews etc.

In [1]:
from BeautifulSoup import BeautifulSoup
import sys
import time
import os
import logging
import argparse
import requests
import codecs
import json

base_url = "http://www.tripadvisor.com"
user_agent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.76 Safari/537.36"
datadir = "data"
hoteldir = "data/hotelpages"
city = "BOSTON"
state = "MASACHUSETTS"
hotel_details = [] # A dictionary to store hotel details


def get_tourism_page(city, state):
    """
            Return the json containing the
            URL of the tourism city page
    """

    # EXAMPLE: http://www.tripadvisor.com/TypeAheadJson?query=boston%20massachusetts&action=API
    #          http://www.tripadvisor.com//TypeAheadJson?query=san%20francisco%20california&type=GEO&action=API
    url = "%s/TypeAheadJson?query=%s%%20%s&action=API" % (base_url, "%20".join(city.split()), state)
    # Given the url, request the HTML page
    headers = { 'User-Agent' : user_agent }
    response = requests.get(url, headers=headers)
    html = response.text.encode('utf-8')

    with open(os.path.join(datadir, city + '-search-page.json'), "w") as h:
        h.write(html)

    # Parse json to get url
    js = json.loads(html)
    results = js['results']
    urls = results[0]['urls'][0]

    # get tourism page url
    tourism_url = urls['url']
    return tourism_url

In [2]:
#This is the code for getting the hotel
def get_city_page(tourism_url):
    """
            Get the URL of the hotels of the city
            using the URL returned by the function
            get_tourism_page()
    """

    url = base_url + tourism_url

    # Given the url, request the HTML page
    headers = { 'User-Agent' : user_agent }
    response = requests.get(url, headers=headers)
    html = response.text.encode('utf-8')

    # Save to file
    with open(os.path.join(datadir, city + '-tourism-page.html'), "w") as h:
            h.write(html)


    # Use BeautifulSoup to extract the url for the list of hotels in
    # the city and state we are interested in.
    # For exampel in this case we need to
    #<li class="hotels twoLines">
    #<a href="/Hotels-g60745-Boston_Massachusetts-Hotels.html" data-trk="hotels_nav"
    soup = BeautifulSoup(html)
    li = soup.find("li", {"class": "hotels twoLines"})
    city_url = li.find('a', href = True)
    #log.info("CITY PAGE URL: %s" % city_url['href'])
    return city_url['href']

In [3]:
def get_hotellist_page(city_url, count):
    """ Get the hotel list page given the url returned by
            get_city_page(). Return the html after saving
            it to the datadir
    """

    url = base_url + city_url
    # Sleep 2 sec before starting a new http request
    time.sleep(2)
    # Request page
    headers = { 'User-Agent' : user_agent }
    response = requests.get(url, headers=headers)
    html = response.text.encode('utf-8')
    # Save the
    with open(os.path.join(datadir, city + '-hotelist-' + str(count) + '.html'), "w") as h:
        h.write(html)
    return html

The below cell has the code to get the traveller rating and the number of reviews by Families,couples etc.

In [4]:
def get_traveller_rating(name, url_from_hotellist):
    hotel_url = base_url + url_from_hotellist
    
    # Sleep 2 sec before starting a new http request
    time.sleep(2)
    headers = { 'User-agent': user_agent }
    response = requests.get(hotel_url, headers=headers)
    html = response.text.encode('utf-8')
    #Dictionary
    traveller_rating = {}
    soup = BeautifulSoup(html)
    traveller_rating["Excellent"] = soup.find('label', {'for':'taplc_prodp13n_hr_sur_review_filter_controls_0_filterRating_5'}).findAll('span')[2].find(text=True)
    traveller_rating["Very Good"] = soup.find('label', {'for':'taplc_prodp13n_hr_sur_review_filter_controls_0_filterRating_4'}).findAll('span')[2].find(text=True)
    traveller_rating["Average"] = soup.find('label', {'for':'taplc_prodp13n_hr_sur_review_filter_controls_0_filterRating_3'}).findAll('span')[2].find(text=True)
    traveller_rating["Poor"] = soup.find('label', {'for':'taplc_prodp13n_hr_sur_review_filter_controls_0_filterRating_2'}).findAll('span')[2].find(text=True)
    traveller_rating["Terrible"] = soup.find('label', {'for':'taplc_prodp13n_hr_sur_review_filter_controls_0_filterRating_1'}).findAll('span')[2].find(text=True)
    traveller_rating["Families"] = soup.find('label', {'for':'taplc_prodp13n_hr_sur_review_filter_controls_0_filterSegment_Family'}).findAll('span')[0].find(text=True)
    traveller_rating["Couples"] = soup.find('label', {'for':'taplc_prodp13n_hr_sur_review_filter_controls_0_filterSegment_Couples'}).findAll('span')[0].find(text=True)
    traveller_rating["Solo"] = soup.find('label', {'for':'taplc_prodp13n_hr_sur_review_filter_controls_0_filterSegment_Solo'}).findAll('span')[0].find(text=True)
    traveller_rating["Business"] = soup.find('label', {'for':'taplc_prodp13n_hr_sur_review_filter_controls_0_filterSegment_Business'}).findAll('span')[0].find(text=True)
    traveller_rating["Friends"] = soup.find('label', {'for':'taplc_prodp13n_hr_sur_review_filter_controls_0_filterSegment_Friends'}).findAll('span')[0].find(text=True)

    return traveller_rating


In the below code we go through each page in the hotellist and get the link of the hotel from which I will get the traveller rating

In [5]:
def parse_hotellist_page(html):
    """ Parse the html pages returned by get_hotellist_page().
            Return the next url page to scrape (a city can have
            more than one page of hotels) if there is, else exit
            the script.
    """

    soup = BeautifulSoup(html)
    hotel_boxes = soup.findAll('div', {'class' :'listing easyClear  p13n_imperfect '})
 
    for hotel_box in hotel_boxes:
        hotelinfo = {}
        name = hotel_box.find('div', {'class' :'listing_title'}).find(text=True)
        try:
                hotel_url = hotel_box.find('div', {'class' :'listing_title'}).find('a',href=True)['href']
                traveller_rating = get_traveller_rating(name, hotel_url)
                rating = hotel_box.find('div', {'class' :'listing_rating'})
                reviews = rating.find('span', {'class' :'more'}).find(text=True)
                stars = hotel_box.find("img", {"class" : "sprite-ratings"})
        except Exception, e:
                print "No ratings for this hotel",name
                reviews = "N/A"
                stars = 'N/A'
        
        if stars != 'N/A':
                stars = stars['alt'].split()[0]

        hotelinfo["name"] = name
        hotelinfo["reviews"] = reviews
        hotelinfo["star_rating"] = stars
        hotelinfo["traveller_rating"] = traveller_rating
        hotel_details.append(hotelinfo)

    
    # # Get next URL page if exists, else exit
    div = soup.find("div", {"class" : "unified pagination standard_pagination"})
    # check if last page
    if div.find('span', {'class' : 'nav next ui_button disabled'}):
        print "We reached last page"
        sys.exit()
    # If it is not las page there must be the Next URL
    hrefs = div.findAll('a', href= True)
    for href in hrefs:
        if href.find(text = True) == 'Next':
            print "Next url is %s" % href['href']
            return href['href']


In the below code we parse the review from the review link to get category data which we will write into a file. Refer to review_file_1 to look into the review

In [6]:
# This function parses each review in Omni Parker and writes to a file
def parseReview(review_url):
    hotel_url = base_url + review_url
    time.sleep(2)
    headers = { 'User-agent': user_agent }
    response = requests.get(hotel_url, headers=headers)
    html = response.text.encode('utf-8')
    soup = BeautifulSoup(html)
    review_details = soup.find('div', {'class' :'innerBubble'})
    review_id = review_details.find('p',{ 'id' : True })['id']
    review_categories = review_details.findAll('li', {'class' :'recommend-answer'})
    for review_category in review_categories:
        category = review_category.find('div', {'class' :'recommend-description'}).find(text=True) 
        findvalue = review_category.find("img")
        value = findvalue['alt'].split()[0]
        review_line = review_id+":"+category+":"+value+"\n"
        review_file.write(review_line)

In [7]:
def getOmniParkerReviews():
    # This function goes through all the pages of review in omniparker
    # Omniparker review links follow a pattern and this function exploits this pattern (refer to url2 in this function)
    # url2 = "" for page 1 of reviews, for page 2: url2=-or10, for page 3: url2=-or20 and so on..
    number_of_pages = 535 # the total number of pages of OmniParker review
    maxcount = number_of_pages*10 #max value of int part of url2
    url1 = "/Hotel_Review-g60745-d89599-Reviews-"
    url2 = ""
    url3 = "Omni_Parker_House-Boston_Massachusetts.html"
    count = 0
    headers = { 'User-agent': user_agent }
    while count < maxcount:
        hotel_url = base_url + url1 + url2 + url3
        response = requests.get(hotel_url, headers=headers)
        html = response.text.encode('utf-8')
        soup = BeautifulSoup(html)
        review_links = soup.findAll('div', {'class':'innerBubble'})
        for review in review_links:
            review_a = review.find('a',id=True)
            if review_a:
                print review_a['href']
                parseReview(review_a['href'])
        count += 10
        url2="or"+str(count)+"-"            

In [8]:
#From now on we will call the functions defined above
current_dir = os.getcwd()
if not os.path.exists(os.path.join(current_dir, datadir)):
    os.makedirs(os.path.join(current_dir, datadir))
tourism_url = get_tourism_page(city, state)
city_url = get_city_page(tourism_url)


In [9]:
print city_url
hotel_details = []
c = 0
while(True):
    c +=1
    html = get_hotellist_page(city_url,c)
    city_url = parse_hotellist_page(html)

/Hotels-g60745-Boston_Massachusetts-Hotels.html
Next url is /Hotels-g60745-oa30-Boston_Massachusetts-Hotels.html#ACCOM_OVERVIEW
Next url is /Hotels-g60745-oa60-Boston_Massachusetts-Hotels.html#ACCOM_OVERVIEW
We reached last page


SystemExit: 

To exit: use 'exit', 'quit', or Ctrl-D.


After all the hotels are done we will exit the while loop. In the below code we call the getOmniParkerReviews() function to scrape Omni Parker Review Data

In [None]:
review_file = open('review_file_1','w')
getOmniParkerReviews()
review_file.close()


So now the OmniParker review data is in review_file_1 and the traveller rating data in hotel_details. 
We will now move all this data into pandas dataframes and calculate average score. We will also calculate the scores
for service,business etc and concatenate the dataframes so that we have the data to perform linear and logistic regression.

** Task 2 (20 pts) **

Now, we will use regression to analyze this information. First, we will fit a linear regression model that predicts the average rating. For example, for the hotel above, the average rating is

$$ \text{AVG_SCORE} = \frac{1*31 + 2*33 + 3*98 + 4*504 + 5*1861}{2527}$$

Use the model to analyze the important factors that decide the $\text{AVG_SCORE}$.

In [10]:
#Just write all the data into a json file for future reference if required. This json has the traveller rating and number
#reviews of families,friends etc.
import json
#print hotel_details
print len(hotel_details)
jsondata = json.dumps(hotel_details)
print jsondata
with open("./hote_review.json","w") as h:
    h.write(jsondata)
# write the hotel_details into a json file


82
[{"reviews": "5,627 Reviews", "star_rating": "4", "name": "Omni Parker House", "traveller_rating": {"Poor": "292", "Solo": "(282)", "Very Good": "1,722", "Families": "(1,156)", "Business": "(1,322)", "Average": "689", "Terrible": "148", "Couples": "(1,753)", "Excellent": "2,495", "Friends": "(421)"}}, {"reviews": "1,674 Reviews", "star_rating": "4", "name": "Hyatt Regency Boston Harbor", "traveller_rating": {"Poor": "83", "Solo": "(132)", "Very Good": "541", "Families": "(355)", "Business": "(454)", "Average": "190", "Terrible": "41", "Couples": "(458)", "Excellent": "742", "Friends": "(84)"}}, {"reviews": "3,574 Reviews", "star_rating": "4.5", "name": "Seaport Boston Hotel", "traveller_rating": {"Poor": "46", "Solo": "(123)", "Very Good": "625", "Families": "(601)", "Business": "(1,568)", "Average": "106", "Terrible": "35", "Couples": "(770)", "Excellent": "2,630", "Friends": "(197)"}}, {"reviews": "3,580 Reviews", "star_rating": "4.5", "name": "Hotel Commonwealth", "traveller_rati

In [11]:
import pandas as pd
pd.DataFrame(hotel_details)
hotel_review_dataframe = pd.DataFrame(hotel_details)
#print hotel_review_dataframe
#print hotel_review_dataframe['traveller_rating'][0]

In [12]:
#From the traveller_rating calculate the average_rating,Number of Family,Business,Friends etc
average_rating = []
Families = []
Couples = []
Solo = []
Business = []
Friends = []
terriblelist = []
poorlist = []
averagelist = []
very_goodlist = []
excellentlist = []
for rating in hotel_review_dataframe['traveller_rating']:
    terrible = int(rating['Terrible'].replace(',',""))
    terriblelist.append(terrible)
    poor = int(rating['Poor'].replace(',',""))
    poorlist.append(poor)
    average = int(rating['Average'].replace(',',""))
    averagelist.append(average)
    very_good = int(rating['Very Good'].replace(',',""))
    very_goodlist.append(very_good)
    excellent = int(rating['Excellent'].replace(',',""))
    excellentlist.append(excellent)
    #average_rat = round(float((terrible*1+poor*2+average*3+very_good*4+excellent*5))/(poor+terrible+average+very_good+excellent)*2)/2.0
    average_rat = float((terrible*1+poor*2+average*3+very_good*4+excellent*5))/(poor+terrible+average+very_good+excellent)
    average_rating.append(average_rat)
    Families.append(int(str(rating['Families']).replace("(","").replace(")","").replace(",","")))
    Couples.append(int(str(rating['Couples']).replace("(","").replace(")","").replace(",","")))
    Solo.append(int(str(rating['Solo']).replace("(","").replace(")","").replace(",","")))
    Business.append(int(str(rating['Business']).replace("(","").replace(")","").replace(",","")))
    Friends.append(int(str(rating['Friends']).replace("(","").replace(")","").replace(",","")))
  
hotel_review_dataframe['average_rating'] = average_rating
hotel_review_dataframe['Families'] = Families
hotel_review_dataframe['Couples'] = Couples
hotel_review_dataframe['Solo'] = Solo
hotel_review_dataframe['Business'] = Business
hotel_review_dataframe['Friends'] = Friends
hotel_review_dataframe['terrible'] = terriblelist
hotel_review_dataframe['poor'] = poorlist
hotel_review_dataframe['average'] = averagelist
hotel_review_dataframe['very_good'] = very_goodlist 
hotel_review_dataframe['Excellent'] = excellentlist 

In [13]:
#Find the hotel for which data for location is missing. We will drop the later on
def getMissingHotels(reviews):
    unique_hotels = reviews['name'].unique()
    missinghotel = ""
    for hotel in unique_hotels:
        missing_hotel = reviews[reviews['name']==hotel]
        count = len(reviews[reviews['name']==hotel])
        if count < 40:
            if len(missing_hotel['category'].unique()) < 8 and 'Location' not in missing_hotel['category'].unique() :
                missinghotel = hotel
    return missinghotel

In [14]:
def average(attribute):
    attribute = attribute.reset_index()
    total_score = 1*attribute['count'][0]+2*attribute['count'][1]+3*attribute['count'][2]+4*attribute['count'][3]+5*attribute['count'][4]
    total_reviews = attribute['count'][0]+attribute['count'][1]+attribute['count'][2]+attribute['count'][3]+attribute['count'][4]
    return float(total_score)/total_reviews

In [15]:
def get_average_rating(hoteldatalist):
    category_average = []
    category_average.append(average(hoteldatalist[hoteldatalist['category'] == 'Service']))
    category_average.append(average(hoteldatalist[hoteldatalist['category'] == 'Cleanliness']))
    category_average.append(average(hoteldatalist[hoteldatalist['category'] == 'Value']))
    category_average.append(average(hoteldatalist[hoteldatalist['category'] == 'Sleep Quality']))
    category_average.append(average(hoteldatalist[hoteldatalist['category'] == 'Rooms']))
    category_average.append(average(hoteldatalist[hoteldatalist['category'] == 'Location']))
    return category_average
    

In [16]:
# now we will read the data file and call the above functions to calculate average rating for different categories of
#restaurants
count=1
sum = 0
hotelreviews = []
novaluehotel = []
with open('rating-summary.dat') as reviewdata:
    for line in reviewdata:
        reviewline = line.split(':')
        hotelreviews.append(reviewline)          

reviews = pd.DataFrame(hotelreviews,columns=['name','category','stars','count']) 
reviews
reviews['count']= reviews['count'].apply(lambda x:int(x.replace('\n',"")))
reviews['stars'] = reviews['stars'].apply(lambda x:int(x))

    


In [17]:
# Find the hotel without location and drop that hotel from reviews
missing_hotel = getMissingHotels(reviews) 
reviews= reviews[reviews['name'] != missing_hotel]
print missing_hotel
hotel_review_dataframe = hotel_review_dataframe[hotel_review_dataframe['name'] != missing_hotel]
print len(reviews)
print len(hotel_review_dataframe)

Element Boston Seaport
2990
81


In the below code we calculate average for Omni Parker

In [18]:
def getAverageRatingForOmni(omnicategory):
    count1 = len(omnicategory[omnicategory['count'] == 1])
    count2 = len(omnicategory[omnicategory['count'] == 2])
    count3 = len(omnicategory[omnicategory['count'] == 3])
    count4 = len(omnicategory[omnicategory['count'] == 4])
    count5 = len(omnicategory[omnicategory['count'] == 5])
    return float(1*count1+2*count2+3*count3+4*count4+5*count5)/(count1+count2+count3+count4+count5)
    

In [20]:
# we will calculate the average for Omni Parker House
omniparker = []
with open('review_file_1') as omnifile:
    for line in omnifile:
        reviewline = line.split(':')
        omniparker.append(reviewline)
        
omnidf= pd.DataFrame(omniparker,columns=['review_id','category','count'])
omnidf['count'] = omnidf['count'].apply(lambda x:int(x.replace('\n','')))
omnilist = []
omnilist.append(getAverageRatingForOmni(omnidf[omnidf['category'] == 'Service']))
omnilist.append(getAverageRatingForOmni(omnidf[omnidf['category'] == 'Cleanliness']))
omnilist.append(getAverageRatingForOmni(omnidf[omnidf['category'] == 'Value']))
omnilist.append(getAverageRatingForOmni(omnidf[omnidf['category'] == 'Sleep Quality']))
omnilist.append(getAverageRatingForOmni(omnidf[omnidf['category'] == 'Rooms']))
omnilist.append(getAverageRatingForOmni(omnidf[omnidf['category'] == 'Location']))
omnilist.append("Omni Parker House")

In [21]:
ratingfromreviewlist = []
# list of average rating from reviews
hotels = reviews['name'].unique()
for hotel in hotels:
    categorylist = []
    categorylist = get_average_rating(reviews[reviews['name']==hotel])
    categorylist.append(hotel)
    ratingfromreviewlist.append(categorylist)

ratingfromreviewlist.append(omnilist)
hotel_review_dffromreviews = pd.DataFrame(ratingfromreviewlist,columns=['service','cleanliness','value','sleep_quality','rooms','location','name'])
hotel_review_dffromreviews

Unnamed: 0,service,cleanliness,value,sleep_quality,rooms,location,name
0,4.800151,4.824275,4.265269,4.710259,4.701031,4.818182,Boston Harbor Hotel
1,3.891148,3.815451,3.736917,3.852670,3.427338,4.443252,Boston Hotel Buckminster
2,4.122093,4.386243,3.677065,4.209266,4.040532,4.719094,Boston Marriott Copley Place
3,4.377953,4.496833,3.758840,4.268608,4.186155,4.859583,Boston Marriott Long Wharf
4,4.692500,4.683146,4.321788,4.552743,4.476773,4.844471,Courtyard by Marriott Boston Copley Square
5,4.050476,4.193998,3.765540,4.075000,3.966025,3.610609,DoubleTree Club by Hilton Hotel Boston Bayside
6,4.149183,4.325472,3.921615,4.256731,4.194239,4.212605,Embassy Suites by Hilton Boston - at Logan Air...
7,4.722359,4.846154,4.615894,4.562278,4.654362,4.150000,enVision Hotel Boston
8,4.471007,4.550926,3.889253,4.444444,4.266212,4.848338,"Fairmont Copley Plaza, Boston"
9,4.064000,4.444824,4.089888,4.249608,4.109283,4.819632,Harborside Inn


In [22]:
hotel_aggregate = pd.DataFrame.copy(hotel_review_dataframe)

hotel_aggregate['name'] = hotel_aggregate['name'].apply(lambda x:str(x).replace('/',''))

hotel_aggregate = hotel_aggregate.sort_values(by='name',ascending='True').reset_index()
hotel_aggregate

Unnamed: 0,index,name,reviews,star_rating,traveller_rating,average_rating,Families,Couples,Solo,Business,Friends,terrible,poor,average,very_good,Excellent
0,73,Aloft Boston Seaport,21 Reviews,4,"{u'Poor': 1, u'Solo': (2), u'Very Good': 8, u'...",3.944444,2,2,2,11,1,1,1,2,8,6
1,81,Americas Best Value Inn,19 Reviews,2.5,"{u'Poor': 1, u'Solo': (1), u'Very Good': 2, u'...",2.812500,4,2,1,2,3,4,1,7,2,2
2,64,Ames Boston Hotel,"1,049 Reviews",4.5,"{u'Poor': 43, u'Solo': (70), u'Very Good': 274...",4.289791,103,436,70,177,75,14,43,94,274,486
3,57,BEST WESTERN PLUS Roundhouse Suites,932 Reviews,3.5,"{u'Poor': 78, u'Solo': (20), u'Very Good': 314...",3.755051,286,181,20,140,109,43,78,133,314,224
4,65,BEST WESTERN University Hotel Boston-Brighton,494 Reviews,3.5,"{u'Poor': 42, u'Solo': (31), u'Very Good': 171...",3.718681,212,75,31,71,34,26,42,91,171,125
5,50,"Battery Wharf Hotel, Boston Waterfront","1,024 Reviews",4.5,"{u'Poor': 25, u'Solo': (50), u'Very Good': 214...",4.492244,201,443,50,154,71,10,25,81,214,637
6,67,Beacon Hill Hotel and Bistro,179 Reviews,4,"{u'Poor': 13, u'Solo': (16), u'Very Good': 54,...",4.119760,18,87,16,16,15,5,13,17,54,78
7,9,Boston Harbor Hotel,"1,527 Reviews",4.5,"{u'Poor': 17, u'Solo': (64), u'Very Good': 212...",4.707952,327,525,64,294,78,14,17,48,212,1130
8,16,Boston Hotel Buckminster,"1,002 Reviews",3.5,"{u'Poor': 72, u'Solo': (56), u'Very Good': 351...",3.595972,188,295,56,105,96,77,72,155,351,189
9,11,Boston Marriott Copley Place,"2,506 Reviews",4,"{u'Poor': 110, u'Solo': (100), u'Very Good': 1...",3.991115,474,483,100,842,147,43,110,373,1023,702


In [23]:
hotel_review_dffromreviews= hotel_review_dffromreviews.sort_values(by='name',ascending='True').reset_index()
hotel_review_dffromreviews



Unnamed: 0,index,service,cleanliness,value,sleep_quality,rooms,location,name
0,59,3.909091,4.166667,4.000000,4.500000,3.800000,4.250000,Aloft Boston Seaport
1,60,2.428571,2.769231,2.857143,3.000000,2.714286,2.636364,Americas Best Value Inn
2,61,4.379888,4.633113,4.027451,4.257824,4.413695,4.760900,Ames Boston Hotel
3,30,4.103750,4.185028,3.969231,3.929245,4.077259,3.261905,BEST WESTERN PLUS Roundhouse Suites
4,63,4.103687,3.873418,3.762148,3.985075,3.672775,4.052219,BEST WESTERN University Hotel Boston-Brighton
5,29,4.502227,4.729730,4.045872,4.579580,4.563265,4.595687,"Battery Wharf Hotel, Boston Waterfront"
6,62,4.238095,4.384615,3.731707,3.956522,3.868421,4.803571,Beacon Hill Hotel and Bistro
7,0,4.800151,4.824275,4.265269,4.710259,4.701031,4.818182,Boston Harbor Hotel
8,1,3.891148,3.815451,3.736917,3.852670,3.427338,4.443252,Boston Hotel Buckminster
9,2,4.122093,4.386243,3.677065,4.209266,4.040532,4.719094,Boston Marriott Copley Place


In [24]:
df = pd.concat([hotel_aggregate,hotel_review_dffromreviews],axis=1)
print len(df)
df = df.drop(['index','name','reviews','traveller_rating','star_rating','terrible','poor','average','very_good','Excellent'],axis=1)
df

81


Unnamed: 0,average_rating,Families,Couples,Solo,Business,Friends,service,cleanliness,value,sleep_quality,rooms,location
0,3.944444,2,2,2,11,1,3.909091,4.166667,4.000000,4.500000,3.800000,4.250000
1,2.812500,4,2,1,2,3,2.428571,2.769231,2.857143,3.000000,2.714286,2.636364
2,4.289791,103,436,70,177,75,4.379888,4.633113,4.027451,4.257824,4.413695,4.760900
3,3.755051,286,181,20,140,109,4.103750,4.185028,3.969231,3.929245,4.077259,3.261905
4,3.718681,212,75,31,71,34,4.103687,3.873418,3.762148,3.985075,3.672775,4.052219
5,4.492244,201,443,50,154,71,4.502227,4.729730,4.045872,4.579580,4.563265,4.595687
6,4.119760,18,87,16,16,15,4.238095,4.384615,3.731707,3.956522,3.868421,4.803571
7,4.707952,327,525,64,294,78,4.800151,4.824275,4.265269,4.710259,4.701031,4.818182
8,3.595972,188,295,56,105,96,3.891148,3.815451,3.736917,3.852670,3.427338,4.443252
9,3.991115,474,483,100,842,147,4.122093,4.386243,3.677065,4.209266,4.040532,4.719094


In [25]:
#from __future__ import print_function
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt
from statsmodels.sandbox.regression.predstd import wls_prediction_std

import statsmodels.formula.api as smf
import pandas

import seaborn as sns
%matplotlib inline

X = df[['service','cleanliness','value','sleep_quality','rooms','location','Families','Couples','Solo','Business','Friends']]
y = df[['average_rating']]
type(X)
type(y)
X.shape
y.shape

model = sm.OLS(y, X)
results = model.fit()
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:         average_rating   R-squared:                       1.000
Model:                            OLS   Adj. R-squared:                  1.000
Method:                 Least Squares   F-statistic:                 1.886e+04
Date:                Wed, 30 Mar 2016   Prob (F-statistic):          7.25e-117
Time:                        00:00:00   Log-Likelihood:                 93.915
No. Observations:                  81   AIC:                            -165.8
Df Residuals:                      70   BIC:                            -139.5
Df Model:                          11                                         
Covariance Type:            nonrobust                                         
                    coef    std err          t      P>|t|      [95.0% Conf. Int.]
---------------------------------------------------------------------------------
service           0.2029      0.084      2.414

The linear regression model has been fitted above. From the above results we can conclude what factors significantly
effect the average score.
Based on the above output the significant factors are based on high beta valu,e low p value and high t value:
    1. rooms 
    2. location
    3. service

The R square value is 1 which implies all of the variation of the data can be explained by the model. The R square might also indicate that the data has overfitted. A better model could be obtained by ignoring the data with lower


** Task 3 (30 pts) **

Finally, we will use logistic regression to decide if a hotel is _excellent_ or not. We classify a hotel as _excellent_ if more than **60%** of its ratings are 5 stars. This is a binary attribute on which we can fit a logistic regression model. As before, use the model to analyze the data.

In [26]:
logistic_df1 = pd.concat([hotel_aggregate,hotel_review_dffromreviews],axis=1)
logistic_df1 = logistic_df1.drop(['index','name','traveller_rating'],axis=1)
excellent_or_not = []
for i in range(len(logistic_df1)):
    if (float(logistic_df1['Excellent'][i])/(logistic_df1['Excellent'][i]+logistic_df1['very_good'][i]+logistic_df1['average'][i]+logistic_df1['poor'][i]+logistic_df1['terrible'][i])) > 0.60:
        excellent_or_not.append(1)
    else:
        excellent_or_not.append(0)
logistic_df1['ExcellentOrNot'] = excellent_or_not
logistic_df1

Unnamed: 0,reviews,star_rating,average_rating,Families,Couples,Solo,Business,Friends,terrible,poor,average,very_good,Excellent,service,cleanliness,value,sleep_quality,rooms,location,ExcellentOrNot
0,21 Reviews,4,3.944444,2,2,2,11,1,1,1,2,8,6,3.909091,4.166667,4.000000,4.500000,3.800000,4.250000,0
1,19 Reviews,2.5,2.812500,4,2,1,2,3,4,1,7,2,2,2.428571,2.769231,2.857143,3.000000,2.714286,2.636364,0
2,"1,049 Reviews",4.5,4.289791,103,436,70,177,75,14,43,94,274,486,4.379888,4.633113,4.027451,4.257824,4.413695,4.760900,0
3,932 Reviews,3.5,3.755051,286,181,20,140,109,43,78,133,314,224,4.103750,4.185028,3.969231,3.929245,4.077259,3.261905,0
4,494 Reviews,3.5,3.718681,212,75,31,71,34,26,42,91,171,125,4.103687,3.873418,3.762148,3.985075,3.672775,4.052219,0
5,"1,024 Reviews",4.5,4.492244,201,443,50,154,71,10,25,81,214,637,4.502227,4.729730,4.045872,4.579580,4.563265,4.595687,1
6,179 Reviews,4,4.119760,18,87,16,16,15,5,13,17,54,78,4.238095,4.384615,3.731707,3.956522,3.868421,4.803571,0
7,"1,527 Reviews",4.5,4.707952,327,525,64,294,78,14,17,48,212,1130,4.800151,4.824275,4.265269,4.710259,4.701031,4.818182,1
8,"1,002 Reviews",3.5,3.595972,188,295,56,105,96,77,72,155,351,189,3.891148,3.815451,3.736917,3.852670,3.427338,4.443252,0
9,"2,506 Reviews",4,3.991115,474,483,100,842,147,43,110,373,1023,702,4.122093,4.386243,3.677065,4.209266,4.040532,4.719094,0


In [27]:
import statsmodels.api as sm
Y1 = logistic_df1[['ExcellentOrNot']]
X1 = logistic_df1[['service','cleanliness','value','sleep_quality','rooms','location','Families','Couples','Solo','Business','Friends']]
logit = sm.Logit(Y1,X1)
 
# fit the model
result = logit.fit()
print(result.summary())

Optimization terminated successfully.
         Current function value: 0.301461
         Iterations 9
                           Logit Regression Results                           
Dep. Variable:         ExcellentOrNot   No. Observations:                   81
Model:                          Logit   Df Residuals:                       70
Method:                           MLE   Df Model:                           10
Date:                Wed, 30 Mar 2016   Pseudo R-squ.:                  0.4947
Time:                        00:00:38   Log-Likelihood:                -24.418
converged:                       True   LL-Null:                       -48.328
                                        LLR p-value:                 6.695e-07
                    coef    std err          z      P>|z|      [95.0% Conf. Int.]
---------------------------------------------------------------------------------
service           8.8695      4.548      1.950      0.051        -0.044    17.783
cleanliness     -26.

From the output of the logistic regression we see that the following factors are significant predictors of whether a 
hotel is excellent or not:
    1. Sleep Quality (coef = 21.8 and p value 0.001)
    2. service (coef = -24 p value = 0.001)
  
The lower p value here shows us the estimates for the coefficients are more likely to be right.
The above data indicates that groups with greater sleep quality are likely to be better hotels. The negative coefficient of service might indicate that restaurants that have lower ratings have good service, but they might be bad on other aspects such as location etc which is positive correlated.

The coeeficients of the other parameters are very low and the p values are high. This implies that there is no strong correlation between these values and the our binary output that the hotel is excellent or not. The type of people reviewing the restaurant does not effect whether the hotel is excellent.

-------

In [None]:
# Code for setting the style of the notebook
from IPython.core.display import HTML
def css_styling():
    styles = open("../../theme/custom.css", "r").read()
    return HTML(styles)
css_styling()