# Regression Phase 2: Simple Linear Regression

In this phase we will use data on Expedia to predict hotel prices using simple (one feature) linear regression.

* Use SArray and SFrame functions to compute important summary statistics 
* Write a function to compute the Simple Linear Regression weights using the closed form solution
* Write a function to make predictions of the output given the input feature
* Turn the regression around to predict the input/feature given the output
* Compare two different models for predicting hotel prices

# Import the Graphlab

In [1]:
import graphlab

# Load Hotel data

In [2]:
hotels = graphlab.SFrame.read_csv('LA_0421.csv')

[INFO] graphlab.cython.cy_server: GraphLab Create v2.1 started. Logging: /tmp/graphlab_server_1494605535.log


This non-commercial license of GraphLab Create for academic use is assigned to dchen@albany.edu and will expire on November 05, 2017.


------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[str,str,float,float,str,str,str,str,int,str,str,int,str,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


In [3]:
hotels.head()

name,zone,star,rating,rates,checkin,checkout
Sportsmen's Lodge,Universal Studios,3.5,3.7,1175,04/21/2017,04/22/2017
The Bicycle Hotel & Casino ...,Bell Gardens,4.0,4.5,296,04/21/2017,04/22/2017
The Bicycle Hotel & Casino ...,Bell Gardens,4.0,4.5,296,04/21/2017,04/22/2017
The Bicycle Hotel & Casino ...,Bell Gardens,4.0,4.5,296,04/21/2017,04/22/2017
The Bicycle Hotel & Casino ...,Bell Gardens,4.0,4.5,296,04/21/2017,04/22/2017
The Bicycle Hotel & Casino ...,Bell Gardens,4.0,4.5,296,04/21/2017,04/22/2017
The Bicycle Hotel & Casino ...,Bell Gardens,4.0,4.5,296,04/21/2017,04/22/2017
Motel 6 Los Angeles LAX,Los Angeles Intl. (LAX),2.0,3.0,1839,04/21/2017,04/22/2017
Motel 6 Los Angeles LAX,Los Angeles Intl. (LAX),2.0,3.0,1839,04/21/2017,04/22/2017
Motel 6 Los Angeles LAX,Los Angeles Intl. (LAX),2.0,3.0,1839,04/21/2017,04/22/2017

room,size,price,bed,guests,address
Studio Suite,400,221,1 king bed,2,Ventura Blvd
"Deluxe Room, 2 Queen Beds",490,164,2 queen beds,4,Casino Dr.
"Deluxe Room, 1 King Bed",455,164,1 king bed,2,Casino Dr.
"Classic Suite, 1 King Bed",654,239,1 king bed,3,Casino Dr.
"Classic Suite, Accessible",654,239,1 king bed,3,Casino Dr.
Suite (The Bike),904,314,1 king and 1 sofa bed,4,Casino Dr.
"Suite, Accessible (The Bike) ...",904,314,1 king and 1 sofa bed,4,Casino Dr.
"Standard Room, 1 Queen Bed, Accessible (Roll-in ...",200,106,1 queen bed,2,W Century Blvd
"Standard Room, 2 Queen Beds ...",200,106,2 queen beds,4,W Century Blvd
"Standard Room, 1 King Bed",200,106,1 king bed,2,W Century Blvd

link
https://www.expedia.com /Universal-Studios- ...
https://www.expedia.com /Bell-Gardens-Hotels- ...
https://www.expedia.com /Bell-Gardens-Hotels- ...
https://www.expedia.com /Bell-Gardens-Hotels- ...
https://www.expedia.com /Bell-Gardens-Hotels- ...
https://www.expedia.com /Bell-Gardens-Hotels- ...
https://www.expedia.com /Bell-Gardens-Hotels- ...
https://www.expedia.com /Los-Angeles-Intl-Hot ...
https://www.expedia.com /Los-Angeles-Intl-Hot ...
https://www.expedia.com /Los-Angeles-Intl-Hot ...


# Split data into training and testing


In [4]:
# convert the string to float
hotels['price'] = hotels['price'].astype(float)
train_data,test_data = hotels.random_split(.8,seed=0)

# Useful SFrame summary functions

In order to make use of the closed form solution as well as take advantage of graphlab's built in functions we will review some important ones. In particular:

* Computing the sum of an SArray
* Computing the arithmetic average (mean) of an SArray
* multiplying SArrays by constants
* multiplying SArrays by other SArrays

In [5]:
# Let's compute the mean of the Hotel Prices in LA in 2 different ways.
prices = hotels['price'] # extract the price column of the sales SFrame -- this is now an SArray

# recall that the arithmetic average (the mean) is the sum of the prices divided by the total number of hotels:
sum_prices = prices.sum()
num_hotels = prices.size() # when prices is an SArray .size() returns its length
avg_price_1 = sum_prices / num_hotels
avg_price_2 = prices.mean() #  the average, the .mean() function
print "average price via method 1: " + str(avg_price_1)
print "average price via method 2: " + str(avg_price_2)
num_hotels

average price via method 1: 254.275391956
average price via method 2: 254.275391956


1467

In [6]:
# if we want to multiply every price by 0.5 it's a simple as:
half_prices = 0.5 * prices
# Let's compute the sum of squares of price. We can multiply two SArrays of the same length elementwise also with *
prices_squared = prices * prices
sum_prices_squared = prices_squared.sum() # price_squared is an SArray of the squares and we want to add them up.
print "the sum of price squared is: " + str(sum_prices_squared)

the sum of price squared is: 128621388.0


# Build a generic simple linear regression function


Armed with these SArray functions we can use the closed form solution found from lecture to compute the slope and intercept for a simple linear regression on observations stored as SArrays: input_feature, output.
Complete the following function to compute the simple linear regression slope and intercept:

In [7]:
def simple_linear_regression(input_feature, output):
    input_sum = input_feature.sum()
    output_sum = output.sum()
    N = input_feature.size()
    input_mean = input_sum/N
    output_mean = output_sum/N
    in_out_prod = input_feature * output
    in_out_prod_sum = in_out_prod.sum()
    prod_sum = output_sum * input_sum 
    prod_mean = prod_sum/N
    sqr_test = input_feature * input_feature
    sqr_test_sum = sqr_test.sum()
    sqr_sum = input_sum * input_sum
    sqr_mean = sqr_sum/N
    slope = (in_out_prod_sum - prod_mean)/(sqr_test_sum - sqr_mean)
    intercept = output_mean - (input_mean * slope)
    return(intercept, slope)

We can test that our function works by passing it something where we know the answer. In particular we can generate a feature and then put the output exactly on a line: output = 1 + 1*input_feature then we know both our slope and intercept should be 1


In [8]:
test_feature = graphlab.SArray(range(5))
test_output = graphlab.SArray(3 + 5*test_feature)
(test_intercept, test_slope) =  simple_linear_regression(test_feature, test_output)
print "Intercept: " + str(test_intercept)
print "Slope: " + str(test_slope)

Intercept: 3
Slope: 5


Now that we know it works let's build a regression model for predicting price based on hotel room size. Rembember that we train on train_data!

In [9]:
size_intercept, size_slope = simple_linear_regression(train_data['size'], train_data['price'])

print "Intercept: " + str(size_intercept)
print "Slope: " + str(size_slope)

Intercept: 185.261339188
Slope: 0.182229076588


# Predicting Values


Now that we have the model parameters: intercept & slope we can make predictions. Using SArrays it's easy to multiply an SArray by a constant and add a constant value. We will return the predicted output given the input_feature, slope and intercept:

In [10]:
def get_regression_predictions(input_feature, intercept, slope):
    # calculate the predicted values:
    predicted_values = intercept + slope * input_feature;
    return predicted_values

Now that we can calculate a prediction given the slope and intercept let's make a prediction. We want to find out the estimated price for a hotel with size of 400 squarefeet according to the  model we estiamted above.

In [11]:
my_chosen_hotel_size = 400
estimated_price = get_regression_predictions(my_chosen_hotel_size, size_intercept, size_slope)
print "The estimated price for a hotel with %d squarefeet is $%.2f" % (my_chosen_hotel_size, estimated_price)

The estimated price for a hotel with 400 squarefeet is $258.15


# Residual Sum of Squares (RSS)


Now that we have a model and can make predictions let's evaluate our model using Residual Sum of Squares (RSS). 

Now we will create another function to compute the RSS of a simple linear regression model given the input_feature, output, intercept and slope:

In [12]:
def get_residual_sum_of_squares(input_feature, output, intercept, slope):
    # First get the predictions
    predictions = get_regression_predictions(input_feature, intercept, slope);
    # then compute the residuals 
    residuals = output - predictions;
    # square the residuals and add them up
    RSS = (residuals * residuals).sum()
    return(RSS)

In [18]:
from math import sqrt
def get_rmse(input_feature, output, intercept, slope):
    # First get the predictions
    predictions = get_regression_predictions(input_feature, intercept, slope);
    # then compute the residuals 
    residuals = output - predictions;
    # square the residuals and add them up
    RMSE = ((residuals * residuals).sum() / len(input_feature)).apply(sqrt)
    return(RMSE)

Let's test our get_residual_sum_of_squares function by applying it to the test model where the data lie exactly on a line. Since they lie exactly on a line the residual sum of squares should be zero!

In [19]:
print get_residual_sum_of_squares(test_feature, test_output, test_intercept, test_slope) 
# should be 0.0

0


Now use our function to calculate the RSS on training data from the model calculated above.

In [20]:
rss_prices_on_size = get_residual_sum_of_squares(train_data['size'], train_data['price'], size_intercept, size_slope)
print 'The RSS of predicting Prices based on Hotel''s room size is : ' + str(rss_prices_on_size)

The RSS of predicting Prices based on Hotels room size is : 24624301.535


In [None]:
rmse_prices_on_size = get_rmse(train_data['size'], train_data['price'], size_intercept, size_slope)
print 'The RMSE of predicting Prices based on Hotel''s room size is : ' + str(rmse_prices_on_size) # should be 

# Predict the size of the room given price


What if we want to predict the size given the price? Since we have an equation y = a + b*x we can solve the function for x. So that if we have the intercept (a) and the slope (b) and the price (y) we can solve for the estimated size (x).

Create the inverse regression estimate, i.e. predict the input_feature given the output.

In [58]:
def inverse_regression_predictions(output, intercept, slope):
    # solve output = intercept + slope*input_feature for input_feature. Use this equation to compute the inverse predictions:
    estimated_feature = (output - intercept) / slope ;
    return estimated_feature

Now that we have a function to compute the size of a room given the price from our simple regression model let's see how big we might expect a hotel that costs $500 a night to be.

In [59]:
my_hotel_price = 500
estimated_size = inverse_regression_predictions(my_hotel_price, size_intercept, size_slope)
print "The estimated size for a room worth $%.2f is %d sqft" % (my_hotel_price, estimated_size)

The estimated size for a room worth $500.00 is 1727 sqft


# New Model: estimate prices from the hotel star


We have made one model for predicting hotel prices using rome size, but there are many other features in the hotels SFrame. We can also play  with the simple linear regression function to estimate the regression parameters from predicting Prices based on the start of hotels. Use the training data!

In [60]:
# Estimate the slope and intercept for predicting 'price' based on 'star'
star_intercept, star_slope = simple_linear_regression(train_data['star'], train_data['price'])
print "Intercept: " + str(star_intercept)
print "Slope: " + str(star_slope)

Intercept: -137.582534468
Slope: 118.030847711


# Test our simple Linear Regression Algorithm


Now we have two models for predicting the price of a hotel. How do we know which one is better? Calculate the RSS on the TEST data (this data set wasn't involved in learning the model). 

In [61]:
# Compute RSS when using star on TEST data:
rss_prices_on_star = get_residual_sum_of_squares(test_data['star'], test_data['price'], star_intercept, star_slope)
print 'RSS when using [star] on TEST data: ' + str(rss_prices_on_star)

RSS when using [star] on TEST data: 3859485.05388


In [62]:
# Compute RSS when using size on TEST data:
rss_prices_on_size = get_residual_sum_of_squares(test_data['size'], test_data['price'], size_intercept, size_slope)
print 'RSS when using [size] on TEST data: ' + str(rss_prices_on_size)

RSS when using [size] on TEST data: 6224453.63319


As we can see, the model using star is much better than the model using size of the room. Based on common sence, the star metric is often a better way to measure the quality(price) of a certain hotel room.

# the seasonality of some feature


Could be modeled like this:
y = w0 + w1 ti + w2 sin(2pi x ti / 12 - Xfac) + e

Where Xfac is unknow phase/shift
