# Regression phase 4: Multiple Regression (gradient descent)


In this phase. We will use graphlab along with numpy to solve for the regression weights with gradient descent. Todos:

* Add a constant column of 1's to a graphlab SFrame to account for the intercept
* Convert an SFrame into a Numpy array
* Write a predict_output() function using Numpy
* Write a numpy function to compute the derivative of the regression weights with respect to a single feature
* Write gradient descent function to compute the regression weights given an initial weight vector, step size and tolerance.
* Use the gradient descent function to estimate regression weights for multiple features

# Import graphlab create

In [1]:
import graphlab


# Load in hotels data


In [2]:
hotels = graphlab.SFrame('LA_0421.csv')
# convert the string to float
hotels['price'] = hotels['price'].astype(float)
hotels['rates'] = hotels['rates'].astype(float)

[INFO] graphlab.cython.cy_server: GraphLab Create v2.1 started. Logging: /tmp/graphlab_server_1492738505.log


This non-commercial license of GraphLab Create for academic use is assigned to dchen@albany.edu and will expire on November 05, 2017.


------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[str,str,float,float,str,str,str,str,int,str,str,int,str,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


In [3]:
hotels.head()

name,zone,star,rating,rates,checkin,checkout
Sportsmen's Lodge,Universal Studios,3.5,3.7,1.0,04/21/2017,04/22/2017
The Bicycle Hotel & Casino ...,Bell Gardens,4.0,4.5,296.0,04/21/2017,04/22/2017
The Bicycle Hotel & Casino ...,Bell Gardens,4.0,4.5,296.0,04/21/2017,04/22/2017
The Bicycle Hotel & Casino ...,Bell Gardens,4.0,4.5,296.0,04/21/2017,04/22/2017
The Bicycle Hotel & Casino ...,Bell Gardens,4.0,4.5,296.0,04/21/2017,04/22/2017
The Bicycle Hotel & Casino ...,Bell Gardens,4.0,4.5,296.0,04/21/2017,04/22/2017
The Bicycle Hotel & Casino ...,Bell Gardens,4.0,4.5,296.0,04/21/2017,04/22/2017
Motel 6 Los Angeles LAX,Los Angeles Intl. (LAX),2.0,3.0,1.0,04/21/2017,04/22/2017
Motel 6 Los Angeles LAX,Los Angeles Intl. (LAX),2.0,3.0,1.0,04/21/2017,04/22/2017
Motel 6 Los Angeles LAX,Los Angeles Intl. (LAX),2.0,3.0,1.0,04/21/2017,04/22/2017

room,size,price,bed,guests,address
Studio Suite,400,221.0,1 king bed,2,Ventura Blvd
"Deluxe Room, 2 Queen Beds",490,164.0,2 queen beds,4,Casino Dr.
"Deluxe Room, 1 King Bed",455,164.0,1 king bed,2,Casino Dr.
"Classic Suite, 1 King Bed",654,239.0,1 king bed,3,Casino Dr.
"Classic Suite, Accessible",654,239.0,1 king bed,3,Casino Dr.
Suite (The Bike),904,314.0,1 king and 1 sofa bed,4,Casino Dr.
"Suite, Accessible (The Bike) ...",904,314.0,1 king and 1 sofa bed,4,Casino Dr.
"Standard Room, 1 Queen Bed, Accessible (Roll-in ...",200,106.0,1 queen bed,2,W Century Blvd
"Standard Room, 2 Queen Beds ...",200,106.0,2 queen beds,4,W Century Blvd
"Standard Room, 1 King Bed",200,106.0,1 king bed,2,W Century Blvd

link
https://www.expedia.com /Universal-Studios- ...
https://www.expedia.com /Bell-Gardens-Hotels- ...
https://www.expedia.com /Bell-Gardens-Hotels- ...
https://www.expedia.com /Bell-Gardens-Hotels- ...
https://www.expedia.com /Bell-Gardens-Hotels- ...
https://www.expedia.com /Bell-Gardens-Hotels- ...
https://www.expedia.com /Bell-Gardens-Hotels- ...
https://www.expedia.com /Los-Angeles-Intl-Hot ...
https://www.expedia.com /Los-Angeles-Intl-Hot ...
https://www.expedia.com /Los-Angeles-Intl-Hot ...


# Convert to Numpy Array


Although SFrames offer a number of benefits to users (especially when using Big Data and built-in graphlab functions) in order to understand the details of the implementation of algorithms it's important to work with a library that allows for direct (and optimized) matrix operations. Numpy is a Python solution to work with matrices (or any multi-dimensional "array").

The predicted value given the weights and the features is just the dot product between the feature and weight vector. Similarly, if we put all of the features row-by-row in a matrix then the predicted value for all the observations can be computed by right multiplying the "feature matrix" by the "weight vector".

First we need to take the SFrame of our data and convert it into a 2D numpy array (also called a matrix). To do this we use graphlab's built in .to_dataframe() which converts the SFrame into a Pandas (another python library) dataframe. We can then use Panda's .as_matrix() to convert the dataframe into a numpy matrix.

In [4]:
import numpy as np 

Now we will write a function that will accept an SFrame, a list of feature names (e.g. ['size', 'star']) and an target feature e.g. ('price') and will return two things:

* A numpy matrix whose columns are the desired features plus a constant column (this is how we create an 'intercept')
* A numpy array containing the values of the output



## test let's assume the features are features = ['size', 'star']


In [5]:
hotels['constant'] = 1;
features = ['size', 'star'];
features = ['constant'] + features;
output = ['price'];
hotels[output].print_rows(num_rows = 20)

+-------+
| price |
+-------+
| 221.0 |
| 164.0 |
| 164.0 |
| 239.0 |
| 239.0 |
| 314.0 |
| 314.0 |
| 106.0 |
| 106.0 |
| 106.0 |
| 149.0 |
| 279.0 |
| 338.0 |
| 309.0 |
| 368.0 |
| 379.0 |
| 438.0 |
| 250.0 |
| 250.0 |
| 270.0 |
+-------+
[1467 rows x 1 columns]



In [6]:
def get_numpy_data(data_sframe, features, output):
    data_sframe['constant'] = 1 
    # add the column 'constant' to the front of the features list so that we can extract it along with the others:
    features = ['constant'] + features 
    # select the columns of data_SFrame given by the features list into the SFrame features_sframe (now including constant):
    features_sframe = data_sframe[features]
    # the following line will convert the features_SFrame into a numpy matrix:
    feature_matrix = features_sframe.to_numpy()
    # assign the column of data_sframe associated with the output to the SArray output_sarray
    output_sarray = data_sframe[output]
    # the following will convert the SArray into a numpy array by first converting it to a list
    output_array = output_sarray.to_numpy()
    return(feature_matrix, output_array)

For testing let's use the 'size' feature and a constant as our features and price as our output:

In [7]:
(example_features, example_output) = get_numpy_data(hotels, ['size'], 'price') # the [] around 'size' makes it a list
print example_features[0,:] # this accesses the first row of the data the ':' indicates 'all columns'
print example_output[0] # and the corresponding output

[  1 400]
221.0


# Predicting output given regression weights


Suppose we had the weights [1.0, 1.0] and the features [1.0, 400.0] and we wanted to compute the predicted output 1.0 x 1.0 + 1.0 x 400.0 = 401.0 this is the dot product between these two arrays. If they're numpy arrayws we can use np.dot() to compute this:

In [8]:
my_weights = np.array([1., 1.]) # the example weights
my_features = example_features[0,] # we'll use the first data point
predicted_value = np.dot(my_features, my_weights)
print predicted_value

401.0


np.dot() also works when dealing with a matrix and a vector. The predictions from all the observations is just the RIGHT (as in weights on the right) dot product between the features matrix and the weights vector. With this in mind finish the following predict_output function to compute the predictions for an entire matrix of features given the matrix and the weights:

In [9]:
def predict_output(feature_matrix, weights):
    # assume feature_matrix is a numpy matrix containing the features as columns and weights is a corresponding numpy array
    # create the predictions vector by using np.dot()
    predictions = np.dot(feature_matrix, weights)
    return(predictions)

In [10]:
# test the code
test_predictions = predict_output(example_features, my_weights)
print test_predictions[0] # should be 401.0
print test_predictions[1] # should be 491.0
# make sure the length is equal to the total observations
print len(test_predictions)

401.0
491.0
1467


# Computing the Derivative


We are now going to move to computing the derivative of the regression cost function. The cost function is the sum over the data points of the squared difference between an observed output and a predicted output.

Since the derivative of a sum is the sum of the derivatives we can compute the derivative for a single data point and then sum over data points. We can write the squared difference between the observed output and predicted output for a single point as follows:

(w[0]*[CONSTANT] + w[1]*[feature_1] + ... + w[i] *[feature_i] + ... + w[k]*[feature_k] - output)^2

Where we have k features and a constant. So the derivative with respect to weight w[i] by the chain rule is:

2*(w[0]*[CONSTANT] + w[1]*[feature_1] + ... + w[i] *[feature_i] + ... + w[k]*[feature_k] - output)* [feature_i]

The term inside the paranethesis is just the error (difference between prediction and output). So we can re-write this as:

2*error*[feature_i]

That is, the derivative for the weight for feature i is the sum (over data points) of 2 times the product of the error and the feature itself. In the case of the constant then this is just twice the sum of the errors!

Recall that twice the sum of the product of two vectors is just twice the dot product of the two vectors. Therefore the derivative for the weight for feature_i is just two times the dot product between the values of feature_i and the current errors.

With this in mind complete the following derivative function which computes the derivative of the weight given the value of the feature (over all data points) and the errors (over all data points).

In [11]:
def feature_derivative(errors, feature):
    # Assume that errors and feature are both numpy arrays of the same length (number of data points)
    # compute twice the dot product of these vectors as 'derivative' and return the value
    derivative = 2 * np.dot(errors, feature)
    return(derivative)

In [12]:
# test the code
(example_features, example_output) = get_numpy_data(hotels, ['size'], 'price') 
my_weights = np.array([0., 0.]) # this makes all the predictions 0
test_predictions = predict_output(example_features, my_weights) 
# just like SFrames 2 numpy arrays can be elementwise subtracted with '-': 
errors = test_predictions - example_output # prediction errors in this case is just the -example_output
feature = example_features[:,0] # let's compute the derivative with respect to 'constant', the ":" indicates "all rows"
derivative = feature_derivative(errors, feature)
print derivative
print -np.sum(example_output)*2 # should be the same as derivative

-746044.0
-746044.0


# Gradient Descent


Now we will write a function that performs a gradient descent. The basic premise is simple. Given a starting point we update the current weights by moving in the negative gradient direction. The gradient is the direction of increase and therefore the negative gradient is the direction of decrease and we're trying to minimize a cost function.

The amount by which we move in the negative gradient direction is called the 'step size'. We stop when we are 'sufficiently close' to the optimum. We define this by requiring that the magnitude (length) of the gradient vector to be smaller than a fixed 'tolerance'.

Complete the gradient descent function below using the derivative function above. For each step in the gradient descent we update the weight for each feature befofe computing our stopping criteria

In [13]:
from math import sqrt # The magnitude / length of a vector [g[0], g[1], g[2]] is sqrt(g[0]^2 + g[1]^2 + g[2]^2)

In [14]:
def regression_gradient_descent(feature_matrix, output, initial_weights, step_size, tolerance):
    converged = False 
    weights = np.array(initial_weights) # make sure it's a numpy array
    while not converged:
        # compute the predictions based on feature_matrix and weights using our predict_output() function
        predictions = predict_output(feature_matrix, weights)
        # compute the errors as predictions - output
        errors = predictions - output
        gradient_sum_squares = 0 # initialize the gradient sum of squares
        # while we haven't reached the tolerance yet, update each feature's weight
        for i in range(len(weights)): # loop over each weight
            # Feature_matrix[:, i] is the feature column associated with weights[i]
            # compute the derivative for weight[i]:
            derivative  = feature_derivative(errors, feature_matrix[:,i])
            # add the squared value of the derivative to the gradient sum of squares (for assessing convergence)
            gradient_sum_squares += derivative ** 2
            # subtract the step size times the derivative from the current weight
            weights[i] = weights[i] - step_size * derivative
        # compute the square-root of the gradient sum of squares to get the gradient magnitude:
        gradient_magnitude = sqrt(gradient_sum_squares)
        # print gradient_magnitude
        if gradient_magnitude < tolerance:
            converged = True
    return(weights)

A few things to note before we run the gradient descent. Since the gradient is a sum over all the data points and involves a product of an error and a feature the gradient itself will be very large since the features are large (size) and the output is large (prices). So while we might expect "tolerance" to be small, small is only relative to the size of the features.

For similar reasons the step size will be much smaller than we might expect but this is because the gradient has such large values.

# Running the Gradient Descent as Simple Regression


First let's split the data into training and test data.

In [15]:
train_data,test_data = hotels.random_split(.8,seed=0)

Although the gradient descent is designed for multiple regression since the constant is now a feature we can use the gradient descent function to estimat the parameters in the simple regression on size. The folowing cell sets up the feature_matrix, output, initial weights and step size for the first model:


In [16]:
# let's test out the gradient descent
simple_features = ['size']
my_output = 'price'
(simple_feature_matrix, output) = get_numpy_data(train_data, simple_features, my_output)
initial_weights = np.array([186., 1.])
step_size = 7e-12
tolerance = 2.4e3

Next run gradient descent with the above parameters.


In [17]:
final_weights = regression_gradient_descent(simple_feature_matrix, output, initial_weights, step_size, tolerance)
final_weights

array([  1.85998463e+02,   1.80660156e-01])

Use the newly estimated weights and our predict_output() function to compute the predictions on all the TEST data

In [18]:
(test_simple_feature_matrix, test_output) = get_numpy_data(test_data, simple_features, my_output)


Now compute predictions using test_simple_feature_matrix and the weights from above.


In [19]:
test_predictions = predict_output(test_simple_feature_matrix,final_weights)
print len(test_output)
print test_output

305
[ 338.  379.    1.  169.  239.  209.  209.  288.  497.  115.  179.  260.
  219.  119.  169.  110.  382.  239.  249.  269.  209.  249.  189.  219.
  219.  209.  249.  159.   90.  169.  229.  259.  289.   85.  359.   80.
   80.  113.  430.  223.  325.  329.  100.  459.  489.  259.  259.  279.
  145.  155.   86.  106.  129.  129.  189.  110.  110.  145.  248.  145.
  145.  120.  199.  219.  209.  274.  299.  299.  120.  106.  389.  229.
  269.  309.  105.  146.  135.   92.   96.  102.  122.  335.   86.   86.
  191.  159.  159.  211.  110.  195.  169.   99.  109.  175.  259.  130.
  159.  249.  369.  105.  154.    2.  149.  201.  211.  369.  309.  155.
  225.  209.  219.  174.  184.  189.  234.  199.  655.  154.  179.  169.
   62.   72.  323.  444.  664.  209.  209.  229.  139.  319.  150.  199.
   99.  109.  159.  229.  199.   75.   75.  569.  202.  225.   83.  147.
  405.  540.  309.  169.  159.  169.  279.  569.  569.  229.  259.  289.
  359.  377.  449.  109.  585.  585.  765.    2

In [20]:
test_predictions

array([ 255.3719625 ,  260.06912655,  374.60766532,  251.93941954,
        282.10966556,  256.45592343,  256.45592343,  267.29553278,
        302.34360301,  217.61398994,  249.2295172 ,  242.90641175,
        258.26252499,  244.71301331,  240.91915004,  232.97010318,
        339.5595951 ,  240.19650941,  240.19650941,  240.19650941,
        257.90120468,  232.60878287,  238.02858754,  238.02858754,
        238.02858754,  290.05871241,  271.99269683,  239.29320863,
        226.64699773,  249.2295172 ,  257.90120468,  275.42523979,
        312.46057173,  242.36443128,  249.2295172 ,  226.64699773,
        226.64699773,  222.13049383,  283.5549468 ,  256.81724375,
        249.2295172 ,  303.42756394,  216.34936885,  302.70492332,
        302.70492332,  229.35690007,  229.35690007,  255.55262266,
        240.37716957,  240.37716957,  226.64699773,  250.13281798,
        258.26252499,  258.26252499,  258.26252499,  230.80218131,
        240.19650941,  238.57056801,  229.35690007,  240.19650

Now that we have the predictions on test data, compute the RSS on the test data set. Save this value for comparison later. 


In [21]:
errors = test_predictions - test_output;
sqr_errors = errors * errors;
test_sum_sqr_errors = sqr_errors.sum();
test_sum_sqr_errors

6228820.0272891456

# Running a multiple regression


Now we will use more than one actual feature. Define a function to produce the weights for a second model with the following parameters:

In [22]:
model_features = ['star', 'size', 'rating'] #
my_output = 'price'
(feature_matrix, output) = get_numpy_data(train_data, model_features, my_output)
initial_weights = np.array([-130., 1., 1., 1.])
step_size = 4e-12
tolerance = 9e6

When trying to run the multiple regression, it is hard to choose the best initilal weight, step_size and the tolerance