# Multiple Regression (gradient descent)

Here regression weights are optimized with gradient descent.

In [4]:
import graphlab
import numpy as np

# Load in house sales data

Dataset is from house sales in King County, the region where the city of Seattle, WA is located.

In [3]:
sales = graphlab.SFrame('kc_house_data.gl/')

[INFO] GraphLab Create v1.8.3 started. Logging: /tmp/graphlab_server_1463781390.log


# Convert to Numpy Array

The following function accepts an SFrame, a list of feature names (e.g. ['sqft_living', 'bedrooms']) and an target feature e.g. ('price') and will return two things:
* A numpy matrix whose columns are the desired features plus a constant column (this is how we create an 'intercept')
* A numpy array containing the values of the output

In [5]:
def get_numpy_data(data_sframe, features, output):
    data_sframe['constant'] = 1 # bias (intercept) column
    # add the column 'constant' to the front of the features list so that we can extract it along with the others:
    features = ['constant'] + features 
    # select the columns of data_SFrame given by the features list into the SFrame features_sframe (now including 
    # constant):
    features_sframe = data_sframe[features]

    # the following line will convert the features_SFrame into a numpy matrix:
    feature_matrix = features_sframe.to_numpy()
    # assign the column of data_sframe associated with the output to the SArray output_sarray
    output_sarray = data_sframe[output]

    # the following will convert the SArray into a numpy array by first converting it to a list
    output_array = output_sarray.to_numpy()
    return(feature_matrix, output_array)

Test: Use the 'sqft_living' feature and a constant as our features and price as our output:

In [9]:
(example_features, example_output) = get_numpy_data(sales, ['sqft_living'], 'price') 
print example_features[0] # same as [0, :]
print example_output[0] 

[  1.00000000e+00   1.18000000e+03]
221900.0


# Predicting output given regression weights

Suppose we had the weights [1.0, 1.0] and the features [1.0, 1180.0] and we wanted to compute the predicted output 1.0\*1.0 + 1.0\*1180.0 = 1181.0 this is the dot product between these two arrays. If they're numpy arrayws we can use np.dot() to compute this:

In [10]:
my_weights = np.array([1., 1.]) # the example weights
my_features = example_features[0] 
print 'features', my_features
predicted_value = np.dot(my_features, my_weights)
print 'predicted val', predicted_value

features [  1.00000000e+00   1.18000000e+03]
predicted val 1181.0


Predict output: the dot product of the feature_matrix and the coefficients (weights).

In [11]:
def predict_output(feature_matrix, weights):
    # assume feature_matrix is a numpy matrix containing the features as columns and weights is a corresponding numpy
    # array create the predictions vector by using np.dot()
    predictions = np.dot(feature_matrix, weights)
    
    return(predictions)

In [12]:
test_predictions = predict_output(example_features, my_weights)
print test_predictions[0] 
print test_predictions[1] 

1181.0
2571.0


# Computing the Derivative

Compute the derivative of the regression cost function (RSS). 

Since the derivative of a sum is the sum of the derivatives, compute the derivative for a single data point and then sum over data points. For a single data point the squared residual is:

(w[0]\*[CONSTANT] + w[1]\*[feature_1] + ... + w[i] \*[feature_i] + ... +  w[k]\*[feature_k] - output)^2

Where we have k features and a constant (bias). The derivative with respect to weight w[i] by the chain rule is:

2\*(w[0]\*[CONSTANT] + w[1]\*[feature_1] + ... + w[i] \*[feature_i] + ... +  w[k]\*[feature_k] - output)\* [feature_i]

The term inside the paranethesis is just the error (difference between prediction and output). So we can re-write this as:

2\*error\*[feature_i]

Since twice the sum of the product of two vectors is just twice the dot product of the two vectors, the derivative for the weight for feature_i is just two times the dot product between the values of feature_i and the current errors. 

In [13]:
def feature_derivative(errors, feature):
    # Assume that errors and feature are both numpy arrays of the same length (number of data points)
    # compute twice the dot product of these vectors as 'derivative' and return the value
    derivative = 2 * np.dot(errors, feature)
    return(derivative)

Test

In [14]:
(example_features, example_output) = get_numpy_data(sales, ['sqft_living'], 'price') 
my_weights = np.array([0., 0.]) 
test_predictions = predict_output(example_features, my_weights) 
errors = test_predictions - example_output # prediction errors in this case is just the -example_output
feature = example_features[:, 0] # compute the derivative with respect to 'constant'
derivative = feature_derivative(errors, feature)
print derivative
print -np.sum(example_output)*2 # should be the same as derivative

-23345850022.0
-23345850022.0


# Gradient Descent

Function that performs a gradient descent. Given a initialized weight, update the weights by moving in the negative gradient direction. (Recall: the gradient is the direction of *increase*-- here we *descend* to *minimize* the cost function.)

The amount by which we move in the negative gradient *direction* is called the 'step size' (eta). Stop when we are 'sufficiently close' to the optimum as defined by requiring that the magnitude of the gradient vector be smaller than a specified 'tolerance'.

# Running the Gradient Descent as Simple Regression

Split the data into training and test data.

In [15]:
train_data,test_data = sales.random_split(0.8, seed = 0)

In [19]:
def regression_gradient_descent(feature_matrix, output, initial_weights, step_size, tolerance, verbose = True):
    converged = False 
    weights = np.array(initial_weights) # make sure it's a numpy array
    while not converged:
        # compute the predictions based on feature_matrix and weights using your predict_output() function
        preds = predict_output(feature_matrix, weights)

        # compute the errors as predictions - output
        error = preds - output

        gradient_sum_squares = 0 # initialize the gradient sum of squares
        # while we haven't reached the tolerance yet, update each feature's weight
        for i in range(len(weights)): # loop over each weight
            # feature_matrix[:, i] is the feature column associated with weights[i]
            # compute the derivative for weight[i]:
            #deriv_wi = feature_derivative(errors, feature_matrix[:, i])
            deriv_wi = 2 * np.dot(feature_matrix[:, i], error)

            # add the squared value of the derivative to the gradient sum of squares (for assessing convergence)
            gradient_sum_squares += (deriv_wi ** 2)
            
            # subtract the step size times the derivative from the current weight
            weights[i] -= (step_size * deriv_wi)
            if verbose:
                print 'weights:', weights
            
        # compute the square-root of the gradient sum of squares to get the gradient magnitude:
        gradient_magnitude = np.sqrt(gradient_sum_squares)
        
        if gradient_magnitude < tolerance:
            converged = True
            
    return(weights)

In [20]:
# Test
simple_features = ['sqft_living']
my_output = 'price'
(simple_feature_matrix, output) = get_numpy_data(train_data, simple_features, my_output)
initial_weights = np.array([-47000., 1.])
step_size = 7e-12
tolerance = 2.5e7

In [21]:
mod1_weights = regression_gradient_descent(simple_feature_matrix, output, initial_weights, step_size, tolerance)
mod1_weights

weights: [ -4.69998578e+04   1.00000000e+00]
weights: [-46999.85779866    354.86068685]
weights: [-46999.894732      354.86068685]
weights: [-46999.894732      262.96853711]
weights: [-46999.88514683    262.96853711]
weights: [-46999.88514683    286.83150776]
weights: [-46999.88764179    286.83150776]
weights: [-46999.88764179    280.6346632 ]
weights: [-46999.88699974    280.6346632 ]
weights: [-46999.88699974    282.24388793]
weights: [-46999.88717231    282.24388793]
weights: [-46999.88717231    281.82599715]
weights: [-46999.88713334    281.82599715]
weights: [-46999.88713334    281.93451693]
weights: [-46999.88714931    281.93451693]
weights: [-46999.88714931    281.90633602]
weights: [-46999.88715101    281.90633602]
weights: [-46999.88715101    281.91365417]
weights: [-46999.88715641    281.91365417]
weights: [-46999.88715641    281.91175376]
weights: [-46999.88716085    281.91175376]
weights: [-46999.88716085    281.91224727]
weights: [-46999.88716555    281.91224727]
weights: 

array([-46999.88716555,    281.91211912])

Use newly estimated weights and predict_output() to compute the predictions on all the TEST data

In [23]:
(simple_feature_matrix_test, output_test) = get_numpy_data(test_data, simple_features, my_output)

In [24]:
mod1_preds = predict_output(simple_feature_matrix_test, mod1_weights)
mod1_preds

array([ 356134.44317093,  784640.86422788,  435069.83652353, ...,
        663418.65300782,  604217.10799338,  240550.4743332 ])

Compute the RSS on the test data set.

In [26]:
def get_rss(preds, outcome):
    # Then compute the residuals/errors
    resids = outcome - preds
    
    # Then square and add them up
    RSS = sum(resids ** 2)

    return(RSS)    

mod1_rss = get_rss(mod1_preds, output_test)
mod1_rss

275400047593155.69

# Running a multiple regression

In [27]:
# sqft_living15 is the average squarefeet for the nearest 15 neighbors. 
model_features = ['sqft_living', 'sqft_living15'] 
my_output = 'price'
(feature_matrix, output) = get_numpy_data(train_data, model_features, my_output)
initial_weights = np.array([-100000., 1., 1.])
step_size = 4e-12
tolerance = 1e9

Use the above parameters to estimate the model weights.

In [28]:
mod2_weights = regression_gradient_descent(
    feature_matrix, output, initial_weights, step_size, tolerance, verbose = False)
mod2_weights

array([ -9.99999688e+04,   2.45072603e+02,   6.52795277e+01])

Compute the predictions on the TEST data.

In [29]:
(feature_matrix_test, output_test) = get_numpy_data(test_data, model_features, my_output)

mod2_preds = predict_output(feature_matrix_test, mod2_weights)
mod2_preds

array([ 366651.41203656,  762662.39786164,  386312.09499712, ...,
        682087.39928241,  585579.27865729,  216559.20396617])

Predicted price for first house in test_data:

In [30]:
mod2_preds[0]

366651.41203655908

Actual price for the 1st house in the test_data

In [31]:
output_test[0]

310000.0

Compute the RSS for model 2 on TEST data.

In [32]:
mod2_rss = get_rss(mod2_preds, output_test)
mod2_rss

270263446465243.91

In [175]:
print "mod1 RSS:", mod1_rss
print "mod2 RSS:", mod2_rss

mod1 RSS: 2.75400047593e+14
mod2 RSS: 2.70263446465e+14
