# Multiple Regression (gradient descent)

**Goal**: use Turi Create along with numpy to solve for the regression weights with gradient descent.

We will:
* Add a constant column of 1's to a Turi Create SFrame to account for the intercept
* Convert an SFrame into a Numpy array
* Write a predict_output() function using Numpy
* Write a numpy function to compute the derivative of the regression weights with respect to a single feature
* Write gradient descent function to compute the regression weights given an initial weight vector, step size and tolerance.
* Use the gradient descent function to estimate regression weights for multiple features

In [5]:
import turicreate


Dataset is from house sales in King County, the region where the city of Seattle, WA is located.

In [6]:
sales = turicreate.SFrame('../ml-foundations/data/home_data.sframe/')

If we want to do any "feature engineering" like creating new features or adjusting existing ones we should do this directly using the SFrames as seen in the other Week 2 notebook. For this notebook, however, we will work with the existing features.

# Converting to Numpy Arrays

Although SFrames offer a number of benefits to users (especially when using Big Data and built-in Turi Create functions) in order to understand the details of the implementation of algorithms it's important to work with a library that allows for direct (and optimized) matrix operations. Numpy is a Python solution to work with matrices (or any multi-dimensional "array").

Recall that the predicted value given the weights and the features is just the dot product between the feature and weight vector. Similarly, if we put all of the features row-by-row in a matrix then the predicted value for *all* the observations can be computed by right multiplying the "feature matrix" by the "weight vector". 

First we need to take the SFrame of our data and convert it into a 2D numpy array (also called a matrix). To do this we use Turi Create's built in .to_dataframe() which converts the SFrame into a Pandas (another python library) dataframe. We can then use Panda's .as_matrix() to convert the dataframe into a numpy matrix.

In [7]:
import numpy as np # note this allows us to refer to numpy as np instead 

Now we will write a function that will accept an SFrame, a list of feature names (e.g. ['sqft_living', 'bedrooms']) and an target feature e.g. ('price') and will return two things:
* A numpy matrix whose columns are the desired features plus a constant column (this is how we create an 'intercept')
* A numpy array containing the values of the output

With this in mind, complete the following function (where there's an empty line you should write a line of code that does what the comment above indicates)

In [8]:
def get_numpy_data(data_sframe, features, target):
    data_sframe['constant'] = 1 # Add a column of 1s to multiply by the intercept (w0)
    # add the column 'constant' to the front of the features list so that we can extract it along with the others:
    features = ['constant'] + features # concatenation of two lists
    
    # select the columns of data_SFrame given by the features list into features_sframe (now including constant):
    features_sframe = data_sframe[features]
    # convert the features_SFrame into a numpy matrix:
    feature_matrix = features_sframe.to_numpy()
    # assign the column of data_sframe associated with the output to the SArray output_sarray
    output_sarray = data_sframe[target]
    # the following will convert the SArray into a numpy array by first converting it to a list
    output_array = output_sarray.to_numpy()
    return(feature_matrix, output_array)

For testing let's use the 'sqft_living' feature and a constant as our features and price as our output:

In [10]:
(example_features, example_output) = get_numpy_data(sales, ['sqft_living'], 'price') # the [] around 'sqft_living' makes it a list
print(example_features[0,:])  # this accesses the first row of the data the ':' indicates 'all columns'
print(example_output[0]) # and the corresponding output
# Result is two columns: 'constant' and 'sqft_living', for features. The target price as example_output.

[1.00e+00 1.18e+03]
221900.0


# Predicting output given regression weights

Suppose we had the weights [1.0, 1.0] and the features [1.0, 1180.0] and we wanted to compute the predicted output 1.0\*1.0 + 1.0\*1180.0 = 1181.0 this is the dot product between these two arrays. If they're numpy arrays, we can use `np.dot()` to compute this:

In [17]:
# Numpy arrays are used to represent numeric vectors.
my_weights = np.array([1., 1.]) # example weight vector
my_features = example_features[0,] # use the first observation vector
predicted_value = np.dot(my_features, my_weights)
print(type(predicted_value))

<class 'numpy.float64'>


`np.dot()` also works for matrix-vector multiplication. 
Recall that the predictions from all the observations is just the RIGHT dot product (weights on the right side) between the features *matrix* and the weights *vector*. 

Finish the following `predict_output` function to compute the predictions for an entire matrix of features given the matrix and the weights:

In [20]:
def predict_output(feature_matrix, weights):
    # assume feature_matrix is a numpy matrix containing the features as columns,
    # weights is the corresponding numpy array
    
    predictions = np.dot(feature_matrix, weights)  # returns an array if first arg is a matrix

    return(predictions)

If you want to test your code run the following cell:

In [19]:
test_predictions = predict_output(example_features, my_weights)
print( test_predictions[0]) # should be 1181.0
print( test_predictions[1]) # should be 2571.0

<class 'numpy.ndarray'>
1181.0
2571.0


# Computing the Partial Derivative for a given weight[i]

We are now going to move to computing the partial derivative with respect to w[i] of the RSS Cost Function. Recall that RSS is the sum over the data points of the squared difference between an observed output and a predicted output.

Since **the derivative of a sum is the sum of the derivatives** we can compute the derivative for a single data point and then sum over data points. 

The squared difference between the observed output and predicted output for a single point (observation) as follows:

(w[0]\*[CONSTANT] + w[1]\*[feature_1] + ... + w[i] \*[feature_i] + ... +  w[k]\*[feature_k] - output)^2

Where we have k features and a constant. So the derivative with respect to weight w[i] by the chain rule is:

2\*(w[0]\*[CONSTANT] + w[1]\*[feature_1] + ... + w[i] \*[feature_i] + ... +  w[k]\*[feature_k] - output)\* [feature_i]

The term inside the parenthesis is just the `error` (difference between prediction and output). So we can re-write this as:

2\*error\*[feature_i]

Thus, 
the partial derivative for the i-th feature's weight (weight[i]) is the sum (over all data points) of 2 times the product of error times the feature itself. In the case of the constant then this is just twice the sum of the errors!

Since the sum of element-wise products of two vectors is just the dot-product, then:
the derivative for the i-th weigh (weight for feature i) is just 2\* dot-product between the values of feature_i and the current errors (diff between real output and prediction). 

Complete the following derivative function which computes the derivative of the weight given the value of the feature (over all data points) and the errors (over all data points).

In [21]:
def feature_derivative(errors, feature):
    # errors and feature are both numpy arrays, same length (N data points)
    # compute twice the dot product of these vectors as 'derivative' and return the value
    derivative = 2*np.dot(feature, errors)
    # Returns a scalar
    return(derivative)

To test your feature derivative (which is a scalar) run the following:

In [24]:
(example_features, example_output) = get_numpy_data(sales, ['sqft_living'], 'price') 
my_weights = np.array([0., 0.]) # this makes all the predictions 0
test_predictions = predict_output(example_features, my_weights) 
# just like SFrames 2 numpy arrays can be elementwise subtracted with '-': 
errors = test_predictions - example_output # prediction errors in this case is just the -example_output
feature = example_features[:,0] # let's compute the derivative with respect to 'constant', the ":" indicates "all rows"
derivative = feature_derivative(errors, feature)
print( derivative)
print( -np.sum(example_output)*2) # should be the same as derivative

-23345850022.0
-23345850022.0


# Gradient Descent

Given a starting point we update the current weights by moving in the negative gradient direction. 
The negative gradient is the direction of *decrease* and we're trying to *minimize* a cost function. 

We stop when we are sufficiently close to the optimum. We define this by requiring that the magnitude (length) of the gradient vector to be smaller than a fixed 'tolerance'.

Using the derivative function `feature_derivative`, for each step in the gradient descent we update the weight for each feature befofe computing our stopping criteria

In [59]:
from math import sqrt # the magnitude/length of a vector [g[0], g[1], g[2]] is sqrt(g[0]^2 + g[1]^2 + g[2]^2)

In [60]:
def regression_gradient_descent(feature_matrix, output_v, initial_weights, step_size, tolerance):
    converged = False 
    weights = np.array(initial_weights) # make sure it's a numpy array
    output_v = np.array(output_v)
    while not converged:
        # compute the predictions based on feature_matrix and weights using your predict_output() function
        predictions = predict_output(feature_matrix, weights)
        # compute the errors array as predictions - output
        errors = predictions - output_v
        gradient_inner_sum = 0 # initialize the gradient sum of squares
        
        # Loop over each weight, updating the weights
        for i in range(len(weights)):
            # feature_matrix[:, i] <-  feature column associated with weights[i]
            
            # compute the derivative with respect to weight[i]:
            derivative = feature_derivative(errors, feature_matrix[:,i])
            
            # add the squared value of the derivative to the gradient's sum of inner squares
            gradient_inner_sum += pow(derivative, 2)
            
            # update is step size times the derivative of current weight
            update = step_size*derivative
            weights[i] = weights[i] - update  # substract because negative descent
            
        # compute gradient magnitude to check convergence:
        gradient_magnitude = sqrt(gradient_inner_sum)
        if gradient_magnitude < tolerance:
            converged = True
    return(weights)

The gradient itself will be very large since the features are large tolerance to be small, "small" is only relative to the size of the features. 

For similar reasons the step size will be much smaller than you might expect but this is because the gradient has such large values.

# Running the Gradient Descent as Simple Regression

In [61]:
train_data,test_data = sales.random_split(.8,seed=0)

We can use the gradient descent function to estimate the parameters in the simple regression on sqft_living. 
Let us set up the feature_matrix, output, initial weights and step size for the simple sqft model:

In [62]:
# let's test out the gradient descent
simple_features = ['sqft_living']
my_output = 'price'
(simple_feature_matrix, output) = get_numpy_data(train_data, simple_features, my_output)
initial_weights = np.array([-47000., 1.])
step_size = 7e-12
tolerance = 2.5e7

Next run your gradient descent with the above parameters.

In [64]:
simple_weights = regression_gradient_descent(simple_feature_matrix, 
                            output, initial_weights, step_size, tolerance)
simple_weights

array([-46999.88716555,    281.91211912])


**Quiz Question: What is the value of the weight for sqft_living -- the second element of ‘simple_weights’ (rounded to 1 decimal place)?**

Use your newly estimated weights and your predict_output() function to compute the predictions on all the TEST data (you will need to create a numpy array of the test feature_matrix and test output first:

In [68]:
test_simple_feature_matrix, test_output_v = get_numpy_data(test_data, simple_features, target)


Now compute your predictions using test_simple_feature_matrix and your weights from above.

In [69]:
simple_predictions = predict_output(test_simple_feature_matrix, simple_weights)

**Quiz Question: What is the predicted price for the 1st house in the TEST data set for model 1 (round to nearest dollar)?**

In [70]:
simple_predictions[0]

356134.4431709297

Now that you have the predictions on test data, compute the RSS on the test data set. Save this value for comparison later. Recall that RSS is the sum of the squared errors (difference between prediction and output).

In [78]:
# Compute RSS of Simple Model's Test predictions
residuals = simple_predictions - test_output
Squares = [e*e for e in residuals]
RSS = sum(Squares)
RSS

275400047593155.7

# Running a multiple regression

Now we will use more than one actual feature. Use the following code to produce the weights for a second model with the following parameters:

In [72]:
model_features = ['sqft_living', 'sqft_living15'] # sqft_living15 is the average squarefeet for the nearest 15 neighbors. 
target = 'price'
# Feature matrix is equivalent to our H matrix in the class notes
feature_matrix, output_v = get_numpy_data(train_data, model_features, target)
initial_weights = np.array([-100000., 1., 1.])
step_size = 4e-12
tolerance = 1e9

Use the above parameters to estimate the model weights. Record these values for your quiz.

In [73]:
model_weights = regression_gradient_descent(feature_matrix, 
                            output_v, initial_weights, step_size, tolerance)

Use your newly estimated weights and the predict_output function to compute the predictions on the TEST data. Don't forget to create a numpy array for these features from the test set first!

In [74]:
feature_matrix_test, output_v_test =  get_numpy_data(test_data, model_features, target)
model_predictions = predict_output(feature_matrix_test, model_weights)

**Quiz Question: What is the predicted price for the 1st house in the TEST data set for model 2 (round to nearest dollar)?**

In [81]:
model_predictions[0]   # vs 356,134.4431709297

366651.4120365591

What is the actual price for the 1st house in the test data set?

In [83]:
test_data[0]['price']

310000.0

**Quiz Question: Which estimate was closer to the true price for the 1st house on the TEST data set, model 1 or model 2?**

Now use your predictions and the output to compute the RSS for model 2 on TEST data.

In [79]:
# Compute RSS of Bi-featured Model's Test predictions
residuals_model = model_predictions - output_v_test
Squares_model = [e*e for e in residuals]
RSS_model = sum(Squares)
RSS_model

275400047593155.7

**Quiz Question: Which model (1 or 2) has lowest RSS on all of the TEST data? **

In [80]:
if (RSS < RSS_model): 
    print("Simple model wins") 
else: 
    print("Bi-feature model wins")

Bi-feature model wins
