# Project 2: Multivariate Regression

The goal of this notebook is to explore multivariate regression and feature engineering.

In this notebook you will use data on house sales to predict prices using multiple regression. You will:
* Do some feature engineering
* Use built-in python functions to compute the regression weights (coefficients/parameters)
* Given the regression weights, predictors and outcome write a function to compute the Residual Sum of Squares
* Look at coefficients and interpret their meanings
* Evaluate multiple models via RSS

# Load in house sale data, split data into training and testing.

In [1]:
import numpy as np
import pandas as pd

In [2]:
# load data
df = pd.read_csv('merged.csv')

In [3]:
df.head()

Unnamed: 0,SALE TYPE,SOLD DATE,PROPERTY TYPE,ADDRESS,CITY,STATE OR PROVINCE,ZIP OR POSTAL CODE,PRICE,BEDS,BATHS,...,STATUS,NEXT OPEN HOUSE START TIME,NEXT OPEN HOUSE END TIME,URL (SEE http://www.redfin.com/buy-a-home/comparative-market-analysis FOR INFO ON PRICING),SOURCE,MLS#,FAVORITE,INTERESTED,LATITUDE,LONGITUDE
0,PAST SALE,March-13-2015,Single Family Residential,1157 S Stelling Rd,CUPERTINO,CA,95014.0,1500000.0,3.0,2.0,...,Sold,,,http://www.redfin.com/CA/Cupertino/1157-S-Stel...,MLSListings,ML81448134,N,Y,37.303816,-122.041727
1,PAST SALE,April-17-2018,Single Family Residential,10340 Las Ondas Way,CUPERTINO,CA,95014.0,2798000.0,4.0,2.5,...,Sold,,,http://www.redfin.com/CA/Cupertino/10340-Las-O...,MLSListings,ML81695379,N,Y,37.317957,-122.024231
2,PAST SALE,February-4-2019,Single Family Residential,1035 W Homestead Rd,SUNNYVALE,CA,94087.0,2200000.0,4.0,3.0,...,Sold,,,http://www.redfin.com/CA/Sunnyvale/1035-W-Home...,MLSListings,ML81733182,N,Y,37.337792,-122.056622
3,PAST SALE,July-9-2018,Townhouse,11030 Firethorne Dr,CUPERTINO,CA,95014.0,1307000.0,2.0,2.5,...,Sold,,,http://www.redfin.com/CA/Cupertino/11030-Firet...,MLSListings,ML81708624,N,Y,37.33827,-122.032713
4,PAST SALE,July-12-2019,Single Family Residential,10590 S Tantau Ave,CUPERTINO,CA,95014.0,2090000.0,3.0,3.0,...,Sold,,,http://www.redfin.com/CA/Cupertino/10590-S-Tan...,MLSListings,ML81753390,N,Y,37.314344,-122.00731


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2450 entries, 0 to 2449
Data columns (total 27 columns):
SALE TYPE                                                                                     2450 non-null object
SOLD DATE                                                                                     965 non-null object
PROPERTY TYPE                                                                                 2450 non-null object
ADDRESS                                                                                       2438 non-null object
CITY                                                                                          2444 non-null object
STATE OR PROVINCE                                                                             2450 non-null object
ZIP OR POSTAL CODE                                                                            2443 non-null float64
PRICE                                                                                   

Remove the rows where the columns you are going to use are nan values. Then split into train/test using 80/20 allocation.

In [5]:
# clean the data and split

In [6]:
df2 = df.drop(columns=['NEXT OPEN HOUSE START TIME','NEXT OPEN HOUSE END TIME','SOURCE','LOCATION','HOA/MONTH',
                       'INTERESTED','MLS#','URL (SEE http://www.redfin.com/buy-a-home/comparative-market-analysis FOR INFO ON PRICING)'])

In [7]:
df3 = df2.dropna()

In [8]:
y = df3.PRICE
X = df3.drop(columns='PRICE')

In [9]:
from sklearn.model_selection import train_test_split

In [10]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Learn a multivariate regression model 

Recall we can use the following code to learn a multiple regression model predicting 'price' based on the following features on training data:

In [11]:
example_features = ['SQUARE FEET', 'BEDS', 'BATHS']

Subset traing_data and test_data using example_features and column['PRICE'] and store them into train_input, train_price, test_input, and test_price.

In [12]:
train_input = X_train[example_features]
train_price = y_train
test_input = X_test[example_features]
test_price = y_test

Build a multivariate regression model using LinearRegression in sklearn.linear_model on train_data using train_input and train_price

In [13]:
from sklearn.linear_model import LinearRegression

In [14]:
regressor = LinearRegression()

In [15]:
regressor.fit(train_input,train_price)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

Now write a function that prints outcomes

In [16]:
#Function of printing outcomes using coefficients, intercept and input_features
def print_outcome(coefficients,intercept,input_features):
    for x in range(len(input_features)):
        print('The coefficient for {} is {}'.format(input_features[x],coefficients[x]))
    print('The intercept is {}'.format(intercept))

Now print out the coefficient and intercept of your model:

In [17]:
print_outcome(regressor.coef_,regressor.intercept_,example_features)

The coefficient for SQUARE FEET is 847.5870111405366
The coefficient for BEDS is 61498.95517570657
The coefficient for BATHS is -83887.77288419777
The intercept is 33494.332950558746


# Making Predictions

Once a model is built we can use the .predict() function to find the predicted values for data we pass. For example using the example model above:

In [18]:
example_predictions = regressor.predict(train_input)
print(example_predictions[0])

2214685.2324394668


# Compute RSS

Now that we can make predictions given the model, let's write a function to compute the RSS of the model.

In [19]:
#Function of calculating RSS using model, data, outcome as input and return RSS
def get_residual_sum_of_squares(model, data, outcome):
    predictions = model.predict(data)
    residuals = predictions - outcome
    residuals_sum = residuals.sum()
    RSS = residuals_sum ** 2
    return(RSS)  

Use the function to calculate RSS for train_data and test_data.

In [20]:
get_residual_sum_of_squares(regressor,train_input,train_price)

1.7956556380704924e-15

In [21]:
get_residual_sum_of_squares(regressor,test_input,test_price)

12442730951891.396

# Build models with different features

Now build a model_1 using different set of features

In [22]:
model_1_features=['SQUARE FEET', 'BEDS', 'BATHS','LOT SIZE']

In [23]:
train_input = X_train[model_1_features]
train_price = y_train
test_input = X_test[model_1_features]
test_price = y_test

In [24]:
regressor2 = LinearRegression()

In [25]:
regressor2.fit(train_input,train_price)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

Now print out the coefficient and intercept of model_1

In [26]:
print_outcome(regressor2.coef_,regressor2.intercept_,model_1_features)

The coefficient for SQUARE FEET is 847.5025061844938
The coefficient for BEDS is 61531.00590854499
The coefficient for BATHS is -83882.13202245435
The coefficient for LOT SIZE is 0.01463020890968282
The intercept is 33381.66187558905


Compute RSS for model_1

In [27]:
get_residual_sum_of_squares(regressor2,train_input,train_price)

7.806255641895632e-18

In [28]:
get_residual_sum_of_squares(regressor2,test_input,test_price)

12475333304287.207

# Create some new features

Although we often think of multiple regression as including multiple different features (e.g. # of bedrooms, squarefeet, and # of bathrooms) but we can also consider transformations of existing features e.g. the log of the squarefeet or even "interaction" features such as the product of bedrooms and bathrooms.

You will use the logarithm function to create a new feature. so first you should import it from the math library.

Next create the following 4 new features as column in both TEST and TRAIN data:
* BEDS_squared = BEDS\*BEDS
* BED_BATH = BEDS\*BATHS
* log_SQFT = log(SQUARE FEET)
* lat_plus_long = LATITUDE + LONGITUDE

Squaring bedrooms will increase the separation between not many bedrooms (e.g. 1) and lots of bedrooms (e.g. 4) since 1^2 = 1 but 4^2 = 16. Consequently this feature will mostly affect houses with many bedrooms.
bedrooms times bathrooms gives what's called an "interaction" feature. It is large when both of them are large.
Taking the log of squarefeet has the effect of bringing large values closer together and spreading out small values.
Adding latitude to longitude is totally non-sensical but we will do it anyway (you'll see why)

In [29]:
import math
from math import log

In [30]:
X['BEDS_squared'] = X['BEDS'] * X['BEDS']
X['BED_BATH'] = X['BEDS'] * X['BATHS']
X['lat_plus_long'] = X['LATITUDE'] + X['LONGITUDE']

In [31]:
plus = []
for x in X['SQUARE FEET']:
    d = log(x)
    plus.append(x)

In [32]:
X['log_SQFT'] = plus

In [33]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

Now build model_2 with more features than model_1: model_2_features=model_1_features+ ['BED_BATH']

In [34]:
model_2_features = ['SQUARE FEET', 'BEDS', 'BATHS','LOT SIZE','BED_BATH']

In [35]:
train_input = X_train[model_2_features]
train_price = y_train
test_input = X_test[model_2_features]
test_price = y_test

In [36]:
regressor3 = LinearRegression()

In [37]:
regressor3.fit(train_input,train_price)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

Now print out the coefficient and intercept of model_2

In [38]:
print_outcome(regressor3.coef_,regressor3.intercept_,model_2_features)

The coefficient for SQUARE FEET is 815.8580848670385
The coefficient for BEDS is 30073.764989576193
The coefficient for BATHS is -142196.8235527953
The coefficient for LOT SIZE is -0.07282153991256131
The coefficient for BED_BATH is 18494.57928508887
The intercept is 158625.58340264368


Compute RSS for model_2

In [39]:
get_residual_sum_of_squares(regressor3,train_input,train_price)

8.501012394024343e-15

In [40]:
get_residual_sum_of_squares(regressor3,test_input,test_price)

365484301955831.0

Now build model_3 with more features than model_2: model_3_features = model_2_features + ['BEDS_squared', 'log_SQFT', 'lat_plus_long']

In [41]:
model_3_features = ['SQUARE FEET', 'BEDS', 'BATHS','LOT SIZE','BED_BATH','BEDS_squared', 'log_SQFT', 'lat_plus_long']

In [42]:
train_input = X_train[model_3_features]
train_price = y_train
test_input = X_test[model_3_features]
test_price = y_test

In [43]:
regressor4 = LinearRegression()

In [44]:
regressor4.fit(train_input,train_price)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

Now print out the coefficient and intercept of model_3

In [45]:
print_outcome(regressor4.coef_,regressor4.intercept_,model_3_features)

The coefficient for SQUARE FEET is 286.0700915608937
The coefficient for BEDS is 321056.47086781455
The coefficient for BATHS is -271412.8976039463
The coefficient for LOT SIZE is -1.0130818878731012
The coefficient for BED_BATH is 91698.21875942852
The coefficient for BEDS_squared is -72017.82329918582
The coefficient for log_SQFT is 286.07009211113734
The coefficient for lat_plus_long is -3352692.6621714337
The intercept is -283633312.7175566


Compute RSS for model_3

In [46]:
get_residual_sum_of_squares(regressor4,train_input,train_price)

2.3283064365386963e-10

In [47]:
get_residual_sum_of_squares(regressor4,test_input,test_price)

136917497619432.69

# Compare different Models

Calculate RSS for model_1, model_2, model_3 for train_data and test_data. What insights do you get from the results?

In [None]:
# After calculating each model's RSS, the rss shows an increasing trend. I think this is strange. First, I believe
# when considering more factors, the model will be more accurate. However, when considering more factors, the difference
# between the predictions and the reality may gets bigger.
# Therefore, I assume this is the reason why the RSS is increasing.

# Gradient descent solution

Now we will estimate multivariate regression weights via gradient descent.

* Add a constant column of 1's to a dataframe to account for the intercept
* Convert a dataframe into a Numpy array
* Write a predict_output() function using Numpy
* Write a numpy function to compute the derivative of the regression weights with respect to a single feature
* Write gradient descent function to compute the regression weights given an initial weight vector, step size and tolerance.
* Use the gradient descent function to estimate regression weights for multiple features

# Convert to Numpy Array

In order to understand the details of the implementation of algorithms it's important to work with a library that allows for direct (and optimized) matrix operations. Numpy is a Python solution to work with matrices (or any multi-dimensional "array").

Recall that the predicted value given the weights and the features is just the dot product between the feature and weight vector. Similarly, if we put all of the features row-by-row in a matrix then the predicted value for all the observations can be computed by right multiplying the "feature matrix" by the "weight vector".

First we need to take the dataframe of our data and convert it into a 2D numpy array (also called a matrix).We can then use Panda's .as_matrix() to convert the dataframe into a numpy matrix.

In [48]:
import numpy as np

Now we will write a function that will accept a dataframe, a list of feature names (e.g. ['SQUARE FEET', 'BEDS','BATHS']) and an target feature e.g. ('PRICE') and will return two things:
* A numpy matrix whose columns are the desired features plus a constant column (this is how we create an 'intercept')
* A numpy array containing the values of the output

With this in mind, complete the following function

In [85]:
def get_numpy_data(data, features, output):
    data['Constant'] = 1
    features = ['Constant'] + features
    feature_matrix = data[features].values
    output_array = data[output].values
    return(feature_matrix, output_array)

For testing let's use the 'SQUARE FEET' feature and a constant as our features and price as our output

In [86]:
(example_features, example_output) = get_numpy_data(df, ['SQUARE FEET'], 'PRICE') # the [] around 'sqft_living' makes it a list
print(example_features[0,:]) # this accesses the first row of the data the ':' indicates 'all columns'
print(example_output[0]) # and the corresponding output

[1.00e+00 1.19e+03]
1500000.0


# Predicting output given regression weights

Suppose we had the weights [1.0, 1.0] and the features [1.0, 2368.0] and we wanted to compute the predicted output 1.0\*1.0 + 1.0\*1000.0 = 1001.0 this is the dot product between these two arrays. If they're numpy arrays, we can use np.dot() to compute this:

In [87]:
my_weights = np.array([1., 1.]) 
my_features = [1.0,1000.0]
predicted_value = np.dot(my_features, my_weights)
print(predicted_value)

1001.0


np.dot() also works when dealing with a matrix and a vector. The predictions from all the observations is just the RIGHT (as in weights on the right) dot product between the features matrix and the weights vector. With this in mind finish the following predict_output function to compute the predictions for an entire matrix of features given the matrix and the weights:

In [88]:
def predict_output(feature_matrix, weights):
    # assume feature_matrix is a numpy matrix containing the features as columns and weights is a corresponding numpy array
    # create the predictions vector by using np.dot()
    predictions = np.dot(feature_matrix,weights)
    return(predictions)

# Computing the Derivative

We are now going to move to computing the derivative of the regression cost function. The cost function is the sum over the data points of the squared difference between an observed output and a predicted output.

Since the derivative of a sum is the sum of the derivatives we can compute the derivative for a single data point and then sum over data points. We can write the squared difference between the observed output and predicted output for a single point as follows:

(w[0]\*[CONSTANT] + w[1]\*[feature_1] + ... + w[i] \*[feature_i] + ... +  w[k]\*[feature_k] - output)^2

Where we have k features and a constant. So the derivative with respect to weight w[i] by the chain rule is:

2\*(w[0]\*[CONSTANT] + w[1]\*[feature_1] + ... + w[i] \*[feature_i] + ... +  w[k]\*[feature_k] - output)\* [feature_i]

The term inside the paranethesis is just the error (difference between prediction and output). So we can re-write this as:

2\*error\*[feature_i]

That is, the derivative for the weight for feature i is the sum (over data points) of 2 times the product of the error and the feature itself. In the case of the constant then this is just twice the sum of the errors!

Twice the sum of the product of two vectors is just twice the dot product of the two vectors. Therefore the derivative for the weight for feature_i is just two times the dot product between the values of feature_i and the current errors. 

With this in mind complete the following derivative function which computes the derivative of the weight given the value of the feature (over all data points) and the errors (over all data points).

In [91]:
def feature_derivative(errors, feature):
    # Assume that errors and feature are both numpy arrays of the same length (number of data points)
    # compute twice the dot product of these vectors as 'derivative' and return the value
    derivative = 2 * np.dot(errors,feature)
    return(derivative)

Use the test code to test the feature derivative function:

In [104]:
(example_features, example_output) = get_numpy_data(df3, ['SQUARE FEET'], 'PRICE') 
my_weights = np.array([0., 0.]) # this makes all the predictions 0
test_predictions = predict_output(example_features, my_weights) 
# numpy arrays can be elementwise subtracted with '-': 
errors = test_predictions - example_output # prediction errors in this case is just the -example_output
feature = example_features[:,0] # let's compute the derivative with respect to 'constant', the ":" indicates "all rows"
derivative = feature_derivative(errors, feature)
print(derivative)
print(-np.sum(example_output)*2)

-2682451712.0
-2682451712.0


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [105]:
# If there exists some null-value, the resutls will be none.

# Gradient Descent

Now we will write a function that performs a gradient descent. The basic premise is simple. Given a starting point we update the current weights by moving in the negative gradient direction. Recall that the gradient is the direction of increase and therefore the negative gradient is the direction of decrease and we're trying to minimize a cost function.

The amount by which we move in the negative gradient direction is called the 'step size'. We stop when we are 'sufficiently close' to the optimum. We define this by requiring that the magnitude (length) of the gradient vector to be smaller than a fixed 'tolerance'.

With this in mind, complete the following gradient descent function below using your derivative function above. For each step in the gradient descent we update the weight for each feature befofe computing our stopping criteria

In [106]:
from math import sqrt 
# recall that the magnitude/length of a vector [g[0], g[1], g[2]] is sqrt(g[0]^2 + g[1]^2 + g[2]^2)

In [119]:
def regression_gradient_descent(feature_matrix, output, initial_weights, step_size, tolerance):
    converged = False 
    # make sure weights is a numpy array
    weights = np.array(initial_weights)
    while not converged:
        # compute the predictions based on feature_matrix and weights using your predict_output() function
        predictions = predict_output(feature_matrix, weights) 
        # compute the errors as predictions - output
        errors = predictions - output
        # initialize the gradient sum of squares to 0
        gradient_sum_of_squares = 0
        # while we haven't reached the tolerance yet, update each feature's weight
        for i in range(len(weights)): #loop over each weight
            # Recall that feature_matrix[:, i] is the feature column associated with weights[i]
            # compute the derivative for weight[i]:
            derivative = feature_derivative(errors, feature_matrix[:,i])
            # add the squared value of the derivative to the gradient sum of squares (for assessing convergence)
            gradient_sum_of_squares = (derivative * derivative).sum()
            # subtract the step size times the derivative from the current weight
            weights[i] = weights[i] - step_size * derivative
            # compute the square-root of the gradient sum of squares to get the gradient magnitude:
        gradient_magnitude = sqrt(gradient_sum_of_squares)
        if gradient_magnitude < tolerance:
            converged = True
    return(weights)

A few things to note before we run the gradient descent. Since the gradient is a sum over all the data points and involves a product of an error and a feature the gradient itself will be very large since the features are large (squarefeet) and the output is large (prices). So while you might expect "tolerance" to be small, small is only relative to the size of the features. 

For similar reasons the step size will be much smaller than you might expect but this is because the gradient has such large values.

# Running the Gradient Descent as Simple Regression

Although the gradient descent is designed for multivariate regression since the constant is now a feature we can use the gradient descent function to estimat the parameters in the simple regression on squarefeet. The folowing cell sets up the feature_matrix, output, initial weights and step size for the first model:

In [111]:
train_data,test_data = train_test_split(df3,train_size=0.2)

In [120]:
# let's test out the gradient descent
simple_features = ['SQUARE FEET']
my_output = 'PRICE'
(simple_feature_matrix, output) = get_numpy_data(train_data, simple_features, my_output)
initial_weights = np.array([0.0, 1.0])
#step_size = 7e-12
#tolerance = 2.5e7
step_size = 7e-13
tolerance = 1.0e7

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Next run your gradient descent with the above parameters.

In [124]:
slope = regression_gradient_descent(simple_feature_matrix,output,initial_weights,step_size,tolerance)
print(slope)

[5.33890829e-01 8.63240712e+02]


How do your weights compare to those achieved in the closed form solution?

Use your newly estimated weights and your predict_output() function to compute the predictions on all the TEST data (you will need to create a numpy array of the test feature_matrix and test output first)

In [122]:
# create a numpy array of the test feature_matrix and test output
simple_features = ['SQUARE FEET']
my_output = 'PRICE'
(test_simple_feature_matrix, test_output) = get_numpy_data(test_data, simple_features, my_output)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [128]:
# use predict_out function to calulate the prediction for test data.
prediction = predict_output(test_simple_feature_matrix, slope)
print(prediction[0])

4412887.054813535


In [131]:
# print the RSS calualted from the test data.
def residual_sum_of_squares(pred,test_output):
    difference = pred - test_output
    square_of_difference = difference * difference
    RSS = square_of_difference.sum()
    return RSS

In [132]:
residual_sum_of_squares(prediction,test_output)

279664929832036.88

# Running the Gradient Descent as Multivariate Regression

Extend the code to use different multivariate features as the first half part of this project. 

In [133]:
# split the data
train_data,test_data = train_test_split(df3,train_size=0.2)

In [146]:
# test out the gradient descent
multi_features = ['SQUARE FEET','BEDS']
my_output = 'PRICE'
(multi_feature_matrix, output) = get_numpy_data(train_data, multi_features, my_output)
initial_weights = np.array([-100000.0, 1.0,1.0])
#step_size = 7e-12
#tolerance = 2.5e7
step_size = 7e-13
tolerance = 1.0e7

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [147]:
slope2 = regression_gradient_descent(multi_feature_matrix,output,initial_weights,step_size,tolerance)
print(slope2)

KeyboardInterrupt: 

# Research

Try change the step_size and tolerance, how do the results change? (BE CAREFUL about the constant you picked, it may take a long time for the algorithm to converge.)

In [None]:
# Test on Simple Linear Regression

In [214]:
simple_features = ['SQUARE FEET']
my_output = 'PRICE'
(simple_feature_matrix, output) = get_numpy_data(train_data, simple_features, my_output)
initial_weights = np.array([0.0, 1.0])
step_size = 7e-12
tolerance = 1e6

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [215]:
slope = regression_gradient_descent(simple_feature_matrix,output,initial_weights,step_size,tolerance)
print(slope)

[5.90661781e-01 8.07844670e+02]


In [216]:
# The coefficients are almost the same

In [217]:
prediction = predict_output(test_simple_feature_matrix, slope)
print(prediction[0])

4129702.5425454983


In [218]:
residual_sum_of_squares(prediction,test_output)

286894808821128.75

In [None]:
# From just one sample, I set the step size and tolerance are all bigger than the previous.
# The Rss is bigger than the former.
# If I change step size and tolerance are all smaller than the previous, the rss is still bigger than the former.