# Multivariable Linear Regression 

Simple linear regression produces a model in the form:

ŷ = α + ꞵx

Multivariable linear regression produces a model in the form:

ŷ = α + $ꞵ_{1}x_{1}$ + $ꞵ_{2}x_{2}$ + ... + $ꞵ_{n}x_{n}$

The linear regression algorithm accomplishes this by deriving a line formula that minimizes the differences between actual values and predicted values.  This algorithm is called **ordinary least-squares**, or **OLS**.

**Scikit-learn** Python module provides a LinearRegression function for doing this job.

In [44]:
import pandas as pd
import numpy as np
%matplotlib inline

In [48]:
# For this exercise we will use a dataset of Nasdaq Apple Inc. Common Stock Historical Stock Prices
# https://www.nasdaq.com/symbol/aapl/historical
df = pd.read_csv("apple_stocks.csv")
df.head()

Unnamed: 0,date,close,volume,open,high,low
0,3/21/2018,171.27,36387880,175.04,175.09,171.26
1,3/20/2018,175.24,19620520,175.24,176.8,174.94
2,3/19/2018,175.3,32931110,177.32,177.47,173.66
3,3/16/2018,178.02,38313330,178.65,179.12,177.62
4,3/15/2018,178.65,22676520,178.5,180.24,178.0701


In [49]:
# Let's examine if any of these columns are highly correlated
df.corr()

Unnamed: 0,close,volume,open,high,low
close,1.0,-0.679251,0.9287,0.966223,0.971258
volume,-0.679251,1.0,-0.6108,-0.581887,-0.717348
open,0.9287,-0.6108,1.0,0.972885,0.971381
high,0.966223,-0.581887,0.972885,1.0,0.971314
low,0.971258,-0.717348,0.971381,0.971314,1.0


In [53]:
# Unlike with simple linear regression where we use one predictor variable to predict one response variable,
# with multivariable regression we can use multiple predictor variables to predict one response variable.
# In this example, we will use 'open', 'high', and 'low' to predict 'close'
X = df[['open','high','low']]
y = df[['close']]

In [54]:
# Split Data
# Now we can split our data into a training and test set.  In this example, we are using an 80/20 split, 
# where 80% of our data will be used for training our model, and 20% of our data will be used for testing.
    
from sklearn.model_selection import train_test_split

# Split X and y into X_
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=1)

In [55]:
# Train Model
# Now we train our LinearRegression model using the training subset of data.

from sklearn.linear_model import LinearRegression

regression_model = LinearRegression()
regression_model.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [56]:
# Now that our model is trained, we can view the coefficients of the model using regression_model.coef_, 
# which is an array of tuples of coefficients.
# Each regression coefficient shows the strength of the relationship between the predictor variable and the
# outcome variable while controlling for the other predictor variable 

for idx, col_name in enumerate(X_train.columns):
    print("The coefficient for {} is {}".format(col_name, regression_model.coef_[0][idx]))

The coefficient for open is -0.6570935162207491
The coefficient for high is 0.8111276419948648
The coefficient for low is 0.8187193256466084


In [57]:
# regression_model.intercept_ returns an array of intercepts
intercept = regression_model.intercept_[0]

print("The intercept for our model is {}".format(intercept))

The intercept for our model is 4.509350671745153


Now that we know the regression coefficients for each predictor variable and the intercept, we can figure out our model:

**ŷ = 0.077 + $-0.65x_{1}$ + $0.81x_{2}$ + $0.819x_{3}$**



## How good is our model - the $R^{2}$ statistic

A common method of measuring the accuracy of regression models is to use the $R^{2}$ statistic.

The $R^{2}$ statistic is defined as follows:

$R^{2}$ =  1 – (RSS/TSS)

* The RSS (Residual sum of squares) measures the variability left unexplained after performing the regression
* The TSS measues the total variance in Y
* Therefore the $R^{2}$ statistic measures proportion of variability in Y that is explained by X using our model

The scale of $R^{2}$ statistic ranges from zero to one, with zero indicating that the proposed model **does not improve prediction over the mean model and one indicating perfect prediction**. Improvement in the regression model results in proportional increases in R-squared.

In [58]:
# R^2  can be determined using our test set and the model’s score method.

regression_model.score(X_test, y_test)

# This means that in our model, 95.7% of the variability in Y can be explained using X

0.97471969717241802

## How good is our model - RMSE

* The RMSE is the square root of the variance of the residuals. 
* It indicates the absolute fit of the model to the data–how close the observed data points are to the model’s predicted values.
* Whereas $R^{2}$ is a relative measure of fit, RMSE is an absolute measure of fit. 
* As the square root of a variance, RMSE can be interpreted as the standard deviation of the unexplained variance, and has the useful property of being in the same units as the response variable. 
* **Lower values of RMSE indicate better fit.**
* RMSE is a good measure of how accurately the model predicts the response, and is the most important criterion for fit if the main purpose of the model is prediction.

https://www.theanalysisfactor.com/assessing-the-fit-of-regression-models/

In [59]:
# We can get the mean squared error using scikit-learn’s mean_squared_error method 
# and comparing the prediction for the test data set (data not used for training) 
# with the ground truth for the data test set.

# We'll start with calculating the Mean Squared Error (MSE)

from sklearn.metrics import mean_squared_error

y_predict = regression_model.predict(X_test)

regression_model_mse = mean_squared_error(y_predict, y_test)

regression_model_mse

0.55489803976763252

In [60]:
# And now we can calculate the Root Mean Squared Error (RMSE)
import math

math.sqrt(regression_model_mse)

0.7449147869170222

In [61]:
# Now, let's try to make a prediction

# We can use our model to predict closing for another unknown day. 

# In the dataset, the data for 1/9/2018 is as follows:

# close: 74.33
# open: 174.55
# high: 175.06
# low: 173.41

# First, let's see if our model will predict the 'close' amount given these exact values:

new_data = [[27.85, 28.04, 27.51]]

regression_model.predict(new_data)



array([[ 31.47628398]])

In [28]:
# Now let's try to change some of the values so that the data is unknown to our model 
# (our model wasn't trained or tested on this data)

new_data = [[200590, 30.85, 28.04, 20.51]]
regression_model.predict(new_data)

array([[ 20.93760125]])