# Multivariate Linear Regression
Multivariate Linear Regression with Backward Elimination to have features that are significant to the model
### Step 1: Importing required libraries and modules

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from statsmodels.formula.api import OLS as regression
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
%matplotlib inline

### Step 2: Reading CSV dataset to pandas Dataframe

In [2]:
dframe = pd.read_csv("50_startups.csv")
dframe.head()

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State,Profit
0,165349.2,136897.8,471784.1,New York,192261.83
1,162597.7,151377.59,443898.53,California,191792.06
2,153441.51,101145.55,407934.54,Florida,191050.39
3,144372.41,118671.85,383199.62,New York,182901.99
4,142107.34,91391.77,366168.42,Florida,166187.94


Copying X training data to a new dataframe

In [3]:
x_data_frame = dframe[dframe.columns.tolist()[:-1]]
x_data_frame.head()

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State
0,165349.2,136897.8,471784.1,New York
1,162597.7,151377.59,443898.53,California
2,153441.51,101145.55,407934.54,Florida
3,144372.41,118671.85,383199.62,New York
4,142107.34,91391.77,366168.42,Florida


Similarly, copying Y data to a new dataframe

In [4]:
y_data_frame = dframe.Profit
y_data_frame.head()

0    192261.83
1    191792.06
2    191050.39
3    182901.99
4    166187.94
Name: Profit, dtype: float64

### Step 3: Label Encoding
To convert categorical data(State) to numerical data for model fitting and prediction

In [5]:
label_encoder = LabelEncoder()
x_data_frame.State = label_encoder.fit_transform(x_data_frame.State)
x_data_frame.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self[name] = value


Unnamed: 0,R&D Spend,Administration,Marketing Spend,State
0,165349.2,136897.8,471784.1,2
1,162597.7,151377.59,443898.53,0
2,153441.51,101145.55,407934.54,1
3,144372.41,118671.85,383199.62,2
4,142107.34,91391.77,366168.42,1


> To avoid multi colinearity, eliminate the relation between the encoded labels, use OneHotEncoding on encoded labels<br>

**In this example, i got rid of the Dummy Variable Trap by eliminating very first column of training data.**

In [6]:
hot_encoder = OneHotEncoder(categorical_features=[3])
x_data = hot_encoder.fit_transform(x_data_frame).toarray().astype(np.int)[:,1:]
x_data[:5,:]

array([[     0,      1, 165349, 136897, 471784],
       [     0,      0, 162597, 151377, 443898],
       [     1,      0, 153441, 101145, 407934],
       [     0,      1, 144372, 118671, 383199],
       [     1,      0, 142107,  91391, 366168]])

In [7]:
y_data = y_data_frame.values.astype(np.int)
y_data[:5]

array([192261, 191792, 191050, 182901, 166187])

### Step 4: Split and Fit
In this step, i am spitting the data to test and training set for x and y repectively and then fitting the training data to the regression model. 
<br>
In this case, we are not using scikit Regression Models because they do not provide P value information for the fitted model and thus i am making use of statsmodels in python for Regression Model

In [8]:
x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size=0.2, random_state=0)
regressor = regression(exog=x_train, endog=y_train).fit()
regressor.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,7.963
Model:,OLS,Adj. R-squared:,8.958
Method:,Least Squares,F-statistic:,-8.005
Date:,"Thu, 07 Sep 2017",Prob (F-statistic):,1.0
Time:,08:40:57,Log-Likelihood:,-432.43
No. Observations:,40,AIC:,874.9
Df Residuals:,35,BIC:,883.3
Df Model:,5,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
x1,529.9975,5269.684,0.101,0.920,-1.02e+04,1.12e+04
x2,4999.3192,4661.687,1.072,0.291,-4464.409,1.45e+04
x3,0.6845,0.068,10.001,0.000,0.546,0.823
x4,0.3418,0.036,9.625,0.000,0.270,0.414
x5,0.0736,0.024,3.122,0.004,0.026,0.121

0,1,2,3
Omnibus:,2.176,Durbin-Watson:,1.961
Prob(Omnibus):,0.337,Jarque-Bera (JB):,1.675
Skew:,-0.501,Prob(JB):,0.433
Kurtosis:,2.963,Cond. No.,798000.0


### Step 5: Building a better model with Backward Elimination
For model for better prediction, i am making use of Backward elimination process to get rid of the features that are not really affecting the performance of the model.

Backward Elimination steps:
- Suppose a significance value, let say 5%
- Fit the model with all features and check for a feature having P value > Significance value, otherwise finish
- Eliminate that feature having P value > Significance value and fit the model with new feature set
- Repeat until we get rid of all the unwanted features from the training set

In this case, we get rid of x1 (refer above summary) from training set having P value of 92% (0.920) and then fit the model with remaining features and check for the feature having P value > Significance value again.

In [9]:
x_train = x_train[:,[1,2,3,4]]
x_test = x_test[:,[1,2,3,4]]
regressor = regression(exog=x_train, endog=y_train).fit()
regressor.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,7.965
Model:,OLS,Adj. R-squared:,8.739
Method:,Least Squares,F-statistic:,-10.29
Date:,"Thu, 07 Sep 2017",Prob (F-statistic):,1.0
Time:,08:40:57,Log-Likelihood:,-432.44
No. Observations:,40,AIC:,872.9
Df Residuals:,36,BIC:,879.6
Df Model:,4,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
x1,4811.4168,4211.802,1.142,0.261,-3730.514,1.34e+04
x2,0.6836,0.067,10.206,0.000,0.548,0.819
x3,0.3430,0.033,10.358,0.000,0.276,0.410
x4,0.0742,0.023,3.281,0.002,0.028,0.120

0,1,2,3
Omnibus:,2.178,Durbin-Watson:,1.958
Prob(Omnibus):,0.337,Jarque-Bera (JB):,1.678
Skew:,-0.501,Prob(JB):,0.432
Kurtosis:,2.961,Cond. No.,576000.0


Now, i didn't find any feature having P value greater than Significance value. Thus, by now, our model is ready for prediction.