# Multiple linear regression (intuition)

# 5 Assumption of a linear regression 
<ol>
<li><font color="blue">Linearity:</font> the target (y) and the features (xi) have a linear relationship. Check: Plot the errors against the predicted y and look for the values to be symmetrically distributed around a horizontal line with constant variance</li>
<li><font color="blue">Independence:</font>  the errors are not correlated with one another. Check: plot errors over time and look for non-random patterns (in the case of time series data)</li>
<li><font color="blue">Normality:</font>  the errors are normally distributed. Check: histogram of the errors</li>
<li><font color="blue">Homoskedasticity:</font>  the variance of the error term is constant across values of the target and features. Check: plot the errors against the predicted y</li>
<li><font color="blue">No Multicollinearity.</font>  Check: look for correlations above ~0.8 between features</li>
</ol>

# Dummy variable trap
it causes duplicating variable (Redundant Dependency: D1 = 1-D2 ): so always omit one dummy variable, if 9 will be 8.
it happens automatically by the sklearn

# Significant level (statistical significance)
<ul>
    <li>The probability of rejecting the null hypothesis in a statistical test when it is true.</li>
    <li>The null hypothesis in the model is: the (NOOOON) benefit of the feature.</li>
    <li>A value of alpha = 0.05 is most often used as the threshold for statistical significance.</li>
    <li>Significant level = 1 - confidence interval</li>
</ul>

# P-value
probability of accepting or rejecting the null hypothesis (this feature is not help the model).
<ul>
    <li>p-value > significance level : accept the hypothesis and avoid this feature.</li>
    <li>p-value < significance level : reject the hypothesis and use this feature.</li>
</ul>

# Methods of building models (feature engineering)
<ol>
    <li>All in</li>
    <li>Backward elimination</li>
    <li>Forward selection</li>
    <li>Bidirectional elimination</li>
    <li>All possible models (Score comparison)</li>
</ol>
Look at the steps for each one in the PDF

# (Significant level and P-value) in the model
<ul>
    <li>we need ( P-value < Significant level ) to prove that this feature help the model.</li>
    <li>If the P value is less than your significance (alpha) level, the hypothesis test is statistically significant.</li>
    <li>If the confidence interval does not contain the null hypothesis value, the results are statistically significant.</li>
    <li>If the P value is less than alpha, the confidence interval will not contain the null hypothesis value (reject H0).</li>
</ul>

# >>>>> Multiple linear regression project <<<<<

# SKlearn offers some automatic preprocessing 
<ul>
    <li>Handle dummy trap</li>
    <li>Handle feature scaling</li>
</ul>

In [1]:
# import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [2]:
# reading Data
data = pd.read_csv('data/50_Startups.csv')
data.head()

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State,Profit
0,165349.2,136897.8,471784.1,New York,192261.83
1,162597.7,151377.59,443898.53,California,191792.06
2,153441.51,101145.55,407934.54,Florida,191050.39
3,144372.41,118671.85,383199.62,New York,182901.99
4,142107.34,91391.77,366168.42,Florida,166187.94


In [3]:
# General exploration 
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 5 columns):
R&D Spend          50 non-null float64
Administration     50 non-null float64
Marketing Spend    50 non-null float64
State              50 non-null object
Profit             50 non-null float64
dtypes: float64(4), object(1)
memory usage: 2.0+ KB


In [4]:
IV = data.iloc[:,:-1]
DV = data.iloc[:,-1]

In [5]:
# Preparation: Convert categorical variables into dummies
IV = pd.get_dummies(IV)
IV.head()

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State_California,State_Florida,State_New York
0,165349.2,136897.8,471784.1,0,0,1
1,162597.7,151377.59,443898.53,1,0,0
2,153441.51,101145.55,407934.54,0,1,0
3,144372.41,118671.85,383199.62,0,0,1
4,142107.34,91391.77,366168.42,0,1,0


In [6]:
# Split the data into train/test
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test= train_test_split(IV,DV,test_size=.2,random_state=0)

In [7]:
# build  regression model
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(x_train,y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [8]:
# predict the profit value using the model
y_pred = regressor.predict(x_test)

In [9]:
# take a look for comparing y_pred and y_test : to feel the data and model
dic={'y_test':y_test,'y_pred': np.round(y_pred,2)}
print(pd.DataFrame(dic).head(10))

       y_test     y_pred
28  103282.38  103015.20
11  144259.40  132582.28
10  146121.95  132447.74
41   77798.83   71976.10
2   191050.39  178537.48
27  105008.31  116161.24
38   81229.06   67851.69
31   97483.56   98791.73
22  110352.25  113969.44
4   166187.94  167921.07


# Improve the model by one of the feature selection techniques<font color="blue"> (Backward elimination)</font>
 

In [10]:
# handle Dummy Trap : stat models doesn't handle that
x_train = x_train.iloc[:,:-1]
x_test = x_test.iloc[:,:-1]

In [11]:
# first of all we need to add the X0 of the intercept value Y= X0B0 + X1B1 + X2B2 + ...
x_train.insert(0,'X0',np.ones(x_train.shape[0]))

In [12]:
import statsmodels.formula.api as sm
regressor_ols = sm.OLS(endog=y_train,exog=x_train).fit()  # OLS : ordinary least squares 
regressor_ols.summary()

0,1,2,3
Dep. Variable:,Profit,R-squared:,0.95
Model:,OLS,Adj. R-squared:,0.943
Method:,Least Squares,F-statistic:,129.7
Date:,"Thu, 13 Dec 2018",Prob (F-statistic):,3.91e-21
Time:,02:48:40,Log-Likelihood:,-421.1
No. Observations:,40,AIC:,854.2
Df Residuals:,34,BIC:,864.3
Df Model:,5,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
X0,4.325e+04,8315.816,5.201,0.000,2.64e+04,6.02e+04
R&D Spend,0.7735,0.055,14.025,0.000,0.661,0.886
Administration,0.0329,0.066,0.495,0.624,-0.102,0.168
Marketing Spend,0.0366,0.019,1.884,0.068,-0.003,0.076
State_California,-699.3691,3661.563,-0.191,0.850,-8140.560,6741.822
State_Florida,-1658.6532,4209.221,-0.394,0.696,-1.02e+04,6895.513

0,1,2,3
Omnibus:,15.823,Durbin-Watson:,2.468
Prob(Omnibus):,0.0,Jarque-Bera (JB):,23.231
Skew:,-1.094,Prob(JB):,9.03e-06
Kurtosis:,6.025,Cond. No.,1480000.0


In [13]:
# check the highest p-value and if < SL : remove the column and train the model again, otherwise : your model is ready
# higest p-value is State_Californi (0.850) > 0.05: remove it
x_train.drop(labels=['State_California'],axis=1,inplace=True)

In [14]:
# fit the model again
regressor_ols = sm.OLS(endog=y_train,exog=x_train).fit()  # OLS : ordinary least squares 
regressor_ols.summary()

0,1,2,3
Dep. Variable:,Profit,R-squared:,0.95
Model:,OLS,Adj. R-squared:,0.944
Method:,Least Squares,F-statistic:,166.7
Date:,"Thu, 13 Dec 2018",Prob (F-statistic):,2.87e-22
Time:,02:48:40,Log-Likelihood:,-421.12
No. Observations:,40,AIC:,852.2
Df Residuals:,35,BIC:,860.7
Df Model:,4,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
X0,4.292e+04,8020.397,5.352,0.000,2.66e+04,5.92e+04
R&D Spend,0.7754,0.053,14.498,0.000,0.667,0.884
Administration,0.0319,0.065,0.488,0.629,-0.101,0.165
Marketing Spend,0.0363,0.019,1.902,0.065,-0.002,0.075
State_Florida,-1272.1608,3639.780,-0.350,0.729,-8661.308,6116.986

0,1,2,3
Omnibus:,16.074,Durbin-Watson:,2.467
Prob(Omnibus):,0.0,Jarque-Bera (JB):,24.553
Skew:,-1.086,Prob(JB):,4.66e-06
Kurtosis:,6.164,Cond. No.,1430000.0


In [15]:
# remove feature and fit the model 
x_train.drop(labels=['State_Florida'],axis=1,inplace=True)
regressor_ols = sm.OLS(endog=y_train,exog=x_train).fit()  # OLS : ordinary least squares 
regressor_ols.summary()

0,1,2,3
Dep. Variable:,Profit,R-squared:,0.95
Model:,OLS,Adj. R-squared:,0.946
Method:,Least Squares,F-statistic:,227.8
Date:,"Thu, 13 Dec 2018",Prob (F-statistic):,1.8499999999999998e-23
Time:,02:48:41,Log-Likelihood:,-421.19
No. Observations:,40,AIC:,850.4
Df Residuals:,36,BIC:,857.1
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
X0,4.299e+04,7919.773,5.428,0.000,2.69e+04,5.91e+04
R&D Spend,0.7788,0.052,15.003,0.000,0.674,0.884
Administration,0.0294,0.064,0.458,0.650,-0.101,0.160
Marketing Spend,0.0347,0.018,1.896,0.066,-0.002,0.072

0,1,2,3
Omnibus:,15.557,Durbin-Watson:,2.481
Prob(Omnibus):,0.0,Jarque-Bera (JB):,22.539
Skew:,-1.081,Prob(JB):,1.28e-05
Kurtosis:,5.974,Cond. No.,1430000.0


In [16]:
# remove feature and fit the model 
x_train.drop(labels=['Administration'],axis=1,inplace=True)
regressor_ols = sm.OLS(endog=y_train,exog=x_train).fit()  # OLS : ordinary least squares 
regressor_ols.summary()

0,1,2,3
Dep. Variable:,Profit,R-squared:,0.95
Model:,OLS,Adj. R-squared:,0.947
Method:,Least Squares,F-statistic:,349.0
Date:,"Thu, 13 Dec 2018",Prob (F-statistic):,9.65e-25
Time:,02:48:41,Log-Likelihood:,-421.3
No. Observations:,40,AIC:,848.6
Df Residuals:,37,BIC:,853.7
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
X0,4.635e+04,2971.236,15.598,0.000,4.03e+04,5.24e+04
R&D Spend,0.7886,0.047,16.846,0.000,0.694,0.883
Marketing Spend,0.0326,0.018,1.860,0.071,-0.003,0.068

0,1,2,3
Omnibus:,14.666,Durbin-Watson:,2.518
Prob(Omnibus):,0.001,Jarque-Bera (JB):,20.582
Skew:,-1.03,Prob(JB):,3.39e-05
Kurtosis:,5.847,Cond. No.,497000.0


In [17]:
# remove feature and fit the model 
x_train.drop(labels=['Marketing Spend'],axis=1,inplace=True)
regressor_ols = sm.OLS(endog=y_train,exog=x_train).fit()  # OLS : ordinary least squares 
regressor_ols.summary()

0,1,2,3
Dep. Variable:,Profit,R-squared:,0.945
Model:,OLS,Adj. R-squared:,0.944
Method:,Least Squares,F-statistic:,652.4
Date:,"Thu, 13 Dec 2018",Prob (F-statistic):,1.56e-25
Time:,02:48:41,Log-Likelihood:,-423.09
No. Observations:,40,AIC:,850.2
Df Residuals:,38,BIC:,853.6
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
X0,4.842e+04,2842.717,17.032,0.000,4.27e+04,5.42e+04
R&D Spend,0.8516,0.033,25.542,0.000,0.784,0.919

0,1,2,3
Omnibus:,13.132,Durbin-Watson:,2.325
Prob(Omnibus):,0.001,Jarque-Bera (JB):,16.254
Skew:,-0.991,Prob(JB):,0.000295
Kurtosis:,5.413,Cond. No.,157000.0


# Forward selection 
<ol>
    <li>train all simple models : every single feature with the target value.</li>
    <li>check the lowest one has p-value: if p-value < SL</li>
    <li>train all simple models: every single feature + (selected ones from te previous steps) with the target value</li>
    <li>check again lowest p-value until lowest one >= SL then the model will be ready</li>
</ol>

# Bidirectional elimination
<ol>
<li>perform one steps of forward selection.</li>
<li>perform all steps of backward elimination on the features that come from forward selection until the final model be ready</li>
</ol>

Done :) 