### Using Bootstrapping to estimate the Accuracy of Coefficients

In [11]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import *
% matplotlib inline
plt.style.use('ggplot')
import seaborn as sns
import statsmodels.formula.api as smf
import statsmodels as sm

In [7]:
# Loading the Data set
df = pd.read_csv("Auto.csv",na_values ='?')

### Data set Description

**Variable** | **__ Description__** |** Type**
---|---|---
mpg|Miles Per Gallon|Integer
cylinders|Number of cylinders between 4 and 8|Integer
displacement|Engine Displacement,Cu Inches|Integer
horsepower|Horsepower|Integer
weight|Vehicle weight(lbs)|Integer
acceleration|Time to accelerate from 0 to 60 mph (Secs)|float
year|Model year|Year of the Model
origin|Origin of car (1. American, 2. European, 3. Japanese|qualitative
name|Vehicle Name|String

In [8]:
# Checking for Null values

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 397 entries, 0 to 396
Data columns (total 9 columns):
mpg             397 non-null float64
cylinders       397 non-null int64
displacement    397 non-null float64
horsepower      392 non-null float64
weight          397 non-null int64
acceleration    397 non-null float64
year            397 non-null int64
origin          397 non-null int64
name            397 non-null object
dtypes: float64(4), int64(4), object(1)
memory usage: 28.0+ KB


In [9]:
df = df.dropna(axis = 0)

### Approach-
- Calculate a simple linear regression with 1 predictor with statsmodel the caluclates the coefficient estimates along with the confidence interval
- use Bootstrapping with sklearn and compare the estimates from 1 &2

In [13]:
model_sm = smf.ols(formula='mpg~horsepower',data=df).fit()

In [14]:
model_sm.summary()

0,1,2,3
Dep. Variable:,mpg,R-squared:,0.606
Model:,OLS,Adj. R-squared:,0.605
Method:,Least Squares,F-statistic:,599.7
Date:,"Mon, 26 Sep 2016",Prob (F-statistic):,7.03e-81
Time:,05:48:59,Log-Likelihood:,-1178.7
No. Observations:,392,AIC:,2361.0
Df Residuals:,390,BIC:,2369.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5
,coef,std err,t,P>|t|,[95.0% Conf. Int.]
Intercept,39.9359,0.717,55.660,0.000,38.525 41.347
horsepower,-0.1578,0.006,-24.489,0.000,-0.171 -0.145

0,1,2,3
Omnibus:,16.432,Durbin-Watson:,0.92
Prob(Omnibus):,0.0,Jarque-Bera (JB):,17.305
Skew:,0.492,Prob(JB):,0.000175
Kurtosis:,3.299,Cond. No.,322.0


In [27]:
# Using Sklearn to calculate the coefficients on the full model
X = df[['horsepower']]
Y = df.mpg
a = linear_model.LinearRegression(fit_intercept = True).fit(X,Y)
print "The value calculated for Beta_1 by SK learn is %r-" %a.coef_
print "The value calculated for Beta_0 by SK learn is %r-" %a.intercept_


The value calculated for Beta_1 by SK learn is array([-0.15784473])-
The value calculated for Beta_0 by SK learn is 39.935861021170467-


In [28]:
# The code below uses Bootstrapping techniques to get the values for std_dev that can be used tocalculate the Confidence intervals

# empty lists ot hold the values of the coefficients
beta_0 = []
beta_1 = []
# performing bootstrapping 1000 times
for i in range(0,1000):
    df_bootstrapped = df.sample(frac = 1, replace = True)
    X = df_bootstrapped[['horsepower']]
    Y = df_bootstrapped.mpg
    temp_model = linear_model.LinearRegression().fit(X,Y)
    beta_0.append(temp_model.intercept_) 
    beta_1.append(temp_model.coef_)
    

In [30]:
print "The std err for Beta_0 is %r" %np.std(beta_0)
print "The std err for Beta_1 is %r" %np.std(beta_1)

The std err for Beta_0 is 0.84890978199449685
The std err for Beta_1 is 0.0072787625688362223


In [49]:
(a.intercept_)-2*np.std(beta_0)

38.238041457181474

In [61]:
Beta_0_conf_lower_lim = (a.intercept_)-2*np.std(beta_0)
Beta_0_conf_upper_lim = (a.intercept_)+2*np.std(beta_0)

Beta_1_conf_lower_lim = (a.coef_)-2*np.std(beta_1)
Beta_1_conf_upper_lim = (a.coef_)+2*np.std(beta_1)

print " The 95 percent conf interval for intercept is  [%r , %r]" %(Beta_0_conf_lower_lim,Beta_0_conf_upper_lim)
print " The 95 percent conf interval for Coefficient is  [%f , %f]" %(Beta_1_conf_lower_lim,Beta_1_conf_upper_lim)

 The 95 percent conf interval for intercept is  [38.238041457181474 , 41.63368058515946]
 The 95 percent conf interval for Coefficient is  [-0.172402 , -0.143287]


#### Conclusion:
- There seems to be a slight difference between the standard errors given by the OLS function and sklearn linear model
- There is a general incliniation to lean towards the estimates given by bootstrapping due to the fokkowing reasons
    - The OLS relies on certain assumptins(regd noise and its estimated Value) which skleanr does not
    - OLS here would assume that the Xs are fixed which in reality are not
    - With a bootstrapped approach you can generate new sample as and when needed
