#### Pair Problem

Let's bootstrap!  Recall that bootstrapping is sampling with replacement from a dataset to produce a new dataset of the same size.  Bootstrapping is used in random forests to guard against overfitting.  It also has wide application in many other areas of statistics - let's see two of them.

1) Produce a bootstrapped estimate of the median and 95 percent confidence interval over the median of the dependent variable in the [attached dataset](Boot.csv).

2) Use the attached data to run the linear model y = xb.  Produce bootstrapped estimates of the model parameters, b, and a 95% confidence interval over them.

**Jeremy created this problem. If you need more explanation or guidance, he said that he'll be at his desk and you can just ask him.**


# 1) Produce a bootstrapped estimate of the median and 95 percent confidence interval over the median of the dependent variable in the [attached dataset](Boot.csv).

In [152]:
import pandas as pd
import numpy as np


### Test Case - Import data

In [153]:
Boot = pd.read_csv("Boot.csv")

### Test Case - Do I have the right form for random sampling?

In [154]:
import string
alphabet = list(string.ascii_lowercase)
print type(alphabet)
print len(alphabet)
np.random.choice( alphabet, 26, replace = True)

<type 'list'>
26


array(['p', 'y', 'r', 'k', 'r', 'e', 'l', 'n', 'r', 's', 'z', 'w', 'e',
       'f', 'x', 'c', 'w', 'w', 'm', 'n', 'u', 'm', 'y', 'v', 'k', 'd'], 
      dtype='|S1')

### Apply to the given problem

In [155]:
y = Boot["y"]
n = len(y)
# print n
# print type(y)

In [156]:
dict_of_medians = {}
list_of_medians = []

for i in range(1,101):
    #print i,
    random_sample_of_y = np.random.choice( y, n, replace = True)
    median_of_random_sample = np.median(random_sample_of_y)
    list_of_medians.append(median_of_random_sample)
    dict_of_medians[i] = (median_of_random_sample, random_sample_of_y)

bootstrap_estimate_of_y = np.mean(list_of_medians)
print "Bootstrapped Estimate:" , bootstrap_estimate_of_y

Bootstrapped Estimate: -0.396238273902


In [157]:
print "Actual Median of Y:", np.median(y)

Actual Median of Y: -0.407439743474


In [158]:
import scipy.stats as st

st.t.interval(0.95, len(list_of_medians)-1, loc=np.mean(list_of_medians), scale=st.sem(list_of_medians))
# Delta = Degrees of Freedom

# loc=np.mean(list_of_medians) = average / expectation of the median

# scale=st.sem(list_of_medians)) - Calculates the standard error of 
# the mean (or standard error of measurement) 
# of the values in the input array.

(-0.41045583396828267, -0.38202071383642711)

In [159]:
import statsmodels.stats.api as sms

sms.DescrStatsW(list_of_medians).tconfint_mean()

(-0.41045583396828261, -0.38202071383642705)

# 2) Use the attached data to run the linear model y = xb. Produce bootstrapped estimates of the model parameters, b, and a 95% confidence interval over them.

In [160]:
#from sklearn.linear_model import LinearRegression
#LinearRegression(fit_intercept=False,)
import statsmodels.api as sm

In [161]:
X = Boot.copy()
del X["y"]
Boot.head(2)

Unnamed: 0,x_1,x_2,x_3,x_4,x_5,Intercept,y
0,0.485176,0.809888,-0.900321,-0.024176,-1.173217,1,-1.517941
1,0.187372,2.440924,1.417704,-1.032075,0.09003,1,-0.919732


In [162]:
X.head(2)

Unnamed: 0,x_1,x_2,x_3,x_4,x_5,Intercept
0,0.485176,0.809888,-0.900321,-0.024176,-1.173217,1
1,0.187372,2.440924,1.417704,-1.032075,0.09003,1


### Doing the basic just for proof of concept

In [163]:
model = sm.OLS(y,X).fit()
print model.params

x_1         -1.411328
x_2         -0.242640
x_3          0.551710
x_4          0.432448
x_5          0.253018
Intercept   -0.364552
dtype: float64


### Perform random samplling for th estimates

In [165]:
n

1000

In [166]:
Boot.shape

(1000, 7)

In [167]:
dict_of_coeff_estimates = {}
list_of_beta1_estimates = []
list_of_beta2_estimates = []
list_of_beta3_estimates = []
list_of_beta4_estimates = []
list_of_beta5_estimates = []
list_of_intercept_estimates = []

for i in range(1,20):
    #print i,
    # Random Sample of Our Observations - "Boot" & Separation of Values
    random_sample_of_observations = Boot.sample(n = 5, replace = True)
    #print random_sample_of_observations
    
    # Separate out X & y from each random sample:
    y = random_sample_of_observations["y"]
    X = random_sample_of_observations.copy()
    del X["y"]

    # Predict Coefficients for the given random sample:
    model = sm.OLS(y,X).fit()
    #print model.params
#    print
    list_of_beta1_estimates.append(model.params[0])
    list_of_beta2_estimates.append(model.params[1])
    list_of_beta3_estimates.append(model.params[2])
    list_of_beta4_estimates.append(model.params[3])
    list_of_beta5_estimates.append(model.params[4])
    list_of_intercept_estimates.append(model.params[5])
    
    dict_of_coeff_estimates[i] = (model.params)

beta1_bootstrap_estimate = np.mean(list_of_beta1_estimates)
beta2_bootstrap_estimate = np.mean(list_of_beta2_estimates)
beta3_bootstrap_estimate = np.mean(list_of_beta3_estimates)
beta4_bootstrap_estimate = np.mean(list_of_beta4_estimates)
beta5_bootstrap_estimate = np.mean(list_of_beta5_estimates)
beta0_bootstrap_estimate = np.mean(list_of_intercept_estimates)

print "beta1_bootstrap_est:", beta1_bootstrap_estimate, sms.DescrStatsW(list_of_beta1_estimates).tconfint_mean()
print "beta2_bootstrap_est:", beta2_bootstrap_estimate, sms.DescrStatsW(list_of_beta2_estimates).tconfint_mean()
print "beta3_bootstrap_est:", beta3_bootstrap_estimate, sms.DescrStatsW(list_of_beta3_estimates).tconfint_mean()
print "beta4_bootstrap_est:", beta4_bootstrap_estimate, sms.DescrStatsW(list_of_beta4_estimates).tconfint_mean()
print "beta5_bootstrap_est:", beta5_bootstrap_estimate, sms.DescrStatsW(list_of_beta5_estimates).tconfint_mean()
print "Intercept bootstrap est:", beta0_bootstrap_estimate, sms.DescrStatsW(list_of_intercept_estimates).tconfint_mean()


beta1_bootstrap_est: -1.26539941097 (-1.7404269616437757, -0.79037186029983686)
beta2_bootstrap_est: -0.810494700682 (-2.016846970257542, 0.39585756889376122)
beta3_bootstrap_est: -0.243895593339 (-0.93422285534193694, 0.44643166866382239)
beta4_bootstrap_est: 0.139017733772 (-0.5309722814960538, 0.80900774903930106)
beta5_bootstrap_est: 0.0372906828363 (-0.6333425626338558, 0.70792392830639006)
Intercept bootstrap est: -0.785017050009 (-1.5498837476350085, -0.020150352382601988)
