## Workshop - Bootstrap

Today we will

1. Show the average unique number of observations when bootstrapping
2. Estimate the standard deviation on the causal effect from a **RANDOMIZED CONTROL TRIAL**

**************************************
# Bootstrap Samples

In one code cell:

- import `numpy` and `numpy.random`
- set the seed to 490
- create *a range* from 0 to 10,000
    - *hint: start with a smaller size to set up the framework*
- create an empty list
- in a 1,000 iteration for loop
    - *hint: start with a smaller size to set up the framework*
    - randomly sample your range your range with replacement with a size equal to the length of your range using `npr.choice()`
    - append your empty list with the length of the the number of unique values from the sampling with replacement
- output the average number of unique values over all bootstrapped samples

In [13]:
import numpy as np
import numpy.random as npr

from tqdm import tqdm

npr.seed(490)

l = range(0,10001)
l2 = []

for i in tqdm(range(1000)):
    indx = npr.choice(l, len(l), replace = True)
    l2.append(len(set(indx)))
    
np.mean(l2)/10000

100%|█████████████████████████████████████████████████████████████████████████████| 1000/1000 [00:03<00:00, 327.52it/s]


0.6322176

Is this closer to 1/2, 2/3, or 3/4?

2/3

**************
# Randomize Control Trial 

In economics, we call experiments with randomly assigned treatment and control groups __*randomized control trials*__. 
In data science, they are called _**A-B testing**_.

In this application, we will be using a data set from [kaggle](https://www.kaggle.com/samtyagi/audacity-ab-testing). 
We will be using an LPM to estimate the effect of being in a treament group on clicking *something*.
The data is from Audacity, however, there is no information about the experiment specifically. 
We do not know if this is showing different versions of a website, different versions of an advertisement, or something else entirely.



In [14]:
import pandas as pd
from tqdm import tqdm

import statsmodels.api as sm
from sklearn.linear_model import LinearRegression

import matplotlib.pyplot as plt

Load in the audacity data as `ab` with `index_col = timestamp`.
Print the head.

In [17]:
ab = pd.read_csv('homepage_actions.csv', index_col = 'timestamp')
print(ab.head(3))

                                id       group action
timestamp                                            
2016-09-24 17:42:27.839496  804196  experiment   view
2016-09-24 19:19:03.542569  434745  experiment   view
2016-09-24 19:36:00.944135  507599  experiment   view


Determine the unique values of `group` and `action`

In [None]:
ab.group

Create a dummy variable `treatment` for those in the treatment group.
Create a dummy variable `click` for those that clicked.

Create an object `x` that is the model matrix composed of a constant and the `treatment` variable.
Create an object `y` that is the `click` variable.

In one line, fit a statsmodel OLS and print the summary. 
Note the estimate and standard error on the `treatment` variable.

Here we will perform the bootstrap in one code cell.

- set the `npr` seed to 490
- define `n` equal to the number of rows of `ab`
- create an empty list `beta`
- set up a for loop over 2,000 iterations using tqdm
    - use `npr.choice()` to obtain the bootstrap index
    - fit a `LinearRegression()`
        - *hint:* `X` *needs to be a DataFrame, not a Series. Select the* `treatment` *variable using* `ab[['treatment']].iloc[indx]`. `y` *needs to be a Series. Select with only single square brackets.*
    - append the `fit.coef_` to beta
        - *Note: the intercept, which we do not need, is contained seperately in* `fit.intercept_`.

Using one `print()` statment, print the average `beta` with 3 decimal places and the standard deviation of `beta` with 4 decimal places.

Up next, we will produce a histogram. However, we need to perform some preprocessing.

Print the top five observations of `beta` using a slice. Note the format.

To convert to a list we can work with

- use `np.concatenate()` on `beta`
- chain the `.flat` attribute
- wrap the whole thing with `list()`
- overwrite `beta`

Finally, use `matplotlib` to create a histogram of `beta`. 