# Assignment

We learned from the central limit theorem that the **mean** of a sample of size $n$ follows a normal distributon with center $\mu$ and variance $\sigma^2 / n$ where $\mu$ and $\sigma$ are the mean and variance of the population the sample was drawn from. And more importantly, the CTL holds **regardless of the distribution of the population** itelf as long as the sample size is large enough. So this leads to another question? What if we wanted to estimate the distribution of *some other statistic* besides the sample mean? For example, can we draw a similar conclusion about the distribution of the **sample median**? What about the distribution of the **sample variance**? Unfortunately, the CLT only applies to the sample mean, and finding the distribution of any other statistic is not that straight-forward. The CLT is like a mathematical guaratee, but we usually cannot make a similar guarantee with other statistics. So instead we can use computational methods to estimate the distribution of our statistic. One example of such computational method is **bootstrapping**.

The sampling techniques we learned in the lecture mainly deal with the problem of how to create smaller data in a way that it is representative of the larger population, and all of them were a form of sampling **without replacement**. **Bootstrapping** consists of **repeated sampling from data with replacement** (which means the same row of data can end of more than once in the sampled data). Bootstrapping is useful technique when we want to estimate the distribution of a statistic that would otherwise be hard to estimate.

Let's return to the bank marketing data from previous exercises:

In [2]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

bank = pd.read_csv('../../data/bank-full.csv', delimiter = ';')
bank.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,58,management,married,tertiary,no,2143,yes,no,unknown,5,may,261,1,-1,0,unknown,no
1,44,technician,single,secondary,no,29,yes,no,unknown,5,may,151,1,-1,0,unknown,no
2,33,entrepreneur,married,secondary,no,2,yes,yes,unknown,5,may,76,1,-1,0,unknown,no
3,47,blue-collar,married,unknown,no,1506,yes,no,unknown,5,may,92,1,-1,0,unknown,no
4,33,unknown,single,unknown,no,1,no,no,unknown,5,may,198,1,-1,0,unknown,no


- Report the difference in the average balance for single vs married individuals in the data. <span style="color:red" float:right>[1 point]</span>

In [None]:
## your code goes here

Since balance is skewed to the right, maybe it's not such a good idea to compare averages for these two groups. So instead of the average, let's use the **median**.

- Report the difference in the median balance for single vs married individuals in the data. How does this compare to using the mean? <span style="color:red" float:right>[1 point]</span>

In [None]:
## your code goes here

But what is the distribution of the above statistics?

- In the case of the mean, we could rely on the CLT to answer that, assuming that we know the mean and variance of balance for single and married individuals **in the population** (our data is the sample).
- In the case of the median, we cannot rely on the CLT, so we will use bootstrapping to estimate the distribution.

1. Take a bootstrap sample from the data. You can use the `sample` method to do this. Recall that a bootstrap sample is a sample **with replacement**. Your bootstrap sample should have the same size (number of rows) as the data. <span style="color:red" float:right>[5 point]</span>
1. Find the difference between the median bank balances for single vs. married individuals in the bootstrap sample. We will refer to this as the **bootstrap estimate**. <span style="color:red" float:right>[2 point]</span>
1. Repeat steps 1 and 2 over 1000 iterations, and at the end of each iteration store the bootstrap estimate into a list we call `bootstrap_estimates`. <span style="color:red" float:right>[2 point]</span>

In [None]:
bootstrap_estimates = []
n_iters = 1000

## your code goes here

- Display the distribution of the 1000 bootstrap estimates we found. Does the distribution of the difference in medians appear to be symmetric? <span style="color:red" float:right>[2 point]</span>

In [None]:
## your code goes here

- What is its standard deviations? <span style="color:red" float:right>[1 point]</span>

In [None]:
## your code goes here

So we now have not only the difference in median balance between the two groups, but we also have a measure of how uncertain this estimate is, based on bootrapping. 

# End of assignment