# D8 Lec 23, Prof Sanchez 
## Confidence Intervals
### Sean Villegas


**Vocab Review**
- Parameter: a number associated with the population
- Statistic (can estimate a parameter): A number calculated from the sample
- Population and Sample
    - The key idea is that you use the data from your sample to make educated guesses (conclusions) about the population. For those conclusions to be accurate, your sample should represent the population well — e.g. it should include a diverse mix of people, not just students from one dorm or professors from one department.
- Mean: _the average of a set of values, calculated by adding all the values together and dividing by the total number of values_ 
- Range: _the difference between the maximum and minimum values in a dataset_
- Bootstrap: _Bootstrapping is a resampling method in data science that estimates the variability of a statistic by repeatedly sampling with replacement from the original dataset_ 
- Confidence Interval 
    - take off extremes, within bounds of the array `left` and `right`. Yellow is the confidence interval, and the green dot will be in there 
    - I dont have a population -> you use the sample from the dataset, and conduct tests like the dataset results is the population, by taking bootstrap samples of the main sample. 

<center> What if we don’t have the population? <=> How can we figure out the value of an unknown population parameter? </center>

<center>Answer: Solved by estimation</center>

1. Take a random sample from the population
2. obtain a statistic from this sample
3. Use this statistic as an estimate of the parameter
    - However… estimates change everytime you sample, thus you use a bound range that can withhold the unknown population parameter 

For a range, we have to know what can happen
1. So, we need multiple samples to see multiple values. So you must sample at random from the original sample, if it represents the population (large and random og sample)


#### Bootstrapping allows for this method
Steps:
    1. Draw with replacement, at random, and as many values as are in the original sample

Notes:
- We get a new sample from our old data that we have, by bootstrapping
    - this method is applied for data scientists that have limited funding (which is usually all studies)
- A sample from the population is usually only done once (historic data for example, or to save time). You must simulate with bootstrapping to work with that data 
- **Jeremy is allergic to repeating code**
- Obtain a statistic many times from bootstrap samples because the statistic can vary between samples **(empirical distribution of observed dataset)** 
    - This works based on degree of probability for dataset which needs to be assessed 
- We are currently in an inference module:
    - Using data to draw reliable conclusions about the world
    - Uses statistics
    - represented by green dot







In [None]:
"""
Code to simulate estimations for range, and bootstrap the random large data for said ranges of estimations (opposed to a single number estimation)
"""

from datascience import * 

population = "Ignore me I represent a table" 

# **STOP**, always think how you should filter the messy data. By filtering or representing with NULL

pop_median = percentile(50, population.column('Total Compensation')) # getting middle of the histogram sample; # generates a single estimation of the data

sample_size = 400
a_sample = population.sample(sample_size, with_replacement=False)
a_sample.hist('Total Compensation', bins = np.arange(0, 800000, 25000)) 
percentile(50, a_sample.column('Total Compensation'))



                                        #### Boot strap example #### 
bootstrap_sample = a_sample.sample(k = a_sample.num_rows, 
                                   with_replacement = True)
bootstrap_sample.hist('Total Compensation', bins = np.arange(0, 800000, 25000))


def one_bootstrap_median():
    """dont repeat code """
    # draw the bootstrap sample
    resample = a_sample.sample(k = a_sample.num_rows, with_replacement=True)
    # return the median total compensation in the bootstrap sample
    return percentile(50, resample.column('Total Compensation'))

bootstrapped_sample_medians = make_array()
num_resamples = 1000

for i in np.arange(num_resamples):
    new_median = one_bootstrap_median()
    bootstrapped_sample_medians = np.append(bootstrapped_sample_medians, new_median)

bootstrapped_sample_medians.show(5) # an array of 1000 estimations 

bootstrapped_median_table = Table().with_column('Bootstrapped Sample Median', bootstrapped_sample_medians)
bootstrapped_median_table.hist(bins = np.arange(125000, 155000, 2000))

# Plotting parameters; you can ignore this code
parameter_green = '#32CD32'
plots.ylim(-0.000005, 0.00014)
plots.scatter(pop_median, 0, color=parameter_green, s=40, zorder=2) # population median is represented by green
plots.title('Do our bootstrapped medians cover the true value?*');

**Note**: 
Remember that in settings where we use this method, we do not have the true value of the parameter. In reality you wont know whether the bootstrapped distribution of statistics covers the parameter (we hope that it does).


#### Looking at percentiles of an array (sorted list)
For a sorted list, the pth percentile:
- is the first value
- that is at least as large 
- as p% of the values

Interpolation:
- is a method of estimating unknown values that fall between known data points. In data science, it's often used to approximate percentiles, points on a curve, or missing values by assuming a linear (or other) pattern between the known points.



**Which statements are true when `s = array([1, 5, 7, 3, 9])`?**
_add percent to 100%. E.g. 1 == 20%, 5 == 40%...._
1. The 50th percentile of `s` is 5.
    TRUE # (Median is 5)
2. The 10th percentile of `s` is 6.
    FALSE # (The 10th percentile is closer to 1.8, not 6)
3. The 39th percentile of `s` is the same as the 40th percentile of `s`. 
    TRUE # (Both would likely interpolate to the same value, within range)
4. The 40th percentile of `s` is the same as the 41st percentile of `s`. 
    FALSE # (They would likely fall in slightly different spots based on interpolation, out of range)

#### Confidence interval is a tool for estimation 


In [None]:
bootstrapped_median_table.hist(bins = np.arange(125000, 150000, 2000))

# Plotting parameters; you can ignore this code
plots.ylim(-0.000005, 0.00014)
plots.plot(make_array(left, right), make_array(0, 0), color='yellow', lw=3, zorder=1)
plots.scatter(pop_median, 0, color=parameter_green, s=40, zorder=2);
plots.title('We are 95 percent confident that \n the parameter lies within the yellow bounds');