# Interpreting Confidence Intervals

In [None]:
from datascience import *
from cs104 import *
import numpy as np
%matplotlib inline

## 1. Caffeine Experiment

Let's load a table of (fake) data for this experiment.

This has the results for the number of words recalled from a list before and after taking caffeine as well as the difference for every student. 

In [None]:
caffeine = Table().read_table("data/caffeine.csv")

In [None]:
caffeine.show(5)

In [None]:
print("Sample size =", caffeine.num_rows)

In [None]:
diff_array = caffeine.column("Difference")

Let's use the mean difference between post-test and pre-test in our sample as our estimate for our population parameter. 

In [None]:
effect = np.mean(diff_array)
print("Caffeine Effect (Mean difference in the sample) = ", effect)

Ok! It's positive so maybe caffeine works... 

Hmmmm... we know this is just an estimate from one sample. Let's create a confidence interval--a range of estimates that can express our confidence in making this estimate from a sample. 

Let's use **boostrapping and the percentile method** to create a 95\% confidence interval. 

In [None]:
results = bootstrap_statistic(diff_array, np.mean, 10000)

In [None]:
ci_interval = confidence_interval(95, results)
print("95% confidence interval = ", ci_interval)

In [None]:
table = Table().with_columns("Caffeine Effects (Mean Difference)", 
                             results)
plot = table.hist("Caffeine Effects (Mean Difference)", bins = np.arange(-2,7,0.5))
plot.set_title("Bootstrap 10000 Times \n Sample Size="+str(caffeine.num_rows))
plot.interval(ci_interval)
plot.dot(effect)

## 2. Variables Influencing CIs

Many factors can influence the width of our confidence intervals, including the desired level of confidence in our process, the sample size, and the variability of the data in our sample.  The following function allows you to manipulate all three variables to see their effects.

In [None]:

def caffeine_bootstrap(sample_size, variability, ci_level):
    """
    A function that helps us visualize how our estimation is affected by 
    various variables, including sample size, variability, ci level.
    """
    # Create some fake data
    rng = np.random.default_rng(0)
    diff_array = np.round(rng.normal(2.4, variability, sample_size))
    
    # Our sample statistic
    observed_effect = np.mean(diff_array)

    # Estimate effect
    np.random.seed(0)
    results = bootstrap_statistic(diff_array, np.mean, 10000)
    ci_interval = confidence_interval(ci_level, results)
    
    # Show results
    table = Table().with_columns("Caffeine Effects (Mean Difference)", results)
    plot = table.hist("Caffeine Effects (Mean Difference)", bins = np.arange(-2,7,0.5))
    ci_string = "[" + str(np.round(ci_interval.item(0), 2)) + "," + str(np.round(ci_interval.item(1), 2)) + "]"
    plot.set_title(str(ci_level) + "% Confidence Interval: " + ci_string + "\nSample Size="+str(sample_size))
    plot.interval(ci_interval)
    plot.dot(observed_effect)    

In [None]:

interact(caffeine_bootstrap, 
         sample_size = Slider(40,150,10),
         variability = Slider(0,10),
         ci_level = Slider(1,100))

### Confidence Level

Decreasing our confidence level produces a narrower confidence interval, but we have less confidence that our process will produce an interval containing the true parameter.


In [None]:
with Figure(2,1,figsize=(5,4)):
    caffeine_bootstrap(40, 10, 95)
    caffeine_bootstrap(40, 10, 50)

### Sample Size

Inreasing the sample size produces a narrower confidence interval with same confidence that our process will produce an interval containing the true parameter.


In [None]:
with Figure(2,1,figsize=(5,4)):
    caffeine_bootstrap(40, 10, 95)
    caffeine_bootstrap(80, 10, 95)

### Variability

Reduced variability in our sample also leads to narrower confidence intervals, 
but this is dependent on the sample and not under our control.

In [None]:
def caffeine_sample(sample_size, variability):
    rng = np.random.default_rng(2)
    diff_array = np.round(rng.normal(2.4, variability, sample_size))
    observed_effect = np.mean(diff_array)
    table = Table().with_columns("Difference", diff_array)
    plot = table.hist("Difference", bins=np.arange(-25,25,5))
    plot.set_title("Caffeine Effects\nSample Size="+str(sample_size))
    plot.dot(observed_effect)    
    plot.set_xlim(-25,25)

#### High variability

In [None]:
with Figure(1,2,figsize=(5,4)):
    caffeine_sample(40,10)
    caffeine_bootstrap(40, 10, 95)

#### Low variability

In [None]:
with Figure(1,2,figsize=(5,4)):
    caffeine_sample(40,5)
    caffeine_bootstrap(40, 5, 95)