# Introduction to Confidence Intervals
By Carl Shan

In this Jupyter Notebook, you will review the idea of Standard Deviation and learn about how we can use it to construct something called the Confidence Interval.

## Table of Contents

The sections of this notebook are:

0. [The Setup](#Do-this-before-going-through-the-rest-of-the-Notebook)
1. [Part 1: Standard Deviation vs. Standard Error](#Part-1:-The-Standard-Deviation-$\sigma$-and-the-Standard-Error-$\overline{SE}$)
2. [Part 2: The t-distribution and t-statistic](#Part-2:-The-t-distribution-and-the-t-statistic)
3. [Part 3: Intro to Confidence Intervals](#Part-3:-Introduction-to-Confidence-Intervals)
4. [Part 4: How to calculate a confidence interval](#Part-4:-Calculating-the-Confidence-Interval)
5. [Part 5: How to interprete a confidence interval](#Part-5:-Interpreting-confidence-intervals-and-how-it-ties-to-hypothesis-testing)

## Setup: Do this before going through the rest of the Notebook
I'm going to create a few widgets that will help you understand these ideas better. You will need to do the following first in Terminal:

* `pip3 install -U ipywidgets` 
    * Use `pip` or `pip2` if you are installing for Python2 instead of Python3
* `jupyter nbextension enable --py widgetsnbextension`
    * This will enable the widgets

I'm also going to be using a different statistical library to do some modeling, so also do the following as well :

* `pip3 install -U statsmodels`

After you've done this, close and reopen `jupyter notebook` in Terminal. Then come back to this notebook.

## Why do we need to know these ideas?

Imagine this:

> At Nueva, Stephen Dunn and Diane Rosenberg have been hearing from a few families and students that they wish there were more I-Lab courses. Stephen and Diane are open to hiring more I-Lab faculty to offer more courses, but they have to be sure that these classes would be filled before allocating funding to do so. 
> 
> After all, what if these are just a few fringe opinions by a handful of families, and don't truly reflect the opinions of Nueva as a whole?
> 
> So they decide to survey some students at Nueva and get some data.
> 
> The survey asks: *(Yes/No) If we offered more Nueva classes, would you be open to taking it?*
> 
> In order to justify hiring more teachers, Stephen and Diane would want **at least 25%** of the students who respond to say *"Yes"*.
> 
> They get the survey results back. There were a total of **30** students who responded and **8** of them said *"Yes"*. That's about **26.6%**, which seems promising, but Stephen and Diane want to know: how confident can we be in this result? Could it have been a fluke?
> 
> They hire you to do some data analysis and answer the question: "Should we hire some more I-Lab faculty?"

Here's the central question you're trying to answer: *How likely is the result of the survey to be representative of the actual opinions of students at Nueva?*

In this Jupyter Notebook you will learn about one central ideas: the **Confidence Interval** to help you answer the question posed above.

![Survey](https://storage.googleapis.com/proudcity/elgl/uploads/2018/08/survey_9.png)

## Part 1: The Standard Deviation $\sigma$ and the Standard Error $\overline{SE}$

The Standard Deviation (SD for short) is one of the most important and popular topics in statistics.

It's used all the time and the symbol statistician use to refer to it is the Greek letter sigma $\sigma$.

### What do we use the Standard Deviation $\sigma$ for?

The Standard Deviation, often represented by the lower case Greek letter sigma $\sigma$, is a measure of how "spread out" or "dispersed" a set of data are.

    NOTE: It is mathematically equivalent to the square root of something called the "variance", which is another measurement of the spread of some data.

By using the standard deviation, we can figure out how likely some points are to be different from the "center" of the data. We use it to learn if some results we got are "weird" or "unexpected" or if they fall within the normal range of what we should expect given how widely "dispersed" the data is.


### How do you calculate the Standard Deviation for some data?

The general formula for Standard Deviation is the following:

$$ \sigma = \sqrt{\sum_{i=1}^{N}{(x_i - \bar{x})^2}} $$

The above should be read: "the standard deviation $\sigma$ is equal to the square root of the sum of the squared distance between all data points and the average point."

### Say what? 

To explain it step-by-step:

1. Calculate the mean of the data
2. Then for each datapoint: subtract the mean and square the result
3. Then find the average of these squared squared differences.
4. Take the square root of that and that is the standard deviation!

### But what is it good for?

I'll let Wikipedia explain at the following [link](https://simple.wikipedia.org/wiki/Standard_deviation):

> Standard deviation is a number used to tell how measurements for a group are spread out from the average (mean) ... A low standard deviation means that most of the numbers are very close to the average. A high standard deviation means that the numbers are spread out.
> 
> [In doing data analysis] ... scientists commonly report the standard deviation of numbers from the average number in experiments. They often decide that only differences bigger than two or three times the standard deviation are important.

### Modifying the above equation: The standard deviation for the average 

So the above equation is the standard deviation of a set of data.

What about the standard deviation for the *average* of a set of data?

Well you'd expect there to be some variability in some data if it was a sample.

**Question:** Do you think there'd be more or less variability in the *average* of this set of data? E.g., if you were to take a bunch of samples, and find the average, do you think there would be a higher or lower variability in this average than in the original set of data?

Think about the above question before you scroll down for the answer.

...

...

...

...

...

...

The answer is **lower**. 

**Why is that?**

Just think about it. What has more variance: the heights of people living in a city, or the *average height of people* between cities?

Obviously there may be very wide differences within a city (e.g., some people are very short, others are very tall and the differences between them may be large.) But once you find the average of a city and compare it against averages of other cities, these averages should be much closer.

When you take the average of a bunch of data, you are "doing away" with a lot of the variance inherent in the data.

It turns out that the Standard Deviation $\sigma$ of the *average* of some data is scaled down by a factor of $\sqrt{N}$. As a result the eventual equation is:

$$ \sigma_{avg} = \sqrt{\frac{1}{N}\sum_{i=1}^{N}{(x_i - \bar{x})^2}} $$

### Another twist: standard Deviation when you're dealing with proportions rather than typical numbers.
The equation I gave above is the general Standard Deviation formula. But there's a more specific formula we'll be using in this notebook that better applies to our example of figuring proportions of something. 

In the case of measuring the proportion of a population that has some quality (e.g., "the proportion of a group of students who want more I-Lab classes."), the equation for the standard deviation can be simplified to the following:

$$ \sigma_{avg} = \sqrt{\frac{p * (1-p)}{N}} $$

What do all these symbols mean?

I'll break them down one-by-one. Let's start with the left-hand side:

$$\sigma_{avg} = $$

The left-hand side of the equation says the Standard Deviation of the proportion, also known as $\sigma_{avg}$, is equal to the right side:

$$ \sqrt{\frac{p * (1-p)}{N}} $$

In the above equation:

* $N$ is the total number of points in the "population" (e.g., the total number of students at Nueva)
* $p$ is the true proportion of the population we're trying to measure (e.g., the proportion of students that would take more I-Lab classes if they were offered)

Now, we don't know the true $p$ value. We are trying to estimate it by taking a sample of $n$ students (where $n \lt N$, in other words we sample $n$ students that are only a subset of all possible $N$ students). 


### Estimating the Standard Deviation $\sigma$ by using Standard Error $\overline{SE}$

Since we don't know the true value of $p$, we try to estimate it by taking a sample of $n$ students and asking them the question instead.

Whatever proportion of students in our sample answer the question *(Yes/No) If we offered more Nueva classes, would you be open to taking it?* will be our **estimate for the value of $p$**.

We will denote the estimate of $p$ as $\hat{p}$, where the little funny symbol on top of the $p$ is called a `hat`.

So $\hat{p}$ is pronounced "p-hat".

We will use $\hat{p}$ and plug it into our Standard Deviation formula to get our best guess for $\sigma$. We will call this estimate called Standard Error, which we denote at $\overline{SE}$. This is our **estimate for the value of $\sigma$**.

So, instead of:

$$ \sigma_{avg} = \sqrt{\frac{p * (1-p)}{N}} $$

where $N$ is the total size of our population (e.g., all students at Nueva).

We now have: 

$$ \overline{SE}_{avg} = \sqrt{\frac{\hat{p} * (1-\hat{p})}{n}} $$

where $n$ is the size of our sample (e.g., how many students took our survey).

### Wait, what? I'm confused Carl. What's going on?

Okay, the short of it is this: Standard Deviation $\sigma$ isn't something we know. But since we need it, we're going to estimate with something called Standard Error $\overline{SE}$.

We'd like to know $\sigma$. We really would! Because it because we can use it to answer the question: "If my survey of 30 students said 26.6% of them would take more I-Lab classes, is this close to what all Nueva students believe?"

After all, if we knew that the Standard Deviation $\sigma$ was really high (say, $25\%$), then we can't really trust this survey because our data may be really dispersed. Getting a score of $26.6\%$ isn't very reliable with a $\sigma$ of $25\%$ because it means there's also a good chance the $1.6\%$ (calculated by $26.6\% - 25\%$) or even $51.6\%$ of students want the I-Lab courses.

### Okay ... I think I get that.

Great. If you don't, call me over and I can explain this to you.

Now, we would **love** to know the Standard Deviation $\sigma$, but we don't.

### How come we don't know it?

In order to know it, the formula for Standard Deviation uses the value $p$. But the value of $p$ is *true* proportion of students who would answer "Yes" in our survey. And we don't know this number. That's why we're doing a survey in the first place! It takes too long to ask each and every student so we have to estimate $p$ with our best guess from a smaller sample.

We call this estimate $\hat{p}$, or `p-hat`.

We plug $\hat{p}$ for $p$ in our original equation to get an estimate for $\sigma$.

So we're essentially using one thing we estimate ($\hat{p}$) to get an estimate for something else we want: $\sigma$.

### Ah, okay. So we estimate $\sigma$ using $\hat{p}$ instead of $p$ in the equation. Is that right?

Yes.

### And by using $\hat{p}$ instead of $p$ in our equation, it produces something new called Standard Error $\overline{SE}$ instead of $\sigma$. Is that right?

Yup! You got it.

We use $\overline{SE}$ because in order to calculate it, we have all the information we need. The equation is below:

$$ \overline{SE}_{avg} = \sqrt{\frac{\hat{p} * (1-\hat{p})}{n}} $$

* $\hat{p}$ is percentage we get in our survey
* $n$ is the total number of students who took our survey.

### Exercise 1:


Okay, hopefully you understand the above now to answer the following questions in your own words:

1. Why do we even want to know the true $\sigma$ in the first place? How does it help us in answering the question *"How likely is the result of the survey to be representative of the actual opinions of students at Nueva?"*?
2. If we want to know it so badly, how come we can't just directly calculate $\sigma$? What are some difficulties?
3. How does surveying a sample of $n$ students help Stephen and Diane in estimating $\sigma$? 

In [None]:
# YOUR RESPONSES HERE

# 1.



# 2.



# 3.


Remember, the larger your $\overline{SE}$, the larger your estimate of "the spread of the data".

To give you an image to hold onto think of $\overline{SE}$ as "how wide the net you're casting to catch the *true opinion*".

In other words, if $\overline{SE}$ is really large, then you can't really precisely say what proportion of students $p$ would take more I-Lab classes.

![Net](https://infinityconcepts.net/wp-content/uploads/2012/06/Church-Social-Media.jpg)
<center>*The larger $\overline{SE}$ is, the wider a net we are casting around the true $p$*</center>

### Play with the widget below to see how different values of $\hat{p}$ and $n$ change $\overline{SE}$

Below, I made an interactive widget that allows you to see how changing the following two parameters change $\overline{SE}$.

* $\hat{p}$ - this is the proportion of students who respond to the survey as "Yes"
* $n$ - this is the total number of students who respond to the survey.

Remember that this is the equation for $\overline{SE}$.

$$ \overline{SE}_{avg} = \sqrt{\frac{\hat{p} * (1-\hat{p})}{n}} $$

In [None]:
# RUN THIS CELL AND PLAY WITH THE INTERACTIVITY
from ipywidgets import interact

@interact(p_hat=(0, 1, 0.05), num_students=(1, 100))
def standard_error(p_hat=0.15, num_students=5):
    """
        Given a p-hat and the total number of students, calculates the standard error using the formula:
        
            se = sqrt( (p-hat * (1-p-hat))/num_students)
    """
    std_err = (p_hat * (1-p_hat) / num_students)**0.5
    print("\n")
    return 'The Standard Error SE with n={n:d} and p̂={p_hat} is {se:0.3f}'.format(n=num_students, p_hat=p_hat, se=std_err)

### Exercise 2:


1. After interacting with the above widget, what values of $\hat{p}$ and $n$ created the largest value of Standard Error $\overline{SE}$?
2. What about for the smallest value of $\overline{SE}$?
3. In your own words, why do you think this relationship between $\hat{p}$, $n$ and how they affect $\overline{SE}$ occurs?

In [None]:
# YOUR RESPONSES HERE

# 1.


# 2. 


# 3. 



We want our $\overline{SE}$ to be low, but we don't have control over $\hat{p}$. The only real thing we have control over is $n$, and it's costly to go and find more students to answer the survey. So there's only so much we can do.

### Confidence Levels and the Central Limit Theorem

This next section will require you to remember what the **Central Limit Theorem** said. If you don't remember, go back and review your notes.

### Exercise 3:

In your own words, summarize the Central Limit Theorem.

In [None]:
### Your answer here





## Can we use the Central Limit here?

For a normal distribution, approximately 68% of the data falls within 1 Standard Deviation $\sigma$ of the center (which we call "mu" or $\mu$) and approximately 95% of the data falls within 2 Standard Deviations $\sigma$ of the center.

![Normal Distribution](https://upload.wikimedia.org/wikipedia/commons/thumb/a/a9/Empirical_Rule.PNG/450px-Empirical_Rule.PNG)

In the above image, $\mu$ is the center, or average, of the distribution. In this case it's the same as $p$.

So how does this relate to what we just talked about?


### Well, it seems like the estimate in our survey should be approximately normally distributed according to the Central Limit Theorem, right?

Now you might think to yourself: "Well, if we sampled a bunch of students in our survey, our estimate $\hat{p}$ should be normally distributed according to the Central Limit Theorem."

### Wrong!

But this isn't quite true any more. It WOULD be true if we knew the true value of $\sigma$.

However because we're using Standard Error $\overline{SE}$, which is itself an estimate, our estimate $\hat{p}$ will be distributed very *close* to a normal distribution with a slight difference.

It will be distributed closer to something called the *Student t-distribution*.

## Part 2: The *t-distribution* and the *t-statistic*

![t-distribution](https://andyjconnelly.files.wordpress.com/2017/05/distributions1.png)

A *Student's t-distribution* is very close to a normal distribution, but allows for a little bit more error given that we're estimate $\sigma$ with $\overline{SE}$. That's why it's peak a little bit shorter and a bit more spread out. After all, we have less certainty so it makes sense that the *t-distribution* covers more ground.  

A *Student's t-distribution* has an additional parameter called the "degrees of freedom" that basically refer to how many samples were used to generate $\overline{SE}$.


   > Footnote: The reason this distribution is called the *Student's t-distribution* is because the creator, William Gosset, was not allowed by the company he worked for (Guiness Beers) to publish this finding under his real name, so he used the pseudonym "Student".

### What's "Degrees of Freedom"?

The "degrees of freedom" or "df" will change the shape of the *t-distribution*. "df" just refers to the number of datapoints in the sample used to generate $\overline{SE}$. The more degrees of freedom, the closer the *t-distribution* looks to a normal distribution. In general you want more degrees of freedom

> Why is it called "degrees of freedom": Degrees of freedom is a statistical concept that basically describes how much "room" there is in your model to vary and still capture the right answer. To use a non-statistical example, imagine you're taking a multiple choice test, and you get stumped on a question that has 4 options. You're not sure which one is correct, but you eliminate the first three options. Well, if you know the first three options are wrong the fourth option must be right!
>
> In this above example, we would say that the question you were working on has 3 degrees of freedom. That's because as long as you know 3 of the choices are wrong you have enough information to perfectly answer the question. 

### INTERACTIVE: Try playing around with the degrees of freedom to see how it changes the distribution

In [None]:
# YOU DON'T NEED TO MODIFY THIS CODE. 
# JUST RUN THIS AND PLAY WITH THE RESULT BELOW.

import numpy as np
from scipy.stats import t as student_t, norm
from matplotlib import pyplot as plt


@interact(df=(1, 30, 1))
def plot_t_distributions(df=1):
   
    fig, ax = plt.subplots(figsize=(8, 5))
    
    center = 0

    x = np.linspace(center-10, center+10, 1000)

    # plot the t-distribution

    t_dist = student_t(df, center)
    
    plt.plot(x, t_dist.pdf(x), ls='--', c='black', label="t(df={})".format(df))

    plt.xlim(center-5, center+5)
    plt.ylim(0.0, 0.45)
    
    # also overlay the normal distribution on top
    
    normal_dist = norm(center, 1)
    plt.plot(x, normal_dist.pdf(x), ls='-', c='blue', label='Normal Distribution')
    

    plt.xlabel('$x$')
    plt.ylabel('Probability')
    plt.title("Student's $t$ Distribution")

    plt.legend()
    plt.show()

## Exercise 4:

Run the code above. What happens when you increase the `df` (degrees of freedom)?

In [None]:
### Your answer here






## Part 3: Introduction to Confidence Intervals

### What's a Confidence Interval?

A confidence interval is a range of values which is likely to include the true value we're after, which in this case is $p$, the proportion of students who would be interested in taking an I-Lab class.

A confidence interval is constructed using the values of $\hat{p}$ and $\overline{SE}$.

### Exercise 5:

What's an interval of numbers (e.g., `[0, 100]`, read as "0 through 100") that is 100% likely to contain the true $p$? "100% likely" here means that you are CERTAIN that $p$ must lie within.
    * Hint: What range of numbers can you guarantee $p$ will be in?
    * This can be your "confidence interval", but it's not particularly useful because it is so inclusive of everything ...

In [None]:
# YOUR RESPONSE HERE









### How confidence intervals help us

Confidence intervals help us figure out a "range of numbers that we believe the true parameter we are estimating lies within."

You probably have seen some form of confidence intervals, such as in polling or election data.

For example, if a news outlet says:
>"Candidate X is $5$ percentage points ahead $\pm$ $1.6$ percentage points."

That means that they believe that Candidate X could be up to $6.6$ points ahead, or only by $3.4$ points. 

In this case, the news outlet's confidence interval is `[3.4, 6.6]`. We say that it's "from $3.4$ to $6.6$".

### Exercise 6:

Time to check your understanding. What did I do to calculate the numbers $6.6$ and $3.4$ for my confidence interval in the above example?

In [None]:
### YOUR RESPONSE HERE










Below is a picture of a confidence interval visualized for some temperature data: 

![Weather Data Confidence Intervals](https://rafalab.github.io/dsbook/book_files/figure-html/first-confidence-intervals-example-1.png)

The pale gray bands represent the range of "confidence interval values" that the true "average yearly temperature" can lie between.

## How do you construct a confidence interval?

In order to construct a confidence interval, you first need to answer the question: "Well, how confident do I need to be about the range of values?" 

In other words, you need to set a *confidence level.*

Examples of confidence levels: 99%, 95%, 90%, 50%, 85%.

The higher the confidence levels, the *wider*  your range of values will be.

Hopefully that makes intuitive sense to you. After all, if you want to be more confident that your range of values will contain the true parameter, you need to cast your net wider. 

Here a visual example below:

![Confidence Levels](http://www.biochemia-medica.com/assets/images/upload/Clanci/18/18-2/confidence_interval/182_Simundic_lessons_slika1.jpg?1534441897846)

The higher the required confidence level, the "wider" your interval of numbers have to be in order to achieve this confidence level.

### Using the confidence interval to get the *t-statistic*

The reason setting a confidence interval matters is because you want to get something called the *t-statistic*.

The idea behind the *t-statistic* is the same as the idea behind a *z-score*. It expresses "how many standard deviations away" from the "center" of the distribution.

However the *t-statistic* is a little bit different, and we can't use the z-score like we did in previous assignments.

**Well if that's the case Carl, how come we don't just use the z-score?**

Because we can't.

We can only use the z-score in a valid manner if (a) we got enough data in our sample and (b) we know the standard deviation of the population we're sampling from.

Unfortunately, sometimes we can only sample small groups of people (e.g., only a few respond to our survey).

Or, in other times, we don't know enough information about the population to know the true standard devaition.

As a result, we have to approximate the normal distribution with the *t-distribution* which allows for more error.

As a result of using the *t-distribution* we also use the *t-statistic* instead of the *z-score*.

**Okay, I think that makes more sense.**

If it doesn't, read steps 1-5 all over again!

## Formula for confidence intervals

The equation for calculating the confidence interval is the following:

$$\hat{p} \pm t_{confidence} * \overline{SE}$$

The $\pm$ symbol means "plus-minus" and you will generate the lower and upper bound of your confidence interval in this fashion.

The value $t_{confidence}$ is known as the *t-statistic* and depends on the *confidence level* that you set. The higher confidence level you want (e.g., $99\%$ versus $85\%$), the larger $t_{confidence}$ will be and thus the wider the "net" that will be cast. 

### So how do we calculate the *t-statistic*?

In order to calculate the value of $t_{confidence}$ AKA the *t-statistic*, we will need to look at the *Student t-distribution*.

If you need a refresher, here's what different distributions look like with varying *degrees of freedom*:

![Student t-distribution](https://i2.wp.com/www.real-statistics.com/wp-content/uploads/2012/11/t-distribution-chart.png)

### What do *degrees of freedom* mean?

Basically, they refer to the number of datapoints in your sample.

So the more datapoints you have in your sample, the more degrees of freedom, and the closer the Student t-distribution looks to the Normal distribution.

### What does the *t-statistic* mean?

Your chosen confidence level (e.g., 95%, 99%) is basically the "area under the curve" that you want covered.

And the *t-statistic* basically number of standard deviations you have to go from the center in order to cover the percentage you chose for your confidence level.

### Huh? I'm confused. What does the confidence level have to do with the *t-statistic*?

Don't worry. I have a nifty visual for you to play around with.

In [9]:
# SETUP. YOU DON'T NEED TO MODIFY THIS. RUN THIS CODE AND PLAY AROUND WITH IT.

import numpy as np
from scipy.stats import t as student_t, norm
from matplotlib import pyplot as plt
from ipywidgets import interact

def get_t_statistic(confidence_level, df):
    """
        Given a confidence level and df, returns the t-statistic associated with
        a t-distribution at that confidence level.
    """
    return student_t.ppf(confidence_level, df)

@interact(confidence_level=(0.05, 0.999, 0.001))
def plot_t_distributions_with_area(confidence_level):

    fig, ax = plt.subplots(figsize=(8, 5))

    center = 0 # center t-distribution at 0

    df = 50 # set a high df

    # plot the t-distribution

    t_dist = student_t(df, center)

    x = np.linspace(center-5, center+5, 1000)

    plt.plot(x, t_dist.pdf(x), ls='--', c='black', label="t(df={})".format(df))

    plt.xlim(center-5, center+5)
    plt.ylim(0.0, 0.45)


    # filling in the part of the plot that extends to the confidence level
    t_statistic = get_t_statistic(confidence_level, df)
    plt.xticks([center, t_statistic], [center, "{:0.3f}".format(t_statistic)])

    fill_range = np.arange(center-5, t_statistic, 1/10.)

    plt.fill_between(fill_range, t_dist.pdf(fill_range))


    plt.xlabel('$x$')
    plt.ylabel('Probability')
    plt.title("t-statistic from the Student's $t$ Distribution")

    plt.legend()
    plt.show()

    print("Your chosen confidence level is: {:0.3f}%".format(confidence_level*100))
    print("\n")
    print("A confidence level of {:0.3f}% means {:0.3f}% of the area underneath the distribution is covered.".format(confidence_level*100, confidence_level*100))
    print("\n")
    print("To cover {:0.3f}% of the distribution, you need to go at least {:0.3f} units from the center.".format(confidence_level*100, t_statistic))
    print("\n")
    print("This means that {:0.3f} is the t-statistic at this confidence level for a t-distribution with df = {}".format(t_statistic, df))


### Exercise 7:

After playing with the above widget, use your own words and explain how changing the confidence level changes the *t-statistic*. Why do you think this relationship between confidence level and *t-statistic* exist?

In [None]:
### YOUR RESPONSE HERE










### Part 4: Calculating the Confidence Interval

Okay, so now you know how the confidence level is connected to the t-statistic.

Let's put it all together.

But first, a brief review:

The equation for calculating the confidence interval is the following:

$$\hat{p} \pm t_{confidence} * \overline{SE}$$

* $\overline{SE}$ - This is called "Standard Error" and you learned in Part 1 how to use calculate it and use it as an estimate of the Standard Deviation $\sigma$
* $t_{confidence}$ - This is the *t-statistic* and you learned in Part 2 how to use the *Student t-distribution* and the *confidence level* to calculate it.
* $\hat{p}$ - This is your estimate.


Now you have everything you need in order to calculate the confidence interval.

You're going to calculate the confidence interval for the situation I gave at the beginning of this notebook.

Remember the situation again? Don't worry, I'll it againfor you below:

> At Nueva, Stephen Dunn and Diane Rosenberg have been hearing from a few families and students that they wish there were more I-Lab courses. Stephen and Diane are open to hiring more I-Lab faculty to offer more courses, but they have to be sure that these classes would be filled before allocating funding to do so. 
> 
> After all, what if these are just a few fringe opinions by a handful of families, and don't truly reflect the opinions of Nueva as a whole?
> 
> So they decide to survey some students at Nueva and get some data.
> 
> The survey asks: *(Yes/No) If we offered more Nueva classes, would you be open to taking it?*
> 
> In order to justify hiring more teachers, Stephen and Diane would want **at least 25%** of the students who respond to say *"Yes"*.
> 
> They get the survey results back. There were a total of **30** students who responded and **8** of them said *"Yes"*. That's about **26.6%**, which seems promising, but Stephen and Diane want to know: how confident can we be in this result? Could it have been a fluke?
> 
> They hire you to do some data analysis and answer the question: "Should we hire some more I-Lab faculty?"

### Exercise 8:

1. If you were hired to answer this question, what confidence level would you pick? There's no right or wrong answer here, but you need to be able to justify your answer in your response.

2. Given the confidence level above, identify the *degrees of freedom* and use the function I've provided for you below to calculate the *t-statistic*.

3. Calculate the $\overline{SE}$ using the information in the story above. And also identify $\hat{p}$

4. Now your answers to the above, calculate the lower and upper bound of the confidence interval. I've copied the equation for you below:
    $$\hat{p} \pm t_{confidence} * \overline{SE}$$
5. Finally, use all of this analysis to answer the question: "Should Nueva hire more I-Lab faculty given the responses in this survey?"





In [None]:
###  YOUR RESPONSES HERE

# 1.


# 2. 
# Needed for part 2. YOU DON'T NEED TO MODIFY THIS FUNCTION.
def get_t_statistic(confidence_level, df):
    """
        Description:
            Given a confidence level and df, returns the t-statistic associated with 
            a t-distribution at that confidence level.
            
        Example Usage:
        
            t_statistic = get_t_statistic(0.95, 30)
    """
    return student_t.ppf(confidence_level, df)



# 3. 


# 4. 


# 5.







## Part 5: Interpreting confidence intervals and how it ties to hypothesis testing

In this final Part 5, we will learn about how to interpret the meaning of confidence intervals and how it ties to hypothesis testing.

Before reading any further, try and answer the question below:

### Exercise 9:

If a statistician conducted a survey about a city's support for a new law, and calculated a 99% confidence interval of `(0.44, 0.54)` for some parameter they are estimating called $p$, which of the following are true statements and why?

> A. "There's a 99% change that the true value of $p$ is between 0.44 and 0.54."
> 
> B. "99% of the time, the true value of $p$ will fall between 0.44 and 0.54."
> 
> C. "If the statistician could repeat this survey many times, and calculated a confidence interval each time, about 99% of those confidence intervals would contain the true value of $p$."

In [None]:
### Your response (explain which options you picked and why)


















## Interpreting Confidence Intervals

Now, watch this video from Khan Academy explaining confidence intervals: [click here](https://www.khanacademy.org/math/ap-statistics/estimating-confidence-ap/introduction-confidence-intervals/v/interpreting-confidence-intervals-example).

After watching this video, complete the exercise below.

### Exercise 10:

Have your choices for Option A - C for the previous question changed? Why or why not?

In [None]:
# YOUR RESPONSE HERE










## How are confidence intervals related to hypothesis testing, the `alpha` value and `p-values`?

Remember hypothesis testing?

If you don't remember, a short summary is below:

**Hypothesis Testing Summary**
1. You set up a hypothesis test by having a `null hypothesis` called $h_0$ and an `alternative hypothesis` called $h_1$
2. To figure out which is more likely to be true, you collect some data.
3. You then take your data and compute some measurements (e.g., taking the average).
4. Then with the data, you determine some `alpha` level. This is similar to the idea of your `risk tolerance` level. The `alpha` level is between `0` and `1` and refers to the *probability that you would have collected the data IF $h_0$ were true.*
    * Many people, when conducting hypothesis testing, set their alpha to be 0.05. That means that there should only be a 5% change of the data appearing if it were $h_0$ that was true.
5. You look at your measurement and calculate what percentage of time this measurement THIS EXTREME OR MORE SO would have come up. You can do this mathematically or by simulating it using a computer. For example, you can resample your data thousands of times and calculate how often you get measurements as extreme as the one you first did.
6. If you get the measurement a very small fraction of times, smaller than your `alpha` level, then you say "since there's only `alpha`% of times that this data would have yielded this measurement under the null hypothesis $h_0$, I'm going to reject it in favor of my alternative."

**What's the relationship between Confidence Intervals and Hypothesis testing?**

Statisticians say that the measurements you got, if they show up less than `alpha`% of the time, is *statistically significant.*
    
So what is the relationship between a `confidence level` (say, $0.95$) and an `alpha` level (say, $0.05$)? How does that play into whether the result is *statistically significant* or not?

There's a deep connectedness.

To learn about it, read over this link (especially the last section) and answer the question below: [click here](http://blog.minitab.com/blog/adventures-in-statistics-2/understanding-hypothesis-tests-confidence-intervals-and-confidence-levels).

### Exercise 11:

1. In your own words, what is the relationship between a `X% confidence interval` of an estimate and the `alpha` level?
 
2. If I do a hypothesis test, and I sample some data and take a measurement to construct a 95% confidence interval of `(55, 95)` for some parameter called $\mu$, what values does my estimate for $\mu$ need to be in order for me to REJECT the null hypothesis at an alpha of `0.05`?

In [None]:
# YOUR RESPONSES HERE








