<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Review: CLT, Confidence Intervals, and Hypothesis Testing

_Authors: Matt Brems (DC)_

---

### First, read in the housing data (code provided).

You can find the original data [here](https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data).

In [1]:
import urllib

names = ["CRIM", "ZN", "INDUS", "CHAS", "NOX", "RM", "AGE", "DIS", "RAD", "TAX", "PTRATIO", "B", "LSTAT", "MEDV"]

import pandas as pd
data = pd.read_csv("../datasets/housing.data", header=None, names=names, delim_whitespace=True)

NOX = data['NOX'].values
AGE = data['AGE'].values

### 1. Find the mean, standard deviation, and standard error of the mean for the `AGE` variable.

In [2]:
import numpy as np

In [3]:
mean = np.mean(data['AGE'])
std = np.std(data['AGE'])
se = std/(len(data['AGE'])**0.5)

print("The mean is {}".format(mean))
print("The standard deviation is {}".format(std))
print("The standard error of the mean is {}".format(se))

The mean is 68.57490118577076
The standard deviation is 28.121032570236867
The standard error of the mean is 1.250132382568063


### 2. Generate a 90%, 95%, and 99% confidence interval for `AGE`.

You can use the `.scipy.stats.t.interval()` function to calculate the confidence interval range.

```python
# End points of the range that contains `alpha` percent of the distribution:
stats.t.interval(alpha, df, loc=0, scale=1)	
```

Arguments:
- `df`: The degrees of freedom; will be the length of the vector -1.
- `loc`: The mean of the t-distribution (Your point estimate — the mean of the variable).
- `scale`: The standard deviation of the t-distribution (i.e., the standard error of your sample mean).

**Interpret the results from all three confidence intervals.**

NOTE 1: when to use z-score (e.g. the 1.96 multiplier to calculate the the 95% confidence interval) vs. when to use the t-distribution (like in the current solution). https://math.stackexchange.com/questions/1350635/when-do-i-use-a-z-score-vs-a-t-score-for-confidence-intervals   
NOTE 2: it's ok to not get the difference between the two at this stage :)

In [4]:
from scipy.stats import t

In [5]:
mean - 1.96 * se, mean + 1.96 * se, 

(66.12464171593736, 71.02516065560415)

In [6]:
t_interval_95 = t.interval(0.95,
                           len(AGE)-1,
                           loc=np.mean(AGE),
                           scale=np.std(AGE, ddof = 1)/(len(AGE))**0.5)

print("We are 95% confident that the true mean value for 'AGE' is between {} and {}.".format(t_interval_95[0], t_interval_95[1]))

We are 95% confident that the true mean value for 'AGE' is between 66.11636971854321 and 71.0334326529983.


In [7]:
t_interval_90 = t.interval(0.9,
                           len(AGE)-1,
                           loc=np.mean(AGE),
                           scale=np.std(AGE, ddof = 1)/(len(AGE))**0.5)

print("We are 90% confident that the true mean value for 'AGE' is between {} and {}.".format(t_interval_90[0], t_interval_90[1]))

t_interval_95 = t.interval(0.95,
                           len(AGE)-1,
                           loc=np.mean(AGE),
                           scale=np.std(AGE, ddof = 1)/(len(AGE))**0.5)

print("We are 95% confident that the true mean value for 'AGE' is between {} and {}.".format(t_interval_95[0], t_interval_95[1]))

t_interval_99 = t.interval(0.99,
                           len(AGE)-1,
                           loc=np.mean(AGE),
                           scale=np.std(AGE, ddof = 1)/(len(AGE))**0.5)

print("We are 99% confident that the true mean value for 'AGE' is between {} and {}.".format(t_interval_99[0], t_interval_99[1]))

We are 90% confident that the true mean value for 'AGE' is between 66.51279866704186 and 70.63700370449965.
We are 95% confident that the true mean value for 'AGE' is between 66.11636971854321 and 71.0334326529983.
We are 99% confident that the true mean value for 'AGE' is between 65.33936041834139 and 71.81044195320013.


Recall that a 99% confidence interval will appear in this form: 

$$\bar{x}-t \frac{s}{\sqrt{n}} \ , \ \bar{x}+t \frac{s}{\sqrt{n}}$$

Here, `t` is the critical t-value with 506 degrees of freedom and 99% confidence.

In [8]:
critical_t = t.ppf(0.995,506) # This pulls the critical value for 99.5%, which is appropriate.
print(critical_t)

2.5855804006393113


### 3. Did you rely on the central limit theorem in Question 2? Why or why not? Explain.


**A.** _Yes. We don't know whether or not `AGE` is normally distributed (and, considering the plot below, it's clearly not). But, because the size of our sample (`n`) is larger than 30, we know that the behavior of the x-bar will be normal enough for us to use our t-distribution to generate confidence intervals._

### 4. For the `NOX` variable, generate a 95% confidence interval and interpret it.

In [9]:
t_interval_95 = t.interval(0.95,
                           len(NOX)-1,
                           loc=np.mean(NOX),
                           scale=np.std(NOX, ddof = 1)/(len(NOX))**0.5)

print ("We are 95% confident that the true mean value for 'NOX' is between {} and {}.".format(t_interval_95[0],t_interval_95[1]))

We are 95% confident that the true mean value for 'NOX' is between 0.5445742622921801 and 0.5648158562848951.


### 5. For the `NOX` variable, test the hypothesis that the mean is equal to the median. 

You can use `scipy` functions to complete this, but be sure complete all steps listed below.

1. Define the hypotheses.
2. Set `alpha` to equal 0.05.
3. Calculate the point estimate.
4. Calculate the test statistic.
5. Find the p-value.
6. Interpret the results.

In [10]:
## Step 1: Define the hypotheses.
### H_0: mu_NOX = M_NOX
### H_A: mu_NOX != M_NOX

## Step 2: Set `alpha` to equal 0.05.
alpha = 0.05

## Step 3: Calculate the point estimate.
sample_mean = np.mean(NOX)

## Step 4: Calculate the test statistic.
t_statistic = (sample_mean - np.median(NOX))/(np.std(NOX, ddof=1)/len(NOX)**0.5)

## Step 5: Find the p-value.
## `t.sf` is the survival function, which is `1-cdf` at a given value
## (i.e., the proportion of values is at least as extreme as...).
p_value = t.sf(np.abs(t_statistic), len(NOX)) * 2 


## Because our alternative hypothesis is `!=` (rather than greater than or less than),
## we double our p-value. (This is called a two-sided test).
print("Our sample median is {}".format(np.median(NOX)))
print("Our sample mean is {}".format(sample_mean))
print("Our t-statistic is {}".format(t_statistic))
print("Our p-value is {}".format(p_value))

if p_value < alpha:
    print("We reject our null hypothesis and conclude that the true mean NOX value is different from the median NOX value.")
elif p_value > alpha:
    print("We fail to reject our null hypothesis and cannot conclude that the true mean NOX value is different from the median .")
else:
    print("Our test is inconclusive.")

Our sample median is 0.538
Our sample mean is 0.5546950592885376
Our t-statistic is 3.24088371677941
Our p-value is 0.0012700527361798387
We reject our null hypothesis and conclude that the true mean NOX value is different from the median NOX value.


#### NOTE: in general we're going to use something like this

In [11]:
from scipy import stats

group1 = np.random.randn(40)
group2 = 4*np.random.randn(50)

t_statistic, p_value = stats.ttest_ind(group1, group2, equal_var=False)
print("t_statistic = {}  p_value = {}".format(t_statistic, p_value))

t_statistic = -0.659937698606985  p_value = 0.5118969239267787


### 6. What do you notice about the results from Questions 4 and 5? 

**If you were going to generalize these observations to the relationship between hypothesis tests and confidence intervals, what might you say? Be specific.**

**A.** _When we calculated the median, it was 0.538. The 95% confidence interval for our mean contained 0.545 through 0.565. Because the median was outside of our 95% confidence interval, this suggests that the true mean would not be equal to our median._

_We then conducted the hypothesis test and found that, at the `alpha = 0.05` significance level, we rejected the hypothesis that the mean and median were equal._

_The results of our hypothesis test and confidence interval are in agreement here. Because our significance level (for `HT`) is `alpha`, as long as our confidence level (for `CI`) is `1 - alpha`, the results should be in agreement. So, if the value of interest does not lie in our `1 - alpha CI`, then the hypothesis that the parameter equals the value of interest should be rejected at the `alpha` significance level. Similarly, if the value of interest **does** lie in our `1 - alpha CI`, then testing the hypothesis that the parameter equals the value of interest should **not** be rejected at the `alpha` significance level._

### 7. For the `NOX` variable, test the hypothesis that the mean is greater than or equal to the median. 

You can use `scipy` functions to complete this, but be sure complete all steps listed below.

1. Define the hypotheses.
2. Set `alpha` to equal 0.05.
3. Calculate the point estimate.
4. Calculate the test statistic.
5. Find the p-value.
6. Interpret the results.

In [12]:
## Step 1: Define the hypotheses.
### H_0: mu_NOX >= M_NOX
### H_A: mu_NOX < M_NOX

## Step 2: Set `alpha` to equal 0.05.
alpha = 0.05

## Step 3: Calculate the point estimate.
sample_mean = np.mean(NOX)
sample_median = np.median(NOX)

## Step 4: Calculate the test statistic.
t_statistic = (sample_mean - sample_median)/(np.std(NOX, ddof=1)/len(NOX)**0.5)

## Step 5: Find the p-value.
p_value = t.sf(np.abs(t_statistic), len(NOX))
## Because our alternative hypothesis is greather than (rather than equal to),
## we DO NOT double our p-value. (This is called a one-sided test).

print("Our sample mean is {}".format(sample_mean))
print("Our sample median is {}".format(sample_median))
print("Our t-statistic is {}".format(t_statistic))
print("Our p-value is {}".format(p_value))

if p_value < alpha:
    print ("We reject our null hypothesis and conclude that the true mean NOX value is greater than the median NOX value.")
elif p_value > alpha:
    print ("We fail to reject our null hypothesis and cannot conclude that the true mean NOX value is greater than the median .")
else:
    print ("Our test is inconclusive.")

Our sample mean is 0.5546950592885376
Our sample median is 0.538
Our t-statistic is 3.24088371677941
Our p-value is 0.0006350263680899193
We reject our null hypothesis and conclude that the true mean NOX value is greater than the median NOX value.


### 8. Compare the p-values from Questions 5 and 7. What do you notice?

**A.** _The p-value in Question 6 is exactly double the p-value in Question 8. This is because of the fact that our alternative hypotheses are different. In Question 6, we can reject the null hypothesis for very large values of `mu` or very small values of `mu`. Because this is a two-sided test, we double our p-value. In Question 8, we can reject the null hypothesis for very small values of `mu` **but** cannot reject the null hypothesis for very large values of `mu`. Because we can only reject one side, this a one-sided test, and we therefore do not need to double our p-value._