<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Review: CLT, Confidence Intervals, and Hypothesis Testing

_Authors: Matt Brems (DC)_

---

## Take some time to read about the relationship between hypothesis testing and confidence intervals

https://en.wikipedia.org/wiki/Confidence_interval#Relationship_with_other_statistical_topics

### First, read in the housing data (code provided).

You can find the original data [here](https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data).

In [1]:
import urllib

names = ["CRIM", "ZN", "INDUS", "CHAS", "NOX", "RM", "AGE", "DIS", "RAD", "TAX", "PTRATIO", "B", "LSTAT", "MEDV"]

import pandas as pd
data = pd.read_csv("./datasets/housing.data", header=None, names=names, delim_whitespace=True)

NOX = data['NOX'].values
AGE = data['AGE'].values

### 1. Find the mean, standard deviation, and standard error of the mean for the `AGE` variable.

In [2]:
import numpy as np

In [3]:
mean = AGE.mean()
std = AGE.std()
stand_error = std/len(AGE)**0.5

### 2. Generate a 90%, 95%, and 99% confidence interval for `AGE`.

You can use the `.scipy.stats.t.interval()` function to calculate the confidence interval range.

```python
# End points of the range that contains `alpha` percent of the distribution:
stats.t.interval(alpha, df, loc=0, scale=1)	
```

Arguments:
- `df`: The degrees of freedom; will be the length of the vector -1.
- `loc`: The mean of the t-distribution (Your point estimate — the mean of the variable).
- `scale`: The standard deviation of the t-distribution (i.e., the standard error of your sample mean).

**Interpret the results from all three confidence intervals.**

In [4]:
from scipy.stats import t

In [5]:
df = len(AGE)-1

In [6]:
print '90% confidence interval:', t.interval(0.9, df, loc=mean, scale=stand_error)
print '95% confidence interval:', t.interval(0.95, df, loc=mean, scale=stand_error)
print '99% confidence interval:', t.interval(0.99, df, loc=mean, scale=stand_error)

90% confidence interval: (66.51483732549254, 70.63496504604898)
95% confidence interval: (66.11880029893695, 71.03100207260457)
99% confidence interval: (65.34255917420734, 71.80724319733417)


### 3. Did you rely on the central limit theorem in Question 2? Why or why not? Explain.

In [7]:
import matplotlib.pyplot as plt
%matplotlib inline

# A:

### 4. For the `NOX` variable, generate a 95% confidence interval and interpret it.

In [8]:
print '95% confidence interval:', t.interval(0.95, df=len(NOX)-1, loc=NOX.mean(), scale=NOX.std()/len(NOX)**0.5)
#95% confident that population mean of NOX lies within confident interval of 0.544584268025715 and 0.5648058505513601

95% confidence interval: (0.544584268025715, 0.5648058505513601)


### 5. For the `NOX` variable, test the hypothesis that the mean is equal to the median. 

You can use `scipy` functions to complete this, but be sure complete all steps listed below.

1. Define the hypotheses.
2. Set `alpha` to equal 0.05.
3. Calculate the point estimate.
4. Calculate the test statistic.
5. Find the p-value.
6. Interpret the results.

In [9]:
#H0: NOX mean is equal to median
#H1: NOX mean is not equal to median
N_mean = NOX.mean()
N_med = np.median(NOX)
diff = N_mean - N_med
t.interval(0.95, df=len(NOX)-1, loc=diff, scale=NOX.std()/len(NOX)**0.5)

import scipy.stats as stats
stats.ttest_1samp(NOX, N_med)

Ttest_1sampResult(statistic=3.2408837167794102, pvalue=0.001270210999819144)

### 6. What do you notice about the results from Questions 4 and 5? 

**If you were going to generalize these observations to the relationship between hypothesis tests and confidence intervals, what might you say? Be specific.**

In [10]:
# A:

### 7. For the `NOX` variable, test the hypothesis that the mean is greater than or equal to the median. 

You can use `scipy` functions to complete this, but be sure complete all steps listed below.

1. Define the hypotheses.
2. Set `alpha` to equal 0.05.
3. Calculate the point estimate.
4. Calculate the test statistic.
5. Find the p-value.
6. Interpret the results.

In [11]:
# A:

### 8. Compare the p-values from Questions 5 and 7. What do you notice?

In [12]:
# A:

In [45]:
titanic = pd.read_csv('./datasets/titanic.csv')

In [46]:
titanic.head(3)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S


In [47]:
titanic = titanic[np.isfinite(titanic['Age'])]
titanic.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 714 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    714 non-null int64
Survived       714 non-null int64
Pclass         714 non-null int64
Name           714 non-null object
Sex            714 non-null object
Age            714 non-null float64
SibSp          714 non-null int64
Parch          714 non-null int64
Ticket         714 non-null object
Fare           714 non-null float64
Cabin          185 non-null object
Embarked       712 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 72.5+ KB


In [117]:
titanic['adult'] = ((titanic['Age'] < 50) & (titanic['Age'] > 16))*1  

In [118]:
titanic.groupby('adult').size()

adult
0    174
1    540
dtype: int64

In [123]:
titanic.groupby('adult').mean()['Survived']

adult
0    0.471264
1    0.385185
Name: Survived, dtype: float64

In [120]:
adult_sur = titanic[titanic['adult']==1]['Survived']
non_adult_sur = titanic[titanic['adult']==0]['Survived']

In [124]:
#H0: Adult has no higher survival rate than non-adult
#H1: Adult has a lower surival rate than non-adult

alpha = 0.05
t_stat, p_value = stats.ttest_ind(adult_sur,non_adult_sur)
print('t-statistic={}, p-value={}'.format(t_stat, p_value))

if p_value < alpha:
    print "We reject our null hypothesis and conclude that adult has a lower surival rate than non-adult."
elif p_value > alpha:
    print "We fail to reject our null hypothesis and cannot conclude that adult has no higher survival rate than non-adult."
else:
    print "Our test is inconclusive."

t-statistic=-2.01354205839, p-value=0.0444334231228
We reject our null hypothesis and conclude that adult has a lower surival rate than non-adult.
