<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Inferential Statistics Lab

_Author: Matt Brems (DC)_

It might be a good idea to first check the [data dictionary](https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.names). Alternatively, you can use the `DESCR` method if you use and are familiar with the `sklearn` library.

We've saved the data for you in a file named "housing.data". Load it in using any method you choose, or run the following cells to import it from `sklearn`.

In [1]:
from sklearn import datasets
import pandas as pd
import scipy.stats as stats

In [2]:
df = datasets.load_boston()
data = pd.DataFrame(df.data,columns=df.feature_names)

In [3]:
data.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33


In [4]:
data.describe()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
count,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0
mean,3.593761,11.363636,11.136779,0.06917,0.554695,6.284634,68.574901,3.795043,9.549407,408.237154,18.455534,356.674032,12.653063
std,8.596783,23.322453,6.860353,0.253994,0.115878,0.702617,28.148861,2.10571,8.707259,168.537116,2.164946,91.294864,7.141062
min,0.00632,0.0,0.46,0.0,0.385,3.561,2.9,1.1296,1.0,187.0,12.6,0.32,1.73
25%,0.082045,0.0,5.19,0.0,0.449,5.8855,45.025,2.100175,4.0,279.0,17.4,375.3775,6.95
50%,0.25651,0.0,9.69,0.0,0.538,6.2085,77.5,3.20745,5.0,330.0,19.05,391.44,11.36
75%,3.647423,12.5,18.1,0.0,0.624,6.6235,94.075,5.188425,24.0,666.0,20.2,396.225,16.955
max,88.9762,100.0,27.74,1.0,0.871,8.78,100.0,12.1265,24.0,711.0,22.0,396.9,37.97


Exercise 1: Conduct a brief integrity check of your data. This integrity check should include, but is not limited to, checking for missing values and making sure all values make logical sense. (i.e. Is one variable a percentage, but there are observations above 100%?)
Summarize your findings in a few sentences, including what you checked and, if appropriate, any steps you took to rectify potential integrity issues.

In [6]:
def eda(dataframe):
    """ Ritika Bhasker's function, DSI-DC-3 alumna. """
    print("Missing Values \n \n", dataframe.isnull().sum(),"\n")
    print("Duplicate Rows \n", dataframe.duplicated().sum(),"\n")
    print("Dataframe Types \n \n", dataframe.dtypes,"\n")
    print("Dataframe Shape \n", dataframe.shape,"\n")
    print("Dataframe Describe \n \n", dataframe.describe(include='all'),"\n")
    for item in dataframe:
        print(item)
        print(dataframe[item].nunique())

In [8]:
eda(data)

Missing Values 
 
 CRIM       0
ZN         0
INDUS      0
CHAS       0
NOX        0
RM         0
AGE        0
DIS        0
RAD        0
TAX        0
PTRATIO    0
B          0
LSTAT      0
dtype: int64 

Duplicate Rows 
 0 

Dataframe Types 
 
 CRIM       float64
ZN         float64
INDUS      float64
CHAS       float64
NOX        float64
RM         float64
AGE        float64
DIS        float64
RAD        float64
TAX        float64
PTRATIO    float64
B          float64
LSTAT      float64
dtype: object 

Dataframe Shape 
 (506, 13) 

Dataframe Describe 
 
              CRIM          ZN       INDUS        CHAS         NOX          RM  \
count  506.000000  506.000000  506.000000  506.000000  506.000000  506.000000   
mean     3.593761   11.363636   11.136779    0.069170    0.554695    6.284634   
std      8.596783   23.322453    6.860353    0.253994    0.115878    0.702617   
min      0.006320    0.000000    0.460000    0.000000    0.385000    3.561000   
25%      0.082045    0.000000    5.

**Answer:**
- There are no missing values.
- There are no duplicate rows.
- All dataframe types are float64, which makes sense for our data. (RAD and CHAS could be ints, but it isn't a problem here.)
- Nothing jumps out from the counts of unique values. The dummy variable CHAS has two values, as expected.

Exercise 2: For what two attributes does it make the least sense to calculate mean and median? Why?

**Answer:** RAD and CHAS. 
- RAD is an [ordinal variable](https://www.ma.utexas.edu/users/mks/statmistakes/ordinal.html) that explains how accessible the highway is. If we were to add the value of 8 and the value of 9 in the context of RAD, it would be meaningless.
- CHAS is a [dummy variable](https://en.wikipedia.org/wiki/Dummy_variable_(statistics)), which allows 

Exercise 3: Find the mean, standard deviation, and the standard error of the mean for variable 'AGE.'

In [19]:
"Mean:", data['AGE'].mean(), "Standard Deviation:", data['AGE'].std(), "Standard Error of the Mean:", data['AGE'].sem(), "Standard Error of the Mean:", data['AGE'].std() / len(data['AGE']) ** 0.5

('Mean:',
 68.57490118577078,
 'Standard Deviation:',
 28.148861406903638,
 'Standard Error of the Mean:',
 1.251369525258305,
 'Standard Error of the Mean:',
 1.251369525258305)

**Answer:** 
- The mean of `AGE` is 68.57. 
- The standard deviation of `AGE` is 28.15.
- The standard error of the mean of `AGE` is 1.25.
    - Note that we can calculate the standard error of the mean with `data['AGE'].sem()` or  `data['AGE'].std() / len(data['AGE']) ** 0.5`.

Exercise 4: Generate a 95% and confidence interval for 'AGE' manually.

**Edit**: Generate a 95% confidence interval for the mean of 'AGE' manually. (We can't calculate a confidence interval for an entire variable.)

Remember that the formula for a confidence interval for the population mean is:

$$\bar{x} \pm z^* \frac{\sigma}{\sqrt{n}}$$

In [47]:
sample_mean = data['AGE'].mean()
z_star = 2.575
sigma = data['AGE'].std()
n = len(data['AGE'])

low_end = sample_mean - z_star * sigma / n ** 0.5

high_end = sample_mean + z_star * sigma / n ** 0.5

In [48]:
(low_end,high_end)

(65.35262465823065, 71.79717771331092)

**Answer:** The 95% confidence interval for the population mean of `AGE` is (66.12, 71.03).

In [26]:
print("We are 95% confident that the true mean value for 'AGE' is between " + str(round(low_end,2)) + " and " + str(round(high_end,2)) + ".")

We are 95% confident that the true mean value for 'AGE' is between 66.12 and 71.03.


Exercise 5: Create a function to take in the data and level of significance, then return the confidence interval with a helpful message interpreting it! Then, for variable 'NOX', generate a 95% confidence interval and its interpretation.

**Edit:** Create a function to take in the data and level of **confidence**, then return the confidence interval with a helpful message interpreting it! Then, for variable 'NOX', generate a 95% confidence interval and its interpretation. (For the record, `significance level = 100% - confidence level`.)

In [54]:
def ci(df,conf_level):
    if conf_level == 0.90:
        z_star = 1.68
    elif conf_level == 0.95:
        z_star = 1.96
    elif conf_level == 0.99:
        z_star = 2.575
    else:
        return "Please select 0.90, 0.95, or 0.99 for confidence."
    
    sample_mean = df.mean()
    sigma = df.std()
    n = len(df)

    low_end = sample_mean - z_star * sigma / n ** 0.5
    high_end = sample_mean + z_star * sigma / n ** 0.5
    
    return ("We are 95% confident that the true mean is between " + str(round(low_end,2)) + " and " + str(round(high_end,2)) + ".")

To create a more general (but advanced!) function, we could do the following:

In [55]:
import scipy.stats as stats

In [57]:
def ci_2(df,conf_level):
    z_star = stats.norm.ppf((conf_level + 1)/2, loc=0, scale=1)
    sample_mean = data['AGE'].mean()
    sigma = data['AGE'].std()
    n = len(data['AGE'])

    low_end = sample_mean - z_star * sigma / n ** 0.5
    high_end = sample_mean + z_star * sigma / n ** 0.5
    
    return ("We are 95% confident that the true mean is between " + str(round(low_end,2)) + " and " + str(round(high_end,2)) + ".")

In [53]:
ci_2(data['AGE'],0.95)

'We are 95% confident that the true mean is between 66.12 and 71.03.'

Exercise 6: For the variable 'NOX', find the median.

In [58]:
data['NOX'].median()

0.538

Exercise 7: For the variable 'NOX', test the hypothesis that the mean is equal to 0.538. We'll complete all five steps.

Exercise 7, Step 1: Set up your hypotheses.

- $H_0: \mu = 0.538$ 
- $H_A: \mu \neq 0.538$

Exercise 7, Step 2: Our level of significance is 0.05. There's no work to do here. :)

$\alpha = 0.05$

Exercise 7, Step 3: Calculate your point estimate. In this case, it's your sample mean.

In [60]:
x_bar = data['NOX'].mean()

Exercise 7, Step 4: Calculate your test statistic. In this case, it's:

$$ z = \frac{\bar{x} - \mu_0}{\sigma/\sqrt{n}}$$

Note that $\mu_0$ is the mean we assume in our null hypothesis!

Exercise 7, Step 5: Suppose your p-value is 0.06. Interpret your result!

Exercise 7, Step 6: Suppose your p-value is actually 0.02. Now interpret your result!

Exercise 8: We're going to run this exact same thing using SciPy. (We'll use a function that assumes our test statistic is $t$. That's okay! Don't worry about that issue for now. If you want to see the documentation, check it out [here](https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.ttest_1samp.html).)

In [None]:
list_of_values =     # This should be your NOX data!
popmean =            # This should be mu_0, the assumed value of mu in the null hypothesis!

stats.ttest_1samp(list_of_values, popmean)