<a href="https://colab.research.google.com/github/SarmenSinanian/DS-Unit-1-Sprint-3-Statistical-Tests-and-Experiments/blob/master/Sarmen_Sinanian_LS_DS6_132_Sampling_Confidence_Intervals_and_Hypothesis_Testing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Assignment - Build a confidence interval

A confidence interval refers to a neighborhood around some point estimate, the size of which is determined by the desired p-value. For instance, we might say that 52% of Americans prefer tacos to burritos, with a 95% confidence interval of +/- 5%.

52% (0.52) is the point estimate, and +/- 5% (the interval $[0.47, 0.57]$) is the confidence interval. "95% confidence" means a p-value $\leq 1 - 0.95 = 0.05$.

In this case, the confidence interval includes $0.5$ - which is the natural null hypothesis (that half of Americans prefer tacos and half burritos, thus there is no clear favorite). So in this case, we could use the confidence interval to report that we've failed to reject the null hypothesis.

But providing the full analysis with a confidence interval, including a graphical representation of it, can be a helpful and powerful way to tell your story. Done well, it is also more intuitive to a layperson than simply saying "fail to reject the null hypothesis" - it shows that in fact the data does *not* give a single clear result (the point estimate) but a whole range of possibilities.

How is a confidence interval built, and how should it be interpreted? It does *not* mean that 95% of the data lies in that interval - instead, the frequentist interpretation is "if we were to repeat this experiment 100 times, we would expect the average result to lie in this interval ~95 times."

For a 95% confidence interval and a normal(-ish) distribution, you can simply remember that +/-2 standard deviations contains 95% of the probability mass, and so the 95% confidence interval based on a given sample is centered at the mean (point estimate) and has a range of +/- 2 (or technically 1.96) standard deviations.

Different distributions/assumptions (90% confidence, 99% confidence) will require different math, but the overall process and interpretation (with a frequentist approach) will be the same.

Your assignment - using the data from the prior module ([congressional voting records](https://archive.ics.uci.edu/ml/datasets/Congressional+Voting+Records)):


### Confidence Intervals:
1. Generate and numerically represent a confidence interval
2. Graphically (with a plot) represent the confidence interval
3. Interpret the confidence interval - what does it tell you about the data and its distribution?

### Chi-squared tests:
4. Take a dataset that we have used in the past in class that has **categorical** variables. Pick two of those categorical variables and run a chi-squared tests on that data
  - By hand using Numpy
  - In a single line using Scipy

Stretch goals:

1. Write a summary of your findings, mixing prose and math/code/results. *Note* - yes, this is by definition a political topic. It is challenging but important to keep your writing voice *neutral* and stick to the facts of the data. Data science often involves considering controversial issues, so it's important to be sensitive about them (especially if you want to publish).
2. Apply the techniques you learned today to your project data or other data of your choice, and write/discuss your findings here.
3. Refactor your code so it is elegant, readable, and can be easily run for all issues.

In [0]:
# TODO - your code!

In [0]:
import pandas as pd
import numpy as np
import matplotlib as plt
from scipy.stats import ttest_ind
# from scipy import stats
from scipy.stats import t, ttest_1samp

In [0]:
df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/voting-records/house-votes-84.data', header=None)

In [4]:
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16
0,republican,n,y,n,y,y,y,n,n,n,y,?,y,y,y,n,y
1,republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,?
2,democrat,?,y,y,?,y,y,n,n,n,n,y,n,y,y,n,n
3,democrat,n,y,y,n,?,y,n,n,n,n,y,n,y,n,n,y
4,democrat,y,y,y,n,y,y,n,n,n,n,y,?,y,y,y,y


In [0]:
columns = ['party','handicapped-infants','water-project-cost-sharing',
           'adoption-of-the-budget-resolution','physician-fee-freeze',
           'el-salvador-aid','religious-groups-in-schools',
           'anti-satellite-test-ban','aid-to-nicaraguan-contras'
           ,'mx-missile','immigration','synfuels-corporation-cutback'
           ,'education-spending','superfund-right-to-sue','crime',
           'duty-free-exports','export-administration-act-south-africa']
df.columns=columns

In [6]:
df.head()

Unnamed: 0,party,handicapped-infants,water-project-cost-sharing,adoption-of-the-budget-resolution,physician-fee-freeze,el-salvador-aid,religious-groups-in-schools,anti-satellite-test-ban,aid-to-nicaraguan-contras,mx-missile,immigration,synfuels-corporation-cutback,education-spending,superfund-right-to-sue,crime,duty-free-exports,export-administration-act-south-africa
0,republican,n,y,n,y,y,y,n,n,n,y,?,y,y,y,n,y
1,republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,?
2,democrat,?,y,y,?,y,y,n,n,n,n,y,n,y,y,n,n
3,democrat,n,y,y,n,?,y,n,n,n,n,y,n,y,n,n,y
4,democrat,y,y,y,n,y,y,n,n,n,n,y,?,y,y,y,y


In [0]:
df.replace('?', np.nan, inplace=True)

In [8]:
df.head()

Unnamed: 0,party,handicapped-infants,water-project-cost-sharing,adoption-of-the-budget-resolution,physician-fee-freeze,el-salvador-aid,religious-groups-in-schools,anti-satellite-test-ban,aid-to-nicaraguan-contras,mx-missile,immigration,synfuels-corporation-cutback,education-spending,superfund-right-to-sue,crime,duty-free-exports,export-administration-act-south-africa
0,republican,n,y,n,y,y,y,n,n,n,y,,y,y,y,n,y
1,republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,
2,democrat,,y,y,,y,y,n,n,n,n,y,n,y,y,n,n
3,democrat,n,y,y,n,,y,n,n,n,n,y,n,y,n,n,y
4,democrat,y,y,y,n,y,y,n,n,n,n,y,,y,y,y,y


In [0]:
df.replace('y', 1, inplace=True)

In [0]:
df.replace('n', 0, inplace=True)

In [11]:
df.head()

Unnamed: 0,party,handicapped-infants,water-project-cost-sharing,adoption-of-the-budget-resolution,physician-fee-freeze,el-salvador-aid,religious-groups-in-schools,anti-satellite-test-ban,aid-to-nicaraguan-contras,mx-missile,immigration,synfuels-corporation-cutback,education-spending,superfund-right-to-sue,crime,duty-free-exports,export-administration-act-south-africa
0,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,0.0,1.0
1,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,
2,democrat,,1.0,1.0,,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0
3,democrat,0.0,1.0,1.0,0.0,,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
4,democrat,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,1.0


In [0]:
rep = df[df['party'] == 'republican']
dem = df[df['party'] == 'democrat']

In [13]:
rep.count()

party                                     168
handicapped-infants                       165
water-project-cost-sharing                148
adoption-of-the-budget-resolution         164
physician-fee-freeze                      165
el-salvador-aid                           165
religious-groups-in-schools               166
anti-satellite-test-ban                   162
aid-to-nicaraguan-contras                 157
mx-missile                                165
immigration                               165
synfuels-corporation-cutback              159
education-spending                        155
superfund-right-to-sue                    158
crime                                     161
duty-free-exports                         156
export-administration-act-south-africa    146
dtype: int64

In [14]:
print('Republican Support:', rep['handicapped-infants'].mean())
print('Democrat Support:', dem['handicapped-infants'].mean())

Republican Support: 0.18787878787878787
Democrat Support: 0.6046511627906976


In [0]:
# n = len(rep['handicapped-infants'])

In [16]:
sample = rep.sample(80)
sample

Unnamed: 0,party,handicapped-infants,water-project-cost-sharing,adoption-of-the-budget-resolution,physician-fee-freeze,el-salvador-aid,religious-groups-in-schools,anti-satellite-test-ban,aid-to-nicaraguan-contras,mx-missile,immigration,synfuels-corporation-cutback,education-spending,superfund-right-to-sue,crime,duty-free-exports,export-administration-act-south-africa
393,republican,,,,,0.0,1.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,
247,republican,0.0,1.0,0.0,1.0,1.0,1.0,,0.0,0.0,0.0,0.0,,1.0,1.0,0.0,0.0
374,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0
308,republican,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0
378,republican,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0
122,republican,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0
356,republican,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0
277,republican,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0
154,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0
1,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,


In [0]:
n=435

In [0]:
dof = n-1

In [19]:
# The Mean of Means:
sample_mean_HI = np.mean(sample['handicapped-infants'])
print(sample_mean_HI)

0.17721518987341772


In [20]:
real_mean_all = np.mean(rep)
real_mean_all

handicapped-infants                       0.187879
water-project-cost-sharing                0.506757
adoption-of-the-budget-resolution         0.134146
physician-fee-freeze                      0.987879
el-salvador-aid                           0.951515
religious-groups-in-schools               0.897590
anti-satellite-test-ban                   0.240741
aid-to-nicaraguan-contras                 0.152866
mx-missile                                0.115152
immigration                               0.557576
synfuels-corporation-cutback              0.132075
education-spending                        0.870968
superfund-right-to-sue                    0.860759
crime                                     0.981366
duty-free-exports                         0.089744
export-administration-act-south-africa    0.657534
dtype: float64

In [21]:
sample_mean_all = np.mean(sample)
sample_mean_all

handicapped-infants                       0.177215
water-project-cost-sharing                0.535211
adoption-of-the-budget-resolution         0.151899
physician-fee-freeze                      1.000000
el-salvador-aid                           0.937500
religious-groups-in-schools               0.875000
anti-satellite-test-ban                   0.205128
aid-to-nicaraguan-contras                 0.131579
mx-missile                                0.137500
immigration                               0.500000
synfuels-corporation-cutback              0.103896
education-spending                        0.893333
superfund-right-to-sue                    0.851351
crime                                     0.987179
duty-free-exports                         0.053333
export-administration-act-south-africa    0.647887
dtype: float64

In [0]:
# ?np.mean

In [23]:
sample_std_all_rep = np.std(sample, ddof=0)
sample_std_all_rep


handicapped-infants                       0.381851
water-project-cost-sharing                0.498759
adoption-of-the-budget-resolution         0.358923
physician-fee-freeze                      0.000000
el-salvador-aid                           0.242061
religious-groups-in-schools               0.330719
anti-satellite-test-ban                   0.403795
aid-to-nicaraguan-contras                 0.338032
mx-missile                                0.344374
immigration                               0.500000
synfuels-corporation-cutback              0.305126
education-spending                        0.308689
superfund-right-to-sue                    0.355742
crime                                     0.112500
duty-free-exports                         0.224697
export-administration-act-south-africa    0.477629
dtype: float64

In [24]:
sample_std = np.std(sample['handicapped-infants'], ddof=0)
sample_std

0.3818507121265403

#WHY USE DDOF = 1 WHEN WE ALREADY SUBTRACT 1 FROM THE SAMPLE SIZE DURING DOF CALCULATION? (BELOW)

In [25]:


std_err = sample_std/n**.5
#one sided t test .975 is equal to .95 on a two sided because you multiply (1-.975)*2
t_stat = t.ppf(.975, dof)
print("t Statistic:", t_stat)
# 95% confidence interval
CI = (sample_mean_HI-(t_stat*std_err), sample_mean_HI+(t_stat*std_err))
print("Confidence Interval", CI)

t Statistic: 1.9654450635078535
Confidence Interval (0.14123115278799628, 0.21319922695883917)


In [0]:
# ?np.std

In [0]:
# ?t.interval

In [28]:
std_err = sample_std_all_rep/n**.5
#one sided t test .975 is equal to .95 on a two sided because you multiply (1-.975)*2
t_stat = t.ppf(.975, dof)
print("t Statistic:", t_stat)
# 95% confidence interval
CI = (sample_mean_all-(t_stat*std_err), sample_mean_all+(t_stat*std_err))
print("Confidence Interval")
print('\n')
CI

t Statistic: 1.9654450635078535
Confidence Interval




(handicapped-infants                       0.141231
 water-project-cost-sharing                0.488210
 adoption-of-the-budget-resolution         0.118075
 physician-fee-freeze                      1.000000
 el-salvador-aid                           0.914689
 religious-groups-in-schools               0.843834
 anti-satellite-test-ban                   0.167076
 aid-to-nicaraguan-contras                 0.099724
 mx-missile                                0.105048
 immigration                               0.452882
 synfuels-corporation-cutback              0.075142
 education-spending                        0.864244
 superfund-right-to-sue                    0.817828
 crime                                     0.976578
 duty-free-exports                         0.032159
 export-administration-act-south-africa    0.602878
 dtype: float64, handicapped-infants                       0.213199
 water-project-cost-sharing                0.582212
 adoption-of-the-budget-resolution         0.185

## Resources

- [Interactive visualize the Chi-Squared test](https://homepage.divms.uiowa.edu/~mbognar/applets/chisq.html)
- [Calculation of Chi-Squared test statistic](https://en.wikipedia.org/wiki/Pearson%27s_chi-squared_test)
- [Visualization of a confidence interval generated by R code](https://commons.wikimedia.org/wiki/File:Confidence-interval.svg)
- [Expected value of a squared standard normal](https://math.stackexchange.com/questions/264061/expected-value-calculation-for-squared-normal-distribution) (it's 1 - which is why the expected value of a Chi-Squared with $n$ degrees of freedom is $n$, as it's the sum of $n$ squared standard normals)