<a href="https://colab.research.google.com/github/Zebfred/DS-Unit-1-Sprint-3-Statistical-Tests-and-Experiments/blob/master/w3d2_Sampling_Confidence_Intervals.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Assignment - Build a confidence interval

A confidence interval refers to a neighborhood around some point estimate, the size of which is determined by the desired p-value. For instance, we might say that 52% of Americans prefer tacos to burritos, with a 95% confidence interval of +/- 5%.

52% (0.52) is the point estimate, and +/- 5% (the interval $[0.47, 0.57]$) is the confidence interval. "95% confidence" means a p-value $\leq 1 - 0.95 = 0.05$.

In this case, the confidence interval includes $0.5$ - which is the natural null hypothesis (that half of Americans prefer tacos and half burritos, thus there is no clear favorite). So in this case, we could use the confidence interval to report that we've failed to reject the null hypothesis.

But providing the full analysis with a confidence interval, including a graphical representation of it, can be a helpful and powerful way to tell your story. Done well, it is also more intuitive to a layperson than simply saying "fail to reject the null hypothesis" - it shows that in fact the data does *not* give a single clear result (the point estimate) but a whole range of possibilities.

How is a confidence interval built, and how should it be interpreted? It does *not* mean that 95% of the data lies in that interval - instead, the frequentist interpretation is "if we were to repeat this experiment 100 times, we would expect the average result to lie in this interval ~95 times."

For a 95% confidence interval and a normal(-ish) distribution, you can simply remember that +/-2 standard deviations contains 95% of the probability mass, and so the 95% confidence interval based on a given sample is centered at the mean (point estimate) and has a range of +/- 2 (or technically 1.96) standard deviations.

Different distributions/assumptions (90% confidence, 99% confidence) will require different math, but the overall process and interpretation (with a frequentist approach) will be the same.

Your assignment - using the data from the prior module ([congressional voting records](https://archive.ics.uci.edu/ml/datasets/Congressional+Voting+Records)):


### Confidence Intervals:
1. Generate and numerically represent a confidence interval
2. Graphically (with a plot) represent the confidence interval
3. Interpret the confidence interval - what does it tell you about the data and its distribution?

### Chi-squared tests:
4. Take a dataset that we have used in the past in class that has **categorical** variables. Pick two of those categorical variables and run a chi-squared tests on that data
  - By hand using Numpy
  - In a single line using Scipy

Stretch goals:

1. Write a summary of your findings, mixing prose and math/code/results. *Note* - yes, this is by definition a political topic. It is challenging but important to keep your writing voice *neutral* and stick to the facts of the data. Data science often involves considering controversial issues, so it's important to be sensitive about them (especially if you want to publish).
2. Apply the techniques you learned today to your project data or other data of your choice, and write/discuss your findings here.
3. Refactor your code so it is elegant, readable, and can be easily run for all issues.

In [0]:
import numpy as np

In [0]:
import pandas as pd
df_v = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/voting-records/house-votes-84.data', header=None, na_values = '?')
#function
df_v.head()


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16
0,republican,n,y,n,y,y,y,n,n,n,y,,y,y,y,n,y
1,republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,
2,democrat,,y,y,,y,y,n,n,n,n,y,n,y,y,n,n
3,democrat,n,y,y,n,,y,n,n,n,n,y,n,y,n,n,y
4,democrat,y,y,y,n,y,y,n,n,n,n,y,,y,y,y,y


In [0]:
column_header = ['democrat/republican'
             ,'handicapped-infants'
             ,'water-project-cost-sharing'
             ,'adoption-of-the-budget-resolution'
             ,'physician-fee-freeze'
             ,'el-salvador-aid'
             ,'religious-groups-in-schools'
             ,'anti-satellite-test-ban'
             ,'aid-to-nicaraguan-contras'
             ,'mx-missile'
             ,'immigration'
             ,'synfuels-corporation-cutback'
             ,'education-spending'
             ,'superfund-right-to-sue'
             ,'crime'
             ,'duty-free-exports'
             ,'export-administration-act-south-africa'
                ]

df_v.columns = column_header
#attribute

In [0]:
df_vrep = df_v.replace({'n': 0, 'y': 1})

In [0]:
df_vrep.head()

Unnamed: 0,democrat/republican,handicapped-infants,water-project-cost-sharing,adoption-of-the-budget-resolution,physician-fee-freeze,el-salvador-aid,religious-groups-in-schools,anti-satellite-test-ban,aid-to-nicaraguan-contras,mx-missile,immigration,synfuels-corporation-cutback,education-spending,superfund-right-to-sue,crime,duty-free-exports,export-administration-act-south-africa
0,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,0.0,1.0
1,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,
2,democrat,,1.0,1.0,,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0
3,democrat,0.0,1.0,1.0,0.0,,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
4,democrat,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,1.0


In [0]:
from scipy.stats import ttest_ind, ttest_ind_from_stats, ttest_rel

In [0]:
from scipy.stats import ttest_1samp

In [0]:
rep = df_vrep[df_vrep['democrat/republican'] == 'republican']
rep.head()

Unnamed: 0,democrat/republican,handicapped-infants,water-project-cost-sharing,adoption-of-the-budget-resolution,physician-fee-freeze,el-salvador-aid,religious-groups-in-schools,anti-satellite-test-ban,aid-to-nicaraguan-contras,mx-missile,immigration,synfuels-corporation-cutback,education-spending,superfund-right-to-sue,crime,duty-free-exports,export-administration-act-south-africa
0,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,0.0,1.0
1,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,
7,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,,1.0
8,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0
10,republican,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,,,1.0,1.0,0.0,0.0


In [0]:
dem = df_vrep[df_vrep['democrat/republican'] == 'democrat']
dem.head()

Unnamed: 0,democrat/republican,handicapped-infants,water-project-cost-sharing,adoption-of-the-budget-resolution,physician-fee-freeze,el-salvador-aid,religious-groups-in-schools,anti-satellite-test-ban,aid-to-nicaraguan-contras,mx-missile,immigration,synfuels-corporation-cutback,education-spending,superfund-right-to-sue,crime,duty-free-exports,export-administration-act-south-africa
2,democrat,,1.0,1.0,,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0
3,democrat,0.0,1.0,1.0,0.0,,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
4,democrat,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,1.0
5,democrat,0.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0
6,democrat,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,,1.0,1.0,1.0


##Confidence intervals
1.number

2.plot

3.interpret

In [0]:
df_main = df_vrep

In [0]:
df_main.head(
)

Unnamed: 0,democrat/republican,handicapped-infants,water-project-cost-sharing,adoption-of-the-budget-resolution,physician-fee-freeze,el-salvador-aid,religious-groups-in-schools,anti-satellite-test-ban,aid-to-nicaraguan-contras,mx-missile,immigration,synfuels-corporation-cutback,education-spending,superfund-right-to-sue,crime,duty-free-exports,export-administration-act-south-africa
0,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,0.0,1.0
1,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,
2,democrat,,1.0,1.0,,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0
3,democrat,0.0,1.0,1.0,0.0,,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
4,democrat,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,1.0


In [0]:
df_std = np.std(df_main)
print('std:', df_std)
df_size = len(df_main)
#print('Number of voters', df_main)

std: handicapped-infants                       0.496634
water-project-cost-sharing                0.499985
adoption-of-the-budget-resolution         0.490560
physician-fee-freeze                      0.493139
el-salvador-aid                           0.499977
religious-groups-in-schools               0.479557
anti-satellite-test-ban                   0.495396
aid-to-nicaraguan-contras                 0.494161
mx-missile                                0.499999
immigration                               0.499978
synfuels-corporation-cutback              0.480670
education-spending                        0.494077
superfund-right-to-sue                    0.499905
crime                                     0.491218
duty-free-exports                         0.494719
export-administration-act-south-africa    0.390161
dtype: float64


In [0]:
df_size = len(df_main['democrat/republican'])
print(df_size)

435


In [0]:
standard_error =df_std / (df_size**(.5))
print("standard error:", standard_error)
# a for loop 

standard error: handicapped-infants                       0.023812
water-project-cost-sharing                0.023972
adoption-of-the-budget-resolution         0.023521
physician-fee-freeze                      0.023644
el-salvador-aid                           0.023972
religious-groups-in-schools               0.022993
anti-satellite-test-ban                   0.023752
aid-to-nicaraguan-contras                 0.023693
mx-missile                                0.023973
immigration                               0.023972
synfuels-corporation-cutback              0.023046
education-spending                        0.023689
superfund-right-to-sue                    0.023969
crime                                     0.023552
duty-free-exports                         0.023720
export-administration-act-south-africa    0.018707
dtype: float64


In [0]:
type(standard_error)

pandas.core.series.Series

In [0]:
standard_error_list = pd.Series.tolist(standard_error)

In [0]:
print(standard_error_list)

[0.02381177712168673, 0.023972444759227014, 0.023520569374124167, 0.02364420002541226, 0.023972077831243818, 0.022993006437974403, 0.023752423069596434, 0.023693202446187483, 0.02397309480006866, 0.02397211809686617, 0.023046374953218447, 0.023689179684940564, 0.023968601039088384, 0.02355208667091171, 0.023719937859748814, 0.018706793943855885]


In [0]:
standard_error_array = np.asarray(standard_error_list)

In [0]:
df_main_ = np.delete(data, 0, axis = 1)

NameError: ignored

In [0]:
print(df_main['el-salvador-aid'])

In [0]:
type(df_main['el-salvador-aid'])

In [0]:
df_main_stderr = np.delete(df_main, 1, axis = 0)

In [0]:
df_main.head()

Unnamed: 0,democrat/republican,handicapped-infants,water-project-cost-sharing,adoption-of-the-budget-resolution,physician-fee-freeze,el-salvador-aid,religious-groups-in-schools,anti-satellite-test-ban,aid-to-nicaraguan-contras,mx-missile,immigration,synfuels-corporation-cutback,education-spending,superfund-right-to-sue,crime,duty-free-exports,export-administration-act-south-africa
0,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,0.0,1.0
1,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,
2,democrat,,1.0,1.0,,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0
3,democrat,0.0,1.0,1.0,0.0,,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
4,democrat,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,1.0


In [0]:
from scipy import stats

stderr = stats.sem(df_main['handicapped-infants'], nan_policy='omit')
stderr 



0.02417576424471678

In [0]:
def std_err(df):
  stderr_list = []
  for column in df:
    stderr = stats.sem(df_main[column], nan_policy='omit')
    stderr_list.append(stderr)
  return stderr_list

In [0]:
std_err(df_main)

TypeError: ignored

In [0]:
?stats.sem


In [0]:
t = stats.t.ppf(.975 , df_size-1)

In [0]:
?stats.t.ppf

In [0]:
df_mean = df_main.mean()

In [0]:
confidence_interval =(df_mean - t*standard_error_array, df_mean + t*standard_error_array)
margin_of_error = t*stderr



In [0]:
print("Sample Mean", df_mean)


In [0]:

print("Margin of Error:", margin_of_error)


In [0]:

print("Confidence Interval:", confidence_interval)

In [0]:
confidence_interval[0]


In [0]:
confidence_interval[1]

Graphical

In [0]:
df_main.head() 

In [0]:

df_main_ = np.delete(data, 0, axis = 1)

In [0]:
import seaborn as sns

sns.kdeplot(df_main)
plt.axvline(x=confidence_interval[0], color='red')
plt.axvline(x=confidence_interval[1], color= 'red')
plt.axvline(x=df_mean, color='k');

#interpretation
I have an interpretation

##Chi-squared
(two catagories)

(by hand)

In [0]:
print(df_main.shape)
df_main.head()

In [0]:
df_main.describe()

In [0]:
df_main.describe(exclude='number')

Chi-squared
(numpy)

In [0]:
chi^two = ((observed - expected)**2/(expected)).sum()
print(f"chi^two: {chi^two}")


In [0]:
dof = (len(row_sums)-1)*(len(col_sums)-1)
print(f"Degrees of Freedom: {dof}")

In [0]:
df_rep_std = np.std(rep)
print('std:', df_rep_std)
df_rep_size = len(rep['democrat/republican'])
print('Number of rep',df_rep_size)
df_dem_size = len(dem['democrat/republican'])
print('Number of dem', df_dem_size)

In [0]:
df_dem_std = np.std(dem)
print('std:', df_dem_std)