## Assignment - Build a confidence interval

A confidence interval refers to a neighborhood around some point estimate, the size of which is determined by the desired p-value. For instance, we might say that 52% of Americans prefer tacos to burritos, with a 95% confidence interval of +/- 5%.

52% (0.52) is the point estimate, and +/- 5% (the interval $[0.47, 0.57]$) is the confidence interval. "95% confidence" means a p-value $\leq 1 - 0.95 = 0.05$.

In this case, the confidence interval includes $0.5$ - which is the natural null hypothesis (that half of Americans prefer tacos and half burritos, thus there is no clear favorite). So in this case, we could use the confidence interval to report that we've failed to reject the null hypothesis.

But providing the full analysis with a confidence interval, including a graphical representation of it, can be a helpful and powerful way to tell your story. Done well, it is also more intuitive to a layperson than simply saying "fail to reject the null hypothesis" - it shows that in fact the data does *not* give a single clear result (the point estimate) but a whole range of possibilities.

How is a confidence interval built, and how should it be interpreted? It does *not* mean that 95% of the data lies in that interval - instead, the frequentist interpretation is "if we were to repeat this experiment 100 times, we would expect the average result to lie in this interval ~95 times."

For a 95% confidence interval and a normal(-ish) distribution, you can simply remember that +/-2 standard deviations contains 95% of the probability mass, and so the 95% confidence interval based on a given sample is centered at the mean (point estimate) and has a range of +/- 2 (or technically 1.96) standard deviations.

Different distributions/assumptions (90% confidence, 99% confidence) will require different math, but the overall process and interpretation (with a frequentist approach) will be the same.

Your assignment - using the data from the prior module ([congressional voting records](https://archive.ics.uci.edu/ml/datasets/Congressional+Voting+Records)):


### Confidence Intervals:
1. Generate and numerically represent a confidence interval
2. Graphically (with a plot) represent the confidence interval
3. Interpret the confidence interval - what does it tell you about the data and its distribution?

### Chi-squared tests:
4. Take a dataset that we have used in the past in class that has **categorical** variables. Pick two of those categorical variables and run a chi-squared tests on that data
  - By hand using Numpy
  - In a single line using Scipy


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
import seaborn as sns

# Needed for grabbing the dataset from https
import ssl
ssl._create_default_https_context = ssl._create_unverified_context

# Using the people dataset for the initial coe, as I wanted to make sure
# I could replicate the instructors results without copying code

df = pd.read_csv('https://raw.githubusercontent.com/ryanleeallred/datasets/master/adult.csv', na_values=" ?")

# Function to generate a CI, and check it's results against Scipy
# s = series
# c = tolerance (confidence level)

# QUOTE:
# 'The confidence interval represents values for the data provided, for which the difference between the
# parameter and the observed estimate is not statistically significant at the c (confidence level) level.
# a confidence interval (CI) is a type of interval estimate, computed from the statistics of the observed
# data, that might contain the true value of an unknown population parameter. ... Most commonly, the 95%
# confidence level is used. However, other confidence levels can be used, for example, 90% and 99%.'

def ci(s, c):

    # Pull the series in as an numpy array
    #s = np.array(d)
    # Mean of the series
    m = np.mean(s)
    # Length of the series
    l = len(s)
    # Std of the series using 1 as degree of freedom
    std = np.std(s, ddof=1)
    # Standard error of the series
    se = std / np.sqrt(l)
    # The t-statistic that corresponds to the
    # cofidence levcel, and degree of freedom
    t = stats.t.ppf((1 + c) / 2, l - 1)
    # Margin of error
    moe = t * se

    # Return variables
    low = m - moe
    high = m + moe
    mid = m

    # Compare our results with Scipy
    s_low, s_high = stats.t.interval(0.95,
                                     l-1,
                                     loc=m,
                                     scale=se)

    if s_low != low or s_high != high:

        print('Your results:')
        print(str(low) + ' - ' + str(high))
        print('SciPy\'s')
        print(str(s_low) + ' - ' + str(s_high))

        raise Exception('The results of the confidence interval you generated ndoes not match that of Scipys!')

    else:
        # Return a tuple that contains the results
        return moe, low, mid, high

# QUOTE:
# The chi-squared goodness of fit test or Pearson’s chi-squared test is used to assess whether a set of
# categorical data is consistent with proposed values for the parameters. This function checks against
# SciPy's results as well (will add in a bit).

def cs(known,expected):

    stat=0
    for known, expected in zip(known, expected):
        stat+=(float(known)-float(expected))**2/float(expected)

        return stat




# Confidence Interval
moe, low, mid, high = ci(df['age'], 0.95)

# Histogram
ax = sns.distplot(df['age'], bins=72)

# verical lines
plt.axvline(x=low, color='red')
plt.axvline(x=mid, color='black')
plt.axvline(x=high, color='red')
plt.show()

# I was able to replicate the code, and test the function for
# generating the CI Loading the congressional voting records:

columns = ['party',
           'handicapped-infants',
           'water-project',
           'budget',
           'physician-fee-freeze',
           'el-salvador-aid',
           'religious-groups',
           'anti-satellite-ban',
           'aid-to-contras',
           'mx-missile',
           'immigration',
           'synfuels',
           'education',
           'right-to-sue',
           'crime',
           'duty-free',
           'south-africa']

df_voting = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/voting-records/house-votes-84.data',
                 header = None,
                 names = columns,
                 na_values = '?'
                 )

# Replace string values with numbers so they are easier to process later.

df_voting = df_voting.replace({'n':0,
                               'y':1})

# Seperate the two by subsetting.
#####################

# - Republicans
df_r = df_voting[df_voting['party'] == 'republican']

# Do not need party, it will just be in the way
df_r = df_r.drop(['party'], axis = 1).reset_index()

# - Democrats
df_d = df_voting[df_voting['party'] == 'democrat']

# Do not need party, it will just be in the way
df_d = df_d.drop(['party'], axis = 1).reset_index()


moe1, low1, mid1, high1 = ci(df_voting['religious-groups'].dropna(), 0.95)
moe2, low2, mid2, high2 = ci(df_voting['budget'].dropna(), 0.95)


fig, ax = plt.subplots()

ax.set_title("Percent of Republicans voting on bills")

ax.bar(x=0, height=mid1, yerr=moe1)
ax.bar(x=1, height=mid2, yerr=moe2)


ax.set_xticks([0,1])
ax.set_xticklabels(["Religous Groups", "Budget Bill"])
plt.show()

# Grab a sample

small = df_voting.sample(40)

# I have no real mental picture of how this works, but ehre anyway
# Numpy complains about those stinking NaN's no matter what I command
# it to do it seems, replacing them with 0 here

small = small.replace({np.NaN:0})

from scipy.stats import chi2_contingency
from scipy.stats import chi2

# My function to get Chi squared just vomits, continuously, so doing it with scipy
stat, p, dof, exp = stats.chi2_contingency(small)
print(stats)
print(p)
print(dof)
print(exp)

# TODO - your code!

## Stretch goals:

1. Write a summary of your findings, mixing prose and math/code/results. *Note* - yes, this is by definition a political topic. It is challenging but important to keep your writing voice *neutral* and stick to the facts of the data. Data science often involves considering controversial issues, so it's important to be sensitive about them (especially if you want to publish).
2. Apply the techniques you learned today to your project data or other data of your choice, and write/discuss your findings here.
3. Refactor your code so it is elegant, readable, and can be easily run for all issues.

## Resources

- [Interactive visualize the Chi-Squared test](https://homepage.divms.uiowa.edu/~mbognar/applets/chisq.html)
- [Calculation of Chi-Squared test statistic](https://en.wikipedia.org/wiki/Pearson%27s_chi-squared_test)
- [Visualization of a confidence interval generated by R code](https://commons.wikimedia.org/wiki/File:Confidence-interval.svg)
- [Expected value of a squared standard normal](https://math.stackexchange.com/questions/264061/expected-value-calculation-for-squared-normal-distribution) (it's 1 - which is why the expected value of a Chi-Squared with $n$ degrees of freedom is $n$, as it's the sum of $n$ squared standard normals)