# LS123: Large N and Hypothesis Testing

This lab will cover the basics of statistical sampling, the law of averages, and hypothesis testing. You should gain an intuition around how samples relate to populations, and the basics of statistical inference in the social sciences.

In [None]:
from collections import Counter
import numpy as np
import pandas as pd
from scipy import stats
%matplotlib inline
import matplotlib.pyplot as plot
plot.style.use('fivethirtyeight')

## Data

We'll continue using the ANES data for this lab!

In [None]:
anes = pd.read_csv('../data/anes/ANES_legalst123_cleaned.csv')
anes.head()

## Empirical Distributions

### Data Manipulation and Plotting Review

Let's look at how liberal respondents characterized themselves as post-election. Write code that saves the "post_liberal_rating" column in the ANES data to a Series variable. Keep in mind that valid answers have domain [0,100] so be sure to subset to only those values.

Plot a histogram of the data:

### Question 1

What patterns do you notice? Where is the center of the distribution? What does this suggest about how Americans tend to self-identify?

Answer:

### Law of Averages

Write a function, "empirical_hist_anes" that takes a Series and a sample size as its argument, and then draws a histogram based on the results. Consult Adhikari and DeNero for help!

In [None]:
def empirical_hist_anes(series, n):
    ...

Check how many rows are in the table with the "size" method, and then use your self-defined function to plot histograms taking sample sizes 10, 100, 1000, and the total number of rows.

### Question 2

What happens to the histograms (compared to the original in Q1) as you increase the sample size? How does this relate to the Law of Averages? What is the relationship between sample size and population parameter estimation?

## Hypothesis Testing

In this section, we'll cover the basic tools for hypothesis testing. 

The goal in conducting a hypothesis test is to answer the question, "Was it likely to observe my test statistic due to chance?" We say something is statistically significant if it is sufficiently far enough away from the center of an empirical distribution, and therefore unlikely to have occurred just by chance.

The basic way to frame a hypothesis test is as follows:

1. Define a null $(H_O)$ and alternative $(H_A)$ hypothesis. The null hypothesis is usually framed as "no statistical relationship between the observed data and the background distribution" and the alternative hypothesis is the opposite. More concretely, the null is our default position, and assumes that the observed statistic likely came from the background distribution.

2. Calculate a test statistic (for example, t-test, $\chi^2$, etc.)

3. Check if the test statistic is far enough away from the center of the distribution. Traditionally, this was done by checking against a reference table, but in Python, we'll use p-values. Typically, a p value of less than .05 (meaning that only 5% of observations should fall where the test statistic does) is used as the threshold for statistical significance in the social sciences.

4. Either reject or fail to reject the null hypothesis.

### Jury Selection

First, we'll use the jury selection example from the Adhikari and DeNero book. This example is based on the U.S. Supreme Court case, Swain v. Alabama. Robert Swain was convicted by an all-white jury, and challenged his conviction on the basis that it was statistically unlikely that a jury would be all-white by chance, given that the racial composition of the county was 26% black. Juries were selected from a panel of 100. In this case, only 8 jurors on the panel were black.

Was it likely that the panel would only include 8 black jurors out of 100, given that 26% of the county was black?

In [None]:
# Create the table
jury = pd.DataFrame(data = {'Ethnicity': ['Asian', 'Black', 'Latino', 'White', 'Other'],
                           'Eligible': [0.15, 0.18, 0.12, 0.54, 0.01],
                           'Panels': [0.26, 0.08, 0.08, 0.54, 0.04]}
)

jury

In [None]:
# Horizontal Bar Chart
jury.plot.barh('Ethnicity')

In [None]:
# Augment with the difference between the "panels" columns and "eligible" column
jury_with_diffs = jury.assign(Difference = jury.loc[:, 'Panels'] - jury.loc[:, 'Eligible'])
jury_with_diffs

Write code that does a t-test between the "Eligible" and "Panels" columns. Hint: https://docs.scipy.org/doc/scipy-0.15.1/reference/generated/scipy.stats.ttest_ind.html#scipy.stats.ttest_ind

In [None]:
# scipy t-test

### Hypothesis Testing on ANES Data

Now let's try with the ANES data! Write code that creates a new DataFrame with the "post_liberal_rating" and "post_conservative_rating" as columns, and only includes values below 150.

### Question 3

Plot a histogram of both the post liberal rating and post conservating rating side by side. Experiment with different bin widths. Visually, what can you infer about the shape of each data?

### Question 4

Now write code to do a t-test between liberal and conservative. For the t-test to work, you have to remove NaN values first (Check pandas documentation to find out how to do so).

In [None]:
# Drop NaN

In [None]:
# t-test

What does the pvalue of this t-test indicate? Can we reject the null hypothesis that the mean of the two distributions differs significantly among samples?

## Central Limit Theorem

The central limit theorem (CLT) is a fundamental concept in statistics. It basically says that the means of repeated samples will converge upon a normal distribution centered around the population mean. This is a powerful result that allows us to use a sample mean without measuring other sample means. This insight is particularly important in the social sciences, and justifies the use of regression for causal inference.

Using liberal respondents ("post_liberal_rating") again, let's illustrate this concept. Write code that does the following:

1. Define a sample size, and number of repetitions. Also, create an empty array to store the sample means.

2. Write a for loop that loops over the number of repetitions and:
    a. Samples the liberal respondents by the sample size
    b. Calculates its mean
    c. Appends the calculated mean to the array that stores sample means

Using this code, experiment with various sample sizes and number of repetitions. Plot each result. For instance, try the following:

1. Sample size = 20, repetititon = 10
2. Sample size = 100, repetitions = 10
3. Sample size = 100, repetitions = 100000
4. Sample size = 500, repetitions = 100000
5. Sample size = 1000, repetitions = 150000

### Question 5: What happens as you increase the sample size and number of repetitions? How does this property justify the use of statistical methods across a range of problems?