# Class 19: Introduction to statistical inference

Plan for today:
- Introduction to statistical inference
- Introduction to hypothesis tests


## Notes on the class Jupyter setup

If you have the *ydata123_2023e* environment set up correctly, you can get the class code using the code below (which presumably you've already done given that you are seeing this notebook).  

In [None]:
import YData

# YData.download.download_class_code(19)   # get class code    
# YData.download.download_class_code(19, TRUE) # get the code with the answers 


YData.download_data("movies.csv")

There are also similar functions to download the homework:

In [None]:
# YData.download.download_class_file('project_template.ipynb', 'homework')  # downloads the class project template (hopefully you've already done this)

If you are using colabs, you should install polars and the YData packages by uncommenting and running the code below.

In [None]:
# !pip install https://github.com/emeyers/YData_package/tarball/master

If you are using google colabs, you should also uncomment and run the code below to mount the your google drive

In [None]:
# from google.colab import drive
# drive.mount('/content/drive')

In [None]:
import statistics
import pandas as pd
import numpy as np
import plotly.express as px
from urllib.request import urlopen

import matplotlib.pyplot as plt
%matplotlib inline

## 1. Very quick example of interactive maps with plotly

Last class we discussed creating choropleth maps using geopandas. Let's take a quick look at creating interactive choropleth maps in plotly. 

More examples of ploly maps can be [found here](https://plotly.com/python/mapbox-county-choropleth/). There are also several other packages for creating maps, such as [geoplot](https://geopandas.org/en/stable/gallery/plotting_with_geoplot.html), [leaflet, and folium](https://www.earthdatascience.org/tutorials/introduction-to-leaflet-animated-maps/), so I encourage you to explore further if you are interested. 



In [None]:
import json
with urlopen('https://raw.githubusercontent.com/plotly/datasets/master/geojson-counties-fips.json') as response:
    counties = json.load(response)

df = pd.read_csv("https://raw.githubusercontent.com/plotly/datasets/master/fips-unemp-16.csv",
                   dtype={"fips": str})

fig = px.choropleth_mapbox(df, geojson=counties, locations='fips', color='unemp',
                           color_continuous_scale="Viridis",
                           range_color=(0, 12),
                           mapbox_style="carto-positron",
                           zoom=3, center = {"lat": 37.0902, "lon": -95.7129},
                           opacity=0.5,
                           labels={'unemp':'unemployment rate'}
                          )
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()


## 2. Statistical inference

In statistical inference we use a smaller sample of data to make claims about a larger population of data. 

As an example, let's look at the [2020 election](https://www.cookpolitical.com/2020-national-popular-vote-tracker) between Donald Trump and Joe Biden, and let's focus on the results from the state of Georgia. After all the votes had been counted, the resuts showed that:

- Biden received 2,461,854 votes
- Trump received 2,473,633 votes

Since we have all the votes on election data, we can precisely calculate the population parameter of the proportion of votes that Biden received, which we will denote with the symbol $\pi_{Biden}$. 

Let's create names `num_trump_votes` and `num_biden_votes`, and calculate `true_prop_Biden` which is the value $\pi_{Biden}$. 

In [None]:
num_trump_votes = 2461854  # 2,461,854
num_biden_votes = 2473633  # 2,473,633

true_prop_Biden = num_biden_votes/(num_biden_votes + num_trump_votes)

true_prop_Biden

The code below creates a DataFrame called `georgia_df` that captures these election results. Each row in the DataFrame represents a votes. The column `Voted Biden` is `True` if a voter voted for Biden and `False` if the voter voted for Trump. 

In [None]:
biden_votes = np.repeat(True, num_biden_votes)     # create 2,473,633 Trues for the Biden votes
trump_votes = np.repeat(False, num_trump_votes)    # create 2,461,854 Falses for the Trump votes
election_outcome = np.concatenate((biden_votes, trump_votes))  # put the votes together

georgia_df = pd.DataFrame({"Voted Biden": election_outcome})  # create a DataFrame with the data
georgia_df = georgia_df.sample(frac = 1)   # shuffle the order to make it more realistic

georgia_df.head()

Now suppose we didn't know the actual value of $\pi_{Biden}$ and we wanted to estimate it based on a poll of 1,000 voters. We can simulate this by using the pandas `.sample(n = )` method.

Let's simulate sampling random voters

In [None]:
# sample 10 random points
georgia_df.sample(10)  

In [None]:
# simulate proportions of voters that voted for Biden - i.e., p-hats

one_sample = georgia_df.sample(1000)

np.mean(one_sample['Voted Biden'])

### Sampling distribution

Suppose 100 polls were conducted. How many of them would show that Biden would get the majority of the vote? 

Let's simulate this "sampling distribution" of statistics now... 


In [None]:
%%time


sample_size = 1000
num_simulations = 100

sampling_dist = []

for i in range(num_simulations):
    curr_sample = georgia_df.sample(sample_size)
    prop_biden = np.mean(curr_sample["Voted Biden"])
    sampling_dist.append(prop_biden)



In [None]:
plt.hist(sampling_dist, edgecolor = "black");

### A faster way to simulate data

Rather than simulating polling outcomes by pulling random samples from a DataFrame, let's simulate each vote by simulating randomly flipping a coin, where the probability of getting a "Head" (True value) is the probability of Biden getting a vote.

To do this we can generate random numbers between 0 and 1. If a number is less than the value of $\pi_{Biden}$, then it is a vote for Biden (i.e., a `True` value) otherwise it is a vote for Trump (`False` value). 

Let's generate one sample of 1,000 voters...


In [None]:

# usse `np.random.rand()` to generate 1,000 numbers between 0 and 1
thousand_random_nums = np.random.rand(1000)

print(thousand_random_nums[0:5])

# visualize these 1,000 numbers as a histogram
plt.hist(thousand_random_nums, edgecolor = "black");


In [None]:
# convert to a vector of Booleans (True = vote for Biden, False = vote for Trump)

voter_sample = thousand_random_nums <= true_prop_Biden

voter_sample[0:5]

In [None]:
# Calculate the proportion of votes for Biden in our sample

# method 1
sample_biden_prop = np.sum(voter_sample)/1000
print(sample_biden_prop)

# method 2
sample_biden_prop2 = np.mean(voter_sample)
print(sample_biden_prop2)


In [None]:
# function to generate proportion of Biden voters based on a poll

def generate_prop_biden(poll_size):
    random_sample = np.random.rand(poll_size) <= true_prop_Biden
    return np.mean(random_sample)


In [None]:
%%time

# sampling distribution of many polls conducted

sample_size = 1000
num_simulations = 100

sampling_dist = []

for i in range(num_simulations):
    prop_biden = generate_prop_biden(sample_size)
    sampling_dist.append(prop_biden)

In [None]:
plt.hist(sampling_dist, edgecolor = "black", bins = 10);

## 3. Hypothesis tests

In hypothesis testing, we start with a claim about a population parameter (e.g., µ = 4.2, or π = 0.25).

This claim implies we should get a certain distribution of statistics, called "The null distribution". 

If our observed statistic is highly unlikely to come from the null distribution, we reject the claim. 

We can break down the process of running a hypothesis test into 5 steps. 

1. State the null and alternative hypothesis
2. Calculate the observed statistic of interest
3. Create the null distribution 
4. Calculate the p-value 
5. Make a decision

Let's run through these steps now!


#### Step 1: State the null and alternative hypothesis

$H_0: \pi = 0.5$

$H_A: \pi < 0.5$


#### Step 2: Calculate the observed statistic of interest


In [None]:
# load the data

movies = pd.read_csv("movies.csv")

movies.head(3)

In [None]:
# reduce data to a smaller number of columns: "title" and "binary"

movies_smaller = movies[["title", "binary"]]

In [None]:
# calculate the proportion of movies that pass the Bechdel test

booleans_passed = movies_smaller["binary"] == "PASS"

prop_passed = np.mean(booleans_passed)

prop_passed


#### Step 3: Create the null distribution 

We need to create a null distribution, which is the distribution of statistics we would expect to get if the null hypothesis is true. 

**Question**: about what percent of the movies would we expect to pass the Bechdel test if the null distribution was true? 

**Answer**: 50%

Let's create simulated data that is consistent with this!


In [None]:
# Let's generate one proportion consistent with the null hypothesis

n = movies.shape[0]
print(n)

null_sample = np.random.rand(n) < .5

np.mean(null_sample)


In [None]:
# Let's write a function to generate a proportions consistent with a null hypothesis

def generate_prop_bechdel(n, null_prop):
    random_sample = np.random.rand(n) <= null_prop
    return np.mean(random_sample)

generate_prop_bechdel(1794, .5)

In [None]:
# Let's generate a null distribution 

null_dist = []

for i in range(10000):    
    null_dist.append(generate_prop_bechdel(1794, .5))


In [None]:
# visualize the null distribution 

plt.hist(null_dist, edgecolor = "black", bins = 20) #, range = (.4, .6));
plt.plot(prop_passed, 30, '.', markersize = 30, color = "red");

#### Step 4: Calculate the p-value 

Calculate the proportion of points in the null distribution that are more extreme than the observed statistic. 


In [None]:
# Calculate the p-value

stats_more_extreme = np.array(null_dist) <= prop_passed

print(stats_more_extreme[0:5])

p_value = np.mean(stats_more_extreme)

p_value

#### Step 5: Make a decision

Since the p-value is very small (essentially zero) it is very unlikely that our statistic come from the null distribution. Thus we will reject the null hypothesis and conclude that less than 50% of movies pass the Bechdel test. 
