# Chi Squared Test of Independence

This notebook demonstrates how to use the chi-squared test of independence to determine if two categorical variables are independent or not.

Inspired by [Jonathan Stray's Risk Ratios notebook](https://github.com/jstray/risk-ratios/blob/main/risk-ratios-workbook.ipynb)

Here's a [paper](https://www.nejm.org/doi/full/10.1056/nejmoa2035389) which reports on the phase 3 clinical trials of the Moderna vaccine. See if you can 
1. Read the abstract and fill out the contingency table for the vaccine and placebo groups 
2. Perform a chi-squared test of independence to determine if the vaccine is effective

## Setup

Ignore this part

In [None]:
%load_ext rpy2.ipython
%load_ext autoreload
%autoreload 2

%matplotlib inline  
from matplotlib import rcParams
rcParams['figure.figsize'] = (16, 100)

import warnings
from rpy2.rinterface import RRuntimeWarning
warnings.filterwarnings("ignore") # Ignore all warnings
# warnings.filterwarnings("ignore", category=RRuntimeWarning) # Show some warnings

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from IPython.display import display, HTML

In [None]:
%%javascript
// Disable auto-scrolling
IPython.OutputArea.prototype._should_scroll = function(lines) {
    return false;
}

## Get Data

In [None]:
%%R 

require('tidyverse')

a = # number of people who DID get the vaccine and DID get COVID
b = # number of people who DID get the vaccine and DID NOT get COVID
c = # number of people who DID NOT get the vaccine and DID get COVID
d = # number of people who DID NOT get the vaccine and DID NOT get COVID

In [None]:
%%R 

# R code to generate 30420 data points
# of all of these data points half will have "vaccine" and half will have "placebo" for a variable called group
# of the "vaccine" group, 11 will have "covid" and 15210-11 will have "no_covid"
# of the "placebo" group, 185 will have "covid" and 15210-185 will have "no_covid"

set.seed(1)
n = 30420
vaccine = rep("vaccine", n/2)
placebo = rep("placebo", n/2)
group = c(vaccine, placebo)
covid = c(rep("covid", a), rep("no_covid", b), rep("covid", c), rep("no_covid", d))
simulated_data <- data.frame(group, covid) %>% 
    sample_frac() # shuffle around the data randomly

simulated_data %>% head()

## `table` in R 

In [None]:
%%R 

cross_table <- table(simulated_data$group, simulated_data$covid)
cross_table

## Chi Squared Test

In [None]:
%%R 

# run a chi-squared test
chisq.test(cross_table, correct = FALSE)

## Relative Risk

In [None]:
%%R 
# what is the risk of getting COVID if you got the vaccine?


In [None]:
%%R 
# what is the risk of getting COVID if you did not get the vaccine?


In [None]:
%%R 

# let's talk about the risk ratio

# Discussion

let's talk as a class about
1. communicating relative risk
2. experimental vs observational data and how that dramatically changes interpretation of risk ratios