# Chi Squared Test of Independence

This notebook demonstrates how to use the chi-squared test of independence to determine if two categorical variables are independent or not.

Inspired by [Jonathan Stray's Risk Ratios notebook](https://github.com/jstray/risk-ratios/blob/main/risk-ratios-workbook.ipynb)

Here's a [paper](https://www.nejm.org/doi/full/10.1056/nejmoa2035389) which reports on the phase 3 clinical trials of the Moderna vaccine. See if you can 
1. Read the abstract and fill out the contingency table for the vaccine and placebo groups 
2. Perform a chi-squared test of independence to determine if the vaccine is effective

## Setup

Ignore this part

In [1]:
%load_ext rpy2.ipython
%load_ext autoreload
%autoreload 2

%matplotlib inline  
from matplotlib import rcParams
rcParams['figure.figsize'] = (16, 100)

import warnings
from rpy2.rinterface import RRuntimeWarning
warnings.filterwarnings("ignore") # Ignore all warnings
# warnings.filterwarnings("ignore", category=RRuntimeWarning) # Show some warnings

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from IPython.display import display, HTML

In [2]:
%%javascript
// Disable auto-scrolling
IPython.OutputArea.prototype._should_scroll = function(lines) {
    return false;
}

<IPython.core.display.Javascript object>

## Get Data

In [3]:
%%R 

require('tidyverse')

a = 11 # number of people who DID get the vaccine and DID get COVID
b = 15210-11 # number of people who DID get the vaccine and DID NOT get COVID
c = 185 # number of people who DID NOT get the vaccine and DID get COVID
d = 15210-185 # number of people who DID NOT get the vaccine and DID NOT get COVID

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors


Loading required package: tidyverse


In [4]:
%%R 

# R code to generate 30420 data points
# of all of these data points half will have "vaccine" and half will have "placebo" for a variable called group
# of the "vaccine" group, 11 will have "covid" and 15210-11 will have "no_covid"
# of the "placebo" group, 185 will have "covid" and 15210-185 will have "no_covid"

set.seed(1)
n = 30420
vaccine = rep("vaccine", n/2)
placebo = rep("placebo", n/2)
group = c(vaccine, placebo)
covid = c(rep("covid", a), rep("no_covid", b), rep("covid", c), rep("no_covid", d))
simulated_data <- data.frame(group, covid) %>% 
    sample_frac() # shuffle around the data randomly

simulated_data %>% head()

    group    covid
1 placebo no_covid
2 placebo no_covid
3 vaccine no_covid
4 placebo no_covid
5 vaccine no_covid
6 placebo no_covid


## `table` in R 

In [5]:
%%R 

cross_table <- table(simulated_data$group, simulated_data$covid)
cross_table

         
          covid no_covid
  placebo   185    15025
  vaccine    11    15199


## Chi Squared Test

In [6]:
%%R 

# run a chi-squared test
chisq.test(cross_table, correct = FALSE)




	Pearson's Chi-squared test

data:  cross_table
X-squared = 155.47, df = 1, p-value < 2.2e-16

