# Week 5: Contingency Tables

## **Notebook Overview**

This notebook is available on github
[here](https://github.com/Yushi-Y/AAS-ongoing-tutorials). If you find errors or would like to suggest an improvement then let me know.

This week is about using contingency tables and a $\chi^{2}$-test to make claims. You will need to use the usual libraries as well as scipy and statsmodels.

The next notebook will be looking at logistic regression. I would also recommend spending a little bit of time on logistic regression as it is a much bigger topic and harder to understand!

### **Additional Resources**

1. **YouTube Videos**
* [**Contingency Tables:**](https://www.youtube.com/watch?v=W95BgQCp_rQ) Runs through a similar (maybe the same...) example from class but at a slower pace and going into more detail. Okay resource.
* [**Hypothesis Testing Example: [Best Resource]**](https://www.youtube.com/watch?v=hpWdDmgsIRE) A good Kahn Academy video walking through a hypothesis test for independence using contingency tables and chi-squared. This is better than the first video. It's at a pace where you can work along with it.
* [**Hypothesis Testing with Two Means:**](https://www.youtube.com/watch?v=UcZwyzwWU7o) Relevant more generally.
* [**Hypothesis Testing for Proportions:**](https://www.youtube.com/watch?v=76VruarGn2Q) Again, relevant more generally.
* [**Ben Lambert on Degrees of Freedom:**](https://www.youtube.com/watch?v=-4aiKmPC994&pp=ygUeYmVuIGxhbWJlcnQgZGVncmVlcyBvZiBmcmVlZG9t) Really useful to understand this but not necessary if you just to want to know how to use the models. Part 2 can be found [here](https://www.youtube.com/watch?v=iA2KZHHZmmg). Would strongly recommend these videos for anyone who has been questioning where we pull the DOF value out from.

2. **Documentation**
* [**SciPy Contingency Tables**:](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chi2_contingency.html) Use SciPy to do a few of the questions in this notebook.

3. **A Level Textbooks**
* Contingency tables are topics taught at A Level so have a look at the textbooks. It will be explained at an introductory level there which might be useful if the other resources are too advanced.

4. **Textbooks**
* It is covered in the Fox textbook (briefly).

As usual we will start by importing some useful libraries.

In [1]:
%config InlineBackend.figure_format = 'svg'
import statsmodels.api as sm
from statsmodels.graphics.mosaicplot import mosaic
import matplotlib.pyplot as plt
import pandas as pd
import scipy
import numpy as np

Today we will look at a dataset from a double-blind clinical trial of a new
treatment for rheumatoid arthritis. We will test whether treatment is correlated
with a change in symptoms using a $\chi^{2}$-test.

First, we need to load the data which comes bundled with `statsmodels`.

In [3]:
# Access the data
ra = sm.datasets.get_rdataset("Arthritis", "vcd").data

# View it
ra.head()

Unnamed: 0,ID,Treatment,Sex,Age,Improved
0,57,Treated,Male,27,Some
1,46,Treated,Male,29,
2,77,Treated,Male,30,
3,17,Treated,Male,32,Marked
4,36,Treated,Male,46,Marked


### Question 1

Use `pandas` to generate a cross tabulation of the treatment status and
improvement.

[hint](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.crosstab.html)


### Answer

In [4]:
# answer...

### Question 2

Generate a mosaic plot to display this data.

[hint](https://www.statsmodels.org/dev/generated/statsmodels.graphics.mosaicplot.mosaic.html)

### Answer

In [5]:
# answer...

### Question 3

a) What errors does the default plot from `pandas` make?

Hint: these are not numerical errors but things that make it harder to interpret.

b) [EXTENSION] Once you identify the errors then try to write some code to overcome them. Note that depending on which errors you've suggested, this may take you a while so come back to this if you have time.

### Answer


text...

In [6]:
# code...

### Question 4

For this trial, what was the null hypothesis?

### Answer


### Question 5

Is it valid to use a $\chi^{2}$-test for this data?

### Answer

### Question 6

How many degrees of freedom are there in this data?

Hint: Do a bit of research to actually understand degrees of freedom (DOF) more generally. When learning statistics it seems a little bit random where you get the DOF from in each different model but once you realise the theory behind it then it all makes sense. Knowing the theory means that it is much easier to remember the rules for each model as well. I will try to link a good DOF explainer in the resources section.


### Answer


### Question 7

Perform a $\chi^{2}$-test on the contingency table; are treatment and changes in symptoms independent?

Make sure that when you do a hypothesis test you make it nice and formal. Define your hypothesis clearly,...etc











### Answer

text...

In [7]:
# code

### Question 8

A) What can we conclude from this hypothesis test?

B) Why did we need to randomise the treatment?

Note that a proper treatment of causality goes well beyond the scope of this course, but recall that randomised controlled trials provide very very high quality evidence.

### Question 9 [Extension - Do this if time]


Recall from earlier notebooks the function estimate_and_ci which computes the probability of success in repeated Bernoulli trials and the  95%  confidence interval on this estimate.



In [None]:
def estimate_and_ci(num_trials, num_success):
  """ returns a tuple of the probability of success and a confidence interval"""
    p_hat = num_success / num_trials
    z = 1.96
    delta = z * np.sqrt(p_hat * (1 - p_hat) / num_trials)
    return (p_hat,(p_hat - delta, p_hat + delta))

The functions `rand_small_table` and `rand_big_table` defined below return
random datasets of the same shape as our arthritis dataset under the null
hypothesis, i.e. when the outcome is independent of treatment. The
`rand_small_table` returns data from a smaller cohort and the `rand_big_table`
returns data from a larger cohort.

In [None]:
_, _, _, expected = scipy.stats.chi2_contingency(outcome_tbl.to_numpy())

def rand_small_table():
    x = np.array(0)
    while x.min() < 1:
        x = scipy.stats.poisson.rvs(mu = np.array(0.5) * expected)
    return x

def rand_big_table():
    x = np.array(0)
    while x.min() < 1:
        x = scipy.stats.poisson.rvs(mu = np.array(1.5) * expected)
    return x

Using the functions `estimate_and_ci`, and `rand_small_table` and
`rand_big_table`, demonstrate how the $\chi^{2}$-test will fail if the cell values are too small.

### Answer

text

In [8]:
# code...