(chi2)=
# The Chi-Square Test

```{admonition} Important Readings
:class: seealso
- {cite}`freedman2007statistics`, Chapters 28
```

We used $z$-tests for proportions, or when our box model had just two categories and could be cast as a 0-1 box. 

The simple 0-1 days are gone. Now we live in a world where the box might contain many values. The NBA lottery chooses among fourteen teams or a gambler might throw a six-sided die. At work, engineers randomize over who is supposed to conduct code reviews. A name is randomly selected, as if from a box, and that person is stuck with the onerous task of double-checking a colleague's code. The box below considers a three-colleague example.



```{figure} images/tikz/boxCodeReview.svg
:width: 38%
:name: boxCodeReview

```

Suppose that over 30 draws, we expect the work to be split evenly but that Velma does 10 reviews, Marvin is tasked with 15, and Paul only does 5. 

Paul is more senior and, only after seeing the data, we're suspicious that the game has been rigged to lower his workload. A $z$-test could be applied here if we consider Paul vs. Everyone. Let $p$ be the proportion of times Paul is selected. Our null is $p = \frac{1}{3}$ and the alternative is $p < \frac{1}{3}$. 

The resulting $z$-statistic is 

$$z = \dfrac{ \frac{5}{30} - \frac{1}{3} }{\sqrt{\frac{\frac{1}{3}\cdot\frac{2}{3}}{30}}} \approx -1.94.$$

This is beyond the critical value of -1.645 for a left-tailed test with a 95\% confidence level. We're tempted to declare that Paul is cheating. But the next section should give us pause. 

## The $\chi^2$ Test

The previous application of the $z$-test might have been too opportunistic. With enough colleagues, someone is probably going to get a bad break. If, in practice, we only became suspicious of Paul after seeing the data, that should also raise red flags. To address the fairness of this kind of random name drawing, we should ask the question "are all names equally likely" instead of asking if a particular name comes up with a one-third chance.

$$ H_0: \hspace{10pt} \text{all names are equally likely} $$

$$ H_A: \hspace{10pt}  \text{all names are not equally likely} $$ 

In this setup, we don't reduce this to a Paul vs. Everyone binary. Instead, we calculate a $\mathbf{\chi^2}$**-statistic**,

$$\chi^2 = \sum \frac{\text{(observed frequency - expected frequency)}^2}{\text{expected frequency}}.$$

The sum is taken over all possible names. Note we don't convert anything to proportions, but we work directly with the counts and expected counts. With our data, 

$$\chi^2 = \frac{(10-10)^2 + (15-10)^2 + (5-10)^2)}{10} = 5.$$

There are two degrees of freedom because once we know two of the values from 10, 15, and 5, we know the last. The $\chi^2$ statistic comes from a $\chi^2$ distribution with two degrees of freedom. More generally, the statistic comes from a $\chi^2$ distribution with degrees of freedom equal to the number of terms minus one. 


### Using the $\chi^2$ Distribution

The $\chi^2$ distribution is characterized by a degrees of freedom parameter. 

In [30]:
from bokeh.io import output_notebook, show
from bokeh.plotting import figure
from bokeh.models import ColumnDataSource, Slider, CustomJS
from bokeh.layouts import column
from scipy.stats import chi2
import numpy as np
from IPython.display import HTML

output_notebook(hide_banner=True)

# Sample data for chi-squared distribution with initial degrees of freedom
df_initial = 1  # Initial degrees of freedom
x = np.linspace(0.1, 40, 777)  # Chi-squared distribution is not defined at x = 0
y_chi2 = chi2.pdf(x, df_initial)  # Initial chi-squared distribution

source = ColumnDataSource(data={'x': x, 'y_chi2': y_chi2})

# Create the figure
p = figure(width=400, height=400, title="Chi-squared Curve",
           toolbar_location=None)
p.line('x', 'y_chi2', source=source, color='blue', line_width=2)

# Slider for adjusting degrees of freedom
slider = Slider(start=1, end=20, value=df_initial, step=1, title="Degrees of Freedom")

# CustomJS callback for the slider
callback = CustomJS(args=dict(source=source, slider=slider), code="""
    const data = source.data;
    const x = data['x'];
    const df = slider.value;
    const y_chi2 = data['y_chi2'];
    for (let i = 0; i < x.length; i++) {
        y_chi2[i] = Math.pow(x[i], df/2 - 1) * Math.exp(-x[i]/2) / (Math.pow(2, df/2) * gamma(df/2));
    }
    source.change.emit();

    // Gamma function approximation
    function gamma(z) {
        const g = 7;
        const C = [
            0.99999999999980993, 676.5203681218851, -1259.1392167224028,
            771.32342877765313, -176.61502916214059, 12.507343278686905,
            -0.13857109526572012, 9.9843695780195716e-6, 1.5056327351493116e-7
        ];

        if (z < 0.5) return Math.PI / (Math.sin(Math.PI * z) * gamma(1 - z));
        z -= 1;

        let x = C[0];
        for (let i = 1; i < g + 2; i++)
        x += C[i] / (z + i);

        const t = z + g + 0.5;
        return Math.sqrt(2 * Math.PI) * Math.pow(t, z + 0.5) * Math.exp(-t) * x;
    }
""")

slider.js_on_change('value', callback)

# Layout and show
layout = column(slider, p)
show(layout)

# Define the CSS style for center alignment
style = """
<style>
.output {
    display: flex;
    align-items: center;
    justify-content: center;
}
</style>
"""

# Apply the style
display(HTML(style))


Now, we can find the P-value from the appropriate $\chi^2$ curve. This is the area to the right of the statistic. Recall that if the statistic is zero, that means the observed counts perfectly matched the expected counts. Therefore, only a large test statistic indicates data that deviates from what you would expect according to the null hypothesis. 



```{figure} images/chi2Pval.svg
:width: 70%
:name: chi2Pval

```

In Google Sheets, the P-value can be found with `=CHISQ.DIST.RT(5,2)}`. Assuming a null of equal chances, our data or data more extreme would only arise about 8% of the time. This doesn't fall below the typical 5\% threshold. The $\chi^2$-test turns out to be more charitable toward Paul. 


### Interactive 

Use [this Colab notebook](https://colab.research.google.com/drive/1ZZkQTf1ceaLvU1QDS7UGZymmtCl_A2gy?usp=sharing) to find the area to the right of a value for a given degrees of freedom. 