# Extra cases

The most simple and common application of a chi-squared test is that to study the correlation between two categorical variables with just two levels. The test is then carried out on a 2x2 table of counts (with *df*=1). In this notebook, we will see that the chi-squared test has a wider applicability, however. We'll cover two more examples from the Spanish novels dataset, which contains metadata on the protagonist in each of the books:

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import scipy as sp
from statsmodels.graphics.mosaicplot import mosaic

In [None]:
novels = pd.read_csv("../../datasets/correlaciones/spanish-novels.tsv", sep="\t")[
    ["protagonist-gender", "protagonist-social-level"]
]
novels.columns = novels.columns.str.replace("-", "_")
novels

## Male vs female

Since there are so few protagonists (3) not assigned to binary gender, we will have to drop them to make the analysis easier

In [3]:
novels = novels[novels.protagonist_gender.isin(["male", "female"])]

How frequent are male and female protagonists?

In [None]:
novels.protagonist_gender.value_counts()

If we assume that the number of women and men are [roughly the same](https://en.wikipedia.org/wiki/Human_sex_ratio) in human societies, it would seem that women are underrepresented as protagonists in this dataset. Can we quantify our surprise at this distribution in more exact terms? How likely would it be that this distribution arose from pure chance (as opposed to an obvious underrepresentation)? The chi-squared test can be used for this purpose too.

Here we will use the 'vanilla' test for goodness of fit. We provide a one-dimensional frequency histogram, with a matching histogram of expected values.

In [None]:
# half the number of rows (subjects) are expected to be female
exp = novels.protagonist_gender.shape[0] * 0.5
sp.stats.chisquare(novels.protagonist_gender.value_counts(), [exp, exp])

The low p-value indicates that there's almost no support for the null hypothesis that this distribution would have arisen from pure chance.

Note that here:
- We are only dealing with a *single* categorical variable
- We can provide the expected frequencies ourselves

## Gender vs social class

Let's ask an actual literary question! The table also includes data on the social class of protagonists:

In [None]:
novels.protagonist_social_level.value_counts()

Most of the protagonists belong to the middle class. Might there be a relationship with gender?

In [None]:
mosaic(novels, ["protagonist_gender", "protagonist_social_level"])
plt.show()
mosaic(novels, ["protagonist_social_level", "protagonist_gender"])
plt.show()

These plots show that women are under-represented in the middle class: female protagonists are apparently pushed to the richer and poorer extremes. The opposite is true for male protagonists, who are overrepresented in the middle class, and underrepresented in the high and low class.

We can use the chi-square test to test whether the relationship between these two variables is significant *as a whole*. Note that here, we will effortlessly apply the test to a *2x3 counts* table. Apart from the degrees of freedom (*df* = 2), nothing much changes in the way the test statistic is calculated. The test can be applied to counts tables of **arbitrary dimensions**.

In [None]:
obs = pd.crosstab(novels.protagonist_gender, novels.protagonist_social_level)
obs

In [9]:
res = sp.stats.chi2_contingency(obs)

We can also numerically compare the observed vs expected frequencies. `DataFrame` objects can be subtracted if they are the same shape! This is a little basic (R has a better kind of plot for this), but it does the job of showing whether each category is over or under its expected value.

In [None]:
obs - res.expected_freq

In [None]:
# are all() expected_freq entries > 5?
# decompose this cell if you don't know how it works...

(res.expected_freq > 5).all()

Finally, we can check the results of the test. First of all, we check the assumptions.
- More than 60 data points
- Expected frequency counts all above 5 (barely)
- No repeated measurements? (you didn't check, but we did)

In [None]:
res

The resulting $p$-value points towards (borderline) significance. (Probably there are too few women in the dataset to make very strong claims.) Our total number of degrees of freedom increased, here, which also makes it harder to see a significant result, particularly with a small number of samples (and quite an unbalanced set of datapoints, given the few female protagonists overall).

```
Version History

Current: v1.0.1

2/10/24: 1.0.0: first draft, BN
08/10/24: 1.0.1: proofread, MK
```