# Session 3: homework

# Chi-squared test for independence: Exercise social

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import scipy as sp
import statsmodels.api as sm

## Loading and inspecting the data

Load the file 'social.tsv', and assign it to the variable `social`.

Before we proceed, we are going to condense the `parents_class`: instead of working with 7 class levels, we will continue with 3.
You are given the code for this operation:

In [None]:
def relevel(x: int) -> str:
    if x < 3:
        return "upper"
    if x < 6:
        return "middle"
    if x < 8:
        return "working"
    # If we haven't returned by now things have gone wrong
    raise ValueError(f"Unknown class level {x}")


social["parents_class"] = social.parents_class.apply(relevel)
social

##### Question: 
> Verify the data independence assumption of the chi-squared test. Are there repeated measurements in the data, i.e. >1 observation for 1 subject? What are the implications for your test?

> Using whichever method you choose, verify that `subject_ID` is unique

## Data exploration

Can we observe a correlation between the teenagers' educational track (variable `education`) and their parents' social class (variable `parents_class`)? Create a table with row percentages and one with column percentages. Round the percentages to two digits.

##### Question:
> How do you intuitively interpret these tables?

## Data visualization

Create the appropriate categorical plot for education versus parents' class. You will need to import the correct library (we forgot to do it for you above). Visualize once "by parents' class" (on the x-axis) and once "by education"

## Chi-squared test for independence

Now perform a chi-squared test for independence for the teenagers' educational track and the parents' social class (conflated).

Start by formulating your null hypothesis and alternative hypothesis.

Perform a chi-squared for independence using `chi2_contingency` and interpret the results. Remember to use the `crosstab` with **raw counts** not normalized values. You will need to verify the key assumption. Revise the notebooks if you need to.

##### Question:
> How to interpret this result?

## Measures of effect size

Since the chi-squared valued and the p-value are dependent on sample size, provide a measure of effect size too.

As we can only calculate odds ratios for pairs, and both our variables have three (and not two) levels, you only need to calculate Cramer's V (i.e. chi-squared value normalized for sample size). Recall the formula: 

> Cramer's V = $\sqrt{\frac{X^2}{n * (min(nrow,ncols) -1)}}$

In which *n* is the sample size. To obtain the number of rows and columns, either just look at your table, or apply `shape` to it (to the table, not the entire dataset!). Finally, to obtain your chi-squared value again, either copy it from your output, or retrieve it directly from `chi2_contingency` with `.statistic`.

Alternatively, use a different scipy method to directly output Cramer's V.

## Chi-squared test for independence (additional exercise): Home language

Repeat the analyses above for a potential correlation between teenagers' educational track and the language they speak at home (variable: "language").

Inspect and interpret the table (with row and then column percentages).

Visualize with an appropriate plot, but only bother with education on the x-axis.

Now perform a chi-squared test for independence and describe your interpretation. Make sure to validate the key assumptions.

##### Question:
> Can you think of a creative solution that would allow you to apply a chi-squared test for independence here? Hint: think of what we did to the original parents' class variable.

> #### HARD one more thing...

If you managed to find a solution, calculate the critical value for $p \lt 0.05$ for the $\chi^2$ *statistic* for that test. You will need to work out the correct number of degrees of freedom according to the number of rows and columns, or find it somewhere (hint: check the `Chi2ContingencyResult` object returned by the test). To calculate the statistic you will need the *inverse survival function* for the $\chi^2$ distribution.

HINT: you should have a number bigger than 2 and smaller than 20. ;)

```
Version History

Current: v1.0.1

7/10/24: 1.0.0: first draft, BN
08/10/24: 1.0.1: proofread, MK
```