Analyze the data on the variables Age and Financial Status from
https://lib.stat.cmu.edu/DASL/Datafiles/montanadat.html

## Notes on the dataset
- The `AGE` variable takes three values:
   - 1 for under 35,
   - 2 for 35-54, and
   - 3 for 55 and over.
- The `FIN` variable (describing financial status)
  takes three values, each time comparing to one year ago:
   - 1 for "worse",
   - 2 for "same", and
   - 3 for "better".

In [1]:
from collections import namedtuple

import numpy as np
import pandas as pd
import scipy.stats

In [2]:
# Read the data into a pandas data frame
montana_df = pd.read_csv('../data/montana_clean.dat', sep='\t')

# Print the data frame, as a sanity check
# (only the columns of interest)
montana_df[['AGE', 'FIN']]

Unnamed: 0,AGE,FIN
0,3,2
1,2,3
2,1,2
3,3,1
4,3,2
...,...,...
204,1,*
205,1,3
206,3,2
207,3,1


In [3]:
# Turn the raw data into a contingency table
contingency = pd.crosstab(
    index=montana_df['AGE'],
    columns=montana_df['FIN']
)

# Remove the '*' data
contingency = contingency.drop(index='*', columns='*')

# Transform the data into a numpy array
X = contingency.to_numpy()

# Sanity check: show the contingency table
contingency

FIN,1,2,3
AGE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,21,16,34
2,17,23,26
3,22,37,11


In [4]:
TestResult = namedtuple('TestResult', [
    'name', 'statistic', 'p_value'
])

def report_test_result(test_result):
    """
    Report out the p-value of a test for
    the independence of two binary random variables.
    """
    
    print(
        f"{test_result.name} test has a p-value of {test_result.p_value:.3}"
    )

def lrt_indep(X):
    """
    Perform the likelihood ratio test for
    the independence of two binary random variables.
    
    Return a TestResult instance.
    """
    
    # D_{ij} = X_{i.} X_{.j}
    D = np.tensordot(
        X.sum(axis=1),
        X.sum(axis=0),
        axes=0
    )
    
    # Test statistic
    T = 2*np.sum(X*np.log(X.sum()*X/D))
    
    # Degrees of freedom
    (I, J) = X.shape
    df = (I-1)*(J-1)
    
    # p-value
    pval = scipy.stats.chi2.sf(T, df)
    
    return TestResult('The likelihood ratio', T, pval)

def pearson_indep(X):
    """
    Perform Pearson's chi-squared test
    for the independence of two binary random variables.
    
    Return a TestResult instance.
    """
    
    # E_{ij} = X_{i.} X_{.j} / X_{..}
    E = np.tensordot(
        X.sum(axis=1),
        X.sum(axis=0),
        axes=0
    )/X.sum()
    
    # Test statistic
    U = np.sum((X-E)**2/E)
    
    # Degrees of freedom
    (I, J) = X.shape
    df = (I-1)*(J-1)
    
    # p-value
    pval = scipy.stats.chi2.sf(U, df)
    
    return TestResult('Pearsons chi-squared', U, pval)

def analyze_independence(X):
    
    report_test_result(lrt_indep(X))
    report_test_result(pearson_indep(X))

In [8]:
analyze_independence(X)

The likelihood ratio test has a p-value of 0.000195
Pearsons chi-squared test has a p-value of 0.000367


## Analysis of the results
Both tests indicate strong evidence that the variables `Age` and `Financial status`
are **not** independent.