Here are breast cancer data from Morrison et al. (1973)
on diagnostic center ($X_1$), nuclear grade ($X_2$), and
survival ($X_3$):

|            $X_2$ | malignant | malignant | benign |   benign |
| ----------------:| ---------:| ---------:| ------:| --------:|
|            $X_3$ |      died |  survived |   died | survived |
|    $X_1$: Boston |        35 |        59 |     47 |      112 |
| $X_1$: Glamorgan |        42 |        77 |     26 |       76 |

In [1]:
import numpy as np
import scipy.stats

In [3]:
# Manually enter the data
Y = np.array([
    [[47, 112], [35, 59]],
    [[26, 76], [42, 77]]
])

1. Treat this as a multinomial and
   find the maximum likelihood estimator.
2. If someone has a tumor classified as benign
   at the Glamorgan clinic, what is the estimated
   probability that they will die?
   Find the standard error for this estimate.

In [4]:
# Number of samples
n = Y.sum()

# MLE for p
p = Y/Y.sum()

# MLE for theta = P(die | benign, Glamorgan)
theta = p[1,0,0]/(p[1,0,0] + p[1,0,1])

# Estimate of the standard error of MLE for theta
se = np.sqrt(
    (p[1,0,0]*p[1,0,1])
    /(n*(p[1,0,0] + p[1,0,1])**3)
)

# Asymptotic 95 percent confidence interval
alpha = 0.05
z = scipy.stats.norm.isf(alpha/2)
lower_bound = theta - z*se
upper_bound = theta + z*se

# Report the results
print(
    "Estimating theta\n"
    "---------------------------------\n"
    f"Max. likelihood est.: {theta:.2}\n"
    f"Std. error estimate:  {se:.2}\n"
    f"Confidence interval: ({lower_bound:.2}, {upper_bound:.2})"
)

Estimating theta
---------------------------------
Max. likelihood est.: 0.25
Std. error estimate:  0.043
Confidence interval: (0.17, 0.34)


Test the null hypotheses
$$
    X_i \amalg X_j \,|\, X_k
$$
for $i$, $j$, and $k$ distinct using the test from Exercise 4 (of the same chapter).

In [4]:
# Y_{abc} Y_{..c}
numerator = np.einsum('abc, c -> abc', Y, Y.sum(axis=(0,1)))

# Y_{a.c} Y_{.bc}
denominator = np.einsum('ac, bc -> abc', Y.sum(axis=1), Y.sum(axis=0))

# Test statistics for X1 indep of X2 given X3
T = 2*np.sum(Y * np.log(numerator/denominator))

# Corresponding pvalue
pval = scipy.stats.chi2.sf(T, df=2)

# Report result
print(
    "Testing the conditional independence of X1 and X2 given X3.\n"
    f"Test statistic: {T:.3}\n"
    f"p-value: {pval:.3}"
)

Testing the conditional independence of X1 and X2 given X3.
Test statistic: 13.8
p-value: 0.00102


In [5]:
# Y_{abc} Y_{.b.}
numerator = np.einsum('abc, b -> abc', Y, Y.sum(axis=(0,2)))

# Y_{a.c} Y_{.bc}
denominator = np.einsum('ab, bc -> abc', Y.sum(axis=2), Y.sum(axis=0))

# Test statistics for X1 indep of X3 given X2
T = 2*np.sum(Y * np.log(numerator/denominator))

# Corresponding pvalue
pval = scipy.stats.chi2.sf(T, df=2)

# Report result
print(
    "Testing the conditional independence of X1 and X3 given X2.\n"
    f"Test statistic: {T:.3}\n"
    f"p-value: {pval:.3}"
)

Testing the conditional independence of X1 and X3 given X2.
Test statistic: 0.6
p-value: 0.741


In [6]:
# Y_{abc} Y_{a..}
numerator = np.einsum('abc, a -> abc', Y, Y.sum(axis=(1,2)))

# Y_{ab.} Y_{a.c}
denominator = np.einsum('ab, ac -> abc', Y.sum(axis=2), Y.sum(axis=1))

# Test statistics for X2 indep of X3 given X1
T = 2*np.sum(Y * np.log(numerator/denominator))

# Corresponding pvalue
pval = scipy.stats.chi2.sf(T, df=2)

# Report result
print(
    "Testing the conditional independence of X2 and X3 given X1.\n"
    f"Test statistic: {T:.3}\n"
    f"p-value: {pval:.3}"
)

Testing the conditional independence of X2 and X3 given X1.
Test statistic: 4.07
p-value: 0.131


In [11]:
# SANITY CHECK
# UNCONDITIONAL independence of X1 and X2?

# Y_{ab.} Y_{...}
numerator = Y.sum(axis=2)*Y.sum()

# Y_{a..} Y_{.b.}
denominator = np.einsum('a, b -> ab', Y.sum(axis=(1,2)), Y.sum(axis=(0,2)))

# Y_{ab.}
factor = Y.sum(axis=2)

# Test statistics for X1 indep of X2
T = 2*np.sum(factor * np.log(numerator/denominator))

# Corresponding pvalue
pval = scipy.stats.chi2.sf(T, df=1)

# Report result
print(
    "Testing the conditional independence of X1 and X2 given X3.\n"
    f"Test statistic: {T:.3}\n"
    f"p-value: {pval:.3}"
)

Testing the conditional independence of X1 and X2 given X3.
Test statistic: 13.3
p-value: 0.000261


In [17]:
# Visual check on that last part:
# the 2-by-2 contingency table for X1 & X2
Y.sum(axis=2)

array([[159,  94],
       [102, 119]])

In [19]:
# [ctd.]
# The odds ratio is not exactly close to 1!
(159*119)/(102*94)

1.9734042553191489