## Simpson's Paradox
[Link to lecture notes](https://github.com/bcaffo/MathematicsBiostatisticsBootCamp2/blob/master/lecture9.pdf)

* Simpson’s paradox describes the case when an observed proportion changes after conditioning on a third variable. A variable that correlates with both the explanatory variable and the response variable is called a confounding variable and may cause this phenomenon.
* We use stratification to adjust for confounding variables. In stratification, we divide our observations to different tables, where each tables refers to a value of the variable that we believe is confounding.
* Using the Mantel/Haenzsel estimator we can estimate the adjusted odds ratio.

**Example from the lecture:** Success and failure are measured for a treatment and a placebo. The observations were made in different centers and the center is believed to be a confounding variable.

In [1]:
import numpy as np
from scipy.stats import chi2

**Mantel/Haenzsel odd ratio estimator**

In [2]:
def odds_ratio(arr):
    return arr[0, 0] * arr[1, 1] / (arr[0, 1] * arr[1, 0])

In [3]:
# Each 2*2 matrix is the 2x2 table of one center
arr = np.array([
    [[11, 25],
     [10, 27]],
    
    [[16, 4],
     [22, 10]], 
    
    [[14, 5],
     [7, 12]],
    
    [[2, 14],
     [1, 16]],
    
    [[6, 11],
     [0, 12]],
    
    [[1, 10],
     [0, 10]],
    
    [[1, 4],
     [1, 8]],
    
    [[4, 2],
     [6, 1]]
])

print('Observations per center:')
print(arr)
print('\n')

print('Odds ratio (OR) per center:')
for i in range(arr.shape[0]):
    print('Center {}: {:.3f}'.format(i+1, odds_ratio(arr[i])))

Observations per center:
[[[11 25]
  [10 27]]

 [[16  4]
  [22 10]]

 [[14  5]
  [ 7 12]]

 [[ 2 14]
  [ 1 16]]

 [[ 6 11]
  [ 0 12]]

 [[ 1 10]
  [ 0 10]]

 [[ 1  4]
  [ 1  8]]

 [[ 4  2]
  [ 6  1]]]


Odds ratio (OR) per center:
Center 1: 1.188
Center 2: 1.818
Center 3: 4.800
Center 4: 2.286
Center 5: inf
Center 6: inf
Center 7: 2.000
Center 8: 0.333


  return arr[0, 0] * arr[1, 1] / (arr[0, 1] * arr[1, 0])


In [4]:
# Total observations
arr1 = arr.sum(axis=0)
print('Total observations:')
print(arr1)
print('\n')

print('Total odds ratio (OR): {:.3f}'.format(odds_ratio(arr1)))

Total observations:
[[55 75]
 [47 96]]


Total odds ratio (OR): 1.498


Mantel-Haenzsel adjusted OR estimate

In [5]:
def MH_OR(arr):
    nom = 0
    den = 0
    for i in range(arr.shape[0]):
        nom += arr[i, 0, 0] * arr[i, 1, 1] / arr[i].sum()
        den += arr[i, 1, 0] * arr[i, 0, 1] / arr[i].sum()
        
    return nom / den

In [6]:
print('MH adjusted OR: {:.3f}'.format(MH_OR(arr)))

MH adjusted OR: 2.135


### CMH Test
The CMH test tests the null hypothesis that the response is independent of the treatment given the confounding variable against the alternative hypothesis that the response is not independent of the treatment given the confounding variable.

In [7]:
def CMH_test(arr):
    nom = 0
    den = 0
    
    for i in range(arr.shape[0]):
        E = arr[i, 0, :].sum() * arr[i, :, 0].sum() / arr[i].sum()
        Var = np.product([arr[i, :, x].sum()*arr[i, x, :].sum() for x in [0, 1]])
        Var = Var / (arr[i].sum()**2 * (arr[i].sum() - 1))
        nom += arr[i, 0, 0] - E
        den += Var
    
    stat = nom**2 / den
    
    p_val = 1 - chi2.cdf(x=stat, df=1)
    
    return stat, p_val
    

In [8]:
stat, p_val = CMH_test(arr)

print('CMH statistic: {:.3f}'.format(stat))
print('P-value: {:.3f}'.format(p_val))

CMH statistic: 6.384
P-value: 0.012


The test suggests that the odds ratio of treatment success is not independent of the treatment given the center.