# Confounders

A **confounder** is associated with a predictor variable (X) and also
with the response variable (Y); the confounder is not part of the
causal/association pathway between predictor(s) and response.

With confounding:

1.  the model may be missing an important predictor $\rightarrow$
    reduced power of inference and prediction
2.  some predictors may show high variance
3.  the model may be biased
4.  the model may be invalid
5.  detected associations/relationships may be spurious

------------------------------------------------------------------------

**Why all this emphasis on confounding in a course on longitudinal data?
Time is pervasive, and is often associated with most things. We observe
trends in time, one thing goes up, another goes down, but are they
really associated? Or is time confusing us?**

------------------------------------------------------------------------

## Simulation

We simulate some data:

-   the true causal effect (difference between exposed and not exposed
    records) is simulated to be `1`
-   binary confounder `C` of magnitude 2, either present or absent in
    40% of the records
-   a causal effect `X` (binary: exposed / non-exposed), that depends on
    `C`
-   random errors `e`, sampled from a Gaussian distribution ($\mu = 0$,
    $\sigma^2 = 2$)
-   `Y0` and `Y1` are the potential outcome, for each record, of being
    both exposed and not exposed
-   `Y_obs`: is the actual observations

In [None]:
import numpy as np
import pandas as pd

In [None]:
# Set seed for reproducibility
np.random.seed(127)

# Sample size
n = 1000

# Generate data
df = pd.DataFrame()

# C: Binary variable with P=0.4
df['C'] = np.random.binomial(1, 0.4, size=n)

# X: Binary variable with P = 0.3 + 0.4 * C
p_X = 0.3 + 0.4 * df['C']
df['X'] = np.random.binomial(1, p_X)

# e: Normal variable with mean 0 and variance 2
df['e'] = np.random.normal(loc=0, scale=np.sqrt(2), size=n)

# Y0: Nonrandom outcome = 2*C + e
df['Y0'] = 2 * df['C'] + df['e']

# Y1: Nonrandom outcome = 1 + 2*C + e
df['Y1'] = 1 + 2 * df['C'] + df['e']

# Y_obs: Observed outcome = Y0 + (Y1 - Y0) * X
df['Y_obs'] = df['Y0'] + (df['Y1'] - df['Y0']) * df['X']

# Now df contains all the generated data
print(df.head())

Check the magnitude of the true effect size:

In [None]:
real_diff = (df['Y1'].mean() - df['Y0'].mean())
print("Calculated mean effect size = ", round(real_diff,3))


Check the proportion of confounded records (expected value = 0.40):

In [None]:
prop_confounded = df['C'].sum()/len(df)
print("The percentage of the population with confounder C is ", round(prop_confounded,3), "%")

Check the proportion of exposed records (expected value = 0.30 + 0.40*C
= 0.3 + 0.4*0.38 = 0.452):

In [None]:
prop_exposed = df['X'].sum()/len(df)
print("The percentage of the population exposed to treatment/effect X ", round(prop_exposed,3), "%")

Calculate the expected observed difference between exposed and
non-exposed records (in presence of confounding):

In [None]:
#dt[X == 1, mean(Y_obs)] - dt[X == 0, mean(Y_obs)]
exp_nonexp = (df.loc[df['X'] == 1]['Y_obs'].mean() - df.loc[df['X'] == 0]['Y_obs'].mean())
print("Calculated observed difference  = ", round(exp_nonexp, 3))

The observed difference is not equal to the true effect size of the
exposure due to the presence of the confounder C. The expected observed
difference is given by the true effect size (1.0) + the confounding bias
(40%\*2 =.8) = 1.8 (which is close to the actual calculated
difference between exposed and non-exposed records).

With simple linear regression we indeed obtain an estimated effect of
1.603:

In [None]:
from statsmodels.formula.api import ols

res = ols('Y_obs ~ X', data=df).fit()
print(res.summary())

Adding the confounder to the model adjusts for the bias and returns the
correct estimates for both the exposure (1.0) and the confounder bias
(2.0):

In [None]:
res = ols('Y_obs ~ X + C', data=df).fit()
print(res.summary())

**Question: if this is so easy (just adding a systematic effect to the
model), why do we worry so much about confounding?**

### Example with risk ratios

Retrospective cohort study: patients with high or low cholesterol,
monitored for 12 months and then assessed for all-cause mortality
(death). The amount of **exercise during the 12-month period of
follow-up** is a potential confounder: it associated with both blood
cholesterol and death (e.g. more physical exercise $\rightarrow$ lower
cholesterol and lower death rate), and it is not in the causal path
between blood cholesterol and risk death.

![confounding](https://drive.google.com/uc?export=view&id=1zSI8WAAx8zdVT_W6E2BX-EHWlPF2e3UY)

Figure from:
<https://rpubs.com/mbounthavong/confounding_interaction>

We have 250 deaths out of 2250 subjects with high blood cholesterol
(250/2250 = 11.1%), and 150 deaths out of 1650 subjects with low blood
cholesterol (150/1650 = 9.1%).

[Is there a higher risk of death with higher blood cholesterol
levels?]{style="color:red"}


In [None]:
data = np.array([[250, 150], [2000, 1500]])

# Create a DataFrame to label rows and columns
table1 = pd.DataFrame(data,
                      index=["high", "low"],
                      columns=["death", "survival"])

# Display the table
print(table1)

To compare the risk of death in the two exposure groups (high/low blood
cholesterol), we can estimate the risk ratio (RR) and odds ratio (OR):

- $$
  RR = \frac{\text{high-chol-deaths}/\text{all-high-chol}}{\text{low-chol-deaths}/\text{all-low-chol}}
  $$

- $$
  OR = \frac{\text{high-chol-deaths}/\text{high-chol-survs}}{\text{low-chol-deaths}/\text{low-chol-survs}}
  $$

(The 95% Confidence Intervals are also estimated)


In [None]:
from statsmodels.stats.proportion import proportion_confint

# Recreate the table (after reversing rows and columns as rev = c("both"))
table = np.array([[1500, 2000], [150, 250]])  # [survival, death] x [low, high]

# Extract cell counts
a = table[1, 1]  # Exposed + event (death | high)
b = table[1, 0]  # Exposed + no event (survival | high)
c = table[0, 1]  # Unexposed + event (death | low)
d = table[0, 0]  # Unexposed + no event (survival | low)

# Compute risks
risk_high = a / (a + b)
risk_low = c / (c + d)

# Compute risk ratio
rr = risk_high / risk_low

# Compute standard error and 95% CI (Wald method on log scale)
import math
se_log_rr = math.sqrt((1/a - 1/(a+b)) + (1/c - 1/(c+d)))
log_rr = math.log(rr)
ci_lower = math.exp(log_rr - 1.96 * se_log_rr)
ci_upper = math.exp(log_rr + 1.96 * se_log_rr)

# Display results
print(f"Risk (high cholesterol): {risk_high:.4f}")
print(f"Risk (low cholesterol): {risk_low:.4f}")
print(f"Risk Ratio: {rr:.4f}")
print(f"95% CI: ({ci_lower:.4f}, {ci_upper:.4f})")

In [None]:
from scipy.stats import norm

# Risk Ratio and standard error already computed
z_rr = log_rr / se_log_rr
p_rr = 2 * (1 - norm.cdf(abs(z_rr)))

print(f"RR p-value: {p_rr:.4g}")


In [None]:
table = np.array([[1500, 2000], [150, 250]])

# Assign cells:
#       Event (death)    No event (survival)
# Exposed (high)     a = 250          b = 150
# Unexposed (low)    c = 2000         d = 1500

a = table[1, 1]
b = table[1, 0]
c = table[0, 1]
d = table[0, 0]

# Odds ratio
odds_ratio = (a / b) / (c / d)

# Standard error on log scale
se_log_or = math.sqrt(1/a + 1/b + 1/c + 1/d)

# 95% CI
log_or = math.log(odds_ratio)
ci_lower = math.exp(log_or - 1.96 * se_log_or)
ci_upper = math.exp(log_or + 1.96 * se_log_or)

# Display results
print(f"Odds Ratio: {odds_ratio:.4f}")
print(f"95% CI: ({ci_lower:.4f}, {ci_upper:.4f})")

In [None]:
from scipy.stats import chi2_contingency, fisher_exact

# chi-square test (similar to Wald for large samples)
chi2, p_chi2, dof, expected = chi2_contingency(table)
print(f"Chi-square p-value: {p_chi2:.4g}")

# Fisher's exact test (exact p-value, useful for small samples)
oddsratio_fisher, p_fisher = fisher_exact(table)
print(f"Fisher's exact test p-value: {p_fisher:.4g}")

Subjects with *High Cholesterol* have a 9.4% higher risk of death
compared to Low Cholesterol subjects (RR = 1.0938; 95% CI: [1.01 - 1.19];
p-value = 0.03).

Subjects with *High Cholesterol* have a 25% increase in the odds of
death compared to Low Cholesterol subjects (OR = 1.25; 95% CI: [1.01,
1.55]; p-value = 0.04).

From this analysis, the association between blood cholesterol and risk
of death is significant.



#### Confounding?

1.  Is there **association between the confounder (exercise) and the
    response (risk of death)**?

In [None]:
data2 = np.array([[200, 200], [2400, 1100]])
table2 = pd.DataFrame(data2,
                      index=["exercise", "no-exercise"],
                      columns=["death", "survival"])

print(table2)

In [None]:
# Fisher's exact test (exact p-value, useful for small samples)
oddsratio_fisher, p_fisher = fisher_exact(table2)
print(f"Odds ratio: {oddsratio_fisher:.4g}")
print(f"Fisher's exact test p-value: {p_fisher:.4g}")

2.  Is there **association between the confounder (exercise) and the
    exposure (blood cholesterol)**?

In [None]:
data3 = np.array([[1750, 850], [500, 800]])
table3 = pd.DataFrame(data3,
                      index=["exercise", "no-exercise"],
                      columns=["high-chol", "low-chol"])

print(table3)

In [None]:
oddsratio_fisher, p_fisher = fisher_exact(table3)
print(f"Odds ratio: {oddsratio_fisher:.4g}")
print(f"Fisher's exact test p-value: {p_fisher:.4g}")

We see that `exercise` is associated with both the outcome and the
exposure and, since it is not in the causal pathway between blood
cholesterol and risk of death, we can rightfully consider it to be a
confounder!

We can estimate the size of the confounding effect using
**stratification analysis**: exercise vs no exercise.

##### Stratum 1: subjects who do exercise

In [None]:
### Distribution of Exercise across Drug groups
# Among subjects who exercise (N=2600)
data4 = np.array([[150, 1600], [50, 800]])
table4 = pd.DataFrame(data4,
                      index=["high-chol", "low-chol"],
                      columns=["death", "survival"])

print(table4)

In [None]:
oddsratio_fisher, p_fisher = fisher_exact(table4)
print(f"Odds ratio: {oddsratio_fisher:.4g}")
print(f"Fisher's exact test p-value: {p_fisher:.4g}")


##### Stratum 2: subjects who don't exercise

In [None]:
### Distribution of Exercise across Drug groups
# Among subjects who do not exercise (N=1300)
data5 = np.array([[100, 400], [100, 700]])
table5 = pd.DataFrame(data5,
                      index=["high-chol", "low-chol"],
                      columns=["death", "survival"])

print(table5)

In [None]:
oddsratio_fisher, p_fisher = fisher_exact(table5)
print(f"Odds ratio: {oddsratio_fisher:.4g}")
print(f"Fisher's exact test p-value: {p_fisher:.4g}")

Compared to the crude analysis where the OR = 1.25 (and the RR = 1.09),
the stratified results are much higher. This suggests that `Exercise`
has a confounding effect on the relationship between blood cholesterol
and risk of death. When we stratify by exercise, we get a stronger
measure of association between `X` and `Y`.

---

It would be convenient to have a way to get an estimate of the X-Y
association adjusted for confounding, without having to go through the
complications of the stratified analysis: this is the **Mantel-Haenszel
(M-H)** metric (a.k.a. Cochran-Mantel-Haenszel), obtained by summing over strata:

$$
\text{OR_{M-H adjusted}} = \left( \frac{ \sum_{k=1}^K (a_k \cdot d_k)/ n_k}{\sum_{k=1}^K (c_k \cdot b_k) / n_k} \right)
$$

where $a,b,c,d$ are the cells of the contingency table: a: high-chol
deaths; b: high-chol survs; c: low-chol deaths; d: low-chols survs.
$n_k$ is the total number of records in each stratum $k$ (e.g. from [here](https://en.wikipedia.org/wiki/Cochran%E2%80%93Mantel%E2%80%93Haenszel_statistics)).



In [None]:
multistratum_array = np.array([table4, table5])

In [None]:
multistratum_array.shape ## two rows, two columns, two strata (channels)

In [None]:
multistratum_array

In [None]:
or_mh = ( ((150*800)/2600) + ((100*700)/1300) ) / ( ((50*1600)/2600) + ((100*400)/1300) )

In [None]:
print("The common M-H adjusted OR is",or_mh)

In [None]:
num = ((150 - (1750*200/2600)) + (100 - (500*200)/1300))**2
denom = 1750*850*200*2400/((2600**2) * 2599) + 500*800*200*1100/((1300**2) * 1299)

In [None]:
test_stat = num/denom
print("The test statistic for the M-H test is", test_stat)

In [None]:
from scipy import stats
pval = 1 - stats.chi2.cdf(test_stat, 1)
print("The p-value for the common OR = 1 is", pval)

The stratifying variable could be the **timepoint**: there are many many many examples from scientific research where experiments are organised in T0, T1, T2 etc.

**Question: do you have any examples from your own research? In that case, we'd encourage you to share them, so that we can make a live illustration of stratified contingency tables**


<u>Alternative to stratified analysis</u>:

Better still, one can use a **logistic regression model**, include the
confounding effect(s) and then look at the model coefficients to get the
adjusted log-odds and odds ratios (and relative risks) for the
exposure/treatment under investigation (e.g. OR of one-unit increase in
exposure, or of belonging to one class or the other of the
exposure).

---

### Solutions

Confounding control can occur at the design stage through:

1.  randomization: e.g. RCT
2.  restriction: i.e. use only one value of the potential confounder to
    recruit subjects (e.g. only females)
3.  matching: i.e. match (balance) exposed / non-exposed by values of
    the confounder

At the analysis stage, confounding control can be managed by:

1.  standardization: e.g. normalize by age, body weight etc. (useful
    with continuous confounders)
2.  stratification: e.g. stratified analysis (as seen above)
3.  multivariable regression (linear, logistic, cox, etc.): include
    confounders in the statistical model

More advanced approaches include: i) structural causal models; ii)
directed acyclic graphs (DAGs); iii) propensity scores; iv) marginal
structural models with inverse probability weighting; v) quasi
experimental methods such as instrumental variables (e.g. Mendelian
Randomization in clinical trials).

#### Time

In longitudinal data, **time** is often a confounder, i.e. a variable
that affects both the outcome and the exposure (time elapses for all).
This is often the case with **counterfactuals** (**confounders** vs
**counterfactuals**: [look here for a fun intuition on
counterfactuals](https://www.youtube.com/watch?v=0lpY0Kt4bn8)).

Similar considerations may apply to spatial confounding, e.g. a
geographical / environmental gradient. In this case, **space** can be a
potential confounder.

------------------------------------------------------------------------