
# Proxies
**Hands‑on Notebook**


**In this notebook**
Use a **proxy variable** to partially adjust for an unobserved confounder.


In [None]:
import numpy as np
import pandas as pd


## **Proxy variable** for an unobserved confounder

Unobserved `U` (true smoking exposure) affects both `YellowTeeth (Z)` and `Cancer (Y)`;  
`Smoking (X)` is noisy self-report we can't rely on for percision issues. Nicotin level in body measured accurately `NL` serves as a **proxy** for `U`.

We compare naive estimate `P(Y|X)` with adjustment by the proxy `NL` (back-door via proxy).


In [None]:
N = 200_000
rng = np.random.default_rng(12)

# Unobserved true exposure
U = rng.normal(0, 1, N)                 # unobserved driver of risk

# Observed variables
X  = U + rng.normal(0, 1.0, N)          # self-report (noisy, low precision)
Z  = U + rng.normal(0, 0.8, N)          # yellow teeth (crude indicator; we won't use it for adjustment here)
NL = U + rng.normal(0, 0.1, N)          # biomarker (accurate proxy; low noise)

# Outcome depends on TRUE exposure (U), not X directly
logit = -0.7 + 1.4 * U
pY = 1 / (1 + np.exp(-logit))
Y = (rng.random(N) < pY).astype(int)

dfp = pd.DataFrame(dict(X=X, Z=Z, NL=NL, Y=Y))

# --- Models: naive vs proxy-adjusted ---
import statsmodels.api as sm

# 1) Naive: Y ~ X  (confounded by U)
m_naive = sm.Logit(dfp["Y"], sm.add_constant(dfp[["X"]])).fit(disp=False)

# 2) Proxy-adjusted with accurate biomarker: Y ~ X + NL
m_proxy = sm.Logit(dfp["Y"], sm.add_constant(dfp[["X","NL"]])).fit(disp=False)

# 3) Biomarker only: Y ~ NL  (close to the "oracle" using U)
m_biomarker = sm.Logit(dfp["Y"], sm.add_constant(dfp[["NL"]])).fit(disp=False)

print("Predicting Y (cancer) using self reported smoking only")
print("Naive (Y ~ X):")
print(f"  beta_X = {m_naive.params['X']:.3f}")

print("\nProxy-adjusted to include both self report and Nicotin level biomarker(Y ~ X + NL):")
print(f"  beta_X  = {m_proxy.params['X']:.3f}   (should shrink toward 0)")
print(f"  beta_NL = {m_proxy.params['NL']:.3f}  (captures the true U effect)")

print("\nPredicting Y (cancer) using Biomarker only (Y ~ NL):")
print(f"  beta_NL = {m_biomarker.params['NL']:.3f}")

# (Optional instructor check — uncomment to peek at "truth")
# corr_U_X  = np.corrcoef(U, X)[0,1]
# corr_U_Z  = np.corrcoef(U, Z)[0,1]
# corr_U_NL = np.corrcoef(U, NL)[0,1]
# print(f"\n[Hidden truth] corr(U,X)={corr_U_X:.2f}, corr(U,Z)={corr_U_Z:.2f}, corr(U,NL)={corr_U_NL:.2f}")



> **Observation:** With only `X` we pick up confounding from `U`.  
> Adding the proxy `NL` absorbs much of `U`'s influence and moves `beta_X` toward the *direct* effect.


## Excersice:

**Proxy strength:** In section C, increase proxy noise (e.g., `Z = U + 1.5*eps_z`).  
   - How do `beta_X` and `beta_NL` change? What does this say about **weak proxies**?

With increased noise, beta_Z is smaller and beta_X remains large, showing that weak proxies fail to adequately capture the unobserved confounder U, leaving residual confounding in the X coefficient.

In [None]:
N = 200_000
rng = np.random.default_rng(12)

U = rng.normal(0, 1, N)
X  = U + rng.normal(0, 1.0, N)
Z  = U + rng.normal(0, 1.5, N)
NL = U + rng.normal(0, 0.1, N)

logit = -0.7 + 1.4 * U
pY = 1 / (1 + np.exp(-logit))
Y = (rng.random(N) < pY).astype(int)

dfp_weak = pd.DataFrame(dict(X=X, Z=Z, NL=NL, Y=Y))

m_naive_weak = sm.Logit(dfp_weak["Y"], sm.add_constant(dfp_weak[["X"]])).fit(disp=False)
m_proxy_weak = sm.Logit(dfp_weak["Y"], sm.add_constant(dfp_weak[["X","Z"]])).fit(disp=False)
m_biomarker_weak = sm.Logit(dfp_weak["Y"], sm.add_constant(dfp_weak[["Z"]])).fit(disp=False)

print("With weak proxy Z (noise = 1.5):")
print(f"Naive (Y ~ X): beta_X = {m_naive_weak.params['X']:.3f}")
print(f"Proxy-adjusted (Y ~ X + Z): beta_X = {m_proxy_weak.params['X']:.3f}, beta_Z = {m_proxy_weak.params['Z']:.3f}")
print(f"Weak proxy only (Y ~ Z): beta_Z = {m_biomarker_weak.params['Z']:.3f}")