![Status: In Progress](https://img.shields.io/badge/status-in--progress-yellow)
![Python](https://img.shields.io/badge/python-3.10-blue)
![Coverage](https://img.shields.io/badge/coverage-70%25-yellowgreen)
![License](https://img.shields.io/badge/license-MIT-green)

<!-- ![Status: Complete](https://img.shields.io/badge/status-complete-brightgreen)
![Python](https://img.shields.io/badge/python-3.10-blue)
![Coverage](https://img.shields.io/badge/coverage-95%25-yellowgreen)
![License](https://img.shields.io/badge/license-MIT-green) -->

<a id="table-of-contents"></a>
# 🧭 Causal Inference

- [🎯 Introduction to Causal Inference](#intro)
  - [🎓 What is Causal Inference?](#what-is-causal)
  - [📌 Why go beyond Correlation?](#why-correlation)
  - [🧭 Real-world problems that need causality](#real-world-examples)

- [🧠 Core Concepts & Notation](#notation-assumptions)
  - [🧮 Treatment, Outcome, Units](#treatment-outcome-units)
  - [📐 Potential Outcomes (Rubin Causal Model)](#potential-outcomes)
  - [🧵 Fundamental Problem of Causal Inference](#fundamental-problem)
  - [🧠 Assumptions (SUTVA, Ignorability, Overlap)](#core-assumptions)

- [🧪 Simulated Dataset Setup](#simulated-data)
  - [🧬 Define treatment assignment logic](#treatment-logic)
  - [🔬 Inject confounding intentionally](#inject-confounding)
  - [🧊 Simulate potential outcomes + observed data](#simulate-outcomes)

- [🚫 Naive Estimation](#naive-estimation)
  - [❌ Simple difference in means](#diff-in-means)
  - [⚠️ Bias due to confounding](#bias-confounding)

- [🕸️ Causal Diagrams (DAGs)](#causal-diagrams)
  - [🧿 Quick primer on DAGs](#primer-dags)
  - [🕷️ Confounding vs. colliders vs. mediators](#confounder-collider-mediator)
  - [🔗 What can/can’t be estimated just from data](#estimability-from-dags)

- [🔍 Backdoor Adjustment Methods](#backdoor-adjustment)
  - [🧾 Conditioning on confounders](#conditioning)
  - [🕵️‍♂️ Stratification / Subgroup analysis](#stratification)
  - [📊 Regression Adjustment](#regression-adjustment)
  - [📌 Propensity Score Matching (PSM)](#psm)

- [🎯 Instrumental Variables (IV)](#iv-methods)
  - [🪝 When backdoor paths can’t be blocked](#when-use-iv)
  - [🎯 Valid instrument conditions](#iv-conditions)
  - [🧩 2-Stage Least Squares (2SLS)](#2sls)

- [🧰 Double Machine Learning (DML)](#dml-methods)
  - [🪛 Use ML models for nuisance functions](#ml-nuisance)
  - [🧱 Residualization + orthogonalization logic](#residualization)
  - [🧲 When to prefer over traditional regression](#dml-vs-regression)

- [🌈 Heterogeneous Treatment Effects](#heterogeneous-effects)
  - [🎨 ATE vs. CATE vs. ITE](#ate-cate-ite)
  - [🌟 Uplift models and use cases](#uplift-usecases)
  - [🧩 Tree-based methods (Causal Trees, Causal Forests)](#causal-forests)

- [🧪 Placebo Tests & Robustness Checks](#placebo-robustness)
  - [🧻 Randomized placebo treatments](#placebo)
  - [⚗️ Sensitivity to unobserved confounding](#robustness)

- [🧬 Counterfactual Thinking](#counterfactuals)
  - [🤖 Predicting what would’ve happened](#what-if)
  - [🔁 Usage in recommendation & personalization](#personalization)

- [📌 Closing Notes](#closing-notes)
  - [📝 Summary table of methods](#summary-table)
  - [📋 When to use what](#method-choice)
  - [📎 Causal vs Predictive mindset](#causal-vs-predictive)

___

<a id="intro"></a>
# 🎯 Introduction to Causal Inference


<details><summary><strong>📉 Click to Expand</strong></summary>

<h5>🧠 Why this Notebook</h5>

<p>Causal inference gives us the tools to answer "what if" questions — not just "what is." In product, policy, medicine, and science, we often need to <strong>act</strong>, and actions require understanding their consequences.</p>

<p>This field helps us:</p>
<ul>
  <li>Understand <strong>how</strong> and <strong>why</strong> outcomes change.</li>
  <li>Move from data <em>descriptions</em> to data-<em>driven interventions</em>.</li>
  <li>Avoid the trap of chasing noisy correlations.</li>
</ul>

<p>This notebook is a build-up from first principles to practical methods — with enough grounding to reason about experiments, models, and their assumptions clearly.</p>

</details>


<a id="what-is-causal"></a>
#### 🎓 What is Causal Inference?


<details><summary><strong>📉 Click to Expand</strong></summary>

<h5>🎓 Causal inference is the process of estimating the <strong>effect</strong> of one variable (the treatment) on another (the outcome), holding all else constant.</h5>

<p>The core idea is to estimate:</p>
<blockquote>What would the outcome have been if the treatment had (or had not) occurred?</blockquote>

<p>Unlike correlation or predictive modeling:</p>
<ul>
  <li>It asks <strong>counterfactual</strong> questions — what <em>would</em> have happened under different scenarios.</li>
  <li>It requires <strong>assumptions</strong>, <strong>design</strong>, and often <strong>randomization</strong> or clever statistical tricks.</li>
</ul>

<p>At its heart, causal inference is about:</p>
<ul>
  <li>Designing better <strong>interventions</strong></li>
  <li>Estimating <strong>treatment effects</strong></li>
  <li>Avoiding misleading <strong>associational patterns</strong></li>
</ul>

</details>


<a id="why-correlation"></a>
#### 📌 Why go beyond Correlation?


<details><summary><strong>📉 Click to Expand</strong></summary>

<h5>📌 Correlation can be dangerous when used as a proxy for causation.</h5>

<p>Example: Ice cream sales are correlated with shark attacks. Should we ban dessert?  
Clearly not — they’re both caused by heatwaves (a confounder).</p>

<p>Correlation fails because it:</p>
<ul>
  <li>Ignores <strong>confounders</strong> (common causes of both variables)</li>
  <li>Misses <strong>directionality</strong> (what affects what)</li>
  <li>Can be driven by <strong>reverse causation</strong> or <strong>coincidence</strong></li>
</ul>

<p>Causal inference gives tools to:</p>
<ul>
  <li><strong>Identify</strong> confounding</li>
  <li><strong>Design</strong> better studies (randomized or observational)</li>
  <li><strong>Interpret</strong> results in terms of actionable causes</li>
</ul>

</details>


<a id="real-world-examples"></a>
#### 🧭 Real-world problems that need causality


<details><summary><strong>📉 Click to Expand</strong></summary>

<h5>🧭 Correlation might be fine for dashboards. But when making <strong>decisions</strong>, causality is non-negotiable.</h5>

<p>Examples:</p>
<ul>
  <li><strong>Product</strong>: Did that new button placement increase checkout, or was it a seasonal effect?</li>
  <li><strong>Marketing</strong>: Did the email nudge lead to purchases, or did loyal users open it anyway?</li>
  <li><strong>Policy</strong>: Did a tax cut help the economy, or was it already improving?</li>
  <li><strong>Health</strong>: Does a drug reduce disease, or do healthier people tend to take it?</li>
</ul>

<p>These questions involve <strong>interventions</strong>, and only causal methods can tell us what would’ve happened under a different choice.</p>

</details>


[Back to the top](#table-of-contents)
___


<a id="notation-assumptions"></a>
# 🧠 Core Concepts & Notation


<a id="treatment-outcome-units"></a>
#### 🧮 Treatment, Outcome, Units


<details><summary><strong>📉 Click to Expand</strong></summary>

<ul>
  <li><strong>Treatment (<code>T</code>)</strong>: The intervention or condition being tested (e.g., new design, drug, policy).</li>
  <li><strong>Outcome (<code>Y</code>)</strong>: The result or metric affected by the treatment (e.g., click, recovery, score).</li>
  <li><strong>Units (<code>i</code>)</strong>: The entities receiving treatment and producing outcomes (e.g., users, patients, schools).</li>
</ul>

<p>Each unit can receive a treatment or control, and we observe only one outcome — not both.</p>

<p>This framing is universal and applies whether you're testing emails, ads, or vaccines.</p>

</details>


<a id="potential-outcomes"></a>
#### 📐 Potential Outcomes (Rubin Causal Model)


<details><summary><strong>📉 Click to Expand</strong></summary>

<p>The <strong>Potential Outcomes framework</strong> (aka Rubin Causal Model) imagines two parallel worlds for each unit:</p>

<ul>
  <li><code>Y(1)</code>: Outcome if treated</li>
  <li><code>Y(0)</code>: Outcome if not treated</li>
</ul>

<p>We define <strong>Individual Treatment Effect (ITE)</strong> as:</p>

<blockquote>ITE = Y(1) - Y(0)</blockquote>

<p><strong>Key idea:</strong></p>

<p>Each unit has both potential outcomes — but we can only observe one. The other is <strong>counterfactual</strong>.</p>

<p>This framework allows us to define:</p>

<ul>
  <li>ATE (Average Treatment Effect)</li>
  <li>CATE (Conditional ATE, for subgroups)</li>
</ul>

<p>And formalizes why causal inference is hard: we never see both outcomes.</p>

</details>


<a id="fundamental-problem"></a>
#### 🧵 Fundamental Problem of Causal Inference


<details><summary><strong>📉 Click to Expand</strong></summary>

<h5>The <strong>Fundamental Problem of Causal Inference</strong>:</h5>

<blockquote>For any individual, we can observe only one potential outcome — never both.</blockquote>

<p>Example:</p>
<ul>
  <li>A user sees version A → you observe <code>Y(0)</code></li>
  <li>You’ll never know what <code>Y(1)</code> would have been for that exact user</li>
</ul>

<p>This creates a missing data problem: the counterfactual is unobservable.</p>

<p>To solve this, we rely on:</p>
<ul>
  <li><strong>Randomization</strong></li>
  <li><strong>Modeling + assumptions</strong></li>
  <li><strong>Matching or weighting approaches</strong></li>
</ul>

<p>All causal methods are, in some way, trying to <strong>approximate the missing counterfactual</strong>.</p>

</details>


<a id="core-assumptions"></a>
#### 🧠 Assumptions (SUTVA, Ignorability, Overlap)


<details><summary><strong>📉 Click to Expand</strong></summary>

<p>Causal inference relies heavily on assumptions — even when you don’t randomize.</p>

<h5>Three core ones:</h5>

<ul>
  <li><strong>SUTVA (Stable Unit Treatment Value Assumption)</strong><br>
  → Your treatment doesn’t affect someone else’s outcome.<br>
  → No interference across units.</li>

  <li><strong>Ignorability (a.k.a. Unconfoundedness)</strong><br>
  → Given the observed covariates, treatment assignment is as good as random.<br>
  → This lets you use observed data for estimation.</li>

  <li><strong>Overlap (a.k.a. Positivity)</strong><br>
  → Every unit has a non-zero probability of receiving either treatment.<br>
  → You can’t learn effects where there’s no variation.</li>
</ul>

<p>Without these, causal estimates can be biased or undefined. Always question whether they hold before trusting results.</p>

</details>


[Back to the top](#table-of-contents)
___


<a id="simulated-data"></a>
# 🧪 Simulated Dataset Setup


<details><summary><strong>📉 Click to Expand</strong></summary>

<p>Simulating data is the best way to <em>control the truth</em> when learning causal inference.</p>

<p>Here’s why we simulate:</p>
<ul>
  <li>You get full knowledge of ground-truth treatment effects.</li>
  <li>You can deliberately create <strong>confounding</strong>, <strong>bias</strong>, <strong>non-randomness</strong>.</li>
  <li>You can practice recovering the true causal effect using different methods.</li>
</ul>

<p>In real-world observational data, the "truth" is hidden. Simulating lets you debug your causal intuition safely before dealing with messy production datasets.</p>

</details>


In [30]:
# Simulated Dataset Setup
import numpy as np
import pandas as pd

np.random.seed(42)  # for reproducibility

# We'll define features, treatment assignment, and outcomes step-by-step later


<a id="treatment-logic"></a>
#### 🧬 Define treatment assignment logic


<details><summary><strong>📉 Click to Expand</strong></summary>

<p>To simulate treatment realistically:</p>
<ul>
  <li>Treatment should <strong>depend</strong> on observed features.</li>
  <li>Treatment <strong>should not</strong> be random — otherwise, no confounding to deal with.</li>
</ul>

<p>For example:</p>
<ul>
  <li>Wealthier users might be more likely to receive a premium offer.</li>
  <li>Healthier patients might be less likely to receive intensive care.</li>
</ul>

<p>We’ll simulate a <strong>non-random treatment assignment</strong> based on a few covariates to mimic real-world biases.</p>

</details>


In [31]:
# Define covariates (features)
n = 5000

age = np.random.normal(40, 12, n)       # Age
income = np.random.normal(60000, 15000, n)  # Annual income
prior_engagement = np.random.beta(2, 5, n)  # Past engagement score [0,1]

# Treatment assignment probability based on features
treatment_prob = (
    0.3 * (income > 70000).astype(float) +
    0.2 * (prior_engagement > 0.5).astype(float) +
    0.1 * (age < 30).astype(float) +
    np.random.normal(0, 0.05, n)  # small noise
)
treatment_prob = np.clip(treatment_prob, 0, 1)

# Assign treatment
T = np.random.binomial(1, treatment_prob)


<a id="inject-confounding"></a>
#### 🔬 Inject confounding intentionally


<details><summary><strong>📉 Click to Expand</strong></summary>

<p>In real-world datasets, treatment assignment is <strong>not random</strong> — it’s confounded by covariates.</p>

<p>We deliberately inject confounding so that:</p>
<ul>
  <li>Covariates (age, income, engagement) affect both <strong>treatment</strong> and <strong>outcome</strong>.</li>
  <li>If we naively compare treated vs untreated, we'll get biased results.</li>
</ul>

<p>Confounding creates the need for adjustment, which will be a major theme later.</p>

</details>


In [32]:
# Let's define a "true" baseline outcome based on the same covariates
base_outcome = (
    50 + 
    0.02 * income +
    5 * prior_engagement -
    0.3 * age +
    np.random.normal(0, 5, n)  # random noise
)


<a id="simulate-outcomes"></a>
#### 🧊 Simulate potential outcomes + observed data


<details><summary><strong>📉 Click to Expand</strong></summary>

<p>In the Rubin Potential Outcomes framework, each unit has two outcomes:</p>
<ul>
  <li><code>Y(1)</code> → If treated</li>
  <li><code>Y(0)</code> → If not treated</li>
</ul>

<p>We can simulate this by:</p>
<ul>
  <li>Applying a <strong>true treatment effect</strong> to <code>Y(1)</code></li>
  <li>Leaving <code>Y(0)</code> as the base outcome</li>
</ul>

<p><strong>Important:</strong> We observe only one of <code>Y(1)</code> or <code>Y(0)</code>, depending on treatment assignment (<code>T</code>).</p>

</details>


In [33]:
# Define a true treatment effect (could vary by subgroup later)
true_treatment_effect = 10  # a flat +10 effect for everyone

# Simulate potential outcomes
Y_0 = base_outcome
Y_1 = base_outcome + true_treatment_effect

# Observed outcome based on treatment assignment
Y_obs = T * Y_1 + (1 - T) * Y_0

# Assemble into a dataframe
df = pd.DataFrame({
    'age': age,
    'income': income,
    'prior_engagement': prior_engagement,
    'T': T,
    'Y_obs': Y_obs,
    'Y_0': Y_0,
    'Y_1': Y_1,
})

df.head()


Unnamed: 0,age,income,prior_engagement,T,Y_obs,Y_0,Y_1
0,45.96057,53643.60477,0.188077,0,1114.502287,1114.502287,1124.502287
1,38.340828,53198.788374,0.170389,0,1099.893323,1099.893323,1109.893323
2,47.772262,33065.352411,0.511379,0,704.133035,704.133035,714.133035
3,58.276358,55048.647124,0.318793,0,1139.203509,1139.203509,1149.203509
4,37.19016,70992.436227,0.384439,0,1464.308234,1464.308234,1474.308234


[Back to the top](#table-of-contents)
___


<a id="naive-estimation"></a>
# 🚫 Naive Estimation


<details><summary><strong>📉 Click to Expand</strong></summary>

<p>Many causal questions are first attacked by simply comparing the treated vs. untreated groups.</p>

<p><strong>Naive Approach:</strong></p>
<blockquote>Average outcome of treated - Average outcome of untreated.</blockquote>

<p>This looks simple, but in observational data:</p>
<ul>
  <li>Treated and untreated units <strong>are not comparable</strong>.</li>
  <li>Treatment assignment was <strong>not randomized</strong>.</li>
  <li>Differences in baseline characteristics confound the simple difference.</li>
</ul>

<p>Naive estimation <strong>almost always gives biased results</strong> unless you have perfect randomization.</p>

<p>In this section, we'll see how bad the naive approach can get even on a simple synthetic dataset.</p>

</details>


In [34]:
# Quick naive estimation
treated_mean = df.loc[df['T'] == 1, 'Y_obs'].mean()
control_mean = df.loc[df['T'] == 0, 'Y_obs'].mean()

naive_diff = treated_mean - control_mean

print(f"Naive difference in means: {naive_diff:.2f}")
print(f"True treatment effect (ground truth): {true_treatment_effect}")


Naive difference in means: 250.53
True treatment effect (ground truth): 10


<a id="diff-in-means"></a>
#### ❌ Simple difference in means


<details><summary><strong>📉 Click to Expand</strong></summary>

<p>A simple difference in means is mathematically:</p>
<blockquote><code>E[Y | T=1] - E[Y | T=0]</code></blockquote>

<p>If treatment assignment were random:</p>
<ul>
  <li>The two groups would be exchangeable.</li>
  <li>Baseline covariates would balance on average.</li>
  <li>The simple difference would be an unbiased estimator of ATE.</li>
</ul>

<p>But if treatment is <strong>confounded</strong>, then:</p>
<ul>
  <li><code>T=1</code> units may systematically differ from <code>T=0</code> units.</li>
  <li>The naive estimator picks up both <strong>causal effect</strong> and <strong>selection bias</strong>.</li>
</ul>

<p>We’ll soon quantify how large this bias can be.</p>

</details>


In [35]:
# Let's quickly visualize how the treated vs control groups differ in covariates
df.groupby('T')[['age', 'income', 'prior_engagement']].mean()


Unnamed: 0_level_0,age,income,prior_engagement
T,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,40.389926,58304.201745,0.278891
1,37.892227,70283.224987,0.330779


<a id="bias-confounding"></a>
#### ⚠️ Bias due to confounding


<details><summary><strong>📉 Click to Expand</strong></summary>

<p><strong>Bias from confounding</strong> happens when:</p>
<ul>
  <li>The treated group has systematically different baseline outcomes than the control group.</li>
</ul>

<p>Mathematically:</p>
<blockquote>Observed difference = True treatment effect + Bias from baseline differences</blockquote>

<p>In our simulation:</p>
<ul>
  <li>Higher income users are more likely to be treated.</li>
  <li>Income also directly influences outcome.</li>
  <li>Therefore, the observed difference <strong>overstates</strong> the real effect.</li>
</ul>

<p>This is why adjusting for confounders is critical — naive methods can easily mislead interventions and business decisions.</p>

</details>


In [36]:
# Let's calculate the *average baseline outcome* (Y_0) in treated vs untreated groups
treated_baseline = df.loc[df['T'] == 1, 'Y_0'].mean()
control_baseline = df.loc[df['T'] == 0, 'Y_0'].mean()

baseline_diff = treated_baseline - control_baseline

print(f"Baseline (Y_0) difference between treated and control: {baseline_diff:.2f}")
print("This baseline imbalance creates bias in naive estimation.")


Baseline (Y_0) difference between treated and control: 240.53
This baseline imbalance creates bias in naive estimation.


[Back to the top](#table-of-contents)
___


<a id="causal-diagrams"></a>
# 🕸️ Causal Diagrams (DAGs)


<details><summary><strong>📉 Click to Expand</strong></summary>

<p><strong>Directed Acyclic Graphs (DAGs)</strong> are a compact way to represent assumptions about the data generating process.</p>

<ul>
  <li>Nodes = variables</li>
  <li>Edges (arrows) = direct causal influence</li>
</ul>

<p>DAGs are not learned from data. They are <strong>drawn from domain knowledge</strong> to help reason about:</p>
<ul>
  <li>Confounders</li>
  <li>Biases</li>
  <li>Valid adjustment strategies</li>
</ul>

<p>Almost every causal inference method implicitly or explicitly assumes a DAG about the world.</p>

</details>


<a id="primer-dags"></a>
#### 🧿 Quick primer on DAGs


<details><summary><strong>📉 Click to Expand</strong></summary>

<p>A <strong>DAG (Directed Acyclic Graph)</strong> encodes assumptions about how variables causally relate.</p>

<ul>
  <li><strong>Directed</strong>: Arrows have direction (cause → effect).</li>
  <li><strong>Acyclic</strong>: No feedback loops allowed (you can’t return to a node).</li>
</ul>

<p>Example:</p>
<blockquote>Age → Income → Health</blockquote>

<p>Means:</p>
<ul>
  <li>Age affects income.</li>
  <li>Income affects health.</li>
  <li>No reverse paths.</li>
</ul>

<p>DAGs help identify:</p>
<ul>
  <li>Which paths are confounded</li>
  <li>Which variables to control for</li>
  <li>Whether effects are identifiable</li>
</ul>

<p>They act like a <strong>map</strong> — letting you plan causal estimation strategies intelligently.</p>

</details>


<a id="confounder-collider-mediator"></a>
#### 🕷️ Confounding vs. colliders vs. mediators


<details><summary><strong>📉 Click to Expand</strong></summary>

<p><strong>Confounders</strong>:</p>
<ul>
  <li>Variables that influence both treatment and outcome.</li>
  <li>Must be adjusted for to block bias.</li>
  <li>Example: Age confounds the relationship between Exercise (<code>T</code>) and Health (<code>Y</code>).</li>
</ul>

<p><strong>Colliders</strong>:</p>
<ul>
  <li>Variables caused by two other variables.</li>
  <li><strong>Must NOT adjust for colliders</strong> — doing so opens spurious associations.</li>
  <li>Example: Adjusting for "hospitalization" might introduce bias when studying Smoking → Lung Disease.</li>
</ul>

<p><strong>Mediators</strong>:</p>
<ul>
  <li>Variables on the causal pathway between treatment and outcome.</li>
  <li>Adjusting for them <strong>blocks part of the causal effect</strong> you want to measure.</li>
  <li>Example: Exercise → Fitness → Health (fitness is a mediator).</li>
</ul>

<p>👉 Correct adjustment requires distinguishing among these roles.</p>

</details>


<a id="estimability-from-dags"></a>
#### 🔗 What can/can’t be estimated just from data


<details><summary><strong>📉 Click to Expand</strong></summary>

<p>Not everything is identifiable from data alone — assumptions are unavoidable.</p>

<h5>What can be estimated:</h5>
<ul>
  <li>Associations (correlations, patterns)</li>
  <li>Conditional independence structures</li>
  <li>Causal effects <strong>if</strong> the right covariates are controlled (based on DAG structure)</li>
</ul>

<h5>What cannot be estimated:</h5>
<ul>
  <li>Whether a relationship is causal (without assumptions)</li>
  <li>The full structure of a DAG (unless randomized experiments are used)</li>
</ul>

<p>Data + assumptions → Causal conclusions.<br>
Data alone → Only correlational findings.</p>

<p>DAGs clarify where you need domain knowledge vs where data suffices.</p>

</details>


[Back to the top](#table-of-contents)
___


<a id="backdoor-adjustment"></a>
# 🔍 Backdoor Adjustment Methods


<details><summary><strong>📉 Click to Expand</strong></summary>

<p>Backdoor adjustment methods aim to block <strong>backdoor paths</strong> — non-causal paths that create bias between treatment and outcome.</p>

<p><strong>Core idea:</strong></p>
<ul>
  <li>Identify variables (confounders) that open backdoor paths.</li>
  <li>Condition on them — either by stratifying, modeling, or matching.</li>
</ul>

<p>Backdoor adjustment <strong>simulates</strong> what would happen if treatment assignment were random within levels of the confounders.</p>

<p>It’s the foundational idea behind:</p>
<ul>
  <li>Regression</li>
  <li>Matching</li>
  <li>Stratification</li>
  <li>Propensity scores</li>
</ul>

<p>If you can block all backdoor paths, you can estimate causal effects from observational data reliably.</p>

</details>


<a id="conditioning"></a>
#### 🧾 Conditioning on confounders


<details><summary><strong>📉 Click to Expand</strong></summary>

<p>Conditioning means <strong>holding confounders constant</strong> when comparing treated vs untreated units.</p>

<p>Examples:</p>
<ul>
  <li>Comparing treated vs untreated users <strong>within each income band</strong>.</li>
  <li>Comparing recovery rates <strong>within each age group</strong>.</li>
</ul>

<p>By conditioning, you eliminate the variation due to confounders, isolating the causal effect.</p>

<p><strong>Important:</strong><br>
You should only condition on true confounders — not colliders or mediators.</p>

<p>Conditioning can be implemented via:</p>
<ul>
  <li>Subgrouping</li>
  <li>Regression</li>
  <li>Matching</li>
  <li>Weighting</li>
</ul>

</details>


In [37]:
# Simple conditioning by subgroup (example: income > 70k vs <= 70k)
df['high_income'] = (df['income'] > 70000).astype(int)

grouped = df.groupby(['high_income', 'T'])['Y_obs'].mean().unstack()
grouped['diff'] = grouped[1] - grouped[0]

print("Simple Conditional Difference by Income Group:")
print(grouped[['diff']])


Simple Conditional Difference by Income Group:
T                 diff
high_income           
0            16.867043
1            13.563878


<a id="stratification"></a>
#### 🕵️‍♂️ Stratification / Subgroup analysis


<details><summary><strong>📉 Click to Expand</strong></summary>

<p>Stratification means <strong>breaking the dataset into buckets</strong> based on confounders and comparing treatment effects within each bucket.</p>

<p>Typical steps:</p>
<ol>
  <li>Divide data based on a confounder (e.g., low vs high engagement).</li>
  <li>Within each stratum, compute treated vs control differences.</li>
  <li>Aggregate across strata (weighted average).</li>
</ol>

<p><strong>When useful:</strong></p>
<ul>
  <li>When confounders are categorical or easily discretized.</li>
  <li>When interpretability is important.</li>
</ul>

<p><strong>Limits:</strong></p>
<ul>
  <li>Doesn’t scale well with many confounders (curse of dimensionality).</li>
</ul>

</details>


In [38]:
# Stratify based on prior_engagement (simple high vs low)
df['high_engagement'] = (df['prior_engagement'] > 0.5).astype(int)

strat_grouped = df.groupby(['high_engagement', 'T'])['Y_obs'].mean().unstack()
strat_grouped['diff'] = strat_grouped[1] - strat_grouped[0]

print("Conditional Difference by Engagement Group:")
print(strat_grouped[['diff']])


Conditional Difference by Engagement Group:
T                      diff
high_engagement            
0                290.043114
1                145.885304


<a id="regression-adjustment"></a>
#### 📊 Regression Adjustment


<details><summary><strong>📉 Click to Expand</strong></summary>

<p>Regression adjustment estimates causal effects by <strong>controlling for confounders via regression</strong>.</p>

<p>Simple linear model:</p>
<blockquote><code>Y = β₀ + β₁·T + β₂·(confounder1) + β₃·(confounder2) + ... + ε</code></blockquote>

<ul>
  <li><code>β₁</code> captures the <strong>adjusted</strong> effect of treatment, controlling for confounders.</li>
  <li>It removes bias from observable confounders (under correct model specification).</li>
</ul>

<p><strong>Advantages:</strong></p>
<ul>
  <li>Easy to use.</li>
  <li>Scales to many covariates.</li>
</ul>

<p><strong>Risks:</strong></p>
<ul>
  <li>Sensitive to model misspecification.</li>
  <li>Wrong functional forms (nonlinearities, interactions) can bias estimates.</li>
</ul>

</details>


In [39]:
import statsmodels.api as sm

X = df[['T', 'age', 'income', 'prior_engagement']]
X = sm.add_constant(X)
y = df['Y_obs']

reg_model = sm.OLS(y, X).fit()

print(reg_model.summary())

print(f"\nEstimated treatment effect (β₁) after adjustment: {reg_model.params['T']:.2f}")


                            OLS Regression Results                            
Dep. Variable:                  Y_obs   R-squared:                       1.000
Model:                            OLS   Adj. R-squared:                  1.000
Method:                 Least Squares   F-statistic:                 4.652e+06
Date:                Sat, 26 Apr 2025   Prob (F-statistic):               0.00
Time:                        17:50:40   Log-Likelihood:                -15122.
No. Observations:                5000   AIC:                         3.025e+04
Df Residuals:                    4995   BIC:                         3.029e+04
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
                       coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------
const               50.3660      0.398  

<a id="psm"></a>
#### 📌 Propensity Score Matching (PSM)


<details><summary><strong>📉 Click to Expand</strong></summary>

<p>Propensity Score Matching (PSM) is a two-step procedure:</p>
<ol>
  <li>Model the <strong>probability of receiving treatment</strong> (<code>P(T=1 | X)</code>) using observed covariates.</li>
  <li>Match treated and control units with <strong>similar propensity scores</strong>.</li>
</ol>

<p><strong>Why PSM?</strong></p>
<ul>
  <li>Instead of adjusting for many covariates separately, you balance treated and control groups on a single dimension (the propensity score).</li>
</ul>

<p><strong>When useful:</strong></p>
<ul>
  <li>When covariate space is high-dimensional.</li>
  <li>When you want a matched sample that resembles randomized data.</li>
</ul>

<p><strong>Limitations:</strong></p>
<ul>
  <li>Requires good overlap (common support).</li>
  <li>Still relies on unconfoundedness assumption.</li>
</ul>

</details>


In [40]:
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import NearestNeighbors

# Step 1: Estimate propensity scores
ps_model = LogisticRegression()
ps_model.fit(df[['age', 'income', 'prior_engagement']], df['T'])
df['propensity_score'] = ps_model.predict_proba(df[['age', 'income', 'prior_engagement']])[:,1]

# Step 2: Nearest neighbor matching
treated = df[df['T'] == 1]
control = df[df['T'] == 0]

nn = NearestNeighbors(n_neighbors=1)
nn.fit(control[['propensity_score']])

distances, indices = nn.kneighbors(treated[['propensity_score']])
matched_control = control.iloc[indices.flatten()]

# Calculate matched difference
matched_diff = (treated['Y_obs'].values - matched_control['Y_obs'].values).mean()

print(f"Propensity Score Matched Estimate of Treatment Effect: {matched_diff:.2f}")


Propensity Score Matched Estimate of Treatment Effect: 197.50


[Back to the top](#table-of-contents)
___


<a id="iv-methods"></a>
# 🎯 Instrumental Variables (IV)


<details><summary><strong>📉 Click to Expand</strong></summary>

<p><strong>Instrumental Variables (IV)</strong> methods are used <strong>when simple adjustment for confounders is impossible</strong> or <strong>not credible</strong>.</p>

<p>When treatment is <strong>endogenous</strong> (affected by unobserved factors also affecting outcome), traditional methods like regression fail.</p>

<p><strong>IV solves this by:</strong></p>
<ul>
  <li>Using a "proxy" (instrument) that affects treatment but is otherwise unrelated to the outcome except through treatment.</li>
  <li>"Re-randomizing" variation in treatment based on the instrument.</li>
</ul>

<p>You create <strong>quasi-randomization</strong> even in observational data.</p>

<p>Classic examples:</p>
<ul>
  <li>Distance to hospital → instrument for getting surgery.</li>
  <li>Random assignment of judges → instrument for harsher sentencing.</li>
</ul>

</details>


<a id="when-use-iv"></a>
#### 🪝 When backdoor paths can’t be blocked


<details><summary><strong>📉 Click to Expand</strong></summary>

<p>You need IV methods when:</p>
<ul>
  <li>There are <strong>unobserved confounders</strong> you can’t measure.</li>
  <li>No set of observed covariates satisfies ignorability.</li>
  <li>Standard backdoor adjustment will be biased.</li>
</ul>

<p>Example:</p>
<ul>
  <li>Studying the effect of education on income: natural intelligence is a hidden confounder (affects both education and income).</li>
  <li>You can't just regress income ~ education — bias remains.</li>
</ul>

<p><strong>Key realization:</strong><br>
If <strong>backdoor paths exist</strong> through unobserved variables, IV becomes necessary.</p>

</details>


<a id="iv-conditions"></a>
#### 🎯 Valid instrument conditions


<details><summary><strong>📉 Click to Expand</strong></summary>

<p>For an instrument (<code>Z</code>) to be valid, it must satisfy:</p>

<ol>
  <li><strong>Relevance:</strong><br>
  <code>Z</code> must affect treatment <code>T</code>.<br>
  (There must be a first-stage effect.)</li>

  <li><strong>Exclusion Restriction:</strong><br>
  <code>Z</code> must affect the outcome <code>Y</code> <strong>only</strong> through <code>T</code>.<br>
  (No direct path from <code>Z</code> to <code>Y</code>.)</li>

  <li><strong>Independence (As-if Randomness):</strong><br>
  <code>Z</code> must be independent of unobserved confounders affecting <code>Y</code>.</li>
</ol>

<hr>

<p>If any of these fail:</p>
<ul>
  <li>IV estimates are biased or meaningless.</li>
  <li>You can’t fix bad instruments with bigger sample sizes.</li>
</ul>

<p><strong>Choosing or arguing a valid instrument is 90% of the IV battle.</strong></p>

</details>


<a id="2sls"></a>
#### 🧩 2-Stage Least Squares (2SLS)


<details><summary><strong>📉 Click to Expand</strong></summary>

<p><strong>2SLS (Two-Stage Least Squares)</strong> is the classic estimation procedure for IV:</p>

<ul>
  <li><strong>Stage 1:</strong><br>
  Regress treatment <code>T</code> on instrument <code>Z</code> (and any controls)<br>
  → get predicted treatment (<code>T̂</code>)</li>

  <li><strong>Stage 2:</strong><br>
  Regress outcome <code>Y</code> on predicted treatment (<code>T̂</code>)</li>
</ul>

<p>The second-stage coefficient gives the <strong>causal effect</strong> of treatment on outcome, isolating variation driven by the instrument.</p>

<p><strong>Warning:</strong></p>
<ul>
  <li>Standard regression software doesn't correct standard errors properly when doing 2SLS manually.</li>
  <li>Later packages like <code>linearmodels</code> automate this.</li>
</ul>

</details>


In [41]:
from statsmodels.api import OLS, add_constant

# Simulate an instrument Z (let's assume it's random and satisfies conditions)
np.random.seed(42)
df['Z'] = np.random.binomial(1, 0.5, size=len(df))

# Make treatment depend partly on Z
df['T_iv'] = (0.5 * df['Z'] + 0.5 * df['prior_engagement'] + np.random.normal(0, 0.1, len(df))) > 0.5
df['T_iv'] = df['T_iv'].astype(int)

# Stage 1: Predict treatment from instrument
X_stage1 = add_constant(df[['Z', 'age', 'income', 'prior_engagement']])
stage1_model = OLS(df['T_iv'], X_stage1).fit()
df['T_hat'] = stage1_model.predict(X_stage1)

# Stage 2: Predict outcome from predicted treatment
X_stage2 = add_constant(df[['T_hat', 'age', 'income', 'prior_engagement']])
stage2_model = OLS(df['Y_obs'], X_stage2).fit()

print(stage2_model.summary())

print(f"\nEstimated causal effect (via 2SLS): {stage2_model.params['T_hat']:.2f}")


                            OLS Regression Results                            
Dep. Variable:                  Y_obs   R-squared:                       1.000
Model:                            OLS   Adj. R-squared:                  1.000
Method:                 Least Squares   F-statistic:                 3.278e+06
Date:                Sat, 26 Apr 2025   Prob (F-statistic):               0.00
Time:                        17:50:40   Log-Likelihood:                -15996.
No. Observations:                5000   AIC:                         3.200e+04
Df Residuals:                    4995   BIC:                         3.203e+04
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
                       coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------
const               48.1296      0.474  

[Back to the top](#table-of-contents)
___


<a id="dml-methods"></a>
# 🧰 Double Machine Learning (DML)


<details><summary><strong>📉 Click to Expand</strong></summary>

<p><strong>Double Machine Learning (DML)</strong> is a modern causal estimation technique that:</p>

<ul>
  <li><strong>Separates</strong> the modeling of treatment and outcome.</li>
  <li><strong>Uses flexible machine learning models</strong> to control for complex confounders.</li>
  <li><strong>Debiases</strong> the final treatment effect estimation by orthogonalization.</li>
</ul>

<p><strong>Why DML matters:</strong></p>
<ul>
  <li>Traditional linear regression forces linearity.</li>
  <li>DML allows for nonlinear, high-dimensional adjustment without overfitting causal estimates.</li>
</ul>

<p>It builds robust treatment effect estimators even when you use ML methods like Random Forests, XGBoost, or Neural Nets for intermediate steps.</p>

</details>


<a id="ml-nuisance"></a>
#### 🪛 Use ML models for nuisance functions


<details><summary><strong>📉 Click to Expand</strong></summary>

<p>In DML, you model two "nuisance functions" first:</p>
<ol>
  <li><strong>Outcome model</strong>: <code>Y ~ X</code></li>
  <li><strong>Treatment model</strong>: <code>T ~ X</code></li>
</ol>

<p>You can use <strong>any ML model</strong> (linear regression, random forest, gradient boosting, etc.) for these.</p>

<p><strong>Key point:</strong><br>
The goal is <strong>accurate prediction</strong>, not causal interpretation, at this stage.</p>

<p>Later, DML uses the residuals from these models to isolate the causal effect of <code>T</code> on <code>Y</code>.</p>

<p>This two-step process protects the final estimate from overfitting to noisy high-dimensional features.</p>

</details>


In [42]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

# Define features
features = ['age', 'income', 'prior_engagement']

# Split into train/test for honest estimation
X_train, X_test, y_train, y_test = train_test_split(df[features], df['Y_obs'], test_size=0.3, random_state=42)
T_train, T_test = train_test_split(df['T'], test_size=0.3, random_state=42)

# Outcome model: Y ~ X
y_model = RandomForestRegressor()
y_model.fit(X_train, y_train)
df['y_hat'] = y_model.predict(df[features])

# Treatment model: T ~ X
t_model = RandomForestRegressor()
t_model.fit(X_train, T_train)
df['t_hat'] = t_model.predict(df[features])


<a id="residualization"></a>
#### 🧱 Residualization + orthogonalization logic


<details><summary><strong>📉 Click to Expand</strong></summary>

<p>After fitting nuisance models:</p>

<ul>
  <li>Calculate <strong>residuals</strong>:
    <ul>
      <li><code>Residual_Y = Y - Ŷ</code></li>
      <li><code>Residual_T = T - T̂</code></li>
    </ul>
  </li>

  <li>Then regress <strong>Residual_Y ~ Residual_T</strong>.</li>
</ul>

<p><strong>Why?</strong></p>
<ul>
  <li>This removes the part of <code>Y</code> and <code>T</code> that is predictable from <code>X</code>.</li>
  <li>What remains captures the <strong>"clean" causal variation</strong> of <code>T</code> on <code>Y</code>, orthogonal to confounders.</li>
</ul>

<p>This two-stage process is called <strong>orthogonalization</strong> — it minimizes bias from overfitting nuisance functions.</p>

<p>It’s a key innovation that separates DML from naive ML-based adjustment.</p>

</details>


In [43]:
# Calculate residuals
df['residual_Y'] = df['Y_obs'] - df['y_hat']
df['residual_T'] = df['T'] - df['t_hat']

# Final stage: regress residual_Y ~ residual_T
X_resid = sm.add_constant(df['residual_T'])
y_resid = df['residual_Y']

residual_model = sm.OLS(y_resid, X_resid).fit()

print(residual_model.summary())

print(f"\nEstimated causal effect via DML: {residual_model.params['residual_T']:.2f}")


                            OLS Regression Results                            
Dep. Variable:             residual_Y   R-squared:                       0.127
Model:                            OLS   Adj. R-squared:                  0.127
Method:                 Least Squares   F-statistic:                     726.5
Date:                Sat, 26 Apr 2025   Prob (F-statistic):          1.62e-149
Time:                        17:50:41   Log-Likelihood:                -14928.
No. Observations:                5000   AIC:                         2.986e+04
Df Residuals:                    4998   BIC:                         2.987e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.0842      0.068      1.243      0.2

<a id="dml-vs-regression"></a>
#### 🧲 When to prefer over traditional regression


<details><summary><strong>📉 Click to Expand</strong></summary>

<p>You should prefer DML over traditional regression when:</p>

<ul>
  <li><strong>High-dimensional confounders</strong> (lots of features) exist.</li>
  <li><strong>Nonlinear relationships</strong> are likely between covariates and treatment/outcome.</li>
  <li><strong>Flexible modeling</strong> is important (tree-based, neural nets, etc.)</li>
  <li><strong>Concern about model misspecification</strong> in simple linear regression.</li>
</ul>

<p>Traditional regression assumes:</p>
<ul>
  <li>Linear relationships</li>
  <li>No complex interactions unless explicitly modeled</li>
</ul>

<p>DML frees you from strict parametric forms, allowing modern ML models while still aiming for valid causal estimates.</p>

<p>✅ DML shines in modern settings: tech products, healthcare, online platforms — where datasets are messy, rich, and big.</p>

</details>


[Back to the top](#table-of-contents)
___


<a id="heterogeneous-effects"></a>
# 🌈 Heterogeneous Treatment Effects


<details><summary><strong>📉 Click to Expand</strong></summary>

<p>Until now, we've talked about the <strong>average</strong> effect of treatment across the entire population (ATE).</p>

<p>But in reality:</p>
<ul>
  <li>Different users respond differently.</li>
  <li>Treatment effects <strong>vary</strong> by user characteristics.</li>
</ul>

<p><strong>Heterogeneous Treatment Effects</strong> (HTE) study how effects vary:</p>
<ul>
  <li>Across groups (e.g., high engagement vs low engagement)</li>
  <li>Across individuals (personalized effects)</li>
</ul>

<p>Estimating HTE is critical for:</p>
<ul>
  <li>Personalized recommendations</li>
  <li>Smart targeting (marketing, healthcare, product launches)</li>
</ul>

</details>


<a id="ate-cate-ite"></a>
#### 🎨 ATE vs. CATE vs. ITE


<details><summary><strong>📉 Click to Expand</strong></summary>

<p>Different layers of treatment effect granularity:</p>

<ul>
  <li><strong>ATE (Average Treatment Effect)</strong>:<br>
    Average effect across everyone.</li>

  <li><strong>CATE (Conditional Average Treatment Effect)</strong>:<br>
    Average effect <strong>given some subgroup</strong> (e.g., CATE for users &lt;30 years old).</li>

  <li><strong>ITE (Individual Treatment Effect)</strong>:<br>
    Effect for a <strong>specific user</strong>.</li>
</ul>

<hr>

<p><strong>In practice:</strong></p>
<ul>
  <li>ATE is easiest to estimate.</li>
  <li>CATEs are often actionable (targeted marketing).</li>
  <li>ITEs are the hardest — noisy and high-variance.</li>
</ul>

<p>Good causal inference methods can recover CATEs/ITEs <strong>if</strong> enough data and signal exist.</p>

</details>


In [44]:
# Calculate ATE (simple diff from simulation ground truth)
ate = (df['Y_1'] - df['Y_0']).mean()
print(f"True ATE: {ate:.2f}")

# Calculate CATE for high engagement group
cate_high_engagement = (df[df['prior_engagement'] > 0.5]['Y_1'] - df[df['prior_engagement'] > 0.5]['Y_0']).mean()
print(f"CATE for high engagement users: {cate_high_engagement:.2f}")

# Show a few ITEs
df['ITE_true'] = df['Y_1'] - df['Y_0']
print("\nSample ITEs:")
print(df[['age', 'income', 'prior_engagement', 'ITE_true']].head())


True ATE: 10.00
CATE for high engagement users: 10.00

Sample ITEs:
         age        income  prior_engagement  ITE_true
0  45.960570  53643.604770          0.188077      10.0
1  38.340828  53198.788374          0.170389      10.0
2  47.772262  33065.352411          0.511379      10.0
3  58.276358  55048.647124          0.318793      10.0
4  37.190160  70992.436227          0.384439      10.0


<a id="uplift-usecases"></a>
#### 🌟 Uplift models and use cases


<details><summary><strong>📉 Click to Expand</strong></summary>

<p><strong>Uplift modeling</strong> directly models the <strong>difference in probability</strong> of a positive outcome between treated and untreated users.</p>

<p>Instead of modeling outcome probabilities separately, uplift models focus on:</p>
<ul>
  <li>Who is <strong>most persuadable</strong>?</li>
  <li>Who would change behavior because of treatment?</li>
</ul>

<p><strong>Where uplift models shine:</strong></p>
<ul>
  <li>Marketing campaigns (maximize conversions per dollar)</li>
  <li>Customer retention (target save offers only to those who would churn)</li>
  <li>Medical interventions (target high-risk patients)</li>
</ul>

<hr>

<p><strong>Typical techniques:</strong></p>
<ul>
  <li>Uplift Decision Trees</li>
  <li>Two-model approach (predict Y|T=1 and Y|T=0 separately, then subtract)</li>
  <li>Causal Forests</li>
</ul>

</details>


<a id="causal-forests"></a>
#### 🧩 Tree-based methods (Causal Trees, Causal Forests)


<details><summary><strong>📉 Click to Expand</strong></summary>

<p>Tree-based methods are powerful for discovering treatment effect heterogeneity:</p>

<ul>
  <li><strong>Causal Trees</strong>:
    <ul>
      <li>Split data to maximize treatment effect differences between branches.</li>
      <li>One tree trained specifically for causal splits.</li>
    </ul>
  </li>

  <li><strong>Causal Forests</strong>:
    <ul>
      <li>Ensemble of causal trees.</li>
      <li>Averages treatment effect estimates across trees.</li>
      <li>Reduces variance compared to a single tree.</li>
    </ul>
  </li>
</ul>

<p>They can estimate <strong>CATEs</strong> reliably across different subgroups without manually specifying interactions.</p>

<hr>

<p><strong>When useful:</strong></p>
<ul>
  <li>You expect heterogeneity but don't know in advance how to segment.</li>
  <li>You want flexible, interpretable treatment effect estimation.</li>
</ul>

</details>


In [45]:
# !pip install econml

In [46]:
# Causal Forest: Full Correct Code

from econml.grf import CausalForest
from sklearn.model_selection import train_test_split

# Prepare features and outcome
X = df[['age', 'income', 'prior_engagement']].values  # Features (2D)
T = df['T'].values  # Treatment (1D)
Y = df['Y_obs'].values  # Observed outcome (1D)

# Fit causal forest
forest = CausalForest(n_estimators=100, random_state=42)
forest.fit(X, T, Y)  # Correct order: X, T, Y

# Predict treatment effects (CATEs)
cate_preds = forest.predict(X)

# Store predictions
df['CATE_predicted'] = cate_preds

# Display sample predictions
print("\nSample predicted CATEs:")
print(df[['age', 'income', 'prior_engagement', 'CATE_predicted']].head())



Sample predicted CATEs:
         age        income  prior_engagement  CATE_predicted
0  45.960570  53643.604770          0.188077       15.767683
1  38.340828  53198.788374          0.170389       18.062330
2  47.772262  33065.352411          0.511379       -4.625701
3  58.276358  55048.647124          0.318793       14.400047
4  37.190160  70992.436227          0.384439       11.047467


[Back to the top](#table-of-contents)
___


<a id="placebo-robustness"></a>
# 🧪 Placebo Tests & Robustness Checks


<details><summary><strong>📉 Click to Expand</strong></summary>

<p>Even after careful causal estimation, you must ask:</p>
<ul>
  <li>Was it a real effect?</li>
  <li>Could hidden bias still exist?</li>
</ul>

<p><strong>Robustness checks</strong> build confidence that your findings are not artifacts of modeling choices, random noise, or hidden confounders.</p>

<p><strong>Placebo tests</strong> simulate situations where you expect <strong>no effect</strong> — if you detect an effect there, something's wrong.</p>

<p>Robust causal analysis is not just about point estimates — it’s about <strong>proving to yourself that you aren't fooling yourself</strong>.</p>

</details>


<a id="placebo"></a>
#### 🧻 Randomized placebo treatments


<details><summary><strong>📉 Click to Expand</strong></summary>

<p>Placebo tests inject "fake" treatments to validate your method.</p>

<p><strong>Idea:</strong></p>
<ul>
  <li>Randomly assign a placebo treatment.</li>
  <li>Re-estimate the treatment effect.</li>
  <li>Expect <strong>no significant effect</strong> if your method is honest.</li>
</ul>

<p>If your model finds strong effects even when treatment is randomized, your pipeline is leaking bias or overfitting.</p>

<hr>

<p><strong>Placebo Tests Are Critical:</strong></p>
<ul>
  <li>They detect specification errors.</li>
  <li>They detect uncontrolled confounding.</li>
  <li>They expose overfitting to noise.</li>
</ul>

<p>Placebo tests are a basic but powerful check — always worth doing.</p>

</details>


In [47]:
# Create a random placebo treatment
np.random.seed(123)
df['placebo_T'] = np.random.binomial(1, 0.5, size=len(df))

# Estimate naive difference for placebo treatment
placebo_treated_mean = df.loc[df['placebo_T'] == 1, 'Y_obs'].mean()
placebo_control_mean = df.loc[df['placebo_T'] == 0, 'Y_obs'].mean()

placebo_naive_diff = placebo_treated_mean - placebo_control_mean

print(f"Placebo test: naive difference in means = {placebo_naive_diff:.2f}")

# Ideally close to zero if model is unbiased


Placebo test: naive difference in means = 7.79


<a id="robustness"></a>
#### ⚗️ Sensitivity to unobserved confounding


<details><summary><strong>📉 Click to Expand</strong></summary>

<p>Even after adjusting for observed confounders, <strong>unobserved variables</strong> can still bias causal estimates.</p>

<p><strong>Sensitivity analysis</strong> asks:</p>
<ul>
  <li>How strong would hidden confounding have to be to overturn my results?</li>
</ul>

<hr>

<p><strong>Typical approaches:</strong></p>
<ul>
  <li>Simulate hidden confounders and see effect size shifts.</li>
  <li>Use formulas (like Rosenbaum bounds) to quantify robustness.</li>
</ul>

<p>In practical data science:</p>
<ul>
  <li>Simulate scenarios with added fake bias.</li>
  <li>Stress test conclusions under "worst plausible" hidden biases.</li>
</ul>

<p>If your conclusions survive plausible levels of unobserved bias, they are more credible.</p>

<p><strong>Important mindset:</strong><br>
No analysis is perfect — the goal is to <strong>understand limits, not pretend away uncertainty</strong>.</p>

</details>


In [48]:
# Simulate a hidden confounder correlated with treatment and outcome
np.random.seed(42)
df['hidden_confounder'] = np.random.normal(0, 1, size=len(df))

# Make the outcome depend slightly on this hidden confounder
df['Y_obs_biased'] = df['Y_obs'] + 2 * df['hidden_confounder']

# Re-run naive difference with biased outcome
treated_mean_biased = df.loc[df['T'] == 1, 'Y_obs_biased'].mean()
control_mean_biased = df.loc[df['T'] == 0, 'Y_obs_biased'].mean()

biased_naive_diff = treated_mean_biased - control_mean_biased

print(f"Naive difference in means (with hidden confounding): {biased_naive_diff:.2f}")

# See how much bias was introduced
bias_inflation = biased_naive_diff - naive_diff
print(f"Inflation in estimate due to hidden confounder: {bias_inflation:.2f}")


Naive difference in means (with hidden confounding): 250.11
Inflation in estimate due to hidden confounder: -0.42


[Back to the top](#table-of-contents)
___


<a id="counterfactuals"></a>
# 🧬 Counterfactual Thinking


<details><summary><strong>📉 Click to Expand</strong></summary>

<p><strong>Counterfactual thinking</strong> is the backbone of causal inference.</p>

<p>Instead of asking:</p>
<blockquote>"What happened?"</blockquote>

<p>We ask:</p>
<blockquote>"What <em>would have</em> happened if things were different?"</blockquote>

<p>In causal inference:</p>
<ul>
  <li>Each unit (user, patient, item) has two potential outcomes.</li>
  <li>Only one is observed.</li>
  <li>The other — the counterfactual — must be predicted or estimated.</li>
</ul>

<hr>

<p><strong>Counterfactual reasoning enables:</strong></p>
<ul>
  <li>Simulating user behavior under alternate scenarios.</li>
  <li>Personalizing interventions based on predicted outcomes.</li>
</ul>

<p>Without counterfactuals, causal inference is blind.</p>

</details>


<a id="what-if"></a>
#### 🤖 Predicting what would’ve happened


<details><summary><strong>📉 Click to Expand</strong></summary>

<p>Predicting counterfactuals means estimating:</p>
<ul>
  <li><code>Y(1)</code> for untreated units.</li>
  <li><code>Y(0)</code> for treated units.</li>
</ul>

<p>We use:</p>
<ul>
  <li>Machine learning models trained on observed data.</li>
  <li>Causal forests, meta-learners, and other counterfactual predictors.</li>
</ul>

<p><strong>Goal:</strong></p>
<ul>
  <li>Recover the missing potential outcome.</li>
  <li>Estimate <strong>individual treatment effects (ITEs)</strong>.</li>
</ul>

<p>This enables granular interventions — not just average effects across a population.</p>

<hr>

<p><strong>Important to remember:</strong><br>
Predicted counterfactuals are <strong>estimates</strong>, not direct observations — uncertainty always exists.</p>

</details>


In [49]:
# Predict counterfactual outcomes using Causal Forest
# (Already trained earlier, we use forest)

# Predict Y(1) and Y(0) separately
cate_preds = df['CATE_predicted'].values
baseline_preds = df['y_hat'].values  # From earlier outcome model

# Predict counterfactual outcomes
df['Y_cf_T1'] = baseline_preds + cate_preds  # Predicted outcome if treated
df['Y_cf_T0'] = baseline_preds  # Predicted outcome if untreated (baseline)

# Now simulate what would happen if treatment status flipped
df['counterfactual_outcome'] = np.where(
    df['T'] == 1,
    df['Y_cf_T0'],  # If treated, counterfactual is untreated
    df['Y_cf_T1']   # If untreated, counterfactual is treated
)

print("\nSample Counterfactual Predictions:")
print(df[['T', 'Y_obs', 'counterfactual_outcome']].head())



Sample Counterfactual Predictions:
   T        Y_obs  counterfactual_outcome
0  0  1114.502287             1127.868133
1  0  1099.893323             1119.847270
2  0   704.133035              697.394845
3  0  1139.203509             1151.640784
4  0  1464.308234             1475.218807


<a id="personalization"></a>
#### 🔁 Usage in recommendation & personalization


<details><summary><strong>📉 Click to Expand</strong></summary>

<p><strong>Counterfactual predictions unlock personalization:</strong></p>

<p>Instead of treating everyone the same, you can:</p>
<ul>
  <li>Target users where treatment has highest predicted uplift.</li>
  <li>De-prioritize users who won't respond.</li>
</ul>

<p>Examples:</p>
<ul>
  <li><strong>Marketing</strong>: Show ads only to users likely to convert if nudged.</li>
  <li><strong>Healthcare</strong>: Prioritize interventions for patients who benefit most.</li>
  <li><strong>Products</strong>: Recommend features or promotions to maximize lift per user.</li>
</ul>

<p><strong>Strategic mindshift:</strong><br>
Focus on <strong>marginally persuadable users</strong>, not just overall averages.</p>

<p>Real-world use cases often combine:</p>
<ul>
  <li>Causal effect estimation (CATE/ITE)</li>
  <li>Ranking users by expected benefit</li>
  <li>Action prioritization based on counterfactuals</li>
</ul>

</details>


In [50]:
# Rank users by predicted CATE
df['priority_score'] = df['CATE_predicted']

# Top 5 users we should prioritize for treatment
top_users = df.sort_values('priority_score', ascending=False).head(5)

print("\nTop users to prioritize based on CATE:")
print(top_users[['age', 'income', 'prior_engagement', 'CATE_predicted']])



Top users to prioritize based on CATE:
            age        income  prior_engagement  CATE_predicted
2841  21.800289  53281.835326          0.350447       20.543173
724   42.050385  53280.579923          0.403388       20.443580
3205  31.876058  53300.354298          0.420334       20.411161
4614  40.670367  53244.477131          0.632127       20.275141
3711  24.210210  53209.647772          0.335915       20.259483


[Back to the top](#table-of-contents)
___


<a id="closing-notes"></a>
# 📌 Closing Notes


<details><summary><strong>📉 Click to Expand</strong></summary>

<p>You now have a practical understanding of core causal inference techniques.</p>

<p>You should be able to:</p>
<ul>
  <li>Simulate data with confounding</li>
  <li>Estimate naive effects and detect bias</li>
  <li>Adjust using regression, matching, stratification</li>
  <li>Apply modern tools like DML and Causal Forests</li>
  <li>Think in terms of counterfactuals, not just correlations</li>
</ul>

<hr>

<p><strong>Remember:</strong><br>
Causal thinking is not just a technique — it’s a lens to see decision-making clearly.</p>

</details>


<a id="summary-table"></a>
#### 📝 Summary table of methods


<details><summary><strong>📉 Click to Expand</strong></summary>

<table>
<thead>
<tr>
<th>Method</th><th>When Useful</th><th>Strengths</th><th>Weaknesses</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>Simple Diff-in-Means</strong></td><td>Randomized experiments</td><td>Easy, unbiased</td><td>Useless with confounding</td>
</tr>
<tr>
<td><strong>Regression Adjustment</strong></td><td>Observational data with measured confounders</td><td>Easy to implement</td><td>Model misspecification risk</td>
</tr>
<tr>
<td><strong>Stratification</strong></td><td>Small number of discrete confounders</td><td>Transparent</td><td>Breaks down in high dimensions</td>
</tr>
<tr>
<td><strong>Propensity Score Matching (PSM)</strong></td><td>Observational data with many confounders</td><td>Balances groups</td><td>Sensitive to model of treatment</td>
</tr>
<tr>
<td><strong>Instrumental Variables (IV)</strong></td><td>Unobserved confounders exist</td><td>Bypasses confounding</td><td>Hard to find good instruments</td>
</tr>
<tr>
<td><strong>Double Machine Learning (DML)</strong></td><td>High-dimensional nonlinear confounders</td><td>ML flexibility + debiasing</td><td>Needs lots of data, honest splits</td>
</tr>
<tr>
<td><strong>Causal Forests</strong></td><td>Heterogeneous treatment effects</td><td>Flexible CATE estimation</td><td>Complex, less interpretable</td>
</tr>
</tbody>
</table>

</details>


<a id="method-choice"></a>
#### 📋 When to use what


<details><summary><strong>📉 Click to Expand</strong></summary>

<p><strong>Choosing a method depends on:</strong></p>

<ul>
  <li><strong>Randomization present?</strong>
    <ul>
      <li>Yes → Simple difference in means is fine.</li>
      <li>No → Need adjustment.</li>
    </ul>
  </li>
  
  <li><strong>Are confounders observed?</strong>
    <ul>
      <li>Yes → Regression, PSM, DML are options.</li>
      <li>No → Need IV or natural experiments.</li>
    </ul>
  </li>

  <li><strong>Do you expect heterogeneous effects?</strong>
    <ul>
      <li>Yes → Causal Trees, Causal Forests, Meta-learners.</li>
    </ul>
  </li>

  <li><strong>Is high-dimensional data involved?</strong>
    <ul>
      <li>Yes → Prefer DML over simple regression.</li>
    </ul>
  </li>
</ul>

<hr>

<p>Choosing the right method = matching the method to the bias and complexity in your data.</p>

</details>


<a id="causal-vs-predictive"></a>
#### 📎 Causal vs Predictive mindset


<details><summary><strong>📉 Click to Expand</strong></summary>

<p><strong>Predictive modeling mindset:</strong></p>
<ul>
  <li>Focuses on fitting observed outcomes.</li>
  <li>Good for forecasts, risk scores, recommendation engines.</li>
  <li>Does not care about interventions.</li>
</ul>

<p><strong>Causal inference mindset:</strong></p>
<ul>
  <li>Focuses on <em>what would happen if we intervened</em>.</li>
  <li>Good for making decisions (policies, treatments, products).</li>
  <li>Requires stronger assumptions, careful design.</li>
</ul>

<hr>

<p><strong>Key difference:</strong><br>
Predictive models can be accurate yet useless for interventions.<br>
Causal models are harder but necessary to make confident decisions.</p>

<hr>

<p><strong>Quote to remember:</strong></p>
<blockquote>"All models are wrong. Some models are useful.  
Only causal models are useful for actions."</blockquote>

</details>


[Back to the top](#table-of-contents)
___
