![Status: In Progress](https://img.shields.io/badge/status-in--progress-yellow)
![Python](https://img.shields.io/badge/python-3.10-blue)
![Coverage](https://img.shields.io/badge/coverage-70%25-yellowgreen)
![License](https://img.shields.io/badge/license-MIT-green)

<!-- ![Status: Complete](https://img.shields.io/badge/status-complete-brightgreen)
![Python](https://img.shields.io/badge/python-3.10-blue)
![Coverage](https://img.shields.io/badge/coverage-95%25-yellowgreen)
![License](https://img.shields.io/badge/license-MIT-green) -->

<a id="table-of-contents"></a>
# 🧭 Causal Inference

- [🎯 Introduction to Causal Inference](#intro)
  - [🎓 What is Causal Inference?](#what-is-causal)
  - [📌 Why go beyond Correlation?](#why-correlation)
  - [🧭 Real-world problems that need causality](#real-world-examples)

- [🧠 Core Concepts & Notation](#notation-assumptions)
  - [🧮 Treatment, Outcome, Units](#treatment-outcome-units)
  - [📐 Potential Outcomes (Rubin Causal Model)](#potential-outcomes)
  - [🧵 Fundamental Problem of Causal Inference](#fundamental-problem)
  - [🧠 Assumptions (SUTVA, Ignorability, Overlap)](#core-assumptions)

- [🧪 Simulated Dataset Setup](#simulated-data)
  - [🧬 Define treatment assignment logic](#treatment-logic)
  - [🔬 Inject confounding intentionally](#inject-confounding)
  - [🧊 Simulate potential outcomes + observed data](#simulate-outcomes)

- [🚫 Naive Estimation](#naive-estimation)
  - [❌ Simple difference in means](#diff-in-means)
  - [⚠️ Bias due to confounding](#bias-confounding)

- [🕸️ Causal Diagrams (DAGs)](#causal-diagrams)
  - [🧿 Quick primer on DAGs](#primer-dags)
  - [🕷️ Confounding vs. colliders vs. mediators](#confounder-collider-mediator)
  - [🔗 What can/can’t be estimated just from data](#estimability-from-dags)

- [🔍 Backdoor Adjustment Methods](#backdoor-adjustment)
  - [🧾 Conditioning on confounders](#conditioning)
  - [🕵️‍♂️ Stratification / Subgroup analysis](#stratification)
  - [📊 Regression Adjustment](#regression-adjustment)
  - [📌 Propensity Score Matching (PSM)](#psm)

- [🎯 Instrumental Variables (IV)](#iv-methods)
  - [🪝 When backdoor paths can’t be blocked](#when-use-iv)
  - [🎯 Valid instrument conditions](#iv-conditions)
  - [🧩 2-Stage Least Squares (2SLS)](#2sls)

- [🧰 Double Machine Learning (DML)](#dml-methods)
  - [🪛 Use ML models for nuisance functions](#ml-nuisance)
  - [🧱 Residualization + orthogonalization logic](#residualization)
  - [🧲 When to prefer over traditional regression](#dml-vs-regression)

- [🌈 Heterogeneous Treatment Effects](#heterogeneous-effects)
  - [🎨 ATE vs. CATE vs. ITE](#ate-cate-ite)
  - [🌟 Uplift models and use cases](#uplift-usecases)
  - [🧩 Tree-based methods (Causal Trees, Causal Forests)](#causal-forests)

- [🧪 Placebo Tests & Robustness Checks](#placebo-robustness)
  - [🧻 Randomized placebo treatments](#placebo)
  - [⚗️ Sensitivity to unobserved confounding](#robustness)

- [🧬 Counterfactual Thinking](#counterfactuals)
  - [🤖 Predicting what would’ve happened](#what-if)
  - [🔁 Usage in recommendation & personalization](#personalization)

- [📌 Closing Notes](#closing-notes)
  - [📝 Summary table of methods](#summary-table)
  - [📋 When to use what](#method-choice)
  - [📎 Causal vs Predictive mindset](#causal-vs-predictive)

___

<a id="intro"></a>
# 🎯 Introduction to Causal Inference


<details><summary><strong>📉 Click to Expand</strong></summary>

<h5>🧠 Why this Notebook</h5>

<p>Causal inference gives us the tools to answer "what if" questions — not just "what is." In product, policy, medicine, and science, we often need to <strong>act</strong>, and actions require understanding their consequences.</p>

<p>This field helps us:</p>
<ul>
  <li>Understand <strong>how</strong> and <strong>why</strong> outcomes change.</li>
  <li>Move from data <em>descriptions</em> to data-<em>driven interventions</em>.</li>
  <li>Avoid the trap of chasing noisy correlations.</li>
</ul>

<p>This notebook is a build-up from first principles to practical methods — with enough grounding to reason about experiments, models, and their assumptions clearly.</p>

</details>


<a id="what-is-causal"></a>
#### 🎓 What is Causal Inference?


<details><summary><strong>📉 Click to Expand</strong></summary>

<h5>🎓 Causal inference is the process of estimating the <strong>effect</strong> of one variable (the treatment) on another (the outcome), holding all else constant.</h5>

<p>The core idea is to estimate:</p>
<blockquote>What would the outcome have been if the treatment had (or had not) occurred?</blockquote>

<p>Unlike correlation or predictive modeling:</p>
<ul>
  <li>It asks <strong>counterfactual</strong> questions — what <em>would</em> have happened under different scenarios.</li>
  <li>It requires <strong>assumptions</strong>, <strong>design</strong>, and often <strong>randomization</strong> or clever statistical tricks.</li>
</ul>

<p>At its heart, causal inference is about:</p>
<ul>
  <li>Designing better <strong>interventions</strong></li>
  <li>Estimating <strong>treatment effects</strong></li>
  <li>Avoiding misleading <strong>associational patterns</strong></li>
</ul>

</details>


<a id="why-correlation"></a>
#### 📌 Why go beyond Correlation?


<details><summary><strong>📉 Click to Expand</strong></summary>

<h5>📌 Correlation can be dangerous when used as a proxy for causation.</h5>

<p>Example: Ice cream sales are correlated with shark attacks. Should we ban dessert?  
Clearly not — they’re both caused by heatwaves (a confounder).</p>

<p>Correlation fails because it:</p>
<ul>
  <li>Ignores <strong>confounders</strong> (common causes of both variables)</li>
  <li>Misses <strong>directionality</strong> (what affects what)</li>
  <li>Can be driven by <strong>reverse causation</strong> or <strong>coincidence</strong></li>
</ul>

<p>Causal inference gives tools to:</p>
<ul>
  <li><strong>Identify</strong> confounding</li>
  <li><strong>Design</strong> better studies (randomized or observational)</li>
  <li><strong>Interpret</strong> results in terms of actionable causes</li>
</ul>

</details>


<a id="real-world-examples"></a>
#### 🧭 Real-world problems that need causality


<details><summary><strong>📉 Click to Expand</strong></summary>

<h5>🧭 Correlation might be fine for dashboards. But when making <strong>decisions</strong>, causality is non-negotiable.</h5>

<p>Examples:</p>
<ul>
  <li><strong>Product</strong>: Did that new button placement increase checkout, or was it a seasonal effect?</li>
  <li><strong>Marketing</strong>: Did the email nudge lead to purchases, or did loyal users open it anyway?</li>
  <li><strong>Policy</strong>: Did a tax cut help the economy, or was it already improving?</li>
  <li><strong>Health</strong>: Does a drug reduce disease, or do healthier people tend to take it?</li>
</ul>

<p>These questions involve <strong>interventions</strong>, and only causal methods can tell us what would’ve happened under a different choice.</p>

</details>


[Back to the top](#table-of-contents)
___


<a id="notation-assumptions"></a>
# 🧠 Core Concepts & Notation


<a id="treatment-outcome-units"></a>
#### 🧮 Treatment, Outcome, Units


<details><summary><strong>📉 Click to Expand</strong></summary>

<ul>
  <li><strong>Treatment (<code>T</code>)</strong>: The intervention or condition being tested (e.g., new design, drug, policy).</li>
  <li><strong>Outcome (<code>Y</code>)</strong>: The result or metric affected by the treatment (e.g., click, recovery, score).</li>
  <li><strong>Units (<code>i</code>)</strong>: The entities receiving treatment and producing outcomes (e.g., users, patients, schools).</li>
</ul>

<p>Each unit can receive a treatment or control, and we observe only one outcome — not both.</p>

<p>This framing is universal and applies whether you're testing emails, ads, or vaccines.</p>

</details>


<a id="potential-outcomes"></a>
#### 📐 Potential Outcomes (Rubin Causal Model)


<details><summary><strong>📉 Click to Expand</strong></summary>

<p>The <strong>Potential Outcomes framework</strong> (aka Rubin Causal Model) imagines two parallel worlds for each unit:</p>

<ul>
  <li><code>Y(1)</code>: Outcome if treated</li>
  <li><code>Y(0)</code>: Outcome if not treated</li>
</ul>

<p>We define <strong>Individual Treatment Effect (ITE)</strong> as:</p>

<blockquote>ITE = Y(1) - Y(0)</blockquote>

<p><strong>Key idea:</strong></p>

<p>Each unit has both potential outcomes — but we can only observe one. The other is <strong>counterfactual</strong>.</p>

<p>This framework allows us to define:</p>

<ul>
  <li>ATE (Average Treatment Effect)</li>
  <li>CATE (Conditional ATE, for subgroups)</li>
</ul>

<p>And formalizes why causal inference is hard: we never see both outcomes.</p>

</details>


<a id="fundamental-problem"></a>
#### 🧵 Fundamental Problem of Causal Inference


<details><summary><strong>📉 Click to Expand</strong></summary>

<h5>The <strong>Fundamental Problem of Causal Inference</strong>:</h5>

<blockquote>For any individual, we can observe only one potential outcome — never both.</blockquote>

<p>Example:</p>
<ul>
  <li>A user sees version A → you observe <code>Y(0)</code></li>
  <li>You’ll never know what <code>Y(1)</code> would have been for that exact user</li>
</ul>

<p>This creates a missing data problem: the counterfactual is unobservable.</p>

<p>To solve this, we rely on:</p>
<ul>
  <li><strong>Randomization</strong></li>
  <li><strong>Modeling + assumptions</strong></li>
  <li><strong>Matching or weighting approaches</strong></li>
</ul>

<p>All causal methods are, in some way, trying to <strong>approximate the missing counterfactual</strong>.</p>

</details>


<a id="core-assumptions"></a>
#### 🧠 Assumptions (SUTVA, Ignorability, Overlap)


<details><summary><strong>📉 Click to Expand</strong></summary>

<p>Causal inference relies heavily on assumptions — even when you don’t randomize.</p>

<h5>Three core ones:</h5>

<ul>
  <li><strong>SUTVA (Stable Unit Treatment Value Assumption)</strong><br>
  → Your treatment doesn’t affect someone else’s outcome.<br>
  → No interference across units.</li>

  <li><strong>Ignorability (a.k.a. Unconfoundedness)</strong><br>
  → Given the observed covariates, treatment assignment is as good as random.<br>
  → This lets you use observed data for estimation.</li>

  <li><strong>Overlap (a.k.a. Positivity)</strong><br>
  → Every unit has a non-zero probability of receiving either treatment.<br>
  → You can’t learn effects where there’s no variation.</li>
</ul>

<p>Without these, causal estimates can be biased or undefined. Always question whether they hold before trusting results.</p>

</details>


[Back to the top](#table-of-contents)
___


<a id="simulated-data"></a>
# 🧪 Simulated Dataset Setup


<details><summary><strong>📉 Click to Expand</strong></summary>

<p>Simulating data is the best way to <em>control the truth</em> when learning causal inference.</p>

<p>Here’s why we simulate:</p>
<ul>
  <li>You get full knowledge of ground-truth treatment effects.</li>
  <li>You can deliberately create <strong>confounding</strong>, <strong>bias</strong>, <strong>non-randomness</strong>.</li>
  <li>You can practice recovering the true causal effect using different methods.</li>
</ul>

<p>In real-world observational data, the "truth" is hidden. Simulating lets you debug your causal intuition safely before dealing with messy production datasets.</p>

</details>


In [1]:
# Simulated Dataset Setup
import numpy as np
import pandas as pd

np.random.seed(42)  # for reproducibility

# We'll define features, treatment assignment, and outcomes step-by-step later


<a id="treatment-logic"></a>
#### 🧬 Define treatment assignment logic


<details><summary><strong>📉 Click to Expand</strong></summary>

<p>To simulate treatment realistically:</p>
<ul>
  <li>Treatment should <strong>depend</strong> on observed features.</li>
  <li>Treatment <strong>should not</strong> be random — otherwise, no confounding to deal with.</li>
</ul>

<p>For example:</p>
<ul>
  <li>Wealthier users might be more likely to receive a premium offer.</li>
  <li>Healthier patients might be less likely to receive intensive care.</li>
</ul>

<p>We’ll simulate a <strong>non-random treatment assignment</strong> based on a few covariates to mimic real-world biases.</p>

</details>


In [2]:
# Define covariates (features)
n = 5000

age = np.random.normal(40, 12, n)       # Age
income = np.random.normal(60000, 15000, n)  # Annual income
prior_engagement = np.random.beta(2, 5, n)  # Past engagement score [0,1]

# Treatment assignment probability based on features
treatment_prob = (
    0.3 * (income > 70000).astype(float) +
    0.2 * (prior_engagement > 0.5).astype(float) +
    0.1 * (age < 30).astype(float) +
    np.random.normal(0, 0.05, n)  # small noise
)
treatment_prob = np.clip(treatment_prob, 0, 1)

# Assign treatment
T = np.random.binomial(1, treatment_prob)


<a id="inject-confounding"></a>
#### 🔬 Inject confounding intentionally


<details><summary><strong>📉 Click to Expand</strong></summary>

<p>In real-world datasets, treatment assignment is <strong>not random</strong> — it’s confounded by covariates.</p>

<p>We deliberately inject confounding so that:</p>
<ul>
  <li>Covariates (age, income, engagement) affect both <strong>treatment</strong> and <strong>outcome</strong>.</li>
  <li>If we naively compare treated vs untreated, we'll get biased results.</li>
</ul>

<p>Confounding creates the need for adjustment, which will be a major theme later.</p>

</details>


In [3]:
# Let's define a "true" baseline outcome based on the same covariates
base_outcome = (
    50 + 
    0.02 * income +
    5 * prior_engagement -
    0.3 * age +
    np.random.normal(0, 5, n)  # random noise
)


<a id="simulate-outcomes"></a>
#### 🧊 Simulate potential outcomes + observed data


<details><summary><strong>📉 Click to Expand</strong></summary>

<p>In the Rubin Potential Outcomes framework, each unit has two outcomes:</p>
<ul>
  <li><code>Y(1)</code> → If treated</li>
  <li><code>Y(0)</code> → If not treated</li>
</ul>

<p>We can simulate this by:</p>
<ul>
  <li>Applying a <strong>true treatment effect</strong> to <code>Y(1)</code></li>
  <li>Leaving <code>Y(0)</code> as the base outcome</li>
</ul>

<p><strong>Important:</strong> We observe only one of <code>Y(1)</code> or <code>Y(0)</code>, depending on treatment assignment (<code>T</code>).</p>

</details>


In [4]:
# Define a true treatment effect (could vary by subgroup later)
true_treatment_effect = 10  # a flat +10 effect for everyone

# Simulate potential outcomes
Y_0 = base_outcome
Y_1 = base_outcome + true_treatment_effect

# Observed outcome based on treatment assignment
Y_obs = T * Y_1 + (1 - T) * Y_0

# Assemble into a dataframe
df = pd.DataFrame({
    'age': age,
    'income': income,
    'prior_engagement': prior_engagement,
    'T': T,
    'Y_obs': Y_obs,
    'Y_0': Y_0,
    'Y_1': Y_1,
})

df.head()


Unnamed: 0,age,income,prior_engagement,T,Y_obs,Y_0,Y_1
0,45.96057,53643.60477,0.188077,0,1114.502287,1114.502287,1124.502287
1,38.340828,53198.788374,0.170389,0,1099.893323,1099.893323,1109.893323
2,47.772262,33065.352411,0.511379,0,704.133035,704.133035,714.133035
3,58.276358,55048.647124,0.318793,0,1139.203509,1139.203509,1149.203509
4,37.19016,70992.436227,0.384439,0,1464.308234,1464.308234,1474.308234


[Back to the top](#table-of-contents)
___


<a id="naive-estimation"></a>
# 🚫 Naive Estimation


<details><summary><strong>📉 Click to Expand</strong></summary>

<p>Many causal questions are first attacked by simply comparing the treated vs. untreated groups.</p>

<p><strong>Naive Approach:</strong></p>
<blockquote>Average outcome of treated - Average outcome of untreated.</blockquote>

<p>This looks simple, but in observational data:</p>
<ul>
  <li>Treated and untreated units <strong>are not comparable</strong>.</li>
  <li>Treatment assignment was <strong>not randomized</strong>.</li>
  <li>Differences in baseline characteristics confound the simple difference.</li>
</ul>

<p>Naive estimation <strong>almost always gives biased results</strong> unless you have perfect randomization.</p>

<p>In this section, we'll see how bad the naive approach can get even on a simple synthetic dataset.</p>

</details>


In [5]:
# Quick naive estimation
treated_mean = df.loc[df['T'] == 1, 'Y_obs'].mean()
control_mean = df.loc[df['T'] == 0, 'Y_obs'].mean()

naive_diff = treated_mean - control_mean

print(f"Naive difference in means: {naive_diff:.2f}")
print(f"True treatment effect (ground truth): {true_treatment_effect}")


Naive difference in means: 250.53
True treatment effect (ground truth): 10


<a id="diff-in-means"></a>
#### ❌ Simple difference in means


<details><summary><strong>📉 Click to Expand</strong></summary>

<p>A simple difference in means is mathematically:</p>
<blockquote><code>E[Y | T=1] - E[Y | T=0]</code></blockquote>

<p>If treatment assignment were random:</p>
<ul>
  <li>The two groups would be exchangeable.</li>
  <li>Baseline covariates would balance on average.</li>
  <li>The simple difference would be an unbiased estimator of ATE.</li>
</ul>

<p>But if treatment is <strong>confounded</strong>, then:</p>
<ul>
  <li><code>T=1</code> units may systematically differ from <code>T=0</code> units.</li>
  <li>The naive estimator picks up both <strong>causal effect</strong> and <strong>selection bias</strong>.</li>
</ul>

<p>We’ll soon quantify how large this bias can be.</p>

</details>


In [6]:
# Let's quickly visualize how the treated vs control groups differ in covariates
df.groupby('T')[['age', 'income', 'prior_engagement']].mean()


Unnamed: 0_level_0,age,income,prior_engagement
T,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,40.389926,58304.201745,0.278891
1,37.892227,70283.224987,0.330779


<a id="bias-confounding"></a>
#### ⚠️ Bias due to confounding


<details><summary><strong>📉 Click to Expand</strong></summary>

<p><strong>Bias from confounding</strong> happens when:</p>
<ul>
  <li>The treated group has systematically different baseline outcomes than the control group.</li>
</ul>

<p>Mathematically:</p>
<blockquote>Observed difference = True treatment effect + Bias from baseline differences</blockquote>

<p>In our simulation:</p>
<ul>
  <li>Higher income users are more likely to be treated.</li>
  <li>Income also directly influences outcome.</li>
  <li>Therefore, the observed difference <strong>overstates</strong> the real effect.</li>
</ul>

<p>This is why adjusting for confounders is critical — naive methods can easily mislead interventions and business decisions.</p>

</details>


In [7]:
# Let's calculate the *average baseline outcome* (Y_0) in treated vs untreated groups
treated_baseline = df.loc[df['T'] == 1, 'Y_0'].mean()
control_baseline = df.loc[df['T'] == 0, 'Y_0'].mean()

baseline_diff = treated_baseline - control_baseline

print(f"Baseline (Y_0) difference between treated and control: {baseline_diff:.2f}")
print("This baseline imbalance creates bias in naive estimation.")


Baseline (Y_0) difference between treated and control: 240.53
This baseline imbalance creates bias in naive estimation.


[Back to the top](#table-of-contents)
___


<a id="causal-diagrams"></a>
# 🕸️ Causal Diagrams (DAGs)


<details><summary><strong>📉 Click to Expand</strong></summary>

**Directed Acyclic Graphs (DAGs)** are a compact way to represent assumptions about the data generating process.

- Nodes = variables
- Edges (arrows) = direct causal influence

DAGs are not learned from data. They are **drawn from domain knowledge** to help reason about:
- Confounders
- Biases
- Valid adjustment strategies

Almost every causal inference method implicitly or explicitly assumes a DAG about the world.

</details>


<a id="primer-dags"></a>
#### 🧿 Quick primer on DAGs


<details><summary><strong>📉 Click to Expand</strong></summary>

A **DAG (Directed Acyclic Graph)** encodes assumptions about how variables causally relate.

- **Directed**: Arrows have direction (cause → effect).
- **Acyclic**: No feedback loops allowed (you can’t return to a node).

Example:
> Age → Income → Health

Means:
- Age affects income.
- Income affects health.
- No reverse paths.

DAGs help identify:
- Which paths are confounded
- Which variables to control for
- Whether effects are identifiable

They act like a **map** — letting you plan causal estimation strategies intelligently.

</details>


<a id="confounder-collider-mediator"></a>
#### 🕷️ Confounding vs. colliders vs. mediators


<details><summary><strong>📉 Click to Expand</strong></summary>

**Confounders**:
- Variables that influence both treatment and outcome.
- Must be adjusted for to block bias.
- Example: Age confounds the relationship between Exercise (T) and Health (Y).

**Colliders**:
- Variables caused by two other variables.
- **Must NOT adjust for colliders** — doing so opens spurious associations.
- Example: Adjusting for "hospitalization" might introduce bias when studying Smoking → Lung Disease.

**Mediators**:
- Variables on the causal pathway between treatment and outcome.
- Adjusting for them **blocks part of the causal effect** you want to measure.
- Example: Exercise → Fitness → Health (fitness is a mediator).

👉 Correct adjustment requires distinguishing among these roles.

</details>


<a id="estimability-from-dags"></a>
#### 🔗 What can/can’t be estimated just from data


<details><summary><strong>📉 Click to Expand</strong></summary>

Not everything is identifiable from data alone — assumptions are unavoidable.

**What can be estimated**:
- Associations (correlations, patterns)
- Conditional independence structures
- Causal effects **if** the right covariates are controlled (based on DAG structure)

**What cannot be estimated**:
- Whether a relationship is causal (without assumptions)
- The full structure of a DAG (unless randomized experiments are used)

Data + assumptions → Causal conclusions.  
Data alone → Only correlational findings.

DAGs clarify where you need domain knowledge vs where data suffices.

</details>


[Back to the top](#table-of-contents)
___


<a id="backdoor-adjustment"></a>
# 🔍 Backdoor Adjustment Methods


<details><summary><strong>📉 Click to Expand</strong></summary>

Backdoor adjustment methods aim to block **backdoor paths** — non-causal paths that create bias between treatment and outcome.

**Core idea**:
- Identify variables (confounders) that open backdoor paths.
- Condition on them — either by stratifying, modeling, or matching.

Backdoor adjustment **simulates** what would happen if treatment assignment were random within levels of the confounders.

It’s the foundational idea behind:
- Regression
- Matching
- Stratification
- Propensity scores

If you can block all backdoor paths, you can estimate causal effects from observational data reliably.

</details>


<a id="conditioning"></a>
#### 🧾 Conditioning on confounders


<details><summary><strong>📉 Click to Expand</strong></summary>

Conditioning means **holding confounders constant** when comparing treated vs untreated units.

Examples:
- Comparing treated vs untreated users **within each income band**.
- Comparing recovery rates **within each age group**.

By conditioning, you eliminate the variation due to confounders, isolating the causal effect.

**Important:**  
You should only condition on true confounders — not colliders or mediators.

Conditioning can be implemented via:
- Subgrouping
- Regression
- Matching
- Weighting

</details>


In [8]:
# Simple conditioning by subgroup (example: income > 70k vs <= 70k)
df['high_income'] = (df['income'] > 70000).astype(int)

grouped = df.groupby(['high_income', 'T'])['Y_obs'].mean().unstack()
grouped['diff'] = grouped[1] - grouped[0]

print("Simple Conditional Difference by Income Group:")
print(grouped[['diff']])


Simple Conditional Difference by Income Group:
T                 diff
high_income           
0            16.867043
1            13.563878


<a id="stratification"></a>
#### 🕵️‍♂️ Stratification / Subgroup analysis


<details><summary><strong>📉 Click to Expand</strong></summary>

Stratification means **breaking the dataset into buckets** based on confounders and comparing treatment effects within each bucket.

Typical steps:
1. Divide data based on a confounder (e.g., low vs high engagement).
2. Within each stratum, compute treated vs control differences.
3. Aggregate across strata (weighted average).

**When useful:**
- When confounders are categorical or easily discretized.
- When interpretability is important.

**Limits:**
- Doesn’t scale well with many confounders (curse of dimensionality).

</details>


In [9]:
# Stratify based on prior_engagement (simple high vs low)
df['high_engagement'] = (df['prior_engagement'] > 0.5).astype(int)

strat_grouped = df.groupby(['high_engagement', 'T'])['Y_obs'].mean().unstack()
strat_grouped['diff'] = strat_grouped[1] - strat_grouped[0]

print("Conditional Difference by Engagement Group:")
print(strat_grouped[['diff']])


Conditional Difference by Engagement Group:
T                      diff
high_engagement            
0                290.043114
1                145.885304


<a id="regression-adjustment"></a>
#### 📊 Regression Adjustment


<details><summary><strong>📉 Click to Expand</strong></summary>

Regression adjustment estimates causal effects by **controlling for confounders via regression**.

Simple linear model:
> `Y = β₀ + β₁·T + β₂·(confounder1) + β₃·(confounder2) + ... + ε`

- `β₁` captures the **adjusted** effect of treatment, controlling for confounders.
- It removes bias from observable confounders (under correct model specification).

**Advantages:**
- Easy to use.
- Scales to many covariates.

**Risks:**
- Sensitive to model misspecification.
- Wrong functional forms (nonlinearities, interactions) can bias estimates.

</details>


In [10]:
import statsmodels.api as sm

X = df[['T', 'age', 'income', 'prior_engagement']]
X = sm.add_constant(X)
y = df['Y_obs']

reg_model = sm.OLS(y, X).fit()

print(reg_model.summary())

print(f"\nEstimated treatment effect (β₁) after adjustment: {reg_model.params['T']:.2f}")


                            OLS Regression Results                            
Dep. Variable:                  Y_obs   R-squared:                       1.000
Model:                            OLS   Adj. R-squared:                  1.000
Method:                 Least Squares   F-statistic:                 4.652e+06
Date:                Sat, 26 Apr 2025   Prob (F-statistic):               0.00
Time:                        17:22:58   Log-Likelihood:                -15122.
No. Observations:                5000   AIC:                         3.025e+04
Df Residuals:                    4995   BIC:                         3.029e+04
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
                       coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------
const               50.3660      0.398  

<a id="psm"></a>
#### 📌 Propensity Score Matching (PSM)


<details><summary><strong>📉 Click to Expand</strong></summary>

Propensity Score Matching (PSM) is a two-step procedure:
1. Model the **probability of receiving treatment** (`P(T=1 | X)`) using observed covariates.
2. Match treated and control units with **similar propensity scores**.

**Why PSM?**
- Instead of adjusting for many covariates separately, you balance treated and control groups on a single dimension (the propensity score).

**When useful:**
- When covariate space is high-dimensional.
- When you want a matched sample that resembles randomized data.

**Limitations:**
- Requires good overlap (common support).
- Still relies on unconfoundedness assumption.

</details>


In [11]:
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import NearestNeighbors

# Step 1: Estimate propensity scores
ps_model = LogisticRegression()
ps_model.fit(df[['age', 'income', 'prior_engagement']], df['T'])
df['propensity_score'] = ps_model.predict_proba(df[['age', 'income', 'prior_engagement']])[:,1]

# Step 2: Nearest neighbor matching
treated = df[df['T'] == 1]
control = df[df['T'] == 0]

nn = NearestNeighbors(n_neighbors=1)
nn.fit(control[['propensity_score']])

distances, indices = nn.kneighbors(treated[['propensity_score']])
matched_control = control.iloc[indices.flatten()]

# Calculate matched difference
matched_diff = (treated['Y_obs'].values - matched_control['Y_obs'].values).mean()

print(f"Propensity Score Matched Estimate of Treatment Effect: {matched_diff:.2f}")


Propensity Score Matched Estimate of Treatment Effect: 197.50


[Back to the top](#table-of-contents)
___


<a id="iv-methods"></a>
# 🎯 Instrumental Variables (IV)


<details><summary><strong>📉 Click to Expand</strong></summary>

Instrumental Variables (IV) methods are used **when simple adjustment for confounders is impossible** or **not credible**.

When treatment is **endogenous** (affected by unobserved factors also affecting outcome), traditional methods like regression fail.

**IV solves this by:**
- Using a "proxy" (instrument) that affects treatment but is otherwise unrelated to the outcome except through treatment.
- "Re-randomizing" variation in treatment based on the instrument.

You create **quasi-randomization** even in observational data.

Classic examples:
- Distance to hospital → instrument for getting surgery.
- Random assignment of judges → instrument for harsher sentencing.

</details>


<a id="when-use-iv"></a>
#### 🪝 When backdoor paths can’t be blocked


<details><summary><strong>📉 Click to Expand</strong></summary>

You need IV methods when:
- There are **unobserved confounders** you can’t measure.
- No set of observed covariates satisfies ignorability.
- Standard backdoor adjustment will be biased.

Example:
- Studying the effect of education on income: natural intelligence is a hidden confounder (affects both education and income).
- You can't just regress income ~ education — bias remains.

**Key realization:**
If **backdoor paths exist** through unobserved variables, IV becomes necessary.

</details>


<a id="iv-conditions"></a>
#### 🎯 Valid instrument conditions


<details><summary><strong>📉 Click to Expand</strong></summary>

For an instrument (`Z`) to be valid, it must satisfy:

1. **Relevance**:  
   - `Z` must affect treatment `T`.  
   (There must be a first-stage effect.)

2. **Exclusion Restriction**:  
   - `Z` must affect the outcome `Y` **only** through `T`.  
   (No direct path from `Z` to `Y`.)

3. **Independence (As-if Randomness)**:  
   - `Z` must be independent of unobserved confounders affecting `Y`.

---

If any of these fail:
- IV estimates are biased or meaningless.
- You can’t fix bad instruments with bigger sample sizes.

**Choosing or arguing a valid instrument is 90% of the IV battle.**

</details>


<a id="2sls"></a>
#### 🧩 2-Stage Least Squares (2SLS)


<details><summary><strong>📉 Click to Expand</strong></summary>

**2SLS (Two-Stage Least Squares)** is the classic estimation procedure for IV:

- **Stage 1**:  
  Regress treatment `T` on instrument `Z` (and any controls)  
  → get predicted treatment (`T̂`)

- **Stage 2**:  
  Regress outcome `Y` on predicted treatment (`T̂`)

The second-stage coefficient gives the **causal effect** of treatment on outcome, isolating variation driven by the instrument.

**Warning:**  
- Standard regression software doesn't correct standard errors properly when doing 2SLS manually.  
- Later packages like `linearmodels` automate this.

</details>


In [12]:
from statsmodels.api import OLS, add_constant

# Simulate an instrument Z (let's assume it's random and satisfies conditions)
np.random.seed(42)
df['Z'] = np.random.binomial(1, 0.5, size=len(df))

# Make treatment depend partly on Z
df['T_iv'] = (0.5 * df['Z'] + 0.5 * df['prior_engagement'] + np.random.normal(0, 0.1, len(df))) > 0.5
df['T_iv'] = df['T_iv'].astype(int)

# Stage 1: Predict treatment from instrument
X_stage1 = add_constant(df[['Z', 'age', 'income', 'prior_engagement']])
stage1_model = OLS(df['T_iv'], X_stage1).fit()
df['T_hat'] = stage1_model.predict(X_stage1)

# Stage 2: Predict outcome from predicted treatment
X_stage2 = add_constant(df[['T_hat', 'age', 'income', 'prior_engagement']])
stage2_model = OLS(df['Y_obs'], X_stage2).fit()

print(stage2_model.summary())

print(f"\nEstimated causal effect (via 2SLS): {stage2_model.params['T_hat']:.2f}")


                            OLS Regression Results                            
Dep. Variable:                  Y_obs   R-squared:                       1.000
Model:                            OLS   Adj. R-squared:                  1.000
Method:                 Least Squares   F-statistic:                 3.278e+06
Date:                Sat, 26 Apr 2025   Prob (F-statistic):               0.00
Time:                        17:24:24   Log-Likelihood:                -15996.
No. Observations:                5000   AIC:                         3.200e+04
Df Residuals:                    4995   BIC:                         3.203e+04
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
                       coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------
const               48.1296      0.474  

[Back to the top](#table-of-contents)
___


<a id="dml-methods"></a>
# 🧰 Double Machine Learning (DML)


<details><summary><strong>📉 Click to Expand</strong></summary>

Double Machine Learning (DML) is a modern causal estimation technique that:

- **Separates** the modeling of treatment and outcome.
- **Uses flexible machine learning models** to control for complex confounders.
- **Debiases** the final treatment effect estimation by orthogonalization.

**Why DML matters:**
- Traditional linear regression forces linearity.
- DML allows for nonlinear, high-dimensional adjustment without overfitting causal estimates.

It builds robust treatment effect estimators even when you use ML methods like Random Forests, XGBoost, or Neural Nets for intermediate steps.

</details>


<a id="ml-nuisance"></a>
#### 🪛 Use ML models for nuisance functions


<details><summary><strong>📉 Click to Expand</strong></summary>

In DML, you model two "nuisance functions" first:
1. **Outcome model**: `Y ~ X`
2. **Treatment model**: `T ~ X`

You can use **any ML model** (linear regression, random forest, gradient boosting, etc.) for these.

**Key point:**  
The goal is **accurate prediction**, not causal interpretation, at this stage.

Later, DML uses the residuals from these models to isolate the causal effect of `T` on `Y`.

This two-step process protects the final estimate from overfitting to noisy high-dimensional features.

</details>


In [13]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

# Define features
features = ['age', 'income', 'prior_engagement']

# Split into train/test for honest estimation
X_train, X_test, y_train, y_test = train_test_split(df[features], df['Y_obs'], test_size=0.3, random_state=42)
T_train, T_test = train_test_split(df['T'], test_size=0.3, random_state=42)

# Outcome model: Y ~ X
y_model = RandomForestRegressor()
y_model.fit(X_train, y_train)
df['y_hat'] = y_model.predict(df[features])

# Treatment model: T ~ X
t_model = RandomForestRegressor()
t_model.fit(X_train, T_train)
df['t_hat'] = t_model.predict(df[features])


<a id="residualization"></a>
#### 🧱 Residualization + orthogonalization logic


<details><summary><strong>📉 Click to Expand</strong></summary>

After fitting nuisance models:

- Calculate **residuals**:
  - `Residual_Y = Y - Ŷ`
  - `Residual_T = T - T̂`

- Then regress **Residual_Y ~ Residual_T**.

Why?
- This removes the part of `Y` and `T` that is predictable from `X`.
- What remains captures the **"clean" causal variation** of `T` on `Y`, orthogonal to confounders.

This two-stage process is called **orthogonalization** — it minimizes bias from overfitting nuisance functions.

It’s a key innovation that separates DML from naive ML-based adjustment.

</details>


In [14]:
# Calculate residuals
df['residual_Y'] = df['Y_obs'] - df['y_hat']
df['residual_T'] = df['T'] - df['t_hat']

# Final stage: regress residual_Y ~ residual_T
X_resid = sm.add_constant(df['residual_T'])
y_resid = df['residual_Y']

residual_model = sm.OLS(y_resid, X_resid).fit()

print(residual_model.summary())

print(f"\nEstimated causal effect via DML: {residual_model.params['residual_T']:.2f}")


                            OLS Regression Results                            
Dep. Variable:             residual_Y   R-squared:                       0.127
Model:                            OLS   Adj. R-squared:                  0.127
Method:                 Least Squares   F-statistic:                     726.5
Date:                Sat, 26 Apr 2025   Prob (F-statistic):          1.62e-149
Time:                        17:25:33   Log-Likelihood:                -14928.
No. Observations:                5000   AIC:                         2.986e+04
Df Residuals:                    4998   BIC:                         2.987e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.0842      0.068      1.243      0.2

<a id="dml-vs-regression"></a>
#### 🧲 When to prefer over traditional regression


<details><summary><strong>📉 Click to Expand</strong></summary>

You should prefer DML over traditional regression when:

- **High-dimensional confounders** (lots of features) exist.
- **Nonlinear relationships** are likely between covariates and treatment/outcome.
- **Flexible modeling** is important (tree-based, neural nets, etc.)
- **Concern about model misspecification** in simple linear regression.

Traditional regression assumes:
- Linear relationships
- No complex interactions unless explicitly modeled

DML frees you from strict parametric forms, allowing modern ML models while still aiming for valid causal estimates.

✅ DML shines in modern settings: tech products, healthcare, online platforms — where datasets are messy, rich, and big.

</details>


[Back to the top](#table-of-contents)
___


<a id="heterogeneous-effects"></a>
# 🌈 Heterogeneous Treatment Effects


<details><summary><strong>📉 Click to Expand</strong></summary>

Until now, we've talked about the **average** effect of treatment across the entire population (ATE).

But in reality:
- Different users respond differently.
- Treatment effects **vary** by user characteristics.

**Heterogeneous Treatment Effects** (HTE) study how effects vary:
- Across groups (e.g., high engagement vs low engagement)
- Across individuals (personalized effects)

Estimating HTE is critical for:
- Personalized recommendations
- Smart targeting (marketing, healthcare, product launches)

</details>


<a id="ate-cate-ite"></a>
#### 🎨 ATE vs. CATE vs. ITE


<details><summary><strong>📉 Click to Expand</strong></summary>

Different layers of treatment effect granularity:

- **ATE (Average Treatment Effect)**:  
  - Average effect across everyone.

- **CATE (Conditional Average Treatment Effect)**:  
  - Average effect **given some subgroup** (e.g., CATE for users <30 years old).

- **ITE (Individual Treatment Effect)**:  
  - Effect for a **specific user**.

---

**In practice:**
- ATE is easiest to estimate.
- CATEs are often actionable (targeted marketing).
- ITEs are the hardest — noisy and high-variance.

Good causal inference methods can recover CATEs/ITEs **if** enough data and signal exist.

</details>


In [15]:
# Calculate ATE (simple diff from simulation ground truth)
ate = (df['Y_1'] - df['Y_0']).mean()
print(f"True ATE: {ate:.2f}")

# Calculate CATE for high engagement group
cate_high_engagement = (df[df['prior_engagement'] > 0.5]['Y_1'] - df[df['prior_engagement'] > 0.5]['Y_0']).mean()
print(f"CATE for high engagement users: {cate_high_engagement:.2f}")

# Show a few ITEs
df['ITE_true'] = df['Y_1'] - df['Y_0']
print("\nSample ITEs:")
print(df[['age', 'income', 'prior_engagement', 'ITE_true']].head())


True ATE: 10.00
CATE for high engagement users: 10.00

Sample ITEs:
         age        income  prior_engagement  ITE_true
0  45.960570  53643.604770          0.188077      10.0
1  38.340828  53198.788374          0.170389      10.0
2  47.772262  33065.352411          0.511379      10.0
3  58.276358  55048.647124          0.318793      10.0
4  37.190160  70992.436227          0.384439      10.0


<a id="uplift-usecases"></a>
#### 🌟 Uplift models and use cases


<details><summary><strong>📉 Click to Expand</strong></summary>

**Uplift modeling** directly models the **difference in probability** of a positive outcome between treated and untreated users.

Instead of modeling outcome probabilities separately, uplift models focus on:
- Who is **most persuadable**?
- Who would change behavior because of treatment?

**Where uplift models shine:**
- Marketing campaigns (maximize conversions per dollar)
- Customer retention (target save offers only to those who would churn)
- Medical interventions (target high-risk patients)

---

**Typical techniques:**
- Uplift Decision Trees
- Two-model approach (predict Y|T=1 and Y|T=0 separately, then subtract)
- Causal Forests

</details>


<a id="causal-forests"></a>
#### 🧩 Tree-based methods (Causal Trees, Causal Forests)


<details><summary><strong>📉 Click to Expand</strong></summary>

Tree-based methods are powerful for discovering treatment effect heterogeneity:

- **Causal Trees**:
  - Split data to maximize treatment effect differences between branches.
  - One tree trained specifically for causal splits.

- **Causal Forests**:
  - Ensemble of causal trees.
  - Averages treatment effect estimates across trees.
  - Reduces variance compared to a single tree.

They can estimate **CATEs** reliably across different subgroups without manually specifying interactions.

---

**When useful:**
- You expect heterogeneity but don't know in advance how to segment.
- You want flexible, interpretable treatment effect estimation.

</details>


In [17]:
!pip install econml

Collecting econml
  Downloading econml-0.15.1-cp311-cp311-macosx_11_0_arm64.whl (2.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m6.3 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m0m
Collecting sparse (from econml)
  Downloading sparse-0.16.0-py2.py3-none-any.whl (147 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m147.3/147.3 kB[0m [31m26.7 MB/s[0m eta [36m0:00:00[0m
Collecting shap<0.44.0,>=0.38.1 (from econml)
  Downloading shap-0.43.0-cp311-cp311-macosx_11_0_arm64.whl (445 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m445.4/445.4 kB[0m [31m47.2 MB/s[0m eta [36m0:00:00[0m
Collecting slicer==0.0.7 (from shap<0.44.0,>=0.38.1->econml)
  Downloading slicer-0.0.7-py3-none-any.whl (14 kB)
Installing collected packages: slicer, sparse, shap, econml
Successfully installed econml-0.15.1 shap-0.43.0 slicer-0.0.7 sparse-0.16.0


In [21]:
# Causal Forest: Full Correct Code

from econml.grf import CausalForest
from sklearn.model_selection import train_test_split

# Prepare features and outcome
X = df[['age', 'income', 'prior_engagement']].values  # Features (2D)
T = df['T'].values  # Treatment (1D)
Y = df['Y_obs'].values  # Observed outcome (1D)

# Fit causal forest
forest = CausalForest(n_estimators=100, random_state=42)
forest.fit(X, T, Y)  # Correct order: X, T, Y

# Predict treatment effects (CATEs)
cate_preds = forest.predict(X)

# Store predictions
df['CATE_predicted'] = cate_preds

# Display sample predictions
print("\nSample predicted CATEs:")
print(df[['age', 'income', 'prior_engagement', 'CATE_predicted']].head())



Sample predicted CATEs:
         age        income  prior_engagement  CATE_predicted
0  45.960570  53643.604770          0.188077       15.767683
1  38.340828  53198.788374          0.170389       18.062330
2  47.772262  33065.352411          0.511379       -4.625701
3  58.276358  55048.647124          0.318793       14.400047
4  37.190160  70992.436227          0.384439       11.047467


[Back to the top](#table-of-contents)
___


<a id="placebo-robustness"></a>
# 🧪 Placebo Tests & Robustness Checks


<details><summary><strong>📉 Click to Expand</strong></summary>

Even after careful causal estimation, you must ask:
- Was it a real effect?
- Could hidden bias still exist?

**Robustness checks** build confidence that your findings are not artifacts of modeling choices, random noise, or hidden confounders.

**Placebo tests** simulate situations where you expect **no effect** — if you detect an effect there, something's wrong.

Robust causal analysis is not just about point estimates — it’s about **proving to yourself that you aren't fooling yourself**.

</details>


<a id="placebo"></a>
#### 🧻 Randomized placebo treatments


<details><summary><strong>📉 Click to Expand</strong></summary>

Placebo tests inject "fake" treatments to validate your method.

**Idea:**
- Randomly assign a placebo treatment.
- Re-estimate the treatment effect.
- Expect **no significant effect** if your method is honest.

If your model finds strong effects even when treatment is randomized, your pipeline is leaking bias or overfitting.

---

**Placebo Tests Are Critical:**
- They detect specification errors.
- They detect uncontrolled confounding.
- They expose overfitting to noise.

Placebo tests are a basic but powerful check — always worth doing.

</details>


In [22]:
# Create a random placebo treatment
np.random.seed(123)
df['placebo_T'] = np.random.binomial(1, 0.5, size=len(df))

# Estimate naive difference for placebo treatment
placebo_treated_mean = df.loc[df['placebo_T'] == 1, 'Y_obs'].mean()
placebo_control_mean = df.loc[df['placebo_T'] == 0, 'Y_obs'].mean()

placebo_naive_diff = placebo_treated_mean - placebo_control_mean

print(f"Placebo test: naive difference in means = {placebo_naive_diff:.2f}")

# Ideally close to zero if model is unbiased


Placebo test: naive difference in means = 7.79


<a id="robustness"></a>
#### ⚗️ Sensitivity to unobserved confounding


<details><summary><strong>📉 Click to Expand</strong></summary>

Even after adjusting for observed confounders, **unobserved variables** can still bias causal estimates.

**Sensitivity analysis** asks:
- How strong would hidden confounding have to be to overturn my results?

---

**Typical approaches:**
- Simulate hidden confounders and see effect size shifts.
- Use formulas (like Rosenbaum bounds) to quantify robustness.

In practical data science:
- Simulate scenarios with added fake bias.
- Stress test conclusions under "worst plausible" hidden biases.

If your conclusions survive plausible levels of unobserved bias, they are more credible.

**Important mindset:**  
No analysis is perfect — the goal is to **understand limits, not pretend away uncertainty**.

</details>


In [23]:
# Simulate a hidden confounder correlated with treatment and outcome
np.random.seed(42)
df['hidden_confounder'] = np.random.normal(0, 1, size=len(df))

# Make the outcome depend slightly on this hidden confounder
df['Y_obs_biased'] = df['Y_obs'] + 2 * df['hidden_confounder']

# Re-run naive difference with biased outcome
treated_mean_biased = df.loc[df['T'] == 1, 'Y_obs_biased'].mean()
control_mean_biased = df.loc[df['T'] == 0, 'Y_obs_biased'].mean()

biased_naive_diff = treated_mean_biased - control_mean_biased

print(f"Naive difference in means (with hidden confounding): {biased_naive_diff:.2f}")

# See how much bias was introduced
bias_inflation = biased_naive_diff - naive_diff
print(f"Inflation in estimate due to hidden confounder: {bias_inflation:.2f}")


Naive difference in means (with hidden confounding): 250.11
Inflation in estimate due to hidden confounder: -0.42


[Back to the top](#table-of-contents)
___


<a id="counterfactuals"></a>
# 🧬 Counterfactual Thinking


<details><summary><strong>📉 Click to Expand</strong></summary>

**Counterfactual thinking** is the backbone of causal inference.

Instead of asking:
> "What happened?"

We ask:
> "What *would have* happened if things were different?"

In causal inference:
- Each unit (user, patient, item) has two potential outcomes.
- Only one is observed.
- The other — the counterfactual — must be predicted or estimated.

---

**Counterfactual reasoning enables:**
- Simulating user behavior under alternate scenarios.
- Personalizing interventions based on predicted outcomes.

Without counterfactuals, causal inference is blind.

</details>


<a id="what-if"></a>
#### 🤖 Predicting what would’ve happened


<details><summary><strong>📉 Click to Expand</strong></summary>

Predicting counterfactuals means estimating:
- `Y(1)` for untreated units.
- `Y(0)` for treated units.

We use:
- Machine learning models trained on observed data.
- Causal forests, meta-learners, and other counterfactual predictors.

**Goal:**
- Recover the missing potential outcome.
- Estimate **individual treatment effects (ITEs)**.

This enables granular interventions — not just average effects across a population.

---

**Important to remember:**  
Predicted counterfactuals are **estimates**, not direct observations — uncertainty always exists.

</details>


In [24]:
# Predict counterfactual outcomes using Causal Forest
# (Already trained earlier, we use forest)

# Predict Y(1) and Y(0) separately
cate_preds = df['CATE_predicted'].values
baseline_preds = df['y_hat'].values  # From earlier outcome model

# Predict counterfactual outcomes
df['Y_cf_T1'] = baseline_preds + cate_preds  # Predicted outcome if treated
df['Y_cf_T0'] = baseline_preds  # Predicted outcome if untreated (baseline)

# Now simulate what would happen if treatment status flipped
df['counterfactual_outcome'] = np.where(
    df['T'] == 1,
    df['Y_cf_T0'],  # If treated, counterfactual is untreated
    df['Y_cf_T1']   # If untreated, counterfactual is treated
)

print("\nSample Counterfactual Predictions:")
print(df[['T', 'Y_obs', 'counterfactual_outcome']].head())



Sample Counterfactual Predictions:
   T        Y_obs  counterfactual_outcome
0  0  1114.502287             1127.868133
1  0  1099.893323             1119.847270
2  0   704.133035              697.394845
3  0  1139.203509             1151.640784
4  0  1464.308234             1475.218807


<a id="personalization"></a>
#### 🔁 Usage in recommendation & personalization


<details><summary><strong>📉 Click to Expand</strong></summary>

**Counterfactual predictions unlock personalization:**

Instead of treating everyone the same, you can:
- Target users where treatment has highest predicted uplift.
- De-prioritize users who won't respond.

Examples:
- **Marketing**: Show ads only to users likely to convert if nudged.
- **Healthcare**: Prioritize interventions for patients who benefit most.
- **Products**: Recommend features or promotions to maximize lift per user.

---

**Strategic mindshift:**  
Focus on **marginally persuadable users**, not just overall averages.

---

Real-world use cases often combine:
- Causal effect estimation (CATE/ITE)
- Ranking users by expected benefit
- Action prioritization based on counterfactuals

</details>


In [25]:
# Rank users by predicted CATE
df['priority_score'] = df['CATE_predicted']

# Top 5 users we should prioritize for treatment
top_users = df.sort_values('priority_score', ascending=False).head(5)

print("\nTop users to prioritize based on CATE:")
print(top_users[['age', 'income', 'prior_engagement', 'CATE_predicted']])



Top users to prioritize based on CATE:
            age        income  prior_engagement  CATE_predicted
2841  21.800289  53281.835326          0.350447       20.543173
724   42.050385  53280.579923          0.403388       20.443580
3205  31.876058  53300.354298          0.420334       20.411161
4614  40.670367  53244.477131          0.632127       20.275141
3711  24.210210  53209.647772          0.335915       20.259483


[Back to the top](#table-of-contents)
___


<a id="closing-notes"></a>
# 📌 Closing Notes


<details><summary><strong>📉 Click to Expand</strong></summary>

You now have a practical understanding of core causal inference techniques.

You should be able to:
- Simulate data with confounding
- Estimate naive effects and detect bias
- Adjust using regression, matching, stratification
- Apply modern tools like DML and Causal Forests
- Think in terms of counterfactuals, not just correlations

---

**Remember:**  
Causal thinking is not just a technique — it’s a lens to see decision-making clearly.

</details>


<a id="summary-table"></a>
#### 📝 Summary table of methods


<details><summary><strong>📉 Click to Expand</strong></summary>

| Method | When Useful | Strengths | Weaknesses |
|:---|:---|:---|:---|
| **Simple Diff-in-Means** | Randomized experiments | Easy, unbiased | Useless with confounding |
| **Regression Adjustment** | Observational data with measured confounders | Easy to implement | Model misspecification risk |
| **Stratification** | Small number of discrete confounders | Transparent | Breaks down in high dimensions |
| **Propensity Score Matching (PSM)** | Observational data with many confounders | Balances groups | Sensitive to model of treatment |
| **Instrumental Variables (IV)** | Unobserved confounders exist | Bypasses confounding | Hard to find good instruments |
| **Double Machine Learning (DML)** | High-dimensional nonlinear confounders | ML flexibility + debiasing | Needs lots of data, honest splits |
| **Causal Forests** | Heterogeneous treatment effects | Flexible CATE estimation | Complex, less interpretable |

</details>


<a id="method-choice"></a>
#### 📋 When to use what


<details><summary><strong>📉 Click to Expand</strong></summary>

**Choosing a method depends on:**

- **Randomization present?**
  - Yes → Simple difference in means is fine.
  - No → Need adjustment.

- **Are confounders observed?**
  - Yes → Regression, PSM, DML are options.
  - No → Need IV or natural experiments.

- **Do you expect heterogeneous effects?**
  - Yes → Causal Trees, Causal Forests, Meta-learners.

- **Is high-dimensional data involved?**
  - Yes → Prefer DML over simple regression.

---

Choosing the right method = matching the method to the bias and complexity in your data.

</details>


<a id="causal-vs-predictive"></a>
#### 📎 Causal vs Predictive mindset


<details><summary><strong>📉 Click to Expand</strong></summary>

**Predictive modeling mindset:**
- Focuses on fitting observed outcomes.
- Good for forecasts, risk scores, recommendation engines.
- Does not care about interventions.

**Causal inference mindset:**
- Focuses on *what would happen if we intervened*.
- Good for making decisions (policies, treatments, products).
- Requires stronger assumptions, careful design.

---

**Key difference:**  
Predictive models can be accurate yet useless for interventions.  
Causal models are harder but necessary to make confident decisions.

---

**Quote to remember:**  
> "All models are wrong. Some models are useful.  
> Only causal models are useful for actions."

</details>


[Back to the top](#table-of-contents)
___
