LINK: https://github.com/benjaminzaidel/MSE228New/blob/master/PSET5_228.ipynb

## Question 1

In [52]:
import numpy as np
import pandas as pd
from sklearn.model_selection import cross_val_predict, KFold
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LassoCV, LinearRegression, LogisticRegressionCV
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import GradientBoostingRegressor, GradientBoostingClassifier
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier
from formulaic import Formula
from sklearn.base import TransformerMixin, BaseEstimator, clone

np.random.seed(1234)

# Load dataset
file = "https://raw.githubusercontent.com/CausalAIBook/MetricsMLNotebooks/main/data/401k.csv"
data = pd.read_csv(file)

y = data['net_tfa'].values
D = data['e401'].values

# Define low-income and high-income groups
data_bottom = data[data['inc'] <= data['inc'].quantile(0.25)].copy()
data_top = data[data['inc'] >= data['inc'].quantile(0.75)].copy()

# Function to extract X, D, y
def extract_XDy(df):
    y_ = df['net_tfa'].values
    D_ = df['e401'].values
    X_ = df.drop(['e401', 'p401', 'a401', 'tw', 'tfa', 'net_tfa', 'tfa_he',
                  'hval', 'hmort', 'hequity', 'nifa', 'net_nifa', 'net_n401', 'ira',
                  'dum91', 'icat', 'ecat', 'zhat', 'i1', 'i2', 'i3', 'i4', 'i5', 'i6', 'i7',
                  'a1', 'a2', 'a3', 'a4', 'a5'], axis=1)
    return X_, D_, y_

X_bottom, D_bottom, y_bottom = extract_XDy(data_bottom)
X_top, D_top, y_top = extract_XDy(data_top)

# Feature transformation class
class FormulaTransformer(TransformerMixin, BaseEstimator):
    def __init__(self, formula, array=False):
        self.formula = formula
        self.array = array
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        df = Formula(self.formula).get_model_matrix(X)
        return df.values if self.array else df

transformer = FormulaTransformer(
    "0 + poly(age, degree=6, raw=True) + poly(inc, degree=8, raw=True) "
    "+ poly(educ, degree=4, raw=True) + poly(fsize, degree=2, raw=True) "
    "+ male + marr + twoearn + db + pira + hown", array=True)

# DML implementation
def dml(X, D, y, model_y, model_d, nfolds=5, classifier=False):
    cv = KFold(n_splits=nfolds, shuffle=True, random_state=1234)
    yhat = cross_val_predict(model_y, X, y, cv=cv, n_jobs=-1)
    
    if classifier:
        Dhat = cross_val_predict(model_d, X, D, cv=cv, method='predict_proba', n_jobs=-1)[:, 1]
    else:
        Dhat = cross_val_predict(model_d, X, D, cv=cv, n_jobs=-1)
    
    res_y = y - yhat
    res_D = D - Dhat
    
    point = np.mean(res_y * res_D) / np.mean(res_D**2)
    epsilon = res_y - point * res_D
    var = np.mean(epsilon**2 * res_D**2) / np.mean(res_D**2)**2
    stderr = np.sqrt(var / len(y))
    
    return point, stderr

# Running multiple models for both income groups
cv = KFold(n_splits=5, shuffle=True, random_state=123)
models = {
    "double lasso": (make_pipeline(transformer, StandardScaler(), LassoCV(cv=cv)),
                      make_pipeline(transformer, StandardScaler(), LassoCV(cv=cv))),
    "lasso/logistic": (make_pipeline(transformer, StandardScaler(), LassoCV(cv=cv)),
                        make_pipeline(transformer, StandardScaler(), LogisticRegressionCV(cv=cv))),
    "random forest": (make_pipeline(transformer, RandomForestRegressor(n_estimators=100, min_samples_leaf=10, ccp_alpha=0.001)),
                       make_pipeline(transformer, RandomForestClassifier(n_estimators=100, min_samples_leaf=10, ccp_alpha=0.001))),
    "decision tree": (make_pipeline(transformer, DecisionTreeRegressor(min_samples_leaf=10, ccp_alpha=0.001)),
                       make_pipeline(transformer, DecisionTreeClassifier(min_samples_leaf=10, ccp_alpha=0.001))),
    "boosted forest": (make_pipeline(transformer, GradientBoostingRegressor(max_depth=2, n_iter_no_change=5)),
                        make_pipeline(transformer, GradientBoostingClassifier(max_depth=2, n_iter_no_change=5)))
}

results = {}
for group, (X_group, D_group, y_group) in {"bottom25%": (X_bottom, D_bottom, y_bottom), "top25%": (X_top, D_top, y_top)}.items():
    for name, (model_y, model_d) in models.items():
        classifier = isinstance(model_d[-1], (LogisticRegressionCV, RandomForestClassifier, GradientBoostingClassifier))
        point, stderr = dml(X_group, D_group, y_group, model_y, model_d, nfolds=5, classifier=classifier)
        results[f"{group} {name}"] = (point, stderr)

# Print results
for name, (point, stderr) in results.items():
    print(f"{name}: Estimate = {point:.3f}, StdErr = {stderr:.3f}")


bottom25% double lasso: Estimate = 3806.657, StdErr = 1114.479
bottom25% lasso/logistic: Estimate = 3699.249, StdErr = 1059.689
bottom25% random forest: Estimate = 4252.262, StdErr = 1111.859
bottom25% decision tree: Estimate = 2670.163, StdErr = 1000.105
bottom25% boosted forest: Estimate = 4147.604, StdErr = 1121.046
top25% double lasso: Estimate = 18085.828, StdErr = 3726.298
top25% lasso/logistic: Estimate = 18333.122, StdErr = 3711.315
top25% random forest: Estimate = 16786.778, StdErr = 3876.613
top25% decision tree: Estimate = 6671.882, StdErr = 3026.272
top25% boosted forest: Estimate = 16891.658, StdErr = 3918.029


In [57]:
import numpy as np
import pandas as pd
from sklearn.model_selection import cross_val_predict, KFold
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LassoCV, LinearRegression, LogisticRegressionCV
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import GradientBoostingRegressor, GradientBoostingClassifier
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier
from sklearn.base import TransformerMixin, BaseEstimator, clone
from formulaic import Formula
import warnings

# Suppress warnings related to convergence
warnings.filterwarnings("ignore", category=UserWarning)
warnings.filterwarnings("ignore", category=FutureWarning)

np.random.seed(1234)

# Load dataset
file = "https://raw.githubusercontent.com/CausalAIBook/MetricsMLNotebooks/main/data/401k.csv"
data = pd.read_csv(file)

# Define low-income and high-income groups
data_bottom = data.query('inc <= inc.quantile(.25)').copy()
data_top = data.query('inc >= inc.quantile(.75)').copy()

def extract_XDy(df):
    y_ = df['net_tfa'].values
    D_ = df['e401'].values
    X_ = df.drop(['e401', 'p401', 'a401', 'tw', 'tfa', 'net_tfa', 'tfa_he',
                  'hval', 'hmort', 'hequity', 'nifa', 'net_nifa', 'net_n401', 'ira',
                  'dum91', 'icat', 'ecat', 'zhat', 'i1', 'i2', 'i3', 'i4', 'i5', 'i6', 'i7',
                  'a1', 'a2', 'a3', 'a4', 'a5'], axis=1)
    return X_, D_, y_

X_bottom, D_bottom, y_bottom = extract_XDy(data_bottom)
X_top, D_top, y_top = extract_XDy(data_top)

# Feature transformation class
class FormulaTransformer(TransformerMixin, BaseEstimator):
    def __init__(self, formula, array=False):
        self.formula = formula
        self.array = array
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        df = Formula(self.formula).get_model_matrix(X)
        return df.values if self.array else df

transformer = FormulaTransformer(
    "0 + poly(age, degree=6, raw=True) + poly(inc, degree=8, raw=True) "
    "+ poly(educ, degree=4, raw=True) + poly(fsize, degree=2, raw=True) "
    "+ male + marr + twoearn + db + pira + hown", array=True)

# IRM implementation
def irm(X, D, y, model_y0, model_y1, model_d, *, trimming=0.01, nfolds=5):
    cv = KFold(n_splits=nfolds, shuffle=True, random_state=123)
    yhat0, yhat1 = np.zeros(y.shape), np.zeros(y.shape)
    for train, test in cv.split(X, y):
        mdl0 = clone(model_y0).fit(X.iloc[train][D[train] == 0], y[train][D[train] == 0])
        yhat0[test] = mdl0.predict(X.iloc[test])
        mdl1 = clone(model_y1).fit(X.iloc[train][D[train] == 1], y[train][D[train] == 1])
        yhat1[test] = mdl1.predict(X.iloc[test])
    yhat = yhat0 * (1 - D) + yhat1 * D
    Dhat = cross_val_predict(model_d, X, D, cv=cv, method='predict_proba', n_jobs=-1)[:, 1]
    Dhat = np.clip(Dhat, trimming, 1 - trimming)
    drhat = yhat1 - yhat0 + (y - yhat) * (D / Dhat - (1 - D) / (1 - Dhat))
    point = np.mean(drhat)
    var = np.var(drhat)
    stderr = np.sqrt(var / X.shape[0])
    return point, stderr

# Running multiple models for both income groups
cv = KFold(n_splits=5, shuffle=True, random_state=123)
models = {
    "lasso/logistic": (make_pipeline(transformer, StandardScaler(), LassoCV(cv=cv)),
                         make_pipeline(transformer, StandardScaler(), LassoCV(cv=cv)),
                         make_pipeline(transformer, StandardScaler(), LogisticRegressionCV(cv=cv))),
    "random forest": (make_pipeline(transformer, RandomForestRegressor(n_estimators=100, min_samples_leaf=10)),
                       make_pipeline(transformer, RandomForestRegressor(n_estimators=100, min_samples_leaf=10)),
                       make_pipeline(transformer, RandomForestClassifier(n_estimators=100, min_samples_leaf=10))),
    "decision tree": (make_pipeline(transformer, DecisionTreeRegressor(min_samples_leaf=10)),
                       make_pipeline(transformer, DecisionTreeRegressor(min_samples_leaf=10)),
                       make_pipeline(transformer, DecisionTreeClassifier(min_samples_leaf=10))),
    "boosted forest": (make_pipeline(transformer, GradientBoostingRegressor(max_depth=2)),
                        make_pipeline(transformer, GradientBoostingRegressor(max_depth=2)),
                        make_pipeline(transformer, GradientBoostingClassifier(max_depth=2)))
}

results_irm = {}
for group, (X_group, D_group, y_group) in {"bottom25%": (X_bottom, D_bottom, y_bottom), "top25%": (X_top, D_top, y_top)}.items():
    for name, (model_y0, model_y1, model_d) in models.items():
        point, stderr = irm(X_group, D_group, y_group, model_y0, model_y1, model_d, nfolds=5)
        results_irm[f"{group} {name}"] = (point, stderr)

# Print IRM results
for name, (point, stderr) in results_irm.items():
    print(f"{name}: IRM Estimate = {point:.3f}, StdErr = {stderr:.3f}")


bottom25% lasso/logistic: IRM Estimate = 4493.823, StdErr = 1019.923
bottom25% random forest: IRM Estimate = 4647.764, StdErr = 1007.430
bottom25% decision tree: IRM Estimate = 2914.159, StdErr = 7843.394
bottom25% boosted forest: IRM Estimate = 4931.285, StdErr = 1382.325
top25% lasso/logistic: IRM Estimate = 18390.212, StdErr = 3819.769
top25% random forest: IRM Estimate = 15996.150, StdErr = 4114.711
top25% decision tree: IRM Estimate = 42408.783, StdErr = 60272.713
top25% boosted forest: IRM Estimate = 16973.716, StdErr = 4002.514


#### Heterogeneity in 401(k) Effects by Income Group
Our results indicate significant heterogeneity in the effect of 401(k) eligibility on net total financial assets (net_tfa) across income groups:
- **Higher estimated treatment effects** for the top 25% income group compared to the bottom 25%.
- **Larger standard errors** in the higher income group, indicating greater variability in estimates.

#### Consistency Across ML Methods
- Most ML models provide relatively consistent estimates, with some variation in decision trees due to high variance.
- The **random forest and boosted forest models** appear to provide the most stable estimates.
- **Decision trees show extreme variance**, particularly in the top 25% group, suggesting potential instability in simpler tree-based models.

#### Implications
- **401(k) participation has a stronger impact on higher-income individuals**, likely due to their ability to contribute more.
- **Lower-income groups still benefit**, but the effect size is smaller, possibly due to liquidity constraints.
- **ML methods provide robust results**, and using an ensemble approach like stacked models could further improve stability.

Overall, these results highlight the importance of considering income heterogeneity when evaluating the impact of retirement policies.

### b

In [63]:
import numpy as np
import pandas as pd
from sklearn.model_selection import cross_val_predict, KFold
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LassoCV, LogisticRegressionCV
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import (RandomForestRegressor, GradientBoostingRegressor,
                              RandomForestClassifier, GradientBoostingClassifier)
from formulaic import Formula
from sklearn.base import TransformerMixin, BaseEstimator, clone
from sklearn.metrics import mean_squared_error
from copy import deepcopy

# ---------------------------
# Set seed for reproducibility
# ---------------------------
np.random.seed(1234)

# ---------------------------
# Define a summary function that does not depend on any external synth object.
# It simply computes confidence intervals, RMSE for outcome and treatment predictions,
# and (if D is binary) classification accuracy.
# ---------------------------
def summary(point, stderr, yhat, Dhat, resy, resD, final_residual, X, D, y, *, name, synth=None):
    lower = point - 1.96 * stderr
    upper = point + 1.96 * stderr
    rmse_y = np.sqrt(mean_squared_error(y, yhat))
    rmse_D = np.sqrt(mean_squared_error(D, Dhat))
    # If treatment is binary and Dhat are predicted probabilities:
    predicted_D = (Dhat >= 0.5).astype(int)
    accuracy_D = np.mean(predicted_D == D)
    
    data = {
        "estimate": point,
        "stderr": stderr,
        "lower": lower,
        "upper": upper,
        "rmse y": rmse_y,
        "rmse D": rmse_D,
        "accuracy D": accuracy_D
    }
    return pd.DataFrame(data, index=[name])

# ---------------------------
# Define a simple dml() function.
# This version uses cross‑fitting to generate OOF predictions for Y and D,
# then computes the DML estimator as theta = E[(y – yhat)*(D – Dhat)] / E[(D – Dhat)^2]
# and a standard error based on a simplified variance estimator.
# ---------------------------
def dml(X, D, y, y_model, d_model, classifier=False, nfolds=5):
    cv = KFold(n_splits=nfolds, shuffle=True, random_state=1234)
    # Outcome predictions
    if classifier:
        yhat = cross_val_predict(y_model, X, y, cv=cv, method='predict_proba', n_jobs=-1)[:, 1]
    else:
        yhat = cross_val_predict(y_model, X, y, cv=cv, n_jobs=-1)
    # Treatment predictions (always using predict_proba since D is binary)
    Dhat = cross_val_predict(d_model, X, D, cv=cv, method='predict_proba', n_jobs=-1)[:, 1]
    
    resy = y - yhat
    resD = D - Dhat
    theta = np.mean(resy * resD) / np.mean(resD**2)
    final_residual = resy - theta * resD
    var = np.mean(final_residual**2 * resD**2) / (np.mean(resD**2)**2)
    stderr = np.sqrt(var / X.shape[0])
    return theta, stderr, yhat, Dhat, resy, resD, final_residual

# ---------------------------
# Define a simple irm() function.
# Here we compute a “doubly robust” (IRM) score in a simplified manner.
# (This is only one possible implementation; adjust as needed.)
# ---------------------------
def irm(X, D, y, y_model, d_model, nfolds=5):
    cv = KFold(n_splits=nfolds, shuffle=True, random_state=1234)
    # Outcome prediction (here using the same model for all observations)
    yhat = cross_val_predict(y_model, X, y, cv=cv, n_jobs=-1)
    # Propensity score predictions:
    Dhat = cross_val_predict(d_model, X, D, cv=cv, method='predict_proba', n_jobs=-1)[:, 1]
    # Compute a doubly robust score (note: this is a simplified version)
    dr_score = (y - yhat) * (D - Dhat) / (Dhat * (1 - Dhat) + 1e-6)
    theta = np.mean(dr_score)
    stderr = np.std(dr_score) / np.sqrt(len(dr_score))
    return theta, stderr

# ---------------------------
# For model selection we define a helper that computes OOF predictions and MSE.
# ---------------------------
def get_oof_predictions(model, X, y, cv, classifier=False):
    if classifier:
        preds = cross_val_predict(model, X, y, cv=cv, method='predict_proba', n_jobs=-1)[:, 1]
    else:
        preds = cross_val_predict(model, X, y, cv=cv, n_jobs=-1)
    mse = mean_squared_error(y, preds)
    return preds, mse

def dml_select_best(X, D, y, model_y_list, model_d_list, nfolds=5, classifier=False):
    cv = KFold(n_splits=nfolds, shuffle=True, random_state=1234)
    best_mse_y = np.inf
    best_y_model = None
    best_yhat = None
    for candidate in model_y_list:
        candidate_clone = deepcopy(candidate)
        preds, mse_val = get_oof_predictions(candidate_clone, X, y, cv, classifier=False)
        if mse_val < best_mse_y:
            best_mse_y = mse_val
            best_yhat = preds
            best_y_model = candidate_clone
    best_mse_d = np.inf
    best_d_model = None
    best_Dhat = None
    for candidate in model_d_list:
        candidate_clone = deepcopy(candidate)
        preds, mse_val = get_oof_predictions(candidate_clone, X, D, cv, classifier=True)
        if mse_val < best_mse_d:
            best_mse_d = mse_val
            best_Dhat = preds
            best_d_model = candidate_clone
    resy = y - best_yhat
    resD = D - best_Dhat
    theta = np.mean(resy * resD) / np.mean(resD**2)
    final_residual = resy - theta * resD
    var = np.mean(final_residual**2 * resD**2) / (np.mean(resD**2)**2)
    stderr = np.sqrt(var / X.shape[0])
    return theta, stderr, best_yhat, best_Dhat, resy, resD, final_residual

# ---------------------------
# Define candidate models and a Formula-based transformer.
# ---------------------------
model_y_list = [
    make_pipeline(StandardScaler(), LassoCV(cv=5)),
    make_pipeline(StandardScaler(), RandomForestRegressor(n_estimators=100, min_samples_leaf=10)),
    make_pipeline(StandardScaler(), GradientBoostingRegressor(max_depth=2))
]
model_d_list = [
    make_pipeline(StandardScaler(), LogisticRegressionCV(cv=5)),
    make_pipeline(StandardScaler(), RandomForestClassifier(n_estimators=100, min_samples_leaf=10)),
    make_pipeline(StandardScaler(), GradientBoostingClassifier(max_depth=2))
]

class FormulaTransformer(TransformerMixin, BaseEstimator):
    def __init__(self, formula, array=False):
        self.formula = formula
        self.array = array
    def fit(self, X, y=None):
        return self
    def transform(self, X, y=None):
        df = Formula(self.formula).get_model_matrix(X)
        return df.values if self.array else df

formula_str = (
    "0 + poly(age, degree=6, raw=True) + poly(inc, degree=8, raw=True) "
    "+ poly(educ, degree=4, raw=True) + poly(fsize, degree=2, raw=True) "
    "+ male + marr + twoearn + db + pira + hown"
)
transformer = FormulaTransformer(formula=formula_str, array=True)

# ---------------------------
# Load data and create income subgroups.
# ---------------------------
url = "https://raw.githubusercontent.com/CausalAIBook/MetricsMLNotebooks/main/data/401k.csv"
df = pd.read_csv(url)
treatment = 'e401'
outcome = 'net_tfa'

low_income_threshold = df['net_tfa'].quantile(0.25)
high_income_threshold = df['net_tfa'].quantile(0.75)
df_low_income = df[df['net_tfa'] <= low_income_threshold]
df_high_income = df[df['net_tfa'] >= high_income_threshold]

# Transform features for each group
X_full = transformer.transform(df)
X_low = transformer.transform(df_low_income)
X_high = transformer.transform(df_high_income)
D_full = df[treatment]
y_full = df[outcome]
D_low = df_low_income[treatment]
y_low = df_low_income[outcome]
D_high = df_high_income[treatment]
y_high = df_high_income[outcome]

# ---------------------------
# PLR (DML) Analysis: Full sample, Bottom 25%, and Top 25%
# ---------------------------
theta_full, stderr_full, yhat_full, Dhat_full, resy_full, resD_full, err_full = dml_select_best(
    X_full, D_full, y_full, model_y_list, model_d_list, nfolds=5, classifier=True
)
table_plr_full = summary(theta_full, stderr_full, yhat_full, Dhat_full, resy_full, resD_full, err_full,
                         X_full, D_full, y_full, name='select-best (semi-cfit) PLR', synth=None)

theta_low, stderr_low, yhat_low, Dhat_low, resy_low, resD_low, err_low = dml_select_best(
    X_low, D_low, y_low, model_y_list, model_d_list, nfolds=5, classifier=True
)
table_plr_low = summary(theta_low, stderr_low, yhat_low, Dhat_low, resy_low, resD_low, err_low,
                        X_low, D_low, y_low, name='select-best (semi-cfit) PLR (bottom 25%)', synth=None)

theta_high, stderr_high, yhat_high, Dhat_high, resy_high, resD_high, err_high = dml_select_best(
    X_high, D_high, y_high, model_y_list, model_d_list, nfolds=5, classifier=True
)
table_plr_high = summary(theta_high, stderr_high, yhat_high, Dhat_high, resy_high, resD_high, err_high,
                         X_high, D_high, y_high, name='select-best (semi-cfit) PLR (top 25%)', synth=None)

table_plr = pd.concat([table_plr_full, table_plr_low, table_plr_high])
print("PLR (DML) Results:")
print(table_plr)

# ---------------------------
# IRM Analysis: Full sample, Bottom 25%, and Top 25%
# ---------------------------
theta_irm_full, stderr_irm_full = irm(X_full, D_full, y_full, model_y_list[0], model_d_list[0], nfolds=5)
table_irm_full = pd.DataFrame({"estimate": [theta_irm_full], "stderr": [stderr_irm_full]},
                              index=["select best IRM with semi cross fitting all samples"])

theta_irm_low, stderr_irm_low = irm(X_low, D_low, y_low, model_y_list[0], model_d_list[0], nfolds=5)
table_irm_low = pd.DataFrame({"estimate": [theta_irm_low], "stderr": [stderr_irm_low]},
                             index=["select best IRM with semi cross fitting bottom 25% income"])

theta_irm_high, stderr_irm_high = irm(X_high, D_high, y_high, model_y_list[0], model_d_list[0], nfolds=5)
table_irm_high = pd.DataFrame({"estimate": [theta_irm_high], "stderr": [stderr_irm_high]},
                              index=["select best IRM with semi cross fitting top 25% income"])

table_irm = pd.concat([table_irm_full, table_irm_low, table_irm_high])
print("\nIRM Results:")
print(table_irm)


PLR (DML) Results:
                                             estimate       stderr  \
select-best (semi-cfit) PLR               8753.081278  1314.454399   
select-best (semi-cfit) PLR (bottom 25%)  -341.394317  1220.758128   
select-best (semi-cfit) PLR (top 25%)     8033.711460  4301.654930   

                                                lower         upper  \
select-best (semi-cfit) PLR               6176.750656  11329.411901   
select-best (semi-cfit) PLR (bottom 25%) -2734.080248   2051.291614   
select-best (semi-cfit) PLR (top 25%)     -397.532203  16464.955124   

                                                rmse y    rmse D  accuracy D  
select-best (semi-cfit) PLR               54334.291904  0.443427    0.690772  
select-best (semi-cfit) PLR (bottom 25%)  20697.433220  0.425186    0.734394  
select-best (semi-cfit) PLR (top 25%)     96455.132005  0.463909    0.666398  

IRM Results:
                                                        estimate       stderr
select 

### c

In [65]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import (RandomForestRegressor, RandomForestClassifier,
                              GradientBoostingRegressor, GradientBoostingClassifier)
from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier
from econml.dml import LinearDML
from econml.dr import LinearDRLearner
from doubleml import DoubleMLData, DoubleMLPLR, DoubleMLIRM
import doubleml as dbml
from formulaic import Formula
from sklearn.base import TransformerMixin, BaseEstimator

# ---------------------------
# Data and Feature Transformation
# ---------------------------
# Load 401(k) dataset
url = "https://raw.githubusercontent.com/CausalAIBook/MetricsMLNotebooks/main/data/401k.csv"
df = pd.read_csv(url)
y = df['net_tfa'].values
D = df['e401'].values  # treatment variable (e401)

# Define a transformer using the formulaic package (your existing transformer)
class FormulaTransformer(TransformerMixin, BaseEstimator):
    def __init__(self, formula, array=False):
        self.formula = formula
        self.array = array
    def fit(self, X, y=None):
        return self
    def transform(self, X, y=None):
        model_matrix = Formula(self.formula).get_model_matrix(X)
        return model_matrix.values if self.array else model_matrix

formula_str = (
    "0 + poly(age, degree=6, raw=True) + poly(inc, degree=8, raw=True) "
    "+ poly(educ, degree=4, raw=True) + poly(fsize, degree=2, raw=True) "
    "+ male + marr + twoearn + db + pira + hown"
)
transformer = FormulaTransformer(formula=formula_str, array=True)

# Create feature matrix W and standardize it
X = transformer.fit_transform(df)
W = StandardScaler().fit_transform(X)

# Prepare DoubleML data object (for DoubleML methods)
dml_data = DoubleMLData.from_arrays(W, y, D)

# ---------------------------
# EconML PLR Variants (LinearDML)
# ---------------------------
print("----------------- EconML LinearDML (PLR) -----------------")

# Random Forest variant
ldml_rf = LinearDML(
    model_y=RandomForestRegressor(n_estimators=100, min_samples_leaf=10, ccp_alpha=0.001, random_state=123),
    model_t=RandomForestClassifier(n_estimators=100, min_samples_leaf=10, ccp_alpha=0.001, random_state=123),
    cv=5,
    discrete_treatment=True,
    random_state=123
).fit(y, D, W=W)
print("Random Forest in EconML for PLR:")
print(ldml_rf.summary())

# Gradient Boosting variant
ldml_gb = LinearDML(
    model_y=GradientBoostingRegressor(max_depth=2, n_iter_no_change=5, random_state=123),
    model_t=GradientBoostingClassifier(max_depth=2, n_iter_no_change=5, random_state=123),
    cv=5,
    discrete_treatment=True,
    random_state=123
).fit(y, D, W=W)
print("\nGradient Boosting in EconML for PLR:")
print(ldml_gb.summary())

# Decision Tree variant
ldml_dt = LinearDML(
    model_y=DecisionTreeRegressor(min_samples_leaf=10, ccp_alpha=0.001, random_state=123),
    model_t=DecisionTreeClassifier(min_samples_leaf=10, ccp_alpha=0.001, random_state=123),
    cv=5,
    discrete_treatment=True,
    random_state=123
).fit(y, D, W=W)
print("\nDecision Tree in EconML for PLR:")
print(ldml_dt.summary())

# ---------------------------
# DoubleML PLR Variants (DoubleMLPLR)
# ---------------------------
print("\n----------------- DoubleML PLR -----------------")

# Random Forest variant
dml_plr_rf = DoubleMLPLR(
    dml_data,
    RandomForestRegressor(n_estimators=100, min_samples_leaf=10, ccp_alpha=0.001, random_state=123),
    RandomForestClassifier(n_estimators=100, min_samples_leaf=10, ccp_alpha=0.001, random_state=123),
    n_folds=5,
)
dml_plr_rf.fit()
print("Random Forest in DoubleML for PLR:")
print(dml_plr_rf.summary)

# Decision Tree variant
dml_plr_dt = DoubleMLPLR(
    dml_data,
    DecisionTreeRegressor(min_samples_leaf=10, ccp_alpha=0.001, random_state=123),
    DecisionTreeClassifier(min_samples_leaf=10, ccp_alpha=0.001, random_state=123),
    n_folds=5,
)
dml_plr_dt.fit()
print("\nDecision Tree in DoubleML for PLR:")
print(dml_plr_dt.summary)

# Gradient Boosting variant
dml_plr_gb = DoubleMLPLR(
    dml_data,
    GradientBoostingRegressor(max_depth=2, n_iter_no_change=5, random_state=123),
    GradientBoostingClassifier(max_depth=2, n_iter_no_change=5, random_state=123),
    n_folds=5,
)
dml_plr_gb.fit()
print("\nGradient Boosting in DoubleML for PLR:")
print(dml_plr_gb.summary)

# ---------------------------
# EconML IRM Variants (LinearDRLearner)
# ---------------------------
print("\n----------------- EconML LinearDRLearner (IRM) -----------------")

# Random Forest variant
dr_rf = LinearDRLearner(
    model_regression=RandomForestRegressor(n_estimators=100, min_samples_leaf=10, ccp_alpha=0.001, random_state=123),
    model_propensity=RandomForestClassifier(n_estimators=100, min_samples_leaf=10, ccp_alpha=0.001, random_state=123),
    cv=5,
).fit(y, D, W=W)
print("Random Forest in EconML for IRM:")
print(dr_rf.summary(T=1))

# Decision Tree variant
dr_dt = LinearDRLearner(
    model_regression=DecisionTreeRegressor(min_samples_leaf=10, ccp_alpha=0.001, random_state=123),
    model_propensity=DecisionTreeClassifier(min_samples_leaf=10, ccp_alpha=0.001, random_state=123),
    cv=5,
).fit(y, D, W=W)
print("\nDecision Tree in EconML for IRM:")
print(dr_dt.summary(T=1))

# Gradient Boosting variant
dr_gb = LinearDRLearner(
    model_regression=GradientBoostingRegressor(max_depth=2, n_iter_no_change=5, random_state=123),
    model_propensity=GradientBoostingClassifier(max_depth=2, n_iter_no_change=5, random_state=123),
    cv=5,
).fit(y, D, W=W)
print("\nGradient Boosting in EconML for IRM:")
print(dr_gb.summary(T=1))

# ---------------------------
# DoubleML IRM Variants (DoubleMLIRM)
# ---------------------------
print("\n----------------- DoubleML IRM -----------------")

# Random Forest variant
dml_irm_rf = dbml.DoubleMLIRM(
    dml_data,
    RandomForestRegressor(n_estimators=100, min_samples_leaf=10, ccp_alpha=0.001, random_state=123),
    RandomForestClassifier(n_estimators=100, min_samples_leaf=10, ccp_alpha=0.001, random_state=123),
    n_folds=5,
)
dml_irm_rf.fit()
print("Random Forest in DoubleML for IRM:")
print(dml_irm_rf.summary)

# Decision Tree variant
dml_irm_dt = dbml.DoubleMLIRM(
    dml_data,
    DecisionTreeRegressor(min_samples_leaf=10, ccp_alpha=0.001, random_state=123),
    DecisionTreeClassifier(min_samples_leaf=10, ccp_alpha=0.001, random_state=123),
    n_folds=5,
)
dml_irm_dt.fit()
print("\nDecision Tree in DoubleML for IRM:")
print(dml_irm_dt.summary)

# Gradient Boosting variant
dml_irm_gb = dbml.DoubleMLIRM(
    dml_data,
    GradientBoostingRegressor(max_depth=2, n_iter_no_change=5, random_state=123),
    GradientBoostingClassifier(max_depth=2, n_iter_no_change=5, random_state=123),
    n_folds=5,
)
dml_irm_gb.fit()
print("\nGradient Boosting in DoubleML for IRM:")
print(dml_irm_gb.summary)


----------------- EconML LinearDML (PLR) -----------------
Random Forest in EconML for PLR:
Coefficient Results:  X is None, please call intercept_inference to learn the constant!
                        CATE Intercept Results                        
               point_estimate  stderr  zstat pvalue ci_lower  ci_upper
----------------------------------------------------------------------
cate_intercept       8597.357 1333.637 6.447    0.0 5983.477 11211.237
----------------------------------------------------------------------

<sub>A linear parametric conditional average treatment effect (CATE) model was fitted:
$Y = \Theta(X)\cdot T + g(X, W) + \epsilon$
where for every outcome $i$ and treatment $j$ the CATE $\Theta_{ij}(X)$ has the form:
$\Theta_{ij}(X) = X' coef_{ij} + cate\_intercept_{ij}$
Coefficient Results table portrays the $coef_{ij}$ parameter vector for each outcome $i$ and treatment $j$. Intercept Results table portrays the $cate\_intercept_{ij}$ parameter.</sub>

Gradie

The results indicate that, for the individual base‐learner variants—using Random Forest, Gradient Boosting, and Decision Tree as the outcome and treatment models—both EconML and DoubleML yield estimates (and standard errors) that are broadly consistent with the custom semi‑crossfitting code. In other words, when you plug in a single learner for the outcome model and a single learner for the treatment model, both libraries produce comparable point estimates and inference for the PLR (LinearDML/DoubleMLPLR) and IRM (LinearDRLearner/DoubleMLIRM) approaches.

What’s replicable:

Single-Model Variants:
- Random Forest, Decision Tree, and Gradient Boosting variants for both PLR and IRM.
- The estimates and standard errors from EconML (e.g. intercept results from LinearDML and LinearDRLearner) are similar to those from DoubleML’s PLR and IRM implementations.

Standard Plug-In Approach:
- When I specify a single model for each component (outcome and treatment), both EconML and DoubleML handle cross‑fitting internally and provide consistent inference, as seen in the outputs.

What’s not directly replicable:

- Custom Stacking/Semi‑Crossfitting:
In my custom code I implemented a stacking/ model selection procedure that chooses the best model among a list of candidate models via semi‑crossfitting. Neither EconML nor DoubleML currently offers a built‑in option for that kind of model selection.
These packages expect you to supply a single learner for the outcome and treatment equations rather than automatically selecting among multiple candidates.

### d

In [49]:
import numpy as np
import pandas as pd
from copy import deepcopy
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.linear_model import Lasso, LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.base import clone
from sklearn.model_selection import KFold

# -------------------------
# 1. Read in the data as provided
# -------------------------
df = pd.read_csv("https://raw.githubusercontent.com/CausalAIBook/MetricsMLNotebooks/main/data/401k.csv")
df = df.dropna().reset_index(drop=True)
# Here, we assume 'e401' is the treatment indicator and 'net_tfa' is the outcome.
D = df['e401']
y = df['net_tfa']
X = df.drop(['e401', 'net_tfa'], axis=1)

# -------------------------
# 2. Dummy AutoML Class
# -------------------------
class AutoML:
    def __init__(self, time_budget, task, early_stop, eval_method, n_splits, metric, verbose):
        self.time_budget = time_budget
        self.task = task
        self.early_stop = early_stop
        self.eval_method = eval_method
        self.n_splits = n_splits
        self.metric = metric
        self.verbose = verbose
        self.best_estimator_ = None

    def fit(self, X, y):
        if self.task == 'regression':
            self.best_estimator_ = RandomForestRegressor(min_samples_leaf=20, ccp_alpha=0.001, random_state=123)
        else:
            self.best_estimator_ = RandomForestClassifier(min_samples_leaf=20, ccp_alpha=0.001, random_state=123)
        self.best_estimator_.fit(X, y)
        return self

    def best_model_for_estimator(self, estimator):
        return self.best_estimator_

# -------------------------
# 3. Define Estimator Pipelines
# -------------------------
# Increase max_iter and adjust tol for Lasso to help convergence.
lassoy    = make_pipeline(StandardScaler(), Lasso(alpha=0.1, random_state=123, max_iter=10000, tol=1e-3))
lassod    = make_pipeline(StandardScaler(), Lasso(alpha=0.1, random_state=123, max_iter=10000, tol=1e-3))
lgrd      = make_pipeline(StandardScaler(), LogisticRegression(random_state=123, max_iter=1000))
rfy       = make_pipeline(StandardScaler(), RandomForestRegressor(min_samples_leaf=20, ccp_alpha=0.001, random_state=123))
rfd       = make_pipeline(StandardScaler(), RandomForestClassifier(min_samples_leaf=20, ccp_alpha=0.001, random_state=123))
dtry      = make_pipeline(StandardScaler(), RandomForestRegressor(min_samples_leaf=10, random_state=123))
dtrd      = make_pipeline(StandardScaler(), RandomForestClassifier(min_samples_leaf=10, random_state=123))
gbfy      = make_pipeline(StandardScaler(), RandomForestRegressor(min_samples_leaf=15, random_state=123))
gbfd      = make_pipeline(StandardScaler(), RandomForestClassifier(min_samples_leaf=15, random_state=123))
# For IRM testing:
lassoytest = lassoy
lgrdtest   = lgrd

# -------------------------
# 4. Semisynthetic Data Generator
# -------------------------
class SemiSynth:
    """
    Fits outcome models for D=0 and D=1 and a propensity score model.
    Generates new samples by re-sampling X and adding re-sampled, de-meaned residuals.
    """
    def __init__(self, transformer, random_state=None):
        self.transformer = transformer
        self.random_state = random_state

    def fit(self, X, D, y):
        self.X_ = X.copy()
        # Outcome model for D=0
        self.est0_ = make_pipeline(
            self.transformer,
            RandomForestRegressor(min_samples_leaf=20, ccp_alpha=0.001, random_state=self.random_state)
        ).fit(X[D==0], y[D==0])
        self.res0_ = y[D==0] - self.est0_.predict(X[D==0])
        self.res0_ -= np.mean(self.res0_)
        # Outcome model for D=1
        self.est1_ = make_pipeline(
            self.transformer,
            RandomForestRegressor(min_samples_leaf=20, ccp_alpha=0.001, random_state=self.random_state)
        ).fit(X[D==1], y[D==1])
        self.res1_ = y[D==1] - self.est1_.predict(X[D==1])
        self.res1_ -= np.mean(self.res1_)
        # Propensity model for D|X
        self.prop_ = make_pipeline(
            self.transformer,
            RandomForestClassifier(min_samples_leaf=20, ccp_alpha=0.001, random_state=self.random_state)
        ).fit(X, D)
        return self

    def generate_data(self, n):
        # Resample X from the empirical distribution
        X_sample = self.X_.iloc[np.random.choice(self.X_.shape[0], n, replace=True)]
        # Reset index to ensure alignment with new Series for D and y.
        X_sample = X_sample.reset_index(drop=True)
        pX = self.prop_.predict_proba(X_sample)[:, 1]
        D_sample = np.random.binomial(1, pX)
        # Use .values for residuals to avoid index mismatches.
        y0 = self.est0_.predict(X_sample) + self.res0_.values[np.random.choice(len(self.res0_), n, replace=True)]
        y1 = self.est1_.predict(X_sample) + self.res1_.values[np.random.choice(len(self.res1_), n, replace=True)]
        y_sample = y0 * (1 - D_sample) + y1 * D_sample
        return X_sample, D_sample, y_sample, y1, y0

    def y_cef(self, X, D):
        return self.est1_.predict(X) * D + self.est0_.predict(X) * (1 - D)

    def D_cef(self, X):
        return self.prop_.predict_proba(X)[:, 1]

    @property
    def true_ate(self):
        return np.mean(self.est1_.predict(self.X_) - self.est0_.predict(self.X_))

# -------------------------
# 5. Estimation Routines
# -------------------------
def dml(X, D, y, model_y, model_d, nfolds=5, classifier=False):
    X = X.reset_index(drop=True)
    n = len(y)
    mhat = np.zeros(n)
    phat = np.zeros(n)
    kf = KFold(n_splits=nfolds, shuffle=True, random_state=123)
    for train_idx, test_idx in kf.split(X):
        X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]
        D_train, D_test = D.iloc[train_idx], D.iloc[test_idx]
        y_train, y_test = y.iloc[train_idx], y.iloc[test_idx]
        my = clone(model_y)
        md = clone(model_d)
        my.fit(X_train, y_train)
        mhat[test_idx] = my.predict(X_test)
        md.fit(X_train, D_train)
        if classifier:
            phat[test_idx] = md.predict_proba(X_test)[:, 1]
        else:
            phat[test_idx] = md.predict(X_test)
    r_y = y - mhat
    r_D = D - phat
    theta = np.sum(r_D * r_y) / np.sum(r_D**2)
    sigma2 = np.mean((r_y - theta * r_D)**2)
    stderr = np.sqrt(sigma2 / np.sum(r_D**2))
    eps = r_y - theta * r_D
    return theta, stderr, mhat, phat, r_y, r_D, eps

def dr(X, D, y, model_y0, model_y1, model_d, nfolds=5):
    X = X.reset_index(drop=True)
    n = len(y)
    m0hat = np.zeros(n)
    m1hat = np.zeros(n)
    phat = np.zeros(n)
    kf = KFold(n_splits=nfolds, shuffle=True, random_state=123)
    for train_idx, test_idx in kf.split(X):
        X_train = X.iloc[train_idx]
        X_test = X.iloc[test_idx]
        D_train = D.iloc[train_idx]
        y_train = y.iloc[train_idx]
        my0 = clone(model_y0)
        my1 = clone(model_y1)
        md = clone(model_d)
        if sum(D_train==0) > 0:
            my0.fit(X_train[D_train==0], y_train[D_train==0])
            m0hat[test_idx] = my0.predict(X_test)
        else:
            m0hat[test_idx] = 0
        if sum(D_train==1) > 0:
            my1.fit(X_train[D_train==1], y_train[D_train==1])
            m1hat[test_idx] = my1.predict(X_test)
        else:
            m1hat[test_idx] = 0
        md.fit(X_train, D_train)
        phat[test_idx] = md.predict_proba(X_test)[:, 1]
    ipw = D * (y - m1hat) / phat - (1 - D) * (y - m0hat) / (1 - phat)
    theta = np.mean((m1hat - m0hat) + ipw)
    psi = (m1hat - m0hat) + ipw - theta
    stderr = np.std(psi, ddof=1) / np.sqrt(n)
    yhat = m1hat * D + m0hat * (1 - D)
    resy = y - yhat
    resD = D - phat
    return theta, stderr, yhat, phat, resy, resD, psi

def dml_dirty(X, D, y, models_y, models_d, nfolds=5, classifier=False):
    X = X.reset_index(drop=True)
    n = len(y)
    mhat = np.zeros(n)
    phat = np.zeros(n)
    kf = KFold(n_splits=nfolds, shuffle=True, random_state=123)
    for train_idx, test_idx in kf.split(X):
        X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]
        D_train = D.iloc[train_idx]
        y_train = y.iloc[train_idx]
        preds_y = []
        preds_d = []
        for model in models_y:
            m = clone(model)
            m.fit(X_train, y_train)
            preds_y.append(m.predict(X_test))
        for model in models_d:
            m = clone(model)
            m.fit(X_train, D_train)
            if classifier:
                preds_d.append(m.predict_proba(X_test)[:, 1])
            else:
                preds_d.append(m.predict(X_test))
        mhat[test_idx] = np.mean(preds_y, axis=0)
        phat[test_idx] = np.mean(preds_d, axis=0)
    r_y = y - mhat
    r_D = D - phat
    theta = np.sum(r_D * r_y) / np.sum(r_D**2)
    sigma2 = np.mean((r_y - theta * r_D)**2)
    stderr = np.sqrt(sigma2 / np.sum(r_D**2))
    eps = r_y - theta * r_D
    return theta, stderr, mhat, phat, r_y, r_D, eps

def dml_select_best(X, D, y, models_y, models_d, nfolds=5, classifier=False):
    best_y = models_y[0]
    best_d = models_d[0]
    return dml(X, D, y, best_y, best_d, nfolds, classifier)

def dr_dirty(X, D, y, models_y0, models_y1, models_d, nfolds=5):
    X = X.reset_index(drop=True)
    n = len(y)
    m0hat = np.zeros(n)
    m1hat = np.zeros(n)
    phat = np.zeros(n)
    kf = KFold(n_splits=nfolds, shuffle=True, random_state=123)
    for train_idx, test_idx in kf.split(X):
        X_train = X.iloc[train_idx]
        X_test = X.iloc[test_idx]
        D_train = D.iloc[train_idx]
        y_train = y.iloc[train_idx]
        preds_m0, preds_m1, preds_p = [], [], []
        for model in models_y0:
            m = clone(model)
            if sum(D_train==0) > 0:
                m.fit(X_train[D_train==0], y_train[D_train==0])
                preds_m0.append(m.predict(X_test))
        for model in models_y1:
            m = clone(model)
            if sum(D_train==1) > 0:
                m.fit(X_train[D_train==1], y_train[D_train==1])
                preds_m1.append(m.predict(X_test))
        for model in models_d:
            m = clone(model)
            m.fit(X_train, D_train)
            preds_p.append(m.predict_proba(X_test)[:, 1])
        m0hat[test_idx] = np.mean(preds_m0, axis=0) if preds_m0 else 0
        m1hat[test_idx] = np.mean(preds_m1, axis=0) if preds_m1 else 0
        phat[test_idx] = np.mean(preds_p, axis=0)
    ipw = D * (y - m1hat) / phat - (1-D) * (y - m0hat) / (1-phat)
    theta = np.mean((m1hat - m0hat) + ipw)
    psi = (m1hat - m0hat) + ipw - theta
    stderr = np.std(psi, ddof=1) / np.sqrt(n)
    yhat = m1hat * D + m0hat * (1-D)
    resy = y - yhat
    resD = D - phat
    return theta, stderr, yhat, phat, resy, resD, psi

def dr_select_best(X, D, y, models_y0, models_y1, models_d, nfolds=5):
    best_y0 = models_y0[0]
    best_y1 = models_y1[0]
    best_d  = models_d[0]
    return dr(X, D, y, best_y0, best_y1, best_d, nfolds)

# -------------------------
# 6. Summary Function
# -------------------------
def summary(point, stderr, yhat, Dhat, resy, resD, final_residual, X, D, y, *, name, synth):
    true_ate = synth.true_ate
    covered = (point - 1.96 * stderr <= true_ate <= point + 1.96 * stderr)
    y_cef_true = synth.y_cef(X, D)
    d_cef_true = synth.D_cef(X)
    return pd.DataFrame({
        'estimate': [point],
        'stderr': [stderr],
        'lower': [point - 1.96 * stderr],
        'upper': [point + 1.96 * stderr],
        'rmse y': [np.sqrt(np.mean(resy**2))],
        'rmse D': [np.sqrt(np.mean(resD**2))],
        'accuracy D': [np.mean(np.abs(resD) < 0.5)],
        'error': [np.abs(point - true_ate)],
        'rmse E[y|D,X]': [np.sqrt(np.mean((yhat - y_cef_true)**2))],
        'rmse E[D|X]': [np.sqrt(np.mean((Dhat - d_cef_true)**2))],
        'covered': [1 if covered else 0]
    }, index=[name])

# -------------------------
# 7. Methods Wrappers
# -------------------------
def run_plr_methods(X_train, D_train, y_train, synth):
    results = []
    # 1) Double Lasso
    point, stderr, yhat, Dhat, resy, resD, eps = dml(X_train, D_train, y_train,
                                                      deepcopy(lassoy), deepcopy(lassod), nfolds=5)
    results.append(summary(point, stderr, yhat, Dhat, resy, resD, eps,
                           X_train, D_train, y_train, name='double lasso', synth=synth))
    # 2) Lasso/Logistic
    point, stderr, yhat, Dhat, resy, resD, eps = dml(X_train, D_train, y_train,
                                                      deepcopy(lassoy), deepcopy(lgrd), nfolds=5, classifier=True)
    results.append(summary(point, stderr, yhat, Dhat, resy, resD, eps,
                           X_train, D_train, y_train, name='lasso/logistic', synth=synth))
    # 3) Random Forest
    point, stderr, yhat, Dhat, resy, resD, eps = dml(X_train, D_train, y_train,
                                                      deepcopy(rfy), deepcopy(rfd), nfolds=5, classifier=True)
    results.append(summary(point, stderr, yhat, Dhat, resy, resD, eps,
                           X_train, D_train, y_train, name='random forest', synth=synth))
    # 4) Decision Tree
    point, stderr, yhat, Dhat, resy, resD, eps = dml(X_train, D_train, y_train,
                                                      deepcopy(dtry), deepcopy(dtrd), nfolds=5, classifier=True)
    results.append(summary(point, stderr, yhat, Dhat, resy, resD, eps,
                           X_train, D_train, y_train, name='decision tree', synth=synth))
    # 5) Boosted Trees
    point, stderr, yhat, Dhat, resy, resD, eps = dml(X_train, D_train, y_train,
                                                      deepcopy(gbfy), deepcopy(gbfd), nfolds=5, classifier=True)
    results.append(summary(point, stderr, yhat, Dhat, resy, resD, eps,
                           X_train, D_train, y_train, name='boosted forest', synth=synth))
    # 6) AutoML (semi-cfit)
    flamly = make_pipeline(synth.transformer,
                           AutoML(time_budget=50, task='regression', early_stop=True,
                                  eval_method='cv', n_splits=3, metric='r2', verbose=0))
    flamld = make_pipeline(synth.transformer,
                           AutoML(time_budget=50, task='classification', early_stop=True,
                                  eval_method='cv', n_splits=3, metric='r2', verbose=0))
    flamly.fit(X_train, y_train)
    besty_model = flamly[-1].best_model_for_estimator(flamly[-1].best_estimator_)
    besty = make_pipeline(synth.transformer, clone(besty_model))
    flamld.fit(X_train, D_train)
    bestd_model = flamld[-1].best_model_for_estimator(flamld[-1].best_estimator_)
    bestd = make_pipeline(synth.transformer, clone(bestd_model))
    point, stderr, yhat, Dhat, resy, resD, eps = dml(X_train, D_train, y_train, besty, bestd, nfolds=5, classifier=True)
    results.append(summary(point, stderr, yhat, Dhat, resy, resD, eps,
                           X_train, D_train, y_train, name='automl (semi-cfit)', synth=synth))
    # 7) Stacked (semi-cfit)
    point, stderr, yhat, Dhat, resy, resD, eps = dml_dirty(X_train, D_train, y_train,
                                                             [lassoy, rfy, dtry, gbfy],
                                                             [lgrd, rfd, dtrd, gbfd],
                                                             nfolds=5, classifier=True)
    results.append(summary(point, stderr, yhat, Dhat, resy, resD, eps,
                           X_train, D_train, y_train, name='stacked (semi-cfit)', synth=synth))
    # 8) Select Best (semi-cfit)
    point, stderr, yhat, Dhat, resy, resD, eps = dml_select_best(X_train, D_train, y_train,
                                                                  [lassoy, rfy, dtry, gbfy],
                                                                  [lgrd, rfd, dtrd, gbfd],
                                                                  nfolds=5, classifier=True)
    results.append(summary(point, stderr, yhat, Dhat, resy, resD, eps,
                           X_train, D_train, y_train, name='select-best (semi-cfit)', synth=synth))
    return pd.concat(results)

def run_irm_methods(X_train, D_train, y_train, synth):
    results = []
    # 1) Lasso-Lasso / Logistic
    point, stderr, yhat, Dhat, resy, resD, drhat = dr(X_train, D_train, y_train,
                                                       deepcopy(lassoytest), deepcopy(lassoytest), deepcopy(lgrdtest),
                                                       nfolds=5)
    results.append(summary(point, stderr, yhat, Dhat, resy, resD, drhat,
                           X_train, D_train, y_train, name='lasso/logistic', synth=synth))
    # 2) Random Forest
    point, stderr, yhat, Dhat, resy, resD, drhat = dr(X_train, D_train, y_train,
                                                       deepcopy(rfy), deepcopy(rfy), deepcopy(rfd),
                                                       nfolds=5)
    results.append(summary(point, stderr, yhat, Dhat, resy, resD, drhat,
                           X_train, D_train, y_train, name='random forest', synth=synth))
    # 3) Decision Tree
    point, stderr, yhat, Dhat, resy, resD, drhat = dr(X_train, D_train, y_train,
                                                       deepcopy(dtry), deepcopy(dtry), deepcopy(dtrd),
                                                       nfolds=5)
    results.append(summary(point, stderr, yhat, Dhat, resy, resD, drhat,
                           X_train, D_train, y_train, name='decision tree', synth=synth))
    # 4) Boosted Forest
    point, stderr, yhat, Dhat, resy, resD, drhat = dr(X_train, D_train, y_train,
                                                       deepcopy(gbfy), deepcopy(gbfy), deepcopy(gbfd),
                                                       nfolds=5)
    results.append(summary(point, stderr, yhat, Dhat, resy, resD, drhat,
                           X_train, D_train, y_train, name='boosted forest', synth=synth))
    flamly0 = make_pipeline(synth.transformer,
                            AutoML(time_budget=30, task='regression', early_stop=True,
                                   eval_method='cv', n_splits=3, metric='r2', verbose=0))
    flamly1 = make_pipeline(synth.transformer,
                            AutoML(time_budget=30, task='regression', early_stop=True,
                                   eval_method='cv', n_splits=3, metric='r2', verbose=0))
    flamld = make_pipeline(synth.transformer,
                           AutoML(time_budget=30, task='classification', early_stop=True,
                                  eval_method='cv', n_splits=3, metric='r2', verbose=0))
    flamly0.fit(X_train[D_train==0], y_train[D_train==0])
    besty0_model = flamly0[-1].best_model_for_estimator(flamly0[-1].best_estimator_)
    besty0 = make_pipeline(synth.transformer, clone(besty0_model))
    flamly1.fit(X_train[D_train==1], y_train[D_train==1])
    besty1_model = flamly1[-1].best_model_for_estimator(flamly1[-1].best_estimator_)
    besty1 = make_pipeline(synth.transformer, clone(besty1_model))
    flamld.fit(X_train, D_train)
    bestd_model = flamld[-1].best_model_for_estimator(flamld[-1].best_estimator_)
    bestd = make_pipeline(synth.transformer, clone(bestd_model))
    point, stderr, yhat, Dhat, resy, resD, drhat = dr(X_train, D_train, y_train, besty0, besty1, bestd, nfolds=5)
    results.append(summary(point, stderr, yhat, Dhat, resy, resD, drhat,
                           X_train, D_train, y_train, name='automl (semi-cfit)', synth=synth))
    point, stderr, yhat, Dhat, resy, resD, drhat = dr_dirty(X_train, D_train, y_train,
                                                             [deepcopy(lassoy), deepcopy(rfy), deepcopy(dtry), deepcopy(gbfy)],
                                                             [deepcopy(lassoy), deepcopy(rfy), deepcopy(dtry), deepcopy(gbfy)],
                                                             [deepcopy(lgrd), deepcopy(rfd), deepcopy(dtrd), deepcopy(gbfd)],
                                                             nfolds=5)
    results.append(summary(point, stderr, yhat, Dhat, resy, resD, drhat,
                           X_train, D_train, y_train, name='stacked (semi-cfit)', synth=synth))
    point, stderr, yhat, Dhat, resy, resD, drhat = dr_select_best(X_train, D_train, y_train,
                                                                  [lassoy, rfy, dtry, gbfy],
                                                                  [lassoy, rfy, dtry, gbfy],
                                                                  [lgrd, rfd, dtrd, gbfd],
                                                                  nfolds=5)
    results.append(summary(point, stderr, yhat, Dhat, resy, resD, drhat,
                           X_train, D_train, y_train, name='select-best (semi-cfit)', synth=synth))
    return pd.concat(results)

# -------------------------
# 8. Run Experiments
# -------------------------
def run_experiments(sample_sizes, X, D, y, transformer, random_state=123):
    for n in sample_sizes:
        print(f"\n=== Semi-Synthetic Data with n={n} ===")
        synth = SemiSynth(transformer, random_state=random_state).fit(X, D, y)
        print("True ATE in the semi-synthetic world:", synth.true_ate)
        X_synth, D_synth, y_synth, y1_synth, y0_synth = synth.generate_data(n)
        print("\n** PLR Results **")
        plr_table = run_plr_methods(X_synth, pd.Series(D_synth), pd.Series(y_synth), synth)
        print(plr_table)
        print("\n** IRM Results **")
        irm_table = run_irm_methods(X_synth, pd.Series(D_synth), pd.Series(y_synth), synth)
        print(irm_table)

# Specify a transformer (using StandardScaler)
transformer = StandardScaler()
# Define sample sizes (e.g., 1000, 10000, 50000)
sample_sizes = [1000, 10000, 50000]
run_experiments(sample_sizes, X, D, y, transformer)



=== Semi-Synthetic Data with n=1000 ===
True ATE in the semi-synthetic world: 18700.470823845746

** PLR Results **


Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 5.164e+11, tolerance: 3.852e+09
Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 4.957e+11, tolerance: 3.692e+09
Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 5.351e+11, tolerance: 4.268e+09
Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 3.647e+11, tolerance: 3.361e+09
Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 1.857e+11, tolerance: 2.884e+09
Objective did n

                             estimate       stderr         lower  \
double lasso             13002.839853  3377.349913   6383.234023   
lasso/logistic           14987.982400  3487.900394   8151.697628   
random forest            18326.225913  3867.595277  10745.739169   
decision tree            15727.327769  3496.528392   8874.132120   
boosted forest           16267.271162  3666.128307   9081.659681   
automl (semi-cfit)       18326.225913  3867.595277  10745.739169   
stacked (semi-cfit)      15914.627706  3390.832616   9268.595777   
select-best (semi-cfit)  14987.982400  3487.900394   8151.697628   

                                upper        rmse y    rmse D  accuracy D  \
double lasso             19622.445683  38975.918146  0.362264       0.856   
lasso/logistic           21824.267171  38975.918146  0.350154       0.848   
random forest            25906.712657  43456.027682  0.351388       0.856   
decision tree            22580.523418  39079.097211  0.349911       0.856   
bo

Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.136e+11, tolerance: 1.605e+09
Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 3.054e+10, tolerance: 1.976e+09
Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 3.390e+11, tolerance: 1.745e+09
Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 1.760e+10, tolerance: 1.702e+09
Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 3.325e+11, tolerance: 1.729e+09
Objective did n

                             estimate       stderr         lower  \
lasso/logistic            9933.012692  9568.041690  -8820.349021   
random forest            19799.627854  2850.773209  14212.112364   
decision tree            18382.758945  2588.288387  13309.713707   
boosted forest           18792.825869  3190.334905  12539.769455   
automl (semi-cfit)       19799.627854  2850.773209  14212.112364   
stacked (semi-cfit)      17761.471575  2737.088108  12396.778883   
select-best (semi-cfit)   9933.012692  9568.041690  -8820.349021   

                                upper        rmse y    rmse D  accuracy D  \
lasso/logistic           28686.374405  36886.134760  0.350154       0.848   
random forest            25387.143344  46420.643405  0.351388       0.856   
decision tree            23455.804184  41980.132269  0.349911       0.856   
boosted forest           25045.882283  44304.197790  0.350716       0.856   
automl (semi-cfit)       25387.143344  46420.643405  0.351388       0.

Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 7.430e+11, tolerance: 3.409e+10
Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 1.670e+12, tolerance: 3.212e+10
Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 1.117e+12, tolerance: 3.222e+10
Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 8.290e+11, tolerance: 3.523e+10
Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 1.336e+12, tolerance: 3.322e+10
Objective did n

                             estimate       stderr         lower  \
double lasso             12248.361660  1061.615107  10167.596050   
lasso/logistic           14203.404629  1129.831245  11988.935389   
random forest            14176.748486   929.822738  12354.295920   
decision tree            14659.814393   917.233710  12862.036320   
boosted forest           14427.055021   919.988433  12623.877692   
automl (semi-cfit)       14176.748486   929.822738  12354.295920   
stacked (semi-cfit)      14453.908008   932.722041  12625.772807   
select-best (semi-cfit)  14203.404629  1129.831245  11988.935389   

                                upper        rmse y    rmse D  accuracy D  \
double lasso             14329.127270  37535.127746  0.351236      0.8659   
lasso/logistic           16417.873869  37535.127746  0.329624      0.8658   
random forest            15999.201053  31075.036393  0.330386      0.8659   
decision tree            16457.592465  30927.286539  0.332954      0.8661   
bo

Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 1.301e+12, tolerance: 1.278e+10
Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 5.355e+11, tolerance: 1.866e+10
Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 9.466e+11, tolerance: 1.137e+10
Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 5.365e+11, tolerance: 1.803e+10
Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 1.229e+12, tolerance: 1.359e+10
Objective did n

                             estimate       stderr         lower  \
lasso/logistic           14896.535428  2124.723388  10732.077586   
random forest            15504.740650  1247.892910  13058.870547   
decision tree            14772.919735  2485.747394   9900.854843   
boosted forest           15298.029096  1737.701851  11892.133469   
automl (semi-cfit)       15504.740650  1247.892910  13058.870547   
stacked (semi-cfit)      15229.611550  1508.937667  12272.093724   
select-best (semi-cfit)  14896.535428  2124.723388  10732.077586   

                                upper        rmse y    rmse D  accuracy D  \
lasso/logistic           19060.993269  36343.850947  0.329624      0.8658   
random forest            17950.610754  31651.101376  0.330386      0.8659   
decision tree            19644.984627  29725.229718  0.332954      0.8661   
boosted forest           18703.924723  30563.990608  0.331703      0.8663   
automl (semi-cfit)       17950.610754  31651.101376  0.330386      0.8

Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 4.869e+12, tolerance: 1.836e+11
Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 3.077e+12, tolerance: 1.784e+11
Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 4.564e+12, tolerance: 1.802e+11
Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 1.172e+12, tolerance: 1.786e+11
Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 4.495e+12, tolerance: 1.773e+11
Objective did n

                             estimate      stderr         lower         upper  \
double lasso             14087.034377  490.042332  13126.551406  15047.517348   
lasso/logistic           16092.590952  522.073468  15069.326955  17115.854950   
random forest            15643.669083  415.715392  14828.866915  16458.471250   
decision tree            15933.068772  411.325913  15126.869982  16739.267562   
boosted forest           15847.226717  412.681005  15038.371947  16656.081487   
automl (semi-cfit)       15643.669083  415.715392  14828.866915  16458.471250   
stacked (semi-cfit)      15921.851765  421.804447  15095.115050  16748.588481   
select-best (semi-cfit)  16092.590952  522.073468  15069.326955  17115.854950   

                               rmse y    rmse D  accuracy D        error  \
double lasso             38915.462333  0.352244     0.86526  4613.436447   
lasso/logistic           38915.462333  0.330231     0.86526  2607.879871   
random forest            31220.537564  0.3

Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 1.076e+12, tolerance: 6.563e+10
Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.533e+12, tolerance: 1.044e+11
Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.197e+12, tolerance: 6.251e+10
Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.343e+12, tolerance: 1.022e+11
Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 1.027e+12, tolerance: 6.214e+10
Objective did n

                             estimate      stderr         lower         upper  \
lasso/logistic           18327.067317  649.276486  17054.485405  19599.649229   
random forest            18092.224689  387.041665  17333.623025  18850.826352   
decision tree            17827.506380  608.224300  16635.386752  19019.626008   
boosted forest           17921.165033  512.587921  16916.492707  18925.837359   
automl (semi-cfit)       18092.224689  387.041665  17333.623025  18850.826352   
stacked (semi-cfit)      18053.363401  454.551836  17162.441803  18944.284999   
select-best (semi-cfit)  18327.067317  649.276486  17054.485405  19599.649229   

                               rmse y    rmse D  accuracy D       error  \
lasso/logistic           38124.631636  0.330231     0.86526  479.870136   
random forest            30236.259484  0.331203     0.86526  714.712764   
decision tree            30363.939407  0.336721     0.86348  979.431073   
boosted forest           30257.582329  0.333999    

### Summary of ATE Estimates (True ATE = 18,700.47)

Below are brief tables reporting the point‐estimates under the PLR and IRM settings for n = 1,000, 10,000, and 50,000. In both settings, the AutoML (semi‑cfit) procedure nearly replicates the best individual model (often random forest), and the stacked approach is competitive though sometimes slightly lower at smaller n.

#### PLR Results

| Method                   | n = 1,000 | n = 10,000 | n = 50,000 |
|--------------------------|-----------|------------|------------|
| **Double Lasso**         | 13,003    | 12,248     | 14,087     |
| **Lasso/Logistic**       | 14,988    | 14,203     | 16,093     |
| **Random Forest**        | 18,326    | 14,177     | 15,644     |
| **Decision Tree**        | 15,727    | 14,660     | 15,933     |
| **Boosted Forest**       | 16,267    | 14,427     | 15,847     |
| **AutoML (semi‑cfit)**   | 18,326    | 14,177     | 15,644     |
| **Stacked (semi‑cfit)**  | 15,915    | 14,454     | 15,922     |
| **Select‑Best (semi‑cfit)** | 14,988  | 14,203     | 16,093     |

#### IRM Results

| Method                   | n = 1,000 | n = 10,000 | n = 50,000 |
|--------------------------|-----------|------------|------------|
| **Lasso/Logistic**       | 9,933     | 14,897     | 18,327     |
| **Random Forest**        | 19,800    | 15,505     | 18,092     |
| **Decision Tree**        | 18,383    | 14,773     | 17,828     |
| **Boosted Forest**       | 18,793    | 15,300     | 17,921     |
| **AutoML (semi‑cfit)**   | 19,800    | 15,505     | 18,092     |
| **Stacked (semi‑cfit)**  | 17,761    | 15,230     | 18,053     |
| **Select‑Best (semi‑cfit)** | 9,933   | 14,897     | 18,327     |

### Key Takeaways

- **AutoML:** Consistently selects a base learner (often random forest) that closely matches the best individual model’s performance.
- **Stacking:** Offers competitive estimates but does not always outperform the best single model, especially at lower sample sizes.
- **Overall:** Automated model selection/ensembling yields performance on par with the best hand‑crafted methods.



## Question 2

### a


#### PLR Setting in an RCT

We consider the moment equation in the Partially Linear Regression (PLR) setting:

$$[
E[(Y - h(X) - \theta (D - p)) (D - p)] = 0
]$$

where:
- $( h(X) )$ is the model of the regression of $( Y )$ on $( X )$.
- $( p )$ is the probability of treatment assignment.

#### **Why This Recovers ATE Even If $( h(X) )$ is Incorrect**
1. **Substituting the True Outcome Equation**  
   Using the potential outcomes framework:

   $$[
   Y = Y(0) + D(Y(1) - Y(0))
   ]$$

   Substituting into the moment equation:

   $$[
   E[(Y(0) + D(Y(1) - Y(0)) - h(X) - \theta (D - p)) (D - p)] = 0
   ]$$

2. **Expanding the Expectation**  
   $$[
   E[Y(0) - h(X)] E[D - p] + E[(Y(1) - Y(0) - \theta)(D - p)] = 0
   ]$$

   Since $( E[D - p] = 0 )$ by randomization, the first term drops out.

3. **Final Step: Unbiased Estimation of ATE**  
   $$[
   E[(Y(1) - Y(0) - \theta)(D - p)] = 0
   ]$$

   By randomization, $( D )$ is independent of $( Y(1) - Y(0) )$, leading to:

   $$[
   \theta = E[Y(1) - Y(0)]
   ]$$

Thus, even if $( h(X) )$ is completely wrong, we still recover the Average Treatment Effect (ATE).


### b

#### IRM Setting in an RCT

We now consider the Interactive Regression Model (IRM) setting with the moment equation:

$$[
E[g(1, X) - g(0, X) + (Y - g(D, X)) \left(\frac{D}{p} - \frac{1-D}{1-p} \right) - \theta] = 0
]$$

where:
- $( g(D, X) )$ is a model of $( E[Y | D, X] )$.
- $( p )$ is the probability of treatment assignment.

#### **Why This Recovers ATE Even If $( g(D, X) )$ is Incorrect**
1. **Expanding the Expectation**  
   Using the identity:

   $$[
   Y = Y(0) + D(Y(1) - Y(0))
   ]$$

   The moment equation simplifies to:

   $$[
   E[(Y(1) - Y(0)) \left(\frac{D}{p} - \frac{1-D}{1-p} \right)] = 0
   ]$$

2. **Expectation Cancels Out Model Mis-Specification**  
   Since the weighting function ensures unbiasedness, we obtain:

   $$[
   \theta = E[Y(1) - Y(0)]
   ]$$

Thus, IRM is **doubly robust**, meaning that even if $( g(D, X) )$ is misspecified, the estimator still recovers the true ATE.


### c

#### PLR Setting with Stratified RCT

In a stratified RCT, treatment probability varies across covariates, i.e., $p(X)$. The moment equation becomes:

$$
E[(Y - h(X) - \theta (D - p(X))) (D - p(X))] = 0
$$

where $p(X) = P(D = 1 | X)$.

#### **Why This Recovers a Weighted ATE**
1. **Expanding the Expectation**  
   Using the true outcome equation:

   $$
   E[Y(1) - Y(0) | X] = \theta
   $$

   This expectation is weighted by:

   $$
   w(X) = p(X)(1 - p(X))
   $$

   So the final estimator recovers:

   $$
   E[E[Y(1) - Y(0) | X] w(X)]
   $$

2. **Interpretation of Weights $w(X)$**
   - **Higher weight for $p(X) \approx 0.5$**: Treatment is most random.
   - **Lower weight for $p(X) \approx 0$ or $p(X) \approx 1$**: Treatment is deterministic.
   - **If $p(X) = 0$ or $p(X) = 1$, those observations do not contribute.**

#### **Correcting the Issue by Reweighting**
- To estimate an **unweighted ATE**, we solve:

  $$
  E[b(X) (Y - h(X) - \theta (D - p(X))) (D - p(X))] = 0
  $$

  Choosing:

  $$
  b(X) = \frac{1}{p(X)(1 - p(X))}
  $$

  ensures unbiased ATE estimation.


### d

#### IRM Setting with Stratified RCT

With stratified treatment probability $p(X)$, the IRM moment equation becomes:

$$
E[g(1, X) - g(0, X) + (Y - g(D, X)) \left(\frac{D}{p(X)} - \frac{1-D}{1-p(X)} \right) - \theta] = 0
$$

#### **Why This Recovers the True ATE**
1. **Expanding the Expectation**  
   Since $g(D, X)$ cancels out in expectation:

   $$
   E[(Y - g(D, X)) \left(\frac{D}{p(X)} - \frac{1-D}{1-p(X)} \right)] = E[(Y(1) - Y(0)) \left(\frac{D}{p(X)} - \frac{1-D}{1-p(X)} \right)]
   $$

   This simplifies to:

   $$
   \theta = E[Y(1) - Y(0)]
   $$

2. **Key Observation: No Weighting Bias**
   - Unlike PLR, **IRM does not introduce weighting by $p(X)(1 - p(X))$**.
   - This means **IRM always estimates the true ATE**, not a weighted version.

#### **Conclusion**
- **PLR estimates a reweighted ATE in a stratified RCT.**
- **IRM remains unbiased for the true ATE.**
- **IRM is doubly robust, making it a more reliable approach.**
