# 02 Tune CATE estimators

In this notebook, we will tune the hyperparemeters for our CATE methods.

### Contents:
1. Description of estimator library  
2. Setting up  
3. Actual tuning

## 1. Description of estimator library

We will consider the following estimators:

1. S-learner:  
A. RF  
B. XGB
2. T-learner:  
A. Lasso  
B. logistic  
C. RF  
D. XGB
3. X-learner:  
A. Outcome_learner: lasso, effect_learner: lasso  
B. Outcome_learner: logistic, effect_learner: lasso  
C. Outcome_learner: RF, effect_learner: lasso  
D. Outcome_learner: XGB, effect_learner: lasso
4. R-learner:  
A. Outcome_learner: lasso, effect_learner: lasso  
B. Outcome_learner: lasso, effect_learner: XGB  
C. Outcome_learner: RF, effect_learner: lasso  
E. Outcome_learner: RF, effect_learner: RF

R-learner base learner types were chosen independently at random from {lasso, RF, XGB}

We will tune the models for the 4 outcomes: GI, cardio, hypertension, severe GI, without perturbations.

## 2. Setting up

In [1]:
# Standard imports
import numpy as np
import pandas as pd
import sys
import copy
import random
import joblib
import pickle

# Import sklearn methods
from xgboost import XGBRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import Lasso, LogisticRegression
from sklearn.model_selection import StratifiedKFold

# Import own methods
from methods.cate_estimator_validation import make_estimator_library

Failed to import duecredit due to No module named 'duecredit'


In [2]:
# Cap BLAS/OpenMP threads to avoid oversubscription with parallel CV
import os
os.environ["OMP_NUM_THREADS"] = "1"
os.environ["MKL_NUM_THREADS"] = "1"
os.environ["OPENBLAS_NUM_THREADS"] = "1"


In [3]:
# Load pre-saved analysis dataset from Analysis.ipynb
import pandas as pd
import numpy as np
from pathlib import Path

ANALYSIS_CSV = Path("data/analysis/analysis_df.csv")
if not ANALYSIS_CSV.exists():
    raise FileNotFoundError(f"Expected CSV not found at {ANALYSIS_CSV}. Run Analysis.ipynb to generate it.")

trainval_df = pd.read_csv(ANALYSIS_CSV)
print(f"Loaded analysis dataset: {trainval_df.shape[0]} rows, {trainval_df.shape[1]} columns")

# Filter to units with message == 1 and drop 'message' column
if "message" in trainval_df.columns:
    trainval_df = trainval_df.loc[(trainval_df["message"] == 1) & (trainval_df["billpayfa"] == 0) & (trainval_df["debitfa"] == 0)].copy()

# Define outcomes available in this dataset
outcomes = ["fausebal"]

# Choose treatment variable (do not include auxiliary 'message' as a feature)
treatment_var = "message_fa"

# Build a comprehensive feature set:
strat_vars = [c for c in trainval_df.columns if c.startswith("strat_")]

# low-cardinality categoricals (exclude outcomes/treatment/id and obvious non-features like 'group')
cat_candidates = ['reminder_freq', 'reminder_infreq', 'camp_short', 'htefa', 'htebal']

# Build missing indicators ONLY for categorical candidates that actually contain NaNs
available_cats = [c for c in cat_candidates if c in trainval_df.columns]
_cat_df = trainval_df[available_cats].copy()
missing_cats = [c for c in available_cats if _cat_df[c].isna().any()]
if missing_cats:
    cat_missing_indicators = _cat_df[missing_cats].isna().astype(int)
    cat_missing_indicators.columns = [f"{c}_missing" for c in missing_cats]
else:
    cat_missing_indicators = pd.DataFrame(index=_cat_df.index)

# Sentinel fill before concatenation
_cat_df = _cat_df.fillna(0)
cat_dummies = pd.concat([_cat_df, cat_missing_indicators], axis=1)

# assemble design matrix (avoid duplicates)
numeric_vars = [c for c in ["assets", "deposits", "paymentmean", "debt", "minbal", "creditcard"] if c in trainval_df.columns]
X_numeric = trainval_df[numeric_vars].copy()

# Create missing indicators for numeric variables (no binning)
num_missing_cols = [c for c in X_numeric.columns if X_numeric[c].isna().any()]
if num_missing_cols:
    num_missing_indicators = X_numeric[num_missing_cols].isna().astype(int)
    num_missing_indicators.columns = [f"{c}_missing" for c in num_missing_cols]
else:
    num_missing_indicators = pd.DataFrame(index=trainval_df.index)

# Simple imputation for numerics after adding indicators
X_numeric = X_numeric.apply(pd.to_numeric, errors='coerce').fillna(0)

X_strat = trainval_df[strat_vars].copy()

# Keep continuous numerics and add missing indicators
X_design = pd.concat([X_strat, cat_dummies, num_missing_indicators, X_numeric], axis=1)
X_design = X_design.loc[:, ~X_design.columns.duplicated()].copy()

# Track full feature set
features = list(X_design.columns)

# Filter 
model_df = pd.concat([X_design, trainval_df[[treatment_var] + outcomes]], axis=1)

print(f"Processed dataset: {model_df.shape[0]} rows")
print(f"Detected treatment_var='{treatment_var}'")
print(f"Propensity score is {model_df[treatment_var].mean()}")
print(f"Feature matrix: {X_design.shape[1]} columns")
print(f"Outcomes to tune: {outcomes}")


Loaded analysis dataset: 108000 rows, 519 columns
Processed dataset: 36020 rows
Detected treatment_var='message_fa'
Propensity score is 0.4990838423098279
Feature matrix: 329 columns
Outcomes to tune: ['fausebal']


In [4]:
# Diagnostics: NaN summary and collinearity checks for X_design (numeric cast)
import numpy as np
import pandas as pd

print("=== Diagnostics for X_design ===")
# Align X_design to filtered rows in model_df
if 'model_df' in globals():
    X_design = X_design.loc[model_df.index].copy()
print(f"Original shape: {X_design.shape}")

# Cast to numeric for diagnostics; non-numeric becomes NaN
X_diag = X_design.copy().astype(float)
print(f"Numeric diagnostic shape: {X_diag.shape}")

# 1) NaN summary
nan_counts = X_diag.isna().sum()
num_cols_with_nans = int((nan_counts > 0).sum())
print(f"Columns with NaNs: {num_cols_with_nans}")
if num_cols_with_nans > 0:
    nan_report = nan_counts[nan_counts > 0].sort_values(ascending=False)
    print(nan_report.head(20))

# 2) Constant columns (zero variance)
constant_cols = [c for c in X_diag.columns if X_diag[c].nunique(dropna=True) <= 1]
print(f"Constant columns: {len(constant_cols)}")
if constant_cols:
    print(constant_cols[:20])

# 3) High pairwise correlations for numeric columns
numeric_cols = X_diag.select_dtypes(include=[np.number]).columns.tolist()
print(f"Numeric columns considered for correlation: {len(numeric_cols)}")

high_corr_pairs = []
if len(numeric_cols) > 1:
    corr = X_diag[numeric_cols].corr().abs()
    # Build an upper-triangle mask
    mask = np.triu(np.ones(corr.shape, dtype=bool), k=1)
    threshold = 0.99
    corr_vals = corr.to_numpy()
    for i in range(corr.shape[0]):
        for j in range(i + 1, corr.shape[1]):
            if mask[i, j]:
                val = corr_vals[i, j]
                if not np.isnan(val) and val >= threshold:
                    high_corr_pairs.append((numeric_cols[i], numeric_cols[j], float(val)))
print(f"High-corr pairs (|r| >= 0.95): {len(high_corr_pairs)}")
if high_corr_pairs:
    for a, b, r in high_corr_pairs[:20]:
        print(f"{a} ~ {b}: r={r:.3f}")

# Identify high-correlation columns to drop (keep second in each pair)
corr_drop_cols = []
if high_corr_pairs:
    seen = set()
    for a, b, r in high_corr_pairs:
        if b not in seen:
            corr_drop_cols.append(b)
            seen.add(b)
print(f"High-corr drop candidates: {len(corr_drop_cols)}")
if corr_drop_cols:
    print(corr_drop_cols[:20])

print("=== End diagnostics ===")

# Drop constant columns and high-corr columns, then refresh features/model_df
drop_cols = []
if len(constant_cols) > 0:
    drop_cols.extend(list(constant_cols))
if corr_drop_cols:
    drop_cols.extend(list(set(corr_drop_cols)))

if drop_cols:
    X_design = X_design.drop(columns=drop_cols, errors="ignore")
    features = list(X_design.columns)
    rows_index = X_design.index
    model_df = pd.concat([X_design, trainval_df.loc[rows_index, [treatment_var] + outcomes]], axis=1)
    print(f"Dropped {len(drop_cols)} columns (constant + high-corr). New X_design shape: {X_design.shape}")


=== Diagnostics for X_design ===
Original shape: (36020, 329)
Numeric diagnostic shape: (36020, 329)
Columns with NaNs: 0
Constant columns: 31
['strat_29', 'strat_48', 'strat_56', 'strat_76', 'strat_118', 'strat_122', 'strat_126', 'strat_143', 'strat_144', 'strat_146', 'strat_147', 'strat_148', 'strat_152', 'strat_154', 'strat_158', 'strat_190', 'strat_194', 'strat_198', 'strat_246', 'strat_250']
Numeric columns considered for correlation: 329
High-corr pairs (|r| >= 0.95): 0
High-corr drop candidates: 0
=== End diagnostics ===
Dropped 31 columns (constant + high-corr). New X_design shape: (36020, 298)


In [5]:
# Save dataset and metadata for reuse; also export trainval_data.csv with 'TREATED'
from pathlib import Path

for outcome in outcomes:
    # Paths
    OUTPUT_ANALYSIS_DIR = Path(f"output/analysis/{outcome}")
    OUTPUT_PARAMS_DIR = Path(f"output/params/{outcome}")
    OUTPUT_ANALYSIS_DIR.mkdir(parents=True, exist_ok=True)
    OUTPUT_PARAMS_DIR.mkdir(parents=True, exist_ok=True)

    MODEL_CSV = OUTPUT_ANALYSIS_DIR / "trainval_data.csv"
    MODEL_CSV_META = OUTPUT_PARAMS_DIR / "analysis_imputation_meta.pkl"
    meta = {
        "features": features,
        "treatment_var": treatment_var,
        "outcomes": outcomes,
    }
    with open(MODEL_CSV_META, 'wb') as f:
        pickle.dump(meta, f)
    
    if treatment_var != "TREATED":
        model_df.rename(columns={treatment_var: "TREATED"}, inplace=True)
    model_df.to_csv(MODEL_CSV, index=False)
    print(f"✓ Saved dataset -> {MODEL_CSV}")
    print(f"✓ Saved metadata -> {MODEL_CSV_META}")

✓ Saved dataset -> output\analysis\fausebal\trainval_data.csv
✓ Saved metadata -> output\params\fausebal\analysis_imputation_meta.pkl


### 2.2. Defining parameter grids and base learners

In [6]:
cv = StratifiedKFold(n_splits = 4, shuffle = True, random_state = 405)
lasso_grid = {"alpha" : np.logspace(-5,5,500) }
logistic_grid = {"penalty" : ["l1", "l2"], 
                 "C" : np.logspace(-5,5,500)}
rf_grid = {'min_samples_leaf': [50,100,200,300,400,500],
           'max_depth': [3,4,5,6,7,8],
           'bootstrap': [False, True],
           'n_estimators': [100,200,300,400,500]}
xgb_grid = {'max_depth': [5,6,7,8,9,10,11,12],
            'gamma': [0, 0.1, 0.2, 0.3, 0.4],
            'subsample': [0.7, 0.75, 0.8,1],
            'reg_lambda': [100,150,200,250, 300, 350, 400],
            'n_estimators': [200, 300, 400, 500, 600, 700, 800, 900, 1000],
            'min_child_weight': [4,5,6,7,8,9,10],
            'learning_rate': [0.1,0.125,0.15,0.175,0.2,0.225,0.25]}

base_learners = {"lasso" : Lasso(),
                 "logistic" : LogisticRegression(solver = "liblinear", 
                                                 max_iter = 500),
                 "rf" : RandomForestRegressor(),
                 "xgb" : XGBRegressor(objective = "reg:squarederror", n_jobs=1, tree_method="hist")}
param_grids = {"lasso" : lasso_grid,
               "logistic" : logistic_grid,
               "rf" : rf_grid,
               "xgb" : xgb_grid}

## 3. Actual tuning

In [7]:
import os
treatment_var = 'TREATED'
for rv in outcomes:
    print("=== Getting results for " + rv + " ===")
    cols_needed = [c for c in features] + [treatment_var, rv]
    df_subset = model_df.dropna(subset=[rv]).loc[:, cols_needed].copy()
    # Subsample 10% stratified by treatment for faster tuning
    # df_subset = df_subset.groupby(treatment_var, group_keys=False).apply(lambda g: g.sample(frac=0.1, random_state=405))
    from methods.data_processing import separate_vars as _separate
    X, t, y = _separate(df_subset, rv, treatment_var)
    res = make_estimator_library(X, t, y, cv, base_learners, param_grids, n_iter=200)
    tuned = {}
    for est_name, est in res.items():
        tuned[est_name] = est.get_params()
    os.makedirs("output/params", exist_ok=True)
    joblib.dump(tuned, f"output/params/{outcome}/{rv}_tuned_params.pkl")

=== Getting results for fausebal ===
Tuning s_xgb
Tuning s_rf
Tuning t_lasso
Tuning t_logistic
Tuning t_rf
Tuning t_xgb
Tuning x_lasso
Tuning x_logistic
Tuning x_rf
Tuning x_xgb
Tuning r_lassolasso
Tuning r_lassoxgb
Tuning r_lassorf
Tuning r_rflasso
Tuning r_rfrf
Tuning r_rfxgb
Tuning r_xgblasso
Tuning r_xgbrf
Tuning r_xgbxgb
