# Instructions

Assignment 1 for Clustering:
New and novel methods in Machine Learning are made either by borrowing formulas and concepts from other scientific fields and redefining it based on new sets of assumptions, or by adding an extra step to an already existing framework of methodology.

In this exercise (Assignment 1 of the Clustering Topic), we will try to develop a novel method of Target Trial Emulation by integrating concepts of Clustering into the already existing framework. Target Trial Emulation is a new methodological framework in epidemiology which tries to account for the biases in old and traditional designs.

These are the instructions:
1. Look at this website: https://rpubs.com/alanyang0924/TTE
2. Extract the dummy data in the package and save it as "data_censored.csv"
2. Convert the R codes into Python Codes (use Jupyter Notebook), replicate the results using your python code.
3. Create another copy of your Python Codes, name it TTE-v2 (use Jupyter Notebook).
4. Using TTE-v2, think of a creative way on where you would integrate a clustering mechanism, understand each step carefully and decide at which step a clustering method can be implemented. Generate insights from your results.
5. Do this by pair, preferably your thesis partner.
6. Push to your github repository.
7. Deadline is 2 weeks from today: February 28, 2025 at 11:59 pm.

# Overview:

1. Setup
2. Data Preparation
3. Weight models and censoring
    (3.1) Censoring due to treatment switching
    (3.2) Other informative censoring
4. Calculate weights
5. Specify outcome models
6. Expand Trials
    (6.1) Create Sequence of Trials Data
7. Load or Sample from Expanded Data
8. Fit Marginal Structural Model
9. Inference


# Imports

In [1]:
import os
import pandas as pd
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.utils import resample
import pickle

## 1. Setup

A sequence of target trials analysis starts by specifying which estimand will be used:


In [2]:
# Define estimands
trial_pp = {"estimand": "PP"}  # Per-protocol
trial_itt = {"estimand": "ITT"}  # Intention-to-treat

# Create directories for saving files
trial_pp_dir = os.path.join(os.getcwd(), "trial_pp")
os.makedirs(trial_pp_dir, exist_ok=True)

trial_itt_dir = os.path.join(os.getcwd(), "trial_itt")
os.makedirs(trial_itt_dir, exist_ok=True)

# Store directories in the trial objects
trial_pp["directory"] = trial_pp_dir
trial_itt["directory"] = trial_itt_dir

Additionally, is it useful to create a directory to save files for later inspection.

# 2. Data Preparation

Next the user must specify the observational input data that will be used for the target trial emulation. Here we need to specify which columns contain which values and how they should be used.

In [3]:
data_censored = pd.read_csv("data_censored.csv")

# Display the first few rows
print(data_censored.head())

   id  period  treatment  x1        x2  x3        x4  age     age_s  outcome  \
0   1       0          1   1  1.146148   0  0.734203   36  0.083333        0   
1   1       1          1   1  0.002200   0  0.734203   37  0.166667        0   
2   1       2          1   0 -0.481762   0  0.734203   38  0.250000        0   
3   1       3          1   0  0.007872   0  0.734203   39  0.333333        0   
4   1       4          1   1  0.216054   0  0.734203   40  0.416667        0   

   censored  eligible  
0         0         1  
1         0         0  
2         0         0  
3         0         0  
4         0         0  


In [4]:
# Define a function to structure the data
def set_data(data, id_col, period_col, treatment_col, outcome_col, eligible_col):
    return {
        "data": data,
        "id": id_col,
        "period": period_col,
        "treatment": treatment_col,
        "outcome": outcome_col,
        "eligible": eligible_col,
    }

# Using the function
trial_pp_data = set_data(
    data=data_censored,
    id_col="id",
    period_col="period",
    treatment_col="treatment",
    outcome_col="outcome",
    eligible_col="eligible",
)

# ITT (without pipe equivalent)
trial_itt_data = set_data(
    data=data_censored,
    id_col="id",
    period_col="period",
    treatment_col="treatment",
    outcome_col="outcome",
    eligible_col="eligible",
)

# Print the result (optional)
print(trial_itt_data)

{'data':      id  period  treatment  x1        x2  x3        x4  age     age_s  \
0     1       0          1   1  1.146148   0  0.734203   36  0.083333   
1     1       1          1   1  0.002200   0  0.734203   37  0.166667   
2     1       2          1   0 -0.481762   0  0.734203   38  0.250000   
3     1       3          1   0  0.007872   0  0.734203   39  0.333333   
4     1       4          1   1  0.216054   0  0.734203   40  0.416667   
..   ..     ...        ...  ..       ...  ..       ...  ...       ...   
720  99       3          0   0 -0.747906   1  0.575268   68  2.750000   
721  99       4          0   0 -0.790056   1  0.575268   69  2.833333   
722  99       5          1   1  0.387429   1  0.575268   70  2.916667   
723  99       6          1   1 -0.033762   1  0.575268   71  3.000000   
724  99       7          0   0 -1.340497   1  0.575268   72  3.083333   

     outcome  censored  eligible  
0          0         0         1  
1          0         0         0  
2        

# 3. Weight models and censoring

- To adjust for the effects of informative censoring, inverse probability of censoring weights (IPCW) can be applied. 
- To estimate these weights, we construct time-to-(censoring) event models.
- Two sets of models are fit for the two censoring mechanisms which may apply: censoring due to deviation from assigned treatment and other informative censoring.

### 3.1 Censoring due to treatment switching

We specify model formulas to be used for calculating the probability of receiving treatment in the current period. Separate models are fitted for patients who had treatment = 1 and those who had treatment = 0 in the previous period. Stabilized weights are used by fitting numerator and denominator models.

There are optional arguments to specify columns which can include/exclude observations from the treatment models. These are used in case it is not possible for a patient to deviate from a certain treatment assignment in that period.


In [7]:
class Trial:
    def __init__(self, data, estimand, directory):
        self.data = data["data"]  # Extract the DataFrame from the dictionary
        self.estimand = estimand
        self.directory = directory  # Store the directory path
        self.switch_weights = None
        self.censor_weights = None

    def fit_logistic_regression(self, X, y, save_path):
        """Fits a logistic regression model and saves it to a file."""
        if len(np.unique(y)) < 2:
            print("Skipping logistic regression: Only one class present in y")
            return None
        model = LogisticRegression()
        model.fit(X, y)
        os.makedirs(os.path.dirname(save_path), exist_ok=True)
        with open(save_path, "wb") as f:
            pickle.dump(model, f)
        return model
    
    def set_censor_weight_model(self, censor_event, numerator="1", denominator="1", pool_models="none", model_fitter=None):
        if model_fitter is None: 
            model_fitter = self.fit_logistic_regression
            
        if censor_event not in self.data.columns:
            raise ValueError(f"'{censor_event}' must be a column in the dataset.")
        
        formula_numerator = f"1 - {censor_event} ~ {numerator}"
        formula_denominator = f"1 - {censor_event} ~ {denominator}"

        self.censor_weights = {
            "numerator": formula_numerator,
            "denominator": formula_denominator,
            "pool_numerator": pool_models in ["numerator", "both"],
            "pool_denominator": pool_models == "both",
            "model_fitter": "te_stats_glm_logit"
        }

        self.censor_weights["fitted_model"] = model_fitter(self.data[numerator.split(" + ")], self.data[censor_event], os.path.join(self.directory, "censor_models", "censor_model.pkl"))
        return self

    def set_switch_weight_model(self, numerator=None, denominator=None, model_fitter=None, eligible_wts_0=None, eligible_wts_1=None):
        if self.data is None:
            raise ValueError("set_data() before setting switch weight models")
        
        if self.estimand == "ITT":
            raise ValueError("Switching weights are not supported for intention-to-treat analyses")

        if eligible_wts_0 and eligible_wts_0 in self.data.columns:
            self.data = self.data.rename(columns={eligible_wts_0: "eligible_wts_0"})
        if eligible_wts_1 and eligible_wts_1 in self.data.columns:
            self.data = self.data.rename(columns={eligible_wts_1: "eligible_wts_1"})

        if numerator is None:
            numerator = "1"
        if denominator is None:
            denominator = "1"
        
        if "time_on_regime" in denominator:
            raise ValueError("time_on_regime should not be used in denominator.")

        formula_numerator = f"treatment ~ {numerator}"
        formula_denominator = f"treatment ~ {denominator}"

        self.switch_weights = {
            "numerator": formula_numerator,
            "denominator": formula_denominator,
            "model_fitter": "te_stats_glm_logit",
        }

        if model_fitter is not None:
            fitted_model = model_fitter(self.data[numerator.split(" + ")], self.data['treatment'], os.path.join(self.directory, "switch_models", "numerator.pkl"))
            self.switch_weights["fitted_model"] = fitted_model 

    def show_switch_weights(self):
        return self.switch_weights if self.switch_weights else "Not calculated"
    
    def show_censor_weights(self):
        return self.censor_weights if self.censor_weights else "Not calculated"
    
trial_pp.set_switch_weight_model(numerator='age', denominator='age + x1 + x3', model_fitter=trial_pp.fit_logistic_regression)
print(trial_pp.show_switch_weights())


{'numerator': 'treatment ~ age', 'denominator': 'treatment ~ age + x1 + x3', 'model_fitter': 'te_stats_glm_logit', 'fitted_model': LogisticRegression()}


### 3.2


In [8]:
# 3.2 
# Initialize trial object
trial_pp = Trial(trial_pp_data, "PP", trial_pp_dir)

# Set censor weight model for PP
trial_pp.set_censor_weight_model(
    censor_event="censored", 
    numerator="x2", 
    denominator="x2 + x1", 
    pool_models="none", 
    model_fitter=lambda X, y, path: trial_pp.fit_logistic_regression(X, y, os.path.join(trial_pp_dir, "censor_models", "censor_model.pkl"))
)

print(trial_pp.show_censor_weights())

# Initialize trial object for ITT
trial_itt = Trial(trial_itt_data, "ITT", trial_itt_dir)

# Set censor weight model for ITT
trial_itt.set_censor_weight_model(
    censor_event="censored", 
    numerator="x2", 
    denominator="x2 + x1", 
    pool_models="numerator",  # Pool numerator across treatment arms
    model_fitter=lambda X, y, path: trial_itt.fit_logistic_regression(X, y, os.path.join(trial_itt_dir, "censor_models", "censor_model.pkl"))
)

print(trial_itt.show_censor_weights())

{'numerator': '1 - censored ~ x2', 'denominator': '1 - censored ~ x2 + x1', 'pool_numerator': False, 'pool_denominator': False, 'model_fitter': 'te_stats_glm_logit', 'fitted_model': LogisticRegression()}
{'numerator': '1 - censored ~ x2', 'denominator': '1 - censored ~ x2 + x1', 'pool_numerator': True, 'pool_denominator': False, 'model_fitter': 'te_stats_glm_logit', 'fitted_model': LogisticRegression()}


# 4. Calculate weights