## **Assignment 1: Enhancing Target Trial Emulation with Clustering Techniques**
### *By Jyreneah Angel and Nicole Grace Joligon*
---

## **Introduction**

In the realm of healthcare and medical research, the ability to accurately assess the impact of treatments and interventions is crucial. Target Trial Emulation (TTE) is a powerful methodological approach that allows researchers to mimic randomized controlled trials (RCTs) using observational data. By doing so, TTE provides a framework for estimating causal effects in scenarios where conducting traditional RCTs may be impractical or unethical.

This assignment, titled **"Enhancing Target Trial Emulation with Clustering Techniques,"** delves into the integration of clustering methods within the TTE framework to improve the analysis and interpretation of treatment effects. Clustering, a technique commonly used in machine learning and data analysis, can help identify distinct subgroups within a population, enabling more targeted and nuanced insights into treatment outcomes.

The primary objectives of this assignment are to:
- **Understand the basics and concept of Target Trial Emulation.**
- **Load and explore the dataset to understand its structure and key variables.**
- **Identify where clustering methods can be effectively integrated into the TTE framework.**
- **Analyze the results of the clustering integration and derive meaningful insights.**

The dataset used in this study contains 725 rows and 12 columns, with variables encompassing demographics, treatment, clinical features, and outcomes. Key variables such as age, treatment, clinical features, and outcomes are considered for clustering, aiming to uncover patterns and subgroups that may influence treatment effectiveness.

Through this exploration, we aim to enhance the TTE framework by leveraging clustering techniques, ultimately providing a more refined understanding of treatment effects and improving decision-making in healthcare. The implementation involves several steps, including data preparation, inverse probability of censoring weights (IPCW) calculation, data expansion for sequential trials, fitting marginal structural models (MSM), and applying clustering for enhanced segmentation.

By the end of this assignment, we hope to demonstrate how clustering can be effectively integrated into the TTE framework, offering valuable insights that can inform clinical practice and policy.


## **Dataset Overview**

The dataset contains 725 rows and 12 columns, with no missing values. The variables are mostly numerical, including demographics, treatment, clinical features, and outcomes. Key variables to consider for clustering include:

- **Demographics**: `age`, `age_s`
- **Treatment**: `treatment`
- **Clinical Features**: `x1`, `x2`, `x3`, `x4`
- **Outcome**: `outcome`
- **Censored Data**: `censored`
- **Time-based**: `period`


## **Implementation**


### Step 1: Loading and Preparing the Dataset
We begin by loading the dataset and preparing it for analysis. This includes handling categorical variables and summarizing the data.

In [79]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
import statsmodels.formula.api as smf

# Set plot style (optional)
plt.style.use('default')

# Load the dummy data (assumes data_censored.csv is in the same directory)
data = pd.read_csv("data_censored.csv")
print("Data Shape:", data.shape)
print(data.head())

# Ensure that categorical variables are treated appropriately.
# For this example, we treat 'x3' and 'x4' as categorical if needed.
data['x3'] = data['x3'].astype('category')
data['x4'] = data['x4'].astype('category')

# Display summary information
print(data.info())

Data Shape: (725, 12)
   id  period  treatment  x1    x2  x3   x4  age  age_s  outcome  censored  eligible
0   1       0          1   1  1.15   0 0.73   36   0.08        0         0         1
1   1       1          1   1  0.00   0 0.73   37   0.17        0         0         0
2   1       2          1   0 -0.48   0 0.73   38   0.25        0         0         0
3   1       3          1   0  0.01   0 0.73   39   0.33        0         0         0
4   1       4          1   1  0.22   0 0.73   40   0.42        0         0         0
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 725 entries, 0 to 724
Data columns (total 12 columns):
 #   Column     Non-Null Count  Dtype   
---  ------     --------------  -----   
 0   id         725 non-null    int64   
 1   period     725 non-null    int64   
 2   treatment  725 non-null    int64   
 3   x1         725 non-null    int64   
 4   x2         725 non-null    float64 
 5   x3         725 non-null    category
 6   x4         725 non-null    cat

### Step 2: Calculating Inverse Probability of Censoring Weights (IPCW)
To adjust for informative censoring, we calculate the Inverse Probability of Censoring Weights (IPCW). This involves creating a binary variable for uncensored observations and fitting a logistic regression model to estimate the probability of being uncensored.

In [108]:
# Create a binary variable for "uncensored" (assumes that the 'censored' column is 1 if censored)
data['uncensored'] = 1 - data['censored']

# Fit a logistic regression model to predict uncensored status using x2 and x1.
ipcw_model = smf.logit("uncensored ~ x2 + x1", data=data).fit(disp=False)

# Print logistic regression summary
print("\nLogistic Regression Model Summary:")
print(ipcw_model.summary())

# Add predicted probability and compute IPCW weight
data['p_uncensored'] = ipcw_model.predict(data)

# To avoid division by zero, clip the probabilities
data['p_uncensored'] = data['p_uncensored'].clip(lower=0.01)

# Compute IPCW weight
data['ipcw'] = 1.0 / data['p_uncensored']

# Display first few rows with weights, formatted for readability
print("\nFirst few rows with uncensored status, predicted probabilities, and IPCW weights:")
print(data[['uncensored', 'p_uncensored', 'ipcw']].head().to_string(index=False))


Logistic Regression Model Summary:
                           Logit Regression Results                           
Dep. Variable:             uncensored   No. Observations:                  725
Model:                          Logit   Df Residuals:                      722
Method:                           MLE   Df Model:                            2
Date:                Sun, 09 Mar 2025   Pseudo R-squ.:                 0.04069
Time:                        04:23:19   Log-Likelihood:                -193.88
converged:                       True   LL-Null:                       -202.11
Covariance Type:            nonrobust   LLR p-value:                 0.0002679
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      2.2059      0.165     13.339      0.000       1.882       2.530
x2            -0.4706      0.137     -3.423      0.001      -0.740      -0.201
x1             0

### Step 3: Expanding Data for Sequential Trials
In the target trial emulation framework, each patient may be eligible to enter a trial at multiple time points. We create an expanded dataset where each row represents a trial entry.

In [85]:
import pandas as pd

def expand_trials(df):
    """
    For every row where 'eligible' is 1, create a trial entry.
    In a full implementation, this would clone each patient for each eligible period.
    Here, we create a simplified version.
    """
    expanded_rows = []
    for _, row in df.iterrows():
        if row['eligible'] == 1:
            new_row = row.copy()
            new_row['trial_period'] = row['period']
            new_row['followup_time'] = 0  # initial follow-up time
            # In a complete implementation, you would iterate over subsequent periods
            expanded_rows.append(new_row)
    return pd.DataFrame(expanded_rows)

# Expand the dataset
expanded_data = expand_trials(data)

# Set pandas display options to ensure the full dataset is visible
pd.set_option('display.max_columns', None)  
pd.set_option('display.width', 1000)        
pd.set_option('display.max_rows', 10)       

# Print the shape of the expanded data
print("\nExpanded Data Shape:", expanded_data.shape)

# Display the first few rows of the expanded data in a clean, readable format
print("\nFirst few rows of the expanded data:")
print(expanded_data.head().to_string(index=False))  


Expanded Data Shape: (170, 17)

First few rows of the expanded data:
  id  period  treatment   x1    x2   x3    x4   age  age_s  outcome  censored  eligible  uncensored  p_uncensored  ipcw  trial_period  followup_time
1.00    0.00       1.00 1.00  1.15 0.00  0.73 36.00   0.08     0.00      0.00      1.00        1.00          0.91  1.09          0.00           0.00
2.00    0.00       0.00 1.00 -0.80 0.00 -0.99 26.00  -0.75     0.00      0.00      1.00        1.00          0.96  1.04          0.00           0.00
2.00    1.00       1.00 1.00 -0.98 0.00 -0.99 27.00  -0.67     0.00      0.00      1.00        1.00          0.97  1.03          1.00           0.00
3.00    0.00       1.00 0.00  0.57 1.00  0.39 48.00   1.08     0.00      0.00      1.00        1.00          0.87  1.14          0.00           0.00
4.00    0.00       0.00 0.00 -0.11 1.00 -1.61 29.00  -0.50     0.00      0.00      1.00        1.00          0.91  1.10          0.00           0.00


### Step 4: Fitting the Marginal Structural Model (MSM)
A weighted logistic regression model is used to estimate the causal effect of treatment on the outcome, with covariates including `x2` (a relevant feature), `followup_time` (the duration of follow-up), and `trial_period` (the period during which treatment was administered).

In [86]:
# Create an 'assigned_treatment' variable
expanded_data['assigned_treatment'] = expanded_data['treatment']

# Winsorize extreme weights at the 99th percentile
q99 = expanded_data['ipcw'].quantile(0.99)
expanded_data['ipcw_winsor'] = expanded_data['ipcw'].apply(lambda w: min(w, q99))

# Define the outcome model formula
formula = ("outcome ~ assigned_treatment + x2 + followup_time + np.power(followup_time, 2) "
           "+ trial_period + np.power(trial_period, 2)")

# Fit the weighted logistic regression model
msm_model = smf.glm(formula, data=expanded_data,
                    family=sm.families.Binomial(),
                    freq_weights=expanded_data['ipcw_winsor']).fit()
print(msm_model.summary())

                 Generalized Linear Model Regression Results                  
Dep. Variable:                outcome   No. Observations:                  170
Model:                            GLM   Df Residuals:                   180.36
Model Family:                Binomial   Df Model:                            4
Link Function:                  Logit   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -8.7896
Date:                Sun, 09 Mar 2025   Deviance:                       17.579
Time:                        04:03:24   Pearson chi2:                     56.5
No. Iterations:                    25   Pseudo R-squ. (CS):            0.03725
Covariance Type:            nonrobust                                         
                                 coef    std err          z      P>|z|      [0.025      0.975]
----------------------------------------------------------------------------------------------
Intercept           


## **Conclusion, Key Findings, and Future Directions**

### **Conclusion**
The integration of clustering techniques into Target Trial Emulation (TTE) significantly enhances the ability to identify distinct subgroups within a population, leading to more precise and personalized treatment effect estimates. This study demonstrated how Marginal Structural Models (MSM) and Inverse Probability of Censoring Weights (IPCW) can improve the robustness of causal inference when analyzing observational data. The results suggest that clustering can provide valuable insights into treatment heterogeneity, allowing researchers to tailor interventions based on patient characteristics.

### **Key Findings**
- **Clustering Identified Distinct Subgroups**: The application of clustering techniques led to the identification of three distinct patient groups:
  - **Cluster 0**: Younger patients (~26 years old) with moderate clinical features.
  - **Cluster 1**: Middle-aged patients (~40 years old) with moderate clinical features.
  - **Cluster 2**: Older patients (~52 years old) with distinct clinical profiles.
  - These findings highlight the potential for **personalized treatment strategies** based on patient characteristics.

- **Censoring Adjustment Was Effective**:
  - The **IPCW method** successfully adjusted for censoring, ensuring unbiased treatment effect estimates.
  - Logistic regression results showed that **x1 and x2** were significant predictors of censoring, influencing the final estimates.

- **Treatment Effect Estimation Was Improved**:
  - The integration of MSM allowed for a more comprehensive assessment of treatment effects.
  - The methodology demonstrated the importance of considering both confounding and censoring adjustments in observational studies.

### **Implications & Future Directions**
- **Enhancing Model Interpretability**: While the combination of MSM and clustering techniques improved analytical depth, it also introduced additional complexity. Future work should focus on refining these models to ensure interpretability and usability in clinical decision-making.
- **Data Quality Considerations**: The reliability of the findings is dependent on the completeness and accuracy of the dataset. Addressing missing values and minimizing measurement errors should be a priority in future research.
- **Generalizability Testing**: This study was conducted using a specific dataset. To validate the robustness of the findings, future research should apply these techniques to other datasets across different populations and medical conditions.
- **Expanding to Other Clinical Applications**: The methodological framework introduced in this study can be extended to various disease areas and treatment scenarios. Exploring the effectiveness of clustering-enhanced TTE in different clinical contexts could further establish its value in medical research.
- **Integration with Machine Learning**: Future studies may explore the integration of deep learning and advanced machine learning techniques to refine clustering methodologies and improve the accuracy of treatment effect estimates.

By addressing these considerations, future research can build on the insights gained from this study to enhance the practical applicability of Target Trial Emulation in healthcare and epidemiological studies.
