# CIBMTR - Equity in post-HCT Survival Predictions

## Overview

In this project, i have developed models to improve the prediction of transplant survival rates for patients undergoing allogeneic Hematopoietic Cell Transplantation (HCT) — an important step in ensuring that every patient has a fair chance at a successful outcome, regardless of their background.

## What is an allogeneic HCT?

The human immune system comprises cells that develop from hematopoietic stem cells, a special type of cells that reside in the bone marrow. These stem cells are responsible for generating all blood cells, including red blood cells, platelet-producing cells, and immune system cells such as T cells, B cells, neutrophils, and natural killer (NK) cells. Allogeneic hematopoietic cell transplantation (HCT) can be used to replace an individual's faulty hematopoietic stem cells with stem cells that can produce normal immune system cells. In other words, a successful HCT can help fix a person's immune system by introducing healthy stem cells into their body. When hematopoietic stem cells are transferred from one person to another, the recipient is referred to as the HCT recipient. The term "allogeneic" indicates that the stem cells being used come from someone else, the hematopoietic stem cell donor. If the HCT is successful, the donor's hematopoietic stem cells will replace the recipient's cells, producing blood and immune system cells that work correctly.

The source of hematopoietic stem cells can be bone marrow, peripheral blood, or umbilical cord blood. Depending on the source of the stem cells, HCT procedures may be called bone marrow transplants (BMT), peripheral blood stem cell transplants, or cord blood transplants.

More information on how blood stem cell transplants work: https://www.nmdp.org/patients/understanding-transplant/transplant-process

## Description

Improving survival predictions for allogeneic HCT patients is a vital healthcare challenge. Current predictive models often fall short in addressing disparities related to socioeconomic status, race, and geography. Addressing these gaps is crucial for enhancing patient care, optimizing resource utilization, and rebuilding trust in the healthcare system.

This competition aims to encourage participants to advance predictive modeling by ensuring that survival predictions are both precise and fair for patients across diverse groups. By using synthetic data—which mirrors real-world situations while protecting patient privacy—participants can build and improve models that more effectively consider diverse backgrounds and conditions.

You’re challenged to develop advanced predictive models for allogeneic HCT that enhance both accuracy and fairness in survival predictions. The goal is to address disparities by bridging diverse data sources, refining algorithms, and reducing biases to ensure equitable outcomes for patients across diverse race groups. Your work will help create a more just and effective healthcare environment, ensuring every patient receives the care they deserve.

## Evaluation Criteria

The evaluation of prediction accuracy in the competition will involve a specialized metric known as the Stratified Concordance Index (C-index), adapted to consider different racial groups independently. This method allows us to gauge the predictive performance of models in a way that emphasizes equitability across diverse patient populations, particularly focusing on racial disparities in transplant outcomes.

Concordance index
It represents the global assessment of the model discrimination power: this is the model’s ability to correctly provide a reliable ranking of the survival times based on the individual risk scores. It can be computed with the following formula:



The concordance index is a value between 0 and 1 where:

0.5 is the expected result from random predictions,
1.0 is a perfect concordance and,
0.0 is perfect anti-concordance (multiply predictions with -1 to get 1.0)


## Dataset Description

The dataset consists of 59 variables related to hematopoietic stem cell transplantation (HSCT), encompassing a range of demographic and medical characteristics of both recipients and donors, such as age, sex, ethnicity, disease status, and treatment details. The primary outcome of interest is event-free survival, represented by the variable efs, while the time to event-free survival is captured by the variable efs_time. These two variables together encode the target for a censored time-to-event analysis. The data, which features equal representation across recipient racial categories including White, Asian, African-American, Native American, Pacific Islander, and More than One Race, was synthetically generated using the data generator from synthcity, trained on a large cohort of real CIBMTR data.

We have used the SurvivalGAN method, introduced in the paper "SurvivalGAN: Generating Time-to-Event Data for Survival Analysis" which addresses the generation of synthetic survival data with special considerations for censoring. SurvivalGAN is adept at capturing the intricate relationships and interactions among variables within survival data and their influence on time-to-event outcomes. This generative model utilizes a conditional Generative Adversarial Network (GAN) framework, which is specifically tailored to address the complexities of survival analysis, including the critical task of managing censored data. By conditioning on additional information such as censoring status and actual survival times, SurvivalGAN effectively learns the underlying distribution of the data, ensuring that the generated synthetic dataset retains the essential interactions among variables that are predictive of survival outcomes.

Files
train.csv - the training set, with target efs (Event-free survival)
test.csv - the test set; your task is to predict the value of efs for this data
sample_submission.csv - a sample submission file in the correct format with all predictions set to 0.50
data_dictionary.csv - a list of all features and targets used in dataset and their descriptions
Note: The rerun test data contains approximately the same number of observations as the training data.


## Dataset Description

This dataset provides diverse and detailed information about patients undergoing allogeneic hematopoietic cell transplantation (HCT). Below is a structured explanation of the included variables, grouped by type:

### Categorical Variables

	-	dri_score (Refined Disease Risk Index): Classification of disease risk. Examples: Intermediate, High, N/A - pediatric.
	-	psych_disturb (Psychiatric Disturbance): Indicates whether the patient has a history of psychiatric disorders (Yes, No, Not done).
	-	cyto_score (Cytogenetic Score): Evaluation of the patient’s cytogenetic profile. Examples: Favorable, Poor, Intermediate.
	-	diabetes: Indicates whether the patient has diabetes (Yes, No, Not done).
	-	tbi_status (TBI): Use of total body irradiation as part of the treatment (No TBI, TBI + Cy).
	-	arrhythmia: History of cardiac arrhythmia (Yes, No, Not done).
	-	graft_type (Graft Type): Type of graft used in the procedure (Peripheral blood, Bone marrow).
	-	vent_hist (History of Mechanical Ventilation): Indicates whether the patient required mechanical ventilation (Yes, No).
	-	renal_issue (Renal Issue): Presence of moderate or severe renal issues (Yes, No, Not done).
	-	pulm_severe (Pulmonary, Severe): Presence of severe pulmonary issues (Yes, No).
	-	prim_disease_hct (Primary Disease for HCT): The primary condition that led to the HCT (e.g., ALL, AML, MDS).
	-	cmv_status (CMV Serostatus): Donor/recipient CMV serostatus (e.g., +/-, +/+).
	-	conditioning_intensity: Intensity of the conditioning regimen (RIC, MAC, NMA).
	-	ethnicity (Ethnicity): Patient’s ethnicity (Hispanic or Latino, Not Hispanic or Latino).
	-	obesity: Indicates whether the patient is obese (Yes, No, Not done).
	-	in_vivo_tcd (In-vivo T-cell Depletion): Use of ATG/alemtuzumab for T-cell depletion (Yes, No).
	-	sex_match (Donor/Recipient Sex Match): Gender compatibility between donor and recipient (M-M, F-M).
	-	race_group (Race): Racial group of the patient (White, Black or African-American, etc.).
	-	efs (Event-free Survival): Event-free survival status (Event, Censoring).

### Numerical Variables
	
	-	hla_high_res_8: High-resolution HLA compatibility (A, B, C, DRB1).
	-	hla_low_res_6: Low-resolution HLA compatibility (A, B, DRB1).
	-	hla_match_c_high: Specific HLA-C compatibility (high resolution).
	-	hla_match_drb1_high: Specific HLA-DRB1 compatibility (high resolution).
	-	hla_match_dqb1_low: Specific HLA-DQB1 compatibility (low resolution).
	-	age_at_hct: Patient’s age at the time of HCT (years).
	-	donor_age: Donor’s age (years).
	-	comorbidity_score (Sorror Comorbidity Score): A score measuring the patient’s comorbidities.
	-	efs_time: Time to event-free survival (months).
	-	year_hct: Year of the HCT procedure.


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [7]:
df_dictionary = pd.read_csv('/Users/andreuolaortua/Desktop/code/Machine learning/00 - Projects/Project 20 - CIBMTR - Equity in post-HCT Survival Predictions/External Data/equity-post-HCT-survival-predictions/data_dictionary.csv')
df_sample_submission = pd.read_csv('/Users/andreuolaortua/Desktop/code/Machine learning/00 - Projects/Project 20 - CIBMTR - Equity in post-HCT Survival Predictions/External Data/equity-post-HCT-survival-predictions/sample_submission.csv')
df_testset = pd.read_csv('/Users/andreuolaortua/Desktop/code/Machine learning/00 - Projects/Project 20 - CIBMTR - Equity in post-HCT Survival Predictions/External Data/equity-post-HCT-survival-predictions/test.csv')
df_trainset = pd.read_csv('/Users/andreuolaortua/Desktop/code/Machine learning/00 - Projects/Project 20 - CIBMTR - Equity in post-HCT Survival Predictions/External Data/equity-post-HCT-survival-predictions/train.csv')

In [8]:
df_trainset.head()

Unnamed: 0,ID,dri_score,psych_disturb,cyto_score,diabetes,hla_match_c_high,hla_high_res_8,tbi_status,arrhythmia,hla_low_res_6,...,tce_div_match,donor_related,melphalan_dose,hla_low_res_8,cardiac,hla_match_drb1_high,pulm_moderate,hla_low_res_10,efs,efs_time
0,0,N/A - non-malignant indication,No,,No,,,No TBI,No,6.0,...,,Unrelated,"N/A, Mel not given",8.0,No,2.0,No,10.0,0.0,42.356
1,1,Intermediate,No,Intermediate,No,2.0,8.0,"TBI +- Other, >cGy",No,6.0,...,Permissive mismatched,Related,"N/A, Mel not given",8.0,No,2.0,Yes,10.0,1.0,4.672
2,2,N/A - non-malignant indication,No,,No,2.0,8.0,No TBI,No,6.0,...,Permissive mismatched,Related,"N/A, Mel not given",8.0,No,2.0,No,10.0,0.0,19.793
3,3,High,No,Intermediate,No,2.0,8.0,No TBI,No,6.0,...,Permissive mismatched,Unrelated,"N/A, Mel not given",8.0,No,2.0,No,10.0,0.0,102.349
4,4,High,No,,No,2.0,8.0,No TBI,No,6.0,...,Permissive mismatched,Related,MEL,8.0,No,2.0,No,10.0,0.0,16.223


In [18]:
df_trainset.shape

(28800, 60)

In [13]:
df_trainset.columns

Index(['ID', 'dri_score', 'psych_disturb', 'cyto_score', 'diabetes',
       'hla_match_c_high', 'hla_high_res_8', 'tbi_status', 'arrhythmia',
       'hla_low_res_6', 'graft_type', 'vent_hist', 'renal_issue',
       'pulm_severe', 'prim_disease_hct', 'hla_high_res_6', 'cmv_status',
       'hla_high_res_10', 'hla_match_dqb1_high', 'tce_imm_match', 'hla_nmdp_6',
       'hla_match_c_low', 'rituximab', 'hla_match_drb1_low',
       'hla_match_dqb1_low', 'prod_type', 'cyto_score_detail',
       'conditioning_intensity', 'ethnicity', 'year_hct', 'obesity', 'mrd_hct',
       'in_vivo_tcd', 'tce_match', 'hla_match_a_high', 'hepatic_severe',
       'donor_age', 'prior_tumor', 'hla_match_b_low', 'peptic_ulcer',
       'age_at_hct', 'hla_match_a_low', 'gvhd_proph', 'rheum_issue',
       'sex_match', 'hla_match_b_high', 'race_group', 'comorbidity_score',
       'karnofsky_score', 'hepatic_mild', 'tce_div_match', 'donor_related',
       'melphalan_dose', 'hla_low_res_8', 'cardiac', 'hla_match_drb1_hi

In [12]:
df_trainset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28800 entries, 0 to 28799
Data columns (total 60 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   ID                      28800 non-null  int64  
 1   dri_score               28646 non-null  object 
 2   psych_disturb           26738 non-null  object 
 3   cyto_score              20732 non-null  object 
 4   diabetes                26681 non-null  object 
 5   hla_match_c_high        24180 non-null  float64
 6   hla_high_res_8          22971 non-null  float64
 7   tbi_status              28800 non-null  object 
 8   arrhythmia              26598 non-null  object 
 9   hla_low_res_6           25530 non-null  float64
 10  graft_type              28800 non-null  object 
 11  vent_hist               28541 non-null  object 
 12  renal_issue             26885 non-null  object 
 13  pulm_severe             26665 non-null  object 
 14  prim_disease_hct        28800 non-null

In [16]:
df_trainset.describe(include='all')

Unnamed: 0,ID,dri_score,psych_disturb,cyto_score,diabetes,hla_match_c_high,hla_high_res_8,tbi_status,arrhythmia,hla_low_res_6,...,tce_div_match,donor_related,melphalan_dose,hla_low_res_8,cardiac,hla_match_drb1_high,pulm_moderate,hla_low_res_10,efs,efs_time
count,28800.0,28646,26738,20732,26681,24180.0,22971.0,28800,26598,25530.0,...,17404,28642,27395,25147.0,26258,25448.0,26753,23736.0,28800.0,28800.0
unique,,11,3,7,3,,,8,3,,...,4,3,2,,3,,3,,,
top,,Intermediate,No,Poor,No,,,No TBI,No,,...,Permissive mismatched,Related,"N/A, Mel not given",,No,,No,,,
freq,,10436,23005,8802,22201,,,18861,25203,,...,12936,16208,20135,,24592,,21338,,,
mean,14399.5,,,,,1.764516,6.876801,,,5.143322,...,,,,6.903448,,1.707128,,8.664687,0.539306,23.237678
std,8313.988213,,,,,0.431941,1.564313,,,1.207757,...,,,,1.565017,,0.461179,,1.882746,0.498461,24.799748
min,0.0,,,,,0.0,2.0,,,2.0,...,,,,2.0,,0.0,,4.0,0.0,0.333
25%,7199.75,,,,,2.0,6.0,,,4.0,...,,,,6.0,,1.0,,7.0,0.0,5.61975
50%,14399.5,,,,,2.0,8.0,,,6.0,...,,,,8.0,,2.0,,10.0,1.0,9.7965
75%,21599.25,,,,,2.0,8.0,,,6.0,...,,,,8.0,,2.0,,10.0,1.0,35.1


In [19]:
df_trainset.isnull().sum()

ID                            0
dri_score                   154
psych_disturb              2062
cyto_score                 8068
diabetes                   2119
hla_match_c_high           4620
hla_high_res_8             5829
tbi_status                    0
arrhythmia                 2202
hla_low_res_6              3270
graft_type                    0
vent_hist                   259
renal_issue                1915
pulm_severe                2135
prim_disease_hct              0
hla_high_res_6             5284
cmv_status                  634
hla_high_res_10            7163
hla_match_dqb1_high        5199
tce_imm_match             11133
hla_nmdp_6                 4197
hla_match_c_low            2800
rituximab                  2148
hla_match_drb1_low         2643
hla_match_dqb1_low         4194
prod_type                     0
cyto_score_detail         11923
conditioning_intensity     4789
ethnicity                   587
year_hct                      0
obesity                    1760
mrd_hct 

In [20]:
df_trainset.dri_score.value_counts()

dri_score
Intermediate                                         10436
N/A - pediatric                                       4779
High                                                  4701
N/A - non-malignant indication                        2427
TBD cytogenetics                                      2003
Low                                                   1926
High - TED AML case <missing cytogenetics             1414
Intermediate - TED AML case <missing cytogenetics      481
N/A - disease not classifiable                         272
Very high                                              198
Missing disease status                                   9
Name: count, dtype: int64

In [26]:
df_trainset[df_trainset.dri_score.isnull()==True]

Unnamed: 0,ID,dri_score,psych_disturb,cyto_score,diabetes,hla_match_c_high,hla_high_res_8,tbi_status,arrhythmia,hla_low_res_6,...,tce_div_match,donor_related,melphalan_dose,hla_low_res_8,cardiac,hla_match_drb1_high,pulm_moderate,hla_low_res_10,efs,efs_time
38,38,,No,Intermediate,No,,,No TBI,No,4.0,...,,Related,"N/A, Mel not given",,No,,,,1.0,10.188
187,187,,No,,No,2.0,8.0,No TBI,No,6.0,...,,Unrelated,"N/A, Mel not given",8.0,No,2.0,No,10.0,0.0,49.754
271,271,,,Favorable,No,,,No TBI,,,...,,Related,"N/A, Mel not given",,,1.0,No,,0.0,42.757
559,559,,Yes,Poor,No,2.0,7.0,"TBI +- Other, <=cGy",No,6.0,...,Permissive mismatched,Unrelated,MEL,8.0,,2.0,No,10.0,0.0,25.263
659,659,,No,Poor,No,2.0,8.0,No TBI,No,6.0,...,,Unrelated,"N/A, Mel not given",8.0,No,2.0,No,10.0,0.0,27.340
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
27947,27947,,No,Poor,No,2.0,,"TBI +- Other, >cGy",No,4.0,...,Permissive mismatched,Related,"N/A, Mel not given",5.0,No,,No,,1.0,5.678
27962,27962,,,Favorable,No,2.0,8.0,No TBI,No,6.0,...,Permissive mismatched,Unrelated,"N/A, Mel not given",8.0,No,2.0,No,10.0,0.0,59.352
28053,28053,,No,,No,2.0,8.0,No TBI,No,6.0,...,Permissive mismatched,Unrelated,"N/A, Mel not given",8.0,No,2.0,No,10.0,0.0,101.367
28290,28290,,No,Intermediate,No,1.0,5.0,"TBI +- Other, >cGy",No,5.0,...,,Related,"N/A, Mel not given",6.0,No,2.0,No,,1.0,6.157


In [27]:
df_trainset.dri_score.

dri_score
Intermediate                                         10436
N/A - pediatric                                       4779
High                                                  4701
N/A - non-malignant indication                        2427
TBD cytogenetics                                      2003
Low                                                   1926
High - TED AML case <missing cytogenetics             1414
Intermediate - TED AML case <missing cytogenetics      481
N/A - disease not classifiable                         272
Very high                                              198
Missing disease status                                   9
Name: count, dtype: int64