# Preface

## **Title**  
*Prognostic Value of Baseline and Pre-Lymphodepletion PET/CT Imaging in DLBCL Patients Undergoing CAR T-Cell Therapy*

***

## Motivation

Chimeric Antigen Receptor (CAR) T-cell therapy has emerged as a transformative treatment modality for hematologic malignancies, demonstrating remarkable efficacy in diffuse large B-cell lymphoma (DLBCL) [1], [2], [3], [4], [5]. However, bridging therapy is frequently required to control disease burden during the manufacturing period before CAR T-cell infusion [1], [6], [7], [8], [9]. One way to measure the efficacy of bridging therapy on CAR T-cell therapy is through 18F-Fluorodeoxyglucose Positron Emission Tomography/Computerized Tomography (18F-FDG PET/CT) imaging. Current literature predominantly focuses on measuring conventional PET metrics such as metabolic active tumor volume (MATV) and standardized uptake value (SUVmax) at single timepoints, rather than employing comprehensive radiomic analysis of dynamic changes [10], [11], [12] . The prognostic value of high-dimensional radiomic features and their temporal evolution (delta radiomics) between baseline and pre-lymphodepletion chemotherapy (pre-LD) scans remains largely unexplored in the CAR T-cell therapy context [11].

***

## Strategic goals

We aim to assess whether baseline, pre-LD or delta radiomic profiles (extracted during the bridging period) provide superior prognostic value compared to conventional clinical variables for predicting treatment response, toxicity, progression-free survival, and overall survival.

***

## Starting point

Current literature predominantly focuses on conventional PET metrics such as metabolic active tumor volume (MATV) and standardized uptake value (SUVmax) at single timepoints, rather than employing comprehensive radiomic analysis of dynamic changes [13]. Preliminary evidence suggests reduced MATV prior to infusion correlates with improved OS (Overall Survival) and TTP (Time To Progression) [12], [14]. Few studies have systematically assessed delta radiomic features, and almost none have explored high-dimensional changes in a CAR T-cell cohort [13], [15]. Bridging strategies (systemic therapy, radiotherapy, or combinations) may influence imaging dynamics, but their detailed prognostic impact remains unclear [16], [17]. 

***

## Expected results (Hypothesis)

We hypothesize that comprehensive delta radiomic analysis will demonstrate enhanced predictive capability compared to conventional single-timepoint metrics.

# Purpose of this notebook

During this course project, we worked on semi-manually segmenting the lesions using PET/CT scan images, with the help of the lesion report, created by radiologists involved, for each patient.

After this stage, we received the clinical data, which includes factors such as: age, gender, dates of important events, etc.

In the end, this notebook is dedicated to preprocessing, combining the radiomics data extracted from the images, and the clinical data, and then finally, running suitable analysis on them to test our hypothesis.

# Results so far
With 31 patients and about 170 features, after correcting for multiple testing of the univariate cox regression:

**Overall Survival** 
**definition** Time from randomization/treatment start to death from any cause.[1]

after correcting for multiple analysis there were no significant features left. 
Before the correction, we could observe that radiomic features from a time point closer to the start of the CAR-T cell therapy showed significant Hazard Ratios.

**Progression Free Survival:** 
**definition** The length of time during and after the treatment of a disease, such as cancer, that a patient lives with the disease but it does not get worse. In a clinical trial, measuring the progression-free survival is one way to see how well a new treatment works. Also called PFS.[1]

This analysis showed similar results to OveralSurvival



[1] Gutman SI, Piper M, Grant MD, et al. Progression-Free Survival: What Does It Mean for Psychological Well-Being or Quality of Life? [Internet] Rockville (MD): Agency for Healthcare Research and Quality (US); 2013 Apr. Background. Available from: https://www.ncbi.nlm.nih.gov/books/NBK137763/


## Delta radiomics computation

In [1]:
import pandas as pd
import numpy as np
import os
import yaml

In [2]:
def calculate_delta_radiomics(data_folder_path):
    """
    Reads radiomics data from subfolders (Time A and Time B), filters for 'suv2.5' 
    segmentation, calculates the delta (B - A) for numeric features, and stores
    the results in a dictionary per patient.

    Args:
        data_folder_path (str): The path to the main folder containing patient subfolders.

    Returns:
        (pd.DataFrame, pd.DataFrame, pd.DataFrame):
            delta_df: Delta radiomics (B - A), patients as index, features as columns.
            A_df: Radiomics at time A, same shape.
            B_df: Radiomics at time B, same shape.
    """
    # dicts to store radiomics and delta values for each patient
    all_delta_radiomics = {}
    A_radiomics, B_radiomics = {}, {}

    # loop over everything inside the main data folder (each item should be one patient)
    for patient_folder_name in os.listdir(data_folder_path):
        patient_path = os.path.join(data_folder_path, patient_folder_name)
        
        # make sure it's actually a folder (and not some random file)
        if os.path.isdir(patient_path):
            print(f"--- Processing {patient_folder_name} ---")
            
            # here we’ll store the paths to A and B Excel files
            file_A_path = None
            file_B_path = None
            
            # search inside the patient folder for the A/B radiomics files
            for filename in os.listdir(patient_path):
                path_excel = os.path.join(patient_path, filename)

                # use uppercase version to make the search for '_A' / '_B' case-insensitive
                upper_name = path_excel.upper()
                # treat anything with '_A' + .xlsx as the Time A file
                if '_A' in upper_name and path_excel.endswith('.xlsx'):
                    file_A_path = path_excel
                # treat anything with '_B' + .xlsx as the Time B file
                elif '_B' in upper_name and path_excel.endswith('.xlsx'):
                    file_B_path = path_excel

            # only continue if we actually found both A and B files for this patient
            if file_A_path and file_B_path:
                try:
                    # read both Excel files into pandas DataFrames
                    df_A = pd.read_excel(file_A_path)
                    df_B = pd.read_excel(file_B_path)
                    
                    # pick the row that has 'suv2.5' in the Segmentation column
                    # and then keep only the feature columns starting from index 23
                    row_A = df_A[df_A['Segmentation'].str.contains('suv2.5')].iloc[0, 23:]
                    row_B = df_B[df_B['Segmentation'].str.contains('suv2.5')].iloc[0, 23:]

                    # convert the features to numeric values; anything weird becomes NaN
                    numeric_A = pd.to_numeric(row_A, errors='coerce')
                    numeric_B = pd.to_numeric(row_B, errors='coerce')

                    # delta radiomics = value at Time B minus value at Time A
                    delta_radiomics = numeric_B - numeric_A
                    
                    # save everything into dicts (drop NaNs to avoid broken features)
                    all_delta_radiomics[patient_folder_name] = delta_radiomics.dropna().to_dict()
                    A_radiomics[patient_folder_name] = numeric_A.dropna().to_dict()
                    B_radiomics[patient_folder_name] = numeric_B.dropna().to_dict()

                    # just to see progress in the console
                    print(f"Successfully calculated radiomics and delta radiomics for {patient_folder_name}.")

                except Exception as e:
                    # if something crashes for this patient, we just print it and move on
                    print(f"Error processing files for {patient_folder_name}: {e}")
            else:
                # if one of the files is missing, we log it here
                print(f"Could not find both A and B files in {patient_folder_name}.")

    # at the end, convert the dicts to DataFrames (rows = patients, columns = features)
    A_df = pd.DataFrame.from_dict(A_radiomics, orient='index')
    B_df = pd.DataFrame.from_dict(B_radiomics, orient='index')
    delta_df = pd.DataFrame.from_dict(all_delta_radiomics, orient='index')

    # return all three DataFrames so we can use them later in the notebook
    return delta_df, A_df, B_df



## Load config and preview delta radiomics

In [3]:
# read config file (YAML) so we don't hard-code any paths in the notebook
with open("config.yaml", "r") as f:  # yaml.safe_load is the usual way to parse YAML configs [web:33][web:38]
    cfg = yaml.safe_load(f)

data_folder_path = cfg["paths"]["data_folder"]

# run the function we wrote above to get delta, A, and B radiomics as DataFrames
delta_radiomics_results, a_radiomics, b_radiomics = calculate_delta_radiomics(data_folder_path)

# quick sanity check: print a small summary for each patient
print("\n--- Final Results Summary ---")
for patient, row in delta_radiomics_results.iterrows():  # iterrows lets us loop over patients row by row [web:41][web:48]
    non_na = row.dropna()  # dropna is the standard way to remove missing values in pandas [web:43][web:51]
    print(f"\n{patient} Delta Radiomics ({len(non_na)} features):")
    print(non_na.head().to_dict())  # to_dict() is handy to print a compact dict view of the features [web:39][web:50]

--- Processing 024 ---
Successfully calculated radiomics and delta radiomics for 024.
--- Processing 023 ---
Successfully calculated radiomics and delta radiomics for 023.
--- Processing 015 ---
Successfully calculated radiomics and delta radiomics for 015.
--- Processing 046 ---
Successfully calculated radiomics and delta radiomics for 046.
--- Processing 048 ---
Successfully calculated radiomics and delta radiomics for 048.
--- Processing 077 ---
Successfully calculated radiomics and delta radiomics for 077.
--- Processing 070 ---
Successfully calculated radiomics and delta radiomics for 070.
--- Processing 013 ---
Successfully calculated radiomics and delta radiomics for 013.
--- Processing 014 ---
Successfully calculated radiomics and delta radiomics for 014.
--- Processing 022 ---
Successfully calculated radiomics and delta radiomics for 022.
--- Processing 047 ---
Successfully calculated radiomics and delta radiomics for 047.
--- Processing 007 ---
Successfully calculated radiomi

In [4]:
delta_radiomics_results

Unnamed: 0,MeshVolume (cc),Volume (cc),Compactness1,Compactness2,Elongation,Flatness,LeastAxisLength,MajorAxisLength,Maximum2DDiameterColumn,Maximum2DDiameterRow,...,glrlm_LongRunLowGrayLevelEmphasis,glrlm_LowGrayLevelRunEmphasis,glrlm_RunEntropy,glrlm_RunLengthNonUniformity,glrlm_RunLengthNonUniformityNormalized,glrlm_RunPercentage,glrlm_RunVariance,glrlm_ShortRunEmphasis,glrlm_ShortRunHighGrayLevelEmphasis,glrlm_ShortRunLowGrayLevelEmphasis
24,-1350.192459,-1349.633052,-0.000854,-0.004856,-0.071998,-0.065654,-27.970122,3.548268,-51.767337,-44.872591,...,,,,,,,,,,
23,-219.73116,-219.58596,-0.001445,-0.016234,-0.053647,-0.079159,-8.566525,17.394789,-17.549948,30.855676,...,,,,,,,,,,
15,49.867963,49.214353,0.016168,0.256642,0.266914,0.405557,-0.581837,-299.663486,-166.887134,-267.583392,...,,,,,,,,,,
46,81.359002,81.531557,0.00445,0.05048,0.116233,0.059981,-53.057164,-718.377143,-106.43608,-167.299933,...,,,,,,,,,,
48,132.95964,134.90532,-0.003025,-0.029388,-0.000437,0.012482,16.392367,101.549356,0.102906,24.886956,...,,,,,,,,,,
77,62.073205,61.188505,0.012091,0.27319,0.263203,0.349453,19.774698,-36.960332,-28.05052,-31.14318,...,91.455757,0.0,1.217388,26.828424,-0.119355,-0.197984,13.049563,-0.193916,-0.193916,-0.193916
70,-1857.982951,-1865.534346,-0.00259,-0.022411,0.178294,0.071478,18.907248,-127.680835,-720.751501,-693.996017,...,-137.743053,-0.012458,-0.989757,-8.304674,0.060624,0.093988,-50.938508,0.045758,0.081561,0.036807
13,282.473562,285.526503,-0.015736,-0.255738,0.442808,-0.060073,26.527685,124.898852,64.273602,104.376945,...,-7.80773,0.042759,-0.268261,153.580232,0.007379,0.014061,-1.636574,-0.009532,-0.101302,0.013411
14,-618.581947,-619.5321,-0.001067,-0.007992,0.153009,-0.070289,-34.663123,144.560198,68.070835,63.707567,...,,,,,,,,,,
22,-1760.492863,-1745.137872,-0.000683,-0.003,0.113102,0.155583,-7.834522,-259.682419,-251.945347,-282.016102,...,,,,,,,,,,


In [5]:
delta_radiomics_results.info()

<class 'pandas.core.frame.DataFrame'>
Index: 31 entries, 024 to 005
Data columns (total 99 columns):
 #   Column                                  Non-Null Count  Dtype  
---  ------                                  --------------  -----  
 0   MeshVolume (cc)                         31 non-null     float64
 1   Volume (cc)                             31 non-null     float64
 2   Compactness1                            31 non-null     float64
 3   Compactness2                            31 non-null     float64
 4   Elongation                              31 non-null     float64
 5   Flatness                                31 non-null     float64
 6   LeastAxisLength                         31 non-null     float64
 7   MajorAxisLength                         31 non-null     float64
 8   Maximum2DDiameterColumn                 31 non-null     float64
 9   Maximum2DDiameterRow                    31 non-null     float64
 10  Maximum2DDiameterSlice                  31 non-null     float64
 1

In [6]:
# clean up the three radiomics DataFrames (delta, A, B)
# idea: some radiomic features are super sparse (NaNs almost everywhere),
# so we only keep columns that are fully filled for all patients
# after that, we reset the index and turn it into an 'id' column

for df in [delta_radiomics_results, a_radiomics, b_radiomics]:
    df.dropna(axis=1, how='any', inplace=True)  # classic pandas dropna on columns 
    df.reset_index(inplace=True)
    df.rename(columns={'index': 'id'}, inplace=True)
    df['id'] = df['id'].astype(int)  # simple cast; if this ever crashes we know IDs are not purely numeric


In [7]:
delta_radiomics_results.head()

Unnamed: 0,id,MeshVolume (cc),Volume (cc),Compactness1,Compactness2,Elongation,Flatness,LeastAxisLength,MajorAxisLength,Maximum2DDiameterColumn,...,SUV_StandardDeviation,SUV_TotalEnergy,SUV_Uniformity,SUV_Variance,TLG,Number of lesions,Dmax Patient (mm),Spread Patient (mm),Dmax Bulk (mm),Spread Bulk (mm)
0,24,-1350.192459,-1349.633052,-0.000854,-0.004856,-0.071998,-0.065654,-27.970122,3.548268,-51.767337,...,1.021023,8423473.0,0.002524,5.870512,-2262.709063,-10.0,-375.047277,-4538.11501,-266.858574,-3568.952663
1,23,-219.73116,-219.58596,-0.001445,-0.016234,-0.053647,-0.079159,-8.566525,17.394789,-17.549948,...,-3.838671,-162696600.0,0.033541,-38.852979,-7366.728039,2.0,40.188724,424.511696,40.188724,671.769366
2,15,49.867963,49.214353,0.016168,0.256642,0.266914,0.405557,-0.581837,-299.663486,-166.887134,...,0.091898,1737152.0,0.0,0.131768,321.512895,-3.0,-530.927813,-1122.497606,-530.927813,-1122.497606
3,46,81.359002,81.531557,0.00445,0.05048,0.116233,0.059981,-53.057164,-718.377143,-106.43608,...,7.408148,25137480.0,-0.265369,85.762131,1210.552077,-4.0,-350.715798,-4689.383535,-696.112137,-6095.639092
4,48,132.95964,134.90532,-0.003025,-0.029388,-0.000437,0.012482,16.392367,101.549356,0.102906,...,0.160606,8044910.0,0.0,0.871056,980.935693,0.0,-3.122628,-895.747256,-3.122628,857.648622


In the raw radiomics tables we initially had 99 features per patient, but many of these features contained missing values for a substantial fraction of the 31 patients. To avoid unstable models and complex imputation on such a small cohort, only features that are fully observed for all patients are kept. Concretely, any feature column that contains at least one missing value is dropped, so the final radiomics matrices retain only those features with complete data across all patients. This results in a reduced but cleaner feature set (44 robust features instead of the original 99), which is easier to interpret and more reliable for downstream modeling in a small-sample setting.

In [8]:
# to differentiate the columns of A and B datasets
a_radiomics = a_radiomics.add_suffix('_a')

In [9]:
a_radiomics.head()

Unnamed: 0,id_a,MeshVolume (cc)_a,Volume (cc)_a,Compactness1_a,Compactness2_a,Elongation_a,Flatness_a,LeastAxisLength_a,MajorAxisLength_a,Maximum2DDiameterColumn_a,...,SUV_StandardDeviation_a,SUV_TotalEnergy_a,SUV_Uniformity_a,SUV_Variance_a,TLG_a,Number of lesions_a,Dmax Patient (mm)_a,Spread Patient (mm)_a,Dmax Bulk (mm)_a,Spread Bulk (mm)_a
0,24,3236.101787,3249.393552,0.008427,0.025231,0.679259,0.379251,167.996974,442.970946,558.036287,...,2.364306,60455620.0,0.996192,5.589943,11722.728508,13.0,740.204182,5251.85917,615.445828,3999.313134
1,23,1236.71559,1240.8066,0.016529,0.097071,0.679058,0.575951,125.204304,217.387132,320.449684,...,6.980068,212479400.0,0.966459,48.721356,13734.421779,3.0,304.656578,593.294918,304.656578,346.037247
2,15,221.214992,222.556487,0.014253,0.072181,0.473212,0.140572,56.523435,402.096359,273.123144,...,0.670973,2523821.0,1.0,0.450205,734.434052,4.0,530.927813,1122.497606,530.927813,1122.497606
3,46,16.306867,17.72892,0.013737,0.067051,0.120633,0.108154,122.121675,1129.145457,582.706796,...,2.084291,365293.3,1.0,4.344271,71.489771,11.0,1091.820444,7430.718599,1091.820444,7430.718599
4,48,110.32296,112.73328,0.015185,0.08193,0.187392,0.09539,41.559987,435.685464,552.831991,...,2.631478,3952316.0,1.0,6.924674,597.957639,13.0,1063.58772,9799.746563,1063.58772,4069.22703


In [10]:
b_radiomics = b_radiomics.add_suffix('_b')

In [11]:
b_radiomics.head()

Unnamed: 0,id_b,MeshVolume (cc)_b,Volume (cc)_b,Compactness1_b,Compactness2_b,Elongation_b,Flatness_b,LeastAxisLength_b,MajorAxisLength_b,Maximum2DDiameterColumn_b,...,SUV_StandardDeviation_b,SUV_TotalEnergy_b,SUV_Uniformity_b,SUV_Variance_b,TLG_b,Number of lesions_b,Dmax Patient (mm)_b,Spread Patient (mm)_b,Dmax Bulk (mm)_b,Spread Bulk (mm)_b
0,24,1885.909327,1899.7605,0.007573,0.020375,0.607262,0.313596,140.026852,446.519213,506.26895,...,3.385329,68879090.0,0.998717,11.460455,9460.019445,3.0,365.156905,713.74416,348.587255,430.360471
1,23,1016.98443,1021.22064,0.015083,0.080836,0.625411,0.496792,116.637779,234.781921,302.899736,...,3.141397,49782750.0,1.0,9.868377,6367.69374,5.0,344.845302,1017.806614,344.845302,1017.806614
2,15,271.082955,271.77084,0.030421,0.328823,0.740126,0.546129,55.941599,102.432873,106.236011,...,0.762871,4260973.0,1.0,0.581973,1055.946947,1.0,0.0,0.0,0.0,0.0
3,46,97.66587,99.260477,0.018188,0.11753,0.236866,0.168135,69.06451,410.768314,476.270716,...,9.492439,25502770.0,0.734631,90.106402,1282.041848,7.0,741.104645,2741.335065,395.708306,1335.079507
4,48,243.2826,247.6386,0.012161,0.052543,0.186955,0.107872,57.952354,537.23482,552.934897,...,2.792084,11997230.0,1.0,7.79573,1578.893332,13.0,1060.465092,8903.999307,1060.465092,4926.875652


In [12]:
for patient, delta_data in delta_radiomics_results.items():
    if len(delta_data) == 99:
        print(patient)
filtered_results = {patient: data for patient, data in delta_radiomics_results.items() if len(data) != 99}        
len(filtered_results)

44

# Load Clinical Data

In [13]:
with open("config.yaml", "r") as f:
    cfg = yaml.safe_load(f)

clinical_path = cfg["paths"]["clinical_data"]

clinic_data = pd.read_excel(clinical_path)

In [14]:
clinic_data

Unnamed: 0,record_id,medhis_diag_comments,scr_date_tb1stmeeting,scr_sex,scr_sex.factor,scr_age,scr_height,scr_weight,scr_bmi,indication_dis_diagnosis,...,post_cart_ther_spec_2___ne.factor,post_cart_ther_comment_spec,cli_st_lab_date,cli_st_hemoglobin,cli_st_trombocytes,cli_st_leukocytes,cli_st_neutrophils,cli_st_ldh,cli_st_crp,cli_st_ferritin
0,Record ID,Comments,Date 1st tumorboard meeting,Sex,,Age,Height,Weight,BMI (kg/m2),Diagnosis for which there is now a cellular th...,...,,Please specify all subsequent anti-cancer ther...,Date lab results,Hemoglobin in mmol/L,Thrombocytes in 10E9/L,Leukocytes in 10E9/L,Neutrophils in 10E9/L (automated differentiation),LDH in U/L,CRP in mg/L,Ferritin in µg/l
1,FTC-UMCG-0001,splenectomy 2012: total hip links 2015: jich...,2020-05-04,0,Male,68,180,72.6,22,1,...,Unchecked,,2020-04-28,7.1,90,6.3,4.74,169,26,NE
2,FTC-UMCG-0002,> 20 jaar geleden DVT links Longembolie links...,2020-05-07,0,Male,73,190,86,24,2,...,Unchecked,,2020-05-14,64,172,4.3,2.83,NE,47,2847
3,FTC-UMCG-0003,"2019 Nov Grootcellig B-Non-Hodgkin lymfoom,...",2020-05-18,0,Male,59,181,91,28,1,...,Unchecked,Radiotherapy CNS and Korfel 3x response evalua...,2020-05-15,7.4,389,11.9,NE,214,14,1404
4,FTC-UMCG-0004,2015 gehoorverlies 2019 aug: DLBCL ...,2020-05-14,1,Female,61,169,73,26,1,...,Unchecked,,2020-04-21,6.5,159,9.2,6.55,296,3.0,NE
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
64,FTC-UMCG-0088,Hematologische voorgeschiedenis 2013 bi-cy...,2023-09-28,0,Male,54,178,69.8,22,1,...,Unchecked,,2023-09-29,9.1,93,7.0,NE,369,7,1643
65,FTC-UMCG-0089,2013 dec: laaggradig B-NHL stadium IV met s...,2023-10-05,1,Female,70,160,58.7,23,2,...,Unchecked,Epcoritamab monotherapy - 48 mg per injection ...,2023-10-04,8.1,205,5.3,2.97,325,17,204
66,FTC-UMCG-0090,Relevante voorgeschiedenis: 2016 Stadium IV D...,2023-10-12,0,Male,70,170,73,25,1,...,Unchecked,2024-02 recidief diffuus grootcellig B-cel lym...,2023-10-12,9.5,327,6.6,5.06,991,78,669
67,FTC-UMCG-0096,Voorgeschiedenis: Tonsilectomie 2004 IBS ...,2022-11-22,0,Male,62,180,78,24,1,...,Unchecked,,2022-10-11,6.8,109,20.7,NE,475,15,1932


In [15]:
clinic_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 69 entries, 0 to 68
Columns: 275 entries, record_id to cli_st_ferritin
dtypes: float64(1), object(274)
memory usage: 148.4+ KB


The `record_id` column stores the unique identifier for each patient, and the long free‑text column contains the physician’s diagnostic and medical history notes. This cell is only used to inspect and understand the structure and content of the clinical file before filtering it down to the 31 patients included in the radiomics analysis.

In [16]:
clinic_data['record_id'].values

array(['Record ID', 'FTC-UMCG-0001', 'FTC-UMCG-0002', 'FTC-UMCG-0003',
       'FTC-UMCG-0004', 'FTC-UMCG-0005', 'FTC-UMCG-0006', 'FTC-UMCG-0007',
       'FTC-UMCG-0008', 'FTC-UMCG-0009', 'FTC-UMCG-0010', 'FTC-UMCG-0011',
       'FTC-UMCG-0012', 'FTC-UMCG-0013', 'FTC-UMCG-0014', 'FTC-UMCG-0015',
       'FTC-UMCG-0016', 'FTC-UMCG-0017', 'FTC-UMCG-0018', 'FTC-UMCG-0019',
       'FTC-UMCG-0020', 'FTC-UMCG-0021', 'FTC-UMCG-0022', 'FTC-UMCG-0023',
       'FTC-UMCG-0024', 'FTC-UMCG-0025', 'FTC-UMCG-0026', 'FTC-UMCG-0027',
       'FTC-UMCG-0028', 'FTC-UMCG-0029', 'FTC-UMCG-0030', 'FTC-UMCG-0031',
       'FTC-UMCG-0046', 'FTC-UMCG-0047', 'FTC-UMCG-0048', 'FTC-UMCG-0049',
       'FTC-UMCG-0050', 'FTC-UMCG-0051', 'FTC-UMCG-0052', 'FTC-UMCG-0053',
       'FTC-UMCG-0054', 'FTC-UMCG-0055', 'FTC-UMCG-0060', 'FTC-UMCG-0061',
       'FTC-UMCG-0064', 'FTC-UMCG-0065', 'FTC-UMCG-0066', 'FTC-UMCG-0067',
       'FTC-UMCG-0068', 'FTC-UMCG-0069', 'FTC-UMCG-0070', 'FTC-UMCG-0075',
       'FTC-UMCG-0076', 'FTC-

In [17]:
# to keep only 3 digits for each patient
clinic_data['id_cleaned'] = [value[-3:] for value in clinic_data['record_id'].values]

In [18]:
clinic_data

Unnamed: 0,record_id,medhis_diag_comments,scr_date_tb1stmeeting,scr_sex,scr_sex.factor,scr_age,scr_height,scr_weight,scr_bmi,indication_dis_diagnosis,...,post_cart_ther_comment_spec,cli_st_lab_date,cli_st_hemoglobin,cli_st_trombocytes,cli_st_leukocytes,cli_st_neutrophils,cli_st_ldh,cli_st_crp,cli_st_ferritin,id_cleaned
0,Record ID,Comments,Date 1st tumorboard meeting,Sex,,Age,Height,Weight,BMI (kg/m2),Diagnosis for which there is now a cellular th...,...,Please specify all subsequent anti-cancer ther...,Date lab results,Hemoglobin in mmol/L,Thrombocytes in 10E9/L,Leukocytes in 10E9/L,Neutrophils in 10E9/L (automated differentiation),LDH in U/L,CRP in mg/L,Ferritin in µg/l,ID
1,FTC-UMCG-0001,splenectomy 2012: total hip links 2015: jich...,2020-05-04,0,Male,68,180,72.6,22,1,...,,2020-04-28,7.1,90,6.3,4.74,169,26,NE,001
2,FTC-UMCG-0002,> 20 jaar geleden DVT links Longembolie links...,2020-05-07,0,Male,73,190,86,24,2,...,,2020-05-14,64,172,4.3,2.83,NE,47,2847,002
3,FTC-UMCG-0003,"2019 Nov Grootcellig B-Non-Hodgkin lymfoom,...",2020-05-18,0,Male,59,181,91,28,1,...,Radiotherapy CNS and Korfel 3x response evalua...,2020-05-15,7.4,389,11.9,NE,214,14,1404,003
4,FTC-UMCG-0004,2015 gehoorverlies 2019 aug: DLBCL ...,2020-05-14,1,Female,61,169,73,26,1,...,,2020-04-21,6.5,159,9.2,6.55,296,3.0,NE,004
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
64,FTC-UMCG-0088,Hematologische voorgeschiedenis 2013 bi-cy...,2023-09-28,0,Male,54,178,69.8,22,1,...,,2023-09-29,9.1,93,7.0,NE,369,7,1643,088
65,FTC-UMCG-0089,2013 dec: laaggradig B-NHL stadium IV met s...,2023-10-05,1,Female,70,160,58.7,23,2,...,Epcoritamab monotherapy - 48 mg per injection ...,2023-10-04,8.1,205,5.3,2.97,325,17,204,089
66,FTC-UMCG-0090,Relevante voorgeschiedenis: 2016 Stadium IV D...,2023-10-12,0,Male,70,170,73,25,1,...,2024-02 recidief diffuus grootcellig B-cel lym...,2023-10-12,9.5,327,6.6,5.06,991,78,669,090
67,FTC-UMCG-0096,Voorgeschiedenis: Tonsilectomie 2004 IBS ...,2022-11-22,0,Male,62,180,78,24,1,...,,2022-10-11,6.8,109,20.7,NE,475,15,1932,096


In [19]:
clinic_data['id_cleaned'].values

array([' ID', '001', '002', '003', '004', '005', '006', '007', '008',
       '009', '010', '011', '012', '013', '014', '015', '016', '017',
       '018', '019', '020', '021', '022', '023', '024', '025', '026',
       '027', '028', '029', '030', '031', '046', '047', '048', '049',
       '050', '051', '052', '053', '054', '055', '060', '061', '064',
       '065', '066', '067', '068', '069', '070', '075', '076', '077',
       '078', '079', '080', '081', '082', '083', '084', '085', '086',
       '087', '088', '089', '090', '096', '104'], dtype=object)

In [20]:
delta_radiomics_results['id']

0     24
1     23
2     15
3     46
4     48
5     77
6     70
7     13
8     14
9     22
10    47
11     7
12     9
13    31
14    52
15    55
16    64
17     8
18     6
19    18
20    11
21    16
22    17
23    28
24    10
25    26
26    95
27    61
28    50
29    68
30     5
Name: id, dtype: int64

In [21]:
# remove the first row
patient_ids = clinic_data['id_cleaned'].values[1:].astype(int)

In [22]:
# find patients that are in both datasets
# values starts from 1 to skip the comment row
intercept = [id for id in delta_radiomics_results['id'] if id in patient_ids]

In [23]:
clinic_data['id_cleaned'] = ['ID'] + patient_ids.tolist()

In [24]:
clinic_data

Unnamed: 0,record_id,medhis_diag_comments,scr_date_tb1stmeeting,scr_sex,scr_sex.factor,scr_age,scr_height,scr_weight,scr_bmi,indication_dis_diagnosis,...,post_cart_ther_comment_spec,cli_st_lab_date,cli_st_hemoglobin,cli_st_trombocytes,cli_st_leukocytes,cli_st_neutrophils,cli_st_ldh,cli_st_crp,cli_st_ferritin,id_cleaned
0,Record ID,Comments,Date 1st tumorboard meeting,Sex,,Age,Height,Weight,BMI (kg/m2),Diagnosis for which there is now a cellular th...,...,Please specify all subsequent anti-cancer ther...,Date lab results,Hemoglobin in mmol/L,Thrombocytes in 10E9/L,Leukocytes in 10E9/L,Neutrophils in 10E9/L (automated differentiation),LDH in U/L,CRP in mg/L,Ferritin in µg/l,ID
1,FTC-UMCG-0001,splenectomy 2012: total hip links 2015: jich...,2020-05-04,0,Male,68,180,72.6,22,1,...,,2020-04-28,7.1,90,6.3,4.74,169,26,NE,1
2,FTC-UMCG-0002,> 20 jaar geleden DVT links Longembolie links...,2020-05-07,0,Male,73,190,86,24,2,...,,2020-05-14,64,172,4.3,2.83,NE,47,2847,2
3,FTC-UMCG-0003,"2019 Nov Grootcellig B-Non-Hodgkin lymfoom,...",2020-05-18,0,Male,59,181,91,28,1,...,Radiotherapy CNS and Korfel 3x response evalua...,2020-05-15,7.4,389,11.9,NE,214,14,1404,3
4,FTC-UMCG-0004,2015 gehoorverlies 2019 aug: DLBCL ...,2020-05-14,1,Female,61,169,73,26,1,...,,2020-04-21,6.5,159,9.2,6.55,296,3.0,NE,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
64,FTC-UMCG-0088,Hematologische voorgeschiedenis 2013 bi-cy...,2023-09-28,0,Male,54,178,69.8,22,1,...,,2023-09-29,9.1,93,7.0,NE,369,7,1643,88
65,FTC-UMCG-0089,2013 dec: laaggradig B-NHL stadium IV met s...,2023-10-05,1,Female,70,160,58.7,23,2,...,Epcoritamab monotherapy - 48 mg per injection ...,2023-10-04,8.1,205,5.3,2.97,325,17,204,89
66,FTC-UMCG-0090,Relevante voorgeschiedenis: 2016 Stadium IV D...,2023-10-12,0,Male,70,170,73,25,1,...,2024-02 recidief diffuus grootcellig B-cel lym...,2023-10-12,9.5,327,6.6,5.06,991,78,669,90
67,FTC-UMCG-0096,Voorgeschiedenis: Tonsilectomie 2004 IBS ...,2022-11-22,0,Male,62,180,78,24,1,...,,2022-10-11,6.8,109,20.7,NE,475,15,1932,96


In [25]:
clinic_data_cleaned = clinic_data[clinic_data['id_cleaned'].isin(intercept)]

In [26]:
clinic_data_cleaned

Unnamed: 0,record_id,medhis_diag_comments,scr_date_tb1stmeeting,scr_sex,scr_sex.factor,scr_age,scr_height,scr_weight,scr_bmi,indication_dis_diagnosis,...,post_cart_ther_comment_spec,cli_st_lab_date,cli_st_hemoglobin,cli_st_trombocytes,cli_st_leukocytes,cli_st_neutrophils,cli_st_ldh,cli_st_crp,cli_st_ferritin,id_cleaned
5,FTC-UMCG-0005,2019 mei: hemicastratie links Hematologische...,1900-01-01,0,Male,62,173.0,58.0,19,1,...,,2020-05-20,5.3,145,2.4,,NE,0.3,894,5
6,FTC-UMCG-0006,2014 Diffuus Grootcellig B-cel lymfoom st I...,2020-07-02,1,Female,58,173.0,57.0,19,2,...,"Verdere behandeling, inclusief allo-SCT in UMCU",2020-07-13,7.6,6,5.7,NE,275,9,371,6
7,FTC-UMCG-0007,2020 (feb) Stadium IV high grade B-cel lymfoom...,2020-08-10,0,Male,58,182.0,99.2,30,5,...,,2020-08-18,64.0,295,50.0,NE,885,47,2570,7
8,FTC-UMCG-0008,"2017 okt: Snel progressief DLBCL, stadium I...",2020-09-21,1,Female,72,169.0,60.0,21,1,...,Epcoritamab monotherapie,2020-09-18,5.7,321,1.9,NE,250,1.0,NE,8
9,FTC-UMCG-0009,2017 okt gastro- en colonoscopie ivm chroni...,2020-09-02,0,Male,48,186.0,106.0,31,2,...,,2020-09-02,7.2,324,8.7,NE,283,10,54,9
10,FTC-UMCG-0010,2020 feb nefrostomiekatheter rechts ivm hyd...,2020-10-19,0,Male,54,181.0,69.0,21,1,...,,2020-10-19,5.4,382,26.3,NE,417,206,4786,10
11,FTC-UMCG-0011,2020-03: koorts zonder lokaliserende klachten....,2020-10-15,0,Male,34,185.0,86.1,25,5,...,,2020-10-19,6.8,442,5.3,3.87,992,17,485,11
13,FTC-UMCG-0013,Hematologische voorgeschiedenis: 2013 (mei) ...,2020-11-23,0,Male,46,187.0,97.0,28,2,...,,2020-11-25,5.8,253,5.7,4.49,313,32,559,13
14,FTC-UMCG-0014,020 (mei) Diffuus grootcellig B-cel lymfoom st...,2020-12-17,0,Male,70,190.0,96.0,27,1,...,,2020-12-15,6.8,497,6.2,NE,484,225,1535,14
15,FTC-UMCG-0015,ematologische voorgeschiedenis: 2005 (dec) s...,2021-01-07,1,Female,66,162.0,56.0,21,2,...,,2021-01-08,6.4,432,7.6,6.10,235,8,569,15


In [27]:
clinic_data_cleaned.reset_index(drop=True, inplace=True)

In [28]:
clinic_data_cleaned.shape

(30, 276)

In the full clinical database, 31 patients are available, but only 30 of them have matching PET radiomics data and are therefore included in the modeling cohort. The final clinical table for these 30 patients contains 276 variables because it combines clinical information with radiomic features from both baseline (A) and pre-lymphodepletion (B) PET scans after merging the corresponding radiomics tables.

In [29]:
# we now should select features we need for modelling the baseline, without the delta radiomics
clinic_data_cleaned

Unnamed: 0,record_id,medhis_diag_comments,scr_date_tb1stmeeting,scr_sex,scr_sex.factor,scr_age,scr_height,scr_weight,scr_bmi,indication_dis_diagnosis,...,post_cart_ther_comment_spec,cli_st_lab_date,cli_st_hemoglobin,cli_st_trombocytes,cli_st_leukocytes,cli_st_neutrophils,cli_st_ldh,cli_st_crp,cli_st_ferritin,id_cleaned
0,FTC-UMCG-0005,2019 mei: hemicastratie links Hematologische...,1900-01-01,0,Male,62,173.0,58.0,19,1,...,,2020-05-20,5.3,145,2.4,,NE,0.3,894,5
1,FTC-UMCG-0006,2014 Diffuus Grootcellig B-cel lymfoom st I...,2020-07-02,1,Female,58,173.0,57.0,19,2,...,"Verdere behandeling, inclusief allo-SCT in UMCU",2020-07-13,7.6,6,5.7,NE,275,9,371,6
2,FTC-UMCG-0007,2020 (feb) Stadium IV high grade B-cel lymfoom...,2020-08-10,0,Male,58,182.0,99.2,30,5,...,,2020-08-18,64.0,295,50.0,NE,885,47,2570,7
3,FTC-UMCG-0008,"2017 okt: Snel progressief DLBCL, stadium I...",2020-09-21,1,Female,72,169.0,60.0,21,1,...,Epcoritamab monotherapie,2020-09-18,5.7,321,1.9,NE,250,1.0,NE,8
4,FTC-UMCG-0009,2017 okt gastro- en colonoscopie ivm chroni...,2020-09-02,0,Male,48,186.0,106.0,31,2,...,,2020-09-02,7.2,324,8.7,NE,283,10,54,9
5,FTC-UMCG-0010,2020 feb nefrostomiekatheter rechts ivm hyd...,2020-10-19,0,Male,54,181.0,69.0,21,1,...,,2020-10-19,5.4,382,26.3,NE,417,206,4786,10
6,FTC-UMCG-0011,2020-03: koorts zonder lokaliserende klachten....,2020-10-15,0,Male,34,185.0,86.1,25,5,...,,2020-10-19,6.8,442,5.3,3.87,992,17,485,11
7,FTC-UMCG-0013,Hematologische voorgeschiedenis: 2013 (mei) ...,2020-11-23,0,Male,46,187.0,97.0,28,2,...,,2020-11-25,5.8,253,5.7,4.49,313,32,559,13
8,FTC-UMCG-0014,020 (mei) Diffuus grootcellig B-cel lymfoom st...,2020-12-17,0,Male,70,190.0,96.0,27,1,...,,2020-12-15,6.8,497,6.2,NE,484,225,1535,14
9,FTC-UMCG-0015,ematologische voorgeschiedenis: 2005 (dec) s...,2021-01-07,1,Female,66,162.0,56.0,21,2,...,,2021-01-08,6.4,432,7.6,6.10,235,8,569,15


In [30]:
# after merging, we have to find the columns that filled by NaNs, so we have to drop them
clinic_data_cleaned.isna().sum().sum()

986

In [31]:
# dropping columns with all NaN values
clinic_data_cleaned = clinic_data_cleaned.dropna(axis=1, how='all')

In [32]:
clinic_data_cleaned.shape

(30, 266)

In [33]:
clinic_data_cleaned.columns

Index(['record_id', 'medhis_diag_comments', 'scr_date_tb1stmeeting', 'scr_sex',
       'scr_sex.factor', 'scr_age', 'scr_height', 'scr_weight', 'scr_bmi',
       'indication_dis_diagnosis',
       ...
       'post_cart_ther_comment_spec', 'cli_st_lab_date', 'cli_st_hemoglobin',
       'cli_st_trombocytes', 'cli_st_leukocytes', 'cli_st_neutrophils',
       'cli_st_ldh', 'cli_st_crp', 'cli_st_ferritin', 'id_cleaned'],
      dtype='object', length=266)

In [34]:
# we don't need factor columns for modelling as they are encoded already
factors = [factor for factor in clinic_data_cleaned.columns if 'factor' in factor]

In [35]:
factors

['scr_sex.factor',
 'indication_dis_diagnosis.factor',
 'indication_priorsct.factor',
 'indication_whops.factor',
 'indication_ldh_uln.factor',
 'indication_age_60.factor',
 'indication_bulkydisease.factor',
 'indication_stage.factor',
 'indication_extran_sites.factor',
 'indication_extran_invol.factor',
 'indication_extran_site_loc___1.factor',
 'indication_extran_site_loc___2.factor',
 'indication_extran_site_loc___3.factor',
 'indication_extran_site_loc___21.factor',
 'indication_extran_site_loc___4.factor',
 'indication_extran_site_loc___5.factor',
 'indication_extran_site_loc___6.factor',
 'indication_extran_site_loc___7.factor',
 'indication_extran_site_loc___8.factor',
 'indication_extran_site_loc___9.factor',
 'indication_extran_site_loc___10.factor',
 'indication_extran_site_loc___11.factor',
 'indication_extran_site_loc___12.factor',
 'indication_extran_site_loc___13.factor',
 'indication_extran_site_loc___14.factor',
 'indication_extran_site_loc___15.factor',
 'indication_ex

In [36]:
comments = [comm for comm in clinic_data_cleaned.columns if 'comment' in comm]

In [37]:
comments

['medhis_diag_comments', 'post_cart_ther_comment_spec']

In [38]:
locations = [loc for loc in clinic_data_cleaned.columns if 'loc' in loc]

In [39]:
locations

['indication_extran_site_loc___1',
 'indication_extran_site_loc___1.factor',
 'indication_extran_site_loc___2',
 'indication_extran_site_loc___2.factor',
 'indication_extran_site_loc___3',
 'indication_extran_site_loc___3.factor',
 'indication_extran_site_loc___21',
 'indication_extran_site_loc___21.factor',
 'indication_extran_site_loc___4',
 'indication_extran_site_loc___4.factor',
 'indication_extran_site_loc___5',
 'indication_extran_site_loc___5.factor',
 'indication_extran_site_loc___6',
 'indication_extran_site_loc___6.factor',
 'indication_extran_site_loc___7',
 'indication_extran_site_loc___7.factor',
 'indication_extran_site_loc___8',
 'indication_extran_site_loc___8.factor',
 'indication_extran_site_loc___9',
 'indication_extran_site_loc___9.factor',
 'indication_extran_site_loc___10',
 'indication_extran_site_loc___10.factor',
 'indication_extran_site_loc___11',
 'indication_extran_site_loc___11.factor',
 'indication_extran_site_loc___12',
 'indication_extran_site_loc___12.

## Reducing correlated clinical features

The number of available clinical and radiomic features is relatively large compared to the cohort size, and the planned models are mainly linear (e.g. logistic regression or linear SVM). In such settings, highly correlated predictors can lead to unstable and difficult-to-interpret coefficient estimates, because the model cannot clearly decide how to distribute weight across redundant variables. Therefore, correlated features (such as height and weight with respect to BMI) are identified and considered for removal to obtain a more stable and interpretable linear model.


In [40]:
# these are highly correlated features with bmi
correlated = ['scr_height', 'scr_weight']

In [41]:
indicators = ['indication_ldh_uln','indication_age_60','indication_extran_sites', 'indication_extran_invol']

In [42]:
# cause of death columns are not needed
cause_of_death = [cause for cause in clinic_data_cleaned.columns if '_cause' in cause]

In [43]:
cause_of_death

['surv_death_cause',
 'surv_death_cause.factor',
 'surv_death_cause_oth',
 'surv_death_cause_spec',
 'surv_death_contrib_cause___1',
 'surv_death_contrib_cause___1.factor',
 'surv_death_contrib_cause___2',
 'surv_death_contrib_cause___2.factor',
 'surv_death_contrib_cause___3',
 'surv_death_contrib_cause___3.factor',
 'surv_death_contrib_cause___4',
 'surv_death_contrib_cause___4.factor',
 'surv_death_contrib_cause___5',
 'surv_death_contrib_cause___5.factor',
 'surv_death_contrib_cause___6',
 'surv_death_contrib_cause___6.factor',
 'surv_death_contrib_cause___7',
 'surv_death_contrib_cause___7.factor',
 'surv_death_contrib_cause___8',
 'surv_death_contrib_cause___8.factor',
 'surv_death_contrib_cause___9',
 'surv_death_contrib_cause___9.factor',
 'surv_death_contrib_cause___10',
 'surv_death_contrib_cause___10.factor',
 'surv_death_contrib_cause___11',
 'surv_death_contrib_cause___11.factor',
 'surv_death_contrib_cause___12',
 'surv_death_contrib_cause___12.factor',
 'surv_death_contr

**NOTE:** indication_dis_diagnosis must be one-hot encoded. as the disease is a nominal categorical feature.

In [44]:
disease = pd.get_dummies(clinic_data_cleaned['indication_dis_diagnosis.factor']).astype(int)

In [45]:
disease

Unnamed: 0,DLBCL,HGBCL DH/TH,HGBCL NOS,tFL
0,1,0,0,0
1,0,0,0,1
2,0,0,1,0
3,1,0,0,0
4,0,0,0,1
5,1,0,0,0
6,0,0,1,0
7,0,0,0,1
8,1,0,0,0
9,0,0,0,1


In [46]:
drop_columns = cause_of_death + factors + ['record_id','scr_date_tb1stmeeting', 'indication_dis_diagnosis'] + comments + locations + correlated + indicators
clinic_data_cleaned.drop(columns=drop_columns,inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  clinic_data_cleaned.drop(columns=drop_columns,inplace=True)


In [47]:
clinic_data_cleaned.shape

(30, 97)

In [48]:
clinic_data_cleaned = pd.concat([clinic_data_cleaned, disease], axis=1)

In [49]:
clinic_data_cleaned

Unnamed: 0,scr_sex,scr_age,scr_bmi,total_num_priortherapylines_fl,total_num_priortherapylines_aggressive,indication_priorsct,indication_whops,indication_bulkydisease,indication_stage,indication_extranodal_nr,...,cli_st_leukocytes,cli_st_neutrophils,cli_st_ldh,cli_st_crp,cli_st_ferritin,id_cleaned,DLBCL,HGBCL DH/TH,HGBCL NOS,tFL
0,0,62,19,,2,4,0,0,4,3.0,...,2.4,,NE,0.3,894,5,1,0,0,0
1,1,58,19,0.0,2,1,0,0,3,,...,5.7,NE,275,9,371,6,0,0,0,1
2,0,58,30,,2,4,0,0,4,,...,50.0,NE,885,47,2570,7,0,0,1,0
3,1,72,21,,2,4,0,0,4,2.0,...,1.9,NE,250,1.0,NE,8,1,0,0,0
4,0,48,31,2.0,2,4,0,0,4,2.0,...,8.7,NE,283,10,54,9,0,0,0,1
5,0,54,21,,2,4,0,1,2,,...,26.3,NE,417,206,4786,10,1,0,0,0
6,0,34,25,,2,4,0,1,4,2.0,...,5.3,3.87,992,17,485,11,0,0,1,0
7,0,46,28,2.0,2,4,0,0,4,2.0,...,5.7,4.49,313,32,559,13,0,0,0,1
8,0,70,27,,2,4,0,1,4,,...,6.2,NE,484,225,1535,14,1,0,0,0
9,1,66,21,4.0,1,4,0,1,1,,...,7.6,6.10,235,8,569,15,0,0,0,1


Before dropping low‑value columns such as locations and free‑text comments and one‑hot encoding the remaining categorical variables, missing entries must be treated consistently. In this dataset, the string `NE` is used as a placeholder for “not evaluated”, but if left as a raw string it would be interpreted as a valid category during one‑hot encoding and model fitting. To avoid counting "NE" as a real clinical state, all "NE" values are first converted to proper missing values (NaN), so that downstream cleaning and encoding steps can handle them as true missing data rather than as an extra category.

In [96]:
clinic_data_cleaned.replace({'NE': np.nan}, inplace=True)

In [97]:
clinic_data_cleaned

Unnamed: 0,scr_sex,scr_age,scr_bmi,total_num_priortherapylines_aggressive,indication_priorsct,indication_whops,indication_bulkydisease,indication_stage,indication_pri_refr,indication_sec_refr,...,cli_st_trombocytes,cli_st_leukocytes,cli_st_ldh,cli_st_crp,cli_st_ferritin,id,DLBCL,HGBCL DH/TH,HGBCL NOS,tFL
0,0,62,19,2,4,0,0,4,1,1,...,145,2.4,293.5,0.3,894.0,5,1,0,0,0
1,1,58,19,2,1,0,0,3,0,0,...,6,5.7,275.0,9.0,371.0,6,0,0,0,1
2,0,58,30,2,4,0,0,4,1,1,...,295,5.0,885.0,47.0,2570.0,7,0,0,1,0
3,1,72,21,2,4,0,0,4,0,1,...,321,1.9,250.0,1.0,888.0,8,1,0,0,0
4,0,48,31,2,4,0,0,4,1,1,...,324,8.7,283.0,10.0,54.0,9,0,0,0,1
5,0,54,21,2,4,0,1,2,1,1,...,382,26.3,417.0,206.0,4786.0,10,1,0,0,0
6,0,34,25,2,4,0,1,4,1,1,...,442,5.3,992.0,17.0,485.0,11,0,0,1,0
7,0,46,28,2,4,0,0,4,0,1,...,253,5.7,313.0,32.0,559.0,13,0,0,0,1
8,0,70,27,2,4,0,1,4,1,1,...,497,6.2,484.0,225.0,1535.0,14,1,0,0,0
9,1,66,21,1,4,0,1,1,1,0,...,432,7.6,235.0,8.0,569.0,15,0,0,0,1


After all preprocessing steps, the final modeling dataset contains 30 rows and 44 columns, meaning 30 patients with complete data on 44 selected clinical and radiomic features. This table represents the intersection of patients with both clinical information and PET radiomics, after removing low-information variables, handling missing values, and keeping only features suitable for linear modeling.

In [98]:
# check the final values
clinic_data_cleaned.describe()

Unnamed: 0,scr_sex,scr_age,scr_bmi,total_num_priortherapylines_aggressive,indication_priorsct,indication_whops,indication_bulkydisease,indication_stage,indication_pri_refr,indication_sec_refr,...,cli_st_trombocytes,cli_st_leukocytes,cli_st_ldh,cli_st_crp,cli_st_ferritin,id,DLBCL,HGBCL DH/TH,HGBCL NOS,tFL
count,30.0,30.0,30.0,30.0,30.0,30.0,30.0,30.0,30.0,30.0,...,30.0,30.0,30.0,30.0,30.0,30.0,30.0,30.0,30.0,30.0
mean,0.366667,60.3,24.5,1.933333,3.8,0.2,0.533333,3.1,0.7,0.8,...,278.966667,8.66,390.4,44.273333,1206.8,31.366667,0.366667,0.1,0.1,0.433333
std,0.490133,11.483571,3.461612,0.365148,0.761124,0.484234,0.507416,1.124952,0.466092,0.406838,...,121.862128,6.068096,218.409667,61.637774,1115.191878,22.404561,0.490133,0.305129,0.305129,0.504007
min,0.0,27.0,19.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,...,6.0,1.9,188.0,0.3,54.0,5.0,0.0,0.0,0.0,0.0
25%,0.0,56.5,22.0,2.0,4.0,0.0,0.0,2.0,0.0,1.0,...,196.25,5.55,248.25,6.0,603.75,13.25,0.0,0.0,0.0,0.0
50%,0.0,61.5,24.0,2.0,4.0,0.0,1.0,4.0,1.0,1.0,...,290.0,7.15,293.5,15.0,888.0,23.5,0.0,0.0,0.0,0.0
75%,1.0,66.0,27.0,2.0,4.0,0.0,1.0,4.0,1.0,1.0,...,358.5,9.15,469.5,56.75,1142.25,49.5,1.0,0.0,0.0,1.0
max,1.0,79.0,31.0,3.0,4.0,2.0,1.0,4.0,1.0,1.0,...,497.0,26.3,992.0,225.0,4786.0,77.0,1.0,1.0,1.0,1.0


In [100]:
nans = clinic_data_cleaned.isna().sum().sort_values(ascending=False)
nans

scr_sex                                   0
scr_age                                   0
surv_bestresponse_car                     0
surv_time_bestresponse_car                0
surv_prog_after_car                       0
surv_status                               0
post_cart_ther                            0
post_cart_ther_spec_2___1                 0
post_cart_ther_spec_2___2                 0
post_cart_ther_spec_2___3                 0
post_cart_ther_spec_2___5                 0
cli_st_hemoglobin                         0
cli_st_trombocytes                        0
cli_st_leukocytes                         0
cli_st_ldh                                0
cli_st_crp                                0
cli_st_ferritin                           0
id                                        0
DLBCL                                     0
HGBCL DH/TH                               0
HGBCL NOS                                 0
ae_summ_icans_v2                          0
ae_summ_highestgrade_v2         

In [53]:
# columns with more than 12 nans, which is half the data for the patients we have
nans[nans > 12]

post_car_ther_other                    29
surv_death_contrib_infect              29
surv_death_contrib_other               29
indication_dis_lymsubtype_cns_onset    29
tr_car_preaph_bridg_type               28
tr_car_bridg_reg_oth                   28
indication_extranodal_nr               23
total_num_priortherapylines_fl         18
ae_summ_crs_start_gr2                  18
post_cart_ther_startdate               17
ae_summ_icans_start_gr2                16
surv_death_date                        14
ae_summ_icans_stop_v2                  14
ae_summ_icans_res_v2                   14
ae_summ_icans_start_v2                 14
ae_summ_icans_highestgrade_v2          14
surv_prog_date                         13
cli_st_neutrophils                     13
dtype: int64

In [54]:
drop_nans = nans[nans > 12].index

In [55]:
clinic_data_cleaned = clinic_data_cleaned.drop(columns=drop_nans)

In [56]:
clinic_data_cleaned.shape

(30, 83)

In [57]:
clinic_data_cleaned.select_dtypes(include=['object']).columns

Index(['scr_sex', 'scr_age', 'scr_bmi',
       'total_num_priortherapylines_aggressive', 'indication_priorsct',
       'indication_whops', 'indication_bulkydisease', 'indication_stage',
       'indication_pri_refr', 'indication_sec_refr',
       'indication_res_last_ther', 'indication_res_last_ther_spec',
       'indication_dis_lymsubtype_cns', 'indication_ind_date',
       'tr_car_preaph_br', 'tr_car_preaph_bridg_reg___1',
       'tr_car_preaph_bridg_reg___2', 'tr_car_preaph_bridg_reg___3',
       'tr_car_preaph_bridg_reg___4', 'tr_car_preaph_bridg_reg___5',
       'tr_car_preaph_bridg_reg___6', 'tr_car_preaph_bridg_reg___7',
       'tr_car_preaph_bridg_reg___8', 'tr_car_preaph_bridg_reg___9',
       'tr_car_preaph_bridg_reg___10', 'tr_car_preaph_bridg_reg___11',
       'tr_car_preaph_bridg_reg___12', 'tr_car_preaph_bridg_reg___na',
       'tr_car_preaph_bridg_reg___ne', 'tr_car_br', 'tr_car_bridg_type',
       'tr_car_bridg_reg___1', 'tr_car_bridg_reg___2', 'tr_car_bridg_reg___3',
  

In [58]:
clinic_data_cleaned.dtypes

scr_sex                                   object
scr_age                                   object
scr_bmi                                   object
total_num_priortherapylines_aggressive    object
indication_priorsct                       object
                                           ...  
id_cleaned                                 int64
DLBCL                                      int64
HGBCL DH/TH                                int64
HGBCL NOS                                  int64
tFL                                        int64
Length: 83, dtype: object

In [59]:
clinic_data_cleaned.columns

Index(['scr_sex', 'scr_age', 'scr_bmi',
       'total_num_priortherapylines_aggressive', 'indication_priorsct',
       'indication_whops', 'indication_bulkydisease', 'indication_stage',
       'indication_pri_refr', 'indication_sec_refr',
       'indication_res_last_ther', 'indication_res_last_ther_spec',
       'indication_dis_lymsubtype_cns', 'indication_ind_date',
       'tr_car_preaph_br', 'tr_car_preaph_bridg_reg___1',
       'tr_car_preaph_bridg_reg___2', 'tr_car_preaph_bridg_reg___3',
       'tr_car_preaph_bridg_reg___4', 'tr_car_preaph_bridg_reg___5',
       'tr_car_preaph_bridg_reg___6', 'tr_car_preaph_bridg_reg___7',
       'tr_car_preaph_bridg_reg___8', 'tr_car_preaph_bridg_reg___9',
       'tr_car_preaph_bridg_reg___10', 'tr_car_preaph_bridg_reg___11',
       'tr_car_preaph_bridg_reg___12', 'tr_car_preaph_bridg_reg___na',
       'tr_car_preaph_bridg_reg___ne', 'tr_car_br', 'tr_car_bridg_type',
       'tr_car_bridg_reg___1', 'tr_car_bridg_reg___2', 'tr_car_bridg_reg___3',
  

In [60]:
# Assuming clinic_data_filtered is the DataFrame you want to convert
date_columns = [date for date in clinic_data_cleaned.columns if ('date' in date) or ('start' in date) or ('stop' in date)]
# 1. Use convert_dtypes() for general automatic inference
# This function automatically converts to best possible dtypes (e.g., object to string, int64 to Int64, float64 to Float64)
# It's particularly useful for handling missing values using pandas' nullable dtypes (e.g., pd.NA).
print("Applying general type conversion...")

# 2. Force remaining object columns that look like numbers to numeric
for col in clinic_data_cleaned.columns:
        if col not in date_columns:
            # Attempt to convert to numeric.
            # this is to fix a typo in columns where , is used instead of .
            if clinic_data_cleaned[col].dtype == 'object':
                clinic_data_cleaned[col] = pd.to_numeric(clinic_data_cleaned[col].str.replace(',','.'), errors='raise')
            print(f"  Converted column '{col}' to numeric.")
        else: 
            clinic_data_cleaned[col] = pd.to_datetime(clinic_data_cleaned[col], errors='coerce')
            print(f"  Converted column '{col}' to datetime.")
        
print("\nAutomatic type conversion complete.")

Applying general type conversion...
  Converted column 'scr_sex' to numeric.
  Converted column 'scr_age' to numeric.
  Converted column 'scr_bmi' to numeric.
  Converted column 'total_num_priortherapylines_aggressive' to numeric.
  Converted column 'indication_priorsct' to numeric.
  Converted column 'indication_whops' to numeric.
  Converted column 'indication_bulkydisease' to numeric.
  Converted column 'indication_stage' to numeric.
  Converted column 'indication_pri_refr' to numeric.
  Converted column 'indication_sec_refr' to numeric.
  Converted column 'indication_res_last_ther' to numeric.
  Converted column 'indication_res_last_ther_spec' to numeric.
  Converted column 'indication_dis_lymsubtype_cns' to numeric.
  Converted column 'indication_ind_date' to datetime.
  Converted column 'tr_car_preaph_br' to numeric.
  Converted column 'tr_car_preaph_bridg_reg___1' to numeric.
  Converted column 'tr_car_preaph_bridg_reg___2' to numeric.
  Converted column 'tr_car_preaph_bridg_reg

In [61]:
clinic_data_cleaned.dtypes

scr_sex                                   int64
scr_age                                   int64
scr_bmi                                   int64
total_num_priortherapylines_aggressive    int64
indication_priorsct                       int64
                                          ...  
id_cleaned                                int64
DLBCL                                     int64
HGBCL DH/TH                               int64
HGBCL NOS                                 int64
tFL                                       int64
Length: 83, dtype: object

In [62]:
variances = clinic_data_cleaned.select_dtypes(include=np.number).var().sort_values()

In [63]:
# zero variance columns are not useful for modelling so I am dropping them
zero_var = variances[variances == 0].index

In [64]:
zero_var

Index(['tr_car_preaph_bridg_reg___11', 'tr_car_preaph_bridg_reg___10',
       'tr_car_preaph_bridg_reg___12', 'tr_car_preaph_bridg_reg___na',
       'tr_car_preaph_bridg_reg___ne', 'tr_car_bridg_reg___1',
       'tr_car_bridg_reg___2', 'tr_car_bridg_reg___4',
       'tr_car_preaph_bridg_reg___9', 'tr_car_bridg_reg___5',
       'tr_car_bridg_reg___9', 'tr_car_bridg_reg___10',
       'tr_car_bridg_reg___11', 'tr_car_bridg_reg___na',
       'tr_car_bridg_reg___ne', 'tr_car_ld', 'tr_car_ld_type',
       'tr_car_bridg_reg___6', 'tr_car_preaph_bridg_reg___7',
       'ae_summ_crs_res_v2', 'tr_car_preaph_bridg_reg___5',
       'tr_car_preaph_bridg_reg___4', 'tr_car_preaph_bridg_reg___3',
       'post_cart_ther_spec_2___ne', 'tr_car_preaph_bridg_reg___2',
       'tr_car_preaph_bridg_reg___1', 'post_cart_ther_spec_2___na',
       'tr_car_preaph_bridg_reg___6', 'post_cart_ther_spec_2___4'],
      dtype='object')

In [65]:
clinic_data_cleaned = clinic_data_cleaned.drop(columns=zero_var)

In [66]:
clinic_data_cleaned.shape

(30, 54)

In [67]:
clinic_data_cleaned.head()

Unnamed: 0,scr_sex,scr_age,scr_bmi,total_num_priortherapylines_aggressive,indication_priorsct,indication_whops,indication_bulkydisease,indication_stage,indication_pri_refr,indication_sec_refr,...,cli_st_trombocytes,cli_st_leukocytes,cli_st_ldh,cli_st_crp,cli_st_ferritin,id_cleaned,DLBCL,HGBCL DH/TH,HGBCL NOS,tFL
0,0,62,19,2,4,0,0,4,1,1,...,145,2.4,,0.3,894.0,5,1,0,0,0
1,1,58,19,2,1,0,0,3,0,0,...,6,5.7,275.0,9.0,371.0,6,0,0,0,1
2,0,58,30,2,4,0,0,4,1,1,...,295,5.0,885.0,47.0,2570.0,7,0,0,1,0
3,1,72,21,2,4,0,0,4,0,1,...,321,1.9,250.0,1.0,,8,1,0,0,0
4,0,48,31,2,4,0,0,4,1,1,...,324,8.7,283.0,10.0,54.0,9,0,0,0,1


In [68]:
clinic_data_cleaned.shape

(30, 54)

In [69]:
clinic_data_cleaned.columns

Index(['scr_sex', 'scr_age', 'scr_bmi',
       'total_num_priortherapylines_aggressive', 'indication_priorsct',
       'indication_whops', 'indication_bulkydisease', 'indication_stage',
       'indication_pri_refr', 'indication_sec_refr',
       'indication_res_last_ther', 'indication_res_last_ther_spec',
       'indication_dis_lymsubtype_cns', 'indication_ind_date',
       'tr_car_preaph_br', 'tr_car_preaph_bridg_reg___8', 'tr_car_br',
       'tr_car_bridg_type', 'tr_car_bridg_reg___3', 'tr_car_bridg_reg___7',
       'tr_car_bridg_reg___8', 'tr_car_bridg_reg___12', 'tr_car_inf_adm_date',
       'tr_car_ld_start', 'tr_car_inf_date', 'tr_car_inf_discharge_date',
       'ae_summ_start_date_v2', 'ae_summ_crs_v2', 'ae_summ_highestgrade_v2',
       'ae_summ_crs_start_v2', 'ae_summ_crs_stop_v2', 'ae_summ_icans_v2',
       'surv_bestresponse_car', 'surv_time_bestresponse_car',
       'surv_prog_after_car', 'surv_status', 'surv_date', 'post_cart_ther',
       'post_cart_ther_spec_2___1', 'post

In [70]:
# Impute missing values with the median for numeric columns
for col in clinic_data_cleaned.select_dtypes(include=np.number).columns:
    median_value = clinic_data_cleaned[col].median()
    clinic_data_cleaned[col].fillna(median_value, inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  clinic_data_cleaned[col].fillna(median_value, inplace=True)


In [71]:
clinic_data_cleaned.isna().sum().sort_values(ascending=False)

ae_summ_crs_start_v2                      3
ae_summ_crs_stop_v2                       3
scr_sex                                   0
post_cart_ther_spec_2___3                 0
ae_summ_icans_v2                          0
surv_bestresponse_car                     0
surv_time_bestresponse_car                0
surv_prog_after_car                       0
surv_status                               0
surv_date                                 0
post_cart_ther                            0
post_cart_ther_spec_2___1                 0
post_cart_ther_spec_2___2                 0
post_cart_ther_spec_2___5                 0
scr_age                                   0
cli_st_lab_date                           0
cli_st_hemoglobin                         0
cli_st_trombocytes                        0
cli_st_leukocytes                         0
cli_st_ldh                                0
cli_st_crp                                0
cli_st_ferritin                           0
id_cleaned                      

In [72]:
clinic_data_cleaned.columns

Index(['scr_sex', 'scr_age', 'scr_bmi',
       'total_num_priortherapylines_aggressive', 'indication_priorsct',
       'indication_whops', 'indication_bulkydisease', 'indication_stage',
       'indication_pri_refr', 'indication_sec_refr',
       'indication_res_last_ther', 'indication_res_last_ther_spec',
       'indication_dis_lymsubtype_cns', 'indication_ind_date',
       'tr_car_preaph_br', 'tr_car_preaph_bridg_reg___8', 'tr_car_br',
       'tr_car_bridg_type', 'tr_car_bridg_reg___3', 'tr_car_bridg_reg___7',
       'tr_car_bridg_reg___8', 'tr_car_bridg_reg___12', 'tr_car_inf_adm_date',
       'tr_car_ld_start', 'tr_car_inf_date', 'tr_car_inf_discharge_date',
       'ae_summ_start_date_v2', 'ae_summ_crs_v2', 'ae_summ_highestgrade_v2',
       'ae_summ_crs_start_v2', 'ae_summ_crs_stop_v2', 'ae_summ_icans_v2',
       'surv_bestresponse_car', 'surv_time_bestresponse_car',
       'surv_prog_after_car', 'surv_status', 'surv_date', 'post_cart_ther',
       'post_cart_ther_spec_2___1', 'post

In [73]:
# there are date related column that still have nans, but we will not use them for modelling as we can't impute them easily
# also cli_st_lab_date is not needed
date_columns = [
    'indication_ind_date',
    'tr_car_inf_adm_date',
    'tr_car_ld_start',
    'tr_car_inf_date',
    'tr_car_inf_discharge_date',
    'ae_summ_start_date_v2',
    'ae_summ_crs_start_v2',
    'ae_summ_crs_stop_v2',
    'surv_date',
    'cli_st_lab_date'
]

clinic_data_cleaned.drop(columns=date_columns, inplace=True)


In [74]:
clinic_data_cleaned.shape

(30, 44)

In [75]:
clinic_data_cleaned.isna().sum().sum() # confirming no nans remain

0

In [76]:
import pandas as pd
import numpy as np

# ----------------------
# Constants
# ----------------------
ID_COL = "id"
LABEL_CANDIDATES = ["surv_status", "surv_status.factor", "surv_status_factor"]

In [77]:
# ----------------------
# Clinical: ensure an 'id' column exists and is integer
# ----------------------
if "id_cleaned" in clinic_data_cleaned.columns and ID_COL not in clinic_data_cleaned.columns:
    clinic_data_cleaned = clinic_data_cleaned.rename(columns={"id_cleaned": ID_COL})

if ID_COL not in clinic_data_cleaned.columns:
    raise ValueError("clinic_data_cleaned has no 'id' or 'id_cleaned' column.")

clinic = clinic_data_cleaned.copy()
clinic[ID_COL] = pd.to_numeric(clinic[ID_COL], errors="raise").astype(int)

print("Clinical shape:", clinic.shape)
print("Clinical id dtype:", clinic[ID_COL].dtype)


Clinical shape: (30, 44)
Clinical id dtype: int64


In [78]:
# ----------------------
# Radiomics: create working copies
# ----------------------
A = a_radiomics.copy()
B = b_radiomics.copy()
D = delta_radiomics_results.copy()

print("A shape:", A.shape, "| columns contain:", ("id" in A.columns), ("id_a" in A.columns))
print("B shape:", B.shape, "| columns contain:", ("id" in B.columns), ("id_b" in B.columns))
print("Delta shape:", D.shape, "| columns contain:", ("id" in D.columns))


A shape: (31, 44) | columns contain: False True
B shape: (31, 44) | columns contain: False True
Delta shape: (31, 44) | columns contain: True


In [79]:
# ----------------------
# Radiomics A/B: restore the id column if suffixing changed it (id_a / id_b -> id)
# ----------------------
if "id_a" in A.columns and ID_COL not in A.columns:
    A = A.rename(columns={"id_a": ID_COL})

if "id_b" in B.columns and ID_COL not in B.columns:
    B = B.rename(columns={"id_b": ID_COL})

# Drop leftover id_a/id_b if both exist (avoid duplicate id columns)
for df_name, df in [("A", A), ("B", B)]:
    extra = [c for c in df.columns if c in ["id_a", "id_b"]]
    if extra:
        df.drop(columns=extra, inplace=True)
        print(f"Dropped {df_name} extra columns:", extra)

print("A id column exists:", ID_COL in A.columns)
print("B id column exists:", ID_COL in B.columns)


A id column exists: True
B id column exists: True


In [80]:
# ----------------------
# Radiomics: enforce integer id dtype everywhere
# ----------------------
for name, df in [("A", A), ("B", B), ("Delta", D)]:
    if ID_COL not in df.columns:
        raise ValueError(f"{name} dataframe has no '{ID_COL}' column.")
    df[ID_COL] = pd.to_numeric(df[ID_COL], errors="raise").astype(int)

print("A id dtype:", A[ID_COL].dtype, "| B id dtype:", B[ID_COL].dtype, "| Delta id dtype:", D[ID_COL].dtype)


A id dtype: int64 | B id dtype: int64 | Delta id dtype: int64


In [81]:
# ----------------------
# Sanity check: ids must be unique (one row per patient)
# ----------------------
for name, df in [("clinic", clinic), ("A", A), ("B", B), ("Delta", D)]:
    dup = df[ID_COL].duplicated().sum()
    if dup > 0:
        raise ValueError(f"{name}: {dup} duplicated ids found. Fix duplicates before merging.")

print("All id columns are unique.")


All id columns are unique.


In [82]:
# ----------------------
# Common cohort (intersection across all four sources)
# ----------------------
common_ids = sorted(set(clinic[ID_COL]) & set(A[ID_COL]) & set(B[ID_COL]) & set(D[ID_COL]))

print(f"clinic N={clinic.shape[0]}, A N={A.shape[0]}, B N={B.shape[0]}, Delta N={D.shape[0]}")
print(f"Common cohort N={len(common_ids)}")

if len(common_ids) < 10:
    print("WARNING: common cohort is very small; ML results will be unstable.")


clinic N=30, A N=31, B N=31, Delta N=31
Common cohort N=30


In [83]:
# ----------------------
# Filter to common cohort and align row order by id
# ----------------------
clinic_c = clinic[clinic[ID_COL].isin(common_ids)].sort_values(ID_COL).reset_index(drop=True)
A_c      = A[A[ID_COL].isin(common_ids)].sort_values(ID_COL).reset_index(drop=True)
B_c      = B[B[ID_COL].isin(common_ids)].sort_values(ID_COL).reset_index(drop=True)
D_c      = D[D[ID_COL].isin(common_ids)].sort_values(ID_COL).reset_index(drop=True)

# Row alignment checks (must match 1:1)
assert (clinic_c[ID_COL].values == A_c[ID_COL].values).all()
assert (clinic_c[ID_COL].values == B_c[ID_COL].values).all()
assert (clinic_c[ID_COL].values == D_c[ID_COL].values).all()

print("Aligned shapes:")
print("clinic_c:", clinic_c.shape, "| A_c:", A_c.shape, "| B_c:", B_c.shape, "| D_c:", D_c.shape)


Aligned shapes:
clinic_c: (30, 44) | A_c: (30, 44) | B_c: (30, 44) | D_c: (30, 44)


In [84]:
# ----------------------
# Pick label column and create y
# ----------------------
label_col = next((c for c in LABEL_CANDIDATES if c in clinic_c.columns), None)
if label_col is None:
    raise ValueError(f"Label column not found. Tried: {LABEL_CANDIDATES}")

y = clinic_c[label_col].copy()

print("Using label column:", label_col)
print("y shape:", y.shape, "| value counts:", y.value_counts(dropna=False).to_dict())


Using label column: surv_status
y shape: (30,) | value counts: {0: 16, 1: 14}


In [85]:
# ----------------------
# Build X_clinical from clinic_c (numeric features only)
# ----------------------
if "label_col" not in globals():
    # If you didn't store label_col earlier, recover it safely
    label_col = next((c for c in LABEL_CANDIDATES if c in clinic_c.columns), None)

if label_col is None:
    raise ValueError(f"Label column not found in clinic_c. Tried: {LABEL_CANDIDATES}")

clin_feature_cols = [c for c in clinic_c.columns if c not in [ID_COL, label_col]]

non_numeric = clinic_c[clin_feature_cols].select_dtypes(exclude=[np.number]).columns.tolist()
if non_numeric:
    raise ValueError(f"Non-numeric clinical features still exist: {non_numeric}")

X_clin = clinic_c[clin_feature_cols].copy()

print("X_clin shape:", X_clin.shape)


X_clin shape: (30, 42)


In [86]:
# ----------------------
# Helper: drop id and validate all remaining columns are numeric
# ----------------------
def features_only(df: pd.DataFrame, name: str) -> pd.DataFrame:
    feat = df.drop(columns=[ID_COL]).copy()
    bad = feat.select_dtypes(exclude=[np.number]).columns.tolist()
    if bad:
        raise ValueError(f"Non-numeric columns in {name}: {bad}")
    return feat


In [87]:
# ----------------------
# Build radiomics feature matrices
# ----------------------
X_A = features_only(A_c, "A_c")
X_B = features_only(B_c, "B_c")
X_D = features_only(D_c, "D_c")

print("X_A shape:", X_A.shape)
print("X_B shape:", X_B.shape)
print("X_D shape:", X_D.shape)


X_A shape: (30, 43)
X_B shape: (30, 43)
X_D shape: (30, 43)


In [88]:
# ----------------------
# Construct the four feature sets
# ----------------------
X_clinA = pd.concat([X_clin, X_A], axis=1)
X_clinB = pd.concat([X_clin, X_B], axis=1)
X_clinD = pd.concat([X_clin, X_D], axis=1)

print("X_clin shape :", X_clin.shape)
print("X_clinA shape:", X_clinA.shape)
print("X_clinB shape:", X_clinB.shape)
print("X_clinD shape:", X_clinD.shape)


X_clin shape : (30, 42)
X_clinA shape: (30, 85)
X_clinB shape: (30, 85)
X_clinD shape: (30, 85)


In [89]:
# ----------------------
# Prepare y as a clean 0/1 integer vector
# ----------------------
y_vec = y.copy()

if y_vec.dtype == "object" or str(y_vec.dtype).startswith("category"):
    y_str = y_vec.astype(str).str.strip().str.lower()

    mapping = {
        "0": 0, "1": 1,
        "alive": 0, "dead": 1,
        "no": 0, "yes": 1
    }

    y_mapped = y_str.map(mapping)
    if y_mapped.isna().any():
        raise ValueError(f"Could not map labels to 0/1. Unique labels: {sorted(y_vec.unique())}")
    y_vec = y_mapped

y_vec = y_vec.astype(int)

print("y shape:", y_vec.shape)
print("Class balance:", y_vec.value_counts().to_dict())


y shape: (30,)
Class balance: {0: 16, 1: 14}


In [90]:
# ----------------------
# Shared train/test split indices (single split used for all scenarios)
# ----------------------
from sklearn.model_selection import train_test_split
train_idx, test_idx = train_test_split(
    np.arange(len(y_vec)),
    test_size=0.25,
    random_state=42,
    stratify=y_vec
)

print("Train size:", len(train_idx), "| Test size:", len(test_idx))
print("y_train balance:", y_vec.iloc[train_idx].value_counts().to_dict())
print("y_test balance :", y_vec.iloc[test_idx].value_counts().to_dict())


Train size: 22 | Test size: 8
y_train balance: {0: 12, 1: 10}
y_test balance : {0: 4, 1: 4}


In [91]:
# ----------------------
# Re-create y_train / y_test from existing split indices
# ----------------------
if "train_idx" not in globals() or "test_idx" not in globals():
    raise RuntimeError("train_idx / test_idx not found. Run the split cell first.")

if "y_vec" not in globals():
    raise RuntimeError("y_vec not found. Run the label preparation cell first.")

y_train = y_vec.iloc[train_idx]
y_test  = y_vec.iloc[test_idx]

print("y_train shape:", y_train.shape, "| balance:", y_train.value_counts().to_dict())
print("y_test shape :", y_test.shape,  "| balance:", y_test.value_counts().to_dict())


y_train shape: (22,) | balance: {0: 12, 1: 10}
y_test shape : (8,) | balance: {0: 4, 1: 4}


In [92]:
# ----------------------
# Package feature splits (raw X; pipeline handles preprocessing)
# ----------------------
X_splits = {
    "clinical": (X_clin.iloc[train_idx],  X_clin.iloc[test_idx]),
    "clin+A":   (X_clinA.iloc[train_idx], X_clinA.iloc[test_idx]),
    "clin+B":   (X_clinB.iloc[train_idx], X_clinB.iloc[test_idx]),
    "clin+delta":   (X_clinD.iloc[train_idx], X_clinD.iloc[test_idx]),
}

print("Prepared X_splits:", list(X_splits.keys()))


Prepared X_splits: ['clinical', 'clin+A', 'clin+B', 'clin+delta']


In [93]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.metrics import (
    accuracy_score, balanced_accuracy_score, f1_score,
    roc_auc_score, confusion_matrix, classification_report
)

results = {}

for name, (X_train, X_test) in X_splits.items():
    clf = Pipeline([
        ("scaler", StandardScaler()),
        ("svm", SVC(
            kernel="rbf",
            C=1.0,
            gamma="scale",
            class_weight="balanced",
            probability=True,     # برای ROC-AUC
            random_state=42
        ))
    ])

    clf.fit(X_train, y_train)

    y_pred = clf.predict(X_test)
    y_proba = clf.predict_proba(X_test)[:, 1]

    cm = confusion_matrix(y_test, y_pred)
    res = {
        "accuracy": accuracy_score(y_test, y_pred),
        "balanced_accuracy": balanced_accuracy_score(y_test, y_pred),
        "f1": f1_score(y_test, y_pred, zero_division=0),
        "roc_auc": roc_auc_score(y_test, y_proba),
        "confusion_matrix": cm,
        "report": classification_report(y_test, y_pred, zero_division=0)
    }
    results[name] = res

    print("\n" + "="*60)
    print(f"Model: {name}")
    print(f"Acc: {res['accuracy']:.3f} | BalAcc: {res['balanced_accuracy']:.3f} | F1: {res['f1']:.3f} | ROC-AUC: {res['roc_auc']:.3f}")
    print("Confusion matrix [[TN FP],[FN TP]]:")
    print(cm)
    print("\nClassification report:")
    print(res["report"])



Model: clinical
Acc: 0.750 | BalAcc: 0.750 | F1: 0.750 | ROC-AUC: 0.125
Confusion matrix [[TN FP],[FN TP]]:
[[3 1]
 [1 3]]

Classification report:
              precision    recall  f1-score   support

           0       0.75      0.75      0.75         4
           1       0.75      0.75      0.75         4

    accuracy                           0.75         8
   macro avg       0.75      0.75      0.75         8
weighted avg       0.75      0.75      0.75         8


Model: clin+A
Acc: 0.500 | BalAcc: 0.500 | F1: 0.500 | ROC-AUC: 0.562
Confusion matrix [[TN FP],[FN TP]]:
[[2 2]
 [2 2]]

Classification report:
              precision    recall  f1-score   support

           0       0.50      0.50      0.50         4
           1       0.50      0.50      0.50         4

    accuracy                           0.50         8
   macro avg       0.50      0.50      0.50         8
weighted avg       0.50      0.50      0.50         8


Model: clin+B
Acc: 0.500 | BalAcc: 0.500 | F1: 0.60