# Reclassification to Current Clinical Guidelines

```{contents}
```

Current thinking:

14 WHO subtypes of AML
+1 otherwise-normal control
+1 mds-like and secondary neoplasms

Need to figure out how to deal with really rare subtypes (e.g. AML FUS-ERG, KAT6A)

## Guidelines

### WHO 2022

In [1]:
import pandas as pd
import itables

from itables import show

# Read the CSV file
df = pd.read_csv('../data/who2022_aml_classification.csv')

# Display the dataframe using itables
itables.show(df)


Subtype,Frequency in paediatric AML,Morphology,Immunophenotype,Common co-occurring genetic aberrations,Prognosis
Loading... (need help?),,,,,


`````{admonition} Source
:class: tip
Hasle H, Meshinchi S, Fogelstrand L, Alaggio R, et al. Acute myeloid leukaemias (AMLs) with defining genetic abnormalities. In: WHO Classification of Tumours Editorial Board. Paediatric tumours [Internet]. Lyon (France): International Agency for Research on Cancer; 2022 [cited 2024 Jan 1]. (WHO classification of tumours series, 5th ed.; vol. 7). Available from: https://tumourclassification.iarc.who.int/chapters/44.
`````

## Load and process clinical data

In [1]:
# Import functions to clean up clinical data
import sys
sys.path.insert(0, '..')
from source.clinical_data_cleanup_functions import *

# Call functions to merge, index and clean clinical data files
labels_0531         = clean_cog       (merge_index_0531())
labels_1031         = clean_cog       (merge_index_1031())
labels_aml05        = clean_aml05     (merge_index_aml05())
labels_beataml      = clean_beataml   (merge_index_beataml())
labels_amltcga      = clean_amltcga   (merge_index_amltcga())
labels_nordic_all   = clean_nordic_all(merge_index_nordic_all())
labels_mds_taml     = clean_mds_taml  (merge_index_mds_taml())
labels_all_graal    = clean_all_graal (merge_index_all_graal())
labels_target_all   = clean_target_all(merge_index_target_all())

# Combine all clinical data labels into one dataframe
labels_combined = pd.concat([labels_aml05, labels_beataml,
                        labels_0531, labels_amltcga, labels_1031,
                        labels_nordic_all, labels_mds_taml,
                        labels_all_graal,labels_target_all],
                        axis=0, join='outer')

# Redefine output path (for troubleshooting purposes in case only this cell is run)
output_path = '../../Data/Intermediate_Files/'

# Read df
df = pd.read_pickle(output_path + '3330samples-333351cpgs-withbatchcorrection-bvalues.pkl')

# Remove samples that are not in the methyl dataset
df_labels = labels_combined.loc[labels_combined.index.isin(df.index)].sort_index()

# Add age categorization and main disease classification to the clinical data
# df_labels = process_df_labels(df_labels)

# Save the clinical data labels
# df_labels.to_csv(output_path + 'discovery_clinical_data.csv')

print('The clinical data has been indexed and cleaned.\n\
Exclusion of samples may be applied depending on the analysis.')

The clinical data has been indexed and cleaned.
Exclusion of samples may be applied depending on the analysis.


In [3]:
# save all label dataframes into excel files using `pandas` 'to_excel' function
# labels_aml05.to_excel(output_path + 'aml05_clinical_data.xlsx')
# labels_beataml.to_excel(output_path + 'beataml_clinical_data.xlsx')
labels_0531.to_excel(output_path + '0531_clinical_data.xlsx')
# labels_amltcga.to_excel(output_path + 'amltcga_clinical_data.xlsx')
labels_1031.to_excel(output_path + '1031_clinical_data.xlsx')
# labels_nordic_all.to_excel(output_path + 'nordic_all_clinical_data.xlsx')
# labels_mds_taml.to_excel(output_path + 'mds_taml_clinical_data.xlsx')
# labels_all_graal.to_excel(output_path + 'all_graal_clinical_data.xlsx')
# labels_target_all.to_excel(output_path + 'target_all_clinical_data.xlsx')



In [10]:
df

IlmnID,Batch,cg00000109,cg00000236,cg00000292,cg00000363,cg00000622,cg00000658,cg00000714,cg00000721,cg00000734,...,ch.9.83519450F,ch.9.837340R,ch.9.84051654F,ch.9.84078312F,ch.9.86947500F,ch.9.88862796F,ch.9.90287778F,ch.9.90621653R,ch.9.93402636R,ch.9.98463211R
202897270043_R01C01,GSE190931,0.934,0.891,0.879,0.363,0.013,0.905,0.364,0.957,0.079,...,0.023,0.039,0.061,0.039,0.045,0.040,0.050,0.033,0.029,0.032
202897270043_R03C01,GSE190931,0.974,0.880,0.884,0.512,0.014,0.869,0.389,0.967,0.071,...,0.022,0.057,0.021,0.047,0.041,0.033,0.022,0.026,0.024,0.026
202897270043_R04C01,GSE190931,0.868,0.936,0.753,0.639,0.016,0.815,0.439,0.921,0.077,...,0.032,0.057,0.029,0.097,0.041,0.058,0.040,0.036,0.032,0.033
202897270043_R05C01,GSE190931,0.861,0.650,0.337,0.134,0.017,0.872,0.107,0.758,0.098,...,0.050,0.327,0.034,0.135,0.048,0.049,0.048,0.035,0.039,0.066
202897270043_R06C01,GSE190931,0.924,0.778,0.710,0.167,0.014,0.868,0.247,0.958,0.081,...,0.026,0.064,0.051,0.030,0.041,0.051,0.030,0.030,0.028,0.037
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
203724130020_R02C01,GSE147667,0.913,0.848,0.844,0.269,0.025,0.848,0.423,0.896,0.099,...,0.041,0.062,0.046,0.037,0.058,0.051,0.063,0.044,0.022,0.081
203724130020_R03C01,GSE147667,0.899,0.844,0.862,0.308,0.025,0.854,0.353,0.876,0.096,...,0.055,0.041,0.040,0.043,0.060,0.060,0.060,0.040,0.030,0.055
203724130020_R04C01,GSE147667,0.905,0.840,0.854,0.296,0.022,0.864,0.362,0.911,0.099,...,0.050,0.046,0.045,0.060,0.062,0.044,0.056,0.044,0.030,0.050
203724130020_R05C01,GSE147667,0.887,0.846,0.881,0.281,0.023,0.861,0.387,0.884,0.098,...,0.066,0.066,0.056,0.052,0.055,0.044,0.049,0.064,0.041,0.055


In [9]:
# check if `Patient_ID` column from `df_labels` contains PAWDWA
df_labels[df_labels['Patient_ID'].str.contains('P')]

Unnamed: 0_level_0,Sex,Patient_ID,Age (years),Tissue Type,WBC Count (10⁹/L),BM Leukemic blasts (%),FLT3/ITD allelic ratio,Other genetic alterations,PRDM16 expression,Non-CR,...,Protocol,WBC at Diagnosis,Initial Therapy,ALAL Presentation,WHO ALAL Classification,WHO Dx,WHO RL1,WHO RL2,St. Jude Genome ID,Comments
Sample_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0003fe29-7b8f-4ef1-b9bc-40306205f1fd_noid,Male,PANVPB,14.624658,,34.1,87.3,,,,,...,,,,,,,,,,
0031efba-f564-4fff-bd7b-2b97f37218c1_noid,Female,PASDKZ,5.136986,,53.5,73.2,0.61,,,,...,,,,,,,,,,
0037ec75-bb9e-4dbb-a2d9-de1f9bfd2362_noid,Female,PASDKZ,5.136986,,53.5,73.2,0.61,,,,...,,,,,,,,,,
004c953f-d999-4d82-898d-c091e692df3c_noid,Female,PASAUT,1.542466,,10.8,34.0,,,,,...,,,,,,,,,,
0066d4af-8019-46a0-ba29-d2962c9537a7_noid,Male,PASYWA,14.194521,,19.1,95.0,0.93,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
fcdaabff-e76c-48dc-9acd-dc36beb0f4ac_noid,Male,PASPLU,9.312329,,446.0,92.6,0.52,,,,...,,,,,,,,,,
fcfa89d8-f11e-4264-be00-e112c4c619f5_noid,Male,PASCRZ,14.673973,,203.3,,,,,,...,,,,,,,,,,
fd0d7fba-01de-4035-b3fa-41a84a4c562d_noid,,TARGET-10-PAPHGD,,,,,,,,,...,,,,,,,,,,
fd4e4be4-3306-4037-b4e9-7886283243ed_noid,Male,PASTZK,5.876712,,17.9,55.0,,,,,...,,,,,,,,,,


## Classification Strategy

### Functions

In [None]:
    def first_classify_controls(normal_samples):
        mapping = {
            'Bone Marrow Normal': 'Otherwise-Normal Control',
            'Blood Derived Normal': 'Otherwise-Normal Control'}
        
        for key, value in mapping.items():
            if key in normal_samples:
                return value

    def second_classify_annotated_diagnosis(diagnosis):
        mapping = {
            'mutated NPM1': 'AML with mutated NPM1',
            'mutated CEBPA': 'AML with bZIP mutated CEBPA',
            'myelodysplasia-related changes': 'MDS-related or secondary myeloid neoplasms'
            }
        
        for key, value in mapping.items():
            if key in diagnosis:
                return value

    def third_classify_fusions(fusion_partner):
        mapping = {
        'RUNX1-RUNX1T1': 'AML with t(8;21)(q22;q22.1)/RUNX1::RUNX1T1',
        'CBFB-MYH11':    'AML with inv(16)(p13.1q22) or t(16;16)(p13.1;q22)/CBFB::MYH11',
        'KMT2A':         'AML with t(9;11)(p22;q23.3)/KMT2A-rearrangement',
        'add(11)(q23)':  'AML with t(9;11)(p22;q23.3)/KMT2A-rearrangement',
        'MLL':           'AML with t(9;11)(p22;q23.3)/KMT2A-rearrangement',
        'PML-RARA':      'APL with t(15;17)(q24.1;q21.2)/PML::RARA',
        'DEK-NUP214':    'AML with t(6;9)(p23;q34.1)/DEK::NUP214',
        'MECOM':         'AML with inv(3)(q21.3q26.2) or t(3;3)(q21.3;q26.2)/MECOM-rearrangement',
        'ETV6':          'AML with ETV6 fusion',
        'NPM1':          'AML with mutated NPM1',
        'RBM15-MKL1':    'AML with t(1;22)(p13.3;q13.1); RBM15::MKL1',
        'NUP98':         'AML with NUP98-fusion',
        'KAT6A-CREBBP':  'AML with t(8;16)(p11.2;p13.3); KAT6A::CREBBP',
        'FUS-ERG':       'AML with t(16;21)(p11;q22); FUS::ERG',
        'CBFA2T3-GLIS2': 'AML with CBFA2T3::GLIS2 (inv(16)(p13q24))',
        'BCR-ABL1':       'AML with t(9;22)(q34.1;q11.2)/BCR::ABL1',        
        # 'RUNX1-CBFA2T3': 'AML with other rare recurring translocations',
        # 'PRDM16-RPN1':   'AML with other rare recurring translocations',
        # 'PICALM-MLLT10': 'AML with other rare recurring translocations',
        # 'RBM15-MRTFA':   'AML with other rare recurring translocations',
        'CBFA2T3-GLIS3': 'AML with CBFA2T3::GLIS2 (inv(16)(p13q24))',
        'PSIP1-NUP214':  'AML with t(6;9)(p23;q34.1)/DEK::NUP214',
        # 'XPO1-TNRC18':   'AML with other rare recurring translocations', 
        # 'HNRNPH1-ERG':   'AML with other rare recurring translocations',
        # 'NIPBL-HOXB9':   'AML with other rare recurring translocations', 
        'SET-NUP214':    'AML with t(6;9)(p23;q34.1)/DEK::NUP214', 
        # 'FLI1-IFIT2':    'AML with other rare recurring translocations', 
        # 'TCF4-ZEB2':     'AML with other rare recurring translocations',
        # 'MBTD1-ZMYND11': 'AML with other rare recurring translocations', 
        # 'FOSB-KLF6':     'AML with other rare recurring translocations', 
        # 'SFPQ-ZFP36L2':  'AML with other rare recurring translocations', 
        # 'RUNX1-LINC00478':'AML with other rare recurring translocations',
        # 'RUNX1-EVX1':     'AML with other rare recurring translocations',  
        # 'PSPC1-ZFP36L1':  'AML with other rare recurring translocations', 
        # 'EWSR1-FEV':      'AML with other rare recurring translocations',
        # 'STAG2-AFF2':     'AML with other rare recurring translocations', 
        # 'MYB-GATA1':      'AML with other rare recurring translocations', 
        # 'RUNX1-ZFPM2':    'AML with other rare recurring translocations', 
        # 'RUNX1-CBFA2T2':  'AML with other rare recurring translocations',
        # 'PIM3-BRD1':      'AML with other rare recurring translocations',
        # 'KAT6A-EP300':    'AML with other rare recurring translocations',
        # 'DOT1L-RPS15':    'AML with other rare recurring translocations',
        # 'FUS-FEV':        'AML with other rare recurring translocations',
        # 'KAT6A-NCOA2':    'AML with other rare recurring translocations',
        # 'JARID2-PTP4A1':  'AML with other rare recurring translocations',
        # 'FUS-FLI1':       'AML with other rare recurring translocations'
        }    
        
        for key, value in mapping.items():
            if key in fusion_partner:
                return value

    # def fourth_classify_karyotype(structural_variation):  
    #     mapping = {
    #         't(8;21)(q22;q22.1)': 'AML with t(8;21)(q22;q22.1)/RUNX1::RUNX1T1',
    #         'inv(16)(p13.1q22) or t(16;16)(p13.1;q22)': 'AML with inv(16)(p13.1q22) or t(16;16)(p13.1;q22)/CBFB::MYH11',
    #         't(9;11)(p22;q23.3)': 'AML with t(9;11)(p22;q23.3)/KMT2A-rearrangement',
    #         't(15;17)(q24.1;q21.2)': 'APL with t(15;17)(q24.1;q21.2)/PML::RARA',
    #         't(6;9)(p23;q34.1)': 'AML with t(6;9)(p23;q34.1)/DEK::NUP214',
    #         'inv(3)(q21.3q26.2) or t(3;3)(q21.3;q26.2)': 'AML with inv(3)(q21.3q26.2) or t(3;3)(q21.3;q26.2)/MECOM-rearrangement',
    #         't(1;22)(p13.3;q13.1)': 'AML with t(1;22)(p13.3;q13.1); RBM15::MKL1',
    #         't(8;16)(p11.2;p13.3)': 'AML with t(8;16)(p11.2;p13.3); KAT6A::CREBBP',
    #         't(16;21)(p11;q22)': 'AML with t(16;21)(p11;q22); FUS::ERG',
    #         'inv(16)(p13q24)': 'AML with CBFA2T3::GLIS2 (inv(16)(p13q24))',
    #         't(9;22)(q34.1;q11.2)': 'AML with t(9;22)(q34.1;q11.2)/BCR::ABL1'
    #     }

    #     for key, value in mapping.items():
    #         if key in structural_variation:
    #             return value      

def process_labels_who22(df):
    # Apply functions and concatenate results in one step
    df['WHO 2022 Combined Diagnoses'] = df.apply(lambda x: ','.join(filter(None, [
        classify_controls(str(x['Sample Type'])),
        classify_fusion(str(x['Gene Fusion'])),
        classify_cebpa(str(x['CEBPA mutation'])),
        classify_npm(str(x['NPM mutation'])),
        classify_annotated_diagnosis(str(x['Comment']))
    ])), axis=1)

    # Replace empty strings with NaN
    df['WHO 2022 Combined Diagnoses'] = df['WHO 2022 Combined Diagnoses'].replace('', np.nan)

    # Create `WHO AML 2022 Diagnosis` by extracting the first non-empty, non-NaN element
    df['WHO AML 2022 Diagnosis'] = df['WHO 2022 Combined Diagnoses'].str.split(',').str[0]

    return df

In [17]:
    def classify_controls(normal_samples):
        mapping = {
            'Bone Marrow Normal': 'Otherwise-Normal Control',
            'Blood Derived Normal': 'Otherwise-Normal Control'}
        
        for key, value in mapping.items():
            if key in normal_samples:
                return value

    def classify_fusion(gene_fusion):
        mapping = {
        'RUNX1-RUNX1T1': 'AML with t(8;21)(q22;q22.1)/RUNX1::RUNX1T1',
        'CBFB-MYH11':    'AML with inv(16)(p13.1q22) or t(16;16)(p13.1;q22)/CBFB::MYH11',
        'KMT2A':         'AML with t(9;11)(p22;q23.3)/KMT2A-rearrangement',
        'add(11)(q23)':  'AML with t(9;11)(p22;q23.3)/KMT2A-rearrangement',
        'MLL':           'AML with t(9;11)(p22;q23.3)/KMT2A-rearrangement',
        'PML-RARA':      'APL with t(15;17)(q24.1;q21.2)/PML::RARA',
        'DEK-NUP214':    'AML with t(6;9)(p23;q34.1)/DEK::NUP214',
        'MECOM':         'AML with inv(3)(q21.3q26.2) or t(3;3)(q21.3;q26.2)/MECOM-rearrangement',
        'ETV6':          'AML with ETV6 fusion',
        'NPM1':          'AML with mutated NPM1',
        'RBM15-MKL1':    'AML with t(1;22)(p13.3;q13.1); RBM15::MKL1',
        'NUP98':         'AML with NUP98-fusion',
        'KAT6A-CREBBP':  'AML with t(8;16)(p11.2;p13.3); KAT6A::CREBBP',
        'FUS-ERG':       'AML with t(16;21)(p11;q22); FUS::ERG',
        'CBFA2T3-GLIS2': 'AML with CBFA2T3::GLIS2 (inv(16)(p13q24))',
        'BCR-ABL1':       'AML with t(9;22)(q34.1;q11.2)/BCR::ABL1',

        # Other uncharacterized abdnormalities present in the dataset but not in guidelines
        
        # 'RUNX1-CBFA2T3': 'AML with other rare recurring translocations',
        # 'PRDM16-RPN1':   'AML with other rare recurring translocations',
        # 'PICALM-MLLT10': 'AML with other rare recurring translocations',
        # 'RBM15-MRTFA':   'AML with other rare recurring translocations',
        # 'CBFA2T3-GLIS3': 'AML with CBFA2T3::GLIS2 (inv(16)(p13q24))',
        # 'PSIP1-NUP214':  'AML with t(6;9)(p23;q34.1)/DEK::NUP214',
        # 'XPO1-TNRC18':   'AML with other rare recurring translocations', 
        # 'HNRNPH1-ERG':   'AML with other rare recurring translocations',
        # 'NIPBL-HOXB9':   'AML with other rare recurring translocations', 
        # 'SET-NUP214':    'AML with t(6;9)(p23;q34.1)/DEK::NUP214', 
        # 'FLI1-IFIT2':    'AML with other rare recurring translocations', 
        # 'TCF4-ZEB2':     'AML with other rare recurring translocations',
        # 'MBTD1-ZMYND11': 'AML with other rare recurring translocations', 
        # 'FOSB-KLF6':     'AML with other rare recurring translocations', 
        # 'SFPQ-ZFP36L2':  'AML with other rare recurring translocations', 
        # 'RUNX1-LINC00478':'AML with other rare recurring translocations',
        # 'RUNX1-EVX1':     'AML with other rare recurring translocations',  
        # 'PSPC1-ZFP36L1':  'AML with other rare recurring translocations', 
        # 'EWSR1-FEV':      'AML with other rare recurring translocations',
        # 'STAG2-AFF2':     'AML with other rare recurring translocations', 
        # 'MYB-GATA1':      'AML with other rare recurring translocations', 
        # 'RUNX1-ZFPM2':    'AML with other rare recurring translocations', 
        # 'RUNX1-CBFA2T2':  'AML with other rare recurring translocations',
        # 'PIM3-BRD1':      'AML with other rare recurring translocations',
        # 'KAT6A-EP300':    'AML with other rare recurring translocations',
        # 'DOT1L-RPS15':    'AML with other rare recurring translocations',
        # 'FUS-FEV':        'AML with other rare recurring translocations',
        # 'KAT6A-NCOA2':    'AML with other rare recurring translocations',
        # 'JARID2-PTP4A1':  'AML with other rare recurring translocations',
        # 'FUS-FLI1':       'AML with other rare recurring translocations'
        }    
        
        for key, value in mapping.items():
            if key in gene_fusion:
                return value

    def classify_cebpa(cebpa_mutation):
        mapping = {
            'Yes': 'AML with bZIP mutated CEBPA'}
        
        for key, value in mapping.items():
            if key in cebpa_mutation:
                return value

    def classify_npm(npm_mutation):
        mapping = {
            'Yes': 'AML with mutated NPM1',
        }

        for key, value in mapping.items():
            if key in npm_mutation:
                return value
            
    def classify_annotated_diagnosis(diagnosis):
        mapping = {
            'mutated NPM1': 'AML with mutated NPM1',
            'mutated CEBPA': 'AML with bZIP mutated CEBPA',
            # 'myelodysplasia-related changes': 'MDS-related or secondary myeloid neoplasms'
            }
        
        for key, value in mapping.items():
            if key in diagnosis:
                return value

    def classify_karyotype(structural_variation):
        mapping = {
            't(8;16)': 'AML with t(8;16)(p11.2;p13.3); KAT6A::CREBBP',
            't(16;21)': 'AML with t(16;21)(p11;q22); FUS::ERG',
            't(6;9)': 'AML with t(6;9)(p23;q34.1)/DEK::NUP214',
            't(1;22)': 'AML with t(1;22)(p13.3;q13.1); RBM15::MKL1',
            'inv(3)': 'AML with inv(3)(q21.3q26.2) or t(3;3)(q21.3;q26.2)/MECOM-rearrangement',
            't(3;3)': 'AML with inv(3)(q21.3q26.2) or t(3;3)(q21.3;q26.2)/MECOM-rearrangement',}
        
        for key, value in mapping.items():
            if key in structural_variation:
                return value

    def process_labels_who22(df):
        df['WHO 2022_Controls'] = df['Sample Type'].astype(str).apply(classify_controls)
        df['WHO 2022_Gene Fusion'] = df['Gene Fusion'].astype(str).apply(classify_fusion)
        df['WHO 2022_CEBPA'] = df['CEBPA mutation'].astype(str).apply(classify_cebpa)
        df['WHO 2022_NPM1'] = df['NPM mutation'].astype(str).apply(classify_npm)
        df['WHO 2022_Comment'] = df['Comment'].astype(str).apply(classify_annotated_diagnosis)
        df['WHO 2022_Karyotype'] = df['Karyotype'].astype(str).apply(classify_karyotype)

        df['WHO 2022 Combined Diagnoses'] = df[['WHO 2022_Controls','WHO 2022_Gene Fusion',
        'WHO 2022_CEBPA', 'WHO 2022_NPM1', 'WHO 2022_Comment', 'WHO 2022_Karyotype']]\
            .apply(lambda x: ','.join(filter(lambda i: i is not None and i==i, x)), axis=1)

        # Replace empty strings with NaN
        df['WHO 2022 Combined Diagnoses'] = df['WHO 2022 Combined Diagnoses'].replace('', np.nan)

        # Create `WHO 2022 Final Diagnosis` column by splitting `Combined Diagnosis` by comma and taking the first element
        df['WHO AML 2022 Diagnosis'] = df['WHO 2022 Combined Diagnoses'].str.split(',').str[0]

        # Drop columns created except for `WHO 2022 Final Diagnosis` and `Combined Diagnosis` columns
        df = df.drop(['WHO 2022_Controls','WHO 2022_Gene Fusion',
         'WHO 2022_CEBPA', 'WHO 2022_NPM1', 'WHO 2022_Comment', 'WHO 2022_Karyotype'], axis=1)
            
        return df

### Execution

In [18]:
labels_cog = pd.concat([labels_0531, labels_1031],
                        axis=0, join='outer')

labels_cog_who = process_labels_who22(labels_cog)

labels_cog_who.to_excel(output_path + 'cog_clinical_data.xlsx')

In [19]:
labels_cog_who['WHO AML 2022 Diagnosis'].value_counts()

WHO AML 2022 Diagnosis
AML with t(9;11)(p22;q23.3)/KMT2A-rearrangement                           416
AML with t(8;21)(q22;q22.1)/RUNX1::RUNX1T1                                245
AML with inv(16)(p13.1q22) or t(16;16)(p13.1;q22)/CBFB::MYH11             199
Otherwise-Normal Control                                                  181
AML with mutated NPM1                                                     142
AML with NUP98-fusion                                                     123
AML with bZIP mutated CEBPA                                                85
AML with CBFA2T3::GLIS2 (inv(16)(p13q24))                                  39
AML with t(6;9)(p23;q34.1)/DEK::NUP214                                     33
AML with ETV6 fusion                                                       25
AML with t(1;22)(p13.3;q13.1); RBM15::MKL1                                 15
AML with t(16;21)(p11;q22); FUS::ERG                                       12
AML with t(8;16)(p11.2;p13.3); KAT6A::CRE

In [16]:
labels_cog_who['WHO AML 2022 Diagnosis'].value_counts()

WHO AML 2022 Diagnosis
AML with t(9;11)(p22;q23.3)/KMT2A-rearrangement                           416
AML with t(8;21)(q22;q22.1)/RUNX1::RUNX1T1                                245
AML with inv(16)(p13.1q22) or t(16;16)(p13.1;q22)/CBFB::MYH11             199
Otherwise-Normal Control                                                  181
AML with mutated NPM1                                                     142
AML with NUP98-fusion                                                     123
AML with bZIP mutated CEBPA                                                85
AML with CBFA2T3::GLIS2 (inv(16)(p13q24))                                  40
AML with t(6;9)(p23;q34.1)/DEK::NUP214                                     35
AML with ETV6 fusion                                                       25
AML with t(1;22)(p13.3;q13.1); RBM15::MKL1                                 13
AML with t(8;16)(p11.2;p13.3); KAT6A::CREBBP                                9
AML with t(16;21)(p11;q22); FUS::ERG     

## Later

In [None]:
def process_df_labels(df):
    """
    Function to process a pandas dataframe, performing age categorization 
    and main disease classification.

    Args:
        df (pandas.DataFrame): Input dataframe.

    Returns:
        pandas.DataFrame: Processed dataframe with age categorized and main disease classified.
    """
    import numpy as np
    import pandas as pd

    def categorize_age(age):
        """
        Function to categorize age into a specific range.

        Args:
            age (int or float): The age to be categorized.

        Returns:
            str: A string representation of the age category.
        """
        if pd.isnull(age):
            return np.nan
        elif age < 5:
            return '0-5'
        elif age < 13:
            return '5-13'
        elif age < 39:
            return '13-39'
        elif age < 60:
            return '39-60'
        else:
            return '60+'

    def classify_main_disease(subtype):
        """
        Function to classify the main disease based on a given subtype.

        Args:
            subtype (str): The subtype of the disease.

        Returns:
            str: A string representation of the main disease.
        """
        mapping = {
            'AML': 'Acute myeloid leukemia (AML)',
            'ALL': 'Acute lymphoblastic leukemia (ALL)',
            'MDS': 'Myelodysplastic syndrome (MDS or MDS-like)',
            'Mixed phenotype acute leukemia': 'Mixed phenotype acute leukemia (MPAL)',
            'APL': 'Acute promyelocytic leukemia (APL)',
            'Otherwise-Normal Control': 'Otherwise-Normal (Control)'
        }

        for key, value in mapping.items():
            if key in subtype:
                return value

    def main_disease_class(df):
        """
        Function to classify the main disease and create a pathology class.

        Args:
            df (pandas.DataFrame): The dataframe to be processed.

        Returns:
            pandas.DataFrame: The processed dataframe with new columns for main disease and pathology class.
        """

        df['WHO_ALL'] = df['WHO ALL 2022 Diagnosis'].astype(str).apply(classify_main_disease)
        df['ELN_AML'] = df['ELN AML 2022 Diagnosis'].astype(str).apply(classify_main_disease)

        df['Hematopoietic Group'] = df[['ELN_AML', 'WHO_ALL']] \
            .apply(lambda x: ','.join(filter(lambda i: i is not None and i == i, x)), axis=1) \
            .replace('', np.nan)

        # Drop columns created except for `WHO Final Diagnosis` and `Combined Diagnosis` columns
        df = df.drop(['ELN_AML', 'WHO_ALL'], axis=1)

        return df

    # Convert 'Age (years)' to numeric, errors='coerce' will turn non-numeric data to NaN
    df['Age (years)'] = pd.to_numeric(df['Age (years)'], errors='coerce')

    # Then apply your function
    df['Age (group years)'] = df['Age (years)'].apply(categorize_age)
    
    # Process labels
    df = main_disease_class(df)
    
    # Create `WHO 2022 Diagnosis` column
    df['WHO 2022 Diagnosis'] = df[['WHO AML 2022 Diagnosis', 'WHO ALL 2022 Diagnosis']] \
            .apply(lambda x: ','.join(filter(lambda i: i is not None and i == i, x)), axis=1) \
            .replace('', np.nan)

    return df

### WHO AML 2022 Diagnosis

In [2]:
df_labels['WHO AML 2022 Diagnosis'].value_counts()

WHO AML 2022 Diagnosis
AML with t(9;11)(p22;q23.3)/KMT2A-rearrangement                           313
MDS-related or secondary myeloid neoplasms                                230
AML with mutated NPM1                                                     179
AML with inv(16)(p13.1q22) or t(16;16)(p13.1;q22)/CBFB::MYH11             178
AML with t(8;21)(q22;q22.1)/RUNX1::RUNX1T1                                169
Otherwise-Normal Control                                                  162
AML with NUP98-fusion                                                      97
AML with bZIP mutated CEBPA                                                69
APL with t(15;17)(q24.1;q21.2)/PML::RARA                                   31
AML with CBFA2T3::GLIS2 (inv(16)(p13q24))                                  30
AML with t(6;9)(p23;q34.1)/DEK::NUP214                                     28
AML with ETV6 fusion                                                       16
AML with inv(3)(q21.3q26.2) or t(3;3)(q21

### ELN AML 2022 Diagnosis

In [3]:
df_labels['ELN AML 2022 Diagnosis'].value_counts()

ELN AML 2022 Diagnosis
AML with t(9;11)(p22;q23.3)/KMT2A-rearrangement                           313
MDS-related or secondary myeloid neoplasms                                228
AML with other rare recurring translocations                              185
AML with inv(16)(p13.1q22) or t(16;16)(p13.1;q22)/CBFB::MYH11             178
AML with mutated NPM1                                                     172
AML with t(8;21)(q22;q22.1)/RUNX1::RUNX1T1                                169
Otherwise-Normal Control                                                  162
AML with in-frame bZIP mutated CEBPA                                       69
APL with t(15;17)(q24.1;q21.2)/PML::RARA                                   31
AML with t(6;9)(p23;q34.1)/DEK::NUP214                                     28
AML with inv(3)(q21.3q26.2) or t(3;3)(q21.3;q26.2)/MECOM-rearrangement     10
AML with t(9;22)(q34.1;q11.2)/BCR::ABL1                                     3
Myeloid leukaemia associated with Down sy

### Evaluate final sample size by batch

In [17]:
df['Batch'].value_counts(dropna=False)

Batch
GSE49031          933
GSE190931         581
GSE124413         495
GSE159907         316
GDC_TARGET-AML    287
GDC_TCGA-AML      194
GSE152710         166
GSE147667         153
GDC_TARGET-ALL    141
GSE133986          64
Name: count, dtype: int64