# Genie data ETL

Loads the demographic, mutation, and copy number calls from the AACR project GENIE and converts them to key-value pairs for transformer input.

The GENIE data are available from https://www.synapse.org/.

First, load the **clinical patient data**. Keep only the columns of interest:
- `PATIENT_ID`: Unique identifier for each patient (e.g. 'GENIE-VICC-101416')
- `SEX`: Biological sex of the patient (e.g., 'Male', 'Female', 'Unknown').
- `PRIMARY_RACE`: Patient's self-reported or recorded primary race (e.g., 'White', 'Black', 'Asian', etc.).
- `ETHNICITY`: Patient's self-reported or recorded ethnicity (e.g., 'Non-Spanish/non-Hispanic', 'Unknown').

In [26]:
import pandas as pd
import os

# Data path
data_path = os.path.join(os.getcwd(), 'data', 'genie')

# Load clinical patient data with only the columns of interest
patient_keep_cols = ['PATIENT_ID', 'SEX', 'PRIMARY_RACE', 'ETHNICITY']
clinical_patient_df = pd.read_csv(
    os.path.join(data_path, 'data_clinical_patient.txt'),
    sep='\t',
    comment="#",
    dtype=str,
    usecols=patient_keep_cols  # Load only the specified columns
)

# Print confirmation message with the length of the dataframe
print(f"Data loaded for {len(clinical_patient_df)} patients")

Data loaded for 196244 patients


Now load the **clinical sample data** with only the columns of interest:
- `PATIENT_ID`: Unique identifier for each patient (e.g. 'GENIE-VICC-101416').
- `SAMPLE_ID`: Unique identifier for each sample associated with a patient (e.g., 'GENIE-DFCI-002910-3355').
- `CANCER_TYPE`: The type of cancer diagnosed for the patient (e.g., 'Breast Cancer', 'Lung Cancer', etc.).
- `AGE_AT_SEQ_REPORT`: Patient's age at the time of sequencing, often used for demographic and survival analyses (e.g. '52', '<18', '>89').


In [27]:
# Load clinical sample data with only the columns of interest
sample_keep_cols = ['PATIENT_ID', 'SAMPLE_ID', 'CANCER_TYPE', 'AGE_AT_SEQ_REPORT']
clinical_sample_df = pd.read_csv(
    os.path.join(data_path, 'data_clinical_sample.txt'),
    sep='\t',
    comment="#",
    dtype=str,
    usecols=sample_keep_cols  # Load only the specified columns
)

# Print confirmation message with the length of the dataframe
print(f"Data loaded for {len(clinical_sample_df)} clinical samples.")

Data loaded for 229453 clinical samples.


Load the **mutation data** with only the columns of interest:
- `Tumor_Sample_Barcode`: Identifies individual tumor samples (e.g., 'GENIE-DFCI-002910-3355').
- `Hugo_Symbol`: Gene names in Hugo Gene Nomenclature Committee (HGNC) format (e.g. 'KRAS', 'BRAF', etc.).
- `Variant_Classification`: Type of mutation (e.g., 'Missense_Mutation', 'Nonsense_Mutation', 'Silent', etc.)

In [28]:
# Load mutation data and keep only the columns of interest
keep_cols = ['Tumor_Sample_Barcode', 'Hugo_Symbol', 'Variant_Classification']
mutations_df = pd.read_csv(
    os.path.join(data_path, 'data_mutations_extended.txt'),
    sep='\t',
    comment="#",
    dtype=str,
    usecols=keep_cols  # Load only the specified columns
)

# Print confirmation message with the length of the dataframe
print(f"Data loaded for {len(mutations_df)} mutations.")

Data loaded for 2266029 mutations.


Load the **Copy Number Alteration (CNA) data**.

- `Hugo_Symbol`: The gene name based on the Hugo Gene Nomenclature Committee (HGNC) standard.
- `Other columns`: Represent tumor sample barcodes, where each column corresponds to a specific sample. The values in these columns indicate the copy number alterations for each gene in the corresponding sample.

Key Notes:
- The dataset contains a large number of columns (148,925 in this case), which represent gene-wise data for different tumor samples.
- You can use the Hugo_Symbol column to identify genes and the sample barcode columns to analyze alterations across different samples.


In [29]:
# Load the CNA data
cna_df = pd.read_csv(
    os.path.join(data_path, 'data_CNA.txt'),
    sep='\t',
    comment="#",
    dtype=str
)

# Only use for quick testing, loads a few rows
# cna_df = pd.read_csv(os.path.join(data_path,'data_cna.txt'), sep='\t', nrows=10, comment="#", dtype=str)

# Print confirmation message with the length of the dataframe
print(f"Data loaded for {len(cna_df.columns)} CNAs.")

Data loaded for 148925 CNAs.


Reshape the **Copy Number Alteration (CNA)** DataFrame from wide format to long format using the `melt` function. This transformation is often used to make the data more suitable for analysis or visualization.


In [30]:
cna_melted = cna_df.melt(
    id_vars=["Hugo_Symbol"],     # Columns to keep as-is
    var_name="SAMPLE_ID",        # Name for the former column headers
    value_name="cna_value"       # Name for the values
)

# Print confirmation message with the length of the dataframe
print(f"Data loaded for {len(cna_melted)} CNA values.")

Data loaded for 149221848 CNA values.


Begin merging the tables, starting with the patient and sample clinical data. Use the 'PATIENT_ID' column as the key to combine these datasets.

In [31]:
# First, merge patient and sample clinical data
clinical_df = pd.merge(clinical_patient_df, clinical_sample_df, on='PATIENT_ID', how='inner')

# Print confirmation message
print(f"Merged clinical data for {len(clinical_df)} patients.")

Merged clinical data for 229453 patients.


Then merge the mutation data to incorporate columns like "Hugo_Symbol" and "Variant_Classification".
We match 'SAMPLE_ID' (in the clinical table) to 'Tumor_Sample_Barcode' (in the mutation table).

In [32]:
clinical_mutations_df = pd.merge(
    clinical_df,
    mutations_df,
    left_on='SAMPLE_ID',        
    right_on='Tumor_Sample_Barcode',
    how='left'
)

# Print confirmation message
print(f"Merged mutation data for {len(clinical_mutations_df)} entries.")

Merged mutation data for 2290611 entries.


Merge the **clinical mutations DataFrame** with the **melted CNA DataFrame**. This step combines clinical and mutation data with copy number alteration (CNA) values for each sample and gene.

For merging, we use:
- SAMPLE_ID: Ensures that rows correspond to the same sample in both datasets.
- Hugo_Symbol: Ensures that rows correspond to the same gene in both datasets.
Together, these two columns act as a composite key, meaning that both must match between the two datasets for a row to be considered a match.


In [33]:
clinical_mutations_cna_df = pd.merge(
    clinical_mutations_df,
    cna_melted,
    on=["SAMPLE_ID", "Hugo_Symbol"],  # Merge on both sample and gene
    how="left"  # 'left' keeps all rows from clinical_mutations_df, adds CNA columns if matching
)

# Print confirmation message
print(f"Merged CNA data for {len(clinical_mutations_cna_df)} entries.")

Merged CNA data for 2290611 entries.


Now transform data to keep only appropriate values for key-value Transformer input.

In [55]:
# Drop the 'Tumor_Sample_Barcode' column since we have 'SAMPLE_ID' now.
try:
    clinical_mutations_cna_df.drop(columns=['Tumor_Sample_Barcode'], inplace=True)
except KeyError:
    pass

# Replace PRIMARY_RACE 'Not Applicable', 'Not Collected' and 'UNKNOWN' values with 'Unknown'
clinical_mutations_cna_df['PRIMARY_RACE'] = clinical_mutations_cna_df['PRIMARY_RACE'].replace('Not Applicable', 'Unknown')
clinical_mutations_cna_df['PRIMARY_RACE'] = clinical_mutations_cna_df['PRIMARY_RACE'].replace('Not Collected', 'Unknown')
clinical_mutations_cna_df['PRIMARY_RACE'] = clinical_mutations_cna_df['PRIMARY_RACE'].replace('UNKNOWN', 'Unknown')

# Replace ETHNICITY 'Not Collected' values with 'Unknown'
clinical_mutations_cna_df['ETHNICITY'] = clinical_mutations_cna_df['ETHNICITY'].replace('Not Collected', 'Unknown')
clinical_mutations_cna_df['ETHNICITY'] = clinical_mutations_cna_df['ETHNICITY'].replace('UNKNOWN', 'Unknown')

# Replace CANCER_TYPE 'UNKNOWN' values with 'Unknown'
clinical_mutations_cna_df['CANCER_TYPE'] = clinical_mutations_cna_df['CANCER_TYPE'].replace('UNKNOWN', 'Unknown')

# Replace AGE '<18' values with random values between 15 and 18
import random
clinical_mutations_cna_df['AGE_AT_SEQ_REPORT'] = clinical_mutations_cna_df['AGE_AT_SEQ_REPORT'].apply(
    lambda x: str(random.randint(15, 18)) if x == '<18' else x
)

# Replace AGE '>89' values with random values between 90 and 95
clinical_mutations_cna_df['AGE_AT_SEQ_REPORT'] = clinical_mutations_cna_df['AGE_AT_SEQ_REPORT'].apply(
    lambda x: str(random.randint(90, 95)) if x == '>89' else x
)

# Print confirmation message
print(f"Data cleaned for {len(clinical_mutations_cna_df)} entries.")

Data cleaned for 2290611 entries.


Now create a dictionary that stores all possible text keys and their text values, to use in a text tokenizer.This will create initial text embeddings that will then be trained with the key-value transformer.

This excludes the 'AGE_AT_SEQ_REPORT' values, which are continuous and need to be handled separately.

In [56]:
features = {}

# Assign 'SEX' feature to the features dictionary with the possible values
features['SEX'] = {
    'feature_type': 'categorical',
    'values': [val for val in clinical_mutations_cna_df['SEX'].unique() if val != 'Unknown']
}

# Assign 'PRIMARY_RACE' feature to the features dictionary with the possible values
features['PRIMARY_RACE'] = {
    'feature_type': 'categorical',
    'values': [val for val in clinical_mutations_cna_df['PRIMARY_RACE'].unique() if val != 'Unknown']
}

# Assign 'ETHNICITY' feature to the features dictionary with the possible values
features['ETHNICITY'] = {
    'feature_type': 'categorical',
    'values': [val for val in clinical_mutations_cna_df['ETHNICITY'].unique() if val != 'Unknown']
}

# Assign 'AGE_AT_SEQ_REPORT' feature to the features dictionary with the possible values
features['AGE_AT_SEQ_REPORT'] = {
    'feature_type': 'numerical',
    'values': [val for val in clinical_mutations_cna_df['AGE_AT_SEQ_REPORT'].unique() if val != 'Unknown']
}

# Assign 'CANCER_TYPE' feature to the features dictionary with the possible values
features['CANCER_TYPE'] = {
    'feature_type': 'categorical',
    'values': [val for val in clinical_mutations_cna_df['CANCER_TYPE'].unique() if val != 'Unknown']
}

# Assign genes (with their 'Hugo_Symbol') to the features dictionary with the possible values
variant_classification = [
    val for val in clinical_mutations_cna_df['Variant_Classification'].unique()
    if val != 'Unknown' and not pd.isna(val)
]
cna_values = [
    val for val in clinical_mutations_cna_df['cna_value'].unique()
    if not pd.isna(val) and val != '0'
]
for gene in clinical_mutations_cna_df['Hugo_Symbol'].dropna().unique():
    features[gene] = {
        'feature_type': 'categorical',
        'values': variant_classification
    }   
    features[f"{gene}_CNA"] = {
        'feature_type': 'categorical',
        'values': cna_values
    }   

# save the features dictionary to a json file
import json
with open(os.path.join(data_path, 'feature_schema.json'), 'w') as f:
    json.dump(features, f)

# Print confirmation message
print(f"Constructed key-value features dictionary with {len(features)} features.")
print('\nFirst few features:')
print(list(features.keys())[:10])
print('\nSome feature values:')
print(f"'SEX': {features['SEX']['values']}")
print(f"'PRIMARY_RACE': {features['PRIMARY_RACE']['values']}")
print(f"'ETHNICITY': {features['ETHNICITY']['values']}")
print(f"'CANCER_TYPE': {features['CANCER_TYPE']['values']}")
print(f"'KRAS': {features['KRAS']['values']}")
print(f"'KRAS_CNA': {features['KRAS_CNA']['values']}")

Constructed key-value features dictionary with 3519 features.

First few features:
['SEX', 'PRIMARY_RACE', 'ETHNICITY', 'AGE_AT_SEQ_REPORT', 'CANCER_TYPE', 'ARID1A', 'ARID1A_CNA', 'BLM', 'BLM_CNA', 'BRCA2']

Some feature values:
'SEX': ['Female', 'Male', 'Other']
'PRIMARY_RACE': ['White', 'Black', 'Asian', 'Native American', 'Other', 'Pacific Islander']
'ETHNICITY': ['Non-Spanish/non-Hispanic', 'Spanish/Hispanic']
'CANCER_TYPE': ['Appendiceal Cancer', 'Colorectal Cancer', 'Cancer of Unknown Primary', 'Non-Small Cell Lung Cancer', 'Breast Cancer', 'Soft Tissue Sarcoma', 'Pancreatic Cancer', 'Leukemia', 'Melanoma', 'Salivary Gland Cancer', 'Endometrial Cancer', 'Head and Neck Cancer', 'Skin Cancer, Non-Melanoma', 'Myeloproliferative Neoplasms', 'Adrenocortical Carcinoma', 'Ovarian Cancer', 'Esophagogastric Cancer', 'Myelodysplastic Syndromes', 'Thyroid Cancer', 'Gastrointestinal Stromal Tumor', 'Cervical Cancer', 'Hepatobiliary Cancer', 'Renal Cell Carcinoma', 'Mature T and NK Neoplasms'

Finally, generate a dictionary where each patient ID is mapped to a sequence of key-value pairs representing their demographic features, mutations, and copy number alterations (CNA).

- Important: Include only mutations and cna values different from 0 or nan. This will lead to variable length sequences in the transformer, but we can also include a PAD token to enable batch processing.
- Also ignore cases with 'Unknown' values

Result:  
patient_sequences[patient_id] is a list of (key, value) pairs, e.g.:  
[  
  ("SEX", "MALE"),  
  ("PRIMARY_RACE", "WHITE"),  
  ...  
  ("EGFR", "Missense_Mutation"),  # from the mutations  
  ("EGFR_CNA", "2.0"),           # from the cna  
  ...  
]  


In [59]:
# Columns to keep from clinical data as demographic features
demographic_cols = [
    "SEX",
    "PRIMARY_RACE",
    "ETHNICITY",
    "AGE_AT_SEQ_REPORT",
    "CANCER_TYPE"
    # add or remove as needed
]

# Dictionary: patient_id -> list of (key, value) pairs
patient_sequences = {}

# Group by patient
for patient_id, group in clinical_mutations_cna_df.groupby("PATIENT_ID"):

    # 1) DEMOGRAPHIC FEATURES
    # Take the first row in this group for demographic columns
    first_row = group.iloc[0]
    demo_pairs = []
    for col in demographic_cols:
        val = first_row.get(col, "Unknown")
        # Skip if 'Unknown'
        if val == "Unknown":
            continue
        # Convert AGE_AT_SEQ_REPORT to float, keep others as strings
        if col == "AGE_AT_SEQ_REPORT":
            try:
                val = float(val)  # Convert to float if possible
            except ValueError:
                continue  # Skip if conversion fails (e.g., invalid format)
        else:
            val = str(val)  # Convert to string for other categories
        demo_pairs.append((col, val))
    
    # 2) MUTATIONS
    # Build a dict of gene -> variant classification
    gene_to_variant = {}
    for _, row in group.iterrows():
        gene = row["Hugo_Symbol"]
        variant_class = row["Variant_Classification"]

        # Skip rows with missing gene name
        if pd.isna(gene):
            continue
        
        # If multiple mutations in same gene, choose how to combine or skip
        if gene not in gene_to_variant:
            gene_to_variant[gene] = variant_class
        else:
            # e.g. combine variant_class with existing
            # gene_to_variant[gene] += ";" + variant_class
            pass

    # Create gene-level (key, value) pairs ONLY for mutated genes
    # i.e., skipping those without a 'Hugo_Symbol' or 'Variant_Classification'
    mutation_pairs = [
        (gene, gene_to_variant[gene])
        for gene in gene_to_variant
        if not pd.isna(gene_to_variant[gene])
    ]
    
    # 3) CNA
    # We'll collect gene -> cna_value from the same group
    # If a row is NaN or 0, we skip it (depending on your preference)
    cna_pairs = []
    for _, row in group.iterrows():
        gene = row["Hugo_Symbol"]
        cna_val = row["cna_value"]  # -2, -1, 0, 1, 2, or NaN, etc.
        
        # Skip missing genes or missing/neutral calls if only storing altered
        if pd.isna(gene) or pd.isna(cna_val) or cna_val == "0" or cna_val == 0:
            continue
        
        # Example: store (gene+"_CNA", cna_val) to differentiate from mutation pair
        cna_pairs.append((gene + "_CNA", str(cna_val)))
    
    # 4) Combine all pairs
    kv_sequence = demo_pairs + mutation_pairs + cna_pairs
    # skip patients with no data
    if len(kv_sequence) <= 3: # only have 'SEX', 'AGE_AT_SEQ', 'CANCER_TYPE'
        continue
    patient_sequences[patient_id] = kv_sequence

# Save the patient sequences to a json file
with open(os.path.join(data_path, 'patient_sequences.json'), 'w') as f:
    json.dump(patient_sequences, f)

# Print confirmation message
print(f"Constructed feature name-value sequences for {len(patient_sequences)} patients.")
print(f"Saved the patient sequences to 'patient_sequences.json'.")

Constructed feature name-value sequences for 193637 patients.
Saved the patient sequences to 'patient_sequences.json'.


Now describe the data. For simplicitly, this is done at the end after merging all dataframes instead of individually for each data set.

In [38]:
clinical_mutations_cna_df.info()
clinical_mutations_cna_df.describe(include='object')
clinical_mutations_cna_df['cna_value'].value_counts()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2290611 entries, 0 to 2290610
Data columns (total 10 columns):
 #   Column                  Dtype 
---  ------                  ----- 
 0   PATIENT_ID              object
 1   SEX                     object
 2   PRIMARY_RACE            object
 3   ETHNICITY               object
 4   SAMPLE_ID               object
 5   AGE_AT_SEQ_REPORT       object
 6   CANCER_TYPE             object
 7   Hugo_Symbol             object
 8   Variant_Classification  object
 9   cna_value               object
dtypes: object(10)
memory usage: 174.8+ MB


cna_value
0       1156642
-1        43076
1         37054
2          7995
-2         2198
-1.5        174
Name: count, dtype: int64