# NBL Analysis 1 - Identifying Risk-Associated Mutations

Neuroblastoma (NBL) is a cancer that develops from immature nerve cells of the sympathetic nervous system, primarily affecting infants and children. In this analysis, I will be analyzing the genomic data of NBL samples from the Therapeutically Applicable Research to Generate Effective Treatments (TARGET) project to identify predictive mutation/gene expression features that could predict risk of patients, potentially informing the treatment strategy.

First, I will need to import the TARGET NBL somatic mutation data in MAF format.

In [1]:
import os

cwd = os.getcwd()
DATA = cwd + '/data'

In [3]:
import pandas as pd

# Reading in the maf file
maf_file = "/TARGET_NBL_WXS_somatic_verified.maf.txt"
maf = pd.read_csv(DATA + maf_file, comment='#', sep="\t")
maf.head()

Unnamed: 0,Hugo_Symbol,Entrez_Gene_Id,Center,NCBI_Build,Chromosome,Start_position,End_position,Strand,Variant_Classification,Variant_Type,...,CGC_Tumor_Types_Germline,CGC_Other_Diseases,DNARepairGenes_Role,FamilialCancerDatabase_Syndromes,MUTSIG_Published_Results,OREGANNO_ID,OREGANNO_Values,PolyPhen,SIFT,RNAseq
0,ABL2,27,broadinstitute.org,37,1,179077416,179077416,+,Missense_Mutation,SNP,...,,,,,,,,possibly_damaging(0.609),tolerated(0.24),RNAseq_support
1,ALK,238,broadinstitute.org,37,2,29445213,29445213,+,Missense_Mutation,SNP,...,neuroblastoma,,,Neuroblastoma_Familial_Clustering_of|Congenita...,,,,probably_damaging(0.99),deleterious(0),
2,ALK,238,broadinstitute.org,37,2,29443697,29443697,+,Missense_Mutation,SNP,...,neuroblastoma,,,Neuroblastoma_Familial_Clustering_of|Congenita...,,,,possibly_damaging(0.748),deleterious(0),
3,ALK,238,broadinstitute.org,37,2,29443697,29443697,+,Missense_Mutation,SNP,...,neuroblastoma,,,Neuroblastoma_Familial_Clustering_of|Congenita...,,,,benign(0.082),deleterious(0.02),
4,ALK,238,broadinstitute.org,37,2,29443696,29443696,+,Missense_Mutation,SNP,...,neuroblastoma,,,Neuroblastoma_Familial_Clustering_of|Congenita...,,,,probably_damaging(0.993),deleterious(0),


To identify which mutations are risk-associated with NBL, we can first look at which gene is the most mutated gene. This can be found by counting how many patients have the mutation for each gene. 

In [4]:
# Use groupby to count how many patients have the mutation for each gene
most_mutated_gene = maf.groupby("Hugo_Symbol").Case_USI.size()
most_mutated_gene.sort_values(ascending=False)

Hugo_Symbol
ALK         19
PTPN11       6
ATRX         6
SPTA1        6
PMFBP1       4
            ..
PDCD1LG2     1
PDGFA        1
PDGFRA       1
PIK3C2G      1
ABL2         1
Name: Case_USI, Length: 114, dtype: int64

It seems like 19 patients have mutations in the ALK gene, which makes it the most mutated gene in the dataset. Now, we can investigate which mutation is most prevalent. 

In [5]:
# Use groupby to count the occurance of a unique mutation in the patients
most_prevalent_mutation = maf.groupby("Genome_Change").Case_USI.size()
most_prevalent_mutation.sort_values(ascending=False)

Genome_Change
g.chr2:29432664C>T       7
g.chr2:29443695G>T       5
g.chr2:16082317C>T       3
g.chr12:112888198G>A     2
g.chr2:29443696A>C       2
                        ..
g.chr3:130289800G>T      1
g.chr3:12633211G>C       1
g.chr3:118623540C>G      1
g.chr3:10084803C>A       1
g.chr10:104162112delC    1
Name: Case_USI, Length: 163, dtype: int64

The most prevalent mutation is the mutation of C to T in chromosome 2, location 29432664, which had a total of 7 patients having this particular mutation.

Next, we can look into other cancer datasets to observe if the putations in pediatric NBL are also observed. The cancer dataset we will be using here is [Cancer Hotspots](https://www.cancerhotspots.org/#/home) database. It is a database collecting frequently observed mutations from public cancer datasets. We will check how many mutations in the TARGET dataset also appear in Cancer Hotspots dataset.

In [6]:
hotspot_file = "/cancerhotspots.v2.light.maf.gz"
hotspot = pd.read_csv(DATA + hotspot_file, comment='#', sep='\t')
hotspot.head()

Unnamed: 0,Hugo_Symbol,Entrez_Gene_Id,Center,NCBI_Build,Chromosome,Start_Position,End_Position,Strand,Variant_Classification,Variant_Type,Reference_Allele,Tumor_Seq_Allele1,Tumor_Seq_Allele2,HGVSc,HGVSp,HGVSp_Short,TUMORTYPE,PLATFORM,judgement
0,WARS2,10352,.,GRCh37,1,119575617,119575617,+,Missense_Mutation,SNP,C,C,T,c.1000G>A,p.Val334Ile,p.V334I,acyc,exome,RETAIN
1,OPN3,23596,.,GRCh37,1,241761094,241761094,+,Missense_Mutation,SNP,G,G,A,c.899C>T,p.Ser300Leu,p.S300L,acyc,exome,RETAIN
2,NAA16,79612,.,GRCh37,13,41894863,41894863,+,Missense_Mutation,SNP,G,G,A,c.305G>A,p.Cys102Tyr,p.C102Y,acyc,exome,RETAIN
3,DHODH,1723,.,GRCh37,16,72056354,72056354,+,Missense_Mutation,SNP,A,A,C,c.799A>C,p.Ile267Leu,p.I267L,acyc,exome,RETAIN
4,KRTAP21-1,337977,.,GRCh37,21,32127473,32127473,+,Missense_Mutation,SNP,C,C,T,c.224G>A,p.Cys75Tyr,p.C75Y,acyc,exome,RETAIN


To easily do this, I created two dataframes, each for TARGET dataset and hotspot dataset, performed an inner join, and dropped duplicates.

In [7]:
# Create two dataframes that contain chromosome, position, and allele info and join the sets
target_mut = pd.DataFrame().assign(Chromosome=maf['Chromosome'], Start_Position=maf['Start_position'], End_Position=maf['End_position'], Tumor_Seq_Allele=maf['Tumor_Seq_Allele1'])
hotspot_mut = pd.DataFrame().assign(Chromosome=hotspot['Chromosome'], Start_Position=hotspot['Start_Position'], End_Position=hotspot['End_Position'], Tumor_Seq_Allele=hotspot['Tumor_Seq_Allele2'])

# Do inner join and drop duplicates to get unique mutations
target_mut.merge(hotspot_mut, how = 'inner' ,indicator=False).drop_duplicates()

Unnamed: 0,Chromosome,Start_Position,End_Position,Tumor_Seq_Allele
0,1,179077416,179077416,C
1,2,29445213,29445213,T
2,2,29443697,29443697,G
4,2,29443696,29443696,C
8,2,29443695,29443695,T
...,...,...,...,...
422,17,7577097,7577097,T
436,19,12384448,12384448,-
447,19,37038544,37038544,A
448,19,37037970,37037970,C


The resulting dataframe has 130 rows, meaning that there are 130 unique mutations in the TARGET NBL dataset also found in Cancer Hotspot dataset. 

Now, we can explore the mutation information and clinical information at the same time. The clinical information for the TARGET dataset has been prepared and provided for the analysis.

In [8]:
clinical_file = "/nbl_target_clinical.txt"
clinical = pd.read_csv(DATA + clinical_file, comment='#', sep="\t")
clinical.head()

Unnamed: 0,TARGET USI,Gender,Race,Ethnicity,Age at Diagnosis in Days,First Event,Event Free Survival Time in Days,Vital Status,Overall Survival Time in Days,Year of Diagnosis,...,Sites of Disease Involvement,Site of Relapse,Relapse Percent Tumor,Relapse Percent Necrosis,Relapse Percent Tumor v/s Stroma,Comment,5yr_efs,in_rnaseq,in_maf,train_test
0,TARGET-30-PAHYWC,Male,White,Unknown,704,Event,324.0,Dead,437.0,1993,...,,,,,,,high risk,False,True,train
1,TARGET-30-PAICGF,Female,White,Unknown,1278,Event,772.0,Dead,1314.0,1994,...,,,,,,,high risk,False,True,train
2,TARGET-30-PAIFXV,Female,White,Unknown,2004,Event,630.0,Dead,1114.0,1995,...,,,,,,,high risk,True,False,train
3,TARGET-30-PAILNU,Male,Unknown,Unknown,1683,Event,647.0,Dead,761.0,1997,...,,,,,,,high risk,False,True,test
4,TARGET-30-PAIMDT,Female,White,Unknown,1408,Death,878.0,Dead,878.0,1997,...,,,,,,,high risk,False,True,train


The `TARGET USI` column indicates the patient ID, `5yr_efs` column indicates whether the patient belongs to the 'high risk' group (progression, relapse, or death within the 5 years of the genomic profiling) or the 'low risk' group (no event for more than 5 years). The `train_test` column indicates whether the patient belongs to training or test set for the machine learning model that we will build in the later part of the analysis. We will focus on  the `5yr_efs` information and investigate if the most mutated gene has association with high risk of disease development events.

In [9]:
clinical.shape

(202, 36)

In [15]:
maf['Case_USI'].unique().shape

(103,)

The clinical dataset has 202 rows, while there are 103 unique patients in the original TARGET dataset. Therefore, we have to subset the clinical dataset to only include the 103 unique patients from TARGET dataset.

In [16]:
# Create a list of unique Case_USI in maf file
unique_usi = maf['Case_USI'].tolist()

# Subset the clinical table to containi only the 103 unique patients from TARGET
clinical_103 = clinical[clinical['TARGET USI'].isin(unique_usi)]
clinical_103 = clinical_103.reset_index()
clinical_103.shape

(103, 37)

From previous steps, we have identified that the most mutated gene is the 'ALK' gene. We will create an additional column in the clinical dataframe to keep track of which patient has a mutation in 'ALK'.

In [17]:
# Extract the rows from maf so that we get a dataframe that shows which patient ids have a mutation in ALK
target_alk = pd.DataFrame().assign(patient_id=maf['Case_USI'], Hugo_Symbol=maf['Hugo_Symbol'])
target_alk = target_alk[1:20]

# Convert it to a list
usi_list = target_alk['patient_id'].tolist()
usi_list

['TARGET-30-PAKXDZ',
 'TARGET-30-PANRVJ',
 'TARGET-30-PARKNP',
 'TARGET-30-PARLTG',
 'TARGET-30-PARZIP',
 'TARGET-30-PAPRPR',
 'TARGET-30-PASVRU',
 'TARGET-30-PATFXV',
 'TARGET-30-PALAKE',
 'TARGET-30-PANZVU',
 'TARGET-30-PATFCY',
 'TARGET-30-PAINLN',
 'TARGET-30-PANWRR',
 'TARGET-30-PANXJL',
 'TARGET-30-PAPTFZ',
 'TARGET-30-PAPTLV',
 'TARGET-30-PARURB',
 'TARGET-30-PALNLU',
 'TARGET-30-PANBCI']

In [19]:
# Initialize a new column and fill it up with a default boolean for now
clinical_103['TMG'] = False

# Chang the value of TMG column if the given usi has a mutation in ALK
for i, usi_103 in enumerate(clinical_103['TARGET USI']):
    if usi_103 in usi_list:
        clinical_103['TMG'][i] = True
    
clinical_103.head(103)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  clinical_103['TMG'][i] = True


Unnamed: 0,index,TARGET USI,Gender,Race,Ethnicity,Age at Diagnosis in Days,First Event,Event Free Survival Time in Days,Vital Status,Overall Survival Time in Days,...,Site of Relapse,Relapse Percent Tumor,Relapse Percent Necrosis,Relapse Percent Tumor v/s Stroma,Comment,5yr_efs,in_rnaseq,in_maf,train_test,TMG
0,0,TARGET-30-PAHYWC,Male,White,Unknown,704,Event,324.0,Dead,437.0,...,,,,,,high risk,False,True,train,False
1,1,TARGET-30-PAICGF,Female,White,Unknown,1278,Event,772.0,Dead,1314.0,...,,,,,,high risk,False,True,train,False
2,3,TARGET-30-PAILNU,Male,Unknown,Unknown,1683,Event,647.0,Dead,761.0,...,,,,,,high risk,False,True,test,False
3,4,TARGET-30-PAIMDT,Female,White,Unknown,1408,Death,878.0,Dead,878.0,...,,,,,,high risk,False,True,train,False
4,5,TARGET-30-PAINLN,Male,White,Unknown,1404,Death,599.0,Dead,599.0,...,,,,,,high risk,False,True,train,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
98,190,TARGET-30-PATFCY,Male,White,Not Hispanic or Latino,1040,Death,787.0,Dead,787.0,...,,,,,,high risk,True,True,train,True
99,191,TARGET-30-PATFIN,Female,White,Not Hispanic or Latino,631,Event,1192.0,Dead,1363.0,...,,,,,,high risk,False,True,train,False
100,192,TARGET-30-PATFXV,Female,White,Hispanic or Latino,648,Progression,236.0,Dead,296.0,...,Primary site; Bone; Other metastatic sites,,,,,high risk,True,True,train,True
101,195,TARGET-30-PATGWT,Male,Black or African American,Not Hispanic or Latino,591,Relapse,727.0,Dead,928.0,...,Bone; Lymph Nodes,,,,,high risk,False,True,train,False


Now we can create a count table to easily visualize the number of patients satisfying each condition.

In [20]:
# Create a count table
count = pd.crosstab(clinical_103['TMG'], clinical_103['5yr_efs'])
count

5yr_efs,high risk,low risk
TMG,Unnamed: 1_level_1,Unnamed: 2_level_1
False,61,23
True,18,1


As the count table suggests, there are 61 patients who are non-TMG-mutated who are high-risk, 23 who are non-mutated who are low-risk, 18 who are TMG-mutated with a high risk, and 1 TMG-mutated with a low risk.

To verify if the result is statistically significant, we can perform a statistical test, such as Fisher's exact test.

In [21]:
# Perform Fisher's exact test
from scipy.stats import fisher_exact

oddsr, p = fisher_exact(count, alternative="two-sided")
print("P-value:", p)

P-value: 0.04010830574733047


Since the p-value is 0.04 (<0.05) we can conclude that the result is statistically significant. Now, we can move on to the second part of the analysis, building ML models and making predictions on NBL risk. 