# Background

- **inputs:** 
    - tsv of U2AF1 mut vs. WT driver mutations generated from this query: https://bit.ly/4b9W6gP
        - all lung ADC studies excluding studies with overlapping samples and TSP Nature, 2008. Exclude studies that just say "lung cancer" or "non SC lung cancer" in this category
    - List of 20 driver mutations in nonsmoking lung ADC patients is from Fig 1 of [Tessema et al. 2018](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6331003/) and consist of: 
        - KRAS, EGFR, BRAF, CDKN2A, ERBB2, KEAP1, MARK1, MET, MYC, NF1, NRAS, PIK3CA, PTEN, RET, RIT1, SETD2, SMARCA4, STK11, TP53, ALK
        - MARK1 was not profiled in these studies so is not included in the final "Other drivers" count
- **goals**:
    - Calculate fractions of EGFR, KRAS, and "other drivers" and plot stacked bar plot on prism. annotate with p values for U2AF1/KRAS mutation co-occurrence from cBioPortal query (q = 0.001)

## import modules and data

In [1]:
#import modules and data
import pandas as pd
import seaborn as sns
import numpy as np
from matplotlib import pyplot as plt
from statsmodels.sandbox.stats.multicomp import multipletests
from scipy.stats import chi2_contingency
from scipy.stats import chisquare

pd.set_option('display.max_colwidth', None)
df = pd.read_csv('Downloads/alterations_across_samples-2.tsv', sep='\t')
df

#each row is a lung ADC sample's mutational status
#there may be multiple samples per patient

Unnamed: 0,Study ID,Sample ID,Patient ID,Altered,U2AF1,KRAS,EGFR,BRAF,CDKN2A,ERBB2,...,STK11: HOMDEL,STK11: FUSION,TP53: MUT,TP53: AMP,TP53: HOMDEL,TP53: FUSION,ALK: MUT,ALK: AMP,ALK: HOMDEL,ALK: FUSION
0,lung_msk_mind_2020,P-0001987-T01-IM3,P-0001987,1,no alteration,no alteration,no alteration,no alteration,no alteration,no alteration,...,no alteration,no alteration,M237I (driver),no alteration,no alteration,no alteration,no alteration,no alteration,no alteration,no alteration
1,lung_msk_mind_2020,P-0002921-T01-IM3,P-0002921,1,no alteration,G12A (driver),no alteration,no alteration,no alteration,no alteration,...,no alteration,no alteration,no alteration,no alteration,no alteration,no alteration,no alteration,no alteration,no alteration,no alteration
2,lung_msk_mind_2020,P-0003247-T01-IM5,P-0003247,1,no alteration,G12D (driver),no alteration,no alteration,no alteration,no alteration,...,no alteration,no alteration,I255F (driver),no alteration,no alteration,no alteration,no alteration,no alteration,no alteration,no alteration
3,lung_msk_mind_2020,P-0003250-T01-IM5,P-0003250,1,no alteration,no alteration,L858R (driver),no alteration,no alteration,no alteration,...,no alteration,no alteration,no alteration,no alteration,no alteration,no alteration,no alteration,no alteration,no alteration,no alteration
4,lung_msk_mind_2020,P-0003275-T01-IM5,P-0003275,1,no alteration,G12C (driver),no alteration,no alteration,no alteration,no alteration,...,no alteration,no alteration,no alteration,no alteration,no alteration,no alteration,no alteration,no alteration,no alteration,no alteration
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4794,luad_mskcc_2023_met_organotropism,P-0030990-T03-WES,P-0030990,0,no alteration,no alteration,no alteration,no alteration,no alteration,no alteration,...,not profiled,no alteration,no alteration,not profiled,not profiled,no alteration,no alteration,not profiled,not profiled,no alteration
4795,luad_mskcc_2023_met_organotropism,P-0031577-T01-WES,P-0031577,0,no alteration,no alteration,no alteration,no alteration,no alteration,no alteration,...,not profiled,no alteration,no alteration,not profiled,not profiled,no alteration,no alteration,not profiled,not profiled,no alteration
4796,luad_mskcc_2023_met_organotropism,P-0039257-T01-WES,P-0039257,0,no alteration,no alteration,no alteration,no alteration,no alteration,no alteration,...,not profiled,no alteration,no alteration,not profiled,not profiled,no alteration,no alteration,not profiled,not profiled,no alteration
4797,luad_mskcc_2023_met_organotropism,P-0039594-T01-WES,P-0039594,0,no alteration,no alteration,no alteration,no alteration,no alteration,no alteration,...,not profiled,no alteration,no alteration,not profiled,not profiled,no alteration,no alteration,not profiled,not profiled,no alteration


In [2]:
#Only grab relevant columns for analysis

df = df.loc[:, df.columns.isin(['Sample ID', 'Patient ID', 'U2AF1', 'KRAS', 'EGFR', 
                               'BRAF', 'CDKN2A', 'ERBB2', 'KEAP1', 'MET', 'MYC', 'NF1', 
                                'NRAS', 'PIK3CA', 'PTEN', 'RET', 'RIT1', 'SETD2', 'SMARCA4', 'STK11', 
                                'TP53', 'ALK'])]
df

Unnamed: 0,Sample ID,Patient ID,U2AF1,KRAS,EGFR,BRAF,CDKN2A,ERBB2,KEAP1,MET,...,NRAS,PIK3CA,PTEN,RET,RIT1,SETD2,SMARCA4,STK11,TP53,ALK
0,P-0001987-T01-IM3,P-0001987,no alteration,no alteration,no alteration,no alteration,no alteration,no alteration,no alteration,no alteration,...,no alteration,no alteration,no alteration,no alteration,no alteration,no alteration,no alteration,no alteration,M237I (driver),no alteration
1,P-0002921-T01-IM3,P-0002921,no alteration,G12A (driver),no alteration,no alteration,no alteration,no alteration,no alteration,no alteration,...,no alteration,no alteration,no alteration,no alteration,no alteration,no alteration,no alteration,no alteration,no alteration,no alteration
2,P-0003247-T01-IM5,P-0003247,no alteration,G12D (driver),no alteration,no alteration,no alteration,no alteration,no alteration,no alteration,...,no alteration,no alteration,no alteration,no alteration,no alteration,no alteration,no alteration,no alteration,I255F (driver),no alteration
3,P-0003250-T01-IM5,P-0003250,no alteration,no alteration,L858R (driver),no alteration,no alteration,no alteration,no alteration,no alteration,...,no alteration,no alteration,no alteration,no alteration,no alteration,W2016C,no alteration,no alteration,no alteration,no alteration
4,P-0003275-T01-IM5,P-0003275,no alteration,G12C (driver),no alteration,no alteration,no alteration,no alteration,no alteration,no alteration,...,no alteration,no alteration,no alteration,no alteration,no alteration,no alteration,no alteration,no alteration,no alteration,no alteration
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4794,P-0030990-T03-WES,P-0030990,no alteration,no alteration,no alteration,no alteration,no alteration,no alteration,no alteration,no alteration,...,no alteration,no alteration,no alteration,no alteration,no alteration,no alteration,no alteration,no alteration,no alteration,no alteration
4795,P-0031577-T01-WES,P-0031577,no alteration,no alteration,no alteration,no alteration,no alteration,no alteration,no alteration,no alteration,...,no alteration,no alteration,no alteration,no alteration,no alteration,no alteration,no alteration,no alteration,no alteration,no alteration
4796,P-0039257-T01-WES,P-0039257,no alteration,no alteration,no alteration,no alteration,no alteration,no alteration,no alteration,no alteration,...,no alteration,no alteration,no alteration,no alteration,no alteration,no alteration,no alteration,no alteration,no alteration,no alteration
4797,P-0039594-T01-WES,P-0039594,no alteration,no alteration,no alteration,no alteration,no alteration,no alteration,no alteration,no alteration,...,no alteration,no alteration,no alteration,no alteration,no alteration,no alteration,no alteration,no alteration,no alteration,no alteration


## split dataframes into one for mutational status of U2AF1 WT samples, one for mutational status of U2AF1 mutant samples

In [4]:
# split dataframe into U2AF1 WT/U2AF1 mut
u2af1wtdf = df.loc[(df['U2AF1'] == 'no alteration')]
u2af1mutdf = df.loc[(df['U2AF1'] != 'no alteration')]

#get N of samples 
print('number u2af1 wt samples are', len(u2af1wtdf))
print('number u2af1 s34f samples are', len(u2af1mutdf))

number u2af1 wt samples are 4671
number u2af1 s34f samples are 128


## calculate KRAS, EGFR, all "other drivers" fractions for plotting on prism

In [20]:
#Make dataframes of counts for KRAS, EGFR, and all other driver mutations and 
#calculate fraction of that mutation

u2af1wtdict  = {}
u2af1mutdict = {}

dflist = [[u2af1wtdf, u2af1wtdict],[u2af1mutdf, u2af1mutdict]]

#count number of mutations per gene for each genotype
#There will be overlaps because patients can have overlapping mutations

for pair in dflist:
    #Other drivers = number of samples with an alteration in any of the drivers on the drivers list
    #Can include samples that are also mutant for KRAS, EGFR
    other = len(pair[0].loc[(pair[0]['BRAF'] != 'no alteration') | (pair[0]['CDKN2A'] != 'no alteration')
                           | (pair[0]['ERBB2'] != 'no alteration') | (pair[0]['KEAP1'] != 'no alteration')
                           | (pair[0]['MET'] != 'no alteration') | (pair[0]['MYC'] != 'no alteration')
                           | (pair[0]['NF1'] != 'no alteration') | (pair[0]['NRAS'] != 'no alteration')
                           | (pair[0]['PIK3CA'] != 'no alteration') | (pair[0]['PTEN'] != 'no alteration')
                           | (pair[0]['RET'] != 'no alteration') | (pair[0]['RIT1'] != 'no alteration')
                           | (pair[0]['SETD2'] != 'no alteration') | (pair[0]['SMARCA4'] != 'no alteration')
                           | (pair[0]['STK11'] != 'no alteration') | (pair[0]['TP53'] != 'no alteration')
                           | (pair[0]['ALK'] != 'no alteration')])
        
    kras = len(pair[0].loc[(pair[0]['KRAS'] != 'no alteration')])
    egfr = len(pair[0].loc[(pair[0]['EGFR'] != 'no alteration')])

    pair[1]['KRAS'] = kras
    pair[1]['EGFR'] = egfr
    pair[1]['Other driver'] = other

u2af1mutdict


# df = df.loc[:, df.columns.isin(['Sample ID', 'Patient ID', 'PTEN', 'RET', 'RIT1', 'SETD2', 'SMARCA4', 'STK11', 
#                                 'TP53', 'ALK'])]

{'KRAS': 63, 'EGFR': 35, 'Other driver': 106}

In [21]:
u2af1wtdict

{'KRAS': 1388, 'EGFR': 1560, 'Other driver': 3861}

## Check if numbers make sense

In [22]:
#check to see if numbers make sense

wtdriversum = sum(u2af1wtdict.values())
wtsamplesum = len(u2af1wtdf)
wtsame = sum(u2af1wtdict.values()) == len(u2af1wtdf)
wtsubtr = (sum(u2af1wtdict.values()) - len(u2af1wtdf))

mutsubtr = (sum(u2af1mutdict.values()) - len(u2af1mutdf))
mutdriversum = sum(u2af1mutdict.values())
mutsamplesum = len(u2af1mutdf)
mutsame = sum(u2af1mutdict.values()) == len(u2af1mutdf)


f"Driver mutations sum equal to sum of U2AF1 WT samples? ...{wtsame}. \
Driver mutation sum is {wtdriversum}. Sample sum is {wtsamplesum}. \
Driver mutation sum - sample sum = {wtsubtr}"  

'Driver mutations sum equal to sum of U2AF1 WT samples? ...False. Driver mutation sum is 6809. Sample sum is 4671. Driver mutation sum - sample sum = 2138'

In [23]:
f"Driver mutations sum equal to sum of U2AF1 mutant samples? ...{mutsame}. \
Driver mutation sum is {mutdriversum}. Sample sum is {mutsamplesum}. \
Driver mutation sum - sample sum = {mutsubtr}" 

'Driver mutations sum equal to sum of U2AF1 mutant samples? ...False. Driver mutation sum is 204. Sample sum is 128. Driver mutation sum - sample sum = 76'

### Conclusion: There are more driver mutations than samples for either U2AF1 mutational status. That makes sense, since there can be multiple mutations in the same sample

In [24]:
#Combine dataframes for WT and U2AF1 mut samples

u2af1wt_top3 = pd.DataFrame.from_dict([u2af1wtdict])
u2af1mut_top3 = pd.DataFrame.from_dict([u2af1mutdict])

u2af1wt_top3.index = ['U2AF1 WT (n=4671)'] 
u2af1mut_top3.index = ['U2AF1 Mutant (n=128)'] 

u2af1wt_top3['Other drivers'] = u2af1wt_top3['Other driver']/sum(u2af1wtdict.values())
u2af1wt_top3['KRAS'] = u2af1wt_top3['KRAS']/sum(u2af1wtdict.values())
u2af1wt_top3['EGFR'] = u2af1wt_top3['EGFR']/sum(u2af1wtdict.values())

u2af1mut_top3['Other drivers'] = u2af1mut_top3['Other driver']/sum(u2af1mutdict.values())
u2af1mut_top3['KRAS'] = u2af1mut_top3['KRAS']/sum(u2af1mutdict.values())
u2af1mut_top3['EGFR'] = u2af1mut_top3['EGFR']/sum(u2af1mutdict.values())

#concat frames
frames = [u2af1wt_top3, u2af1mut_top3]

mutsconcat = pd.concat(frames)
mutsconcat

Unnamed: 0,KRAS,EGFR,Other driver,Other drivers
U2AF1 WT (n=4671),0.203848,0.229109,3861,0.567044
U2AF1 Mutant (n=128),0.308824,0.171569,106,0.519608
