# Background

- **inputs:** 
    - tsv of U2AF1 mut vs. WT driver mutations generated from this query: https://bit.ly/3Ui2FrQ
        - all lung ADC studies excluding studies with overlapping samples and TSP Nature, 2008. Exclude studies that just say "lung cancer" or "non SC lung cancer" in this category
        - Go to query -> "Download" tab -> Downloadable Data Files -> Mutations (OQL is not in effect)
 
- **goals**:
    - Calculate fractions of EGFR, KRAS, and "Other mutations" (samples with neither EGFR nor KRAS muts but may or may not have muts in other genes) and plot stacked bar plot on prism. annotate with p values for U2AF1/KRAS mutation co-occurrence from cBioPortal query (q < 0.001)

## import modules and data

In [17]:
#import modules and data
import pandas as pd
import seaborn as sns
import numpy as np
from matplotlib import pyplot as plt
from statsmodels.sandbox.stats.multicomp import multipletests
from scipy.stats import chi2_contingency
from scipy.stats import chisquare

pd.set_option('display.max_colwidth', None)
df = pd.read_csv('Downloads/mutations-2.txt', sep='\t')
df

#each row is a lung ADC sample's mutational status
#there may be multiple samples per patient

Unnamed: 0,STUDY_ID,SAMPLE_ID,U2AF1,KRAS,EGFR
0,lung_msk_mind_2020,P-0000239-T01-IM3,WT,WT,WT
1,lung_msk_mind_2020,P-0001987-T01-IM3,WT,WT,WT
2,lung_msk_mind_2020,P-0002794-T01-IM3,WT,WT,WT
3,lung_msk_mind_2020,P-0002921-T01-IM3,WT,G12A,WT
4,lung_msk_mind_2020,P-0003247-T01-IM5,WT,G12D,WT
...,...,...,...,...,...
4794,luad_mskcc_2023_met_organotropism,P-0047270-T02-WES,WT,WT,WT
4795,luad_mskcc_2023_met_organotropism,P-0047338-T01-WES,WT,G12C,WT
4796,luad_mskcc_2023_met_organotropism,P-0052008-T03-WES,WT,G12A,WT
4797,luad_mskcc_2023_met_organotropism,P-0052008-T02-WES,WT,G12A,WT


## split dataframes into one for mutational status of U2AF1 WT samples, one for mutational status of U2AF1 mutant samples

In [18]:
# split dataframe into U2AF1 WT/U2AF1 mut
u2af1wtdf = df.loc[(df['U2AF1'] == 'WT')]
u2af1mutdf = df.loc[(df['U2AF1'] != 'WT')]

#get N of samples 
print('number u2af1 wt samples are', len(u2af1wtdf))
print('number u2af1 s34f samples are', len(u2af1mutdf))

number u2af1 wt samples are 4691
number u2af1 s34f samples are 108


## calculate KRAS, EGFR, all "other drivers" fractions for plotting on prism

In [19]:
#Make dataframes of counts for KRAS, EGFR, and all other driver mutations and 
#calculate fraction of that mutation

u2af1wtdict  = {}
u2af1mutdict = {}

dflist = [[u2af1wtdf, u2af1wtdict],[u2af1mutdf, u2af1mutdict]]

#count number of mutations per gene for each genotype
#There will be overlaps because patients can have overlapping mutations

for pair in dflist:
    #Other drivers = number of samples with an alteration in any of the drivers on the drivers list
    #Can include samples that are also mutant for KRAS, EGFR
    other = len(pair[0].loc[(pair[0]['KRAS'] == 'WT') & (pair[0]['EGFR'] == 'WT')])
    kras = len(pair[0].loc[(pair[0]['KRAS'] != 'WT')])
    egfr = len(pair[0].loc[(pair[0]['EGFR'] != 'WT')])

    pair[1]['KRAS'] = kras
    pair[1]['EGFR'] = egfr
    pair[1]['Other'] = other

u2af1mutdict


# df = df.loc[:, df.columns.isin(['Sample ID', 'Patient ID', 'PTEN', 'RET', 'RIT1', 'SETD2', 'SMARCA4', 'STK11', 
#                                 'TP53', 'ALK'])]

{'KRAS': 56, 'EGFR': 24, 'Other': 34}

In [20]:
u2af1wtdict

{'KRAS': 1340, 'EGFR': 1491, 'Other': 1887}

## Check if numbers make sense

In [23]:
#check to see if numbers make sense

wtdriversum = sum(u2af1wtdict.values())
wtsamplesum = len(u2af1wtdf)
wtsame = sum(u2af1wtdict.values()) == len(u2af1wtdf)
wtsubtr = (sum(u2af1wtdict.values()) - len(u2af1wtdf))

mutsubtr = (sum(u2af1mutdict.values()) - len(u2af1mutdf))
mutdriversum = sum(u2af1mutdict.values())
mutsamplesum = len(u2af1mutdf)
mutsame = sum(u2af1mutdict.values()) == len(u2af1mutdf)


f"Mutations sum equal to sum of U2AF1 WT samples? ...{wtsame}. \
Mutation sum is {wtdriversum}. Sample sum is {wtsamplesum}. \
Mutation sum - sample sum = {wtsubtr}"  

'Mutations sum equal to sum of U2AF1 WT samples? ...False. Mutation sum is 4718. Sample sum is 4691. Mutation sum - sample sum = 27'

In [22]:
f"Driver mutations sum equal to sum of U2AF1 mutant samples? ...{mutsame}. \
Driver mutation sum is {mutdriversum}. Sample sum is {mutsamplesum}. \
Driver mutation sum - sample sum = {mutsubtr}" 

'Driver mutations sum equal to sum of U2AF1 mutant samples? ...False. Driver mutation sum is 114. Sample sum is 108. Driver mutation sum - sample sum = 6'

### Conclusion: There are more driver mutations than samples for either U2AF1 mutational status. That makes sense, since there can be multiple mutations in the same sample

In [24]:
#Combine dataframes for WT and U2AF1 mut samples

u2af1wt_top3 = pd.DataFrame.from_dict([u2af1wtdict])
u2af1mut_top3 = pd.DataFrame.from_dict([u2af1mutdict])

u2af1wt_top3.index = ['U2AF1 WT (n=4691)'] 
u2af1mut_top3.index = ['U2AF1 Mutant (n=108)'] 

u2af1wt_top3['Other'] = u2af1wt_top3['Other']/sum(u2af1wtdict.values())
u2af1wt_top3['KRAS'] = u2af1wt_top3['KRAS']/sum(u2af1wtdict.values())
u2af1wt_top3['EGFR'] = u2af1wt_top3['EGFR']/sum(u2af1wtdict.values())

u2af1mut_top3['Other'] = u2af1mut_top3['Other']/sum(u2af1mutdict.values())
u2af1mut_top3['KRAS'] = u2af1mut_top3['KRAS']/sum(u2af1mutdict.values())
u2af1mut_top3['EGFR'] = u2af1mut_top3['EGFR']/sum(u2af1mutdict.values())

#concat frames
frames = [u2af1wt_top3, u2af1mut_top3]

mutsconcat = pd.concat(frames)
mutsconcat

Unnamed: 0,KRAS,EGFR,Other
U2AF1 WT (n=4691),0.284019,0.316024,0.399958
U2AF1 Mutant (n=108),0.491228,0.210526,0.298246
