# <span style='font-family:"Times New Roman"'> <span styel=''>**COHORT MAF FILE CREATION**
*Emile Cohen*
    
 *February 2020*

**Goal:** Through this notebook, we create a maf file composed of all patients in the cohort. As we do not have the mutations for all patients in impact-facets-tp53 datasets, we will merge the cohort file with cbioportal data.

The notebook is composed of 2 parts:
   * **1. Extraction of patients from the cohort**
   * **2. Creation of the MAF file from CbioPortal raw datasets**
---

In [1]:
%run -i '../../utils/setup_environment.ipy'

import warnings
warnings.filterwarnings('ignore')

data_path = '../../data/'

Setup environment... done!


<span style="color:green">✅ Working on **mskimpact_env** conda environment.</span>

---
# Extraction of patients from the cohort

In [281]:
cohort = pd.read_csv(data_path + 'impact-facets-tp53/raw/default_qc_pass.cohort.txt', sep='\t')

In [282]:
cohort['Patient_Id'] = cohort['sample_id'].str[:9]
cohort['Tumor_Id'] = cohort['sample_id'].str[:17]

In [283]:
cohort.head()

Unnamed: 0,sample_id,sample_path,tumor_sample_id,path,fit_name,purity_run_version,purity_run_prefix,purity_run_Seed,purity_run_cval,purity_run_nhet,purity_run_Purity,purity_run_Ploidy,purity_run_dipLogR,purity_run_alBalLogR,hisens_run_version,hisens_run_prefix,hisens_run_Seed,hisens_run_cval,hisens_run_nhet,hisens_run_hisens,hisens_run_Ploidy,hisens_run_dipLogR,manual_note,is_best_fit,purity,ploidy,dipLogR,dipLogR_flag,n_alternative_dipLogR,n_dip_bal_segs,frac_dip_bal_segs,n_dip_imbal_segs,frac_dip_imbal_segs,n_amps,n_homdels,frac_homdels,n_homdels_clonal,frac_homdels_clonal,n_cn_states,n_segs,n_cnlr_clusters,n_lcn_na,n_loh,frac_loh,n_segs_subclonal,frac_segs_subclonal,n_snps,n_het_snps,frac_het_snps,n_het_snps_hom_in_tumor_1pct,n_het_snps_hom_in_tumor_5pct,frac_het_snps_hom_in_tumor_1pct,frac_het_snps_hom_in_tumor_5pct,mean_cnlr_residual,sd_cnlr_residual,n_segs_discordant_tcn,frac_discordant_tcn,n_segs_discordant_lcn,frac_discordant_lcn,n_segs_discordant_both,frac_discordant_both,n_segs_icn_cnlor_discordant,frac_icn_cnlor_discordant,homdel_filter_pass,diploid_bal_seg_filter_pass,diploid_imbal_seg_filter_pass,waterfall_filter_pass,hyper_seg_filter_pass,high_ploidy_filter_pass,valid_purity_filter_pass,diploid_seg_filter_pass,facets_suite_qc,arm_level_file,gene_level_file,ccf_file,arm_level_file_exists,gene_level_file_exists,ccf_file_exists,Patient_Id,Tumor_Id
0,P-0034223-T01-IM6_P-0034223-N01-IM6,/juno/work/ccs/resources/impact/facets/all/P-00342/P-0034223-T01-IM6_P-0034223-N01-IM6/,P-0034223-T01-IM6_P-0034223-N01-IM6,/juno/work/ccs/resources/impact/facets/all/P-00342/P-0034223-T01-IM6_P-0034223-N01-IM6/,default,0.5.14,/juno/work/ccs/resources/impact/facets/all/P-00342/P-0034223-T01-IM6_P-0034223-N01-IM6//default/P-0034223-T01-IM6_P-0034223-N01-IM6_purity,100.0,100,15,0.94,2.24,-0.16,-0.16,0.5.14,/juno/work/ccs/resources/impact/facets/all/P-00342/P-0034223-T01-IM6_P-0034223-N01-IM6//default/P-0034223-T01-IM6_P-0034223-N01-IM6_hisens,100.0,50.0,15.0,,2.24,-0.16,,False,0.941111,2.24183,-0.155483,False,0,12,0.59,0,0.0,0,0,0.0,0,0.0,6,31,10,2,4,0.062,1,1e-05,22963,2655,0.12,43,129,0.016,0.049,-0.19,0.63,1,0.0038,0.0,0.0,0,0.0,1,0.042,True,True,False,True,True,True,True,True,True,/juno/work/ccs/resources/impact/facets/all/P-00342/P-0034223-T01-IM6_P-0034223-N01-IM6//default/P-0034223-T01-IM6_P-0034223-N01-IM6.arm_level.txt,/juno/work/ccs/resources/impact/facets/all/P-00342/P-0034223-T01-IM6_P-0034223-N01-IM6//default/P-0034223-T01-IM6_P-0034223-N01-IM6.gene_level.txt,/juno/work/ccs/resources/impact/facets/all/P-00342/P-0034223-T01-IM6_P-0034223-N01-IM6//default/P-0034223-T01-IM6_P-0034223-N01-IM6.ccf.maf,True,True,True,P-0034223,P-0034223-T01-IM6
1,P-0009819-T01-IM5_P-0009819-N01-IM5,/juno/work/ccs/resources/impact/facets/all/P-00098/P-0009819-T01-IM5_P-0009819-N01-IM5/,P-0009819-T01-IM5_P-0009819-N01-IM5,/juno/work/ccs/resources/impact/facets/all/P-00098/P-0009819-T01-IM5_P-0009819-N01-IM5/,default,0.5.14,/juno/work/ccs/resources/impact/facets/all/P-00098/P-0009819-T01-IM5_P-0009819-N01-IM5//default/P-0009819-T01-IM5_P-0009819-N01-IM5_purity,100.0,100,15,0.28,2.68,-0.13,-0.13,0.5.14,/juno/work/ccs/resources/impact/facets/all/P-00098/P-0009819-T01-IM5_P-0009819-N01-IM5//default/P-0009819-T01-IM5_P-0009819-N01-IM5_hisens,100.0,50.0,15.0,,2.77,-0.13,,False,0.275237,2.681075,-0.129255,False,0,7,0.43,0,0.0,0,1,0.0062,1,0.0062,3,25,6,0,5,0.094,0,0.0,16527,2041,0.12,10,10,0.0049,0.0049,-0.25,0.83,1,0.0062,0.0,0.0,0,0.0,2,0.0063,True,True,False,True,True,True,True,True,True,/juno/work/ccs/resources/impact/facets/all/P-00098/P-0009819-T01-IM5_P-0009819-N01-IM5//default/P-0009819-T01-IM5_P-0009819-N01-IM5.arm_level.txt,/juno/work/ccs/resources/impact/facets/all/P-00098/P-0009819-T01-IM5_P-0009819-N01-IM5//default/P-0009819-T01-IM5_P-0009819-N01-IM5.gene_level.txt,/juno/work/ccs/resources/impact/facets/all/P-00098/P-0009819-T01-IM5_P-0009819-N01-IM5//default/P-0009819-T01-IM5_P-0009819-N01-IM5.ccf.maf,True,True,True,P-0009819,P-0009819-T01-IM5
2,P-0025956-T01-IM6_P-0025956-N01-IM6,/juno/work/ccs/resources/impact/facets/all/P-00259/P-0025956-T01-IM6_P-0025956-N01-IM6/,P-0025956-T01-IM6_P-0025956-N01-IM6,/juno/work/ccs/resources/impact/facets/all/P-00259/P-0025956-T01-IM6_P-0025956-N01-IM6/,default,0.5.14,/juno/work/ccs/resources/impact/facets/all/P-00259/P-0025956-T01-IM6_P-0025956-N01-IM6//default/P-0025956-T01-IM6_P-0025956-N01-IM6_purity,100.0,100,15,0.19,3.5,-0.19,"-0.19, 0.02",0.5.14,/juno/work/ccs/resources/impact/facets/all/P-00259/P-0025956-T01-IM6_P-0025956-N01-IM6//default/P-0025956-T01-IM6_P-0025956-N01-IM6_hisens,100.0,50.0,15.0,,3.45,-0.19,,False,0.185874,3.496971,-0.187925,False,0,2,0.096,4,0.18,0,0,0.0,0,0.0,6,26,6,0,5,0.19,0,0.0,17971,2159,0.12,0,0,0.0,0.0,-0.052,0.3,2,0.015,0.0,0.0,3,0.12,8,0.3,True,True,True,True,True,True,True,True,True,/juno/work/ccs/resources/impact/facets/all/P-00259/P-0025956-T01-IM6_P-0025956-N01-IM6//default/P-0025956-T01-IM6_P-0025956-N01-IM6.arm_level.txt,/juno/work/ccs/resources/impact/facets/all/P-00259/P-0025956-T01-IM6_P-0025956-N01-IM6//default/P-0025956-T01-IM6_P-0025956-N01-IM6.gene_level.txt,/juno/work/ccs/resources/impact/facets/all/P-00259/P-0025956-T01-IM6_P-0025956-N01-IM6//default/P-0025956-T01-IM6_P-0025956-N01-IM6.ccf.maf,False,False,False,P-0025956,P-0025956-T01-IM6
3,P-0027408-T01-IM6_P-0027408-N01-IM6,/juno/work/ccs/resources/impact/facets/all/P-00274/P-0027408-T01-IM6_P-0027408-N01-IM6/,P-0027408-T01-IM6_P-0027408-N01-IM6,/juno/work/ccs/resources/impact/facets/all/P-00274/P-0027408-T01-IM6_P-0027408-N01-IM6/,default,0.5.14,/juno/work/ccs/resources/impact/facets/all/P-00274/P-0027408-T01-IM6_P-0027408-N01-IM6//default/P-0027408-T01-IM6_P-0027408-N01-IM6_purity,100.0,100,15,0.31,1.81,0.04,0.04,0.5.14,/juno/work/ccs/resources/impact/facets/all/P-00274/P-0027408-T01-IM6_P-0027408-N01-IM6//default/P-0027408-T01-IM6_P-0027408-N01-IM6_hisens,100.0,50.0,15.0,,1.82,0.04,,False,0.308886,1.811066,0.042724,False,0,7,0.28,1,0.035,0,0,0.0,0,0.0,4,31,6,0,12,0.34,0,0.0,18633,2163,0.12,0,0,0.0,0.0,-0.058,0.29,2,0.096,0.0,0.0,0,0.0,4,0.21,True,True,False,True,True,True,True,True,True,/juno/work/ccs/resources/impact/facets/all/P-00274/P-0027408-T01-IM6_P-0027408-N01-IM6//default/P-0027408-T01-IM6_P-0027408-N01-IM6.arm_level.txt,/juno/work/ccs/resources/impact/facets/all/P-00274/P-0027408-T01-IM6_P-0027408-N01-IM6//default/P-0027408-T01-IM6_P-0027408-N01-IM6.gene_level.txt,/juno/work/ccs/resources/impact/facets/all/P-00274/P-0027408-T01-IM6_P-0027408-N01-IM6//default/P-0027408-T01-IM6_P-0027408-N01-IM6.ccf.maf,True,True,True,P-0027408,P-0027408-T01-IM6
4,P-0006554-T01-IM5_P-0006554-N01-IM5,/juno/work/ccs/resources/impact/facets/all/P-00065/P-0006554-T01-IM5_P-0006554-N01-IM5/,P-0006554-T01-IM5_P-0006554-N01-IM5,/juno/work/ccs/resources/impact/facets/all/P-00065/P-0006554-T01-IM5_P-0006554-N01-IM5/,default,0.5.14,/juno/work/ccs/resources/impact/facets/all/P-00065/P-0006554-T01-IM5_P-0006554-N01-IM5//default/P-0006554-T01-IM5_P-0006554-N01-IM5_purity,100.0,100,15,0.72,1.91,0.05,0.05,0.5.14,/juno/work/ccs/resources/impact/facets/all/P-00065/P-0006554-T01-IM5_P-0006554-N01-IM5//default/P-0006554-T01-IM5_P-0006554-N01-IM5_hisens,100.0,50.0,15.0,,1.92,0.05,,False,0.715208,1.910719,0.046812,False,0,11,0.49,0,0.0,0,1,0.0074,0,0.0,6,30,12,2,6,0.088,6,0.15,16557,2041,0.12,1,7,0.00049,0.0034,-0.024,0.38,0,0.0,160000000.0,0.058,3,0.054,3,0.11,True,True,False,True,True,True,True,True,True,/juno/work/ccs/resources/impact/facets/all/P-00065/P-0006554-T01-IM5_P-0006554-N01-IM5//default/P-0006554-T01-IM5_P-0006554-N01-IM5.arm_level.txt,/juno/work/ccs/resources/impact/facets/all/P-00065/P-0006554-T01-IM5_P-0006554-N01-IM5//default/P-0006554-T01-IM5_P-0006554-N01-IM5.gene_level.txt,/juno/work/ccs/resources/impact/facets/all/P-00065/P-0006554-T01-IM5_P-0006554-N01-IM5//default/P-0006554-T01-IM5_P-0006554-N01-IM5.ccf.maf,True,True,True,P-0006554,P-0006554-T01-IM5


In [284]:
print_md('cohort columns:','green')
for column in cohort.columns: print(column)

<span style="color:green">cohort columns:</span>

sample_id
sample_path
tumor_sample_id
path
fit_name
purity_run_version
purity_run_prefix
purity_run_Seed
purity_run_cval
purity_run_nhet
purity_run_Purity
purity_run_Ploidy
purity_run_dipLogR
purity_run_alBalLogR
hisens_run_version
hisens_run_prefix
hisens_run_Seed
hisens_run_cval
hisens_run_nhet
hisens_run_hisens
hisens_run_Ploidy
hisens_run_dipLogR
manual_note
is_best_fit
purity
ploidy
dipLogR
dipLogR_flag
n_alternative_dipLogR
n_dip_bal_segs
frac_dip_bal_segs
n_dip_imbal_segs
frac_dip_imbal_segs
n_amps
n_homdels
frac_homdels
n_homdels_clonal
frac_homdels_clonal
n_cn_states
n_segs
n_cnlr_clusters
n_lcn_na
n_loh
frac_loh
n_segs_subclonal
frac_segs_subclonal
n_snps
n_het_snps
frac_het_snps
n_het_snps_hom_in_tumor_1pct
n_het_snps_hom_in_tumor_5pct
frac_het_snps_hom_in_tumor_1pct
frac_het_snps_hom_in_tumor_5pct
mean_cnlr_residual
sd_cnlr_residual
n_segs_discordant_tcn
frac_discordant_tcn
n_segs_discordant_lcn
frac_discordant_lcn
n_segs_discordant_both
frac_discordant_both
n_segs_icn_cnlo

In [285]:
# We verify that each sample is unique (we have an equal number of lines and unique samples )
assert(len(cohort) == len(set(cohort.sample_id)))

In [286]:
len(set(cohort.Tumor_Id))

29259

# Adding NA purity samples
We want to add the samples that have no purity, at least one tp53 mutation and a max_vaf>0.15 

In [287]:
maf_total_cohort = pd.read_csv(data_path + 'impact-facets-tp53/new_data_facets/msk_impact_facets_annotated.ccf.maf', sep='\t')
total_cohort = pd.read_csv(data_path + 'impact-facets-tp53/new_data_facets/msk_impact_facets_annotated.cohort.txt', sep='\t')
cohort_na_tp53 = maf_total_cohort[(maf_total_cohort.purity == 0.3)| (maf_total_cohort.purity.isna())][maf_total_cohort['Hugo_Symbol'] == 'TP53']


In [288]:
# We create a list for the interesting samples that we will integrate to our data
samples_maxvaf=[]
for sample in list(set(cohort_na_tp53.Tumor_Sample_Barcode)):
    sample_1 = maf_total_cohort[maf_total_cohort['Tumor_Sample_Barcode'] == sample]
    if max(sample_1.t_var_freq)>0.15:
        samples_maxvaf.append(sample)

In [289]:
len(samples_maxvaf) - 320

521

In [290]:
29304 + 521

29825

In [291]:
i=0
for sample in samples_maxvaf:
    if sample in set(cohort.Tumor_Id):
        print(sample)
        print(maf_total_cohort[maf_total_cohort['Tumor_Sample_Barcode']==sample]['purity'])
        #i+=1
#print(i)

P-0022508-T01-IM6
101198   NaN
101199   NaN
101200   NaN
101201   NaN
101202   NaN
101203   NaN
101204   NaN
101205   NaN
101206   NaN
101207   NaN
101208   NaN
101209   NaN
101210   NaN
101211   NaN
101212   NaN
101213   NaN
101214   NaN
101215   NaN
101216   NaN
101217   NaN
101218   NaN
101219   NaN
101220   NaN
101221   NaN
101222   NaN
101223   NaN
101224   NaN
101225   NaN
101226   NaN
101227   NaN
101228   NaN
101229   NaN
101230   NaN
101231   NaN
101232   NaN
101233   NaN
101234   NaN
Name: purity, dtype: float64
P-0036065-T02-IM6
73894    0.3
73895    0.3
73896    0.3
73897    0.3
73898    0.3
73899    0.3
73900    0.3
73901    0.3
Name: purity, dtype: float64
P-0022739-T01-IM6
42658    0.3
42659    0.3
42660    0.3
Name: purity, dtype: float64
P-0012289-T02-IM6
218291    0.3
218292    0.3
218293    0.3
Name: purity, dtype: float64
P-0047948-T01-IM6
325693    0.3
325694    0.3
325695    0.3
325696    0.3
325697    0.3
325698    0.3
325699    0.3
325700    0.3
Name: purity, dt

P-0039290-T01-IM6
250948    0.3
250949    0.3
250950    0.3
250951    0.3
Name: purity, dtype: float64
P-0040489-T01-IM6
258847    0.3
258848    0.3
258849    0.3
258850    0.3
258851    0.3
258852    0.3
258853    0.3
258854    0.3
258855    0.3
258856    0.3
258857    0.3
258858    0.3
258859    0.3
Name: purity, dtype: float64
P-0003820-T01-IM5
33096   NaN
33097   NaN
33098   NaN
33099   NaN
33100   NaN
Name: purity, dtype: float64
P-0006634-T01-IM5
87663    0.3
87664    0.3
87665    0.3
87666    0.3
87667    0.3
87668    0.3
87669    0.3
87670    0.3
87671    0.3
87672    0.3
87673    0.3
Name: purity, dtype: float64
P-0040655-T01-IM6
258766    0.3
258767    0.3
258768    0.3
258769    0.3
258770    0.3
258771    0.3
258772    0.3
258773    0.3
258774    0.3
Name: purity, dtype: float64
P-0047599-T01-IM6
315627   NaN
315628   NaN
315629   NaN
315630   NaN
315631   NaN
315632   NaN
315633   NaN
315634   NaN
315635   NaN
315636   NaN
315637   NaN
315638   NaN
315639   NaN
315640   Na

286584    0.3
286585    0.3
286586    0.3
286587    0.3
286588    0.3
286589    0.3
286590    0.3
286591    0.3
286592    0.3
286593    0.3
286594    0.3
Name: purity, dtype: float64
P-0004327-T01-IM5
229055    0.3
229056    0.3
229057    0.3
229058    0.3
229059    0.3
229060    0.3
229061    0.3
229062    0.3
229063    0.3
229064    0.3
229065    0.3
229066    0.3
229067    0.3
229068    0.3
229069    0.3
229070    0.3
229071    0.3
229072    0.3
229073    0.3
229074    0.3
Name: purity, dtype: float64
P-0035372-T01-IM6
100022    0.3
100023    0.3
100024    0.3
100025    0.3
100026    0.3
100027    0.3
100028    0.3
100029    0.3
100030    0.3
100031    0.3
Name: purity, dtype: float64
P-0033982-T01-IM6
85349    0.3
85350    0.3
85351    0.3
Name: purity, dtype: float64
P-0030094-T03-IM6
245185    0.3
245186    0.3
245187    0.3
245188    0.3
245189    0.3
245190    0.3
245191    0.3
245192    0.3
245193    0.3
245194    0.3
Name: purity, dtype: float64
P-0035240-T01-IM6
175319    0.

226165    0.3
226166    0.3
226167    0.3
Name: purity, dtype: float64
P-0004710-T01-IM5
229346    0.3
229347    0.3
229348    0.3
229349    0.3
229350    0.3
229351    0.3
229352    0.3
229353    0.3
229354    0.3
229355    0.3
229356    0.3
229357    0.3
229358    0.3
229359    0.3
229360    0.3
229361    0.3
229362    0.3
229363    0.3
229364    0.3
229365    0.3
229366    0.3
229367    0.3
Name: purity, dtype: float64
P-0041572-T01-IM6
268298    0.3
268299    0.3
268300    0.3
268301    0.3
268302    0.3
268303    0.3
268304    0.3
268305    0.3
Name: purity, dtype: float64
P-0009076-T06-IM6
107493   NaN
107494   NaN
107495   NaN
107496   NaN
107497   NaN
107498   NaN
107499   NaN
107500   NaN
Name: purity, dtype: float64
P-0042485-T01-IM6
311948    0.3
311949    0.3
Name: purity, dtype: float64
P-0004363-T01-IM5
7451   NaN
7452   NaN
7453   NaN
7454   NaN
7455   NaN
7456   NaN
7457   NaN
7458   NaN
7459   NaN
7460   NaN
Name: purity, dtype: float64
P-0036454-T01-IM6
20730    0.3
2

P-0019970-T01-IM6
147574   NaN
147575   NaN
147576   NaN
147577   NaN
147578   NaN
147579   NaN
147580   NaN
147581   NaN
147582   NaN
147583   NaN
147584   NaN
147585   NaN
147586   NaN
147587   NaN
147588   NaN
Name: purity, dtype: float64
P-0018182-T03-IM6
55213    0.300000
55214    0.300000
55215    0.300000
55216    0.300000
55217    0.300000
55218    0.234449
55219    0.234449
55220    0.234449
55221    0.234449
55222    0.234449
Name: purity, dtype: float64
P-0006216-T01-IM5
218298   NaN
218299   NaN
218300   NaN
218301   NaN
218302   NaN
218303   NaN
218304   NaN
218305   NaN
218306   NaN
218307   NaN
218308   NaN
218309   NaN
218310   NaN
218311   NaN
218312   NaN
218313   NaN
218314   NaN
218315   NaN
218316   NaN
218317   NaN
218318   NaN
218319   NaN
218320   NaN
218321   NaN
218322   NaN
218323   NaN
218324   NaN
218325   NaN
218326   NaN
218327   NaN
218328   NaN
218329   NaN
218330   NaN
218331   NaN
218332   NaN
218333   NaN
218334   NaN
218335   NaN
218336   NaN
218337

P-0016546-T01-IM6
48969    0.3
48970    0.3
48971    0.3
48972    0.3
48973    0.3
48974    0.3
48975    0.3
48976    0.3
48977    0.3
48978    0.3
48979    0.3
48980    0.3
48981    0.3
48982    0.3
48983    0.3
48984    0.3
48985    0.3
48986    0.3
48987    0.3
48988    0.3
48989    0.3
48990    0.3
48991    0.3
48992    0.3
48993    0.3
48994    0.3
48995    0.3
Name: purity, dtype: float64
P-0000698-T01-IM3
152979    0.3
152980    0.3
152981    0.3
152982    0.3
Name: purity, dtype: float64
P-0011077-T01-IM5
235176    0.3
235177    0.3
235178    0.3
235179    0.3
235180    0.3
235181    0.3
235182    0.3
235183    0.3
235184    0.3
235185    0.3
235186    0.3
235187    0.3
235188    0.3
Name: purity, dtype: float64
P-0004380-T01-IM5
22922   NaN
22923   NaN
22924   NaN
22925   NaN
22926   NaN
22927   NaN
22928   NaN
22929   NaN
22930   NaN
22931   NaN
22932   NaN
22933   NaN
22934   NaN
22935   NaN
22936   NaN
22937   NaN
22938   NaN
22939   NaN
22940   NaN
22941   NaN
22942   NaN


56728   NaN
56729   NaN
56730   NaN
Name: purity, dtype: float64
P-0037799-T01-IM6
241746    0.3
241747    0.3
241748    0.3
241749    0.3
241750    0.3
241751    0.3
241752    0.3
241753    0.3
241754    0.3
241755    0.3
241756    0.3
241757    0.3
241758    0.3
241759    0.3
Name: purity, dtype: float64
P-0019076-T01-IM6
194780    0.3
194781    0.3
194782    0.3
194783    0.3
194784    0.3
194785    0.3
194786    0.3
194787    0.3
194788    0.3
194789    0.3
194790    0.3
194791    0.3
194792    0.3
194793    0.3
194794    0.3
194795    0.3
194796    0.3
194797    0.3
194798    0.3
194799    0.3
194800    0.3
194801    0.3
194802    0.3
194803    0.3
194804    0.3
194805    0.3
194806    0.3
194807    0.3
194808    0.3
194809    0.3
194810    0.3
194811    0.3
194812    0.3
194813    0.3
194814    0.3
194815    0.3
194816    0.3
194817    0.3
194818    0.3
194819    0.3
194820    0.3
194821    0.3
194822    0.3
194823    0.3
194824    0.3
194825    0.3
194826    0.3
194827    0.3
19

P-0010034-T01-IM5
84922    0.3
84923    0.3
84924    0.3
84925    0.3
84926    0.3
84927    0.3
84928    0.3
84929    0.3
84930    0.3
84931    0.3
84932    0.3
84933    0.3
84934    0.3
84935    0.3
84936    0.3
84937    0.3
84938    0.3
84939    0.3
84940    0.3
84941    0.3
84942    0.3
84943    0.3
84944    0.3
84945    0.3
Name: purity, dtype: float64
P-0002178-T01-IM3
97868    0.3
97869    0.3
97870    0.3
97871    0.3
97872    0.3
97873    0.3
97874    0.3
97875    0.3
97876    0.3
97877    0.3
97878    0.3
97879    0.3
97880    0.3
Name: purity, dtype: float64
P-0023543-T01-IM6
38868    0.3
38869    0.3
Name: purity, dtype: float64
P-0024482-T01-IM6
146098   NaN
146099   NaN
146100   NaN
146101   NaN
Name: purity, dtype: float64
P-0031394-T01-IM6
146066    0.3
146067    0.3
146068    0.3
146069    0.3
146070    0.3
146071    0.3
146072    0.3
Name: purity, dtype: float64
P-0014018-T01-IM5
134967   NaN
134968   NaN
134969   NaN
134970   NaN
134971   NaN
134972   NaN
134973   NaN

P-0044858-T01-IM6
286317    0.3
286318    0.3
286319    0.3
286320    0.3
286321    0.3
286322    0.3
286323    0.3
286324    0.3
286325    0.3
286326    0.3
286327    0.3
286328    0.3
286329    0.3
286330    0.3
286331    0.3
286332    0.3
286333    0.3
286334    0.3
286335    0.3
286336    0.3
286337    0.3
286338    0.3
286339    0.3
286340    0.3
286341    0.3
286342    0.3
286343    0.3
286344    0.3
286345    0.3
286346    0.3
286347    0.3
286348    0.3
286349    0.3
286350    0.3
286351    0.3
286352    0.3
286353    0.3
286354    0.3
286355    0.3
286356    0.3
286357    0.3
286358    0.3
286359    0.3
286360    0.3
286361    0.3
286362    0.3
Name: purity, dtype: float64
P-0050654-T01-IM6
339343    0.3
339344    0.3
Name: purity, dtype: float64
P-0015458-T02-IM6
70300    0.3
Name: purity, dtype: float64
P-0023624-T01-IM6
289051    0.3
289052    0.3
289053    0.3
289054    0.3
Name: purity, dtype: float64
P-0026958-T01-IM6
109185    0.3
109186    0.3
109187    0.3
109188    0

In [292]:
total_cohort[total_cohort.tumor_sample == 'P-0022508-T01-IM6']

Unnamed: 0,sample_id,sample_path,fit_to_use,patient,tumor_sample,tumor_bamname,normal_sample,normal_bamname,tag,run_prefix,run_output_dir,run_log_dir,counts_file,has_counts_file,has_hisens_run,has_purity_run,has_qc,has_maf_anno,run_status,tumor_sample_id,path,fit_name,purity_run_version,purity_run_prefix,purity_run_Seed,purity_run_cval,purity_run_nhet,purity_run_snp_nbhd,purity_run_ndepth,purity_run_Purity,purity_run_Ploidy,purity_run_dipLogR,purity_run_alBalLogR,hisens_run_version,hisens_run_prefix,hisens_run_Seed,hisens_run_cval,hisens_run_nhet,hisens_run_snp_nbhd,hisens_run_ndepth,hisens_run_hisens,hisens_run_Ploidy,hisens_run_dipLogR,manual_note,is_best_fit,purity,ploidy,dipLogR,dipLogR_flag,n_alternative_dipLogR,wgd,n_dip_bal_segs,frac_dip_bal_segs,n_dip_imbal_segs,frac_dip_imbal_segs,n_amps,n_homdels,frac_homdels,n_homdels_clonal,frac_homdels_clonal,n_cn_states,n_segs,n_cnlr_clusters,n_lcn_na,n_loh,frac_loh,n_segs_subclonal,frac_segs_subclonal,n_segs_below_dipLogR,frac_below_dipLogR,n_segs_balanced_odd_tcn,frac_balanced_odd_tcn,n_segs_imbalanced_diploid_cn,frac_imbalanced_diploid_cn,n_segs_lcn_greater_mcn,frac_lcn_greater_mcn,n_snps,n_het_snps,frac_het_snps,n_snps_with_300x_in_tumor,n_het_snps_with_300x_in_tumor,n_het_snps_hom_in_tumor_1pct,n_het_snps_hom_in_tumor_5pct,frac_het_snps_hom_in_tumor_1pct,frac_het_snps_hom_in_tumor_5pct,mean_cnlr_residual,sd_cnlr_residual,n_segs_discordant_tcn,frac_discordant_tcn,n_segs_discordant_lcn,frac_discordant_lcn,n_segs_discordant_both,frac_discordant_both,n_segs_icn_cnlor_discordant,frac_icn_cnlor_discordant,mafr_median_all,mafr_median_clonal,mafr_n_gt_1,facets_suite_version,facets_qc_version,homdel_filter_pass,diploid_bal_seg_filter_pass,diploid_imbal_seg_filter_pass,waterfall_filter_pass,hyper_seg_filter_pass,high_ploidy_filter_pass,valid_purity_filter_pass,diploid_seg_filter_pass,em_cncf_icn_discord_filter_pass,dipLogR_too_low_filter_pass,subclonal_genome_filter_pass,icn_allelic_state_concordance_filter_pass,contamination_filter_pass,facets_qc,arm_level_file,gene_level_file,ccf_file,arm_level_file_exists,gene_level_file_exists,ccf_file_exists
13057,P-0022508-T01-IM6_P-0022508-N01-IM6,/juno/work/ccs/shared/resources/impact/facets/all/P-00225/P-0022508-T01-IM6_P-0022508-N01-IM6/,,P-0022508,P-0022508-T01-IM6,QS871775-T,P-0022508-N01-IM6,AQ626086-N,P-0022508-T01-IM6_P-0022508-N01-IM6,/juno/work/ccs/shared/resources/impact/facets/all/P-00225/P-0022508-T01-IM6_P-0022508-N01-IM6/,/juno/work/ccs/shared/resources/impact/facets/all/P-00225/P-0022508-T01-IM6_P-0022508-N01-IM6/default/,/juno/work/ccs/shared/resources/impact/facets/all/P-00225/P-0022508-T01-IM6_P-0022508-N01-IM6/default/logs/,/juno/work/ccs/shared/resources/impact/facets/all/P-00225/P-0022508-T01-IM6_P-0022508-N01-IM6/countsMerged____P-0022508-T01-IM6_P-0022508-N01-IM6.dat.gz,True,True,True,False,False,complete,P-0022508-T01-IM6_P-0022508-N01-IM6,/juno/work/ccs/shared/resources/impact/facets/all/P-00225/P-0022508-T01-IM6_P-0022508-N01-IM6/,default,0.5.14,/juno/work/ccs/shared/resources/impact/facets/all/P-00225/P-0022508-T01-IM6_P-0022508-N01-IM6//default/P-0022508-T01-IM6_P-0022508-N01-IM6_purity,100,100,15,250,35,0.27,2.3,-0.06,-0.06,0.5.14,/juno/work/ccs/shared/resources/impact/facets/all/P-00225/P-0022508-T01-IM6_P-0022508-N01-IM6//default/P-0022508-T01-IM6_P-0022508-N01-IM6_hisens,100,50,15,250,35,,2.0,-0.06,,False,0.271751,2.320998,-0.061591,False,0,False,10,0.53,0,0.0,0,0,0.0,0,0.0,5,31,8,1,8,0.12,2,0.034,13,0.3,7,0.12,0,0.0,8,0.12,19903,2349,0.12,6971,731,2,0,0.00085,0.0,-0.12,0.5,5,0.1,0.0,0.0,0,0.0,7,0.12,0.0012,0.0017,0,2.0.5,1,True,True,False,True,True,True,True,True,True,True,True,True,True,True,/juno/work/ccs/shared/resources/impact/facets/all/P-00225/P-0022508-T01-IM6_P-0022508-N01-IM6//default/P-0022508-T01-IM6_P-0022508-N01-IM6.arm_level.txt,/juno/work/ccs/shared/resources/impact/facets/all/P-00225/P-0022508-T01-IM6_P-0022508-N01-IM6//default/P-0022508-T01-IM6_P-0022508-N01-IM6.gene_level.txt,/juno/work/ccs/shared/resources/impact/facets/all/P-00225/P-0022508-T01-IM6_P-0022508-N01-IM6//default/P-0022508-T01-IM6_P-0022508-N01-IM6.ccf.maf,True,True,True


In [293]:
# We only take interesting columns --> ['sample_id','Tumor_Id', 'purity', 'ploidy', 'dipLogR', 'frac_loh']
cohort_to_add = total_cohort[total_cohort.tumor_sample.isin(samples_maxvaf)]
cohort_to_add = cohort_to_add[['sample_id','tumor_sample', 'purity', 'ploidy', 'dipLogR', 'frac_loh']]
cohort_to_add.columns = ['sample_id','Tumor_Id', 'purity', 'ploidy', 'dipLogR', 'frac_loh']

In [294]:
# We concatenate cohort and cohort_to_add
cohort = pd.concat([cohort, cohort_to_add], axis=0)

In [295]:
cohort[cohort['Tumor_Id'] == 'P-0035205-T03-IM6']

Unnamed: 0,sample_id,sample_path,tumor_sample_id,path,fit_name,purity_run_version,purity_run_prefix,purity_run_Seed,purity_run_cval,purity_run_nhet,purity_run_Purity,purity_run_Ploidy,purity_run_dipLogR,purity_run_alBalLogR,hisens_run_version,hisens_run_prefix,hisens_run_Seed,hisens_run_cval,hisens_run_nhet,hisens_run_hisens,hisens_run_Ploidy,hisens_run_dipLogR,manual_note,is_best_fit,purity,ploidy,dipLogR,dipLogR_flag,n_alternative_dipLogR,n_dip_bal_segs,frac_dip_bal_segs,n_dip_imbal_segs,frac_dip_imbal_segs,n_amps,n_homdels,frac_homdels,n_homdels_clonal,frac_homdels_clonal,n_cn_states,n_segs,n_cnlr_clusters,n_lcn_na,n_loh,frac_loh,n_segs_subclonal,frac_segs_subclonal,n_snps,n_het_snps,frac_het_snps,n_het_snps_hom_in_tumor_1pct,n_het_snps_hom_in_tumor_5pct,frac_het_snps_hom_in_tumor_1pct,frac_het_snps_hom_in_tumor_5pct,mean_cnlr_residual,sd_cnlr_residual,n_segs_discordant_tcn,frac_discordant_tcn,n_segs_discordant_lcn,frac_discordant_lcn,n_segs_discordant_both,frac_discordant_both,n_segs_icn_cnlor_discordant,frac_icn_cnlor_discordant,homdel_filter_pass,diploid_bal_seg_filter_pass,diploid_imbal_seg_filter_pass,waterfall_filter_pass,hyper_seg_filter_pass,high_ploidy_filter_pass,valid_purity_filter_pass,diploid_seg_filter_pass,facets_suite_qc,arm_level_file,gene_level_file,ccf_file,arm_level_file_exists,gene_level_file_exists,ccf_file_exists,Patient_Id,Tumor_Id
23658,P-0035205-T03-IM6_P-0035205-N03-IM6,/juno/work/ccs/resources/impact/facets/all/P-00352/P-0035205-T03-IM6_P-0035205-N03-IM6/,P-0035205-T03-IM6_P-0035205-N03-IM6,/juno/work/ccs/resources/impact/facets/all/P-00352/P-0035205-T03-IM6_P-0035205-N03-IM6/,default,0.5.14,/juno/work/ccs/resources/impact/facets/all/P-00352/P-0035205-T03-IM6_P-0035205-N03-IM6//default/P-0035205-T03-IM6_P-0035205-N03-IM6_purity,100.0,100.0,15.0,0.18,0.77,0.17,0.17,0.5.14,/juno/work/ccs/resources/impact/facets/all/P-00352/P-0035205-T03-IM6_P-0035205-N03-IM6//default/P-0035205-T03-IM6_P-0035205-N03-IM6_hisens,100.0,50.0,15.0,,0.75,0.17,,False,0.18081,0.766488,0.170582,False,0.0,10.0,0.53,1.0,0.0088,0.0,3.0,0.0079,3.0,0.0079,5.0,30.0,9.0,1.0,6.0,0.056,0.0,0.0,19847.0,2398.0,0.12,0.0,1.0,0.0,0.00042,0.0052,0.63,1.0,0.0011,0.0,0.0,0.0,0.0,4.0,0.043,True,True,False,True,True,True,True,True,True,/juno/work/ccs/resources/impact/facets/all/P-00352/P-0035205-T03-IM6_P-0035205-N03-IM6//default/P-0035205-T03-IM6_P-0035205-N03-IM6.arm_level.txt,/juno/work/ccs/resources/impact/facets/all/P-00352/P-0035205-T03-IM6_P-0035205-N03-IM6//default/P-0035205-T03-IM6_P-0035205-N03-IM6.gene_level.txt,/juno/work/ccs/resources/impact/facets/all/P-00352/P-0035205-T03-IM6_P-0035205-N03-IM6//default/P-0035205-T03-IM6_P-0035205-N03-IM6.ccf.maf,True,True,True,P-0035205,P-0035205-T03-IM6
33893,P-0035205-T03-IM6_P-0035205-N02-IM6,,,,,,,,,,,,,,,,,,,,,,,,,2.0,-0.049729,,,,,,,,,,,,,,,,,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,P-0035205-T03-IM6
33894,P-0035205-T03-IM6_P-0035205-N03-IM6,,,,,,,,,,,,,,,,,,,,,,,,0.18081,0.766488,0.170582,,,,,,,,,,,,,,,,,0.056,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,P-0035205-T03-IM6


---
## Creation of the MAF file from CbioPortal raw datasets

In [2]:
clinical_data = pd.read_csv(data_path + 'cbioportal/raw/mskimpact_clinical_data-2.tsv', sep= '\t')
mutations = pd.read_pickle(data_path + 'cbioportal/raw/mutations_cohort.pkl')

In [5]:
get_groupby(clinical_data[clinical_data['Cancer Type'] == 'Breast Cancer'], 'Cancer Type Detailed', 'count')

Unnamed: 0_level_0,count
Cancer Type Detailed,Unnamed: 1_level_1
Adenoid Cystic Breast Cancer,10
Adenomyoepithelioma of the Breast,2
Breast Carcinoma with Signet Ring,1
Breast Ductal Carcinoma In Situ,3
"Breast Invasive Cancer, NOS",190
"Breast Invasive Carcinoma, NOS",358
"Breast Invasive Carcinosarcoma, NOS",3
Breast Invasive Ductal Carcinoma,3709
Breast Invasive Lobular Carcinoma,646
Breast Invasive Mixed Mucinous Carcinoma,17


In [297]:
# Here are the columns we will select for the three different files

filter_cohort = ['sample_id','Tumor_Id', 'purity', 'ploidy', 'dipLogR', 'frac_loh']

filter_mut = ['sampleId',
             'patientId',
             'gene',
             'entrezGeneId',
             'mutationType',
             'mutationStatus',
             'proteinChange',
             'startPosition',
             'endPosition',
             'referenceAllele',
             'variantAllele',
             'chr',
             'hugoGeneSymbol',
             'tumorAltCount',
             'tumorRefCount']

filter_clinical = [  'Sample ID',
                     'Patient ID',
                     'Patient Current Age',
                     'Cancer Type',
                     'Cancer Type Detailed',
                     'Ethnicity Category' ,
                     'Race Category',
                     'Sex',
                     'Mutation Count',
                     'Sample Type',
                     'Number of Samples Per Patient',
                     'Overall Survival Status',
                     'Overall Survival (Months)',
                     'MSI Score',
                     'Impact TMB Score',
                     'Somatic Status'
                      ]

cohort_filtered = cohort[filter_cohort]
mutations_filtered = mutations[filter_mut]
clinical_data_filtered = clinical_data[filter_clinical]

---
We create 3 new columns in mutations_filtered:
* *mut_key*: mutation key that describes entirely the mutation
* *sample_mut_key*: sample mutation key that adds information about the sample (it allows to filter out duplicates)
* *mut_spot*: number representing the location of the amino acid mutated

In [298]:
# Create a mutation Key
mutations_filtered['mut_key'] = pd.Series([str(i)+'_'+str(j)+'_'+str(k)+'_'+str(l) for i,j,k,l in zip(mutations_filtered.chr, mutations_filtered.startPosition, mutations_filtered.referenceAllele, mutations_filtered.variantAllele)]) 
# Create a sample key to differentiate duplicates
mutations_filtered['sample_mut_key'] = pd.Series([str(j)+'_'+str(i) for i,j in zip( mutations_filtered.mut_key, mutations_filtered.sampleId)])
# Extract the mutation spot from HGVSp
mutations_filtered['mut_spot'] = mutations_filtered.proteinChange.str.extract('(\d+)')
#Create the vaf column
mutations_filtered['vaf'] = mutations_filtered.apply(lambda x: x.tumorAltCount/(x.tumorAltCount + x.tumorRefCount) if (x.tumorAltCount + x.tumorRefCount)>0 else 'None' , axis=1)

In [299]:
# We merge the three dataframes
# Left Join on 'patient_Id' and 'Patient ID'
maf = pd.merge(left=cohort_filtered,right=clinical_data_filtered, how='left', left_on='Tumor_Id', right_on='Sample ID')
maf_cohort = pd.merge(left=maf, right=mutations_filtered, how='left', left_on='Tumor_Id', right_on='sampleId')
# We drop column duplicates
maf_cohort = maf_cohort.drop(['sampleId', 'Sample ID','patientId'], axis=1)
# We rename the columns to be consistent with other maf files created
maf_cohort.columns = ['Sample_Id', 'Tumor_Id','purity', 'ploidy', 'dipLogR', 'frac_loh', 'Patient_Id', 'Patient_Current_Age', 'Cancer_Type',
                    'Cancer_Type_Detailed', 'Ethnicity_Category','Race_Category', 'Sex', 'Mutation_Count', 'Sample_Type', 'samples_per_patient','Overall Survival Status',
                     'Overall Survival (Months)', 'MSI Score','TMB_Score','Somatic_Status', 'gene','Gene_Id','Variant_Classification','mutationStatus', 'proteinChange',
                    'Start_Position', 'End_Position', 'Reference_Allele','Variant_Allele', 'Chromosome',  
                    'Hugo_Symbol','alt_count', 'ref_count', 'mut_key', 'sample_mut_key', 'mut_spot', 'vaf']
maf_cohort

Unnamed: 0,Sample_Id,Tumor_Id,purity,ploidy,dipLogR,frac_loh,Patient_Id,Patient_Current_Age,Cancer_Type,Cancer_Type_Detailed,Ethnicity_Category,Race_Category,Sex,Mutation_Count,Sample_Type,samples_per_patient,Overall Survival Status,Overall Survival (Months),MSI Score,TMB_Score,Somatic_Status,gene,Gene_Id,Variant_Classification,mutationStatus,proteinChange,Start_Position,End_Position,Reference_Allele,Variant_Allele,Chromosome,Hugo_Symbol,alt_count,ref_count,mut_key,sample_mut_key,mut_spot,vaf
0,P-0034223-T01-IM6_P-0034223-N01-IM6,P-0034223-T01-IM6,0.941111,2.241830,-0.155483,0.062,P-0034223,63.0,Breast Cancer,Invasive Breast Carcinoma,,NO VALUE ENTERED,Female,6.0,Metastasis,1.0,LIVING,,0.55,5.3,Matched,"{'entrezGeneId': 5290, 'hugoGeneSymbol': 'PIK3CA', 'type': 'protein-coding'}",5290.0,Missense_Mutation,SOMATIC,E545K,178936091.0,178936091.0,G,A,3,PIK3CA,284.0,334.0,3_178936091_G_A,P-0034223-T01-IM6_3_178936091_G_A,545,0.459547
1,P-0034223-T01-IM6_P-0034223-N01-IM6,P-0034223-T01-IM6,0.941111,2.241830,-0.155483,0.062,P-0034223,63.0,Breast Cancer,Invasive Breast Carcinoma,,NO VALUE ENTERED,Female,6.0,Metastasis,1.0,LIVING,,0.55,5.3,Matched,"{'entrezGeneId': 2064, 'hugoGeneSymbol': 'ERBB2', 'type': 'protein-coding'}",2064.0,Missense_Mutation,SOMATIC,L755S,37880220.0,37880220.0,T,C,17,ERBB2,224.0,262.0,17_37880220_T_C,P-0034223-T01-IM6_17_37880220_T_C,755,0.460905
2,P-0034223-T01-IM6_P-0034223-N01-IM6,P-0034223-T01-IM6,0.941111,2.241830,-0.155483,0.062,P-0034223,63.0,Breast Cancer,Invasive Breast Carcinoma,,NO VALUE ENTERED,Female,6.0,Metastasis,1.0,LIVING,,0.55,5.3,Matched,"{'entrezGeneId': 9641, 'hugoGeneSymbol': 'IKBKE', 'type': 'protein-coding'}",9641.0,Missense_Mutation,SOMATIC,R27H,206646650.0,206646650.0,G,A,1,IKBKE,252.0,1027.0,1_206646650_G_A,P-0034223-T01-IM6_1_206646650_G_A,27,0.197029
3,P-0034223-T01-IM6_P-0034223-N01-IM6,P-0034223-T01-IM6,0.941111,2.241830,-0.155483,0.062,P-0034223,63.0,Breast Cancer,Invasive Breast Carcinoma,,NO VALUE ENTERED,Female,6.0,Metastasis,1.0,LIVING,,0.55,5.3,Matched,"{'entrezGeneId': 6926, 'hugoGeneSymbol': 'TBX3', 'type': 'protein-coding'}",6926.0,Frame_Shift_Ins,SOMATIC,S321Vfs*6,115114257.0,115114258.0,-,T,12,TBX3,358.0,384.0,12_115114257_-_T,P-0034223-T01-IM6_12_115114257_-_T,321,0.48248
4,P-0034223-T01-IM6_P-0034223-N01-IM6,P-0034223-T01-IM6,0.941111,2.241830,-0.155483,0.062,P-0034223,63.0,Breast Cancer,Invasive Breast Carcinoma,,NO VALUE ENTERED,Female,6.0,Metastasis,1.0,LIVING,,0.55,5.3,Matched,"{'entrezGeneId': 3169, 'hugoGeneSymbol': 'FOXA1', 'type': 'protein-coding'}",3169.0,Missense_Mutation,SOMATIC,C227Y,38061309.0,38061309.0,C,T,14,FOXA1,410.0,462.0,14_38061309_C_T,P-0034223-T01-IM6_14_38061309_C_T,227,0.470183
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
284614,P-0050744-T01-IM6_P-0050744-N01-IM6,P-0050744-T01-IM6,0.300000,2.426452,-0.089454,0.210,P-0050744,69.0,Pancreatic Cancer,Pancreatic Adenocarcinoma,,WHITE,Male,5.0,Primary,1.0,LIVING,,0.06,5.3,Matched,"{'entrezGeneId': 3845, 'hugoGeneSymbol': 'KRAS', 'type': 'protein-coding'}",3845.0,Missense_Mutation,SOMATIC,G12D,25398284.0,25398284.0,C,T,12,KRAS,37.0,120.0,12_25398284_C_T,P-0050744-T01-IM6_12_25398284_C_T,12,0.235669
284615,P-0050744-T01-IM6_P-0050744-N01-IM6,P-0050744-T01-IM6,0.300000,2.426452,-0.089454,0.210,P-0050744,69.0,Pancreatic Cancer,Pancreatic Adenocarcinoma,,WHITE,Male,5.0,Primary,1.0,LIVING,,0.06,5.3,Matched,"{'entrezGeneId': 7157, 'hugoGeneSymbol': 'TP53', 'type': 'protein-coding'}",7157.0,Missense_Mutation,SOMATIC,Y234C,7577580.0,7577580.0,T,C,17,TP53,66.0,210.0,17_7577580_T_C,P-0050744-T01-IM6_17_7577580_T_C,234,0.23913
284616,P-0050744-T01-IM6_P-0050744-N01-IM6,P-0050744-T01-IM6,0.300000,2.426452,-0.089454,0.210,P-0050744,69.0,Pancreatic Cancer,Pancreatic Adenocarcinoma,,WHITE,Male,5.0,Primary,1.0,LIVING,,0.06,5.3,Matched,"{'entrezGeneId': 1029, 'hugoGeneSymbol': 'CDKN2A', 'type': 'protein-coding'}",1029.0,Missense_Mutation,SOMATIC,H83Y,21971111.0,21971111.0,G,A,9,CDKN2A,20.0,215.0,9_21971111_G_A,P-0050744-T01-IM6_9_21971111_G_A,83,0.0851064
284617,P-0050744-T01-IM6_P-0050744-N01-IM6,P-0050744-T01-IM6,0.300000,2.426452,-0.089454,0.210,P-0050744,69.0,Pancreatic Cancer,Pancreatic Adenocarcinoma,,WHITE,Male,5.0,Primary,1.0,LIVING,,0.06,5.3,Matched,"{'entrezGeneId': 5159, 'hugoGeneSymbol': 'PDGFRB', 'type': 'protein-coding'}",5159.0,Missense_Mutation,SOMATIC,E1069K,149495442.0,149495442.0,C,T,5,PDGFRB,58.0,279.0,5_149495442_C_T,P-0050744-T01-IM6_5_149495442_C_T,1069,0.172107


In [300]:
# MERGE WITH ANNOTATED DATA 

annotated_data = pd.read_pickle(data_path + 'maf_cohort_annotated.pkl')

#Create a total_mut_key
maf_cohort['total_mut_key'] = pd.Series([str(j)+'_'+str(i) for i,j in zip( maf_cohort.mut_key, maf_cohort.Sample_Id)])
annotated_data['total_mut_key'] = pd.Series([str(j)+'_'+str(i) for i,j in zip(annotated_data.mut_key, annotated_data.Sample_Id)])
annotated_data = annotated_data[['total_mut_key','mutationEffect','oncogenic','vus','hotspot']]

maf_cohort_final = pd.merge(maf_cohort, annotated_data, how='left', left_on='total_mut_key', right_on='total_mut_key')
maf_cohort_final = maf_cohort_final.drop_duplicates('total_mut_key')

In [301]:
# removing samples with tp53 vaf < 2%
maf_cohort_final = maf_cohort_final.drop(maf_cohort_final[maf_cohort_final['Tumor_Id'] == 'P-0035205-T03-IM6'].index)

In [302]:
# remove Germline (only for non tp53 mutations !!) and NA mutations: 
maf_cohort_final_ = maf_cohort_final[maf_cohort_final['mutationStatus'] != 'NA']

key_list = list(maf_cohort_final_[maf_cohort_final_['Hugo_Symbol'] != 'TP53'][maf_cohort_final_['mutationStatus'] != 'GERMLINE'].total_mut_key)
key_list = key_list + list(maf_cohort_final_[maf_cohort_final_['Hugo_Symbol'] == 'TP53'].total_mut_key)
maf_cohort_final_ = maf_cohort_final_[maf_cohort_final_.total_mut_key.isin(key_list)]

# Remove all mutations without VAF information
#maf_cohort_final_ = maf_cohort_final_[~maf_cohort_final_.vaf.isna()]

In [303]:
# Saving to pickle File
maf_cohort_final_.to_pickle(data_path + 'merged_data/maf_cohort.pkl')

In [307]:
for tumor_id in ['P-0027609-T01-IM6',
 'P-0009112-T01-IM5',
 'P-0017428-T01-IM6',
 'P-0027394-T01-IM6',
 'P-0010760-T01-IM5',
 'P-0005766-T01-IM5',
 'P-0012881-T01-IM5',
 'P-0008972-T01-IM5',
 'P-0000263-T01-IM3',
 'P-0015446-T01-IM6',
 'P-0014135-T02-IM5',
 'P-0015154-T01-IM6',
 'P-0016972-T01-IM6',
 'P-0028144-T01-IM6',
 'P-0012033-T01-IM5',
 'P-0010332-T01-IM5',
 'P-0006564-T01-IM5',
 'P-0018495-T01-IM6',
 'P-0012445-T01-IM5',
 'P-0035156-T01-IM6',
 'P-0035830-T01-IM6',
 'P-0033628-T01-IM6',
 'P-0026687-T01-IM6',
 'P-0023969-T01-IM6',
 'P-0036454-T01-IM6',
 'P-0035470-T02-IM6',
 'P-0035470-T01-IM6',
 'P-0033667-T01-IM6',
 'P-0030260-T01-IM6',
 'P-0027809-T01-IM6',
 'P-0013417-T01-IM5',
 'P-0010424-T01-IM5',
 'P-0021285-T01-IM6',
 'P-0021285-T02-IM6',
 'P-0035621-T01-IM6',
 'P-0017322-T02-IM6',
 'P-0033300-T01-IM6',
 'P-0027127-T01-IM6',
 'P-0002101-T02-IM6',
 'P-0019464-T01-IM6',
 'P-0035281-T01-IM6',
 'P-0022795-T01-IM6',
 'P-0004176-T01-IM5',
 'P-0022794-T01-IM6',
 'P-0003458-T02-IM5',
 'P-0003834-T01-IM5',
 'P-0004259-T01-IM5',
 'P-0015966-T01-IM6',
 'P-0023543-T01-IM6',
 'P-0024579-T01-IM6',
 'P-0020821-T01-IM6',
 'P-0013428-T01-IM5',
 'P-0020473-T01-IM6',
 'P-0036555-T01-IM6',
 'P-0006587-T01-IM5',
 'P-0032660-T01-IM6',
 'P-0006705-T01-IM5',
 'P-0000448-T01-IM3',
 'P-0028013-T01-IM6',
 'P-0015493-T01-IM6',
 'P-0007779-T01-IM5',
 'P-0034471-T02-IM6',
 'P-0028913-T02-IM6',
 'P-0016546-T01-IM6',
 'P-0030085-T01-IM6',
 'P-0008330-T01-IM5',
 'P-0009964-T02-IM5',
 'P-0036637-T01-IM6',
 'P-0025978-T01-IM6',
 'P-0030865-T01-IM6',
 'P-0030988-T01-IM6',
 'P-0004387-T02-IM5',
 'P-0034144-T01-IM6',
 'P-0026717-T01-IM6',
 'P-0037571-T01-IM6',
 'P-0019818-T01-IM6',
 'P-0008973-T01-IM5',
 'P-0027438-T01-IM6',
 'P-0029323-T01-IM6',
 'P-0022732-T01-IM6',
 'P-0009655-T01-IM5',
 'P-0019831-T03-IM6',
 'P-0004379-T02-IM6',
 'P-0003739-T01-IM5',
 'P-0031460-T01-IM6',
 'P-0020639-T01-IM6',
 'P-0035104-T01-IM6',
 'P-0025387-T01-IM6',
 'P-0016025-T01-IM6',
 'P-0032593-T02-IM6',
 'P-0023965-T01-IM6',
 'P-0011385-T01-IM5',
 'P-0036065-T02-IM6',
 'P-0022457-T01-IM6',
 'P-0004515-T01-IM5',
 'P-0018312-T01-IM6',
 'P-0025715-T01-IM6',
 'P-0032437-T01-IM6',
 'P-0023572-T01-IM6',
 'P-0001613-T01-IM3',
 'P-0033786-T01-IM6',
 'P-0035340-T01-IM6',
 'P-0022172-T01-IM6',
 'P-0019360-T01-IM6',
 'P-0001677-T01-IM3',
 'P-0021475-T01-IM6',
 'P-0013146-T01-IM5',
 'P-0025642-T01-IM6',
 'P-0006066-T01-IM5',
 'P-0024791-T01-IM6',
 'P-0027050-T04-IM6',
 'P-0032818-T01-IM6',
 'P-0021889-T01-IM6',
 'P-0020508-T01-IM6',
 'P-0010034-T01-IM5',
 'P-0026019-T01-IM6',
 'P-0022786-T01-IM6',
 'P-0026942-T01-IM6',
 'P-0005326-T01-IM5',
 'P-0022361-T01-IM6',
 'P-0001901-T02-IM6',
 'P-0003610-T01-IM5',
 'P-0029790-T02-IM6',
 'P-0007215-T01-IM5',
 'P-0023307-T01-IM6',
 'P-0006612-T01-IM5',
 'P-0012076-T01-IM5',
 'P-0002146-T01-IM3',
 'P-0011226-T01-IM5',
 'P-0032057-T01-IM6',
 'P-0005379-T01-IM5',
 'P-0019448-T01-IM6',
 'P-0030457-T01-IM6',
 'P-0016098-T01-IM6',
 'P-0005021-T01-IM5',
 'P-0037690-T01-IM6',
 'P-0015716-T01-IM6',
 'P-0037685-T01-IM6',
 'P-0024774-T01-IM6',
 'P-0022508-T01-IM6',
 'P-0009621-T01-IM5',
 'P-0015585-T01-IM6',
 'P-0016311-T01-IM6',
 'P-0020757-T01-IM6',
 'P-0018005-T01-IM6',
 'P-0025866-T01-IM6',
 'P-0025885-T01-IM6',
 'P-0007179-T03-IM6',
 'P-0012333-T01-IM5',
 'P-0026958-T01-IM6',
 'P-0027506-T01-IM6',
 'P-0014993-T01-IM6',
 'P-0019495-T01-IM6',
 'P-0015276-T01-IM6',
 'P-0017675-T01-IM5',
 'P-0005586-T01-IM5',
 'P-0013537-T01-IM5',
 'P-0029289-T01-IM6',
 'P-0028466-T01-IM6',
 'P-0030466-T01-IM6',
 'P-0019264-T01-IM6',
 'P-0023271-T01-IM6',
 'P-0008346-T02-IM6',
 'P-0015615-T01-IM6',
 'P-0013889-T01-IM5',
 'P-0032949-T01-IM6',
 'P-0034864-T01-IM6',
 'P-0010393-T02-IM6',
 'P-0023913-T01-IM6',
 'P-0036503-T01-IM6',
 'P-0033425-T01-IM6',
 'P-0003630-T02-IM5',
 'P-0037179-T01-IM6',
 'P-0014224-T01-IM6',
 'P-0010504-T01-IM5',
 'P-0032742-T01-IM6',
 'P-0002485-T01-IM3',
 'P-0020151-T01-IM6',
 'P-0003809-T01-IM5',
 'P-0028312-T01-IM6',
 'P-0030300-T01-IM6',
 'P-0028775-T01-IM6',
 'P-0001703-T01-IM3',
 'P-0019109-T01-IM6',
 'P-0005879-T01-IM5',
 'P-0026620-T01-IM6',
 'P-0015758-T01-IM6',
 'P-0020586-T01-IM6',
 'P-0011257-T01-IM5',
 'P-0006952-T01-IM5',
 'P-0035180-T01-IM6',
 'P-0006678-T01-IM5',
 'P-0020817-T01-IM6',
 'P-0030875-T01-IM6',
 'P-0033886-T01-IM6',
 'P-0002900-T01-IM3',
 'P-0031394-T01-IM6',
 'P-0024482-T01-IM6',
 'P-0017893-T01-IM6',
 'P-0032589-T01-IM6',
 'P-0034578-T01-IM6',
 'P-0024293-T01-IM6',
 'P-0000698-T01-IM3',
 'P-0019741-T01-IM6',
 'P-0028465-T01-IM6',
 'P-0010555-T01-IM5',
 'P-0032229-T01-IM6',
 'P-0019807-T01-IM6',
 'P-0015626-T01-IM6',
 'P-0017011-T01-IM6',
 'P-0016554-T01-IM6',
 'P-0028288-T01-IM6',
 'P-0030653-T02-IM6',
 'P-0030653-T03-IM6',
 'P-0034982-T01-IM6',
 'P-0031492-T01-IM6',
 'P-0036844-T01-IM6',
 'P-0012670-T01-IM5',
 'P-0000807-T01-IM3',
 'P-0027762-T01-IM6',
 'P-0021930-T01-IM6',
 'P-0018631-T01-IM6',
 'P-0030684-T01-IM6',
 'P-0018939-T02-IM6',
 'P-0025405-T01-IM6',
 'P-0013676-T01-IM5',
 'P-0035770-T01-IM6',
 'P-0028362-T01-IM6',
 'P-0011183-T01-IM5',
 'P-0001516-T02-IM5',
 'P-0012875-T01-IM5',
 'P-0016773-T01-IM6',
 'P-0002979-T01-IM3',
 'P-0031234-T01-IM6',
 'P-0021606-T01-IM6',
 'P-0034928-T01-IM6',
 'P-0027421-T01-IM6',
 'P-0028196-T02-IM6',
 'P-0033783-T01-IM6',
 'P-0032814-T01-IM6',
 'P-0028197-T01-IM6',
 'P-0036564-T01-IM6',
 'P-0023013-T01-IM6',
 'P-0033375-T03-IM6',
 'P-0015939-T01-IM6',
 'P-0032874-T01-IM6',
 'P-0037073-T01-IM6',
 'P-0003008-T01-IM3',
 'P-0004738-T01-IM5',
 'P-0006960-T01-IM5',
 'P-0022703-T01-IM6',
 'P-0021158-T01-IM6',
 'P-0018846-T01-IM6',
 'P-0019846-T01-IM6',
 'P-0007426-T01-IM5',
 'P-0019684-T01-IM6',
 'P-0019076-T01-IM6',
 'P-0012726-T01-IM5',
 'P-0015288-T01-IM6',
 'P-0025921-T01-IM6',
 'P-0018177-T01-IM6',
 'P-0010308-T01-IM5',
 'P-0024368-T01-IM6',
 'P-0007782-T01-IM5',
 'P-0013822-T01-IM5',
 'P-0034237-T01-IM6',
 'P-0024749-T01-IM6',
 'P-0032479-T01-IM6',
 'P-0031025-T01-IM6',
 'P-0037442-T01-IM6',
 'P-0029780-T01-IM6',
 'P-0007788-T03-IM6',
 'P-0035471-T01-IM6',
 'P-0035502-T01-IM6',
 'P-0002506-T02-IM5',
 'P-0019984-T01-IM6',
 'P-0026962-T01-IM6',
 'P-0015080-T01-IM6',
 'P-0030181-T01-IM6',
 'P-0012556-T01-IM5',
 'P-0032010-T01-IM6',
 'P-0007296-T01-IM5',
 'P-0033222-T01-IM6',
 'P-0032590-T01-IM6',
 'P-0019553-T01-IM6',
 'P-0015851-T01-IM6',
 'P-0032534-T01-IM6',
 'P-0012358-T01-IM5',
 'P-0026690-T01-IM6',
 'P-0030323-T01-IM6',
 'P-0031581-T01-IM6',
 'P-0023403-T01-IM6',
 'P-0004688-T01-IM5',
 'P-0022534-T01-IM6',
 'P-0003425-T01-IM5',
 'P-0034926-T01-IM6',
 'P-0034189-T01-IM6',
 'P-0021754-T01-IM6',
 'P-0029380-T01-IM6',
 'P-0001215-T01-IM3',
 'P-0001375-T01-IM3',
 'P-0001734-T01-IM3',
 'P-0002540-T01-IM3',
 'P-0002989-T01-IM3',
 'P-0003154-T01-IM5',
 'P-0003580-T01-IM5',
 'P-0003938-T01-IM5',
 'P-0004710-T01-IM5',
 'P-0004781-T01-IM5',
 'P-0006636-T01-IM5',
 'P-0007147-T01-IM5',
 'P-0007240-T01-IM5',
 'P-0007514-T01-IM5',
 'P-0009437-T01-IM5',
 'P-0009863-T01-IM5',
 'P-0011077-T01-IM5',
 'P-0011308-T01-IM5',
 'P-0013587-T01-IM5',
 'P-0014649-T01-IM6',
 'P-0016130-T01-IM6',
 'P-0016751-T02-IM6',
 'P-0017786-T01-IM6',
 'P-0017896-T01-IM6',
 'P-0020143-T01-IM6',
 'P-0020297-T01-IM6',
 'P-0037746-T01-IM6',
 'P-0037799-T01-IM6',
 'P-0008322-T06-IM6',
 'P-0037919-T01-IM6',
 'P-0036358-T01-IM6',
 'P-0032756-T01-IM6',
 'P-0034995-T01-IM6',
 'P-0037731-T01-IM6',
 'P-0038246-T01-IM6',
 'P-0030094-T03-IM6',
 'P-0038557-T01-IM6',
 'P-0038584-T01-IM6',
 'P-0038410-T01-IM6',
 'P-0038677-T01-IM6',
 'P-0038816-T01-IM6',
 'P-0038840-T01-IM6',
 'P-0039214-T01-IM6',
 'P-0039059-T01-IM6',
 'P-0039322-T01-IM6',
 'P-0039547-T01-IM6',
 'P-0039459-T01-IM6',
 'P-0039872-T01-IM6',
 'P-0039666-T01-IM6',
 'P-0039773-T01-IM6',
 'P-0039660-T01-IM6',
 'P-0040023-T01-IM6',
 'P-0039443-T02-IM6',
 'P-0040021-T01-IM6',
 'P-0039884-T01-IM6',
 'P-0040147-T01-IM6',
 'P-0040294-T01-IM6',
 'P-0015009-T02-IM6',
 'P-0035578-T02-IM6',
 'P-0040828-T01-IM6',
 'P-0040961-T01-IM6',
 'P-0030706-T02-IM6',
 'P-0040992-T01-IM6',
 'P-0041210-T01-IM6',
 'P-0041287-T01-IM6',
 'P-0041317-T01-IM6',
 'P-0041341-T01-IM6',
 'P-0041324-T01-IM6',
 'P-0041281-T01-IM6',
 'P-0035205-T03-IM6',
 'P-0035205-T03-IM6',
 'P-0041662-T01-IM6',
 'P-0038098-T02-IM6',
 'P-0042133-T01-IM6',
 'P-0042226-T01-IM6',
 'P-0042270-T01-IM6',
 'P-0042390-T01-IM6',
 'P-0042601-T01-IM6',
 'P-0042865-T01-IM6',
 'P-0039983-T01-IM6',
 'P-0043062-T01-IM6',
 'P-0024900-T02-IM6',
 'P-0043242-T01-IM6',
 'P-0043370-T01-IM6',
 'P-0044060-T01-IM6',
 'P-0044059-T01-IM6',
 'P-0032246-T02-IM6',
 'P-0044388-T01-IM6',
 'P-0008322-T07-IM6',
 'P-0031022-T02-IM6',
 'P-0044416-T01-IM6',
 'P-0044658-T01-IM6',
 'P-0045015-T01-IM6',
 'P-0044858-T01-IM6',
 'P-0026663-T01-IM6',
 'P-0033317-T01-IM6',
 'P-0025948-T02-IM6',
 'P-0021519-T02-IM6',
 'P-0021519-T01-IM6',
 'P-0029805-T01-IM6',
 'P-0029522-T01-IM6',
 'P-0038091-T01-IM6',
 'P-0032669-T01-IM6',
 'P-0028776-T01-IM6',
 'P-0037870-T01-IM6',
 'P-0020805-T01-IM6',
 'P-0027705-T01-IM6',
 'P-0045136-T01-IM6',
 'P-0045458-T01-IM6',
 'P-0045378-T01-IM6',
 'P-0045687-T01-IM6',
 'P-0045970-T01-IM6',
 'P-0046008-T01-IM6',
 'P-0046075-T01-IM6',
 'P-0045809-T02-IM6',
 'P-0046225-T01-IM6',
 'P-0046612-T01-IM6',
 'P-0046840-T01-IM6',
 'P-0045684-T02-IM6',
 'P-0046620-T01-IM6',
 'P-0039113-T01-IM6',
 'P-0047158-T01-IM6',
 'P-0047256-T01-IM6',
 'P-0047260-T01-IM6',
 'P-0047379-T01-IM6',
 'P-0047200-T01-IM6',
 'P-0037730-T03-IM6',
 'P-0047599-T01-IM6',
 'P-0046583-T01-IM6',
 'P-0041146-T01-IM6',
 'P-0037838-T02-IM6',
 'P-0041346-T01-IM6',
 'P-0047901-T01-IM6',
 'P-0047768-T01-IM6',
 'P-0047793-T01-IM6',
 'P-0048318-T01-IM6',
 'P-0048245-T01-IM6',
 'P-0048167-T01-IM6',
 'P-0035578-T03-IM6',
 'P-0048298-T02-IM6',
 'P-0048298-T01-IM6',
 'P-0047799-T01-IM6',
 'P-0048356-T01-IM6',
 'P-0047945-T01-IM6',
 'P-0047035-T02-IM6',
 'P-0045366-T01-IM6',
 'P-0047971-T01-IM6',
 'P-0048545-T01-IM6',
 'P-0034471-T03-IM6',
 'P-0034471-T03-IM6',
 'P-0049661-T01-IM6',
 'P-0049570-T01-IM6',
 'P-0049376-T01-IM6',
 'P-0048700-T01-IM6',
 'P-0049202-T01-IM6',
 'P-0049114-T02-IM6',
 'P-0048765-T01-IM6',
 'P-0049128-T01-IM6',
 'P-0049409-T01-IM6',
 'P-0049380-T01-IM6',
 'P-0025736-T03-IM6',
 'P-0026308-T01-IM6',
 'P-0050500-T01-IM6',
 'P-0050442-T01-IM6',
 'P-0050657-T01-IM6']:
    if tumor_id not in list(set(maf_cohort_final_.Tumor_Id)):
        print(tumor_id)

P-0035205-T03-IM6
P-0035205-T03-IM6


In [308]:
samples_mult = ['P-0009112-T01-IM5',
 'P-0017428-T01-IM6',
 'P-0005766-T01-IM5',
 'P-0012881-T01-IM5',
 'P-0008972-T01-IM5',
 'P-0014135-T02-IM5',
 'P-0010332-T01-IM5',
 'P-0036454-T01-IM6',
 'P-0013417-T01-IM5',
 'P-0021285-T01-IM6']

for sample in samples_mult:
    print(maf_cohort_final_[maf_cohort_final_['Tumor_Id'] == sample]['Hugo_Symbol'])

3127    AKT1
3128    TP53
3129    NSD1
Name: Hugo_Symbol, dtype: object
3365     TP53
3366    CDK12
3367    NCOR1
3368    RUNX1
Name: Hugo_Symbol, dtype: object
5293      TP53
5294    DICER1
5295     AXIN2
5296       SRC
5297       APC
Name: Hugo_Symbol, dtype: object
263224       TP53
263225       SPOP
263226       PTEN
263227       TP53
263228       POLE
263229        AXL
263230      EPHA5
263231       ESR1
263232       PTEN
263233     NFE2L2
263234       SDHD
263235      CASP8
263236        NF1
263237      PTPRT
263238      LATS1
263239      CASP8
263240        RB1
263241      PPM1D
263242      SF3B1
263243     CARD11
263244      XRCC2
263245    MAP3K13
263246       MSH6
263247      FBXW7
263249      ASXL2
263250     RICTOR
263251       PTEN
263252     ARID1A
263253     ARID5B
263254     INPP4A
263255       NSD1
263256      STAG2
263257       MDC1
263258     PIK3CA
263259      CDC73
263260       NSD1
263261      ERCC3
263262       TET1
263263     MAP3K1
263264      PALB2
263265     

In [305]:
len(set(maf_cohort_final_.Tumor_Id))

29382

In [260]:
h = maf_cohort_final[maf_cohort_final['mutationStatus'] == 'NA'][maf_cohort_final['Hugo_Symbol'] == 'TP53']['Tumor_Id'].tolist()
h

['P-0003867-T01-IM5',
 'P-0026701-T01-IM6',
 'P-0033092-T01-IM6',
 'P-0016168-T01-IM6',
 'P-0004827-T01-IM5',
 'P-0030800-T01-IM6',
 'P-0032474-T01-IM6',
 'P-0025686-T01-IM6',
 'P-0025686-T02-IM6',
 'P-0037395-T01-IM6',
 'P-0035227-T01-IM6',
 'P-0008695-T03-IM6',
 'P-0032533-T01-IM6',
 'P-0004140-T02-IM5',
 'P-0004140-T03-IM5',
 'P-0036760-T01-IM6',
 'P-0031933-T01-IM6',
 'P-0016643-T01-IM6',
 'P-0024672-T01-IM6',
 'P-0034440-T01-IM6',
 'P-0020430-T01-IM6',
 'P-0021318-T01-IM6',
 'P-0027735-T01-IM6',
 'P-0031946-T01-IM6',
 'P-0004387-T02-IM5',
 'P-0007169-T01-IM5',
 'P-0017265-T01-IM6',
 'P-0012776-T01-IM5',
 'P-0020606-T01-IM6',
 'P-0033821-T01-IM6',
 'P-0035527-T01-IM6',
 'P-0030493-T01-IM6',
 'P-0021358-T01-IM6',
 'P-0031409-T01-IM6',
 'P-0031409-T02-IM6',
 'P-0031284-T01-IM6',
 'P-0002354-T02-IM6',
 'P-0009044-T01-IM5',
 'P-0012597-T01-IM5',
 'P-0027135-T01-IM6',
 'P-0018462-T01-IM6',
 'P-0028435-T01-IM6',
 'P-0012773-T01-IM5',
 'P-0035373-T01-IM6',
 'P-0009743-T01-IM5',
 'P-000970

In [261]:
a = maf_cohort_final[maf_cohort_final['Tumor_Id'] == 'P-0050657-T01-IM6']
a = a[a['mutationStatus'] != 'GERMLINE']
a

Unnamed: 0,Sample_Id,Tumor_Id,purity,ploidy,dipLogR,frac_loh,Patient_Id,Patient_Current_Age,Cancer_Type,Cancer_Type_Detailed,Ethnicity_Category,Race_Category,Sex,Mutation_Count,Sample_Type,samples_per_patient,Overall Survival Status,Overall Survival (Months),MSI Score,TMB_Score,Somatic_Status,gene,Gene_Id,Variant_Classification,mutationStatus,proteinChange,Start_Position,End_Position,Reference_Allele,Variant_Allele,Chromosome,Hugo_Symbol,alt_count,ref_count,mut_key,sample_mut_key,mut_spot,vaf,total_mut_key,mutationEffect,oncogenic,vus,hotspot
286987,P-0050657-T01-IM6_P-0050657-N01-IM6,P-0050657-T01-IM6,,2.0,-0.005398,0.0,P-0050657,82.0,Non-Small Cell Lung Cancer,Lung Adenocarcinoma,Non-Spanish; Non-Hispanic,WHITE,Male,3.0,Metastasis,1.0,LIVING,2.005,0.05,2.6,Matched,"{'entrezGeneId': 7157, 'hugoGeneSymbol': 'TP53', 'type': 'protein-coding'}",7157.0,Missense_Mutation,SOMATIC,R249G,7577536.0,7577536.0,T,C,17,TP53,26.0,689.0,17_7577536_T_C,P-0050657-T01-IM6_17_7577536_T_C,249,0.0363636,P-0050657-T01-IM6_P-0050657-N01-IM6_17_7577536_T_C,,,,
286988,P-0050657-T01-IM6_P-0050657-N01-IM6,P-0050657-T01-IM6,,2.0,-0.005398,0.0,P-0050657,82.0,Non-Small Cell Lung Cancer,Lung Adenocarcinoma,Non-Spanish; Non-Hispanic,WHITE,Male,3.0,Metastasis,1.0,LIVING,2.005,0.05,2.6,Matched,"{'entrezGeneId': 7157, 'hugoGeneSymbol': 'TP53', 'type': 'protein-coding'}",7157.0,Missense_Mutation,SOMATIC,M246L,7577545.0,7577545.0,T,A,17,TP53,24.0,677.0,17_7577545_T_A,P-0050657-T01-IM6_17_7577545_T_A,246,0.0342368,P-0050657-T01-IM6_P-0050657-N01-IM6_17_7577545_T_A,,,,
286989,P-0050657-T01-IM6_P-0050657-N01-IM6,P-0050657-T01-IM6,,2.0,-0.005398,0.0,P-0050657,82.0,Non-Small Cell Lung Cancer,Lung Adenocarcinoma,Non-Spanish; Non-Hispanic,WHITE,Male,3.0,Metastasis,1.0,LIVING,2.005,0.05,2.6,Matched,"{'entrezGeneId': 4233, 'hugoGeneSymbol': 'MET', 'type': 'protein-coding'}",4233.0,Splice_Region,SOMATIC,X1010_splice,116412046.0,116412046.0,A,G,7,MET,366.0,851.0,7_116412046_A_G,P-0050657-T01-IM6_7_116412046_A_G,1010,0.30074,P-0050657-T01-IM6_P-0050657-N01-IM6_7_116412046_A_G,,,,


In [262]:
len(set(maf_cohort_final[maf_cohort_final['Tumor_Id'].isin(samples_maxvaf)][maf_cohort_final['mutationStatus'] == 'GERMLINE']['Tumor_Id']))

70

In [263]:
master = load_clean_up_master(data_path + 'merged_data/master_file.pkl')
len(master)

26931

In [264]:
get_groupby(master[master['Tumor_Id'].isin(h)], 'tp53_cn_state', 'count')

Unnamed: 0_level_0,count
tp53_cn_state,Unnamed: 1_level_1
CNLOH,6
CNLOH BEFORE & LOSS,5
DIPLOID,7
DOUBLE LOSS AFTER,1
GAIN,2
HETLOSS,34
HOMDEL,4
INDETERMINATE,18
LOSS AFTER,2
LOSS BEFORE,9


In [265]:
get_groupby(maf_cohort_final, 'mutationStatus', 'count')

Unnamed: 0_level_0,count
mutationStatus,Unnamed: 1_level_1
GERMLINE,1532
,6692
SOMATIC,258851
UNKNOWN,2275


In [266]:
get_groupby(maf_cohort_final, 'Variant_Classification', 'count')

Unnamed: 0_level_0,count
Variant_Classification,Unnamed: 1_level_1
5'Flank,3896
Frame_Shift_Del,23680
Frame_Shift_Ins,9046
Fusion,6777
In_Frame_Del,4560
In_Frame_Ins,949
Missense_Mutation,185303
Nonsense_Mutation,25072
Nonstop_Mutation,166
Splice_Region,332


In [267]:
maf_cohort_final[maf_cohort_final['mutationStatus'] == 'UNKNOWN']

Unnamed: 0,Sample_Id,Tumor_Id,purity,ploidy,dipLogR,frac_loh,Patient_Id,Patient_Current_Age,Cancer_Type,Cancer_Type_Detailed,Ethnicity_Category,Race_Category,Sex,Mutation_Count,Sample_Type,samples_per_patient,Overall Survival Status,Overall Survival (Months),MSI Score,TMB_Score,Somatic_Status,gene,Gene_Id,Variant_Classification,mutationStatus,proteinChange,Start_Position,End_Position,Reference_Allele,Variant_Allele,Chromosome,Hugo_Symbol,alt_count,ref_count,mut_key,sample_mut_key,mut_spot,vaf,total_mut_key,mutationEffect,oncogenic,vus,hotspot
2326,P-0000806-T01-IM3_P-0000806-N01-IM3,P-0000806-T01-IM3,0.486670,3.877389,-0.542837,0.120,P-0000806,61.0,Breast Cancer,"Breast Invasive Cancer, NOS",Non-Spanish; Non-Hispanic,WHITE,Female,3.0,Metastasis,1.0,DECEASED,45.567,1.62,3.3,Matched,"{'entrezGeneId': 8314, 'hugoGeneSymbol': 'BAP1', 'type': 'protein-coding'}",8314.0,Missense_Mutation,UNKNOWN,A217T,52440855.0,52440855.0,C,T,3,BAP1,77.0,293.0,3_52440855_C_T,P-0000806-T01-IM3_3_52440855_C_T,217,0.208108,P-0000806-T01-IM3_P-0000806-N01-IM3_3_52440855_C_T,,,,
2327,P-0000806-T01-IM3_P-0000806-N01-IM3,P-0000806-T01-IM3,0.486670,3.877389,-0.542837,0.120,P-0000806,61.0,Breast Cancer,"Breast Invasive Cancer, NOS",Non-Spanish; Non-Hispanic,WHITE,Female,3.0,Metastasis,1.0,DECEASED,45.567,1.62,3.3,Matched,"{'entrezGeneId': 6598, 'hugoGeneSymbol': 'SMARCB1', 'type': 'protein-coding'}",6598.0,Missense_Mutation,UNKNOWN,A238D,24159041.0,24159041.0,C,A,22,SMARCB1,62.0,350.0,22_24159041_C_A,P-0000806-T01-IM3_22_24159041_C_A,238,0.150485,P-0000806-T01-IM3_P-0000806-N01-IM3_22_24159041_C_A,,,,
2328,P-0000806-T01-IM3_P-0000806-N01-IM3,P-0000806-T01-IM3,0.486670,3.877389,-0.542837,0.120,P-0000806,61.0,Breast Cancer,"Breast Invasive Cancer, NOS",Non-Spanish; Non-Hispanic,WHITE,Female,3.0,Metastasis,1.0,DECEASED,45.567,1.62,3.3,Matched,"{'entrezGeneId': 54880, 'hugoGeneSymbol': 'BCOR', 'type': 'protein-coding'}",54880.0,Missense_Mutation,UNKNOWN,R540Q,39932980.0,39932980.0,C,T,23,BCOR,149.0,559.0,23_39932980_C_T,P-0000806-T01-IM3_23_39932980_C_T,540,0.210452,P-0000806-T01-IM3_P-0000806-N01-IM3_23_39932980_C_T,,,,
4971,P-0002304-T01-IM3_P-0002304-N01-IM3,P-0002304-T01-IM3,0.563198,1.703556,0.125759,0.300,P-0002304,76.0,Prostate Cancer,Prostate Adenocarcinoma,Non-Spanish; Non-Hispanic,WHITE,Male,5.0,Metastasis,1.0,LIVING,55.529,0.43,5.6,Matched,"{'entrezGeneId': 3169, 'hugoGeneSymbol': 'FOXA1', 'type': 'protein-coding'}",3169.0,Missense_Mutation,UNKNOWN,R219S,38061334.0,38061334.0,G,T,14,FOXA1,435.0,1076.0,14_38061334_G_T,P-0002304-T01-IM3_14_38061334_G_T,219,0.287889,P-0002304-T01-IM3_P-0002304-N01-IM3_14_38061334_G_T,,,,
4972,P-0002304-T01-IM3_P-0002304-N01-IM3,P-0002304-T01-IM3,0.563198,1.703556,0.125759,0.300,P-0002304,76.0,Prostate Cancer,Prostate Adenocarcinoma,Non-Spanish; Non-Hispanic,WHITE,Male,5.0,Metastasis,1.0,LIVING,55.529,0.43,5.6,Matched,"{'entrezGeneId': 3169, 'hugoGeneSymbol': 'FOXA1', 'type': 'protein-coding'}",3169.0,Frame_Shift_Del,UNKNOWN,S285Rfs*26,38061104.0,38061134.0,CTTGCGGCTCTCAGGGCCGCCCTTGGCGCCG,-,14,FOXA1,51.0,126.0,14_38061104_CTTGCGGCTCTCAGGGCCGCCCTTGGCGCCG_-,P-0002304-T01-IM3_14_38061104_CTTGCGGCTCTCAGGGCCGCCCTTGGCGCCG_-,285,0.288136,P-0002304-T01-IM3_P-0002304-N01-IM3_14_38061104_CTTGCGGCTCTCAGGGCCGCCCTTGGCGCCG_-,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
277806,P-0018865-T01-IM6_P-0018865-N01-IM6,P-0018865-T01-IM6,,2.000000,-0.564619,0.081,P-0018865,73.0,Non-Small Cell Lung Cancer,Lung Adenocarcinoma,Other Spanish/Hispanic(incl European; excl Dom Rep,OTHER,Male,10.0,Primary,2.0,LIVING,19.726,13.38,9.7,Unmatched,"{'entrezGeneId': 7048, 'hugoGeneSymbol': 'TGFBR2', 'type': 'protein-coding'}",7048.0,Missense_Mutation,UNKNOWN,E453Q,30715624.0,30715624.0,G,C,3,TGFBR2,13.0,210.0,3_30715624_G_C,P-0018865-T01-IM6_3_30715624_G_C,453,0.058296,P-0018865-T01-IM6_P-0018865-N01-IM6_3_30715624_G_C,,,,
277807,P-0018865-T01-IM6_P-0018865-N01-IM6,P-0018865-T01-IM6,,2.000000,-0.564619,0.081,P-0018865,73.0,Non-Small Cell Lung Cancer,Lung Adenocarcinoma,Other Spanish/Hispanic(incl European; excl Dom Rep,OTHER,Male,10.0,Primary,2.0,LIVING,19.726,13.38,9.7,Unmatched,"{'entrezGeneId': 151987, 'hugoGeneSymbol': 'PPP4R2', 'type': 'protein-coding'}",151987.0,Missense_Mutation,UNKNOWN,E209K,73113284.0,73113284.0,G,A,3,PPP4R2,22.0,224.0,3_73113284_G_A,P-0018865-T01-IM6_3_73113284_G_A,209,0.0894309,P-0018865-T01-IM6_P-0018865-N01-IM6_3_73113284_G_A,,,,
277808,P-0018865-T01-IM6_P-0018865-N01-IM6,P-0018865-T01-IM6,,2.000000,-0.564619,0.081,P-0018865,73.0,Non-Small Cell Lung Cancer,Lung Adenocarcinoma,Other Spanish/Hispanic(incl European; excl Dom Rep,OTHER,Male,10.0,Primary,2.0,LIVING,19.726,13.38,9.7,Unmatched,"{'entrezGeneId': 668, 'hugoGeneSymbol': 'FOXL2', 'type': 'protein-coding'}",668.0,Missense_Mutation,UNKNOWN,A230V,138664876.0,138664876.0,G,A,3,FOXL2,13.0,152.0,3_138664876_G_A,P-0018865-T01-IM6_3_138664876_G_A,230,0.0787879,P-0018865-T01-IM6_P-0018865-N01-IM6_3_138664876_G_A,,,,
277809,P-0018865-T01-IM6_P-0018865-N01-IM6,P-0018865-T01-IM6,,2.000000,-0.564619,0.081,P-0018865,73.0,Non-Small Cell Lung Cancer,Lung Adenocarcinoma,Other Spanish/Hispanic(incl European; excl Dom Rep,OTHER,Male,10.0,Primary,2.0,LIVING,19.726,13.38,9.7,Unmatched,"{'entrezGeneId': 545, 'hugoGeneSymbol': 'ATR', 'type': 'protein-coding'}",545.0,Missense_Mutation,UNKNOWN,H1166R,142259830.0,142259830.0,T,C,3,ATR,240.0,311.0,3_142259830_T_C,P-0018865-T01-IM6_3_142259830_T_C,1166,0.435572,P-0018865-T01-IM6_P-0018865-N01-IM6_3_142259830_T_C,,,,


In [268]:
total_mut_keys = maf_cohort_final["total_mut_key"]
h = maf_cohort_final[total_mut_keys.isin(total_mut_keys[total_mut_keys.duplicated()])]#.sort("ID")
#get_groupby(h, 'oncogenic', 'count')
h[h['oncogenic'] == 'Oncogenic']

Unnamed: 0,Sample_Id,Tumor_Id,purity,ploidy,dipLogR,frac_loh,Patient_Id,Patient_Current_Age,Cancer_Type,Cancer_Type_Detailed,Ethnicity_Category,Race_Category,Sex,Mutation_Count,Sample_Type,samples_per_patient,Overall Survival Status,Overall Survival (Months),MSI Score,TMB_Score,Somatic_Status,gene,Gene_Id,Variant_Classification,mutationStatus,proteinChange,Start_Position,End_Position,Reference_Allele,Variant_Allele,Chromosome,Hugo_Symbol,alt_count,ref_count,mut_key,sample_mut_key,mut_spot,vaf,total_mut_key,mutationEffect,oncogenic,vus,hotspot


In [269]:
maf_cohort_final[maf_cohort_final['total_mut_key'] == 'P-0009819-T01-IM5_P-0009819-N01-IM5_NA_-1_NA_']

Unnamed: 0,Sample_Id,Tumor_Id,purity,ploidy,dipLogR,frac_loh,Patient_Id,Patient_Current_Age,Cancer_Type,Cancer_Type_Detailed,Ethnicity_Category,Race_Category,Sex,Mutation_Count,Sample_Type,samples_per_patient,Overall Survival Status,Overall Survival (Months),MSI Score,TMB_Score,Somatic_Status,gene,Gene_Id,Variant_Classification,mutationStatus,proteinChange,Start_Position,End_Position,Reference_Allele,Variant_Allele,Chromosome,Hugo_Symbol,alt_count,ref_count,mut_key,sample_mut_key,mut_spot,vaf,total_mut_key,mutationEffect,oncogenic,vus,hotspot
7,P-0009819-T01-IM5_P-0009819-N01-IM5,P-0009819-T01-IM5,0.275237,2.681075,-0.129255,0.094,P-0009819,72.0,Prostate Cancer,Prostate Adenocarcinoma,Non-Spanish; Non-Hispanic,NO VALUE ENTERED,Male,1.0,Primary,1.0,LIVING,23.441,0.0,1.0,Matched,"{'entrezGeneId': 2078, 'hugoGeneSymbol': 'ERG', 'type': 'protein-coding'}",2078.0,Fusion,,TMPRSS2-ERG fusion,-1.0,-1.0,,,,ERG,-1.0,-1.0,NA_-1_NA_,P-0009819-T01-IM5_NA_-1_NA_,2,,P-0009819-T01-IM5_P-0009819-N01-IM5_NA_-1_NA_,,,,


In [270]:
len(maf_cohort[maf_cohort['alt_count'] == -1.0])

13550

In [271]:
maf_cohort[maf_cohort['Tumor_Id'] == 'P-0009819-T01-IM5']

Unnamed: 0,Sample_Id,Tumor_Id,purity,ploidy,dipLogR,frac_loh,Patient_Id,Patient_Current_Age,Cancer_Type,Cancer_Type_Detailed,Ethnicity_Category,Race_Category,Sex,Mutation_Count,Sample_Type,samples_per_patient,Overall Survival Status,Overall Survival (Months),MSI Score,TMB_Score,Somatic_Status,gene,Gene_Id,Variant_Classification,mutationStatus,proteinChange,Start_Position,End_Position,Reference_Allele,Variant_Allele,Chromosome,Hugo_Symbol,alt_count,ref_count,mut_key,sample_mut_key,mut_spot,vaf,total_mut_key
6,P-0009819-T01-IM5_P-0009819-N01-IM5,P-0009819-T01-IM5,0.275237,2.681075,-0.129255,0.094,P-0009819,72.0,Prostate Cancer,Prostate Adenocarcinoma,Non-Spanish; Non-Hispanic,NO VALUE ENTERED,Male,1.0,Primary,1.0,LIVING,23.441,0.0,1.0,Matched,"{'entrezGeneId': 3169, 'hugoGeneSymbol': 'FOXA1', 'type': 'protein-coding'}",3169.0,In_Frame_Ins,SOMATIC,G157dup,38061516.0,38061517.0,-,CGC,14.0,FOXA1,41.0,236.0,14_38061516_-_CGC,P-0009819-T01-IM5_14_38061516_-_CGC,157,0.148014,P-0009819-T01-IM5_P-0009819-N01-IM5_14_38061516_-_CGC
7,P-0009819-T01-IM5_P-0009819-N01-IM5,P-0009819-T01-IM5,0.275237,2.681075,-0.129255,0.094,P-0009819,72.0,Prostate Cancer,Prostate Adenocarcinoma,Non-Spanish; Non-Hispanic,NO VALUE ENTERED,Male,1.0,Primary,1.0,LIVING,23.441,0.0,1.0,Matched,"{'entrezGeneId': 2078, 'hugoGeneSymbol': 'ERG', 'type': 'protein-coding'}",2078.0,Fusion,,TMPRSS2-ERG fusion,-1.0,-1.0,,,,ERG,-1.0,-1.0,NA_-1_NA_,P-0009819-T01-IM5_NA_-1_NA_,2,,P-0009819-T01-IM5_P-0009819-N01-IM5_NA_-1_NA_
8,P-0009819-T01-IM5_P-0009819-N01-IM5,P-0009819-T01-IM5,0.275237,2.681075,-0.129255,0.094,P-0009819,72.0,Prostate Cancer,Prostate Adenocarcinoma,Non-Spanish; Non-Hispanic,NO VALUE ENTERED,Male,1.0,Primary,1.0,LIVING,23.441,0.0,1.0,Matched,"{'entrezGeneId': 7113, 'hugoGeneSymbol': 'TMPRSS2', 'type': 'protein-coding'}",7113.0,Fusion,,TMPRSS2-ERG fusion,-1.0,-1.0,,,,TMPRSS2,-1.0,-1.0,NA_-1_NA_,P-0009819-T01-IM5_NA_-1_NA_,2,,P-0009819-T01-IM5_P-0009819-N01-IM5_NA_-1_NA_


In [272]:
maf_cohort_unique = maf_cohort.drop_duplicates('Patient_Id')
print('Number of cohort patients without cancer type information: '+str(maf_cohort_unique['Cancer_Type'].isna().sum()))

Number of cohort patients without cancer type information: 9


In [273]:
len(set(maf_cohort.Patient_Id))

27435

In [274]:
cohort_patients = set(cohort.Patient_Id)
cbioportal_patients = set(clinical_data['Patient ID'])
maf_patients = set(maf_cohort['Patient_Id'])
mutation_patients = set(mutations_filtered['patientId'])
print(len(cohort_patients - mutation_patients))

885


In [275]:
clinical_data = pd.read_csv(data_path + 'cbioportal/raw/mskimpact_clinical_data-2.tsv', sep= '\t')
clinical_data[clinical_data['Patient ID'] == 'P-0003702']

Unnamed: 0,Study ID,Patient ID,Sample ID,Diagnosis Age,Age at Which Sequencing was Reported (Days),Patient Current Age,Archer Panel,Cancer Type,Cancer Type Detailed,CRDB_ADJ_TXT,CRDB_BASIC_COMMENTS,CRDB_BRAINMET,CRDB_CONSENT_DATE_DAYS,CRDB_ECOG,CRDB_NOSYSTXT,CRDB_OFF_STUDY_DAYS,CRDB_PRIOR_RX,CRDB_SURVEY_COMMENTS,CRDB_SURVIVAL_STATUS,CRDB_TREATMENT_END_DAYS,Impact TMB Percentile (Across All Tumor Types),Impact TMB Score,Impact TMB Percentile (Within Tumor Type),Date added to cBioPortal,Disease Free (Months),Disease Free Status,Ethnicity Category,Fraction Genome Altered,Gene Panel,Neoplasm Histologic Type Name,Institute Source,Metastatic Site,MGMT Status,Month added to cBioPortal,MSI Comment,MSI Score,MSI Type,MSK Slide ID,Mutation Count,Oncotree Code,Overall Survival (Months),Overall Survival Months Reported by DMT,Overall Survival Status,Overall Status Reported by DMT,Other Patient ID,12-245 Part A Consented,12-245 Part C Consented,MSK Pathology Slide Available,Pediatric Case Indicator,Primary Tumor Site,Race Category,Religion,Sample Class,Number of Samples Per Patient,Sample coverage,Sample Type,Sex,Sex Reported by DMT,Somatic Status,SO comments,Tumor Purity,Week added to cBioPortal,WHO Grade
3929,mskimpact,P-0003702,P-0003702-T02-IM5,,52,55.0,NO,Breast Cancer,Breast Invasive Ductal Carcinoma,NO,,NO,18409.0,0.0,1.0,,NO,,Alive,,59.6,4.9,69.6,2018/11/02,,,Non-Spanish; Non-Hispanic,0.1945,IMPACT410,,MSKCC,Epidural Mass,,2018/11,MICROSATELLITE STABLE (MSS). See MSI note below.,0.08,Stable,,5,IDC,58.257,,LIVING,,,YES,NO,NO,No,Breast,WHITE,CATHOLIC/ROMAN,Tumor,1,191,Metastasis,Female,,Matched,,20,"2018, Wk. 44",


In [276]:
mutations_filtered[mutations_filtered['patientId'] == 'P-0002760']

Unnamed: 0,sampleId,patientId,gene,entrezGeneId,mutationType,mutationStatus,proteinChange,startPosition,endPosition,referenceAllele,variantAllele,chr,hugoGeneSymbol,tumorAltCount,tumorRefCount,mut_key,sample_mut_key,mut_spot,vaf


In [277]:
maf_cohort.vaf.isna().sum()

1070

In [278]:
mutations = pd.read_pickle(data_path + 'cbioportal/raw/mutations_cohort.pkl')

def cond(x):
    return list(x.gene.values())[1]

mutations['hugo_gene_symbol'] = mutations.apply(cond, axis=1)

In [279]:
mutations.to_csv(data_path + 'cbioportal/raw/mutations_cohort.tsv')

KeyboardInterrupt: 

In [None]:
list(maf_cohort[maf_cohort['mutationStatus'] == 'GERMLINE'][maf_cohort['Hugo_Symbol'] == 'TP53'].Sample_Id)

In [None]:
annotated_data = pd.read_pickle(data_path + 'maf_cohort_annotated.pkl')
annotated_samples = list(set(annotated_data.Sample_Id))
maf_samples = list(set(maf_cohort.Sample_Id))
print('Samples',len(annotated_samples), len(maf_samples))
print('Number of mutations',len(annotated_data), len(maf_cohort))

In [None]:
annotated_data.head(3)

In [None]:
ARID = annotated_data[annotated_data['Hugo_Symbol'] == 'ARID1A']
ARID_Miss = ARID[ARID['Variant_Classification'] == 'Missense_Mutation']
ARID_Non = ARID[ARID['Variant_Classification'] == 'Nonsense_Mutation']
get_groupby(ARID_Non, 'oncogenic', 'count')

In [None]:
get_groupby(annotated_data, 'mutationEffect', 'count')

In [None]:
get_groupby(annotated_data, 'mutationEffect', 'count')

In [None]:
maf_cohort_final.mutationStatus

In [None]:
master = load_clean_up_master(data_path + 'merged_data/master_file.pkl')
h = list(master[master[''']>=1]['Tumor_Id'])
clinical_data_tp53 = clinical_data[clinical_data['Sample ID'].isin(h)]

In [None]:
print(h[0])

In [None]:
get_groupby(clinical_data_tp53, 'Somatic Status', 'count')

In [None]:
get_groupby(clinical_data, 'Somatic Status', 'count')

In [7]:
mutations

Unnamed: 0,uniqueSampleKey,uniquePatientKey,molecularProfileId,sampleId,patientId,entrezGeneId,gene,studyId,center,mutationStatus,validationStatus,tumorAltCount,tumorRefCount,normalAltCount,normalRefCount,startPosition,endPosition,referenceAllele,proteinChange,mutationType,functionalImpactScore,fisValue,linkXvar,linkPdb,linkMsa,ncbiBuild,variantType,keyword,driverFilter,driverFilterAnnotation,driverTiersFilter,driverTiersFilterAnnotation,chr,variantAllele,refseqMrnaId,proteinPosStart,proteinPosEnd,hugoGeneSymbol,type
0,UC0wMDAwMDA0LVQwMS1JTTM6bXNraW1wYWN0,UC0wMDAwMDA0Om1za2ltcGFjdA,mskimpact_mutations,P-0000004-T01-IM3,P-0000004,207,"{'entrezGeneId': 207, 'hugoGeneSymbol': 'AKT1', 'type': 'protein-coding'}",mskimpact,MSKCC,SOMATIC,Unknown,244.0,202.0,1.0,711.0,105246551,105246551,C,E17K,Missense_Mutation,M,2.175000e+00,"getma.org/?cm=var&var=hg19,14,105246551,C,T&fts=all",getma.org/pdb.php?prot=AKT1_HUMAN&from=6&to=108&var=E17K,getma.org/?cm=msa&ty=f&p=AKT1_HUMAN&rb=6&re=108&var=E17K,GRCh37,SNP,AKT1 E17 missense,,,,,14,T,,17,17,AKT1,protein-coding
1,UC0wMDAwMDA0LVQwMS1JTTM6bXNraW1wYWN0,UC0wMDAwMDA0Om1za2ltcGFjdA,mskimpact_mutations,P-0000004-T01-IM3,P-0000004,7157,"{'entrezGeneId': 7157, 'hugoGeneSymbol': 'TP53', 'type': 'protein-coding'}",mskimpact,MSKCC,SOMATIC,Unknown,58.0,209.0,0.0,600.0,7578503,7578518,CAGGGCAGGTCTTGGC,A138Cfs*27,Frame_Shift_Del,,1.401300e-45,,,,GRCh37,DEL,TP53 truncating,,,,,17,-,"NM_001126112.2,NM_001276761.1,NM_001276760.1,NM_000546.5,NM_0011",138,143,TP53,protein-coding
2,UC0wMDAwMDA0LVQwMS1JTTM6bXNraW1wYWN0,UC0wMDAwMDA0Om1za2ltcGFjdA,mskimpact_mutations,P-0000004-T01-IM3,P-0000004,23013,"{'entrezGeneId': 23013, 'hugoGeneSymbol': 'SPEN', 'type': 'protein-coding'}",mskimpact,MSKCC,SOMATIC,Unknown,73.0,400.0,0.0,1071.0,16265908,16265908,A,I3661F,Missense_Mutation,M,2.275000e+00,"getma.org/?cm=var&var=hg19,1,16265908,A,T&fts=all",getma.org/pdb.php?prot=MINT_HUMAN&from=3498&to=3664&var=I3661F,getma.org/?cm=msa&ty=f&p=MINT_HUMAN&rb=3498&re=3664&var=I3661F,GRCh37,SNP,SPEN I3661 missense,,,,,1,T,NM_015001.2,3661,3661,SPEN,protein-coding
3,UC0wMDAwMDA0LVQwMS1JTTM6bXNraW1wYWN0,UC0wMDAwMDA0Om1za2ltcGFjdA,mskimpact_mutations,P-0000004-T01-IM3,P-0000004,58508,"{'entrezGeneId': 58508, 'hugoGeneSymbol': 'KMT2C', 'type': 'protein-coding'}",mskimpact,MSKCC,SOMATIC,Unknown,11.0,84.0,0.0,193.0,151945083,151945083,C,M812I,Missense_Mutation,L,8.050000e-01,"getma.org/?cm=var&var=hg19,7,151945083,C,T&fts=all",,getma.org/?cm=msa&ty=f&p=MLL3_HUMAN&rb=639&re=838&var=M812I,GRCh37,SNP,KMT2C M812 missense,,,,,7,T,NM_170606.2,812,812,KMT2C,protein-coding
4,UC0wMDAwMDEyLVQwMi1JTTM6bXNraW1wYWN0,UC0wMDAwMDEyOm1za2ltcGFjdA,mskimpact_mutations,P-0000012-T02-IM3,P-0000012,7157,"{'entrezGeneId': 7157, 'hugoGeneSymbol': 'TP53', 'type': 'protein-coding'}",mskimpact,MSKCC,SOMATIC,Unknown,114.0,113.0,0.0,569.0,7577515,7577515,T,T256P,Missense_Mutation,M,3.140000e+00,"getma.org/?cm=var&var=hg19,17,7577515,T,G&fts=all",getma.org/pdb.php?prot=P53_HUMAN&from=95&to=289&var=T256P,getma.org/?cm=msa&ty=f&p=P53_HUMAN&rb=95&re=289&var=T256P,GRCh37,SNP,TP53 T256 missense,,,,,17,G,"NM_001126112.2,NM_001276761.1,NM_001276760.1,NM_000546.5,NM_0011",256,256,TP53,protein-coding
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
411354,UC0wMDUyMzY5LVQwMS1YUzE6bXNraW1wYWN0,UC0wMDUyMzY5Om1za2ltcGFjdA,mskimpact_mutations,P-0052369-T01-XS1,P-0052369,3845,"{'entrezGeneId': 3845, 'hugoGeneSymbol': 'KRAS', 'type': 'protein-coding'}",mskimpact,MSKCC,SOMATIC,Unknown,57.0,3616.0,0.0,1306.0,25398281,25398282,CC,G13F,Missense_Mutation,,1.401300e-45,,,,GRCh37,DNP,KRAS G13 missense,,,,,12,AA,NM_033360.2,13,13,KRAS,protein-coding
411355,UC0wMDUyMzY5LVQwMS1YUzE6bXNraW1wYWN0,UC0wMDUyMzY5Om1za2ltcGFjdA,mskimpact_mutations,P-0052369-T01-XS1,P-0052369,4089,"{'entrezGeneId': 4089, 'hugoGeneSymbol': 'SMAD4', 'type': 'protein-coding'}",mskimpact,MSKCC,SOMATIC,Unknown,11.0,3529.0,0.0,985.0,48573529,48573529,G,R38T,Missense_Mutation,,1.401300e-45,,,,GRCh37,SNP,SMAD4 R38 missense,,,,,18,C,NM_005359.5,38,38,SMAD4,protein-coding
411356,UC0wMDUyMzY5LVQwMS1YUzE6bXNraW1wYWN0,UC0wMDUyMzY5Om1za2ltcGFjdA,mskimpact_mutations,P-0052369-T01-XS1,P-0052369,55294,"{'entrezGeneId': 55294, 'hugoGeneSymbol': 'FBXW7', 'type': 'protein-coding'}",mskimpact,MSKCC,SOMATIC,Unknown,16.0,3519.0,0.0,1178.0,153253766,153253766,C,E323Q,Missense_Mutation,,1.401300e-45,,,,GRCh37,SNP,FBXW7 E323 missense,,,,,4,G,NM_033632.3,323,323,FBXW7,protein-coding
411357,UC0wMDUyMzY5LVQwMS1YUzE6bXNraW1wYWN0,UC0wMDUyMzY5Om1za2ltcGFjdA,mskimpact_mutations,P-0052369-T01-XS1,P-0052369,10320,"{'entrezGeneId': 10320, 'hugoGeneSymbol': 'IKZF1', 'type': 'protein-coding'}",mskimpact,MSKCC,SOMATIC,Unknown,11.0,4200.0,1.0,1212.0,50468030,50468030,C,A422E,Missense_Mutation,,1.401300e-45,,,,GRCh37,SNP,IKZF1 A422 missense,,,,,7,A,NM_006060.4,422,422,IKZF1,protein-coding
