# <span style='font-family:"Times New Roman"'> <span styel=''>**COHORT MAF FILE CREATION**
*Emile Cohen*
    
 *February 2020*

**Goal:** Through this notebook, we create a maf file composed of all patients in the cohort. As we do not have the mutations for all patients in impact-facets-tp53 datasets, we will merge the cohort file with cbioportal data.

The notebook is composed of 2 parts:
   * **1. Extraction of patients from the cohort**
   * **2. Creation of the MAF file from CbioPortal raw datasets**
---

In [3]:
%run -i '../../utils/setup_environment.ipy'

import warnings
warnings.filterwarnings('ignore')

data_path = '../../data/'

Setup environment... done!


<span style="color:green">✅ Working on **mskimpact_env** conda environment.</span>

---
# Extraction of patients from the cohort

In [40]:
cohort = pd.read_csv(data_path + 'impact-facets-tp53/raw/default_qc_pass.cohort.txt', sep='\t')

In [41]:
cohort['Patient_Id'] = cohort['sample_id'].str[:9]
cohort['Tumor_Id'] = cohort['sample_id'].str[:17]

In [42]:
cohort.head()

Unnamed: 0,sample_id,sample_path,tumor_sample_id,path,fit_name,purity_run_version,purity_run_prefix,purity_run_Seed,purity_run_cval,purity_run_nhet,purity_run_Purity,purity_run_Ploidy,purity_run_dipLogR,purity_run_alBalLogR,hisens_run_version,hisens_run_prefix,hisens_run_Seed,hisens_run_cval,hisens_run_nhet,hisens_run_hisens,hisens_run_Ploidy,hisens_run_dipLogR,manual_note,is_best_fit,purity,ploidy,dipLogR,dipLogR_flag,n_alternative_dipLogR,n_dip_bal_segs,frac_dip_bal_segs,n_dip_imbal_segs,frac_dip_imbal_segs,n_amps,n_homdels,frac_homdels,n_homdels_clonal,frac_homdels_clonal,n_cn_states,n_segs,n_cnlr_clusters,n_lcn_na,n_loh,frac_loh,n_segs_subclonal,frac_segs_subclonal,n_snps,n_het_snps,frac_het_snps,n_het_snps_hom_in_tumor_1pct,n_het_snps_hom_in_tumor_5pct,frac_het_snps_hom_in_tumor_1pct,frac_het_snps_hom_in_tumor_5pct,mean_cnlr_residual,sd_cnlr_residual,n_segs_discordant_tcn,frac_discordant_tcn,n_segs_discordant_lcn,frac_discordant_lcn,n_segs_discordant_both,frac_discordant_both,n_segs_icn_cnlor_discordant,frac_icn_cnlor_discordant,homdel_filter_pass,diploid_bal_seg_filter_pass,diploid_imbal_seg_filter_pass,waterfall_filter_pass,hyper_seg_filter_pass,high_ploidy_filter_pass,valid_purity_filter_pass,diploid_seg_filter_pass,facets_suite_qc,arm_level_file,gene_level_file,ccf_file,arm_level_file_exists,gene_level_file_exists,ccf_file_exists,Patient_Id,Tumor_Id
0,P-0034223-T01-IM6_P-0034223-N01-IM6,/juno/work/ccs/resources/impact/facets/all/P-00342/P-0034223-T01-IM6_P-0034223-N01-IM6/,P-0034223-T01-IM6_P-0034223-N01-IM6,/juno/work/ccs/resources/impact/facets/all/P-00342/P-0034223-T01-IM6_P-0034223-N01-IM6/,default,0.5.14,/juno/work/ccs/resources/impact/facets/all/P-00342/P-0034223-T01-IM6_P-0034223-N01-IM6//default/P-0034223-T01-IM6_P-0034223-N01-IM6_purity,100.0,100,15,0.94,2.24,-0.16,-0.16,0.5.14,/juno/work/ccs/resources/impact/facets/all/P-00342/P-0034223-T01-IM6_P-0034223-N01-IM6//default/P-0034223-T01-IM6_P-0034223-N01-IM6_hisens,100.0,50.0,15.0,,2.24,-0.16,,False,0.941111,2.24183,-0.155483,False,0,12,0.59,0,0.0,0,0,0.0,0,0.0,6,31,10,2,4,0.062,1,1e-05,22963,2655,0.12,43,129,0.016,0.049,-0.19,0.63,1,0.0038,0.0,0.0,0,0.0,1,0.042,True,True,False,True,True,True,True,True,True,/juno/work/ccs/resources/impact/facets/all/P-00342/P-0034223-T01-IM6_P-0034223-N01-IM6//default/P-0034223-T01-IM6_P-0034223-N01-IM6.arm_level.txt,/juno/work/ccs/resources/impact/facets/all/P-00342/P-0034223-T01-IM6_P-0034223-N01-IM6//default/P-0034223-T01-IM6_P-0034223-N01-IM6.gene_level.txt,/juno/work/ccs/resources/impact/facets/all/P-00342/P-0034223-T01-IM6_P-0034223-N01-IM6//default/P-0034223-T01-IM6_P-0034223-N01-IM6.ccf.maf,True,True,True,P-0034223,P-0034223-T01-IM6
1,P-0009819-T01-IM5_P-0009819-N01-IM5,/juno/work/ccs/resources/impact/facets/all/P-00098/P-0009819-T01-IM5_P-0009819-N01-IM5/,P-0009819-T01-IM5_P-0009819-N01-IM5,/juno/work/ccs/resources/impact/facets/all/P-00098/P-0009819-T01-IM5_P-0009819-N01-IM5/,default,0.5.14,/juno/work/ccs/resources/impact/facets/all/P-00098/P-0009819-T01-IM5_P-0009819-N01-IM5//default/P-0009819-T01-IM5_P-0009819-N01-IM5_purity,100.0,100,15,0.28,2.68,-0.13,-0.13,0.5.14,/juno/work/ccs/resources/impact/facets/all/P-00098/P-0009819-T01-IM5_P-0009819-N01-IM5//default/P-0009819-T01-IM5_P-0009819-N01-IM5_hisens,100.0,50.0,15.0,,2.77,-0.13,,False,0.275237,2.681075,-0.129255,False,0,7,0.43,0,0.0,0,1,0.0062,1,0.0062,3,25,6,0,5,0.094,0,0.0,16527,2041,0.12,10,10,0.0049,0.0049,-0.25,0.83,1,0.0062,0.0,0.0,0,0.0,2,0.0063,True,True,False,True,True,True,True,True,True,/juno/work/ccs/resources/impact/facets/all/P-00098/P-0009819-T01-IM5_P-0009819-N01-IM5//default/P-0009819-T01-IM5_P-0009819-N01-IM5.arm_level.txt,/juno/work/ccs/resources/impact/facets/all/P-00098/P-0009819-T01-IM5_P-0009819-N01-IM5//default/P-0009819-T01-IM5_P-0009819-N01-IM5.gene_level.txt,/juno/work/ccs/resources/impact/facets/all/P-00098/P-0009819-T01-IM5_P-0009819-N01-IM5//default/P-0009819-T01-IM5_P-0009819-N01-IM5.ccf.maf,True,True,True,P-0009819,P-0009819-T01-IM5
2,P-0025956-T01-IM6_P-0025956-N01-IM6,/juno/work/ccs/resources/impact/facets/all/P-00259/P-0025956-T01-IM6_P-0025956-N01-IM6/,P-0025956-T01-IM6_P-0025956-N01-IM6,/juno/work/ccs/resources/impact/facets/all/P-00259/P-0025956-T01-IM6_P-0025956-N01-IM6/,default,0.5.14,/juno/work/ccs/resources/impact/facets/all/P-00259/P-0025956-T01-IM6_P-0025956-N01-IM6//default/P-0025956-T01-IM6_P-0025956-N01-IM6_purity,100.0,100,15,0.19,3.5,-0.19,"-0.19, 0.02",0.5.14,/juno/work/ccs/resources/impact/facets/all/P-00259/P-0025956-T01-IM6_P-0025956-N01-IM6//default/P-0025956-T01-IM6_P-0025956-N01-IM6_hisens,100.0,50.0,15.0,,3.45,-0.19,,False,0.185874,3.496971,-0.187925,False,0,2,0.096,4,0.18,0,0,0.0,0,0.0,6,26,6,0,5,0.19,0,0.0,17971,2159,0.12,0,0,0.0,0.0,-0.052,0.3,2,0.015,0.0,0.0,3,0.12,8,0.3,True,True,True,True,True,True,True,True,True,/juno/work/ccs/resources/impact/facets/all/P-00259/P-0025956-T01-IM6_P-0025956-N01-IM6//default/P-0025956-T01-IM6_P-0025956-N01-IM6.arm_level.txt,/juno/work/ccs/resources/impact/facets/all/P-00259/P-0025956-T01-IM6_P-0025956-N01-IM6//default/P-0025956-T01-IM6_P-0025956-N01-IM6.gene_level.txt,/juno/work/ccs/resources/impact/facets/all/P-00259/P-0025956-T01-IM6_P-0025956-N01-IM6//default/P-0025956-T01-IM6_P-0025956-N01-IM6.ccf.maf,False,False,False,P-0025956,P-0025956-T01-IM6
3,P-0027408-T01-IM6_P-0027408-N01-IM6,/juno/work/ccs/resources/impact/facets/all/P-00274/P-0027408-T01-IM6_P-0027408-N01-IM6/,P-0027408-T01-IM6_P-0027408-N01-IM6,/juno/work/ccs/resources/impact/facets/all/P-00274/P-0027408-T01-IM6_P-0027408-N01-IM6/,default,0.5.14,/juno/work/ccs/resources/impact/facets/all/P-00274/P-0027408-T01-IM6_P-0027408-N01-IM6//default/P-0027408-T01-IM6_P-0027408-N01-IM6_purity,100.0,100,15,0.31,1.81,0.04,0.04,0.5.14,/juno/work/ccs/resources/impact/facets/all/P-00274/P-0027408-T01-IM6_P-0027408-N01-IM6//default/P-0027408-T01-IM6_P-0027408-N01-IM6_hisens,100.0,50.0,15.0,,1.82,0.04,,False,0.308886,1.811066,0.042724,False,0,7,0.28,1,0.035,0,0,0.0,0,0.0,4,31,6,0,12,0.34,0,0.0,18633,2163,0.12,0,0,0.0,0.0,-0.058,0.29,2,0.096,0.0,0.0,0,0.0,4,0.21,True,True,False,True,True,True,True,True,True,/juno/work/ccs/resources/impact/facets/all/P-00274/P-0027408-T01-IM6_P-0027408-N01-IM6//default/P-0027408-T01-IM6_P-0027408-N01-IM6.arm_level.txt,/juno/work/ccs/resources/impact/facets/all/P-00274/P-0027408-T01-IM6_P-0027408-N01-IM6//default/P-0027408-T01-IM6_P-0027408-N01-IM6.gene_level.txt,/juno/work/ccs/resources/impact/facets/all/P-00274/P-0027408-T01-IM6_P-0027408-N01-IM6//default/P-0027408-T01-IM6_P-0027408-N01-IM6.ccf.maf,True,True,True,P-0027408,P-0027408-T01-IM6
4,P-0006554-T01-IM5_P-0006554-N01-IM5,/juno/work/ccs/resources/impact/facets/all/P-00065/P-0006554-T01-IM5_P-0006554-N01-IM5/,P-0006554-T01-IM5_P-0006554-N01-IM5,/juno/work/ccs/resources/impact/facets/all/P-00065/P-0006554-T01-IM5_P-0006554-N01-IM5/,default,0.5.14,/juno/work/ccs/resources/impact/facets/all/P-00065/P-0006554-T01-IM5_P-0006554-N01-IM5//default/P-0006554-T01-IM5_P-0006554-N01-IM5_purity,100.0,100,15,0.72,1.91,0.05,0.05,0.5.14,/juno/work/ccs/resources/impact/facets/all/P-00065/P-0006554-T01-IM5_P-0006554-N01-IM5//default/P-0006554-T01-IM5_P-0006554-N01-IM5_hisens,100.0,50.0,15.0,,1.92,0.05,,False,0.715208,1.910719,0.046812,False,0,11,0.49,0,0.0,0,1,0.0074,0,0.0,6,30,12,2,6,0.088,6,0.15,16557,2041,0.12,1,7,0.00049,0.0034,-0.024,0.38,0,0.0,160000000.0,0.058,3,0.054,3,0.11,True,True,False,True,True,True,True,True,True,/juno/work/ccs/resources/impact/facets/all/P-00065/P-0006554-T01-IM5_P-0006554-N01-IM5//default/P-0006554-T01-IM5_P-0006554-N01-IM5.arm_level.txt,/juno/work/ccs/resources/impact/facets/all/P-00065/P-0006554-T01-IM5_P-0006554-N01-IM5//default/P-0006554-T01-IM5_P-0006554-N01-IM5.gene_level.txt,/juno/work/ccs/resources/impact/facets/all/P-00065/P-0006554-T01-IM5_P-0006554-N01-IM5//default/P-0006554-T01-IM5_P-0006554-N01-IM5.ccf.maf,True,True,True,P-0006554,P-0006554-T01-IM5


In [43]:
print_md('cohort columns:','green')
for column in cohort.columns: print(column)

<span style="color:green">cohort columns:</span>

sample_id
sample_path
tumor_sample_id
path
fit_name
purity_run_version
purity_run_prefix
purity_run_Seed
purity_run_cval
purity_run_nhet
purity_run_Purity
purity_run_Ploidy
purity_run_dipLogR
purity_run_alBalLogR
hisens_run_version
hisens_run_prefix
hisens_run_Seed
hisens_run_cval
hisens_run_nhet
hisens_run_hisens
hisens_run_Ploidy
hisens_run_dipLogR
manual_note
is_best_fit
purity
ploidy
dipLogR
dipLogR_flag
n_alternative_dipLogR
n_dip_bal_segs
frac_dip_bal_segs
n_dip_imbal_segs
frac_dip_imbal_segs
n_amps
n_homdels
frac_homdels
n_homdels_clonal
frac_homdels_clonal
n_cn_states
n_segs
n_cnlr_clusters
n_lcn_na
n_loh
frac_loh
n_segs_subclonal
frac_segs_subclonal
n_snps
n_het_snps
frac_het_snps
n_het_snps_hom_in_tumor_1pct
n_het_snps_hom_in_tumor_5pct
frac_het_snps_hom_in_tumor_1pct
frac_het_snps_hom_in_tumor_5pct
mean_cnlr_residual
sd_cnlr_residual
n_segs_discordant_tcn
frac_discordant_tcn
n_segs_discordant_lcn
frac_discordant_lcn
n_segs_discordant_both
frac_discordant_both
n_segs_icn_cnlo

In [44]:
# We verify that each sample is unique (we have an equal number of lines and unique samples )
assert(len(cohort) == len(set(cohort.sample_id)))

In [45]:
len(set(cohort.Tumor_Id))

29259

# Adding NA purity samples
We want to add the samples that have no purity, at least one tp53 mutation and a max_vaf>0.15 

In [46]:
maf_total_cohort = pd.read_csv(data_path + 'impact-facets-tp53/new_data_facets/msk_impact_facets_annotated.ccf.maf', sep='\t')
total_cohort = pd.read_csv(data_path + 'impact-facets-tp53/new_data_facets/msk_impact_facets_annotated.cohort.txt', sep='\t')
cohort_na_tp53 = maf_total_cohort[(maf_total_cohort.purity == 0.3)| (maf_total_cohort.purity.isna())][maf_total_cohort['Hugo_Symbol'] == 'TP53']


In [47]:
# We create a list for the interesting samples that we will integrate to our data
samples_maxvaf=[]
for sample in list(set(cohort_na_tp53.Tumor_Sample_Barcode)):
    sample_1 = maf_total_cohort[maf_total_cohort['Tumor_Sample_Barcode'] == sample]
    if max(sample_1.t_var_freq)>0.15:
        samples_maxvaf.append(sample)

In [58]:
len(samples_maxvaf) - 320

521

In [59]:
29304 + 521

29825

In [54]:
i=0
for sample in samples_maxvaf:
    if sample in set(cohort.Tumor_Id):
        print(sample)
        print(maf_total_cohort[maf_total_cohort['Tumor_Sample_Barcode']==sample]['purity'])
        #i+=1
#print(i)

P-0022508-T01-IM6
101198   NaN
101199   NaN
101200   NaN
101201   NaN
101202   NaN
101203   NaN
101204   NaN
101205   NaN
101206   NaN
101207   NaN
101208   NaN
101209   NaN
101210   NaN
101211   NaN
101212   NaN
101213   NaN
101214   NaN
101215   NaN
101216   NaN
101217   NaN
101218   NaN
101219   NaN
101220   NaN
101221   NaN
101222   NaN
101223   NaN
101224   NaN
101225   NaN
101226   NaN
101227   NaN
101228   NaN
101229   NaN
101230   NaN
101231   NaN
101232   NaN
101233   NaN
101234   NaN
Name: purity, dtype: float64
P-0036065-T02-IM6
73894    0.3
73895    0.3
73896    0.3
73897    0.3
73898    0.3
73899    0.3
73900    0.3
73901    0.3
Name: purity, dtype: float64
P-0022739-T01-IM6
42658    0.3
42659    0.3
42660    0.3
Name: purity, dtype: float64
P-0012289-T02-IM6
218291    0.3
218292    0.3
218293    0.3
Name: purity, dtype: float64
P-0047948-T01-IM6
325693    0.3
325694    0.3
325695    0.3
325696    0.3
325697    0.3
325698    0.3
325699    0.3
325700    0.3
Name: purity, dt

250948    0.3
250949    0.3
250950    0.3
250951    0.3
Name: purity, dtype: float64
P-0040489-T01-IM6
258847    0.3
258848    0.3
258849    0.3
258850    0.3
258851    0.3
258852    0.3
258853    0.3
258854    0.3
258855    0.3
258856    0.3
258857    0.3
258858    0.3
258859    0.3
Name: purity, dtype: float64
P-0003820-T01-IM5
33096   NaN
33097   NaN
33098   NaN
33099   NaN
33100   NaN
Name: purity, dtype: float64
P-0006634-T01-IM5
87663    0.3
87664    0.3
87665    0.3
87666    0.3
87667    0.3
87668    0.3
87669    0.3
87670    0.3
87671    0.3
87672    0.3
87673    0.3
Name: purity, dtype: float64
P-0040655-T01-IM6
258766    0.3
258767    0.3
258768    0.3
258769    0.3
258770    0.3
258771    0.3
258772    0.3
258773    0.3
258774    0.3
Name: purity, dtype: float64
P-0047599-T01-IM6
315627   NaN
315628   NaN
315629   NaN
315630   NaN
315631   NaN
315632   NaN
315633   NaN
315634   NaN
315635   NaN
315636   NaN
315637   NaN
315638   NaN
315639   NaN
315640   NaN
315641   NaN
315

175319    0.3
175320    0.3
175321    0.3
175322    0.3
175323    0.3
175324    0.3
175325    0.3
175326    0.3
175327    0.3
175328    0.3
175329    0.3
175330    0.3
175331    0.3
175332    0.3
175333    0.3
175334    0.3
175335    0.3
175336    0.3
175337    0.3
175338    0.3
175339    0.3
175340    0.3
175341    0.3
175342    0.3
175343    0.3
175344    0.3
175345    0.3
175346    0.3
Name: purity, dtype: float64
P-0020508-T01-IM6
83962    0.3
83963    0.3
83964    0.3
83965    0.3
83966    0.3
Name: purity, dtype: float64
P-0003011-T01-IM3
20800    0.3
20801    0.3
20802    0.3
20803    0.3
Name: purity, dtype: float64
P-0008380-T02-IM5
121940    0.3
121941    0.3
121942    0.3
121943    0.3
Name: purity, dtype: float64
P-0033330-T01-IM6
193071    0.3
193072    0.3
193073    0.3
Name: purity, dtype: float64
P-0037838-T02-IM6
318794    0.3
318795    0.3
318796    0.3
318797    0.3
318798    0.3
318799    0.3
318800    0.3
318801    0.3
318802    0.3
318803    0.3
318804    0.3
3188

132082    0.3
132083    0.3
Name: purity, dtype: float64
P-0026962-T01-IM6
209280    0.3
209281    0.3
209282    0.3
209283    0.3
209284    0.3
209285    0.3
209286    0.3
209287    0.3
209288    0.3
209289    0.3
209290    0.3
209291    0.3
209292    0.3
209293    0.3
209294    0.3
209295    0.3
209296    0.3
209297    0.3
209298    0.3
209299    0.3
209300    0.3
209301    0.3
209302    0.3
209303    0.3
209304    0.3
209305    0.3
209306    0.3
209307    0.3
209308    0.3
209309    0.3
209310    0.3
209311    0.3
209312    0.3
209313    0.3
209314    0.3
209315    0.3
209316    0.3
209317    0.3
209318    0.3
209319    0.3
209320    0.3
209321    0.3
209322    0.3
209323    0.3
209324    0.3
209325    0.3
209326    0.3
209327    0.3
209328    0.3
209329    0.3
Name: purity, dtype: float64
P-0003630-T02-IM5
128017    0.3
128018    0.3
Name: purity, dtype: float64
P-0030884-T01-IM6
174922    0.3
174923    0.3
174924    0.3
174925    0.3
174926    0.3
174927    0.3
Name: purity, dtype

P-0041317-T01-IM6
265025    0.3
265026    0.3
265027    0.3
265028    0.3
265029    0.3
265030    0.3
265031    0.3
265032    0.3
265033    0.3
265034    0.3
265035    0.3
265036    0.3
Name: purity, dtype: float64
P-0009437-T01-IM5
234106    0.3
234107    0.3
234108    0.3
234109    0.3
234110    0.3
Name: purity, dtype: float64
P-0012875-T01-IM5
180353    0.3
180354    0.3
180355    0.3
180356    0.3
180357    0.3
180358    0.3
180359    0.3
180360    0.3
180361    0.3
180362    0.3
180363    0.3
180364    0.3
180365    0.3
180366    0.3
180367    0.3
Name: purity, dtype: float64
P-0049372-T01-IM6
333808    0.3
333809    0.3
333810    0.3
333811    0.3
Name: purity, dtype: float64
P-0018312-T01-IM6
74768    0.3
74769    0.3
74770    0.3
Name: purity, dtype: float64
P-0013310-T01-IM5
165821    0.3
165822    0.3
165823    0.3
165824    0.3
165825    0.3
Name: purity, dtype: float64
P-0026790-T01-IM6
288154   NaN
288155   NaN
288156   NaN
288157   NaN
288158   NaN
288159   NaN
288160   

166170    0.3
166171    0.3
166172    0.3
166173    0.3
166174    0.3
166175    0.3
166176    0.3
166177    0.3
166178    0.3
166179    0.3
166180    0.3
166181    0.3
166182    0.3
166183    0.3
166184    0.3
166185    0.3
166186    0.3
166187    0.3
166188    0.3
166189    0.3
166190    0.3
166191    0.3
166192    0.3
166193    0.3
166194    0.3
166195    0.3
166196    0.3
166197    0.3
166198    0.3
166199    0.3
166200    0.3
166201    0.3
166202    0.3
166203    0.3
166204    0.3
166205    0.3
166206    0.3
166207    0.3
166208    0.3
166209    0.3
166210    0.3
166211    0.3
166212    0.3
166213    0.3
166214    0.3
166215    0.3
166216    0.3
166217    0.3
166218    0.3
Name: purity, dtype: float64
P-0011073-T01-IM5
148703   NaN
148704   NaN
148705   NaN
Name: purity, dtype: float64
P-0009863-T01-IM5
234395   NaN
234396   NaN
234397   NaN
234398   NaN
234399   NaN
234400   NaN
234401   NaN
234402   NaN
Name: purity, dtype: float64
P-0002663-T01-IM3
227346    0.3
227347    0.3
22

P-0010061-T01-IM5
234539    0.3
234540    0.3
234541    0.3
234542    0.3
234543    0.3
234544    0.3
234545    0.3
234546    0.3
234547    0.3
234548    0.3
234549    0.3
234550    0.3
234551    0.3
234552    0.3
234553    0.3
234554    0.3
234555    0.3
234556    0.3
234557    0.3
234558    0.3
234559    0.3
234560    0.3
234561    0.3
Name: purity, dtype: float64
P-0031393-T01-IM6
157457    0.3
157458    0.3
157459    0.3
157460    0.3
157461    0.3
157462    0.3
157463    0.3
Name: purity, dtype: float64
P-0021069-T01-IM6
185345    0.3
185346    0.3
185347    0.3
185348    0.3
185349    0.3
185350    0.3
185351    0.3
185352    0.3
185353    0.3
185354    0.3
Name: purity, dtype: float64
P-0030323-T01-IM6
216633    0.3
216634    0.3
216635    0.3
216636    0.3
216637    0.3
216638    0.3
216639    0.3
216640    0.3
216641    0.3
216642    0.3
216643    0.3
216644    0.3
216645    0.3
216646    0.3
216647    0.3
216648    0.3
216649    0.3
Name: purity, dtype: float64
P-0028581-T02-

KeyboardInterrupt: 

In [57]:
total_cohort[total_cohort.tumor_sample == 'P-0022508-T01-IM6']

Unnamed: 0,sample_id,sample_path,fit_to_use,patient,tumor_sample,tumor_bamname,normal_sample,normal_bamname,tag,run_prefix,run_output_dir,run_log_dir,counts_file,has_counts_file,has_hisens_run,has_purity_run,has_qc,has_maf_anno,run_status,tumor_sample_id,path,fit_name,purity_run_version,purity_run_prefix,purity_run_Seed,purity_run_cval,purity_run_nhet,purity_run_snp_nbhd,purity_run_ndepth,purity_run_Purity,purity_run_Ploidy,purity_run_dipLogR,purity_run_alBalLogR,hisens_run_version,hisens_run_prefix,hisens_run_Seed,hisens_run_cval,hisens_run_nhet,hisens_run_snp_nbhd,hisens_run_ndepth,hisens_run_hisens,hisens_run_Ploidy,hisens_run_dipLogR,manual_note,is_best_fit,purity,ploidy,dipLogR,dipLogR_flag,n_alternative_dipLogR,wgd,n_dip_bal_segs,frac_dip_bal_segs,n_dip_imbal_segs,frac_dip_imbal_segs,n_amps,n_homdels,frac_homdels,n_homdels_clonal,frac_homdels_clonal,n_cn_states,n_segs,n_cnlr_clusters,n_lcn_na,n_loh,frac_loh,n_segs_subclonal,frac_segs_subclonal,n_segs_below_dipLogR,frac_below_dipLogR,n_segs_balanced_odd_tcn,frac_balanced_odd_tcn,n_segs_imbalanced_diploid_cn,frac_imbalanced_diploid_cn,n_segs_lcn_greater_mcn,frac_lcn_greater_mcn,n_snps,n_het_snps,frac_het_snps,n_snps_with_300x_in_tumor,n_het_snps_with_300x_in_tumor,n_het_snps_hom_in_tumor_1pct,n_het_snps_hom_in_tumor_5pct,frac_het_snps_hom_in_tumor_1pct,frac_het_snps_hom_in_tumor_5pct,mean_cnlr_residual,sd_cnlr_residual,n_segs_discordant_tcn,frac_discordant_tcn,n_segs_discordant_lcn,frac_discordant_lcn,n_segs_discordant_both,frac_discordant_both,n_segs_icn_cnlor_discordant,frac_icn_cnlor_discordant,mafr_median_all,mafr_median_clonal,mafr_n_gt_1,facets_suite_version,facets_qc_version,homdel_filter_pass,diploid_bal_seg_filter_pass,diploid_imbal_seg_filter_pass,waterfall_filter_pass,hyper_seg_filter_pass,high_ploidy_filter_pass,valid_purity_filter_pass,diploid_seg_filter_pass,em_cncf_icn_discord_filter_pass,dipLogR_too_low_filter_pass,subclonal_genome_filter_pass,icn_allelic_state_concordance_filter_pass,contamination_filter_pass,facets_qc,arm_level_file,gene_level_file,ccf_file,arm_level_file_exists,gene_level_file_exists,ccf_file_exists
13057,P-0022508-T01-IM6_P-0022508-N01-IM6,/juno/work/ccs/shared/resources/impact/facets/all/P-00225/P-0022508-T01-IM6_P-0022508-N01-IM6/,,P-0022508,P-0022508-T01-IM6,QS871775-T,P-0022508-N01-IM6,AQ626086-N,P-0022508-T01-IM6_P-0022508-N01-IM6,/juno/work/ccs/shared/resources/impact/facets/all/P-00225/P-0022508-T01-IM6_P-0022508-N01-IM6/,/juno/work/ccs/shared/resources/impact/facets/all/P-00225/P-0022508-T01-IM6_P-0022508-N01-IM6/default/,/juno/work/ccs/shared/resources/impact/facets/all/P-00225/P-0022508-T01-IM6_P-0022508-N01-IM6/default/logs/,/juno/work/ccs/shared/resources/impact/facets/all/P-00225/P-0022508-T01-IM6_P-0022508-N01-IM6/countsMerged____P-0022508-T01-IM6_P-0022508-N01-IM6.dat.gz,True,True,True,False,False,complete,P-0022508-T01-IM6_P-0022508-N01-IM6,/juno/work/ccs/shared/resources/impact/facets/all/P-00225/P-0022508-T01-IM6_P-0022508-N01-IM6/,default,0.5.14,/juno/work/ccs/shared/resources/impact/facets/all/P-00225/P-0022508-T01-IM6_P-0022508-N01-IM6//default/P-0022508-T01-IM6_P-0022508-N01-IM6_purity,100,100,15,250,35,0.27,2.3,-0.06,-0.06,0.5.14,/juno/work/ccs/shared/resources/impact/facets/all/P-00225/P-0022508-T01-IM6_P-0022508-N01-IM6//default/P-0022508-T01-IM6_P-0022508-N01-IM6_hisens,100,50,15,250,35,,2.0,-0.06,,False,0.271751,2.320998,-0.061591,False,0,False,10,0.53,0,0.0,0,0,0.0,0,0.0,5,31,8,1,8,0.12,2,0.034,13,0.3,7,0.12,0,0.0,8,0.12,19903,2349,0.12,6971,731,2,0,0.00085,0.0,-0.12,0.5,5,0.1,0.0,0.0,0,0.0,7,0.12,0.0012,0.0017,0,2.0.5,1,True,True,False,True,True,True,True,True,True,True,True,True,True,True,/juno/work/ccs/shared/resources/impact/facets/all/P-00225/P-0022508-T01-IM6_P-0022508-N01-IM6//default/P-0022508-T01-IM6_P-0022508-N01-IM6.arm_level.txt,/juno/work/ccs/shared/resources/impact/facets/all/P-00225/P-0022508-T01-IM6_P-0022508-N01-IM6//default/P-0022508-T01-IM6_P-0022508-N01-IM6.gene_level.txt,/juno/work/ccs/shared/resources/impact/facets/all/P-00225/P-0022508-T01-IM6_P-0022508-N01-IM6//default/P-0022508-T01-IM6_P-0022508-N01-IM6.ccf.maf,True,True,True


In [63]:
# We only take interesting columns --> ['sample_id','Tumor_Id', 'purity', 'ploidy', 'dipLogR', 'frac_loh']
cohort_to_add = total_cohort[total_cohort.tumor_sample.isin(samples_maxvaf)]
cohort_to_add = cohort_to_add[['sample_id','tumor_sample', 'purity', 'ploidy', 'dipLogR', 'frac_loh']]
cohort_to_add.columns = ['sample_id','Tumor_Id', 'purity', 'ploidy', 'dipLogR', 'frac_loh']

In [64]:
# We concatenate cohort and cohort_to_add
cohort = pd.concat([cohort, cohort_to_add], axis=0)

In [66]:
cohort

Unnamed: 0,sample_id,sample_path,tumor_sample_id,path,fit_name,purity_run_version,purity_run_prefix,purity_run_Seed,purity_run_cval,purity_run_nhet,purity_run_Purity,purity_run_Ploidy,purity_run_dipLogR,purity_run_alBalLogR,hisens_run_version,hisens_run_prefix,hisens_run_Seed,hisens_run_cval,hisens_run_nhet,hisens_run_hisens,hisens_run_Ploidy,hisens_run_dipLogR,manual_note,is_best_fit,purity,ploidy,dipLogR,dipLogR_flag,n_alternative_dipLogR,n_dip_bal_segs,frac_dip_bal_segs,n_dip_imbal_segs,frac_dip_imbal_segs,n_amps,n_homdels,frac_homdels,n_homdels_clonal,frac_homdels_clonal,n_cn_states,n_segs,n_cnlr_clusters,n_lcn_na,n_loh,frac_loh,n_segs_subclonal,frac_segs_subclonal,n_snps,n_het_snps,frac_het_snps,n_het_snps_hom_in_tumor_1pct,n_het_snps_hom_in_tumor_5pct,frac_het_snps_hom_in_tumor_1pct,frac_het_snps_hom_in_tumor_5pct,mean_cnlr_residual,sd_cnlr_residual,n_segs_discordant_tcn,frac_discordant_tcn,n_segs_discordant_lcn,frac_discordant_lcn,n_segs_discordant_both,frac_discordant_both,n_segs_icn_cnlor_discordant,frac_icn_cnlor_discordant,homdel_filter_pass,diploid_bal_seg_filter_pass,diploid_imbal_seg_filter_pass,waterfall_filter_pass,hyper_seg_filter_pass,high_ploidy_filter_pass,valid_purity_filter_pass,diploid_seg_filter_pass,facets_suite_qc,arm_level_file,gene_level_file,ccf_file,arm_level_file_exists,gene_level_file_exists,ccf_file_exists,Patient_Id,Tumor_Id
0,P-0034223-T01-IM6_P-0034223-N01-IM6,/juno/work/ccs/resources/impact/facets/all/P-00342/P-0034223-T01-IM6_P-0034223-N01-IM6/,P-0034223-T01-IM6_P-0034223-N01-IM6,/juno/work/ccs/resources/impact/facets/all/P-00342/P-0034223-T01-IM6_P-0034223-N01-IM6/,default,0.5.14,/juno/work/ccs/resources/impact/facets/all/P-00342/P-0034223-T01-IM6_P-0034223-N01-IM6//default/P-0034223-T01-IM6_P-0034223-N01-IM6_purity,100.0,100.0,15.0,0.94,2.24,-0.16,-0.16,0.5.14,/juno/work/ccs/resources/impact/facets/all/P-00342/P-0034223-T01-IM6_P-0034223-N01-IM6//default/P-0034223-T01-IM6_P-0034223-N01-IM6_hisens,100.0,50.0,15.0,,2.24,-0.16,,False,0.941111,2.241830,-0.155483,False,0.0,12.0,0.590,0.0,0.000,0.0,0.0,0.0000,0.0,0.0000,6.0,31.0,10.0,2.0,4.0,0.062,1.0,0.00001,22963.0,2655.0,0.12,43.0,129.0,0.01600,0.0490,-0.190,0.63,1.0,0.0038,0.0,0.000,0.0,0.000,1.0,0.0420,True,True,False,True,True,True,True,True,True,/juno/work/ccs/resources/impact/facets/all/P-00342/P-0034223-T01-IM6_P-0034223-N01-IM6//default/P-0034223-T01-IM6_P-0034223-N01-IM6.arm_level.txt,/juno/work/ccs/resources/impact/facets/all/P-00342/P-0034223-T01-IM6_P-0034223-N01-IM6//default/P-0034223-T01-IM6_P-0034223-N01-IM6.gene_level.txt,/juno/work/ccs/resources/impact/facets/all/P-00342/P-0034223-T01-IM6_P-0034223-N01-IM6//default/P-0034223-T01-IM6_P-0034223-N01-IM6.ccf.maf,True,True,True,P-0034223,P-0034223-T01-IM6
1,P-0009819-T01-IM5_P-0009819-N01-IM5,/juno/work/ccs/resources/impact/facets/all/P-00098/P-0009819-T01-IM5_P-0009819-N01-IM5/,P-0009819-T01-IM5_P-0009819-N01-IM5,/juno/work/ccs/resources/impact/facets/all/P-00098/P-0009819-T01-IM5_P-0009819-N01-IM5/,default,0.5.14,/juno/work/ccs/resources/impact/facets/all/P-00098/P-0009819-T01-IM5_P-0009819-N01-IM5//default/P-0009819-T01-IM5_P-0009819-N01-IM5_purity,100.0,100.0,15.0,0.28,2.68,-0.13,-0.13,0.5.14,/juno/work/ccs/resources/impact/facets/all/P-00098/P-0009819-T01-IM5_P-0009819-N01-IM5//default/P-0009819-T01-IM5_P-0009819-N01-IM5_hisens,100.0,50.0,15.0,,2.77,-0.13,,False,0.275237,2.681075,-0.129255,False,0.0,7.0,0.430,0.0,0.000,0.0,1.0,0.0062,1.0,0.0062,3.0,25.0,6.0,0.0,5.0,0.094,0.0,0.00000,16527.0,2041.0,0.12,10.0,10.0,0.00490,0.0049,-0.250,0.83,1.0,0.0062,0.0,0.000,0.0,0.000,2.0,0.0063,True,True,False,True,True,True,True,True,True,/juno/work/ccs/resources/impact/facets/all/P-00098/P-0009819-T01-IM5_P-0009819-N01-IM5//default/P-0009819-T01-IM5_P-0009819-N01-IM5.arm_level.txt,/juno/work/ccs/resources/impact/facets/all/P-00098/P-0009819-T01-IM5_P-0009819-N01-IM5//default/P-0009819-T01-IM5_P-0009819-N01-IM5.gene_level.txt,/juno/work/ccs/resources/impact/facets/all/P-00098/P-0009819-T01-IM5_P-0009819-N01-IM5//default/P-0009819-T01-IM5_P-0009819-N01-IM5.ccf.maf,True,True,True,P-0009819,P-0009819-T01-IM5
2,P-0025956-T01-IM6_P-0025956-N01-IM6,/juno/work/ccs/resources/impact/facets/all/P-00259/P-0025956-T01-IM6_P-0025956-N01-IM6/,P-0025956-T01-IM6_P-0025956-N01-IM6,/juno/work/ccs/resources/impact/facets/all/P-00259/P-0025956-T01-IM6_P-0025956-N01-IM6/,default,0.5.14,/juno/work/ccs/resources/impact/facets/all/P-00259/P-0025956-T01-IM6_P-0025956-N01-IM6//default/P-0025956-T01-IM6_P-0025956-N01-IM6_purity,100.0,100.0,15.0,0.19,3.50,-0.19,"-0.19, 0.02",0.5.14,/juno/work/ccs/resources/impact/facets/all/P-00259/P-0025956-T01-IM6_P-0025956-N01-IM6//default/P-0025956-T01-IM6_P-0025956-N01-IM6_hisens,100.0,50.0,15.0,,3.45,-0.19,,False,0.185874,3.496971,-0.187925,False,0.0,2.0,0.096,4.0,0.180,0.0,0.0,0.0000,0.0,0.0000,6.0,26.0,6.0,0.0,5.0,0.190,0.0,0.00000,17971.0,2159.0,0.12,0.0,0.0,0.00000,0.0000,-0.052,0.30,2.0,0.0150,0.0,0.000,3.0,0.120,8.0,0.3000,True,True,True,True,True,True,True,True,True,/juno/work/ccs/resources/impact/facets/all/P-00259/P-0025956-T01-IM6_P-0025956-N01-IM6//default/P-0025956-T01-IM6_P-0025956-N01-IM6.arm_level.txt,/juno/work/ccs/resources/impact/facets/all/P-00259/P-0025956-T01-IM6_P-0025956-N01-IM6//default/P-0025956-T01-IM6_P-0025956-N01-IM6.gene_level.txt,/juno/work/ccs/resources/impact/facets/all/P-00259/P-0025956-T01-IM6_P-0025956-N01-IM6//default/P-0025956-T01-IM6_P-0025956-N01-IM6.ccf.maf,False,False,False,P-0025956,P-0025956-T01-IM6
3,P-0027408-T01-IM6_P-0027408-N01-IM6,/juno/work/ccs/resources/impact/facets/all/P-00274/P-0027408-T01-IM6_P-0027408-N01-IM6/,P-0027408-T01-IM6_P-0027408-N01-IM6,/juno/work/ccs/resources/impact/facets/all/P-00274/P-0027408-T01-IM6_P-0027408-N01-IM6/,default,0.5.14,/juno/work/ccs/resources/impact/facets/all/P-00274/P-0027408-T01-IM6_P-0027408-N01-IM6//default/P-0027408-T01-IM6_P-0027408-N01-IM6_purity,100.0,100.0,15.0,0.31,1.81,0.04,0.04,0.5.14,/juno/work/ccs/resources/impact/facets/all/P-00274/P-0027408-T01-IM6_P-0027408-N01-IM6//default/P-0027408-T01-IM6_P-0027408-N01-IM6_hisens,100.0,50.0,15.0,,1.82,0.04,,False,0.308886,1.811066,0.042724,False,0.0,7.0,0.280,1.0,0.035,0.0,0.0,0.0000,0.0,0.0000,4.0,31.0,6.0,0.0,12.0,0.340,0.0,0.00000,18633.0,2163.0,0.12,0.0,0.0,0.00000,0.0000,-0.058,0.29,2.0,0.0960,0.0,0.000,0.0,0.000,4.0,0.2100,True,True,False,True,True,True,True,True,True,/juno/work/ccs/resources/impact/facets/all/P-00274/P-0027408-T01-IM6_P-0027408-N01-IM6//default/P-0027408-T01-IM6_P-0027408-N01-IM6.arm_level.txt,/juno/work/ccs/resources/impact/facets/all/P-00274/P-0027408-T01-IM6_P-0027408-N01-IM6//default/P-0027408-T01-IM6_P-0027408-N01-IM6.gene_level.txt,/juno/work/ccs/resources/impact/facets/all/P-00274/P-0027408-T01-IM6_P-0027408-N01-IM6//default/P-0027408-T01-IM6_P-0027408-N01-IM6.ccf.maf,True,True,True,P-0027408,P-0027408-T01-IM6
4,P-0006554-T01-IM5_P-0006554-N01-IM5,/juno/work/ccs/resources/impact/facets/all/P-00065/P-0006554-T01-IM5_P-0006554-N01-IM5/,P-0006554-T01-IM5_P-0006554-N01-IM5,/juno/work/ccs/resources/impact/facets/all/P-00065/P-0006554-T01-IM5_P-0006554-N01-IM5/,default,0.5.14,/juno/work/ccs/resources/impact/facets/all/P-00065/P-0006554-T01-IM5_P-0006554-N01-IM5//default/P-0006554-T01-IM5_P-0006554-N01-IM5_purity,100.0,100.0,15.0,0.72,1.91,0.05,0.05,0.5.14,/juno/work/ccs/resources/impact/facets/all/P-00065/P-0006554-T01-IM5_P-0006554-N01-IM5//default/P-0006554-T01-IM5_P-0006554-N01-IM5_hisens,100.0,50.0,15.0,,1.92,0.05,,False,0.715208,1.910719,0.046812,False,0.0,11.0,0.490,0.0,0.000,0.0,1.0,0.0074,0.0,0.0000,6.0,30.0,12.0,2.0,6.0,0.088,6.0,0.15000,16557.0,2041.0,0.12,1.0,7.0,0.00049,0.0034,-0.024,0.38,0.0,0.0000,160000000.0,0.058,3.0,0.054,3.0,0.1100,True,True,False,True,True,True,True,True,True,/juno/work/ccs/resources/impact/facets/all/P-00065/P-0006554-T01-IM5_P-0006554-N01-IM5//default/P-0006554-T01-IM5_P-0006554-N01-IM5.arm_level.txt,/juno/work/ccs/resources/impact/facets/all/P-00065/P-0006554-T01-IM5_P-0006554-N01-IM5//default/P-0006554-T01-IM5_P-0006554-N01-IM5.gene_level.txt,/juno/work/ccs/resources/impact/facets/all/P-00065/P-0006554-T01-IM5_P-0006554-N01-IM5//default/P-0006554-T01-IM5_P-0006554-N01-IM5.ccf.maf,True,True,True,P-0006554,P-0006554-T01-IM5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
42107,P-0050500-T01-IM6_P-0050500-N01-IM6,,,,,,,,,,,,,,,,,,,,,,,,,2.000000,-0.017907,,,,,,,,,,,,,,,,,0.000,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,P-0050500-T01-IM6
42154,P-0050442-T01-IM6_P-0050442-N01-IM6,,,,,,,,,,,,,,,,,,,,,,,,0.324076,2.463234,-0.104419,,,,,,,,,,,,,,,,,0.013,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,P-0050442-T01-IM6
42183,P-0050657-T01-IM6_P-0050657-N01-IM6,,,,,,,,,,,,,,,,,,,,,,,,,2.000000,-0.005398,,,,,,,,,,,,,,,,,0.000,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,P-0050657-T01-IM6
42217,P-0050654-T01-IM6_P-0050654-N01-IM6,,,,,,,,,,,,,,,,,,,,,,,,0.202936,2.842656,-0.118363,,,,,,,,,,,,,,,,,0.240,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,P-0050654-T01-IM6


---
## Creation of the MAF file from CbioPortal raw datasets

In [24]:
clinical_data = pd.read_csv(data_path + 'cbioportal/raw/mskimpact_clinical_data-2.tsv', sep= '\t')
mutations = pd.read_pickle(data_path + 'cbioportal/raw/mutations_cohort.pkl')

In [25]:
# Here are the columns we will select for the three different files

filter_cohort = ['sample_id','Tumor_Id', 'purity', 'ploidy', 'dipLogR', 'frac_loh']

filter_mut = ['sampleId',
             'patientId',
             'gene',
             'entrezGeneId',
             'mutationType',
             'mutationStatus',
             'proteinChange',
             'startPosition',
             'endPosition',
             'referenceAllele',
             'variantAllele',
             'chr',
             'hugoGeneSymbol',
             'tumorAltCount',
             'tumorRefCount']

filter_clinical = [  'Sample ID',
                     'Patient ID',
                     'Patient Current Age',
                     'Cancer Type',
                     'Cancer Type Detailed',
                     'Ethnicity Category' ,
                     'Race Category',
                     'Sex',
                     'Mutation Count',
                     'Sample Type',
                     'Number of Samples Per Patient',
                     'Overall Survival Status',
                     'Overall Survival (Months)',
                     'MSI Score',
                     'Impact TMB Score',
                     'Somatic Status'
                      ]

cohort_filtered = cohort[filter_cohort]
mutations_filtered = mutations[filter_mut]
clinical_data_filtered = clinical_data[filter_clinical]

---
We create 3 new columns in mutations_filtered:
* *mut_key*: mutation key that describes entirely the mutation
* *sample_mut_key*: sample mutation key that adds information about the sample (it allows to filter out duplicates)
* *mut_spot*: number representing the location of the amino acid mutated

In [26]:
# Create a mutation Key
mutations_filtered['mut_key'] = pd.Series([str(i)+'_'+str(j)+'_'+str(k)+'_'+str(l) for i,j,k,l in zip(mutations_filtered.chr, mutations_filtered.startPosition, mutations_filtered.referenceAllele, mutations_filtered.variantAllele)]) 
# Create a sample key to differentiate duplicates
mutations_filtered['sample_mut_key'] = pd.Series([str(j)+'_'+str(i) for i,j in zip( mutations_filtered.mut_key, mutations_filtered.sampleId)])
# Extract the mutation spot from HGVSp
mutations_filtered['mut_spot'] = mutations_filtered.proteinChange.str.extract('(\d+)')
#Create the vaf column
mutations_filtered['vaf'] = mutations_filtered.apply(lambda x: x.tumorAltCount/(x.tumorAltCount + x.tumorRefCount) if (x.tumorAltCount + x.tumorRefCount)>0 else 'None' , axis=1)

In [27]:
# We merge the three dataframes
# Left Join on 'patient_Id' and 'Patient ID'
maf = pd.merge(left=cohort_filtered,right=clinical_data_filtered, how='left', left_on='Tumor_Id', right_on='Sample ID')
maf_cohort = pd.merge(left=maf, right=mutations_filtered, how='left', left_on='Tumor_Id', right_on='sampleId')
# We drop column duplicates
maf_cohort = maf_cohort.drop(['sampleId', 'Sample ID','patientId'], axis=1)
# We rename the columns to be consistent with other maf files created
maf_cohort.columns = ['Sample_Id', 'Tumor_Id','purity', 'ploidy', 'dipLogR', 'frac_loh', 'Patient_Id', 'Patient_Current_Age', 'Cancer_Type',
                    'Cancer_Type_Detailed', 'Ethnicity_Category','Race_Category', 'Sex', 'Mutation_Count', 'Sample_Type', 'samples_per_patient','Overall Survival Status',
                     'Overall Survival (Months)', 'MSI Score','TMB_Score','Somatic_Status', 'gene','Gene_Id','Variant_Classification','mutationStatus', 'proteinChange',
                    'Start_Position', 'End_Position', 'Reference_Allele','Variant_Allele', 'Chromosome',  
                    'Hugo_Symbol','alt_count', 'ref_count', 'mut_key', 'sample_mut_key', 'mut_spot', 'vaf']
maf_cohort

Unnamed: 0,Sample_Id,Tumor_Id,purity,ploidy,dipLogR,frac_loh,Patient_Id,Patient_Current_Age,Cancer_Type,Cancer_Type_Detailed,Ethnicity_Category,Race_Category,Sex,Mutation_Count,Sample_Type,samples_per_patient,Overall Survival Status,Overall Survival (Months),MSI Score,TMB_Score,Somatic_Status,gene,Gene_Id,Variant_Classification,mutationStatus,proteinChange,Start_Position,End_Position,Reference_Allele,Variant_Allele,Chromosome,Hugo_Symbol,alt_count,ref_count,mut_key,sample_mut_key,mut_spot,vaf
0,P-0034223-T01-IM6_P-0034223-N01-IM6,P-0034223-T01-IM6,0.941111,2.241830,-0.155483,0.062,P-0034223,63.0,Breast Cancer,Invasive Breast Carcinoma,,NO VALUE ENTERED,Female,6.0,Metastasis,1.0,LIVING,,0.55,5.3,Matched,"{'entrezGeneId': 5290, 'hugoGeneSymbol': 'PIK3CA', 'type': 'protein-coding'}",5290.0,Missense_Mutation,SOMATIC,E545K,178936091.0,178936091.0,G,A,3,PIK3CA,284.0,334.0,3_178936091_G_A,P-0034223-T01-IM6_3_178936091_G_A,545,0.459547
1,P-0034223-T01-IM6_P-0034223-N01-IM6,P-0034223-T01-IM6,0.941111,2.241830,-0.155483,0.062,P-0034223,63.0,Breast Cancer,Invasive Breast Carcinoma,,NO VALUE ENTERED,Female,6.0,Metastasis,1.0,LIVING,,0.55,5.3,Matched,"{'entrezGeneId': 2064, 'hugoGeneSymbol': 'ERBB2', 'type': 'protein-coding'}",2064.0,Missense_Mutation,SOMATIC,L755S,37880220.0,37880220.0,T,C,17,ERBB2,224.0,262.0,17_37880220_T_C,P-0034223-T01-IM6_17_37880220_T_C,755,0.460905
2,P-0034223-T01-IM6_P-0034223-N01-IM6,P-0034223-T01-IM6,0.941111,2.241830,-0.155483,0.062,P-0034223,63.0,Breast Cancer,Invasive Breast Carcinoma,,NO VALUE ENTERED,Female,6.0,Metastasis,1.0,LIVING,,0.55,5.3,Matched,"{'entrezGeneId': 9641, 'hugoGeneSymbol': 'IKBKE', 'type': 'protein-coding'}",9641.0,Missense_Mutation,SOMATIC,R27H,206646650.0,206646650.0,G,A,1,IKBKE,252.0,1027.0,1_206646650_G_A,P-0034223-T01-IM6_1_206646650_G_A,27,0.197029
3,P-0034223-T01-IM6_P-0034223-N01-IM6,P-0034223-T01-IM6,0.941111,2.241830,-0.155483,0.062,P-0034223,63.0,Breast Cancer,Invasive Breast Carcinoma,,NO VALUE ENTERED,Female,6.0,Metastasis,1.0,LIVING,,0.55,5.3,Matched,"{'entrezGeneId': 6926, 'hugoGeneSymbol': 'TBX3', 'type': 'protein-coding'}",6926.0,Frame_Shift_Ins,SOMATIC,S321Vfs*6,115114257.0,115114258.0,-,T,12,TBX3,358.0,384.0,12_115114257_-_T,P-0034223-T01-IM6_12_115114257_-_T,321,0.48248
4,P-0034223-T01-IM6_P-0034223-N01-IM6,P-0034223-T01-IM6,0.941111,2.241830,-0.155483,0.062,P-0034223,63.0,Breast Cancer,Invasive Breast Carcinoma,,NO VALUE ENTERED,Female,6.0,Metastasis,1.0,LIVING,,0.55,5.3,Matched,"{'entrezGeneId': 3169, 'hugoGeneSymbol': 'FOXA1', 'type': 'protein-coding'}",3169.0,Missense_Mutation,SOMATIC,C227Y,38061309.0,38061309.0,C,T,14,FOXA1,410.0,462.0,14_38061309_C_T,P-0034223-T01-IM6_14_38061309_C_T,227,0.470183
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
284614,P-0050744-T01-IM6_P-0050744-N01-IM6,P-0050744-T01-IM6,0.300000,2.426452,-0.089454,0.210,P-0050744,69.0,Pancreatic Cancer,Pancreatic Adenocarcinoma,,WHITE,Male,5.0,Primary,1.0,LIVING,,0.06,5.3,Matched,"{'entrezGeneId': 3845, 'hugoGeneSymbol': 'KRAS', 'type': 'protein-coding'}",3845.0,Missense_Mutation,SOMATIC,G12D,25398284.0,25398284.0,C,T,12,KRAS,37.0,120.0,12_25398284_C_T,P-0050744-T01-IM6_12_25398284_C_T,12,0.235669
284615,P-0050744-T01-IM6_P-0050744-N01-IM6,P-0050744-T01-IM6,0.300000,2.426452,-0.089454,0.210,P-0050744,69.0,Pancreatic Cancer,Pancreatic Adenocarcinoma,,WHITE,Male,5.0,Primary,1.0,LIVING,,0.06,5.3,Matched,"{'entrezGeneId': 7157, 'hugoGeneSymbol': 'TP53', 'type': 'protein-coding'}",7157.0,Missense_Mutation,SOMATIC,Y234C,7577580.0,7577580.0,T,C,17,TP53,66.0,210.0,17_7577580_T_C,P-0050744-T01-IM6_17_7577580_T_C,234,0.23913
284616,P-0050744-T01-IM6_P-0050744-N01-IM6,P-0050744-T01-IM6,0.300000,2.426452,-0.089454,0.210,P-0050744,69.0,Pancreatic Cancer,Pancreatic Adenocarcinoma,,WHITE,Male,5.0,Primary,1.0,LIVING,,0.06,5.3,Matched,"{'entrezGeneId': 1029, 'hugoGeneSymbol': 'CDKN2A', 'type': 'protein-coding'}",1029.0,Missense_Mutation,SOMATIC,H83Y,21971111.0,21971111.0,G,A,9,CDKN2A,20.0,215.0,9_21971111_G_A,P-0050744-T01-IM6_9_21971111_G_A,83,0.0851064
284617,P-0050744-T01-IM6_P-0050744-N01-IM6,P-0050744-T01-IM6,0.300000,2.426452,-0.089454,0.210,P-0050744,69.0,Pancreatic Cancer,Pancreatic Adenocarcinoma,,WHITE,Male,5.0,Primary,1.0,LIVING,,0.06,5.3,Matched,"{'entrezGeneId': 5159, 'hugoGeneSymbol': 'PDGFRB', 'type': 'protein-coding'}",5159.0,Missense_Mutation,SOMATIC,E1069K,149495442.0,149495442.0,C,T,5,PDGFRB,58.0,279.0,5_149495442_C_T,P-0050744-T01-IM6_5_149495442_C_T,1069,0.172107


In [28]:
# MERGE WITH ANNOTATED DATA 

annotated_data = pd.read_pickle(data_path + 'maf_cohort_annotated.pkl')

#Create a total_mut_key
maf_cohort['total_mut_key'] = pd.Series([str(j)+'_'+str(i) for i,j in zip( maf_cohort.mut_key, maf_cohort.Sample_Id)])
annotated_data['total_mut_key'] = pd.Series([str(j)+'_'+str(i) for i,j in zip(annotated_data.mut_key, annotated_data.Sample_Id)])
annotated_data = annotated_data[['total_mut_key','mutationEffect','oncogenic','vus','hotspot']]

maf_cohort_final = pd.merge(maf_cohort, annotated_data, how='left', left_on='total_mut_key', right_on='total_mut_key')
maf_cohort_final = maf_cohort_final.drop_duplicates('total_mut_key')

In [29]:
# remove Germline and NA mutations: 
maf_cohort_final_ = maf_cohort_final[maf_cohort_final['mutationStatus'] != 'GERMLINE'][maf_cohort_final['mutationStatus'] != 'NA']

In [30]:
# Saving to pickle File
maf_cohort_final_.to_pickle(data_path + 'merged_data/maf_cohort.pkl')

In [42]:
h = maf_cohort_final[maf_cohort_final['mutationStatus'] == 'NA'][maf_cohort_final['Hugo_Symbol'] == 'TP53']['Tumor_Id'].tolist()
h

['P-0003867-T01-IM5',
 'P-0026701-T01-IM6',
 'P-0033092-T01-IM6',
 'P-0016168-T01-IM6',
 'P-0004827-T01-IM5',
 'P-0030800-T01-IM6',
 'P-0032474-T01-IM6',
 'P-0025686-T01-IM6',
 'P-0025686-T02-IM6',
 'P-0037395-T01-IM6',
 'P-0035227-T01-IM6',
 'P-0008695-T03-IM6',
 'P-0032533-T01-IM6',
 'P-0004140-T02-IM5',
 'P-0004140-T03-IM5',
 'P-0036760-T01-IM6',
 'P-0031933-T01-IM6',
 'P-0016643-T01-IM6',
 'P-0024672-T01-IM6',
 'P-0034440-T01-IM6',
 'P-0020430-T01-IM6',
 'P-0021318-T01-IM6',
 'P-0027735-T01-IM6',
 'P-0031946-T01-IM6',
 'P-0004387-T02-IM5',
 'P-0007169-T01-IM5',
 'P-0017265-T01-IM6',
 'P-0012776-T01-IM5',
 'P-0020606-T01-IM6',
 'P-0033821-T01-IM6',
 'P-0035527-T01-IM6',
 'P-0030493-T01-IM6',
 'P-0021358-T01-IM6',
 'P-0031409-T01-IM6',
 'P-0031409-T02-IM6',
 'P-0031284-T01-IM6',
 'P-0002354-T02-IM6',
 'P-0009044-T01-IM5',
 'P-0012597-T01-IM5',
 'P-0027135-T01-IM6',
 'P-0018462-T01-IM6',
 'P-0028435-T01-IM6',
 'P-0012773-T01-IM5',
 'P-0035373-T01-IM6',
 'P-0009743-T01-IM5',
 'P-000970

In [68]:
a = maf_cohort_final[maf_cohort_final['Tumor_Id'] == 'P-0050657-T01-IM6']
a = a[a['mutationStatus'] != 'GERMLINE']
a

Unnamed: 0,Sample_Id,Tumor_Id,purity,ploidy,dipLogR,frac_loh,Patient_Id,Patient_Current_Age,Cancer_Type,Cancer_Type_Detailed,Ethnicity_Category,Race_Category,Sex,Mutation_Count,Sample_Type,samples_per_patient,Overall Survival Status,Overall Survival (Months),MSI Score,TMB_Score,Somatic_Status,gene,Gene_Id,Variant_Classification,mutationStatus,proteinChange,Start_Position,End_Position,Reference_Allele,Variant_Allele,Chromosome,Hugo_Symbol,alt_count,ref_count,mut_key,sample_mut_key,mut_spot,vaf,total_mut_key,mutationEffect,oncogenic,vus,hotspot
286987,P-0050657-T01-IM6_P-0050657-N01-IM6,P-0050657-T01-IM6,,2.0,-0.005398,0.0,P-0050657,82.0,Non-Small Cell Lung Cancer,Lung Adenocarcinoma,Non-Spanish; Non-Hispanic,WHITE,Male,3.0,Metastasis,1.0,LIVING,2.005,0.05,2.6,Matched,"{'entrezGeneId': 7157, 'hugoGeneSymbol': 'TP53', 'type': 'protein-coding'}",7157.0,Missense_Mutation,SOMATIC,R249G,7577536.0,7577536.0,T,C,17,TP53,26.0,689.0,17_7577536_T_C,P-0050657-T01-IM6_17_7577536_T_C,249,0.0363636,P-0050657-T01-IM6_P-0050657-N01-IM6_17_7577536_T_C,,,,
286988,P-0050657-T01-IM6_P-0050657-N01-IM6,P-0050657-T01-IM6,,2.0,-0.005398,0.0,P-0050657,82.0,Non-Small Cell Lung Cancer,Lung Adenocarcinoma,Non-Spanish; Non-Hispanic,WHITE,Male,3.0,Metastasis,1.0,LIVING,2.005,0.05,2.6,Matched,"{'entrezGeneId': 7157, 'hugoGeneSymbol': 'TP53', 'type': 'protein-coding'}",7157.0,Missense_Mutation,SOMATIC,M246L,7577545.0,7577545.0,T,A,17,TP53,24.0,677.0,17_7577545_T_A,P-0050657-T01-IM6_17_7577545_T_A,246,0.0342368,P-0050657-T01-IM6_P-0050657-N01-IM6_17_7577545_T_A,,,,
286989,P-0050657-T01-IM6_P-0050657-N01-IM6,P-0050657-T01-IM6,,2.0,-0.005398,0.0,P-0050657,82.0,Non-Small Cell Lung Cancer,Lung Adenocarcinoma,Non-Spanish; Non-Hispanic,WHITE,Male,3.0,Metastasis,1.0,LIVING,2.005,0.05,2.6,Matched,"{'entrezGeneId': 4233, 'hugoGeneSymbol': 'MET', 'type': 'protein-coding'}",4233.0,Splice_Region,SOMATIC,X1010_splice,116412046.0,116412046.0,A,G,7,MET,366.0,851.0,7_116412046_A_G,P-0050657-T01-IM6_7_116412046_A_G,1010,0.30074,P-0050657-T01-IM6_P-0050657-N01-IM6_7_116412046_A_G,,,,


In [77]:
len(set(maf_cohort_final[maf_cohort_final['Tumor_Id'].isin(samples_maxvaf)][maf_cohort_final['mutationStatus'] == 'NA']['Tumor_Id']))

138

In [39]:
master = load_clean_up_master(data_path + 'merged_data/master_file.pkl')
len(master)

26931

In [46]:
get_groupby(master[master['Tumor_Id'].isin(h)], 'tp53_cn_state', 'count')

Unnamed: 0_level_0,count
tp53_cn_state,Unnamed: 1_level_1
CNLOH,6
CNLOH BEFORE & LOSS,5
DIPLOID,7
DOUBLE LOSS AFTER,1
GAIN,2
HETLOSS,34
HOMDEL,4
INDETERMINATE,18
LOSS AFTER,2
LOSS BEFORE,9


In [69]:
get_groupby(maf_cohort_final, 'mutationStatus', 'count')

Unnamed: 0_level_0,count
mutationStatus,Unnamed: 1_level_1
,6622
SOMATIC,241854
UNKNOWN,2265


In [47]:
get_groupby(maf_cohort_final, 'Variant_Classification', 'count')

Unnamed: 0_level_0,count
Variant_Classification,Unnamed: 1_level_1
5'Flank,3832
Frame_Shift_Del,22744
Frame_Shift_Ins,8695
Fusion,6704
In_Frame_Del,4451
In_Frame_Ins,929
Missense_Mutation,171850
Nonsense_Mutation,23354
Nonstop_Mutation,160
Splice_Region,319


In [30]:
maf_cohort_final[maf_cohort_final['mutationStatus'] == 'UNKNOWN']

Unnamed: 0,Sample_Id,Tumor_Id,purity,ploidy,dipLogR,frac_loh,Patient_Id,Patient_Current_Age,Cancer_Type,Cancer_Type_Detailed,Ethnicity_Category,Race_Category,Sex,Mutation_Count,Sample_Type,samples_per_patient,Overall Survival Status,Overall Survival (Months),MSI Score,TMB_Score,gene,Gene_Id,Variant_Classification,mutationStatus,proteinChange,Start_Position,End_Position,Reference_Allele,Variant_Allele,Chromosome,Hugo_Symbol,alt_count,ref_count,mut_key,sample_mut_key,mut_spot,vaf,total_mut_key,mutationEffect,oncogenic,vus,hotspot
2326,P-0000806-T01-IM3_P-0000806-N01-IM3,P-0000806-T01-IM3,0.486670,3.877389,-0.542837,0.12,P-0000806,61.0,Breast Cancer,"Breast Invasive Cancer, NOS",Non-Spanish; Non-Hispanic,WHITE,Female,3.0,Metastasis,1.0,DECEASED,45.567,1.62,3.3,"{'entrezGeneId': 8314, 'hugoGeneSymbol': 'BAP1', 'type': 'protein-coding'}",8314.0,Missense_Mutation,UNKNOWN,A217T,52440855.0,52440855.0,C,T,3,BAP1,77.0,293.0,3_52440855_C_T,P-0000806-T01-IM3_3_52440855_C_T,217,0.208108,P-0000806-T01-IM3_P-0000806-N01-IM3_3_52440855_C_T,,,,
2327,P-0000806-T01-IM3_P-0000806-N01-IM3,P-0000806-T01-IM3,0.486670,3.877389,-0.542837,0.12,P-0000806,61.0,Breast Cancer,"Breast Invasive Cancer, NOS",Non-Spanish; Non-Hispanic,WHITE,Female,3.0,Metastasis,1.0,DECEASED,45.567,1.62,3.3,"{'entrezGeneId': 6598, 'hugoGeneSymbol': 'SMARCB1', 'type': 'protein-coding'}",6598.0,Missense_Mutation,UNKNOWN,A238D,24159041.0,24159041.0,C,A,22,SMARCB1,62.0,350.0,22_24159041_C_A,P-0000806-T01-IM3_22_24159041_C_A,238,0.150485,P-0000806-T01-IM3_P-0000806-N01-IM3_22_24159041_C_A,,,,
2328,P-0000806-T01-IM3_P-0000806-N01-IM3,P-0000806-T01-IM3,0.486670,3.877389,-0.542837,0.12,P-0000806,61.0,Breast Cancer,"Breast Invasive Cancer, NOS",Non-Spanish; Non-Hispanic,WHITE,Female,3.0,Metastasis,1.0,DECEASED,45.567,1.62,3.3,"{'entrezGeneId': 54880, 'hugoGeneSymbol': 'BCOR', 'type': 'protein-coding'}",54880.0,Missense_Mutation,UNKNOWN,R540Q,39932980.0,39932980.0,C,T,23,BCOR,149.0,559.0,23_39932980_C_T,P-0000806-T01-IM3_23_39932980_C_T,540,0.210452,P-0000806-T01-IM3_P-0000806-N01-IM3_23_39932980_C_T,,,,
4971,P-0002304-T01-IM3_P-0002304-N01-IM3,P-0002304-T01-IM3,0.563198,1.703556,0.125759,0.30,P-0002304,76.0,Prostate Cancer,Prostate Adenocarcinoma,Non-Spanish; Non-Hispanic,WHITE,Male,5.0,Metastasis,1.0,LIVING,55.529,0.43,5.6,"{'entrezGeneId': 3169, 'hugoGeneSymbol': 'FOXA1', 'type': 'protein-coding'}",3169.0,Missense_Mutation,UNKNOWN,R219S,38061334.0,38061334.0,G,T,14,FOXA1,435.0,1076.0,14_38061334_G_T,P-0002304-T01-IM3_14_38061334_G_T,219,0.287889,P-0002304-T01-IM3_P-0002304-N01-IM3_14_38061334_G_T,,,,
4972,P-0002304-T01-IM3_P-0002304-N01-IM3,P-0002304-T01-IM3,0.563198,1.703556,0.125759,0.30,P-0002304,76.0,Prostate Cancer,Prostate Adenocarcinoma,Non-Spanish; Non-Hispanic,WHITE,Male,5.0,Metastasis,1.0,LIVING,55.529,0.43,5.6,"{'entrezGeneId': 3169, 'hugoGeneSymbol': 'FOXA1', 'type': 'protein-coding'}",3169.0,Frame_Shift_Del,UNKNOWN,S285Rfs*26,38061104.0,38061134.0,CTTGCGGCTCTCAGGGCCGCCCTTGGCGCCG,-,14,FOXA1,51.0,126.0,14_38061104_CTTGCGGCTCTCAGGGCCGCCCTTGGCGCCG_-,P-0002304-T01-IM3_14_38061104_CTTGCGGCTCTCAGGGCCGCCCTTGGCGCCG_-,285,0.288136,P-0002304-T01-IM3_P-0002304-N01-IM3_14_38061104_CTTGCGGCTCTCAGGGCCGCCCTTGGCGCCG_-,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
218802,P-0035458-T02-IM6_P-0035458-N01-IM6,P-0035458-T02-IM6,0.578450,2.260712,-0.104879,0.16,P-0035458,63.0,Pancreatic Cancer,Pancreatic Adenocarcinoma,Non-Spanish; Non-Hispanic,WHITE,Male,5.0,Primary,2.0,LIVING,7.562,-1.00,5.3,"{'entrezGeneId': 1029, 'hugoGeneSymbol': 'CDKN2A', 'type': 'protein-coding'}",1029.0,Frame_Shift_Ins,UNKNOWN,T93Nfs*27,21971080.0,21971081.0,-,T,9,CDKN2A,72.0,220.0,9_21971080_-_T,P-0035458-T02-IM6_9_21971080_-_T,93,0.246575,P-0035458-T02-IM6_P-0035458-N01-IM6_9_21971080_-_T,Gain-of-function,Oncogenic,False,True
225567,P-0002165-T01-IM3_P-0002165-N01-IM3,P-0002165-T01-IM3,0.445702,3.668292,-0.456050,0.26,P-0002165,67.0,Melanoma,Acral Melanoma,Non-Spanish; Non-Hispanic,WHITE,Male,2.0,Metastasis,1.0,LIVING,58.389,-1.00,12.3,"{'entrezGeneId': 673, 'hugoGeneSymbol': 'BRAF', 'type': 'protein-coding'}",673.0,Missense_Mutation,UNKNOWN,D594N,140453155.0,140453155.0,C,T,7,BRAF,2360.0,626.0,7_140453155_C_T,P-0002165-T01-IM3_7_140453155_C_T,594,0.790355,P-0002165-T01-IM3_P-0002165-N01-IM3_7_140453155_C_T,Unknown,,False,False
225568,P-0002165-T01-IM3_P-0002165-N01-IM3,P-0002165-T01-IM3,0.445702,3.668292,-0.456050,0.26,P-0002165,67.0,Melanoma,Acral Melanoma,Non-Spanish; Non-Hispanic,WHITE,Male,2.0,Metastasis,1.0,LIVING,58.389,-1.00,12.3,"{'entrezGeneId': 2195, 'hugoGeneSymbol': 'FAT1', 'type': 'protein-coding'}",2195.0,Missense_Mutation,UNKNOWN,E3306K,187531107.0,187531107.0,C,T,4,FAT1,165.0,647.0,4_187531107_C_T,P-0002165-T01-IM3_4_187531107_C_T,3306,0.203202,P-0002165-T01-IM3_P-0002165-N01-IM3_4_187531107_C_T,,,,
231052,P-0002078-T01-IM3_P-0002078-N01-IM3,P-0002078-T01-IM3,0.245690,1.878995,0.021606,0.28,P-0002078,21.0,Colorectal Cancer,Mucinous Adenocarcinoma of the Colon and Rectum,Non-Spanish; Non-Hispanic,ASIAN-FAR EAST/INDIAN SUBCONT,Male,2.0,Metastasis,1.0,DECEASED,1.216,0.00,,"{'entrezGeneId': 7157, 'hugoGeneSymbol': 'TP53', 'type': 'protein-coding'}",7157.0,Missense_Mutation,UNKNOWN,Y163C,7578442.0,7578442.0,T,C,17,TP53,106.0,442.0,17_7578442_T_C,P-0002078-T01-IM3_17_7578442_T_C,163,0.193431,P-0002078-T01-IM3_P-0002078-N01-IM3_17_7578442_T_C,,,,


In [76]:
total_mut_keys = maf_cohort_final["total_mut_key"]
h = maf_cohort_final[total_mut_keys.isin(total_mut_keys[total_mut_keys.duplicated()])]#.sort("ID")
#get_groupby(h, 'oncogenic', 'count')
h[h['oncogenic'] == 'Oncogenic']

Unnamed: 0,Sample_Id,Tumor_Id,purity,ploidy,dipLogR,frac_loh,Patient_Id,Patient_Current_Age,Cancer_Type,Cancer_Type_Detailed,Ethnicity_Category,Race_Category,Sex,Mutation_Count,Sample_Type,samples_per_patient,Overall Survival Status,Overall Survival (Months),MSI Score,TMB_Score,gene,Gene_Id,Variant_Classification,mutationStatus,proteinChange,Start_Position,End_Position,Reference_Allele,Variant_Allele,Chromosome,Hugo_Symbol,alt_count,ref_count,mut_key,sample_mut_key,mut_spot,vaf,total_mut_key,mutationEffect,oncogenic,vus,hotspot
12339,P-0014183-T01-IM6_P-0014183-N01-IM6,P-0014183-T01-IM6,0.255454,1.901383,0.018288,0.27,P-0014183,82.0,Soft Tissue Sarcoma,Angiosarcoma,Non-Spanish; Non-Hispanic,WHITE,Male,9.0,Primary,1.0,DECEASED,28.800,0.45,8.8,"{'entrezGeneId': 1029, 'hugoGeneSymbol': 'CDKN2A', 'type': 'protein-coding'}",1029.0,Missense_Mutation,SOMATIC,H83Y,21971111.0,21971111.0,G,A,9,CDKN2A,132.0,544.0,9_21971111_G_A,P-0014183-T01-IM6_9_21971111_G_A,83,0.195266,P-0014183-T01-IM6_P-0014183-N01-IM6_9_21971111_G_A,Gain-of-function,Oncogenic,False,True
12349,P-0014183-T01-IM6_P-0014183-N01-IM6,P-0014183-T01-IM6,0.255454,1.901383,0.018288,0.27,P-0014183,82.0,Soft Tissue Sarcoma,Angiosarcoma,Non-Spanish; Non-Hispanic,WHITE,Male,9.0,Primary,1.0,DECEASED,28.800,0.45,8.8,"{'entrezGeneId': -1, 'hugoGeneSymbol': 'CDKN2AP16INK4A', 'type': 'protein-coding'}",-1.0,Missense_Mutation,SOMATIC,H83Y,21971111.0,21971111.0,G,A,9,CDKN2AP16INK4A,132.0,544.0,9_21971111_G_A,P-0014183-T01-IM6_9_21971111_G_A,83,0.195266,P-0014183-T01-IM6_P-0014183-N01-IM6_9_21971111_G_A,Gain-of-function,Oncogenic,False,True
12352,P-0014183-T01-IM6_P-0014183-N01-IM6,P-0014183-T01-IM6,0.255454,1.901383,0.018288,0.27,P-0014183,82.0,Soft Tissue Sarcoma,Angiosarcoma,Non-Spanish; Non-Hispanic,WHITE,Male,9.0,Primary,1.0,DECEASED,28.800,0.45,8.8,"{'entrezGeneId': -2, 'hugoGeneSymbol': 'CDKN2AP14ARF', 'type': 'protein-coding'}",-2.0,Missense_Mutation,SOMATIC,H83Y,21971111.0,21971111.0,G,A,9,CDKN2AP14ARF,132.0,544.0,9_21971111_G_A,P-0014183-T01-IM6_9_21971111_G_A,83,0.195266,P-0014183-T01-IM6_P-0014183-N01-IM6_9_21971111_G_A,Gain-of-function,Oncogenic,False,True
14286,P-0029302-T01-IM6_P-0029302-N01-IM6,P-0029302-T01-IM6,0.574623,2.745068,-0.279848,0.31,P-0029302,67.0,Cancer of Unknown Primary,"Poorly Differentiated Carcinoma, NOS",Non-Spanish; Non-Hispanic,WHITE,Male,4.0,Metastasis,1.0,LIVING,21.468,0.85,4.4,"{'entrezGeneId': 1029, 'hugoGeneSymbol': 'CDKN2A', 'type': 'protein-coding'}",1029.0,Nonsense_Mutation,SOMATIC,R80*,21971120.0,21971120.0,G,A,9,CDKN2A,298.0,185.0,9_21971120_G_A,P-0029302-T01-IM6_9_21971120_G_A,80,0.616977,P-0029302-T01-IM6_P-0029302-N01-IM6_9_21971120_G_A,Likely Loss-of-function,Oncogenic,False,False
14290,P-0029302-T01-IM6_P-0029302-N01-IM6,P-0029302-T01-IM6,0.574623,2.745068,-0.279848,0.31,P-0029302,67.0,Cancer of Unknown Primary,"Poorly Differentiated Carcinoma, NOS",Non-Spanish; Non-Hispanic,WHITE,Male,4.0,Metastasis,1.0,LIVING,21.468,0.85,4.4,"{'entrezGeneId': -1, 'hugoGeneSymbol': 'CDKN2AP16INK4A', 'type': 'protein-coding'}",-1.0,Nonsense_Mutation,SOMATIC,R80*,21971120.0,21971120.0,G,A,9,CDKN2AP16INK4A,298.0,185.0,9_21971120_G_A,P-0029302-T01-IM6_9_21971120_G_A,80,0.616977,P-0029302-T01-IM6_P-0029302-N01-IM6_9_21971120_G_A,Likely Loss-of-function,Oncogenic,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
224203,P-0031227-T01-IM6_P-0031227-N01-IM6,P-0031227-T01-IM6,0.388603,3.180206,-0.297855,0.31,P-0031227,58.0,Non-Small Cell Lung Cancer,Lung Squamous Cell Carcinoma,Non-Spanish; Non-Hispanic,WHITE,Female,15.0,Metastasis,1.0,LIVING,10.258,0.55,14.0,"{'entrezGeneId': -1, 'hugoGeneSymbol': 'CDKN2AP16INK4A', 'type': 'protein-coding'}",-1.0,Frame_Shift_Del,SOMATIC,D105Tfs*41,21971045.0,21971045.0,C,-,9,CDKN2AP16INK4A,298.0,440.0,9_21971045_C_-,P-0031227-T01-IM6_9_21971045_C_-,105,0.403794,P-0031227-T01-IM6_P-0031227-N01-IM6_9_21971045_C_-,Gain-of-function,Oncogenic,False,True
224206,P-0031227-T01-IM6_P-0031227-N01-IM6,P-0031227-T01-IM6,0.388603,3.180206,-0.297855,0.31,P-0031227,58.0,Non-Small Cell Lung Cancer,Lung Squamous Cell Carcinoma,Non-Spanish; Non-Hispanic,WHITE,Female,15.0,Metastasis,1.0,LIVING,10.258,0.55,14.0,"{'entrezGeneId': -2, 'hugoGeneSymbol': 'CDKN2AP14ARF', 'type': 'protein-coding'}",-2.0,Frame_Shift_Del,SOMATIC,D105Tfs*41,21971045.0,21971045.0,C,-,9,CDKN2AP14ARF,298.0,440.0,9_21971045_C_-,P-0031227-T01-IM6_9_21971045_C_-,105,0.403794,P-0031227-T01-IM6_P-0031227-N01-IM6_9_21971045_C_-,Gain-of-function,Oncogenic,False,True
225382,P-0025328-T01-IM6_P-0025328-N01-IM6,P-0025328-T01-IM6,0.233438,3.802625,-0.275484,0.26,P-0025328,42.0,Cancer of Unknown Primary,Cancer of Unknown Primary,Non-Spanish; Non-Hispanic,WHITE,Female,7.0,Metastasis,2.0,LIVING,9.962,0.39,7.0,"{'entrezGeneId': 1029, 'hugoGeneSymbol': 'CDKN2A', 'type': 'protein-coding'}",1029.0,Nonsense_Mutation,SOMATIC,W110*,21971028.0,21971028.0,C,T,9,CDKN2A,203.0,446.0,9_21971028_C_T,P-0025328-T01-IM6_9_21971028_C_T,110,0.312789,P-0025328-T01-IM6_P-0025328-N01-IM6_9_21971028_C_T,Gain-of-function,Oncogenic,False,True
225386,P-0025328-T01-IM6_P-0025328-N01-IM6,P-0025328-T01-IM6,0.233438,3.802625,-0.275484,0.26,P-0025328,42.0,Cancer of Unknown Primary,Cancer of Unknown Primary,Non-Spanish; Non-Hispanic,WHITE,Female,7.0,Metastasis,2.0,LIVING,9.962,0.39,7.0,"{'entrezGeneId': -1, 'hugoGeneSymbol': 'CDKN2AP16INK4A', 'type': 'protein-coding'}",-1.0,Nonsense_Mutation,SOMATIC,W110*,21971028.0,21971028.0,C,T,9,CDKN2AP16INK4A,203.0,446.0,9_21971028_C_T,P-0025328-T01-IM6_9_21971028_C_T,110,0.312789,P-0025328-T01-IM6_P-0025328-N01-IM6_9_21971028_C_T,Gain-of-function,Oncogenic,False,True


In [72]:
maf_cohort_final[maf_cohort_final['total_mut_key'] == 'P-0009819-T01-IM5_P-0009819-N01-IM5_NA_-1_NA_']

Unnamed: 0,Sample_Id,Tumor_Id,purity,ploidy,dipLogR,frac_loh,Patient_Id,Patient_Current_Age,Cancer_Type,Cancer_Type_Detailed,Ethnicity_Category,Race_Category,Sex,Mutation_Count,Sample_Type,samples_per_patient,Overall Survival Status,Overall Survival (Months),MSI Score,TMB_Score,gene,Gene_Id,Variant_Classification,mutationStatus,proteinChange,Start_Position,End_Position,Reference_Allele,Variant_Allele,Chromosome,Hugo_Symbol,alt_count,ref_count,mut_key,sample_mut_key,mut_spot,vaf,total_mut_key,mutationEffect,oncogenic,vus,hotspot
7,P-0009819-T01-IM5_P-0009819-N01-IM5,P-0009819-T01-IM5,0.275237,2.681075,-0.129255,0.094,P-0009819,72.0,Prostate Cancer,Prostate Adenocarcinoma,Non-Spanish; Non-Hispanic,NO VALUE ENTERED,Male,1.0,Primary,1.0,LIVING,23.441,0.0,1.0,"{'entrezGeneId': 2078, 'hugoGeneSymbol': 'ERG', 'type': 'protein-coding'}",2078.0,Fusion,,TMPRSS2-ERG fusion,-1.0,-1.0,,,,ERG,-1.0,-1.0,NA_-1_NA_,P-0009819-T01-IM5_NA_-1_NA_,2,,P-0009819-T01-IM5_P-0009819-N01-IM5_NA_-1_NA_,,,,
8,P-0009819-T01-IM5_P-0009819-N01-IM5,P-0009819-T01-IM5,0.275237,2.681075,-0.129255,0.094,P-0009819,72.0,Prostate Cancer,Prostate Adenocarcinoma,Non-Spanish; Non-Hispanic,NO VALUE ENTERED,Male,1.0,Primary,1.0,LIVING,23.441,0.0,1.0,"{'entrezGeneId': 7113, 'hugoGeneSymbol': 'TMPRSS2', 'type': 'protein-coding'}",7113.0,Fusion,,TMPRSS2-ERG fusion,-1.0,-1.0,,,,TMPRSS2,-1.0,-1.0,NA_-1_NA_,P-0009819-T01-IM5_NA_-1_NA_,2,,P-0009819-T01-IM5_P-0009819-N01-IM5_NA_-1_NA_,,,,


In [74]:
len(maf_cohort[maf_cohort['alt_count'] == -1.0])

13278

In [59]:
maf_cohort[maf_cohort['Tumor_Id'] == 'P-0009819-T01-IM5']

Unnamed: 0,Sample_Id,Tumor_Id,purity,ploidy,dipLogR,frac_loh,Patient_Id,Patient_Current_Age,Cancer_Type,Cancer_Type_Detailed,Ethnicity_Category,Race_Category,Sex,Mutation_Count,Sample_Type,samples_per_patient,Overall Survival Status,Overall Survival (Months),MSI Score,TMB_Score,gene,Gene_Id,Variant_Classification,mutationStatus,proteinChange,Start_Position,End_Position,Reference_Allele,Variant_Allele,Chromosome,Hugo_Symbol,alt_count,ref_count,mut_key,sample_mut_key,mut_spot,vaf
6,P-0009819-T01-IM5_P-0009819-N01-IM5,P-0009819-T01-IM5,0.275237,2.681075,-0.129255,0.094,P-0009819,72.0,Prostate Cancer,Prostate Adenocarcinoma,Non-Spanish; Non-Hispanic,NO VALUE ENTERED,Male,1.0,Primary,1.0,LIVING,23.441,0.0,1.0,"{'entrezGeneId': 3169, 'hugoGeneSymbol': 'FOXA1', 'type': 'protein-coding'}",3169.0,In_Frame_Ins,SOMATIC,G157dup,38061516.0,38061517.0,-,CGC,14.0,FOXA1,41.0,236.0,14_38061516_-_CGC,P-0009819-T01-IM5_14_38061516_-_CGC,157,0.148014
7,P-0009819-T01-IM5_P-0009819-N01-IM5,P-0009819-T01-IM5,0.275237,2.681075,-0.129255,0.094,P-0009819,72.0,Prostate Cancer,Prostate Adenocarcinoma,Non-Spanish; Non-Hispanic,NO VALUE ENTERED,Male,1.0,Primary,1.0,LIVING,23.441,0.0,1.0,"{'entrezGeneId': 2078, 'hugoGeneSymbol': 'ERG', 'type': 'protein-coding'}",2078.0,Fusion,,TMPRSS2-ERG fusion,-1.0,-1.0,,,,ERG,-1.0,-1.0,NA_-1_NA_,P-0009819-T01-IM5_NA_-1_NA_,2,
8,P-0009819-T01-IM5_P-0009819-N01-IM5,P-0009819-T01-IM5,0.275237,2.681075,-0.129255,0.094,P-0009819,72.0,Prostate Cancer,Prostate Adenocarcinoma,Non-Spanish; Non-Hispanic,NO VALUE ENTERED,Male,1.0,Primary,1.0,LIVING,23.441,0.0,1.0,"{'entrezGeneId': 7113, 'hugoGeneSymbol': 'TMPRSS2', 'type': 'protein-coding'}",7113.0,Fusion,,TMPRSS2-ERG fusion,-1.0,-1.0,,,,TMPRSS2,-1.0,-1.0,NA_-1_NA_,P-0009819-T01-IM5_NA_-1_NA_,2,


In [205]:
maf_cohort_unique = maf_cohort.drop_duplicates('Patient_Id')
print('Number of cohort patients without cancer type information: '+str(maf_cohort_unique['Cancer_Type'].isna().sum()))

Number of cohort patients without cancer type information: 7


In [203]:
len(set(maf_cohort.Patient_Id))

26972

In [213]:
cohort_patients = set(cohort.Patient_Id)
cbioportal_patients = set(clinical_data['Patient ID'])
maf_patients = set(maf_cohort['Patient_Id'])
mutation_patients = set(mutations_filtered['patientId'])
print(len(cohort_patients - mutation_patients))

884


In [200]:
clinical_data = pd.read_csv(data_path + 'cbioportal/raw/mskimpact_clinical_data-2.tsv', sep= '\t')
clinical_data[clinical_data['Patient ID'] == 'P-0003702']

Unnamed: 0,Patient ID,Sample ID,Cancer Type,Cancer Type Detailed,Number of Samples Per Patient,Mutation Count,Fraction Genome Altered,Sex,Ethnicity Category,Race Category,Sample Type,12-245 Part C Consented,Gene Panel,Impact TMB Score,Institute Source,MSI Score,MSI Type,Overall Survival Status,Patient Current Age,Sample coverage,Somatic Status,Tumor Purity
3924,P-0003702,P-0003702-T02-IM5,Breast Cancer,Breast Invasive Ductal Carcinoma,1,5,0.1945,Female,Non-Spanish; Non-Hispanic,WHITE,Metastasis,NO,IMPACT410,4.9,MSKCC,0.08,Stable,LIVING,55.0,191,Matched,20


In [198]:
mutations_filtered[mutations_filtered['patientId'] == 'P-0002760']

Unnamed: 0,sampleId,patientId,gene,mutationType,proteinChange,startPosition,endPosition,referenceAllele,variantAllele,chr,hugoGeneSymbol,mut_key,sample_mut_key,mut_spot


In [61]:
maf_cohort.vaf.isna().sum()

1070

In [43]:
mutations = pd.read_pickle(data_path + 'cbioportal/raw/mutations_cohort.pkl')

def cond(x):
    return list(x.gene.values())[1]

mutations['hugo_gene_symbol'] = mutations.apply(cond, axis=1)

In [45]:
mutations.to_csv(data_path + 'cbioportal/raw/mutations_cohort.tsv')

In [22]:
list(maf_cohort[maf_cohort['mutationStatus'] == 'GERMLINE'][maf_cohort['Hugo_Symbol'] == 'TP53'].Sample_Id)

['P-0034797-T02-IM6_P-0034797-N01-IM6',
 'P-0035720-T01-IM6_P-0035720-N01-IM6',
 'P-0012425-T01-IM5_P-0012425-N01-IM5',
 'P-0034664-T01-IM6_P-0034664-N01-IM6',
 'P-0009076-T01-IM5_P-0009076-N01-IM5',
 'P-0009076-T04-IM6_P-0009076-N01-IM6',
 'P-0009076-T05-IM6_P-0009076-N01-IM6',
 'P-0009076-T06-IM6_P-0009076-N01-IM6',
 'P-0014726-T01-IM6_P-0014726-N01-IM6',
 'P-0017801-T01-IM6_P-0017801-N01-IM6',
 'P-0017801-T02-IM6_P-0017801-N01-IM6',
 'P-0016090-T01-IM6_P-0016090-N01-IM6',
 'P-0011220-T01-IM5_P-0011220-N01-IM5',
 'P-0019349-T01-IM6_P-0019349-N01-IM6',
 'P-0031705-T01-IM6_P-0031705-N01-IM6',
 'P-0021028-T01-IM6_P-0021028-N01-IM6',
 'P-0014507-T01-IM6_P-0014507-N01-IM6',
 'P-0001586-T02-IM5_P-0001586-N01-IM5',
 'P-0037899-T01-IM6_P-0037899-N01-IM6',
 'P-0040092-T01-IM6_P-0040092-N01-IM6',
 'P-0040825-T01-IM6_P-0040825-N01-IM6',
 'P-0041242-T01-IM6_P-0041242-N01-IM6',
 'P-0043787-T01-IM6_P-0043787-N01-IM6',
 'P-0032271-T01-IM6_P-0032271-N01-IM6',
 'P-0046911-T01-IM6_P-0046911-N01-IM6',


In [79]:
annotated_data = pd.read_pickle(data_path + 'maf_cohort_annotated.pkl')
annotated_samples = list(set(annotated_data.Sample_Id))
maf_samples = list(set(maf_cohort.Sample_Id))
print('Samples',len(annotated_samples), len(maf_samples))
print('Number of mutations',len(annotated_data), len(maf_cohort))

Samples 27898 29304
Number of mutations 246466 260813


In [82]:
annotated_data.head(3)

Unnamed: 0,Sample_Id,Tumor_Id,purity,ploidy,dipLogR,frac_loh,Patient_Id,Patient_Current_Age,Cancer_Type,Cancer_Type_Detailed,Ethnicity_Category,Sex,Mutation_Count,Sample_Type,samples_per_patient,Overall Survival Status,Overall Survival (Months),MSI Score,TMB_Score,gene,Gene_Id,Variant_Classification,mutationStatus,proteinChange,Start_Position,End_Position,Reference_Allele,Variant_Allele,Chromosome,Hugo_Symbol,alt_count,ref_count,mut_key,sample_mut_key,mut_spot,vaf,mutationEffect,oncogenic,vus,hotspot
189187,P-0038582-T01-IM6_P-0038582-N01-IM6,P-0038582-T01-IM6,0.280299,2.45048,-0.088324,0.58,P-0038582,69.0,Esophagogastric Cancer,Esophageal Adenocarcinoma,Non-Spanish; Non-Hispanic,Male,19.0,Primary,1.0,DECEASED,4.175,0.0,16.7,"{'entrezGeneId': 7253, 'hugoGeneSymbol': 'TSHR', 'type': 'protein-coding'}",7253.0,Missense_Mutation,SOMATIC,D633N,81610299.0,81610299.0,G,A,14,TSHR,76.0,628.0,14_81610299_G_A,P-0038582-T01-IM6_14_81610299_G_A,633,0.107955,Unknown,,False,False
165005,P-0032563-T01-IM6_P-0032563-N01-IM6,P-0032563-T01-IM6,0.399005,1.795841,0.059992,0.32,P-0032563,41.0,Hepatobiliary Cancer,Intrahepatic Cholangiocarcinoma,Non-Spanish; Non-Hispanic,Male,2.0,Primary,1.0,DECEASED,2.992,0.25,1.8,"{'entrezGeneId': 55193, 'hugoGeneSymbol': 'PBRM1', 'type': 'protein-coding'}",55193.0,Missense_Mutation,SOMATIC,R522Q,52651531.0,52651531.0,C,T,3,PBRM1,148.0,432.0,3_52651531_C_T,P-0032563-T01-IM6_3_52651531_C_T,522,0.255172,Unknown,Predicted Oncogenic,True,True
259480,P-0050547-T01-IM6_P-0050547-N01-IM6,P-0050547-T01-IM6,0.68538,3.505192,-0.600093,0.16,P-0050547,65.0,Bladder Cancer,Bladder Urothelial Carcinoma,Non-Spanish; Non-Hispanic,Male,7.0,Primary,1.0,LIVING,2.071,1.17,6.1,"{'entrezGeneId': 6009, 'hugoGeneSymbol': 'RHEB', 'type': 'protein-coding'}",6009.0,Missense_Mutation,SOMATIC,S148C,151167676.0,151167676.0,G,C,7,RHEB,270.0,239.0,7_151167676_G_C,P-0050547-T01-IM6_7_151167676_G_C,148,0.530452,Unknown,,False,False


In [96]:
ARID = annotated_data[annotated_data['Hugo_Symbol'] == 'ARID1A']
ARID_Miss = ARID[ARID['Variant_Classification'] == 'Missense_Mutation']
ARID_Non = ARID[ARID['Variant_Classification'] == 'Nonsense_Mutation']
get_groupby(ARID_Non, 'oncogenic', 'count')

Unnamed: 0_level_0,count
oncogenic,Unnamed: 1_level_1
,19
Likely Oncogenic,508
Oncogenic,2


In [91]:
get_groupby(annotated_data, 'mutationEffect', 'count')

Unnamed: 0_level_0,count
mutationEffect,Unnamed: 1_level_1
Gain-of-function,10246
Inconclusive,71
Likely Gain-of-function,2354
Likely Loss-of-function,40421
Likely Neutral,148
Likely Switch-of-function,43
Loss-of-function,3819
Neutral,15
,21
Switch-of-function,476


In [39]:
get_groupby(annotated_data, 'mutationEffect', 'count')

Unnamed: 0_level_0,count
mutationEffect,Unnamed: 1_level_1
Gain-of-function,10246
Inconclusive,71
Likely Gain-of-function,2354
Likely Loss-of-function,40421
Likely Neutral,148
Likely Switch-of-function,43
Loss-of-function,3819
Neutral,15
,21
Switch-of-function,476


In [None]:
maf_cohort_final.mutationStatus

In [45]:
master = load_clean_up_master(data_path + 'merged_data/master_file.pkl')
h = list(master[master[''']>=1]['Tumor_Id'])
clinical_data_tp53 = clinical_data[clinical_data['Sample ID'].isin(h)]

In [46]:
print(h[0])

P-0027408-T01-IM6


In [47]:
get_groupby(clinical_data_tp53, 'Somatic Status', 'count')

Unnamed: 0_level_0,count
Somatic Status,Unnamed: 1_level_1
Matched,11466
Unmatched,61


In [48]:
get_groupby(clinical_data, 'Somatic Status', 'count')

Unnamed: 0_level_0,count
Somatic Status,Unnamed: 1_level_1
Matched,49982
Unmatched,2988
