# <span style='font-family:"Times New Roman"'> <span styel=''>**TP53 MAF FILE CREATION**
## <span style='font-family:"Times New Roman"'> <span styel=''>*Emile Cohen*
 *February 2020*

**Goal:** In this notebook, we aim to filter the raw data from *default_qc_pass.ccf_TP53.maf* in impact-facets-tp53, filter the raw data from *mutations.pkl* in cbioportal/downloaded, filter the raw data from *mskimpact_clinical_data.tsv* in cbioportal/downloaded and merge the two files. 

We will also create a new feature *sample_mut_key* to identify the mutations.

The notebook is composed of 2 parts:
   * **1. Loading & Filtering the raw data**
   * **2. Adding new keys & Merging**
---

In [8]:
%run -i '../../utils/setup_environment.ipy'

import warnings
warnings.filterwarnings('ignore')

data_path = '../../data/'

Setup environment... done!


<span style="color:green">✅ Working on **mskimpact_env** conda environment.</span>

## Loading & Filtering the raw data
---

In [9]:
ccf_tp53 = pd.read_csv(data_path + 'impact-facets-tp53/raw/default_qc_pass.ccf_TP53.maf', sep='\t')
clinical_data = pd.read_csv(data_path + 'cbioportal/raw/mskimpact_clinical_data-2.tsv', sep= '\t')

In [10]:
ccf_tp53[ccf_tp53['tcn'] == 2][ccf_tp53['lcn'] == 1]

Unnamed: 0,Hugo_Symbol,Entrez_Gene_Id,Center,NCBI_Build,Chromosome,Start_Position,End_Position,Strand,Variant_Classification,Variant_Type,Reference_Allele,Tumor_Seq_Allele1,Tumor_Seq_Allele2,dbSNP_RS,dbSNP_Val_Status,Tumor_Sample_Barcode,Matched_Norm_Sample_Barcode,Match_Norm_Seq_Allele1,Match_Norm_Seq_Allele2,Tumor_Validation_Allele1,Tumor_Validation_Allele2,Match_Norm_Validation_Allele1,Match_Norm_Validation_Allele2,Verification_Status,Validation_Status,Mutation_Status,Sequencing_Phase,Sequence_Source,Validation_Method,Score,BAM_File,Sequencer,Tumor_Sample_UUID,Matched_Norm_Sample_UUID,HGVSc,HGVSp,HGVSp_Short,Transcript_ID,Exon_Number,t_depth,t_ref_count,t_alt_count,n_depth,n_ref_count,n_alt_count,all_effects,Allele,Gene,Feature,Feature_type,Consequence,cDNA_position,CDS_position,Protein_position,Amino_acids,Codons,Existing_variation,ALLELE_NUM,DISTANCE,STRAND_VEP,SYMBOL,SYMBOL_SOURCE,HGNC_ID,BIOTYPE,CANONICAL,CCDS,ENSP,SWISSPROT,TREMBL,UNIPARC,RefSeq,SIFT,PolyPhen,EXON,INTRON,DOMAINS,GMAF,AFR_MAF,AMR_MAF,ASN_MAF,EAS_MAF,EUR_MAF,SAS_MAF,AA_MAF,EA_MAF,CLIN_SIG,SOMATIC,PUBMED,MOTIF_NAME,MOTIF_POS,HIGH_INF_POS,MOTIF_SCORE_CHANGE,IMPACT,PICK,VARIANT_CLASS,TSL,HGVS_OFFSET,PHENO,MINIMISED,ExAC_AF,ExAC_AF_AFR,ExAC_AF_AMR,ExAC_AF_EAS,ExAC_AF_FIN,ExAC_AF_NFE,ExAC_AF_OTH,ExAC_AF_SAS,GENE_PHENO,FILTER,flanking_bps,variant_id,variant_qual,ExAC_AF_Adj,ExAC_AC_AN_Adj,ExAC_AC_AN,ExAC_AC_AN_AFR,ExAC_AC_AN_AMR,ExAC_AC_AN_EAS,ExAC_AC_AN_FIN,ExAC_AC_AN_NFE,ExAC_AC_AN_OTH,ExAC_AC_AN_SAS,ExAC_FILTER,Caller,is-a-hotspot,is-a-3d-hotspot,mutation_effect,oncogenic,LEVEL_1,LEVEL_2A,LEVEL_2B,LEVEL_3A,LEVEL_3B,LEVEL_4,LEVEL_R1,LEVEL_R2,LEVEL_R3,Highest_level,citations,driver,tcn,lcn,cf,purity,t_var_freq,expected_alt_copies,ccf_Mcopies,ccf_Mcopies_lower,ccf_Mcopies_upper,ccf_Mcopies_prob95,ccf_Mcopies_prob90,ccf_1copy,ccf_1copy_lower,ccf_1copy_upper,ccf_1copy_prob95,ccf_1copy_prob90,ccf_expected_copies,ccf_expected_copies_lower,ccf_expected_copies_upper,ccf_expected_copies_prob95,ccf_expected_copies_prob90,facets_fit,facets_suite_qc,reviewer_set_purity,use_only_purity_run,use_edited_cncf,cncf_file_used
17,TP53,7157,MSKCC,GRCh37,17,7578442,7578442,+,Missense_Mutation,SNP,T,T,C,novel,,P-0032875-T01-IM6,,,,,,,,,Unknown,SOMATIC,,,,MSK-IMPACT,,,,,c.488A>G,p.Tyr163Cys,p.Y163C,ENST00000269305,5/11,848,749,99,649,649,0,"TP53,missense_variant,p.Tyr163Cys,ENST00000420246,NM_001126114.2,NM_001276696.1;TP53,missense_variant,p.Tyr163Cys,ENST00000455263,NM_001276695.1,NM_001126113.2;TP53,missense_variant,p.Tyr163Cys,ENST00000269305,NM_001126112.2,NM_001276761.1,NM_001276760.1,NM_000546.5,NM_001126118.1;TP53,missense_variant,p.Tyr163Cys,ENST00000445888,;TP53,missense_variant,p.Tyr163Cys,ENST00000359597,;TP53,missense_variant,p.Tyr163Cys,ENST00000413465,;TP53,missense_variant,p.Tyr31Cys,ENST00000509690,;TP53,missense_variant,p.Tyr163Cys,ENST00000508793,;TP53,missense_variant,p.Tyr70Cys,ENST00000514944,;TP53,downstream_gene_variant,,ENST00000604348,;TP53,downstream_gene_variant,,ENST00000503591,;TP53,upstream_gene_variant,,ENST00000576024,;TP53,upstream_gene_variant,,ENST00000574684,;TP53,non_coding_transcript_exon_variant,,ENST00000510385,;TP53,non_coding_transcript_exon_variant,,ENST00000504290,;TP53,non_coding_transcript_exon_variant,,ENST00000504937,;TP53,non_coding_transcript_exon_variant,,ENST0000050...",C,ENSG00000141510,ENST00000269305,Transcript,missense_variant,678/2579,488/1182,163/393,Y/C,tAc/tGc,,1,,-1,TP53,HGNC,11998,protein_coding,YES,CCDS11118.1,ENSP00000269305,P04637,"S5LQU8,Q761V2,Q6IT77,Q1HGV1,Q0PKT5,L0ES54,L0EQ05,K7PPA8,H2EHT1,G4Y083,E9PCY9,E7ESS1,E7EMR6,B5AKF6,B4DNI2,A4GWD0,A4GWB8,A4GWB5,A4GW97,A4GW76,A4GW75,A4GW74,A4GW67,A2I9Z1,A2I9Z0",UPI000002ED67,"NM_001126112.2,NM_001276761.1,NM_001276760.1,NM_000546.5,NM_001126118.1",deleterious(0),probably_damaging(0.999),5/11,,"Gene3D:2.60.40.720,Pfam_domain:PF00870,Prints_domain:PR00386,hmmpanther:PTHR11447,hmmpanther:PTHR11447:SF6,Superfamily_domains:SSF49417",,,,,,,,,,,,,,,,,MODERATE,1.0,indel,,,,,,,,,,,,,1,,GTA,.,.,,,,,,,,,,,,,Y,Y,Loss-of-function,Likely Oncogenic,,,,,,,,,,,25584008;8023157;11900253;15077194;23246812;15037740,True,2.0,1.0,1.0,0.201282,0.116745,1.0,1.000,0.960,1.000,6.840427e-01,9.228051e-01,1.000,0.960,1.000,6.840427e-01,9.228051e-01,1.000,0.960,1.000,6.840427e-01,9.228051e-01,/juno/work/ccs/resources/impact/facets/all/P-00328/P-0032875-T01-IM6_P-0032875-N01-IM6//default,True,,False,False,/juno/work/ccs/resources/impact/facets/all/P-00328/P-0032875-T01-IM6_P-0032875-N01-IM6//default/P-0032875-T01-IM6_P-0032875-N01-IM6_hisens.cncf.txt
19,TP53,7157,MSKCC,GRCh37,17,7577090,7577090,+,Frame_Shift_Del,DEL,C,C,-,novel,,P-0033129-T01-IM6,,,,,,,,,Unknown,SOMATIC,,,,MSK-IMPACT,,,,,c.848delG,p.Arg283ProfsTer62,p.R283Pfs*62,ENST00000269305,8/11,738,619,119,549,549,0,"TP53,frameshift_variant,p.Arg283ProfsTer68,ENST00000420246,NM_001126114.2,NM_001276696.1;TP53,frameshift_variant,p.Arg283ProfsTer60,ENST00000455263,NM_001276695.1,NM_001126113.2;TP53,frameshift_variant,p.Arg283ProfsTer62,ENST00000269305,NM_001126112.2,NM_001276761.1,NM_001276760.1,NM_000546.5,NM_001126118.1;TP53,frameshift_variant,p.Arg283ProfsTer62,ENST00000445888,;TP53,frameshift_variant,p.Arg283ProfsTer69,ENST00000359597,;TP53,frameshift_variant,p.Arg151ProfsTer?,ENST00000509690,;TP53,intron_variant,,ENST00000413465,;TP53,downstream_gene_variant,,ENST00000508793,;TP53,downstream_gene_variant,,ENST00000604348,;TP53,downstream_gene_variant,,ENST00000503591,;TP53,downstream_gene_variant,,ENST00000514944,;TP53,upstream_gene_variant,,ENST00000576024,;TP53,downstream_gene_variant,,ENST00000574684,;TP53,non_coding_transcript_exon_variant,,ENST00000510385,;TP53,non_coding_transcript_exon_variant,,ENST00000504290,;TP53,non_coding_transcript_exon_variant,,ENST00000504937,;TP53,downstream_...",-,ENSG00000141510,ENST00000269305,Transcript,frameshift_variant,1038/2579,848/1182,283/393,R/X,cGc/cc,,1,,-1,TP53,HGNC,11998,protein_coding,YES,CCDS11118.1,ENSP00000269305,P04637,"S5LQU8,Q761V2,Q6IT77,Q1HGV1,Q0PKT5,L0ES54,L0EQ05,K7PPA8,H2EHT1,G4Y083,E9PCY9,E7ESS1,E7EMR6,B5AKF6,B4DNI2,A4GWD0,A4GWB8,A4GWB5,A4GW97,A4GW76,A4GW75,A4GW74,A4GW67,A2I9Z1,A2I9Z0",UPI000002ED67,"NM_001126112.2,NM_001276761.1,NM_001276760.1,NM_000546.5,NM_001126118.1",,,8/11,,"Gene3D:2.60.40.720,Pfam_domain:PF00870,Prints_domain:PR00386,hmmpanther:PTHR11447,hmmpanther:PTHR11447:SF6,Superfamily_domains:SSF49417",,,,,,,,,,,,,,,,,HIGH,1.0,sequence_alteration,,,,,,,,,,,,,1,,TGCG,.,.,,,,,,,,,,,,,,,Likely Loss-of-function,Likely Oncogenic,,,,,,,,,,,19336573;27759562;16007150;11753428;11900253;21467160,True,2.0,1.0,1.0,0.219131,0.161247,1.0,1.000,0.984,1.000,9.172547e-01,9.949473e-01,1.000,0.984,1.000,9.172547e-01,9.949473e-01,1.000,0.984,1.000,9.172547e-01,9.949473e-01,/juno/work/ccs/resources/impact/facets/all/P-00331/P-0033129-T01-IM6_P-0033129-N01-IM6//default,True,,False,False,/juno/work/ccs/resources/impact/facets/all/P-00331/P-0033129-T01-IM6_P-0033129-N01-IM6//default/P-0033129-T01-IM6_P-0033129-N01-IM6_hisens.cncf.txt
23,TP53,7157,MSKCC,GRCh37,17,7578553,7578553,+,Missense_Mutation,SNP,T,T,C,novel,,P-0021226-T01-IM6,,,,,,,,,Unknown,SOMATIC,,,,MSK-IMPACT,,,,,c.377A>G,p.Tyr126Cys,p.Y126C,ENST00000269305,5/11,1063,978,85,827,827,0,"TP53,missense_variant,p.Tyr126Cys,ENST00000420246,NM_001126114.2,NM_001276696.1;TP53,missense_variant,p.Tyr126Cys,ENST00000455263,NM_001276695.1,NM_001126113.2;TP53,missense_variant,p.Tyr126Cys,ENST00000269305,NM_001126112.2,NM_001276761.1,NM_001276760.1,NM_000546.5,NM_001126118.1;TP53,missense_variant,p.Tyr126Cys,ENST00000445888,;TP53,missense_variant,p.Tyr126Cys,ENST00000359597,;TP53,missense_variant,p.Tyr126Cys,ENST00000413465,;TP53,missense_variant,p.Tyr126Cys,ENST00000508793,;TP53,missense_variant,p.Tyr126Cys,ENST00000503591,;TP53,missense_variant,p.Tyr33Cys,ENST00000514944,;TP53,splice_region_variant,,ENST00000509690,;TP53,intron_variant,,ENST00000604348,;TP53,upstream_gene_variant,,ENST00000576024,;TP53,upstream_gene_variant,,ENST00000574684,;TP53,splice_region_variant,,ENST00000505014,;TP53,non_coding_transcript_exon_variant,,ENST00000510385,;TP53,non_coding_transcript_exon_variant,,ENST00000504290,;TP53,non_coding_transcript_exon_variant,,ENST00000504937,;",C,ENSG00000141510,ENST00000269305,Transcript,"missense_variant,splice_region_variant",567/2579,377/1182,126/393,Y/C,tAc/tGc,,1,,-1,TP53,HGNC,11998,protein_coding,YES,CCDS11118.1,ENSP00000269305,P04637,"S5LQU8,Q761V2,Q6IT77,Q1HGV1,Q0PKT5,L0ES54,L0EQ05,K7PPA8,H2EHT1,G4Y083,E9PCY9,E7ESS1,E7EMR6,B5AKF6,B4DNI2,A4GWD0,A4GWB8,A4GWB5,A4GW97,A4GW76,A4GW75,A4GW74,A4GW67,A2I9Z1,A2I9Z0",UPI000002ED67,"NM_001126112.2,NM_001276761.1,NM_001276760.1,NM_000546.5,NM_001126118.1",deleterious(0),probably_damaging(1),5/11,,"Gene3D:2.60.40.720,Pfam_domain:PF00870,Prints_domain:PR00386,hmmpanther:PTHR11447,hmmpanther:PTHR11447:SF6,Superfamily_domains:SSF49417",,,,,,,,,,,,,,,,,MODERATE,1.0,indel,,,,,,,,,,,,,1,,GTA,.,.,,,,,,,,,,,,,Y,,Likely Loss-of-function,Likely Oncogenic,,,,,,,,,,,8023157;11900253,True,2.0,1.0,1.0,0.288492,0.079962,1.0,0.554,0.488,0.625,4.520065e-09,1.726415e-07,0.554,0.488,0.625,4.520065e-09,1.726415e-07,0.554,0.488,0.625,4.520065e-09,1.726415e-07,/juno/work/ccs/resources/impact/facets/all/P-00212/P-0021226-T01-IM6_P-0021226-N01-IM6//default,True,,False,False,/juno/work/ccs/resources/impact/facets/all/P-00212/P-0021226-T01-IM6_P-0021226-N01-IM6//default/P-0021226-T01-IM6_P-0021226-N01-IM6_hisens.cncf.txt
39,TP53,7157,MSKCC,GRCh37,17,7577094,7577094,+,Missense_Mutation,SNP,G,G,A,novel,,P-0022536-T01-IM6,,,,,,,,,Unknown,SOMATIC,,,,MSK-IMPACT,,,,,c.844C>T,p.Arg282Trp,p.R282W,ENST00000269305,8/11,1084,760,324,759,758,1,"TP53,missense_variant,p.Arg282Trp,ENST00000420246,NM_001126114.2,NM_001276696.1;TP53,missense_variant,p.Arg282Trp,ENST00000455263,NM_001276695.1,NM_001126113.2;TP53,missense_variant,p.Arg282Trp,ENST00000269305,NM_001126112.2,NM_001276761.1,NM_001276760.1,NM_000546.5,NM_001126118.1;TP53,missense_variant,p.Arg282Trp,ENST00000445888,;TP53,missense_variant,p.Arg282Trp,ENST00000359597,;TP53,missense_variant,p.Arg150Trp,ENST00000509690,;TP53,intron_variant,,ENST00000413465,;TP53,downstream_gene_variant,,ENST00000508793,;TP53,downstream_gene_variant,,ENST00000604348,;TP53,downstream_gene_variant,,ENST00000503591,;TP53,downstream_gene_variant,,ENST00000514944,;TP53,upstream_gene_variant,,ENST00000576024,;TP53,downstream_gene_variant,,ENST00000574684,;TP53,non_coding_transcript_exon_variant,,ENST00000510385,;TP53,non_coding_transcript_exon_variant,,ENST00000504290,;TP53,non_coding_transcript_exon_variant,,ENST00000504937,;TP53,downstream_gene_variant,,ENST00000505014,;",A,ENSG00000141510,ENST00000269305,Transcript,missense_variant,1034/2579,844/1182,282/393,R/W,Cgg/Tgg,,1,,-1,TP53,HGNC,11998,protein_coding,YES,CCDS11118.1,ENSP00000269305,P04637,"S5LQU8,Q761V2,Q6IT77,Q1HGV1,Q0PKT5,L0ES54,L0EQ05,K7PPA8,H2EHT1,G4Y083,E9PCY9,E7ESS1,E7EMR6,B5AKF6,B4DNI2,A4GWD0,A4GWB8,A4GWB5,A4GW97,A4GW76,A4GW75,A4GW74,A4GW67,A2I9Z1,A2I9Z0",UPI000002ED67,"NM_001126112.2,NM_001276761.1,NM_001276760.1,NM_000546.5,NM_001126118.1",deleterious(0),probably_damaging(0.997),8/11,,"Gene3D:2.60.40.720,Pfam_domain:PF00870,Prints_domain:PR00386,hmmpanther:PTHR11447,hmmpanther:PTHR11447:SF6,Superfamily_domains:SSF49417",,,,,,,,,,,,,,,,,MODERATE,1.0,indel,,,,,,,,,,,,,1,,CGG,.,.,,,,,,,,,,,,,Y,Y,Likely Loss-of-function,Likely Oncogenic,,,,,,,,,,,21445056,True,2.0,1.0,1.0,0.621794,0.298893,1.0,0.961,0.909,1.000,5.165369e-01,9.032131e-01,0.961,0.909,1.000,5.165369e-01,9.032131e-01,0.961,0.909,1.000,5.165369e-01,9.032131e-01,/juno/work/ccs/resources/impact/facets/all/P-00225/P-0022536-T01-IM6_P-0022536-N01-IM6//default,True,,False,False,/juno/work/ccs/resources/impact/facets/all/P-00225/P-0022536-T01-IM6_P-0022536-N01-IM6//default/P-0022536-T01-IM6_P-0022536-N01-IM6_hisens.cncf.txt
40,TP53,7157,MSKCC,GRCh37,17,7577120,7577120,+,Missense_Mutation,SNP,C,C,T,novel,,P-0022536-T01-IM6,,,,,,,,,Unknown,SOMATIC,,,,MSK-IMPACT,,,,,c.818G>A,p.Arg273His,p.R273H,ENST00000269305,8/11,1009,692,317,660,656,4,"TP53,missense_variant,p.Arg273His,ENST00000420246,NM_001126114.2,NM_001276696.1;TP53,missense_variant,p.Arg273His,ENST00000455263,NM_001276695.1,NM_001126113.2;TP53,missense_variant,p.Arg273His,ENST00000269305,NM_001126112.2,NM_001276761.1,NM_001276760.1,NM_000546.5,NM_001126118.1;TP53,missense_variant,p.Arg273His,ENST00000445888,;TP53,missense_variant,p.Arg273His,ENST00000359597,;TP53,missense_variant,p.Arg141His,ENST00000509690,;TP53,intron_variant,,ENST00000413465,;TP53,downstream_gene_variant,,ENST00000508793,;TP53,downstream_gene_variant,,ENST00000604348,;TP53,downstream_gene_variant,,ENST00000503591,;TP53,downstream_gene_variant,,ENST00000514944,;TP53,upstream_gene_variant,,ENST00000576024,;TP53,downstream_gene_variant,,ENST00000574684,;TP53,non_coding_transcript_exon_variant,,ENST00000510385,;TP53,non_coding_transcript_exon_variant,,ENST00000504290,;TP53,non_coding_transcript_exon_variant,,ENST00000504937,;TP53,downstream_gene_variant,,ENST00000505014,;",T,ENSG00000141510,ENST00000269305,Transcript,missense_variant,1008/2579,818/1182,273/393,R/H,cGt/cAt,,1,,-1,TP53,HGNC,11998,protein_coding,YES,CCDS11118.1,ENSP00000269305,P04637,"S5LQU8,Q761V2,Q6IT77,Q1HGV1,Q0PKT5,L0ES54,L0EQ05,K7PPA8,H2EHT1,G4Y083,E9PCY9,E7ESS1,E7EMR6,B5AKF6,B4DNI2,A4GWD0,A4GWB8,A4GWB5,A4GW97,A4GW76,A4GW75,A4GW74,A4GW67,A2I9Z1,A2I9Z0",UPI000002ED67,"NM_001126112.2,NM_001276761.1,NM_001276760.1,NM_000546.5,NM_001126118.1",tolerated(0.13),possibly_damaging(0.631),8/11,,"Gene3D:2.60.40.720,Pfam_domain:PF00870,Prints_domain:PR00386,hmmpanther:PTHR11447,hmmpanther:PTHR11447:SF6,Superfamily_domains:SSF49417",,,,,,,,,,,,,,,,,MODERATE,1.0,indel,,,,,,,,,,,,,1,,ACG,.,.,,,,,,,,,,,,,Y,Y,Loss-of-function,Oncogenic,,,,,,,,,,,25584008;26181206;15037740,True,2.0,1.0,1.0,0.621794,0.314172,1.0,1.000,0.954,1.000,7.760786e-01,9.816952e-01,1.000,0.954,1.000,7.760786e-01,9.816952e-01,1.000,0.954,1.000,7.760786e-01,9.816952e-01,/juno/work/ccs/resources/impact/facets/all/P-00225/P-0022536-T01-IM6_P-0022536-N01-IM6//default,True,,False,False,/juno/work/ccs/resources/impact/facets/all/P-00225/P-0022536-T01-IM6_P-0022536-N01-IM6//default/P-0022536-T01-IM6_P-0022536-N01-IM6_hisens.cncf.txt
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13874,TP53,7157,MSKCC,GRCh37,17,7576858,7576858,+,Missense_Mutation,SNP,G,G,T,novel,,P-0050204-T01-IM6,,,,,,,,,Unknown,SOMATIC,,,,MSK-IMPACT,,,,,c.988C>A,p.Leu330Ile,p.L330I,ENST00000269305,9/11,743,613,130,572,572,0,"TP53,missense_variant,p.Leu330Ile,ENST00000420246,NM_001126114.2,NM_001276696.1;TP53,missense_variant,p.Leu330Ile,ENST00000455263,NM_001276695.1,NM_001126113.2;TP53,missense_variant,p.Leu330Ile,ENST00000269305,NM_001126112.2,NM_001276761.1,NM_001276760.1,NM_000546.5,NM_001126118.1;TP53,missense_variant,p.Leu330Ile,ENST00000445888,;TP53,missense_variant,p.Leu330Ile,ENST00000359597,;TP53,missense_variant,p.Leu198Ile,ENST00000509690,;TP53,missense_variant,p.Leu17Ile,ENST00000576024,;TP53,intron_variant,,ENST00000413465,;TP53,downstream_gene_variant,,ENST00000508793,;TP53,downstream_gene_variant,,ENST00000604348,;TP53,downstream_gene_variant,,ENST00000503591,;TP53,downstream_gene_variant,,ENST00000514944,;TP53,downstream_gene_variant,,ENST00000574684,;TP53,non_coding_transcript_exon_variant,,ENST00000510385,;TP53,non_coding_transcript_exon_variant,,ENST00000504290,;TP53,non_coding_transcript_exon_variant,,ENST00000504937,;TP53,downstream_gene_variant,,ENST00000505014,;",T,ENSG00000141510,ENST00000269305,Transcript,missense_variant,1178/2579,988/1182,330/393,L/I,Ctt/Att,,1,,-1,TP53,HGNC,11998,protein_coding,YES,CCDS11118.1,ENSP00000269305,P04637,"S5LQU8,Q761V2,Q6IT77,Q1HGV1,Q0PKT5,L0ES54,L0EQ05,K7PPA8,H2EHT1,G4Y083,E9PCY9,E7ESS1,E7EMR6,B5AKF6,B4DNI2,A4GWD0,A4GWB8,A4GWB5,A4GW97,A4GW76,A4GW75,A4GW74,A4GW67,A2I9Z1,A2I9Z0",UPI000002ED67,"NM_001126112.2,NM_001276761.1,NM_001276760.1,NM_000546.5,NM_001126118.1",deleterious(0.03),possibly_damaging(0.654),9/11,,"Gene3D:1olgA00,Pfam_domain:PF07710,Prints_domain:PR00386,hmmpanther:PTHR11447,hmmpanther:PTHR11447:SF6,Superfamily_domains:SSF47719",,,,,,,,,,,,,,,,,MODERATE,1.0,indel,,,,,,,,,,,,,1,,AGG,.,.,,,,,,,,,,,,,,Y,Likely Loss-of-function,Likely Oncogenic,,,,,,,,,,,19454241;20978130,True,2.0,1.0,1.0,0.393438,0.174966,1.0,0.889,0.808,0.975,1.553216e-01,4.198529e-01,0.889,0.808,0.975,1.553216e-01,4.198529e-01,0.889,0.808,0.975,1.553216e-01,4.198529e-01,/juno/work/ccs/resources/impact/facets/all/P-00502/P-0050204-T01-IM6_P-0050204-N01-IM6//default,True,,False,False,/juno/work/ccs/resources/impact/facets/all/P-00502/P-0050204-T01-IM6_P-0050204-N01-IM6//default/P-0050204-T01-IM6_P-0050204-N01-IM6_hisens.cncf.txt
13875,TP53,7157,MSKCC,GRCh37,17,7577504,7577504,+,Missense_Mutation,SNP,G,G,T,novel,,P-0050204-T01-IM6,,,,,,,,,Unknown,SOMATIC,,,,MSK-IMPACT,,,,,c.777C>A,p.Asp259Glu,p.D259E,ENST00000269305,7/11,631,407,224,605,604,1,"TP53,missense_variant,p.Asp259Glu,ENST00000420246,NM_001126114.2,NM_001276696.1;TP53,missense_variant,p.Asp259Glu,ENST00000455263,NM_001276695.1,NM_001126113.2;TP53,missense_variant,p.Asp259Glu,ENST00000269305,NM_001126112.2,NM_001276761.1,NM_001276760.1,NM_000546.5,NM_001126118.1;TP53,missense_variant,p.Asp259Glu,ENST00000445888,;TP53,missense_variant,p.Asp259Glu,ENST00000359597,;TP53,missense_variant,p.Asp259Glu,ENST00000413465,;TP53,missense_variant,p.Asp127Glu,ENST00000509690,;TP53,downstream_gene_variant,,ENST00000508793,;TP53,downstream_gene_variant,,ENST00000604348,;TP53,downstream_gene_variant,,ENST00000503591,;TP53,downstream_gene_variant,,ENST00000514944,;TP53,upstream_gene_variant,,ENST00000576024,;TP53,downstream_gene_variant,,ENST00000574684,;TP53,non_coding_transcript_exon_variant,,ENST00000510385,;TP53,non_coding_transcript_exon_variant,,ENST00000504290,;TP53,non_coding_transcript_exon_variant,,ENST00000504937,;TP53,downstream_gene_variant,,ENST00000505014,;",T,ENSG00000141510,ENST00000269305,Transcript,missense_variant,967/2579,777/1182,259/393,D/E,gaC/gaA,,1,,-1,TP53,HGNC,11998,protein_coding,YES,CCDS11118.1,ENSP00000269305,P04637,"S5LQU8,Q761V2,Q6IT77,Q1HGV1,Q0PKT5,L0ES54,L0EQ05,K7PPA8,H2EHT1,G4Y083,E9PCY9,E7ESS1,E7EMR6,B5AKF6,B4DNI2,A4GWD0,A4GWB8,A4GWB5,A4GW97,A4GW76,A4GW75,A4GW74,A4GW67,A2I9Z1,A2I9Z0",UPI000002ED67,"NM_001126112.2,NM_001276761.1,NM_001276760.1,NM_000546.5,NM_001126118.1",tolerated(0.05),possibly_damaging(0.874),7/11,,"Gene3D:2.60.40.720,Pfam_domain:PF00870,hmmpanther:PTHR11447,hmmpanther:PTHR11447:SF6,Superfamily_domains:SSF49417",,,,,,,,,,,,,,,,,MODERATE,1.0,indel,,,,,,,,,,,,,1,,AGT,.,.,,,,,,,,,,,,,Y,,Likely Loss-of-function,Likely Oncogenic,,,,,,,,,,,8023157;11900253,True,2.0,1.0,1.0,0.393438,0.354992,2.0,1.000,0.994,1.000,9.988490e-01,9.999992e-01,1.000,0.994,1.000,9.988490e-01,9.999992e-01,1.000,0.994,1.000,9.988490e-01,9.999992e-01,/juno/work/ccs/resources/impact/facets/all/P-00502/P-0050204-T01-IM6_P-0050204-N01-IM6//default,True,,False,False,/juno/work/ccs/resources/impact/facets/all/P-00502/P-0050204-T01-IM6_P-0050204-N01-IM6//default/P-0050204-T01-IM6_P-0050204-N01-IM6_hisens.cncf.txt
13876,TP53,7157,MSKCC,GRCh37,17,7578212,7578212,+,Nonsense_Mutation,SNP,G,G,A,novel,,P-0050204-T01-IM6,,,,,,,,,Unknown,SOMATIC,,,,MSK-IMPACT,,,,,c.637C>T,p.Arg213Ter,p.R213*,ENST00000269305,6/11,806,241,565,633,633,0,"TP53,stop_gained,p.Arg213Ter,ENST00000420246,NM_001126114.2,NM_001276696.1;TP53,stop_gained,p.Arg213Ter,ENST00000455263,NM_001276695.1,NM_001126113.2;TP53,stop_gained,p.Arg213Ter,ENST00000269305,NM_001126112.2,NM_001276761.1,NM_001276760.1,NM_000546.5,NM_001126118.1;TP53,stop_gained,p.Arg213Ter,ENST00000445888,;TP53,stop_gained,p.Arg213Ter,ENST00000359597,;TP53,stop_gained,p.Arg213Ter,ENST00000413465,;TP53,stop_gained,p.Arg81Ter,ENST00000509690,;TP53,stop_gained,p.Arg120Ter,ENST00000514944,;TP53,downstream_gene_variant,,ENST00000508793,;TP53,downstream_gene_variant,,ENST00000604348,;TP53,downstream_gene_variant,,ENST00000503591,;TP53,upstream_gene_variant,,ENST00000576024,;TP53,intron_variant,,ENST00000574684,;TP53,non_coding_transcript_exon_variant,,ENST00000510385,;TP53,non_coding_transcript_exon_variant,,ENST00000504290,;TP53,non_coding_transcript_exon_variant,,ENST00000504937,;TP53,non_coding_transcript_exon_variant,,ENST00000505014,;",A,ENSG00000141510,ENST00000269305,Transcript,stop_gained,827/2579,637/1182,213/393,R/*,Cga/Tga,,1,,-1,TP53,HGNC,11998,protein_coding,YES,CCDS11118.1,ENSP00000269305,P04637,"S5LQU8,Q761V2,Q6IT77,Q1HGV1,Q0PKT5,L0ES54,L0EQ05,K7PPA8,H2EHT1,G4Y083,E9PCY9,E7ESS1,E7EMR6,B5AKF6,B4DNI2,A4GWD0,A4GWB8,A4GWB5,A4GW97,A4GW76,A4GW75,A4GW74,A4GW67,A2I9Z1,A2I9Z0",UPI000002ED67,"NM_001126112.2,NM_001276761.1,NM_001276760.1,NM_000546.5,NM_001126118.1",,,6/11,,"Gene3D:2.60.40.720,Pfam_domain:PF00870,Prints_domain:PR00386,hmmpanther:PTHR11447,hmmpanther:PTHR11447:SF6,Superfamily_domains:SSF49417",,,,,,,,,,,,,,,,,HIGH,1.0,indel,,,,,,,,,,,,,1,,CGA,.,.,,,,,,,,,,,,,,,Likely Loss-of-function,Likely Oncogenic,,,,,,,,,,,19336573;27759562;16007150;11753428;11900253;21467160,True,2.0,1.0,1.0,0.393438,0.700993,4.0,1.000,0.998,1.000,1.000000e+00,1.000000e+00,1.000,0.998,1.000,1.000000e+00,1.000000e+00,1.000,0.998,1.000,1.000000e+00,1.000000e+00,/juno/work/ccs/resources/impact/facets/all/P-00502/P-0050204-T01-IM6_P-0050204-N01-IM6//default,True,,False,False,/juno/work/ccs/resources/impact/facets/all/P-00502/P-0050204-T01-IM6_P-0050204-N01-IM6//default/P-0050204-T01-IM6_P-0050204-N01-IM6_hisens.cncf.txt
13877,TP53,7157,MSKCC,GRCh37,17,7578263,7578263,+,Nonsense_Mutation,SNP,G,G,A,novel,,P-0050204-T01-IM6,,,,,,,,,Unknown,SOMATIC,,,,MSK-IMPACT,,,,,c.586C>T,p.Arg196Ter,p.R196*,ENST00000269305,6/11,693,458,235,430,430,0,"TP53,stop_gained,p.Arg196Ter,ENST00000420246,NM_001126114.2,NM_001276696.1;TP53,stop_gained,p.Arg196Ter,ENST00000455263,NM_001276695.1,NM_001126113.2;TP53,stop_gained,p.Arg196Ter,ENST00000269305,NM_001126112.2,NM_001276761.1,NM_001276760.1,NM_000546.5,NM_001126118.1;TP53,stop_gained,p.Arg196Ter,ENST00000445888,;TP53,stop_gained,p.Arg196Ter,ENST00000359597,;TP53,stop_gained,p.Arg196Ter,ENST00000413465,;TP53,stop_gained,p.Arg64Ter,ENST00000509690,;TP53,stop_gained,p.Arg103Ter,ENST00000514944,;TP53,downstream_gene_variant,,ENST00000508793,;TP53,downstream_gene_variant,,ENST00000604348,;TP53,downstream_gene_variant,,ENST00000503591,;TP53,upstream_gene_variant,,ENST00000576024,;TP53,intron_variant,,ENST00000574684,;TP53,non_coding_transcript_exon_variant,,ENST00000510385,;TP53,non_coding_transcript_exon_variant,,ENST00000504290,;TP53,non_coding_transcript_exon_variant,,ENST00000504937,;TP53,non_coding_transcript_exon_variant,,ENST00000505014,;",A,ENSG00000141510,ENST00000269305,Transcript,stop_gained,776/2579,586/1182,196/393,R/*,Cga/Tga,,1,,-1,TP53,HGNC,11998,protein_coding,YES,CCDS11118.1,ENSP00000269305,P04637,"S5LQU8,Q761V2,Q6IT77,Q1HGV1,Q0PKT5,L0ES54,L0EQ05,K7PPA8,H2EHT1,G4Y083,E9PCY9,E7ESS1,E7EMR6,B5AKF6,B4DNI2,A4GWD0,A4GWB8,A4GWB5,A4GW97,A4GW76,A4GW75,A4GW74,A4GW67,A2I9Z1,A2I9Z0",UPI000002ED67,"NM_001126112.2,NM_001276761.1,NM_001276760.1,NM_000546.5,NM_001126118.1",,,6/11,,"Gene3D:2.60.40.720,Pfam_domain:PF00870,hmmpanther:PTHR11447,hmmpanther:PTHR11447:SF6,Superfamily_domains:SSF49417",,,,,,,,,,,,,,,,,HIGH,1.0,indel,,,,,,,,,,,,,1,,CGG,.,.,,,,,,,,,,,,,,,Likely Loss-of-function,Likely Oncogenic,,,,,,,,,,,19336573;27759562;16007150;11753428;11900253;21467160,True,2.0,1.0,1.0,0.393438,0.339105,2.0,1.000,0.994,1.000,9.987885e-01,9.999992e-01,1.000,0.994,1.000,9.987885e-01,9.999992e-01,1.000,0.994,1.000,9.987885e-01,9.999992e-01,/juno/work/ccs/resources/impact/facets/all/P-00502/P-0050204-T01-IM6_P-0050204-N01-IM6//default,True,,False,False,/juno/work/ccs/resources/impact/facets/all/P-00502/P-0050204-T01-IM6_P-0050204-N01-IM6//default/P-0050204-T01-IM6_P-0050204-N01-IM6_hisens.cncf.txt


In [13]:
get_groupby(ccf_tp53, 'oncogenic', 'count')

Unnamed: 0_level_0,count
oncogenic,Unnamed: 1_level_1
Likely Oncogenic,12221
Oncogenic,1522
Predicted Oncogenic,16
Unknown,1


In [12]:
ccf_tp53_scatter = ccf_tp53[['cf','ccf']]

KeyError: "['ccf'] not in index"

In [None]:
ccf_tp53.t_var_freq.describe()

In [12]:
#set(ccf_tp53.oncogenic)
get_groupby(ccf_tp53, 'oncogenic')

Unnamed: 0_level_0,count
oncogenic,Unnamed: 1_level_1
Likely Oncogenic,12221
Oncogenic,1522
Predicted Oncogenic,16
Unknown,1


In [13]:
print('Variant_Classification: ' + str(set(ccf_tp53.Variant_Classification)))
print('Variant_Type: ' + str(set(ccf_tp53.Variant_Type)))

Variant_Classification: {'Nonstop_Mutation', 'Splice_Site', 'In_Frame_Del', 'Frame_Shift_Ins', 'In_Frame_Ins', 'Nonsense_Mutation', 'Splice_Region', 'Intron', 'Missense_Mutation', 'Frame_Shift_Del'}
Variant_Type: {'TNP', 'DEL', 'INS', 'ONP', 'DNP', 'SNP'}


In [15]:
print_md('ccf_tp53 columns:','green')
for column in ccf_tp53.columns: print(column)

<span style="color:green">ccf_tp53 columns:</span>

Hugo_Symbol
Entrez_Gene_Id
Center
NCBI_Build
Chromosome
Start_Position
End_Position
Strand
Variant_Classification
Variant_Type
Reference_Allele
Tumor_Seq_Allele1
Tumor_Seq_Allele2
dbSNP_RS
dbSNP_Val_Status
Tumor_Sample_Barcode
Matched_Norm_Sample_Barcode
Match_Norm_Seq_Allele1
Match_Norm_Seq_Allele2
Tumor_Validation_Allele1
Tumor_Validation_Allele2
Match_Norm_Validation_Allele1
Match_Norm_Validation_Allele2
Verification_Status
Validation_Status
Mutation_Status
Sequencing_Phase
Sequence_Source
Validation_Method
Score
BAM_File
Sequencer
Tumor_Sample_UUID
Matched_Norm_Sample_UUID
HGVSc
HGVSp
HGVSp_Short
Transcript_ID
Exon_Number
t_depth
t_ref_count
t_alt_count
n_depth
n_ref_count
n_alt_count
all_effects
Allele
Gene
Feature
Feature_type
Consequence
cDNA_position
CDS_position
Protein_position
Amino_acids
Codons
Existing_variation
ALLELE_NUM
DISTANCE
STRAND_VEP
SYMBOL
SYMBOL_SOURCE
HGNC_ID
BIOTYPE
CANONICAL
CCDS
ENSP
SWISSPROT
TREMBL
UNIPARC
RefSeq
SIFT
PolyPhen
EXON
INTRON
DOMAINS
GMAF
AFR_

In [16]:
clinical_data.head(1)

Unnamed: 0,Patient ID,Sample ID,Cancer Type,Cancer Type Detailed,Number of Samples Per Patient,Mutation Count,Fraction Genome Altered,Sex,Ethnicity Category,Race Category,Sample Type,12-245 Part C Consented,Gene Panel,Impact TMB Score,Institute Source,MSI Score,MSI Type,Overall Survival Status,Patient Current Age,Sample coverage,Somatic Status,Tumor Purity
0,P-0000004,P-0000004-T01-IM3,Breast Cancer,Breast Invasive Ductal Carcinoma,1,4,0.2782,Female,Non-Spanish; Non-Hispanic,WHITE,Primary,NO,IMPACT341,4.5,MSKCC,2.5,Stable,DECEASED,40.0,428,Matched,50


In [17]:
print_md('clinical_data columns:','green')
for column in clinical_data.columns: print(column)

<span style="color:green">clinical_data columns:</span>

Patient ID
Sample ID
Cancer Type
Cancer Type Detailed
Number of Samples Per Patient
Mutation Count
Fraction Genome Altered
Sex
Ethnicity Category
Race Category
Sample Type
12-245 Part C Consented
Gene Panel
Impact TMB Score
Institute Source
MSI Score
MSI Type
Overall Survival Status
Patient Current Age
Sample coverage
Somatic Status
Tumor Purity


---
**We will keep the following columns of *ccf_tp53*:**
* Hugo_symbol
* Chromosome
* Start_Position
* End_Position
* Variant_classification
* Variant_Type
* Reference_Allele
* Tumor_Seq_Allele1
* Tumor_Seq_Allele2
* HGVSp
* Consequence
* mutation_effect
* cf
* ccf_expected_copies
* purity
* t_var_freq
* Tumor_Sample_Barcode
* Mutation Count

**We will keep the following columns of *clinical_data*:**
* Sample ID
* Patient Current Age
* Cancer Type 
* Cancer Type Detailed
* Ethnicity Category
* Sex
* Mutation Count
* Sample Type
* Number of Samples Per Patient 

In [18]:
filter_ccf = ['Hugo_Symbol',
            'Chromosome',
            'Start_Position',
            'End_Position',
            'Variant_Classification',
            'Variant_Type',
            'Reference_Allele',
            'Tumor_Seq_Allele1',
            'Tumor_Seq_Allele2',
            'HGVSp',
            'Consequence',
            'mutation_effect',
            'cf',
            'ccf_expected_copies',
            'purity',
            't_var_freq', 
            'Tumor_Sample_Barcode']

filter_clinical = [ 'Sample ID',
 'Patient Current Age',
 'Cancer Type' ,
 'Cancer Type Detailed',
 'Ethnicity Category' ,
 'Sex',
 'Sample Type',
 'Number of Samples Per Patient',
 'Mutation Count']

ccf_tp53_filtered = ccf_tp53[filter_ccf]
clinical_data_filtered = clinical_data[filter_clinical]
clinical_data_filtered.columns =['Sample_Id',
                                 'Patient_Current_Age',
                                 'Cancer_Type' ,
                                 'Cancer_Type_Detailed',
                                 'Ethnicity_Category' ,
                                 'Sex',
                                 'Sample_Type',
                                 'samples_per_patient',
                                  'mutation_count']

---
## Adding new keys & Merging

We create 4 new columns in ccf_tp53:
* *mut_key*: mutation key that describes entirely the mutation
* *sample_mut_key*: sample mutation key that adds information about the sample (it allows to filter out duplicates)
* *patient_mut_key*: patient mutation key that adds information about the patient (it allows to filter out duplicates)
* *Patient_Id*: identifies patients 
* *mut_spot*: number representing the location of the amino acid mutated

In [19]:
# Create a mutation Key
ccf_tp53_filtered['mut_key'] = pd.Series([str(i)+'_'+str(j)+'_'+str(k)+'_'+str(l) for i,j,k,l in zip(ccf_tp53.Chromosome, ccf_tp53.Start_Position, ccf_tp53.Reference_Allele, ccf_tp53.Tumor_Seq_Allele2)]) 
# Create a Patient_Id
ccf_tp53_filtered['Patient_Id'] = ccf_tp53_filtered.Tumor_Sample_Barcode.str[:9]
# Create a sample key to differentiate duplicates
ccf_tp53_filtered['sample_mut_key'] = pd.Series([j+'_'+i for i,j in zip( ccf_tp53_filtered.mut_key, ccf_tp53_filtered.Tumor_Sample_Barcode)])
# Create a patient key to differentiate duplicates
ccf_tp53_filtered['patient_mut_key'] = pd.Series([j+'_'+i for i,j in zip( ccf_tp53_filtered.mut_key, ccf_tp53_filtered.Patient_Id)])
# Extract the mutation spot from HGVSp
ccf_tp53_filtered['mut_spot'] = ccf_tp53_filtered.HGVSp.str.extract('(\d+)')

ccf_tp53_filtered

Unnamed: 0,Hugo_Symbol,Chromosome,Start_Position,End_Position,Variant_Classification,Variant_Type,Reference_Allele,Tumor_Seq_Allele1,Tumor_Seq_Allele2,HGVSp,Consequence,mutation_effect,cf,ccf_expected_copies,purity,t_var_freq,Tumor_Sample_Barcode,mut_key,Patient_Id,sample_mut_key,patient_mut_key,mut_spot
0,TP53,17,7578409,7578410,Missense_Mutation,DNP,CT,CT,TC,p.Arg174Glu,missense_variant,Likely Loss-of-function,0.315621,0.925,0.308886,0.168901,P-0027408-T01-IM6,17_7578409_CT_TC,P-0027408,P-0027408-T01-IM6_17_7578409_CT_TC,P-0027408_17_7578409_CT_TC,174
1,TP53,17,7577121,7577121,Missense_Mutation,SNP,G,G,A,p.Arg273Cys,missense_variant,Likely Loss-of-function,0.325590,0.812,0.384643,0.312169,P-0036909-T01-IM6,17_7577121_G_A,P-0036909,P-0036909-T01-IM6_17_7577121_G_A,P-0036909_17_7577121_G_A,273
2,TP53,17,7578442,7578442,Missense_Mutation,SNP,T,T,C,p.Tyr163Cys,missense_variant,Loss-of-function,0.832723,0.935,0.861984,0.845070,P-0023546-T01-IM6,17_7578442_T_C,P-0023546,P-0023546-T01-IM6_17_7578442_T_C,P-0023546_17_7578442_T_C,163
3,TP53,17,7578442,7578442,Missense_Mutation,SNP,T,T,C,p.Tyr163Cys,missense_variant,Loss-of-function,0.307591,1.000,0.567171,0.636735,P-0023546-T02-IM6,17_7578442_T_C,P-0023546,P-0023546-T02-IM6_17_7578442_T_C,P-0023546_17_7578442_T_C,163
4,TP53,17,7578471,7578471,Frame_Shift_Del,DEL,G,G,-,p.Gly154AlafsTer16,frameshift_variant,Likely Loss-of-function,0.892744,1.000,0.890701,0.912621,P-0025997-T01-IM6,17_7578471_G_-,P-0025997,P-0025997-T01-IM6_17_7578471_G_-,P-0025997_17_7578471_G_-,154
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13882,TP53,17,7578291,7578291,Splice_Site,SNP,T,T,G,,splice_acceptor_variant,Likely Loss-of-function,0.211973,1.000,0.220008,0.126382,P-0050748-T01-IM6,17_7578291_T_G,P-0050748,P-0050748-T01-IM6_17_7578291_T_G,P-0050748_17_7578291_T_G,
13883,TP53,17,7578394,7578394,Missense_Mutation,SNP,T,T,A,p.His179Leu,missense_variant,Loss-of-function,0.824066,1.000,0.833058,0.757801,P-0050741-T01-IM6,17_7578394_T_A,P-0050741,P-0050741-T01-IM6_17_7578394_T_A,P-0050741_17_7578394_T_A,179
13884,TP53,17,7577570,7577570,Missense_Mutation,SNP,C,C,T,p.Met237Ile,missense_variant,Likely Loss-of-function,0.254038,0.937,0.305687,0.168975,P-0050747-T01-IM6,17_7577570_C_T,P-0050747,P-0050747-T01-IM6_17_7577570_C_T,P-0050747_17_7577570_C_T,237
13885,TP53,17,7578208,7578208,Missense_Mutation,SNP,T,T,C,p.His214Arg,missense_variant,Likely Loss-of-function,1.000000,,,0.082168,P-0050652-T01-IM6,17_7578208_T_C,P-0050652,P-0050652-T01-IM6_17_7578208_T_C,P-0050652_17_7578208_T_C,214


In [20]:
# Left Join on Tumor_Sample_Barcode and 'Sample ID'
maf = pd.merge(left=ccf_tp53_filtered,right=clinical_data_filtered, how='left', left_on='Tumor_Sample_Barcode', right_on='Sample_Id')

In [21]:
maf.head(5)

Unnamed: 0,Hugo_Symbol,Chromosome,Start_Position,End_Position,Variant_Classification,Variant_Type,Reference_Allele,Tumor_Seq_Allele1,Tumor_Seq_Allele2,HGVSp,Consequence,mutation_effect,cf,ccf_expected_copies,purity,t_var_freq,Tumor_Sample_Barcode,mut_key,Patient_Id,sample_mut_key,patient_mut_key,mut_spot,Sample_Id,Patient_Current_Age,Cancer_Type,Cancer_Type_Detailed,Ethnicity_Category,Sex,Sample_Type,samples_per_patient,mutation_count
0,TP53,17,7578409,7578410,Missense_Mutation,DNP,CT,CT,TC,p.Arg174Glu,missense_variant,Likely Loss-of-function,0.315621,0.925,0.308886,0.168901,P-0027408-T01-IM6,17_7578409_CT_TC,P-0027408,P-0027408-T01-IM6_17_7578409_CT_TC,P-0027408_17_7578409_CT_TC,174,P-0027408-T01-IM6,67.0,Non-Small Cell Lung Cancer,Non-Small Cell Lung Cancer,Non-Spanish; Non-Hispanic,Female,Metastasis,1,20
1,TP53,17,7577121,7577121,Missense_Mutation,SNP,G,G,A,p.Arg273Cys,missense_variant,Likely Loss-of-function,0.32559,0.812,0.384643,0.312169,P-0036909-T01-IM6,17_7577121_G_A,P-0036909,P-0036909-T01-IM6_17_7577121_G_A,P-0036909_17_7577121_G_A,273,P-0036909-T01-IM6,47.0,Non-Small Cell Lung Cancer,Lung Adenocarcinoma,Non-Spanish; Non-Hispanic,Female,Metastasis,3,4
2,TP53,17,7578442,7578442,Missense_Mutation,SNP,T,T,C,p.Tyr163Cys,missense_variant,Loss-of-function,0.832723,0.935,0.861984,0.84507,P-0023546-T01-IM6,17_7578442_T_C,P-0023546,P-0023546-T01-IM6_17_7578442_T_C,P-0023546_17_7578442_T_C,163,P-0023546-T01-IM6,50.0,Prostate Cancer,Prostate Neuroendocrine Carcinoma,Non-Spanish; Non-Hispanic,Male,Primary,2,4
3,TP53,17,7578442,7578442,Missense_Mutation,SNP,T,T,C,p.Tyr163Cys,missense_variant,Loss-of-function,0.307591,1.0,0.567171,0.636735,P-0023546-T02-IM6,17_7578442_T_C,P-0023546,P-0023546-T02-IM6_17_7578442_T_C,P-0023546_17_7578442_T_C,163,P-0023546-T02-IM6,50.0,Prostate Cancer,Prostate Adenocarcinoma,Non-Spanish; Non-Hispanic,Male,Primary,2,3
4,TP53,17,7578471,7578471,Frame_Shift_Del,DEL,G,G,-,p.Gly154AlafsTer16,frameshift_variant,Likely Loss-of-function,0.892744,1.0,0.890701,0.912621,P-0025997-T01-IM6,17_7578471_G_-,P-0025997,P-0025997-T01-IM6_17_7578471_G_-,P-0025997_17_7578471_G_-,154,P-0025997-T01-IM6,70.0,Cancer of Unknown Primary,Small Cell Carcinoma of Unknown Primary,Non-Spanish; Non-Hispanic,Female,Metastasis,1,9


In [22]:
# Saving to pickle File
maf.to_pickle(data_path + 'merged_data/maf_tp53.pkl')