# <span style='font-family:"Times New Roman"'> <span styel=''> **MASTER FILE CREATION**

## <span style='font-family:"Times New Roman"'> <span styel=''> *Emile Cohen* 
*March 2020*

**Goal:** In this Notebook, we create a master file that summarizes all useful information.

The Notebook is divided in 4 parts, representing the four parts of our Master file:
   
* **1. Patient/Sample Information**
* **2. TP53 Mutations**
* **3. TP53 Copy Numbers**
* **4. TP53 Computed Metrics**
* **5. Subgroup columns creation**
* **6. Merge tables**

**NB1:** In each part, you must run the cells from the begining in order to initialize the variables

**NB2:** In order to launch the last script (Merge Tables), you have to define the functions in each part.

**NB3:** All functions used for the plots are located in utils/custom_tools.py

---

In [22]:
%run -i '../../utils/setup_environment.ipy'

from pathlib import Path
from utils.filters import *

import warnings
warnings.filterwarnings('ignore')

data_path = '../../data/'

Setup environment... done!


<span style="color:green">✅ Working on **mskimpact_env** conda environment.</span>

---
## 1. Patient/Sample Information

In this part, we focus on clinical information exported from CbioPortal. We use the maf file created in the script *./maf_cohort_creation.ipynb* and stored in *../../data/merged_data/maf_cohort.pkl*.

The following columns are selected:
* Sample_Id
* Tumor_Id
* Patient_Id
* Cancer_Type
* Cancer_Type_Detailed
* Sample_Type
* purity
* ploidy
* samples_per_patient
* Overall Survival Status
* Overall Survival (Months)
* MSI Score
* Tumor Mutational Burden

In [23]:
def create_sample_info(path):
    '''
    This function aims to create a dataframe gathering all samples from the cohort with important clinical
    information.
    We use the function normal_samp_duplicates_filter to filter out the samples that have the same tumor
    but different normal samples, we keep only the one with the highest purity.
    '''
    maf_cohort = pd.read_pickle(path)
    
    #We select only interesting columns
    selected_cohort = maf_cohort[['Sample_Id','Tumor_Id', 'Patient_Id','Cancer_Type', 'Cancer_Type_Detailed', 'Sample_Type', 'purity', 'ploidy',
                                  'samples_per_patient','Overall Survival Status', 'Overall Survival (Months)', 
                                  'MSI Score', 'TMB_Score']]

    # But we have many duplicates
    # First, we remove the duplicates based on Sample_Id
    selected_cohort = selected_cohort.drop_duplicates('Sample_Id')
    
    return selected_cohort #selected_cohort

In [5]:
sample_info = create_sample_info(data_path + 'merged_data/maf_cohort.pkl')
sample_info

Unnamed: 0,Sample_Id,Tumor_Id,Patient_Id,Cancer_Type,Cancer_Type_Detailed,Sample_Type,purity,ploidy,samples_per_patient,Overall Survival Status,Overall Survival (Months),MSI Score,TMB_Score
0,P-0034223-T01-IM6_P-0034223-N01-IM6,P-0034223-T01-IM6,P-0034223,Breast Cancer,Invasive Breast Carcinoma,Metastasis,0.941111,2.241830,1.0,LIVING,,0.55,5.3
6,P-0009819-T01-IM5_P-0009819-N01-IM5,P-0009819-T01-IM5,P-0009819,Prostate Cancer,Prostate Adenocarcinoma,Primary,0.275237,2.681075,1.0,LIVING,23.441,0.00,1.0
9,P-0025956-T01-IM6_P-0025956-N01-IM6,P-0025956-T01-IM6,P-0025956,Non-Small Cell Lung Cancer,Lung Adenocarcinoma,Primary,0.185874,3.496971,1.0,DECEASED,3.584,0.00,5.3
15,P-0027408-T01-IM6_P-0027408-N01-IM6,P-0027408-T01-IM6,P-0027408,Non-Small Cell Lung Cancer,Non-Small Cell Lung Cancer,Metastasis,0.308886,1.811066,1.0,LIVING,22.586,0.27,17.6
36,P-0006554-T01-IM5_P-0006554-N01-IM5,P-0006554-T01-IM5,P-0006554,Glioma,Anaplastic Oligodendroglioma,Primary,0.715208,1.910719,1.0,LIVING,26.170,1.30,46.2
...,...,...,...,...,...,...,...,...,...,...,...,...,...
260784,P-0050644-T01-IM6_P-0050644-N01-IM6,P-0050644-T01-IM6,P-0050644,Prostate Cancer,Prostate Adenocarcinoma,Primary,0.576103,2.175837,1.0,LIVING,1.249,0.00,0.9
260787,P-0050741-T01-IM6_P-0050741-N01-IM6,P-0050741-T01-IM6,P-0050741,Small Cell Lung Cancer,Small Cell Lung Cancer,Metastasis,0.833591,2.006039,1.0,LIVING,1.940,0.22,7.0
260795,P-0050747-T01-IM6_P-0050747-N01-IM6,P-0050747-T01-IM6,P-0050747,Pancreatic Cancer,Pancreatic Adenocarcinoma,Primary,0.360576,2.187990,1.0,LIVING,,0.05,6.1
260801,P-0050652-T01-IM6_P-0050652-N01-IM6,P-0050652-T01-IM6,P-0050652,Pancreatic Cancer,Pancreatic Adenocarcinoma,Primary,0.171442,2.011650,1.0,LIVING,1.085,0.26,2.6


---
## 2. TP53 Mutations

In this part, we focus on mutational information exported from *default_qc_pass.ccf_TP53.maf* file. We use the maf file created in the script *./maf_tp53_creation.ipynb* and stored in *../../data/merged_data/maf_tp53.pkl*.

We gather all mutations per sample, and split it into different columns. We have the following columns:
* Tumor_Id	
* key_1 (2,3,4,5) --> Mutation key allowing to filter duplicates
* vc_1 (2,3,4,5) --> Variant Classification
* ccf_1 (2,3,4,5) --> Cancer Cell Fraction of the mutation
* vaf_1 (2,3,4,5) --> Variant Allele Frequency of the mutation
* HGVSp_1 (2,3,4,5) --> protein change
* spot_1 (2,3,4,5) --> Integer that defines the spot of the tp53 mutation
* tp53_count --> Number of tp53 mutations of the sample


In [67]:
def f(x):
    # This function helps us to group mutations together in a single cell per patient
    return pd.DataFrame(dict(Sample_Id = x['Sample_Id'],  
                        muts = "%s" % ','.join(x['sample_mut_key_vc_ccf_vaf_hgv_spot'])))

def count_tp53_muts(x):
    count = 0
    for i in range(1,6):
        if x['tp53_key_' + str(i)]:
            count+= 1
    return count

def create_tp53_muts(sample_info, path):
    '''
    This function aims to gather all tp53 mutation characteristics.
    For each sample we gather the tp53 mutations and their characteristics for all patients.
    '''
    # We load the  table created in maf_tp53_creation.ipynb
    maf_tp53 = pd.read_pickle(path)
    
    # We select only intresting columns
    maf_tp53_filtered = maf_tp53[['Sample_Id','sample_mut_key', 'Variant_Classification',\
                                        'ccf_expected_copies', 't_var_freq', 'HGVSp','mut_spot' ]]

    # Let's Merge mut_key,Variant_classification, CF, CCF, and VAF to gather them
    maf_tp53_filtered['sample_mut_key_vc_ccf_vaf_hgv_spot'] = pd.Series([str(i)+'%'+str(j)+'%'+str(k)+'%'+str(l)+'%'+str(m)+'%'+str(n) for i,j,k,l,m,n\
                                                            in zip(maf_tp53_filtered.sample_mut_key, \
                                                                   maf_tp53_filtered.Variant_Classification,\
                                                                   maf_tp53_filtered.ccf_expected_copies,\
                                                                   maf_tp53_filtered.t_var_freq,\
                                                                   maf_tp53_filtered.HGVSp,\
                                                                   maf_tp53_filtered.mut_spot\
                                                                  )]) 

 

    # We Select important columns
    final = maf_tp53_filtered[['Sample_Id', 'sample_mut_key_vc_ccf_vaf_hgv_spot']]
    # We groupby Patient_Id and apply the function above to group mutations
    final = final.groupby(['Sample_Id'], sort=False).apply(f)

    # We separate the different mutations into 5 different columns (5 is the max number of tp53 mutations in our cohort)
    final[['mut_key_1','mut_key_2','mut_key_3','mut_key_4','mut_key_5']] = final.muts.str.split(',', expand=True)

    # Split the columns into mut_key_ and vc_
    final[['tp53_key_1','tp53_vc_1','tp53_ccf_1','tp53_vaf_1','tp53_HGVSp_1', 'tp53_spot_1']] = final.mut_key_1.str.split('%', expand=True)
    final[['tp53_key_2','tp53_vc_2','tp53_ccf_2','tp53_vaf_2','tp53_HGVSp_2', 'tp53_spot_2']] = final.mut_key_2.str.split('%', expand=True)
    final[['tp53_key_3','tp53_vc_3','tp53_ccf_3','tp53_vaf_3','tp53_HGVSp_3', 'tp53_spot_3']] = final.mut_key_3.str.split('%', expand=True)
    final[['tp53_key_4','tp53_vc_4','tp53_ccf_4','tp53_vaf_4','tp53_HGVSp_4', 'tp53_spot_4']] = final.mut_key_4.str.split('%', expand=True)
    final[['tp53_key_5','tp53_vc_5','tp53_ccf_5','tp53_vaf_5','tp53_HGVSp_5', 'tp53_spot_5']] = final.mut_key_5.str.split('%', expand=True)

    # We remove the muts column
    final = final.drop(['muts','mut_key_1','mut_key_2','mut_key_3','mut_key_4','mut_key_5'], axis=1)

    # We remove duplicates
    final = final.drop_duplicates('Sample_Id')

    # We add the cohort patients that are not tp53 positive
    #First we create a dataframe with all missing samples
    cohort_samples = set(sample_info.Tumor_Id)
    final_samples = set(final.Sample_Id)
    missing_samp = pd.DataFrame(cohort_samples - final_samples, columns = ['Sample_Id'])
    #Then we append the two datframe
    final = final.append(missing_samp)
    
    # We rename the Sample_Id column to have the same key as in other datframes
    final = final.rename(columns={'Sample_Id': 'Tumor_Id'})
    
    # We add a last column tp53_count that represents the number of tp53 mutations per sample
    final = final.where(final.notnull(), None)
    final['tp53_count'] = final.apply(count_tp53_muts, axis = 1)
    
    # We change the type of vafs column to float64 instead of strings
    final = final.astype({'tp53_vaf_1': 'float64', 'tp53_vaf_2': 'float64', 'tp53_vaf_3': 'float64', 'tp53_vaf_4': 'float64', 'tp53_vaf_5': 'float64',
                       'tp53_ccf_1': 'float64', 'tp53_ccf_2': 'float64', 'tp53_ccf_3': 'float64', 'tp53_ccf_4': 'float64', 'tp53_ccf_5': 'float64'})

    return final

In [13]:
tp53_muts = create_tp53_muts(sample_info, data_path + 'merged_data/maf_tp53.pkl')
tp53_muts.head()

Unnamed: 0,Tumor_Id,tp53_key_1,tp53_vc_1,tp53_ccf_1,tp53_vaf_1,tp53_HGVSp_1,tp53_spot_1,tp53_key_2,vc_2,tp53_ccf_2,tp53_vaf_2,tp53_HGVSp_2,tp53_spot_2,tp53_key_3,tp53_vc_3,tp53_ccf_3,tp53_vaf_3,tp53_HGVSp_3,tp53_spot_3,tp53_key_4,tp53_vc_4,tp53_ccf_4,tp53_vaf_4,tp53_HGVSp_4,tp53_spot_4,tp53_key_5,tp53_vc_5,tp53_ccf_5,tp53_vaf_5,tp53_HGVSp_5,tp53_spot_5,tp53_count
0,P-0027408-T01-IM6,P-0027408-T01-IM6_17_7578409_CT_TC,Missense_Mutation,0.925,0.168901,p.Arg174Glu,174,,,,,,,,,,,,,,,,,,,,,,,,,1
1,P-0036909-T01-IM6,P-0036909-T01-IM6_17_7577121_G_A,Missense_Mutation,0.812,0.312169,p.Arg273Cys,273,,,,,,,,,,,,,,,,,,,,,,,,,1
2,P-0023546-T01-IM6,P-0023546-T01-IM6_17_7578442_T_C,Missense_Mutation,0.935,0.84507,p.Tyr163Cys,163,,,,,,,,,,,,,,,,,,,,,,,,,1
3,P-0023546-T02-IM6,P-0023546-T02-IM6_17_7578442_T_C,Missense_Mutation,1.0,0.636735,p.Tyr163Cys,163,,,,,,,,,,,,,,,,,,,,,,,,,1
4,P-0025997-T01-IM6,P-0025997-T01-IM6_17_7578471_G_-,Frame_Shift_Del,1.0,0.912621,p.Gly154AlafsTer16,154,,,,,,,,,,,,,,,,,,,,,,,,,1


---
## 3. TP53 Copy Numbers

In this part, we gather the information from gene_level table.
We creaste the following columns:
* Sample_Id 
* tcn --> total copy number
* mcn --> major copy number
* lcn --> lower copy number
* seg_length --> length of the segment
* cn_state --> copy number state
* cf --> Cell fraction of the cn_state
* wgd --> Wholde Genome Doubling (1 or -1)

In [48]:
def wgd_condition(x):
    arm_level = pd.read_csv(data_path + 'impact-facets-tp53/raw/default_qc_pass.arm_level.txt', sep='\t')
    cond_wgd = ['LOSS BEFORE & AFTER', 'LOSS BEFORE', 'CNLOH BEFORE & AFTER',
           'CNLOH BEFORE', 'CNLOH BEFORE & GAIN', 'DOUBLE LOSS AFTER',
           'LOSS AFTER', 'CNLOH AFTER', 'LOSS & GAIN']
    cond_no_wgd = ['CNLOH', 'HETLOSS', 'CNLOH & GAIN', 'DIPLOID']
    
    for tp53_cn_state in list(arm_level[arm_level['sample'] == x.Sample_Id]['tp53_cn_state']):
        if tp53_cn_state in cond_wgd:
            return 1
        
    if x.tp53_cn_state in cond_no_wgd :
        return -1

In [44]:
def create_copy_number_state(sample_info, path):
    arm_level = pd.read_csv(data_path + 'impact-facets-tp53/raw/default_qc_pass.arm_level.txt', sep='\t')
    
    gene_level = pd.read_csv(path, sep='\t')
    gene_level['Tumor_Id'] = gene_level['sample'].str[:17]
    gene_level_subset = gene_level[['sample','tcn','mcn','lcn','seg_length','cn_state', 'cf.em']]
    
    # We add the cohort patients that are not in the dataframe
    #First we create a dataframe with all missing samples
    cohort_samples = set(sample_info.Sample_Id)
    gene_level_samples = set(gene_level_subset['sample'])
    missing_samp = pd.DataFrame(cohort_samples - gene_level_samples, columns = ['sample'])
    
    
    #Then we append the two dataframe
    gene_level_subset = gene_level_subset.append(missing_samp)
    
    # We rename the cf.em column 
    gene_level_subset = gene_level_subset.rename(columns={'cf.em': 'tp53_cf', 
                                                          'sample':'Sample_Id',
                                                          'tcn': 'tp53_tcn',
                                                          'mcn': 'tp53_mcn',
                                                          'lcn': 'tp53_lcn',
                                                          'seg_length': 'tp53_seg_length',
                                                          'cn_state':'tp53_cn_state'})
    
    # We add WGD information
    gene_level_subset['wgd'] = gene_level_subset.apply(wgd_condition, axis = 1)
    
    return gene_level_subset

In [59]:
def compute_frac_genome(x, arm_level: pd.DataFrame):
    lookup_table = arm_level[arm_level['sample'] == x.Sample_Id]
    lookup_table_altered = lookup_table[lookup_table['cn_state'] != 'DIPLOID']
    altered_length = lookup_table_altered.cn_length.sum()
    total_length = lookup_table.arm_length.sum()
    
    frac_gen_altered = round(altered_length/total_length,3)
    
    return frac_gen_altered

# Here is the function that allws to compute genome instability columns
def chr_computations(x):
    arm_level = pd.read_csv(data_path + 'impact-facets-tp53/raw/default_qc_pass.arm_level.txt', sep='\t')
    CNLOH = ['CNLOH', 'CNLOH BEFORE & LOSS', 'CNLOH AFTER', 'CNLOH BEFORE', 'CNLOH & GAIN', 'CNLOH BEFORE & GAIN', 'AMP (LOH)']
    LOSS = ['HETLOSS', 'LOSS BEFORE', 'LOSS AFTER', 'HOMDEL', 'LOSS BEFORE & AFTER', 'DOUBLE LOSS AFTER']
    GAIN = ['GAIN', 'AMP', 'AMP (BALANCED)', 'LOSS & GAIN', 'TETRAPLOID']
    
    lookup_table = arm_level[arm_level['sample'] == x.Sample_Id]
    lookup_table['chr'] = lookup_table.arm.str.extract('(\d+)')
    lookup_table = lookup_table[lookup_table['cn_state'] != 'DIPLOID'][lookup_table['chr'] != '17']
    lookup_table['state_chr'] = lookup_table['cn_state']+lookup_table['chr']
    #chr_affected colum
    lookup_table_chr = lookup_table.drop_duplicates(subset=['chr'])
    chr_affected = len(lookup_table_chr)
    
    #chr_loss, chr_gain, chr_cnloh columns
    lookup_table_events = lookup_table.drop_duplicates(subset=['state_chr'])
    chr_loss = len(lookup_table_events[lookup_table_events.cn_state.isin(LOSS)])
    chr_gain = len(lookup_table_events[lookup_table_events.cn_state.isin(GAIN)])
    chr_cnloh = len(lookup_table_events[lookup_table_events.cn_state.isin(CNLOH)])
    
    #frac_gen_altered column
    frac_gen_altered = compute_frac_genome(x, arm_level)
    
    return [chr_affected, chr_loss, chr_gain, chr_cnloh, frac_gen_altered]

In [110]:
sample_info

Unnamed: 0,Sample_Id,Tumor_Id,Patient_Id,Cancer_Type,Sample_Type,purity,ploidy,samples_per_patient,Overall Survival Status,Overall Survival (Months),MSI Score
0,P-0034223-T01-IM6_P-0034223-N01-IM6,P-0034223-T01-IM6,P-0034223,Breast Cancer,Metastasis,0.941111,2.241830,1.0,LIVING,,0.55
6,P-0009819-T01-IM5_P-0009819-N01-IM5,P-0009819-T01-IM5,P-0009819,Prostate Cancer,Primary,0.275237,2.681075,1.0,LIVING,23.441,0.00
9,P-0025956-T01-IM6_P-0025956-N01-IM6,P-0025956-T01-IM6,P-0025956,Non-Small Cell Lung Cancer,Primary,0.185874,3.496971,1.0,DECEASED,3.584,0.00
15,P-0027408-T01-IM6_P-0027408-N01-IM6,P-0027408-T01-IM6,P-0027408,Non-Small Cell Lung Cancer,Metastasis,0.308886,1.811066,1.0,LIVING,22.586,0.27
36,P-0006554-T01-IM5_P-0006554-N01-IM5,P-0006554-T01-IM5,P-0006554,Glioma,Primary,0.715208,1.910719,1.0,LIVING,26.170,1.30
...,...,...,...,...,...,...,...,...,...,...,...
260784,P-0050644-T01-IM6_P-0050644-N01-IM6,P-0050644-T01-IM6,P-0050644,Prostate Cancer,Primary,0.576103,2.175837,1.0,LIVING,1.249,0.00
260787,P-0050741-T01-IM6_P-0050741-N01-IM6,P-0050741-T01-IM6,P-0050741,Small Cell Lung Cancer,Metastasis,0.833591,2.006039,1.0,LIVING,1.940,0.22
260795,P-0050747-T01-IM6_P-0050747-N01-IM6,P-0050747-T01-IM6,P-0050747,Pancreatic Cancer,Primary,0.360576,2.187990,1.0,LIVING,,0.05
260801,P-0050652-T01-IM6_P-0050652-N01-IM6,P-0050652-T01-IM6,P-0050652,Pancreatic Cancer,Primary,0.171442,2.011650,1.0,LIVING,1.085,0.26


In [48]:
arm_level = pd.read_csv(data_path + 'impact-facets-tp53/raw/default_qc_pass.arm_level.txt', sep='\t')
lookup_table = arm_level[arm_level['sample'] == 'P-0034223-T01-IM6_P-0034223-N01-IM6']
lookup_table_altered = lookup_table[lookup_table['cn_state'] != 'DIPLOID']
display(lookup_table)
altered_length = lookup_table_altered.cn_length.sum()
total_length = lookup_table.arm_length.sum()
frac_gen_altered = round(altered_length/total_length,3)
frac_gen_altered

Unnamed: 0,sample,arm,tcn,lcn,cn_length,arm_length,frac_of_arm,cn_state
0,P-0034223-T01-IM6_P-0034223-N01-IM6,1p,2,1,120534257,120534257,1.0,DIPLOID
1,P-0034223-T01-IM6_P-0034223-N01-IM6,1q,4,1,101461100,125135032,0.81,GAIN
2,P-0034223-T01-IM6_P-0034223-N01-IM6,2p,2,1,87764506,87764506,1.0,DIPLOID
3,P-0034223-T01-IM6_P-0034223-N01-IM6,2q,2,1,150656179,150656179,1.0,DIPLOID
4,P-0034223-T01-IM6_P-0034223-N01-IM6,3p,2,1,89055004,89055004,1.0,DIPLOID
5,P-0034223-T01-IM6_P-0034223-N01-IM6,3q,2,1,106857923,106857923,1.0,DIPLOID
6,P-0034223-T01-IM6_P-0034223-N01-IM6,4p,2,1,48257537,48257537,1.0,DIPLOID
7,P-0034223-T01-IM6_P-0034223-N01-IM6,4q,2,1,140602481,140602481,1.0,DIPLOID
8,P-0034223-T01-IM6_P-0034223-N01-IM6,5p,2,1,46187241,46187241,1.0,DIPLOID
9,P-0034223-T01-IM6_P-0034223-N01-IM6,5q,2,1,134121897,134121897,1.0,DIPLOID


0.08

In [177]:
get_groupby(table, 'wgd', 'count')  

Unnamed: 0_level_0,count
wgd,Unnamed: 1_level_1
-1.0,17920
1.0,9286


---
## 4. TP53 Computed Metrics

In this part, we compute mainly 4 metrics:

* mutation_count (*create_mut_count*) --> It is the total mutation count per sample
* gene_count (*create_gene_count*)--> It is the number of mutated genes per sample
* max_vaf --> It is the maximum Variant Allele Frequency within all the mutations of a sample
* exp_nb_1 (2,3,4,5) --> It is the expected number of copies of tp53 mutations in a cell 


In [33]:
def create_gene_count(maf_cohort):
    '''
    This function create the count of genes mutated for each sample.
    Arguments:
        - maf_cohort: the maf_cohort file located in data/merged/data
    '''
    
    # First we create the gene_count table by groupbying and sizing, we then change the index
    selected_cohort = maf_cohort[['Sample_Id','Tumor_Id', 'Gene_Id']]
    gene_count = pd.DataFrame(pd.DataFrame(selected_cohort[['Sample_Id', 'Gene_Id']].groupby(['Sample_Id', 'Gene_Id']).size(), columns = ['count']).groupby(['Sample_Id']).size(), columns = ['gene_count'])
    gene_count = gene_count.reset_index()

    # We add missing patients to the gene_count to have all the cohort
    no_gene_id = selected_cohort.Gene_Id.isna()
    no_gene_samples = set(selected_cohort[selected_cohort.index.isin(list(no_gene_id[no_gene_id == True].index))]['Sample_Id'])
    missing_samp = pd.DataFrame(no_gene_samples, columns = ['Sample_Id'])

    # We append the two dataframes
    gene_count = gene_count.append(missing_samp)
    
    #Fillna with 0
    gene_count = gene_count.fillna(0)

    return gene_count

def create_mut_count(maf_cohort):
    '''
    This function computes the dataframe of mutation count per sample.
    '''
    selected_cohort = maf_cohort[['Sample_Id','Tumor_Id', 'Gene_Id']]
    mut_count = get_groupby(selected_cohort, 'Sample_Id', 'mutation_count')
    
    return mut_count

# The following function needs to be called on the complete master file because it needs info from different parts
def create_copies_tp53_muts(master):
    master['tp53_exp_nb_1'] = master.apply(lambda x:(x.tp53_vaf_1 / x.purity) * (x.tp53_tcn * x.purity + 2*(1 - x.purity)), axis = 1)
    master['tp53_exp_nb_2'] = master.apply(lambda x:(x.tp53_vaf_2 / x.purity) * (x.tp53_tcn * x.purity + 2*(1 - x.purity)), axis = 1)
    master['tp53_exp_nb_3'] = master.apply(lambda x:(x.tp53_vaf_3 / x.purity) * (x.tp53_tcn * x.purity + 2*(1 - x.purity)), axis = 1)
    master['tp53_exp_nb_4'] = master.apply(lambda x:(x.tp53_vaf_4 / x.purity) * (x.tp53_tcn * x.purity + 2*(1 - x.purity)), axis = 1)
    master['tp53_exp_nb_5'] = master.apply(lambda x:(x.tp53_vaf_5 / x.purity) * (x.tp53_tcn * x.purity + 2*(1 - x.purity)), axis = 1)
    
    return master


def vc_group_cond_1(x):
    truncated = ['Splice_Site','Intron','Nonsense_Mutation','Splice_Region','Frame_Shift_Del','Frame_Shift_Ins']
    in_frame = ['In_Frame_Ins','In_Frame_Del']
    missense = ['Missense_Mutation']
    
    if x.tp53_vc_1 in truncated: return 'truncated'
    if x.tp53_vc_1 in in_frame: return 'in_frame'
    if x.tp53_vc_1 in missense: return 'missense'
def vc_group_cond_2(x):
    truncated = ['Splice_Site','Intron','Nonsense_Mutation','Splice_Region','Frame_Shift_Del','Frame_Shift_Ins']
    in_frame = ['In_Frame_Ins','In_Frame_Del']
    missense = ['Missense_Mutation']
    
    if x.tp53_vc_2 in truncated: return 'truncated'
    if x.tp53_vc_2 in in_frame: return 'in_frame'
    if x.tp53_vc_2 in missense: return 'missense'   
def vc_group_cond_3(x):
    truncated = ['Splice_Site','Intron','Nonsense_Mutation','Splice_Region','Frame_Shift_Del','Frame_Shift_Ins']
    in_frame = ['In_Frame_Ins','In_Frame_Del']
    missense = ['Missense_Mutation']
    
    if x.tp53_vc_3 in truncated: return 'truncated'
    if x.tp53_vc_3 in in_frame: return 'in_frame'
    if x.tp53_vc_3 in missense: return 'missense'  
def vc_group_cond_4(x):
    truncated = ['Splice_Site','Intron','Nonsense_Mutation','Splice_Region','Frame_Shift_Del','Frame_Shift_Ins']
    in_frame = ['In_Frame_Ins','In_Frame_Del']
    missense = ['Missense_Mutation']
    
    if x.tp53_vc_4 in truncated: return 'truncated'
    if x.tp53_vc_4 in in_frame: return 'in_frame'
    if x.tp53_vc_4 in missense: return 'missense'
def vc_group_cond_5(x):
    truncated = ['Splice_Site','Intron','Nonsense_Mutation','Splice_Region','Frame_Shift_Del','Frame_Shift_Ins']
    in_frame = ['In_Frame_Ins','In_Frame_Del']
    missense = ['Missense_Mutation']
    
    if x.tp53_vc_5 in truncated: return 'truncated'
    if x.tp53_vc_5 in in_frame: return 'in_frame'
    if x.tp53_vc_5 in missense: return 'missense'


In [34]:
def create_computed_metrics(path):
    # We add mutation_count and max_vaf
    maf_cohort = pd.read_pickle(path)

    # MUTATION COUNT
    #We create the table for mutation_count
    mut_count = create_mut_count(maf_cohort)
    
    # We create the table for gene_count
    gene_count = create_gene_count(maf_cohort)

    # MAX_VAF
    # To do so, we groupby Tumor_Id and apply the max() function
    # But first we need to transform None values in Nan to compute the max
    maf_cohort['vaf'].replace('None', np.nan, inplace=True)
    max_vaf = maf_cohort[['Sample_Id','vaf']].groupby(['Sample_Id']).max()
    max_vaf = max_vaf.rename(columns={'vaf': 'max_vaf'})
    
    # Merge the tables
    computed_metrics = pd.merge(mut_count, gene_count, on=['Sample_Id'])
    computed_metrics = pd.merge(computed_metrics, max_vaf, on=['Sample_Id'])
    
    
    return computed_metrics

In [17]:
computed_metrics = create_computed_metrics(data_path + 'merged_data/maf_cohort.pkl')
computed_metrics.head()

Unnamed: 0,Sample_Id,mutation_count,gene_count,max_vaf
0,P-0000004-T01-IM3_P-0000004-N01-IM3,4,4.0,0.547085
1,P-0000012-T02-IM3_P-0000012-N01-IM3,1,1.0,0.502203
2,P-0000024-T01-IM3_P-0000024-N01-IM3,6,5.0,0.368683
3,P-0000025-T02-IM5_P-0000025-N01-IM5,2,2.0,0.203236
4,P-0000026-T01-IM3_P-0000026-N01-IM3,4,4.0,0.590164


## 5. Subgroup Columns Creation

#### First, we group the different COpy Number States *cn_state* in subgroups, under the column *cn_group*:
* Group 1: cnLOH gathering ['CNLOH', 'CNLOH BEFORE & LOSS', 'CNLOH BEFORE', 'CNLOH BEFORE & GAIN']
* Group 2: LOSS gathering ['LOSS BEFORE', 'HETLOSS', 'LOSS BEFORE & AFTER']
* Group 3: HOMDEL gathering ['HOMDEL']
* Group 4: DOUBLE LOSS AFTER gathering ['DOUBLE LOSS AFTER']
* Group 5: WILD_TYPE gathering ['LOSS AFTER', 'DIPLOID', 'TETRAPLOID']
* Group 6: GAIN gathering ['GAIN']
* Group 7: OTHER gathering ['CNLOH AFTER', 'AMP (BALANCED)', 'AMP (LOH)', 'AMP','LOSS & GAIN', 'CNLOH & GAIN']


Based on this first column we define 7 final groups of patients adding the mutational information. These groups will be under the column *mut_cn_group*.
* Group 1: Samples with 0 tp53 mutations and HETLOSS
* Group 2: Samples with HOMDEL
* Group 3: Samples with 1 tp53 mutation and WILD_TYPE (DIPLOID, LOSS AFTER, TETRAPLOID)
* Group 4: Samples with 1 tp53 mutation or more and LOSS
* Group 5: Samples with 1 tp53 mutation or more and cnLOH
* Group 6: Samples with 2/3/4/5 tp53 mutations and WILD_TYPE or GAIN

We define the columns thanks to 2 functions that we call in the **Merge Tables** part through *create_master* function.

In [45]:
def cn_group_cond(x):
    if x.tp53_cn_state in ['CNLOH', 'CNLOH BEFORE & LOSS', 'CNLOH BEFORE', 'CNLOH BEFORE & GAIN']:
        return 'cnLOH'
    if x.tp53_cn_state in ['LOSS BEFORE', 'HETLOSS', 'LOSS BEFORE & AFTER']:
        return 'LOSS'
    if x.tp53_cn_state == 'HOMDEL':
        return 'HOMDEL'
    if x.tp53_cn_state in ['LOSS AFTER', 'DIPLOID', 'TETRAPLOID']:
        return 'WILD_TYPE'
    if x.tp53_cn_state == 'DOUBLE LOSS AFTER':
        return 'DOUBLE LOSS AFTER'
    if x.tp53_cn_state == 'GAIN':
        return 'GAIN'
    if x.tp53_cn_state in ['CNLOH AFTER', 'AMP (BALANCED)', 'AMP (LOH)', 'AMP','LOSS & GAIN', 'CNLOH & GAIN']:
        return 'OTHER'

def mut_cn_group_cond(x):
    if x.tp53_cn_state == 'HETLOSS' and x.tp53_count == 0:
        return '0_HETLOSS'
    if x.tp53_first_group == 'HOMDEL':
        return 'HOMDEL'
    if x.tp53_first_group == 'WILD_TYPE' and x.tp53_count == 1 :
        return '1_WILD_TYPE'
    if x.tp53_first_group == 'LOSS' and x.tp53_count >=1:
        return '>=1_LOSS'
    if x.tp53_first_group == 'cnLOH' and x.tp53_count >=1:
        return '>=1_cnLOH'
    if (x.tp53_first_group == 'WILD_TYPE' or x.tp53_first_group == 'DOUBLE LOSS AFTER' or x.tp53_first_group == 'GAIN') and x.tp53_count > 1:
        return '>1muts'

---
## Merge Tables

In [79]:
from tqdm import tqdm,tqdm_notebook

def compute_genome_instability():
    tqdm_notebook().pandas()
    sample_info = create_sample_info(data_path + 'merged_data/maf_cohort.pkl')
    sample_info['chr_comput'] = sample_info.progress_apply(chr_computations, axis=1)
    print('checkpoint 1')
    sample_info[['chr_affected', 'chr_loss', 'chr_gain', 'chr_cnloh', 'frac_genome_altered']] = pd.DataFrame(sample_info.chr_comput.values.tolist(), index= sample_info.index)
    print('checkpoint 2')
    
    return sample_info

sample_info = compute_genome_instability()
sample_info[['Sample_Id','chr_affected', 'chr_loss', 'chr_gain', 'chr_cnloh', 'frac_genome_altered']].to_pickle(data_path + 'merged_data/chr_metrics.pkl')

HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

HBox(children=(FloatProgress(value=0.0, max=29304.0), HTML(value='')))


checkpoint 1
checkpoint 2


KeyError: "['Sample_id'] not in index"

In [82]:
def create_master():
    '''
    This function creates the tables and merges them.
    '''
    sample_info = create_sample_info(data_path + 'merged_data/maf_cohort.pkl')
    tp53_muts = create_tp53_muts(sample_info, data_path + 'merged_data/maf_tp53.pkl')
    
    # We load the copy number table if it is already stored 
    file = Path(data_path + 'merged_data/copy_number.pkl')
    if file.is_file():
        copy_number_state =  pd.read_pickle(data_path + 'merged_data/copy_number.pkl')
        copy_number_state = copy_number_state.rename(columns={'tcn': 'tp53_tcn',
                                                          'mcn': 'tp53_mcn',
                                                          'lcn': 'tp53_lcn',
                                                          'seg_length': 'tp53_seg_length',
                                                          'cn_state':'tp53_cn_state'})
    else:
        copy_number_state = create_copy_number_state(sample_info, data_path + 'impact-facets-tp53/raw/default_qc_pass.gene_level_TP53.txt')
    computed_metrics = create_computed_metrics(data_path + 'merged_data/maf_cohort.pkl')
    
    # We first merge sample_info and tp53_muts because they have the same list of keys
    master_file = pd.merge(sample_info, tp53_muts, on=['Tumor_Id'])
    #For copy_number_state we have to do a right join because it contains less Tumor_Ids
    master_file = pd.merge(master_file, copy_number_state, on=['Sample_Id'])
    # Finally we merge the computedmetrics table
    master_file = pd.merge(master_file, computed_metrics, on=['Sample_Id'])
    
    # We filter out the sample duplicates
    master_file = master_file.drop_duplicates('Sample_Id')
    
    # At this step we need to remove samples that comes from same tumor but different normal sample
    # BUT this step makes us loose important clinical information for some samples
    # So we wil spread the Patient_Id and Cancer_Type by front and backpropagating the non-NaN values
    master_file['Patient_Id'] = master_file.Tumor_Id.str[:9]
    master_file['Cancer_Type'] = master_file[['Patient_Id','Cancer_Type']].groupby(['Patient_Id']).bfill().ffill()
    
    # Then we filter out samples with the same Tumor_Id but different Sample_Id with a filter function
    master_file = normal_samp_duplicates_filter(master_file, 'Sample_Id', 'purity')
    master_file = normal_samp_duplicates_filter(master_file, 'Sample_Id', 'purity')
    
    #We compute the expected number of copies of tp53 mutations
    master_file = create_copies_tp53_muts(master_file)
    
    #Finally we add the subgroup columns defined in Part 5
    master_file['tp53_first_group'] = master_file.apply(cn_group_cond, axis = 1)
    master_file['tp53_group'] = master_file.apply(mut_cn_group_cond, axis = 1)
    
    # We add Genome Instability columns
    # Genome Instability columns computed from arm_level file
    arm_level = pd.read_csv(data_path + 'impact-facets-tp53/raw/default_qc_pass.arm_level.txt', sep='\t')
    arm_level['chr'] = arm_level.arm.str.extract('(\d+)')
    print('checkpoint_1')
    #master_file['chr_comput'] = master_file.apply(chr_computations, axis=1)
    print('checkpoint_2')
    chr_metrics =  pd.read_pickle(data_path + 'merged_data/chr_metrics.pkl')
    master_file = pd.merge(master_file, chr_metrics, on=['Sample_Id'])
    
    print('checkpoint_3')
    
    # Grouping the Variant Classificationb into 3 Classes
    master_file['tp53_vc_group_1'] = master_file.apply(vc_group_cond_1, axis = 1)
    master_file['tp53_vc_group_2'] = master_file.apply(vc_group_cond_2, axis = 1)
    master_file['tp53_vc_group_3'] = master_file.apply(vc_group_cond_3, axis = 1)
    master_file['tp53_vc_group_4'] = master_file.apply(vc_group_cond_4, axis = 1)
    master_file['tp53_vc_group_5'] = master_file.apply(vc_group_cond_5, axis = 1)
    return master_file

In [83]:
%%time
master_file = create_master()

checkpoint_1
checkpoint_2
checkpoint_3
CPU times: user 35.6 s, sys: 1.61 s, total: 37.2 s
Wall time: 37.7 s


In [86]:
master_file

Unnamed: 0,Sample_Id,Tumor_Id,Patient_Id,Cancer_Type,Cancer_Type_Detailed,Sample_Type,purity,ploidy,samples_per_patient,Overall Survival Status,Overall Survival (Months),MSI Score,TMB_Score,tp53_key_1,tp53_vc_1,tp53_ccf_1,tp53_vaf_1,tp53_HGVSp_1,tp53_spot_1,tp53_key_2,tp53_vc_2,tp53_ccf_2,tp53_vaf_2,tp53_HGVSp_2,tp53_spot_2,tp53_key_3,tp53_vc_3,tp53_ccf_3,tp53_vaf_3,tp53_HGVSp_3,tp53_spot_3,tp53_key_4,tp53_vc_4,tp53_ccf_4,tp53_vaf_4,tp53_HGVSp_4,tp53_spot_4,tp53_key_5,tp53_vc_5,tp53_ccf_5,tp53_vaf_5,tp53_HGVSp_5,tp53_spot_5,tp53_count,tp53_tcn,tp53_mcn,tp53_lcn,tp53_seg_length,tp53_cn_state,cf,wgd,mutation_count,gene_count,max_vaf,tp53_exp_nb_1,tp53_exp_nb_2,tp53_exp_nb_3,tp53_exp_nb_4,tp53_exp_nb_5,tp53_first_group,tp53_group,chr_affected,chr_loss,chr_gain,chr_cnloh,frac_genome_altered,tp53_vc_group_1,tp53_vc_group_2,tp53_vc_group_3,tp53_vc_group_4,tp53_vc_group_5
0,P-0034223-T01-IM6_P-0034223-N01-IM6,P-0034223-T01-IM6,P-0034223,Breast Cancer,Invasive Breast Carcinoma,Metastasis,0.941111,2.241830,1.0,LIVING,,0.55,5.3,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0,2.0,1.0,1.0,80668592.0,DIPLOID,1.000000,-1.0,6,6.0,0.901899,,,,,,WILD_TYPE,,3,2,2,0,0.080,,,,,
1,P-0009819-T01-IM5_P-0009819-N01-IM5,P-0009819-T01-IM5,P-0009819,Prostate Cancer,Prostate Adenocarcinoma,Primary,0.275237,2.681075,1.0,LIVING,23.441,0.00,1.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0,1.0,1.0,0.0,80668309.0,HETLOSS,0.154578,-1.0,3,3.0,0.148014,,,,,,LOSS,0_HETLOSS,4,4,0,0,0.137,,,,,
2,P-0025956-T01-IM6_P-0025956-N01-IM6,P-0025956-T01-IM6,P-0025956,Non-Small Cell Lung Cancer,Lung Adenocarcinoma,Primary,0.185874,3.496971,1.0,DECEASED,3.584,0.00,5.3,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0,,,,,,,,6,6.0,0.200000,,,,,,,,0,0,0,0,,,,,,
3,P-0027408-T01-IM6_P-0027408-N01-IM6,P-0027408-T01-IM6,P-0027408,Non-Small Cell Lung Cancer,Non-Small Cell Lung Cancer,Metastasis,0.308886,1.811066,1.0,LIVING,22.586,0.27,17.6,P-0027408-T01-IM6_17_7578409_CT_TC,Missense_Mutation,0.925,0.168901,p.Arg174Glu,174,,,,,,,,,,,,,,,,,,,,,,,,,1,1.0,1.0,0.0,25260272.0,HETLOSS,0.315621,-1.0,21,19.0,0.192475,0.924711,,,,,LOSS,>=1_LOSS,12,11,3,0,0.452,missense,,,,
4,P-0006554-T01-IM5_P-0006554-N01-IM5,P-0006554-T01-IM5,P-0006554,Glioma,Anaplastic Oligodendroglioma,Primary,0.715208,1.910719,1.0,LIVING,26.170,1.30,46.2,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0,2.0,,,7465132.0,INDETERMINATE,,,47,39.0,0.706897,,,,,,,,5,4,1,0,0.173,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
29254,P-0050644-T01-IM6_P-0050644-N01-IM6,P-0050644-T01-IM6,P-0050644,Prostate Cancer,Prostate Adenocarcinoma,Primary,0.576103,2.175837,1.0,LIVING,1.249,0.00,0.9,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0,2.0,1.0,1.0,80668455.0,DIPLOID,1.000000,-1.0,3,3.0,0.300000,,,,,,WILD_TYPE,,7,7,0,0,0.144,,,,,
29255,P-0050741-T01-IM6_P-0050741-N01-IM6,P-0050741-T01-IM6,P-0050741,Small Cell Lung Cancer,Small Cell Lung Cancer,Metastasis,0.833591,2.006039,1.0,LIVING,1.940,0.22,7.0,P-0050741-T01-IM6_17_7578394_T_A,Missense_Mutation,1.000,0.757801,p.His179Leu,179,,,,,,,,,,,,,,,,,,,,,,,,,1,1.0,1.0,0.0,21694879.0,HETLOSS,0.824066,-1.0,8,8.0,0.757801,1.060359,,,,,LOSS,>=1_LOSS,6,5,2,0,0.230,missense,,,,
29256,P-0050747-T01-IM6_P-0050747-N01-IM6,P-0050747-T01-IM6,P-0050747,Pancreatic Cancer,Pancreatic Adenocarcinoma,Primary,0.360576,2.187990,1.0,LIVING,,0.05,6.1,P-0050747-T01-IM6_17_7577570_C_T,Missense_Mutation,0.937,0.168975,p.Met237Ile,237,,,,,,,,,,,,,,,,,,,,,,,,,1,1.0,1.0,0.0,18461117.0,HETLOSS,0.254038,-1.0,6,6.0,0.270471,0.768275,,,,,LOSS,>=1_LOSS,5,5,1,0,0.192,missense,,,,
29257,P-0050652-T01-IM6_P-0050652-N01-IM6,P-0050652-T01-IM6,P-0050652,Pancreatic Cancer,Pancreatic Adenocarcinoma,Primary,0.171442,2.011650,1.0,LIVING,1.085,0.26,2.6,P-0050652-T01-IM6_17_7578208_T_C,Missense_Mutation,,0.082168,p.His214Arg,214,,,,,,,,,,,,,,,,,,,,,,,,,1,2.0,1.0,1.0,28981250.0,DIPLOID,1.000000,-1.0,3,3.0,0.086842,0.958552,,,,,WILD_TYPE,1_WILD_TYPE,2,1,0,1,0.088,missense,,,,


In [190]:
print('Number of samples: ' + str(len(set(master_file.Tumor_Id))))
print('Number of patients: ' + str(len(set(master_file.Patient_Id))))
print('Number of tp53 positive samples: ' + str( len(set(master_file.Tumor_Id)) - master_file.key_1.isna().sum()))
print('Number of tp53 positive patients: ' + str( len(set(master_file.Patient_Id)) - master_file.drop_duplicates('Patient_Id').key_1.isna().sum()))
print('Number of samples with missing wgd : ' + str(master_file.wgd.isna().sum()))
print('Number of samples with missing cf : ' + str(master_file.cf.isna().sum()))
print('Number of samples with missing max_vaf : ' + str(master_file.max_vaf.isna().sum()))
print('Number of samples with missing Cn state: ' + str(master_file.cn_state.isna().sum()))
print('Number of samples with missing Sample_Type : ' + str(master_file.Sample_Type.isna().sum()))

Number of samples: 29259
Number of patients: 27021
Number of tp53 positive samples: 12731
Number of tp53 positive patients: 11885
Number of samples with missing wgd : 2092
Number of samples with missing cf : 1746
Number of samples with missing max_vaf : 1421
Number of samples with missing Cn state: 517
Number of samples with missing Sample_Type : 104


In [87]:
# Saving to pickle File
master_file.to_pickle(data_path + 'merged_data/master_file.pkl')

In [230]:
master_file[master_file['wgd'] == 1].describe()

Unnamed: 0,purity,ploidy,samples_per_patient,Overall Survival (Months),MSI Score,ccf_1,vaf_1,ccf_2,vaf_2,ccf_3,vaf_3,ccf_4,vaf_4,ccf_5,vaf_5,tp53_count,tcn,mcn,lcn,seg_length,cf,wgd,mutation_count,gene_count,max_vaf
count,9272.0,9272.0,9266.0,8750.0,9265.0,5268.0,6025.0,323.0,357.0,21.0,27.0,2.0,6.0,1.0,2.0,9272.0,9270.0,8085.0,8085.0,9270.0,8686.0,9272.0,9272.0,9272.0,9212.0
mean,0.477368,3.185226,1.28351,18.34375,1.146893,0.903747,0.457104,0.73622,0.277294,0.738286,0.284815,1.0,0.278522,0.612,0.221435,0.692731,2.540777,2.105133,0.36945,33524780.0,0.470042,1.0,8.488676,7.788395,0.488342
std,0.188877,0.567033,0.639602,15.825557,1.980211,0.151989,0.211558,0.258898,0.165031,0.329655,0.201141,0.0,0.120586,,0.026298,0.554211,1.256511,1.054938,0.631463,25328210.0,0.240559,0.0,12.572563,9.223415,0.213706
min,0.122458,1.376369,1.0,0.0,-1.0,0.001,0.020833,0.029,0.023593,0.107,0.026432,1.0,0.136111,0.612,0.20284,0.0,0.0,0.0,0.0,4679.0,0.00485,1.0,1.0,0.0,0.022587
25%,0.323767,2.798844,1.0,5.655,0.21,0.86175,0.289655,0.533,0.156475,0.419,0.141049,1.0,0.177837,0.612,0.212137,0.0,2.0,2.0,0.0,15942220.0,0.281185,1.0,4.0,3.0,0.321511
50%,0.450435,3.133504,1.0,13.512,0.66,0.972,0.428721,0.83,0.246166,0.917,0.255076,1.0,0.287632,0.612,0.221435,1.0,2.0,2.0,0.0,21694830.0,0.414251,1.0,6.0,6.0,0.467728
75%,0.611323,3.492561,1.0,27.847,1.46,1.0,0.603064,0.9895,0.363073,1.0,0.36605,1.0,0.361761,0.612,0.230733,1.0,3.0,2.0,1.0,40385900.0,0.621502,1.0,9.0,9.0,0.642857
max,0.968505,6.724265,9.0,73.118,39.6,1.0,0.992523,1.0,0.919643,1.0,0.930791,1.0,0.432049,0.612,0.24003,5.0,76.0,71.0,5.0,80969000.0,1.0,1.0,433.0,216.0,1.0


In [85]:
maf_cohort_annotated = pd.read_pickle(data_path + 'merged_data/maf_cohort_annotated.pkl')

EOFError: Ran out of input