# <span style='font-family:"Times New Roman"'> <span styel=''> **MASTER FILE CREATION**

## <span style='font-family:"Times New Roman"'> <span styel=''> *Emile Cohen* 
*March 2020*

**Goal:** In this Notebook, we create a master file that summarizes all useful information.

The Notebook is divided in 4 parts, representing the four parts of our Master file:
   
* **1. Patient/Sample Information**
* **2. TP53 Mutations**
* **3. TP53 Copy Numbers**
* **4. TP53 Computed Metrics**
* **5. Subgroup columns creation**
* **6. Merge tables**

**NB1:** In each part, you must run the cells from the begining in order to initialize the variables

**NB2:** In order to launch the last script (Merge Tables), you have to define the functions in each part.

**NB3:** All functions used for the plots are located in utils/custom_tools.py

---

In [77]:
%run -i '../../utils/setup_environment.ipy'

from pathlib import Path
from utils.filters import *

import warnings
warnings.filterwarnings('ignore')

data_path = '../../data/'

Setup environment... done!


<span style="color:green">✅ Working on **mskimpact_env** conda environment.</span>

---
## 1. Patient/Sample Information

In this part, we focus on clinical information exported from CbioPortal. We use the maf file created in the script *./maf_cohort_creation.ipynb* and stored in *../../data/merged_data/maf_cohort.pkl*.

The following columns are selected:
* Sample_Id
* Tumor_Id
* Patient_Id
* Cancer_Type
* Cancer_Type_Detailed
* Sample_Type
* purity
* ploidy
* samples_per_patient
* Overall Survival Status
* Overall Survival (Months)
* MSI Score
* Tumor Mutational Burden

In [78]:
def create_sample_info(path):
    '''
    This function aims to create a dataframe gathering all samples from the cohort with important clinical
    information.
    We use the function normal_samp_duplicates_filter to filter out the samples that have the same tumor
    but different normal samples, we keep only the one with the highest purity.
    '''
    maf_cohort = pd.read_pickle(path)
    
    #We select only interesting columns
    selected_cohort = maf_cohort[['Sample_Id','Tumor_Id', 'Patient_Id','Cancer_Type', 'Cancer_Type_Detailed', 'Sample_Type', 'purity', 'ploidy',
                                  'samples_per_patient','Overall Survival Status', 'Overall Survival (Months)', 
                                  'MSI Score', 'TMB_Score', 'Patient_Current_Age', 'Sex', 'Ethnicity_Category', 'Race_Category', 'mutationStatus', 'Somatic_Status']]

    # But we have many duplicates
    # First, we remove the duplicates based on Sample_Id
    selected_cohort = selected_cohort.drop_duplicates('Sample_Id')
    
    return selected_cohort #selected_cohort

In [79]:
sample_info = create_sample_info(data_path + 'merged_data/maf_cohort.pkl')
sample_info

Unnamed: 0,Sample_Id,Tumor_Id,Patient_Id,Cancer_Type,Cancer_Type_Detailed,Sample_Type,purity,ploidy,samples_per_patient,Overall Survival Status,Overall Survival (Months),MSI Score,TMB_Score,Patient_Current_Age,Sex,Ethnicity_Category,Race_Category,mutationStatus,Somatic_Status
0,P-0034223-T01-IM6_P-0034223-N01-IM6,P-0034223-T01-IM6,P-0034223,Breast Cancer,Invasive Breast Carcinoma,Metastasis,0.941111,2.241830,1.0,LIVING,,0.55,5.3,63.0,Female,,NO VALUE ENTERED,SOMATIC,Matched
6,P-0009819-T01-IM5_P-0009819-N01-IM5,P-0009819-T01-IM5,P-0009819,Prostate Cancer,Prostate Adenocarcinoma,Primary,0.275237,2.681075,1.0,LIVING,23.441,0.00,1.0,72.0,Male,Non-Spanish; Non-Hispanic,NO VALUE ENTERED,SOMATIC,Matched
9,P-0025956-T01-IM6_P-0025956-N01-IM6,P-0025956-T01-IM6,P-0025956,Non-Small Cell Lung Cancer,Lung Adenocarcinoma,Primary,0.185874,3.496971,1.0,DECEASED,3.584,0.00,5.3,71.0,Female,Non-Spanish; Non-Hispanic,WHITE,SOMATIC,Matched
15,P-0027408-T01-IM6_P-0027408-N01-IM6,P-0027408-T01-IM6,P-0027408,Non-Small Cell Lung Cancer,Non-Small Cell Lung Cancer,Metastasis,0.308886,1.811066,1.0,LIVING,22.586,0.27,17.6,67.0,Female,Non-Spanish; Non-Hispanic,WHITE,SOMATIC,Matched
38,P-0006554-T01-IM5_P-0006554-N01-IM5,P-0006554-T01-IM5,P-0006554,Glioma,Anaplastic Oligodendroglioma,Primary,0.715208,1.910719,1.0,LIVING,26.170,1.30,46.2,55.0,Female,Non-Spanish; Non-Hispanic,WHITE,SOMATIC,Matched
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
286942,P-0025736-T03-IM6_P-0025736-N01-IM6,P-0025736-T03-IM6,P-0025736,Non-Small Cell Lung Cancer,Lung Adenocarcinoma,Metastasis,,2.000000,2.0,LIVING,25.348,0.00,1.8,62.0,Female,Non-Spanish; Non-Hispanic,BLACK OR AFRICAN AMERICAN,SOMATIC,Matched
286944,P-0026308-T01-IM6_P-0026308-N01-IM6,P-0026308-T01-IM6,P-0026308,Non-Small Cell Lung Cancer,Lung Adenocarcinoma,Metastasis,0.300000,2.399804,1.0,LIVING,21.271,0.00,2.6,81.0,Male,Non-Spanish; Non-Hispanic,WHITE,SOMATIC,Matched
286947,P-0050500-T01-IM6_P-0050500-N01-IM6,P-0050500-T01-IM6,P-0050500,Bladder Cancer,Bladder Urothelial Carcinoma,Primary,,2.000000,1.0,LIVING,0.690,0.14,33.4,89.0,Male,Non-Spanish; Non-Hispanic,WHITE,SOMATIC,Matched
286987,P-0050657-T01-IM6_P-0050657-N01-IM6,P-0050657-T01-IM6,P-0050657,Non-Small Cell Lung Cancer,Lung Adenocarcinoma,Metastasis,,2.000000,1.0,LIVING,2.005,0.05,2.6,82.0,Male,Non-Spanish; Non-Hispanic,WHITE,SOMATIC,Matched


---
## 2. TP53 Mutations

In this part, we focus on mutational information exported from *default_qc_pass.ccf_TP53.maf* file. We use the maf file created in the script *./maf_tp53_creation.ipynb* and stored in *../../data/merged_data/maf_tp53.pkl*.

We gather all mutations per sample, and split it into different columns. We have the following columns:
* Tumor_Id	
* key_1 (2,3,4,5) --> Mutation key allowing to filter duplicates
* vc_1 (2,3,4,5) --> Variant Classification
* ccf_1 (2,3,4,5) --> Cancer Cell Fraction of the mutation
* vaf_1 (2,3,4,5) --> Variant Allele Frequency of the mutation
* HGVSp_1 (2,3,4,5) --> protein change
* spot_1 (2,3,4,5) --> Integer that defines the spot of the tp53 mutation
* tp53_count --> Number of tp53 mutations of the sample


In [36]:
def f(x):
    # This function helps us to group mutations together in a single cell per patient
    return pd.DataFrame(dict(Sample_Id = x['Sample_Id'],  
                        muts = "%s" % ','.join(x['sample_mut_key_vc_ccf_vaf_hgv_spot'])))

def count_tp53_muts(x):
    count = 0
    for i in range(1,6):
        if x['tp53_key_' + str(i)]:
            count+= 1
    return count

def create_tp53_muts(sample_info, path):
    '''
    This function aims to gather all tp53 mutation characteristics.
    For each sample we gather the tp53 mutations and their characteristics for all patients.
    '''
    # We load the  table created in maf_tp53_creation.ipynb
    maf_tp53 = pd.read_pickle(path)
    
    # We select only intresting columns
    maf_tp53_filtered = maf_tp53[['Sample_Id','sample_mut_key', 'Variant_Classification',\
                                        'ccf_expected_copies', 't_var_freq', 'HGVSp','mut_spot' ]]

    # Let's Merge mut_key,Variant_classification, CF, CCF, and VAF to gather them
    maf_tp53_filtered['sample_mut_key_vc_ccf_vaf_hgv_spot'] = pd.Series([str(i)+'%'+str(j)+'%'+str(k)+'%'+str(l)+'%'+str(m)+'%'+str(n) for i,j,k,l,m,n\
                                                            in zip(maf_tp53_filtered.sample_mut_key, \
                                                                   maf_tp53_filtered.Variant_Classification,\
                                                                   maf_tp53_filtered.ccf_expected_copies,\
                                                                   maf_tp53_filtered.t_var_freq,\
                                                                   maf_tp53_filtered.HGVSp,\
                                                                   maf_tp53_filtered.mut_spot\
                                                                  )]) 

 

    # We Select important columns
    final = maf_tp53_filtered[['Sample_Id', 'sample_mut_key_vc_ccf_vaf_hgv_spot']]
    # We groupby Patient_Id and apply the function above to group mutations
    final = final.groupby(['Sample_Id'], sort=False).apply(f)
    
    # We separate the different mutations into 5 different columns (5 is the max number of tp53 mutations in our cohort)
    final[['mut_key_1','mut_key_2','mut_key_3','mut_key_4','mut_key_5', 'mut_key_6']] = final.muts.str.split(',', expand=True)
    final = final.drop(['mut_key_6'],axis=1)
    # Split the columns into mut_key_ and vc_
    final[['tp53_key_1','tp53_vc_1','tp53_ccf_1','tp53_vaf_1','tp53_HGVSp_1', 'tp53_spot_1']] = final.mut_key_1.str.split('%', expand=True)
    final[['tp53_key_2','tp53_vc_2','tp53_ccf_2','tp53_vaf_2','tp53_HGVSp_2', 'tp53_spot_2']] = final.mut_key_2.str.split('%', expand=True)
    final[['tp53_key_3','tp53_vc_3','tp53_ccf_3','tp53_vaf_3','tp53_HGVSp_3', 'tp53_spot_3']] = final.mut_key_3.str.split('%', expand=True)
    final[['tp53_key_4','tp53_vc_4','tp53_ccf_4','tp53_vaf_4','tp53_HGVSp_4', 'tp53_spot_4']] = final.mut_key_4.str.split('%', expand=True)
    final[['tp53_key_5','tp53_vc_5','tp53_ccf_5','tp53_vaf_5','tp53_HGVSp_5', 'tp53_spot_5']] = final.mut_key_5.str.split('%', expand=True)

    # We remove the muts column
    final = final.drop(['muts','mut_key_1','mut_key_2','mut_key_3','mut_key_4','mut_key_5'], axis=1)

    # We remove duplicates
    final = final.drop_duplicates('Sample_Id')

    # We add the cohort patients that are not tp53 positive
    #First we create a dataframe with all missing samples
    cohort_samples = set(sample_info.Tumor_Id)
    final_samples = set(final.Sample_Id)
    missing_samp = pd.DataFrame(cohort_samples - final_samples, columns = ['Sample_Id'])
    #Then we append the two datframe
    final = final.append(missing_samp)
    
    # We rename the Sample_Id column to have the same key as in other datframes
    final = final.rename(columns={'Sample_Id': 'Tumor_Id'})
    
    # We add a last column tp53_count that represents the number of tp53 mutations per sample
    final = final.where(final.notnull(), None)
    final['tp53_count'] = final.apply(count_tp53_muts, axis = 1)
    
    # We change the type of vafs column to float64 instead of strings
    final = final.astype({'tp53_vaf_1': 'float64', 'tp53_vaf_2': 'float64', 'tp53_vaf_3': 'float64', 'tp53_vaf_4': 'float64', 'tp53_vaf_5': 'float64',
                       'tp53_ccf_1': 'float64', 'tp53_ccf_2': 'float64', 'tp53_ccf_3': 'float64', 'tp53_ccf_4': 'float64', 'tp53_ccf_5': 'float64'})

    return final

In [37]:
tp53_muts = create_tp53_muts(sample_info, data_path + 'merged_data/maf_tp53.pkl')
tp53_muts.head()

Unnamed: 0,Tumor_Id,tp53_key_1,tp53_vc_1,tp53_ccf_1,tp53_vaf_1,tp53_HGVSp_1,tp53_spot_1,tp53_key_2,tp53_vc_2,tp53_ccf_2,tp53_vaf_2,tp53_HGVSp_2,tp53_spot_2,tp53_key_3,tp53_vc_3,tp53_ccf_3,tp53_vaf_3,tp53_HGVSp_3,tp53_spot_3,tp53_key_4,tp53_vc_4,tp53_ccf_4,tp53_vaf_4,tp53_HGVSp_4,tp53_spot_4,tp53_key_5,tp53_vc_5,tp53_ccf_5,tp53_vaf_5,tp53_HGVSp_5,tp53_spot_5,tp53_count
0,P-0027408-T01-IM6,P-0027408-T01-IM617_7578409_CT_TC,Missense_Mutation,0.925,0.168901,p.Arg174Glu,174,,,,,,,,,,,,,,,,,,,,,,,,,1
1,P-0036909-T01-IM6,P-0036909-T01-IM617_7577121_G_A,Missense_Mutation,0.812,0.312169,p.Arg273Cys,273,,,,,,,,,,,,,,,,,,,,,,,,,1
2,P-0023546-T01-IM6,P-0023546-T01-IM617_7578442_T_C,Missense_Mutation,0.935,0.84507,p.Tyr163Cys,163,,,,,,,,,,,,,,,,,,,,,,,,,1
3,P-0023546-T02-IM6,P-0023546-T02-IM617_7578442_T_C,Missense_Mutation,1.0,0.636735,p.Tyr163Cys,163,,,,,,,,,,,,,,,,,,,,,,,,,1
4,P-0025997-T01-IM6,P-0025997-T01-IM617_7578471_G_-,Frame_Shift_Del,1.0,0.912621,p.Gly154AlafsTer16,154,,,,,,,,,,,,,,,,,,,,,,,,,1


---
## 3. TP53 Copy Numbers

In this part, we gather the information from gene_level table.
We creaste the following columns:
* Sample_Id 
* tcn --> total copy number
* mcn --> major copy number
* lcn --> lower copy number
* seg_length --> length of the segment
* cn_state --> copy number state
* cf --> Cell fraction of the cn_state
* wgd --> Wholde Genome Doubling (1 or -1)

In [89]:
def wgd_condition(x):
    arm_level = pd.read_csv(data_path + 'impact-facets-tp53/raw/default_qc_pass.arm_level.txt', sep='\t')
    cond_wgd = ['LOSS BEFORE & AFTER', 'LOSS BEFORE', 'CNLOH BEFORE & AFTER',
           'CNLOH BEFORE', 'CNLOH BEFORE & GAIN', 'DOUBLE LOSS AFTER',
           'LOSS AFTER', 'CNLOH AFTER', 'LOSS & GAIN']
    cond_no_wgd = ['CNLOH', 'HETLOSS', 'CNLOH & GAIN', 'DIPLOID']
    
    for tp53_cn_state in list(arm_level[arm_level['sample'] == x.Sample_Id]['cn_state']):
        if tp53_cn_state in cond_wgd:
            return 1
        
    if x.tp53_cn_state in cond_no_wgd :
        return -1

In [90]:
from tqdm import tqdm,tqdm_notebook

def create_copy_number_state(sample_info, path):
    tqdm_notebook().pandas()
    arm_level = pd.read_csv(data_path + 'impact-facets-tp53/raw/default_qc_pass.arm_level.txt', sep='\t')
    
    gene_level = pd.read_csv(path, sep='\t')
    gene_level['Tumor_Id'] = gene_level['sample'].str[:17]
    gene_level_subset = gene_level[['sample','tcn','mcn','lcn','seg_length','cn_state', 'cf.em']]
    
    # We add the cohort patients that are not in the dataframe
    #First we create a dataframe with all missing samples
    cohort_samples = set(sample_info.Sample_Id)
    gene_level_samples = set(gene_level_subset['sample'])
    missing_samp = pd.DataFrame(cohort_samples - gene_level_samples, columns = ['sample'])
    
    
    #Then we append the two dataframe
    gene_level_subset = gene_level_subset.append(missing_samp)
    
    # We rename the cf.em column 
    gene_level_subset = gene_level_subset.rename(columns={'cf.em': 'tp53_cf', 
                                                          'sample':'Sample_Id',
                                                          'tcn': 'tp53_tcn',
                                                          'mcn': 'tp53_mcn',
                                                          'lcn': 'tp53_lcn',
                                                          'seg_length': 'tp53_seg_length',
                                                          'cn_state':'tp53_cn_state'})
    
    # We add WGD information
    gene_level_subset['wgd'] = gene_level_subset.progress_apply(wgd_condition, axis = 1)
    
    return gene_level_subset

In [93]:
%%time 
gene_level_subset = create_copy_number_state(sample_info, data_path + 'impact-facets-tp53/raw/default_qc_pass.gene_level_TP53.txt')
gene_level_subset.to_pickle(data_path + 'merged_data/new_copy_number.pkl')

HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

HBox(children=(FloatProgress(value=0.0, max=29445.0), HTML(value='')))



CPU times: user 6h 6min 8s, sys: 48min 52s, total: 6h 55min
Wall time: 6h 54min 55s


In [125]:
def condition_CNLOH(x):
    CNLOH = ['CNLOH', 'CNLOH BEFORE & LOSS', 'CNLOH AFTER', 'CNLOH BEFORE', 'CNLOH & GAIN', 'CNLOH BEFORE & GAIN', 'AMP (LOH)']
    if x.cn_state in CNLOH:
        return 'CNLOH' + x.chr
    else:
        return 'NO_CNLOH'  + x.chr 
    
def condition_GAIN(x):
    GAIN = ['GAIN', 'AMP', 'AMP (BALANCED)', 'LOSS & GAIN', 'CNLOH & GAIN', 'CNLOH BEFORE & GAIN']
    if x.cn_state in GAIN:
        return 'GAIN'+ x.chr
    else:
        return 'NO_GAIN'+ x.chr

def condition_LOSS(x):
    LOSS = ['HETLOSS', 'LOSS BEFORE', 'LOSS AFTER', 'HOMDEL', 'LOSS BEFORE & AFTER', 'DOUBLE LOSS AFTER', 'LOSS & GAIN', 'CNLOH BEFORE & LOSS']
    if x.cn_state in LOSS:
        return 'LOSS'+ x.chr
    else:
        return 'NO_LOSS'+ x.chr


def compute_frac_genome(x, arm_level: pd.DataFrame):
    copy_number_state =  pd.read_pickle(data_path + 'merged_data/new_copy_number.pkl')
    
    lookup_table = arm_level[arm_level['sample'] == x.Sample_Id]
    lookup_table['chr'] = lookup_table.arm.str.extract('(\d+)')
    if float(copy_number_state[copy_number_state['Sample_Id'] == x.Sample_Id]['wgd']) == 1:
        lookup_table_altered = lookup_table[lookup_table['cn_state'] != 'TETRAPLOID'][lookup_table['chr'] != '17']
    else:
        lookup_table_altered = lookup_table[lookup_table['cn_state'] != 'DIPLOID'][lookup_table['chr'] != '17']
    altered_length = lookup_table_altered.cn_length.sum()
    total_length = lookup_table.arm_length.sum()
    
    frac_gen_altered = round(altered_length/total_length,3)
    
    return frac_gen_altered

# Here is the function that allws to compute genome instability columns
def chr_computations(x):
    arm_level = pd.read_csv(data_path + 'impact-facets-tp53/raw/default_qc_pass.arm_level.txt', sep='\t')
    copy_number_state =  pd.read_pickle(data_path + 'merged_data/new_copy_number.pkl')
    CNLOH = ['CNLOH', 'CNLOH BEFORE & LOSS', 'CNLOH AFTER', 'CNLOH BEFORE', 'CNLOH & GAIN', 'CNLOH BEFORE & GAIN', 'AMP (LOH)']
    LOSS = ['HETLOSS', 'LOSS BEFORE', 'LOSS AFTER', 'HOMDEL', 'LOSS BEFORE & AFTER', 'DOUBLE LOSS AFTER', 'LOSS & GAIN', 'CNLOH BEFORE & LOSS']
    GAIN = ['GAIN', 'AMP', 'AMP (BALANCED)', 'LOSS & GAIN', 'CNLOH & GAIN', 'CNLOH BEFORE & GAIN']
    arm_level_samples = list(set(arm_level['sample']))
    
    if x.Sample_Id not in arm_level_samples:
        return ['NaN','NaN','NaN','NaN', 'NaN']
    
    lookup_table = arm_level[arm_level['sample'] == x.Sample_Id]
    lookup_table['chr'] = lookup_table.arm.str.extract('(\d+)')
    if float(copy_number_state[copy_number_state['Sample_Id'] == x.Sample_Id]['wgd']) == 1:
        lookup_table = lookup_table[lookup_table['cn_state'] != 'TETRAPLOID'][lookup_table['chr'] != '17']
    else:
        lookup_table = lookup_table[lookup_table['cn_state'] != 'DIPLOID'][lookup_table['chr'] != '17']
    lookup_table['state_chr'] = lookup_table['cn_state']+lookup_table['chr']
    
    # If only DIPLOID or TETRAPLOID
    if lookup_table.empty:
        return [float(0)]*5
    
    lookup_table['cnloh_chr'] = lookup_table.apply(condition_CNLOH, axis=1)
    lookup_table['loss_chr'] = lookup_table.apply(condition_LOSS, axis=1)
    lookup_table['gain_chr'] = lookup_table.apply(condition_GAIN, axis=1)

    #chr_affected colum
    lookup_table_chr = lookup_table.drop_duplicates(subset=['chr'])
    chr_affected = len(lookup_table_chr)
    
    #chr_loss, chr_gain, chr_cnloh columns
    lookup_table_cnloh = lookup_table.drop_duplicates(subset=['cnloh_chr'])['cnloh_chr']
    lookup_table_loss = lookup_table.drop_duplicates(subset=['loss_chr'])['loss_chr']
    lookup_table_gain = lookup_table.drop_duplicates(subset=['gain_chr'])['gain_chr']

    chr_loss = len(lookup_table_loss[lookup_table_loss.str.startswith('LOSS')])
    chr_gain = len(lookup_table_gain[lookup_table_gain.str.startswith('GAIN')])
    chr_cnloh = len(lookup_table_cnloh[lookup_table_cnloh.str.startswith('CNLOH')])
    
    #frac_gen_altered column
    frac_gen_altered = compute_frac_genome(x, arm_level)
    
    return [chr_affected, chr_loss, chr_gain, chr_cnloh, frac_gen_altered]

In [None]:
from tqdm import tqdm,tqdm_notebook

def compute_genome_instability():
    tqdm_notebook().pandas()
    sample_info = create_sample_info(data_path + 'merged_data/maf_cohort.pkl')
    sample_info = sample_info
    sample_info['chr_comput'] = sample_info.progress_apply(chr_computations, axis=1)
    print('checkpoint 1')
    sample_info[['chr_affected', 'chr_loss', 'chr_gain', 'chr_cnloh', 'frac_genome_altered']] = pd.DataFrame(sample_info.chr_comput.values.tolist(), index= sample_info.index)
    print('checkpoint 2')
    
    return sample_info

sample_info = compute_genome_instability()
sample_info[['Sample_Id','chr_affected', 'chr_loss', 'chr_gain', 'chr_cnloh', 'frac_genome_altered']].to_pickle(data_path + 'merged_data/chr_metrics_new_new.pkl')

HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

HBox(children=(FloatProgress(value=0.0, max=29428.0), HTML(value='')))

---
## 4. TP53 Computed Metrics

In this part, we compute mainly 4 metrics:

* mutation_count (*create_mut_count*) --> It is the total mutation count per sample
* gene_count (*create_gene_count*)--> It is the number of mutated genes per sample
* max_vaf --> It is the maximum Variant Allele Frequency within all the mutations of a sample
* exp_nb_1 (2,3,4,5) --> It is the expected number of copies of tp53 mutations in a cell 


In [62]:
def create_gene_count(maf_cohort):
    '''
    This function create the count of genes mutated for each sample.
    Arguments:
        - maf_cohort: the maf_cohort file located in data/merged/data
    '''
    
    # First we create the gene_count table by groupbying and sizing, we then change the index
    selected_cohort = maf_cohort[['Sample_Id','Tumor_Id', 'Gene_Id']]
    gene_count = pd.DataFrame(pd.DataFrame(selected_cohort[['Sample_Id', 'Gene_Id']].groupby(['Sample_Id', 'Gene_Id']).size(), columns = ['count']).groupby(['Sample_Id']).size(), columns = ['gene_count'])
    gene_count = gene_count.reset_index()

    # We add missing patients to the gene_count to have all the cohort
    no_gene_id = selected_cohort.Gene_Id.isna()
    no_gene_samples = set(selected_cohort[selected_cohort.index.isin(list(no_gene_id[no_gene_id == True].index))]['Sample_Id'])
    missing_samp = pd.DataFrame(no_gene_samples, columns = ['Sample_Id'])

    # We append the two dataframes
    gene_count = gene_count.append(missing_samp)
    
    #Fillna with 0
    gene_count = gene_count.fillna(0)

    return gene_count

def create_mut_count(maf_cohort):
    '''
    This function computes the dataframe of all mutation count per sample.
    '''
    selected_cohort = maf_cohort[['Sample_Id','Tumor_Id', 'Gene_Id']]
    mut_count = get_groupby(selected_cohort, 'Sample_Id', 'mutation_count')
    
    return mut_count


def get_driver_count(x, maf_cohort):
    lookup_table = maf_cohort[maf_cohort['Sample_Id'] == x.Sample_Id]
    h = get_groupby(lookup_table, 'oncogenic', 'count')
    count = (int(h.loc['Oncogenic']) if 'Oncogenic' in h.index else 0) + (int(h.loc['Likely Oncogenic']) if 'Likely Oncogenic' in h.index else 0) +(int(h.loc['Predicted Oncogenic']) if 'Predicted Oncogenic' in h.index else 0)
    return count

def create_driver_count(maf_cohort):
    '''In this function we count the number of mutation driver per sample.'''
    samples=list(set(maf_cohort.Sample_Id))
    driver_count = pd.DataFrame(columns=['Sample_Id', 'driver_count'])
    driver_count.Sample_Id = samples
    driver_count['driver_count'] = driver_count.apply(get_driver_count, maf_cohort=maf_cohort, axis=1)

    return driver_count
    

# The following function needs to be called on the complete master file because it needs info from different parts
def create_copies_tp53_muts(master):
    master['tp53_exp_nb_1'] = master.apply(lambda x:(x.tp53_vaf_1 / x.purity) * (x.tp53_tcn * x.purity + 2*(1 - x.purity)), axis = 1)
    master['tp53_exp_nb_2'] = master.apply(lambda x:(x.tp53_vaf_2 / x.purity) * (x.tp53_tcn * x.purity + 2*(1 - x.purity)), axis = 1)
    master['tp53_exp_nb_3'] = master.apply(lambda x:(x.tp53_vaf_3 / x.purity) * (x.tp53_tcn * x.purity + 2*(1 - x.purity)), axis = 1)
    master['tp53_exp_nb_4'] = master.apply(lambda x:(x.tp53_vaf_4 / x.purity) * (x.tp53_tcn * x.purity + 2*(1 - x.purity)), axis = 1)
    master['tp53_exp_nb_5'] = master.apply(lambda x:(x.tp53_vaf_5 / x.purity) * (x.tp53_tcn * x.purity + 2*(1 - x.purity)), axis = 1)
    
    return master


def vc_group_cond_1(x):
    truncated = ['Splice_Site','Intron','Nonsense_Mutation','Splice_Region','Frame_Shift_Del','Frame_Shift_Ins']
    in_frame = ['In_Frame_Ins','In_Frame_Del']
    missense = ['Missense_Mutation']
    
    if x.tp53_vc_1 in truncated: return 'truncated'
    if x.tp53_vc_1 in in_frame: return 'in_frame'
    if x.tp53_vc_1 in missense: 
        if x.tp53_spot_1 in ['273','248','175']: return x.tp53_spot_1
        elif x.tp53_spot_1 in ['245', '282', '213', '352', '220', '196']: return 'hotspot'
        else: return 'missense'
def vc_group_cond_2(x):
    truncated = ['Splice_Site','Intron','Nonsense_Mutation','Splice_Region','Frame_Shift_Del','Frame_Shift_Ins']
    in_frame = ['In_Frame_Ins','In_Frame_Del']
    missense = ['Missense_Mutation']
    
    if x.tp53_vc_2 in truncated: return 'truncated'
    if x.tp53_vc_2 in in_frame: return 'in_frame'
    if x.tp53_vc_2 in missense: 
        if x.tp53_spot_2 in ['273','248','175']: return x.tp53_spot_2
        elif x.tp53_spot_2 in['245', '282', '213', '352', '220', '196']: return 'hotspot'
        else: return 'missense'   
def vc_group_cond_3(x):
    truncated = ['Splice_Site','Intron','Nonsense_Mutation','Splice_Region','Frame_Shift_Del','Frame_Shift_Ins']
    in_frame = ['In_Frame_Ins','In_Frame_Del']
    missense = ['Missense_Mutation']
    
    if x.tp53_vc_3 in truncated: return 'truncated'
    if x.tp53_vc_3 in in_frame: return 'in_frame'
    if x.tp53_vc_3 in missense: 
        if x.tp53_spot_3 in ['273','248','175']: return x.tp53_spot_3
        elif x.tp53_spot_3 in ['245', '282', '213', '352', '220', '196']: return 'hotspot'
        else: return 'missense' 
def vc_group_cond_4(x):
    truncated = ['Splice_Site','Intron','Nonsense_Mutation','Splice_Region','Frame_Shift_Del','Frame_Shift_Ins']
    in_frame = ['In_Frame_Ins','In_Frame_Del']
    missense = ['Missense_Mutation']
    
    if x.tp53_vc_4 in truncated: return 'truncated'
    if x.tp53_vc_4 in in_frame: return 'in_frame'
    if x.tp53_vc_4 in missense: 
        if x.tp53_spot_4 in ['273','248','175']: return x.tp53_spot_4
        elif x.tp53_spot_4 in ['245', '282', '213', '352', '220', '196']: return 'hotspot'
        else: return 'missense'
def vc_group_cond_5(x):
    truncated = ['Splice_Site','Intron','Nonsense_Mutation','Splice_Region','Frame_Shift_Del','Frame_Shift_Ins']
    in_frame = ['In_Frame_Ins','In_Frame_Del']
    missense = ['Missense_Mutation']
    
    if x.tp53_vc_5 in truncated: return 'truncated'
    if x.tp53_vc_5 in in_frame: return 'in_frame'
    if x.tp53_vc_5 in missense: 
        if x.tp53_spot_5 in ['273','248','175']: return x.tp53_spot_5
        elif x.tp53_spot_5 in ['245', '282', '213', '352', '220', '196']: return 'hotspot'
        else: return 'missense'


In [63]:
def get_driver_count(x, maf_cohort):
    lookup_table = maf_cohort[maf_cohort['Sample_Id'] == x.Sample_Id]
    h = get_groupby(lookup_table, 'oncogenic', 'count')
    count = (int(h.loc['Oncogenic']) if 'Oncogenic' in h.index else 0) + (int(h.loc['Likely Oncogenic']) if 'Likely Oncogenic' in h.index else 0) +(int(h.loc['Predicted Oncogenic']) if 'Predicted Oncogenic' in h.index else 0)
    return count

'''maf_cohort = pd.read_pickle(data_path + 'merged_data/maf_cohort.pkl')
selected_cohort = maf_cohort[['Sample_Id','oncogenic']]
samples=list(set(selected_cohort.Sample_Id))
driver_count = pd.DataFrame(columns=['Sample_Id', 'driver_count'])
driver_count.Sample_Id = samples
driver_count['driver_count'] = driver_count.apply(get_driver_count, maf_cohort=maf_cohort, axis=1)'''

"maf_cohort = pd.read_pickle(data_path + 'merged_data/maf_cohort.pkl')\nselected_cohort = maf_cohort[['Sample_Id','oncogenic']]\nsamples=list(set(selected_cohort.Sample_Id))\ndriver_count = pd.DataFrame(columns=['Sample_Id', 'driver_count'])\ndriver_count.Sample_Id = samples\ndriver_count['driver_count'] = driver_count.apply(get_driver_count, maf_cohort=maf_cohort, axis=1)"

In [20]:
maf_cohort = pd.read_pickle(data_path + 'merged_data/maf_cohort.pkl')
h = create_driver_count(maf_cohort)
display(h)

Unnamed: 0_level_0,driver_count
Sample_Id,Unnamed: 1_level_1
P-0000004-T01-IM3_P-0000004-N01-IM3,4
P-0000012-T02-IM3_P-0000012-N01-IM3,1
P-0000024-T01-IM3_P-0000024-N01-IM3,6
P-0000025-T02-IM5_P-0000025-N01-IM5,2
P-0000026-T01-IM3_P-0000026-N01-IM3,4
...,...
P-0050745-T01-IM6_P-0050745-N01-IM6,7
P-0050746-T01-IM6_P-0050746-N01-IM6,89
P-0050747-T01-IM6_P-0050747-N01-IM6,6
P-0050748-T01-IM6_P-0050748-N01-IM6,2


In [64]:
def create_computed_metrics(path):
    # We add mutation_count and max_vaf
    maf_cohort = pd.read_pickle(path)

    # MUTATION COUNT
    #We create the table for mutation_count
    mut_count = create_mut_count(maf_cohort)
    
    # We create the table for gene_count
    gene_count = create_gene_count(maf_cohort)
    
    # We create the table for the driver count per sample
    driver_count = create_driver_count(maf_cohort)

    # MAX_VAF
    # To do so, we groupby Tumor_Id and apply the max() function
    # But first we need to transform None values in Nan to compute the max
    maf_cohort['vaf'].replace('None', np.nan, inplace=True)
    max_vaf = maf_cohort[['Sample_Id','vaf']].groupby(['Sample_Id']).max()
    max_vaf = max_vaf.rename(columns={'vaf': 'max_vaf'})
    
    # Merge the tables
    computed_metrics = pd.merge(mut_count, gene_count, on=['Sample_Id'])
    computed_metrics = pd.merge(computed_metrics, driver_count, on=['Sample_Id'])
    computed_metrics = pd.merge(computed_metrics, max_vaf, on=['Sample_Id'])
    
    
    return computed_metrics

In [65]:
computed_metrics = create_computed_metrics(data_path + 'merged_data/maf_cohort.pkl')

In [67]:
computed_metrics

Unnamed: 0,Sample_Id,mutation_count,gene_count,driver_count,max_vaf
0,P-0000004-T01-IM3_P-0000004-N01-IM3,4,4.0,2,0.547085
1,P-0000012-T02-IM3_P-0000012-N01-IM3,1,1.0,0,0.502203
2,P-0000024-T01-IM3_P-0000024-N01-IM3,6,5.0,2,0.368683
3,P-0000025-T02-IM5_P-0000025-N01-IM5,2,2.0,2,0.203236
4,P-0000026-T01-IM3_P-0000026-N01-IM3,4,4.0,0,0.590164
...,...,...,...,...,...
28353,P-0050745-T01-IM6_P-0050745-N01-IM6,6,6.0,1,0.451400
28354,P-0050746-T01-IM6_P-0050746-N01-IM6,88,68.0,24,0.943925
28355,P-0050747-T01-IM6_P-0050747-N01-IM6,6,6.0,1,0.270471
28356,P-0050748-T01-IM6_P-0050748-N01-IM6,2,2.0,2,0.126382


## 5. Subgroup Columns Creation

#### tp53_group
First, we group the different COpy Number States *cn_state* in subgroups, under the column *cn_group*:
* Group 1: cnLOH gathering ['CNLOH', 'CNLOH BEFORE & LOSS', 'CNLOH BEFORE', 'CNLOH BEFORE & GAIN']
* Group 2: LOSS gathering ['LOSS BEFORE', 'HETLOSS', 'LOSS BEFORE & AFTER']
* Group 3: HOMDEL gathering ['HOMDEL']
* Group 4: DOUBLE LOSS AFTER gathering ['DOUBLE LOSS AFTER']
* Group 5: WILD_TYPE gathering ['LOSS AFTER', 'DIPLOID', 'TETRAPLOID']
* Group 6: GAIN gathering ['GAIN']
* Group 7: OTHER gathering ['CNLOH AFTER', 'AMP (BALANCED)', 'AMP (LOH)', 'AMP','LOSS & GAIN', 'CNLOH & GAIN']


Based on this first column we define 7 final groups of patients adding the mutational information. These groups will be under the column *mut_cn_group*.
* Group 1: Samples with 0 tp53 mutations and HETLOSS
* Group 2: Samples with HOMDEL
* Group 3: Samples with 1 tp53 mutation and WILD_TYPE (DIPLOID, LOSS AFTER, TETRAPLOID)
* Group 4: Samples with 1 tp53 mutation or more and LOSS
* Group 5: Samples with 1 tp53 mutation or more and cnLOH
* Group 6: Samples with 2/3/4/5 tp53 mutations and WILD_TYPE or GAIN

We define the columns thanks to 2 functions that we call in the **Merge Tables** part through *create_master* function.

In [68]:
def cn_group_cond(x):
    if x.tp53_cn_state in ['CNLOH', 'CNLOH BEFORE & LOSS', 'CNLOH BEFORE', 'CNLOH BEFORE & GAIN', 'LOSS BEFORE']:
        return 'cnLOH'
    if x.tp53_cn_state in ['HETLOSS', 'LOSS BEFORE & AFTER']:
        return 'LOSS'
    if x.tp53_cn_state == 'HOMDEL':
        return 'HOMDEL'
    if x.tp53_cn_state in ['LOSS AFTER', 'DIPLOID', 'TETRAPLOID']:
        return 'WILD_TYPE'
    if x.tp53_cn_state == 'DOUBLE LOSS AFTER':
        return 'DOUBLE LOSS AFTER'
    if x.tp53_cn_state == 'GAIN':
        return 'GAIN'
    if x.tp53_cn_state in ['CNLOH AFTER', 'AMP (BALANCED)', 'AMP (LOH)', 'AMP','LOSS & GAIN', 'CNLOH & GAIN']:
        return 'OTHER'

def mut_cn_group_cond(x):
    if x.tp53_cn_state == 'HETLOSS' and x.tp53_count == 0:
        return '0_HETLOSS'
    if x.tp53_first_group == 'HOMDEL':
        return 'HOMDEL'
    if x.tp53_first_group == 'WILD_TYPE' and x.tp53_count == 1 :
        return '1_WILD_TYPE'
    if x.tp53_first_group == 'LOSS' and x.tp53_count >=1:
        return '>=1_LOSS'
    if x.tp53_first_group == 'cnLOH' and x.tp53_count >=1:
        return '>=1_cnLOH'
    if (x.tp53_first_group == 'WILD_TYPE' or x.tp53_first_group == 'DOUBLE LOSS AFTER' or x.tp53_first_group == 'GAIN') and x.tp53_count > 1:
        return '>1muts'

#### tp53_residual subgroups

In [69]:
def tp53_residual_group(x):
    if x.tp53_group == '1_WILD_TYPE' or x.tp53_group == '0_HETLOSS':
        return 'tp53_res'
    if x.tp53_group == 'HOMDEL':
        return 'no_tp53_res'
    if x.tp53_group == '>=1_LOSS' or x.tp53_group == '>=1_cnLOH':
        if (x.tp53_residual_1 < 0.5) or (x.tp53_residual_2 < 0.5):
            return 'no_tp53_res'
        else:
            if (x.cf + max(x.tp53_ccf_1, x.tp53_ccf_2, x.tp53_ccf_3, x.tp53_ccf_4, x.tp53_ccf_5)) > 1:
                return 'no_tp53_res'
            else:
                return 'uncertain'
    if x.tp53_group == '>1muts':
        if (x.tp53_residual_1 + x.tp53_residual_2 < 2.5):
            if (x.tp53_ccf_1 + x.tp53_ccf_2 > 1):
                return 'no_tp53_res'
            else: 
                return 'uncertain'
           
        elif (x.tp53_residual_1 + x.tp53_residual_2 > 2.5):
            return 'tp53_res'

---
## Merge Tables

In [84]:
sample_info = create_sample_info(data_path + 'merged_data/maf_cohort.pkl')
sample_info

Unnamed: 0,Sample_Id,Tumor_Id,Patient_Id,Cancer_Type,Cancer_Type_Detailed,Sample_Type,purity,ploidy,samples_per_patient,Overall Survival Status,Overall Survival (Months),MSI Score,TMB_Score,Patient_Current_Age,Sex,Ethnicity_Category,Race_Category,mutationStatus,Somatic_Status
0,P-0034223-T01-IM6_P-0034223-N01-IM6,P-0034223-T01-IM6,P-0034223,Breast Cancer,Invasive Breast Carcinoma,Metastasis,0.941111,2.241830,1.0,LIVING,,0.55,5.3,63.0,Female,,NO VALUE ENTERED,SOMATIC,Matched
6,P-0009819-T01-IM5_P-0009819-N01-IM5,P-0009819-T01-IM5,P-0009819,Prostate Cancer,Prostate Adenocarcinoma,Primary,0.275237,2.681075,1.0,LIVING,23.441,0.00,1.0,72.0,Male,Non-Spanish; Non-Hispanic,NO VALUE ENTERED,SOMATIC,Matched
9,P-0025956-T01-IM6_P-0025956-N01-IM6,P-0025956-T01-IM6,P-0025956,Non-Small Cell Lung Cancer,Lung Adenocarcinoma,Primary,0.185874,3.496971,1.0,DECEASED,3.584,0.00,5.3,71.0,Female,Non-Spanish; Non-Hispanic,WHITE,SOMATIC,Matched
15,P-0027408-T01-IM6_P-0027408-N01-IM6,P-0027408-T01-IM6,P-0027408,Non-Small Cell Lung Cancer,Non-Small Cell Lung Cancer,Metastasis,0.308886,1.811066,1.0,LIVING,22.586,0.27,17.6,67.0,Female,Non-Spanish; Non-Hispanic,WHITE,SOMATIC,Matched
38,P-0006554-T01-IM5_P-0006554-N01-IM5,P-0006554-T01-IM5,P-0006554,Glioma,Anaplastic Oligodendroglioma,Primary,0.715208,1.910719,1.0,LIVING,26.170,1.30,46.2,55.0,Female,Non-Spanish; Non-Hispanic,WHITE,SOMATIC,Matched
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
286942,P-0025736-T03-IM6_P-0025736-N01-IM6,P-0025736-T03-IM6,P-0025736,Non-Small Cell Lung Cancer,Lung Adenocarcinoma,Metastasis,,2.000000,2.0,LIVING,25.348,0.00,1.8,62.0,Female,Non-Spanish; Non-Hispanic,BLACK OR AFRICAN AMERICAN,SOMATIC,Matched
286944,P-0026308-T01-IM6_P-0026308-N01-IM6,P-0026308-T01-IM6,P-0026308,Non-Small Cell Lung Cancer,Lung Adenocarcinoma,Metastasis,0.300000,2.399804,1.0,LIVING,21.271,0.00,2.6,81.0,Male,Non-Spanish; Non-Hispanic,WHITE,SOMATIC,Matched
286947,P-0050500-T01-IM6_P-0050500-N01-IM6,P-0050500-T01-IM6,P-0050500,Bladder Cancer,Bladder Urothelial Carcinoma,Primary,,2.000000,1.0,LIVING,0.690,0.14,33.4,89.0,Male,Non-Spanish; Non-Hispanic,WHITE,SOMATIC,Matched
286987,P-0050657-T01-IM6_P-0050657-N01-IM6,P-0050657-T01-IM6,P-0050657,Non-Small Cell Lung Cancer,Lung Adenocarcinoma,Metastasis,,2.000000,1.0,LIVING,2.005,0.05,2.6,82.0,Male,Non-Spanish; Non-Hispanic,WHITE,SOMATIC,Matched


In [80]:
def create_master():
    '''
    This function creates the tables and merges them.
    '''
    sample_info = create_sample_info(data_path + 'merged_data/maf_cohort.pkl')
    tp53_muts = create_tp53_muts(sample_info, data_path + 'merged_data/maf_tp53.pkl')
    
    # We load the copy number table if it is already stored 
    file = Path(data_path + 'merged_data/new_copy_number.pkl')
    if file.is_file():
        copy_number_state =  pd.read_pickle(data_path + 'merged_data/new_copy_number.pkl')
        copy_number_state = copy_number_state.rename(columns={'tcn': 'tp53_tcn',
                                                          'mcn': 'tp53_mcn',
                                                          'lcn': 'tp53_lcn',
                                                          'seg_length': 'tp53_seg_length',
                                                          'cn_state':'tp53_cn_state'})
    else:
        copy_number_state = create_copy_number_state(sample_info, data_path + 'impact-facets-tp53/raw/default_qc_pass.gene_level_TP53.txt')
    computed_metrics = create_computed_metrics(data_path + 'merged_data/maf_cohort.pkl')
    
    # We first merge sample_info and tp53_muts because they have the same list of keys
    master_file = pd.merge(sample_info, tp53_muts, on=['Tumor_Id'])
    #For copy_number_state we have to do a right join because it contains less Tumor_Ids
    master_file = pd.merge(master_file, copy_number_state, on=['Sample_Id'])
    # Finally we merge the computedmetrics table
    master_file = pd.merge(master_file, computed_metrics, on=['Sample_Id'])
    
    # We filter out the sample duplicates
    master_file = master_file.drop_duplicates('Sample_Id')
    
    # At this step we need to remove samples that comes from same tumor but different normal sample
    # BUT this step makes us loose important clinical information for some samples
    # So we wil spread the Patient_Id and Cancer_Type by front and backpropagating the non-NaN values
    master_file['Patient_Id'] = master_file.Tumor_Id.str[:9]
    master_file['Cancer_Type'] = master_file[['Patient_Id','Cancer_Type']].groupby(['Patient_Id']).bfill().ffill()
    
    # Then we filter out samples with the same Tumor_Id but different Sample_Id with a filter function
    master_file = normal_samp_duplicates_filter(master_file, 'Sample_Id', 'purity')
    master_file = normal_samp_duplicates_filter(master_file, 'Sample_Id', 'purity')
    
    #We compute the expected number of copies of tp53 mutations
    master_file = create_copies_tp53_muts(master_file)
    
    master_file['tp53_residual_1'] = master_file['tp53_tcn'] - master_file['tp53_exp_nb_1']
    master_file['tp53_residual_2'] = master_file['tp53_tcn'] - master_file['tp53_exp_nb_2']
    master_file['tp53_residual_3'] = master_file['tp53_tcn'] - master_file['tp53_exp_nb_3']
    master_file['tp53_residual_4'] = master_file['tp53_tcn'] - master_file['tp53_exp_nb_4']
    master_file['tp53_residual_5'] = master_file['tp53_tcn'] - master_file['tp53_exp_nb_5']

    #Finally we add the subgroup columns defined in Part 5
    master_file['tp53_first_group'] = master_file.apply(cn_group_cond, axis = 1)
    master_file['tp53_group'] = master_file.apply(mut_cn_group_cond, axis = 1)
    master_file['tp53_res_group'] = master_file.apply(tp53_residual_group, axis = 1)
    
    # We add Genome Instability columns
    # Genome Instability columns computed from arm_level file
    arm_level = pd.read_csv(data_path + 'impact-facets-tp53/raw/default_qc_pass.arm_level.txt', sep='\t')
    arm_level['chr'] = arm_level.arm.str.extract('(\d+)')
    print('checkpoint_1')
    #master_file['chr_comput'] = master_file.apply(chr_computations, axis=1)
    print('checkpoint_2')
    chr_metrics =  pd.read_pickle(data_path + 'merged_data/chr_metrics_new.pkl')
    master_file = pd.merge(master_file, chr_metrics, on=['Sample_Id'])
    
    print('checkpoint_3')
    
    # Grouping the Variant Classificationb into 3 Classes
    master_file['tp53_vc_group_1'] = master_file.apply(vc_group_cond_1, axis = 1)
    master_file['tp53_vc_group_2'] = master_file.apply(vc_group_cond_2, axis = 1)
    master_file['tp53_vc_group_3'] = master_file.apply(vc_group_cond_3, axis = 1)
    master_file['tp53_vc_group_4'] = master_file.apply(vc_group_cond_4, axis = 1)
    master_file['tp53_vc_group_5'] = master_file.apply(vc_group_cond_5, axis = 1)
    return master_file

In [81]:
%%time
master_file = create_master()

checkpoint_1
checkpoint_2
checkpoint_3
CPU times: user 12min 18s, sys: 10.7 s, total: 12min 28s
Wall time: 12min 34s


In [82]:
get_groupby(master_file, 'tp53_group', 'count').sort_values(by='count', ascending=False)

Unnamed: 0_level_0,count
tp53_group,Unnamed: 1_level_1
>=1_cnLOH,5026
>=1_LOSS,3956
0_HETLOSS,2780
1_WILD_TYPE,1336
>1muts,827
HOMDEL,287


In [83]:
master_file.shape[0]

28861

In [59]:

get_groupby(master_file, 'tp53_group', 'count').sort_values(by='count', ascending=False)

NameError: name 'master_file' is not defined

In [41]:
print('Number of samples: ' + str(len(set(master_file.Tumor_Id))))
print('Number of patients: ' + str(len(set(master_file.Patient_Id))))
print('Number of tp53 positive samples: ' + str( len(set(master_file.Tumor_Id)) - master_file.tp53_key_1.isna().sum()))
print('Number of tp53 positive patients: ' + str( len(set(master_file.Patient_Id)) - master_file.drop_duplicates('Patient_Id').tp53_key_1.isna().sum()))
print('Number of samples with missing wgd : ' + str(master_file.wgd.isna().sum()))
print('Number of samples with missing cf : ' + str(master_file.cf.isna().sum()))
print('Number of samples with missing max_vaf : ' + str(master_file.max_vaf.isna().sum()))
print('Number of samples with missing Cn state: ' + str(master_file.tp53_cn_state.isna().sum()))
print('Number of samples with missing Sample_Type : ' + str(master_file.Sample_Type.isna().sum()))

Number of samples: 29259
Number of patients: 27021
Number of tp53 positive samples: 12731
Number of tp53 positive patients: 11885
Number of samples with missing wgd : 2092
Number of samples with missing cf : 1746
Number of samples with missing max_vaf : 1421
Number of samples with missing Cn state: 517
Number of samples with missing Sample_Type : 104


In [22]:
# Saving to pickle File
master_file.to_pickle(data_path + 'merged_data/master_file.pkl')

In [70]:
master = load_clean_up_master(data_path + 'merged_data/master_file.pkl')
len(list(master[master['tp53_count']>=1]['Sample_Id']))

11527

In [44]:
max(1,2,3)

3

In [73]:
master_file.purity.describe()

count    29202.000000
mean         0.495941
std          0.207579
min          0.018483
25%          0.319273
50%          0.469923
75%          0.658398
max          0.989434
Name: purity, dtype: float64

In [3]:
master = pd.read_pickle(data_path + 'merged_data/master_file.pkl')

In [60]:
get_groupby(master, 'tp53_group', 'count').sort_values(by='count', ascending=False)

Unnamed: 0_level_0,count
tp53_group,Unnamed: 1_level_1
>=1_cnLOH,5026
>=1_LOSS,3956
0_HETLOSS,2779
1_WILD_TYPE,1457
>1muts,697
HOMDEL,287


In [75]:
master_file.shape[0]

27792

In [76]:
master = pd.read_pickle(data_path + 'merged_data/master_file.pkl')
master.shape[0]

28858