This notebook creates the inputs for DeconstructSigs and shows the many ways in which it was run for the project. 

This piece of code relies on a workspace directory structure such as 
```
cohort/
	patientID/
		DxTumorID_vs_normalID/
		ReTumorID_vs_normalID/ (sometimes)

```
 patientID, DxTumorID etc can be found in ../ext_files/all_cohort_clinical_groups.tsv
 
Be aware that the filtered mafs with clonal classification and joined mutations after running the scripts in ```filter/```  have the following file name: ```TumorID_vs_normalID + _strelka_uniq_all_anno_vep92_categories_filt_snps_cluster.maf``` 
.This file name is used in the following code.

In [None]:
import os
import pandas as pd
import glob
import numpy as np
from aux_functions import stage_mapping,get_context_rev,add_pyrimidine_type
from aux_data_in_pyvar import PATS_DIRS

In [None]:
clinical = pd.read_excel("Additional file 2.xlsx", sheet_name='Table S1', skiprows=[0,1]) #Additional file 2 Table S1
df_clinical = stage_mapping(clinical)
df_clinical.tail()

### ALL primary

The results of the fitting of this run were used in Figure 1c and Additional file 1 Figure S1. 

In a similar way were created the inputs for the pediatric cohorts. Each cohort had their own run for the primary samples. The signature weights obtained from it served to build figure 1c

In [None]:
dff_pry = pd.DataFrame()

for com in df_clinical[df_clinical['STAGE'] == 'primary']['COMPARISON'].tolist():
    
    pat = df_clinical[df_clinical['COMPARISON'] == com]['PATIENT'].tolist()[0]
    
    if (pat != 'PAT3') and (pat != 'PAT4'):   
        print(pat)
        dire_in = PATS_DIRS[pat]

        df_pry = pd.read_table(os.path.join(dire_in, pat,com,
                            com+'_strelka_uniq_all_anno_vep92_categories_filt_snps_cluster.maf'),
                            sep='\t',low_memory=False)
        df_pry = df_pry[df_pry['mut_type'] == 'snv']
        df_pry['PATIENT'] = pat

        dff_pry = dff_pry.append(df_pry[['#CHROM', 'POS','REF', 'ALT', 'PATIENT']], ignore_index=True)

In [None]:
out_path = "" # path to the run of DeconstructSigs

# inside out_path make a folder for this run with this input. Here is used run_all_primary/  
dff_pry.to_csv(os.path.join(out_path,"run_all_primary/simple_input.tsv"),sep='\t', index=False)

For this input deconstructSigs was run:

```python assign_signature_to_mutation.py -i run_all_primary/simple_input.tsv -o run_all_primary -s ../../ext_files/run_deconstructSig/signatures.pcawg -t ALL_primary_subset ```

### PRIMARY AND RELAPSE

We created an input for each type of samples (all primary) and (all relapse) that were both used in three different runs. 

First we fitted the pcawg signatures + the treatment signatures (see ```../../ext_files/run_deconstructSig/README.txt```) 
The results of the fitting of this run were used in Additional file 1 Figure S6. 

run folder name is ```run_samples_treatment/```

Second we fitted the pcawg signatures + HSCP signature (see ```../../ext_files/run_deconstructSig/README.txt```) 
The results of the fitting of this run were used in Additional file 1 Figure S7.

run folder name is ```run_samples_hemato/```

Third we fitted the pcawg signatures (see ```../../ext_files/run_deconstructSig/README.txt```) 
The results of the fitting of this run were used in Additional file 1 Figure S7.

run folder name is ```run_samples/```


In [None]:
dff_pry = pd.DataFrame()
dff_rel = pd.DataFrame()

for pat in df_clinical['PATIENT'].unique().tolist():
    pry_com = df_clinical[(df_clinical['PATIENT'] == pat) & (df_clinical['STAGE'] == 'primary')]['COMPARISON'].tolist()[0]
    rel_com = df_clinical[(df_clinical['PATIENT'] == pat) & (df_clinical['STAGE'] == 'relapse')]['COMPARISON'].tolist()[0]
    
    if (pat != 'PAT3') and (pat != 'PAT4'):   
        print(pat)
        dire_in = PATS_DIRS[pat]
            
        df_pry = pd.read_table(os.path.join(dire_in, pat, pry_com,
                            pry_com+'_strelka_uniq_all_anno_vep92_categories_filt_snps_cluster.maf'),
                            sep='\t',low_memory=False)
        df_rel = pd.read_table(os.path.join(dire_in, pat, rel_com,
                            rel_com+'_strelka_uniq_all_anno_vep92_categories_filt_snps_cluster.maf'),
                            sep='\t',low_memory=False)
        df_pry = df_pry[df_pry['mut_type'] == 'snv']
        df_pry['PATIENT'] = pat
        
        df_rel = df_rel[df_rel['mut_type'] == 'snv']
        df_rel['PATIENT'] = pat

        dff_pry = dff_pry.append(df_pry[['#CHROM', 'POS','REF', 'ALT', 'PATIENT']], ignore_index=True)
        dff_rel = dff_rel.append(df_rel[['#CHROM', 'POS','REF', 'ALT', 'PATIENT']], ignore_index=True)

In [None]:
dff_pry.to_csv(os.path.join(out_path,"run_samples_treatment/primary/simple_input.tsv"),
                                                                                 sep='\t', index=False)
dff_rel.to_csv(os.path.join(out_path,"run_samples_treatment/relapse/simple_input.tsv"),
                                                                                 sep='\t', index=False)

For this input deconstructSigs was run: 

```python assign_signature_to_mutation.py -i run_samples_treatment/primary/simple_input.tsv -o run_samples_treatment/primary -s ../../ext_files/run_deconstructSig/leukemia_signatures.pcawg -t TALL_relapse_subset```

For this input deconstructSigs was run: 

```python assign_signature_to_mutation.py -i run_samples_treatment/relapse/simple_input.tsv -o run_samples_treatment/relapse -s ../../ext_files/run_deconstructSig/leukemia_signatures.pcawg -t TALL_relapse_subset```

In [None]:
dff_pry.to_csv(os.path.join(out_path,"run_samples_hemato/primary/simple_input.tsv"),
                                                                                  sep='\t', index=False)
dff_rel.to_csv(os.path.join(out_path,"run_samples_hemato/relapse/simple_input.tsv"),
                                                                                  sep='\t', index=False)

For this input deconstructSigs was run: 

```python assign_signature_to_mutation.py -i run_samples_hemato/primary/simple_input.tsv -o run_samples_hemato/primary -s ../../ext_files/run_deconstructSig/signatures_added.pcawg -t TALL_HSCP_comparative```

For this input deconstructSigs was run: 

```python assign_signature_to_mutation.py -i run_samples_hemato/relapse/simple_input.tsv -o run_samples_hemato/relapse -s ../../ext_files/run_deconstructSig/signatures_added.pcawg -t TALL_HSCP_comparative```

In [None]:
dff_pry.to_csv(os.path.join(out_path,"run_samples/primary/simple_input.tsv"),
                                                                                  sep='\t', index=False)
dff_rel.to_csv(os.path.join(out_path,"run_samples/relapse/simple_input.tsv"),
                                                                                  sep='\t', index=False)

For this input deconstructSigs was run: 

```python assign_signature_to_mutation.py -i run_samples/primary/simple_input.tsv -o run_samples/primary -s ../../ext_files/run_deconstructSig/signatures.pcawg -t TALL_subset```

For this input deconstructSigs was run: 

```python assign_signature_to_mutation.py -i run_samples/relapse/simple_input.tsv -o run_samples/relapse -s ../../ext_files/run_deconstructSig/signatures.pcawg -t TALL_subset```

### all subsets together

After seeing no evidence of treatment signatures we created an input with all primary and relapse mutations together but with a column specifying them in privates (primary and relapse) or shared. The results of the fitting of this run were used in Figure 3a.

run folder name is ```run_subsets_together/```

In [None]:
mutations = pd.DataFrame()

for pat in df_clinical['PATIENT'].unique().tolist():
    pry_com = df_clinical[(df_clinical['PATIENT'] == pat) & (df_clinical['STAGE'] == 'primary')]['COMPARISON'].tolist()[0]
    rel_com = df_clinical[(df_clinical['PATIENT'] == pat) & (df_clinical['STAGE'] == 'relapse')]['COMPARISON'].tolist()[0]
    
    if (pat != 'PAT3') and (pat != 'PAT4'):   
        print(pat)
        dire_in = PATS_DIRS[pat]
        df_pry = pd.read_table(os.path.join(dire_in, pat, pry_com,
                            pry_com+'_strelka_uniq_all_anno_vep92_categories_filt_snps_cluster.maf'),
                            sep='\t',low_memory=False)
      
        df_rel = pd.read_table(os.path.join(dire_in, pat, rel_com,
                            rel_com+'_strelka_uniq_all_anno_vep92_categories_filt_snps_cluster.maf'),
                            sep='\t',low_memory=False)
        
        if 'AF_less_001' in df_pry.columns:
            df_pry.rename(columns={'AF_less_001':'AF_less_0.01'}, inplace=True)
            df_rel.rename(columns={'AF_less_001':'AF_less_0.01'}, inplace=True)

        df_pry = df_pry[df_pry['mut_type'] == 'snv']
        df_pry['PATIENT'] = pat
        
        df_rel = df_rel[df_rel['mut_type'] == 'snv']
        df_rel['PATIENT'] = pat
        
        all_pry_variants = set(df_pry['Variant'].unique())
        all_rel_variants = set(df_rel['Variant'].unique())
    
        print(len(all_pry_variants))
        print(len(all_rel_variants))
        
        shared_variants = all_pry_variants.intersection(all_rel_variants)
        private_pry_variants = all_pry_variants.difference(shared_variants)
        private_rel_variants = all_rel_variants.difference(shared_variants) 

        df_shared = df_pry[df_pry['Variant'].isin(shared_variants)]
        df_private_pry = df_pry[df_pry['Variant'].isin(private_pry_variants)]
        df_private_rel = df_rel[df_rel['Variant'].isin(private_rel_variants)]

        df_shared['subset'] = 'shared'
        df_private_pry['subset'] = 'private_primary'
        df_private_rel['subset'] = 'private_relapse'

        df = df_shared.copy()
        df = df.append(df_private_pry, ignore_index=True, sort=False)
        df = df.append(df_private_rel, ignore_index=True, sort=False)
        df['PATIENT'] = pat

        df = df.apply(lambda x: get_context_rev(x), axis=1)
        df = add_pyrimidine_type(df)

        mutations = mutations.append(df, ignore_index=True, sort=False)

In [None]:
mutations.to_csv(os.path.join(out_path,"run_subsets_together/simple_input.tsv"), sep='\t', index=False)

For this input deconstructSigs was run: 
```python assign_signature_to_mutation.py -i run_subsets_together/simple_input.tsv -o run_subsets_together -s ../../ext_files/run_deconstructSig/signatures.pcawg -t TALL_subset```
