# Generate genomic coordinates for the mappable genome

The mappable genome is the fraction of the reference hg38 genome under analysis in the manuscript. 

The mappable genome consists of: 
- regions of high mappability that do not overlap with blacklisted sequences of low mappability.
- genomic positions containing population variants (any substitution or indel with total allele frequency above 1%). 
- the above mentioned regions exluding cancer driver coding and non-coding regions.

This notebook generates first a mappable genome version that includes drivers (this is used to normalise mutational signature profiles), and then the final mappable genome file excluding drivers. 

Blacklisted regions of low mappability for hg38 were obtained from the ENCODE Unified GRCh38 Blacklist (downloaded from encodeproject.org/files/ENCFF356LFX on 16-06-2020). 

Population variants were obtained from gnomAD 49 version 3.0 (downloaded from gnomad.broadinstitute.org on 25-06-2020).

Cancer driver genes and annotations can be found in the "cancerdrivers" folder

In [1]:
import os

In [35]:
main_dir = ''

# Mappable genome including drivers

Exclude genomic regions of driver elements from the mappable genome computed above

In [5]:
mappable_regions_f = f'{main_dir}/data/inputs/hg38_100bp.coverage.regions.gz'
mappable_blacklist_f = f'{main_dir}/data/inputs/ENCFF356LFX.bed.gz'
pop_variants_f = f'{main_dir}/data/inputs/gnomad.genomes.r3.0.sites.allchr.af_0.01.tsv.gz'
output_file = f'{main_dir}/data/hg38_mappable_genome.tsv.gz'

In [6]:
map_file = f'{main_dir}/code/1_mappable_genome.map'
code_file = f'{main_dir}/code/mappable_genome.py'

In [7]:
info = [
    '[params]',
    'cores=1',
    'memory=100G\n',
    '[pre]',
    '. "/home/$USER/miniconda3/etc/profile.d/conda.sh"',
    'conda activate hotspots_framework\n',
    '[jobs]',
]

In [8]:
with open(map_file, 'w') as ofd: 
    for line in info: 
        ofd.write(f'{line}\n')
    
    ofd.write(f'python {code_file} -m {mappable_regions_f} -b {mappable_blacklist_f} -pv {pop_variants_f} -o {output_file}  \n')            


# Mappable genome excluding drivers

Exclude genomic regions of driver elements from the mappable genome computed above

In [36]:
main_dir = ''

In [37]:
mappable_genome_file = f'{main_dir}/data/hg38_mappable_genome.tsv.gz'
drivers_file = f'{main_dir}/data/cancerdrivers_regions.tsv'
output_file = f'{main_dir}/data/hg38_mappable_genome.nodrivers.tsv.gz'

In [38]:
map_file = f'{main_dir}/code/1_mappable_genome_nodrivers.map'
code_file = f'{main_dir}/code/mappable_genome_nodrivers.py'

In [39]:
info = [
    '[params]',
    'cores=1',
    'memory=100G\n',
    '[pre]',
    '. "/home/$USER/miniconda3/etc/profile.d/conda.sh"',
    'conda activate hotspots_framework\n',
    '[jobs]',
]

In [40]:
with open(map_file, 'w') as ofd: 
    for line in info: 
        ofd.write(f'{line}\n')
    
    ofd.write(f'python {code_file} -m {mappable_genome_file} -d {drivers_file} -o {output_file}  \n')            
