# Generating SLiM exome recombination maps

This notebook generates a recombination map for SLiM that implements a realistic exome structure and includes the positions of sites from the archaic admixture array. It also saves the positions of array sites and coordinates of exons separately.

In [1]:
from pybedtools import BedTool
import pandas as pd
import matplotlib.pyplot as plt

In [2]:
%matplotlib inline
plt.style.use('ggplot')

In [3]:
RECOMB_RATE = 1e-8 # crossovers per bp per generation

## Exon coordinates processing

First download the GTF annotations:

In [4]:
recmap = pd.read_table('../data/bed/regions/protein_coding_regions_gap_sites_spacers_recomb_rates.bed', names=["chrom", "start", "end", "type", "width", "length", "recomb_rate", "xxx"])[["chrom", "start", "end", "length", "type", "recomb_rate"]]

In [5]:
recmap.head()

Unnamed: 0,chrom,start,end,length,type,recomb_rate
0,chr1,69090,70008,918,original,2.082414
1,chr1,70008,138529,68521,spacer,3.01757
2,chr1,138529,139309,780,original,3.354927
3,chr1,139309,367658,228349,spacer,3.124309
4,chr1,367658,368597,939,original,2.887498


In [6]:
exons = recmap.query("type == 'original'").query("end - start > 1")[["chrom", "start", "end"]]

In [7]:
exons.chrom.replace(regex=True, to_replace="chr", value="", inplace=True)
exons.chrom = exons.chrom.astype(int)
exons = exons.sort_values(["chrom", "start"])

In [8]:
BedTool.from_dataframe(exons).sort().merge().total_coverage()

33741600

In [9]:
exons = BedTool.from_dataframe(exons).sort().merge().to_dataframe()

In [10]:
BedTool.from_dataframe(exons).sort().merge().total_coverage()

33741600

In [11]:
exons['type'] = 'region'

In [35]:
len(exons)

200323

Convert chromosome IDs to integers and sort coordinates:

In [12]:
lengths = exons.end - exons.start

In [13]:
lengths.describe()

count    200323.000000
mean        168.435976
std         279.273197
min           2.000000
25%          82.000000
50%         122.000000
75%         171.000000
max       21693.000000
dtype: float64

Specify the recombination rate within exons:

In [14]:
exons.loc[exons.type == 'region', 'recomb_rate'] = RECOMB_RATE

The modified dataframe:

In [15]:
exons.head()

Unnamed: 0,chrom,start,end,type,recomb_rate
0,1,69090,70008,region,1e-08
1,1,138529,139309,region,1e-08
2,1,367658,368597,region,1e-08
3,1,621095,622034,region,1e-08
4,1,738531,738618,region,1e-08


## Process positions from the archaic admixture array

Load the coordinates of sites from the archaic admixture array:

In [16]:
array_sites = recmap.loc[(recmap.end - recmap.start == 1) & (recmap.type != "spacer")][["chrom", "start", "end"]]

In [17]:
len(array_sites)

49990

In [18]:
array_sites.chrom.replace(regex=True, to_replace="chr", value="", inplace=True)
array_sites.chrom = array_sites.chrom.astype(int)

In [19]:
array_sites = array_sites.sort_values(["chrom", "start"])

In [20]:
array_sites['type'] = 'site'

In [21]:
array_sites.tail()

Unnamed: 0,chrom,start,end,type
325075,22,50546466,50546467,site
325157,22,50641435,50641436,site
325217,22,50666506,50666507,site
325223,22,50677478,50677479,site
325747,22,51028230,51028231,site


## Create a union of exon regions with admixture array informative positions

Inter-exon informative sites will be simulated individually, with an appropriate recombination rate set between them and adjacent exons or other informative sites. Intra-exon sites will be ignored when building the recombination map (they will be recombining as a part of the exons).

Take the subset of array sites that lie outside exons:

In [22]:
exome = BedTool.from_dataframe(exons)
sites_outside_exons = BedTool.from_dataframe(array_sites).intersect(exome, v=True).to_dataframe().rename(columns={'name': 'type'})

In [23]:
exons_and_sites = pd.concat([sites_outside_exons,
                             exons]).sort_values(by=['chrom', 'start']).reset_index(drop=True)
exons_and_sites = exons_and_sites[['chrom', 'start', 'end', 'type', 'recomb_rate']]

In [24]:
exons_and_sites.head()

Unnamed: 0,chrom,start,end,type,recomb_rate
0,1,69090,70008,region,1e-08
1,1,138529,139309,region,1e-08
2,1,367658,368597,region,1e-08
3,1,621095,622034,region,1e-08
4,1,738531,738618,region,1e-08


In [25]:
len(exons_and_sites)

250313

## Generating the recombination map of exons and archaic admixture array sites

In [26]:
AUTOSOMES = [i for i in range(1, 23)]

In [27]:
def add_recombination_gaps(regions):
    """Add 1 bp "recombination gap" between each record in
    a given DataFrame of exon and snp coordinates.
    """
    # create a new DataFrame with coordinates of 1 bp gaps
    gaps = pd.DataFrame({'chrom'       : regions.chrom.values,
                         'start'       : regions.end.values,
                         'end'         : regions.end.values + 1,
                         'recomb_rate' : list(RECOMB_RATE * (regions.start[1:].values -  # between exons/sites
                                                             regions.end[:-1].values)) +
                                         [0.5], # between chromosomes
                         'type'        : 'spacer'},
                        columns=['chrom', 'start', 'end', 'recomb_rate', 'type'])

    # merge the dataframes of regions and gap coordinates
    regions_and_gaps = pd.concat([regions, gaps]).sort_values(by=['chrom', 'start']).reset_index(drop=True)

    return regions_and_gaps[['chrom', 'start', 'end', 'type', 'recomb_rate']]


def concatenate_regions(regions):
    """Concatenate regions/sites on all chromosome as if they were
    directly adjacent on a single chromosome, changing the coordinates
    accordingly.
    """
    concat_regions = regions.copy()

    concat_regions['width'] = regions.end - regions.start
    concat_regions['slim_start'] = pd.Series([0] + list(concat_regions.width[:-1])).cumsum().values
    concat_regions['slim_end'] = concat_regions.width.cumsum() - 1
    
    return concat_regions


def create_recomb_map(regions):
    """Create recombination map for all regions/sites on all chromosomes."""
    recomb_map = []

    for chrom in AUTOSOMES:
        # add recombination gaps between regions on this chromosome
        recomb_map.append(add_recombination_gaps(regions.query('chrom == {}'.format(chrom))))
        
    # concatenate exome recombination maps of all chromosomes
    recomb_map = pd.concat(recomb_map, ignore_index=True).sort_values(by=['chrom', 'start']).reset_index(drop=True)

    # remove the very last base of the recombination map
    # (it has a 0.5 recombination rate anyway and there's no other chromosome
    # after it)
    recomb_map = recomb_map[:-1]
    
    return recomb_map

Recombination rate between exons and/or informative positions is implemented by inserting a 1 bp "gap" between each adjacent region/site and setting the recombination rate at these positions to $L \times 1\cdot10^{-8}$ crossovers per generation ($L$ is the distance between exons/sites).

The recombination rate of the "gap" between the last feature on one chromosome and the first feature on another chromosome will be 0.5.

Create a recombination map of exons and informative sites:

In [28]:
recomb_map = create_recomb_map(exons_and_sites)

In [29]:
recomb_map.head()

Unnamed: 0,chrom,start,end,type,recomb_rate
0,1,69090,70008,region,1e-08
1,1,70008,70009,spacer,0.00068521
2,1,138529,139309,region,1e-08
3,1,139309,139310,spacer,0.00228349
4,1,367658,368597,region,1e-08


Filter out "gaps" between directly adjacent features:

In [30]:
recomb_map = recomb_map.query('(type == "site") | (recomb_rate > 0)')

Convert the coordinates of all regions/sites/gaps into SLiM's 0-based single segment coordinate system:

In [31]:
concat_map = concatenate_regions(recomb_map)

In [32]:
concat_map.head()

Unnamed: 0,chrom,start,end,type,recomb_rate,width,slim_start,slim_end
0,1,69090,70008,region,1e-08,918,0,917
1,1,70008,70009,spacer,0.00068521,1,918,918
2,1,138529,139309,region,1e-08,780,919,1698
3,1,139309,139310,spacer,0.00228349,1,1699,1699
4,1,367658,368597,region,1e-08,939,1700,2638


In [33]:
concat_map.tail()

Unnamed: 0,chrom,start,end,type,recomb_rate,width,slim_start,slim_end
500620,22,51216379,51216409,region,1e-08,30,34041699,34041728
500621,22,51216409,51216410,spacer,2.724e-05,1,34041729,34041729
500622,22,51219133,51219146,region,1e-08,13,34041730,34041742
500623,22,51219146,51219147,spacer,1.469e-05,1,34041743,34041743
500624,22,51220615,51220722,region,1e-08,107,34041744,34041850


Save the recombination map in a SLiM-friendly format (end-positions of exons and gaps, without the positions of SNPs since they don't have recombination rates themselves):

In [34]:
concat_map.query('type != "site"').to_csv('../zzz_recmap.txt', sep='\t', index=False)

## Generating recombination map of exons only (ignoring admixture array sites)

In [35]:
recomb_map_exons_only = create_recomb_map(exons_and_sites.query('type != "snp"'))

In [36]:
concat_map_exons_only = concatenate_regions(recomb_map_exons_only)

In [37]:
concat_map_exons_only[['slim_end', 'recomb_rate']].to_csv('../clean_data/exome_only_recombination_map.txt', sep='\t', index=False)

FileNotFoundError: [Errno 2] No such file or directory: '../clean_data/exome_only_recombination_map.txt'

In [None]:
concat_map_exons_only.tail()

## Save SLiM coordinates of exons only (ignoring admixture array sites)

In [None]:
concat_map_exons_only.query('type == "exon"')[['slim_start', 'slim_end']].to_csv('../clean_data/exome_only_exon_coordinates.txt', sep='\t', index=False)

<br><br><br>
## Save SLiM coordinates of all sites from the archaic admixture array

Recombination map includes only positions of sites that fall outside of exonic regions. However, in order to simulate the sites from the archaic admixture array, we have to know the positions of sites _within_ exons too.

SLiM simulates all exons and individual sites as one continuous segment of concatenated regions, with coordinates of exons and sites shifted appropriately. To obtain the coordinates of array sites within exons, we need to find out, for each site, which exon does it fall in and calculate its position relative to the start of that exon.

Take the subset of array sites that lie inside exons:

In [39]:
sites_within_exons = BedTool.from_dataframe(array_sites).intersect(exome).to_dataframe()

In [40]:
sites_within_exons = sites_within_exons.rename(columns={'name': 'type'})

Get a DataFrame of the coordinates of exons that contain a site from the admixture array (will contain multiple copies of one exon if more than one site falls within that exon):

In [41]:
exons_with_sites = BedTool.from_dataframe(concat_map.query('type == "exon"')).intersect(BedTool.from_dataframe(sites_within_exons), wa=True).to_dataframe()


# rename and subset columns
exons_with_sites = exons_with_sites.rename(columns={'name': 'type',
                                                    'score': 'recomb_rate',
                                                    'strand': 'width',
                                                    'thickStart': 'slim_start',
                                                    'thickEnd': 'slim_end'})[['chrom', 'start', 'end', 'type', 'recomb_rate', 'width', 'slim_start', 'slim_end']]

Calculate the position of each site relative to the start of "its" exon and convert this position into a SLiM single-segment coordinate (i.e. relative the the position 0 of the simulated segment):

In [42]:
sites_within_exons['width'] = 1
sites_within_exons['slim_start'] = sites_within_exons.start - exons_with_sites.start + exons_with_sites.slim_start
sites_within_exons['slim_end'] = sites_within_exons.slim_start

Output the coordinates of subsampled sites:

In [43]:
subsampled_exonic_sites = sites_within_exons.sample(10000).sort_values(by=['slim_start']).reset_index(drop=True)

In [44]:
subsampled_exonic_sites['slim_start'].to_csv('../clean_data/admixture_array_coordinates_exonic.txt', index=False)
concat_map.query('type == "snp"')['slim_start'].to_csv('../clean_data/admixture_array_coordinates_nonexonic.txt', index=False)

## Save SLiM coordinates of exonic regions only

This is required for specification for the `initializeGenomicElement` function in SLiM.

In [45]:
concat_map.query('type == "exon"')[['slim_start', 'slim_end']].to_csv('../clean_data/exome_and_sites_exon_coordinates.txt', sep='\t', index=False)