This notebook processes the `genosnp` file with data from the "Ice Age" paper into a sensible table, which can be used by other things than just ADMIXTOOLS etc.

In [1]:
import pandas as pd
import numpy as np
from pybedtools import BedTool

In [2]:
import matplotlib.pyplot as plt

In [3]:
%matplotlib inline

plt.style.use('ggplot')

### Read the positions of sites from the archaic admixture array

In [4]:
snp_pos = pd.read_table('../tests/cadd/clean_data/ice_age.tsv')[['chrom', 'pos']]

In [5]:
len(snp_pos)

945357

In [6]:
snp_pos.head()

Unnamed: 0,chrom,pos
0,1,847983
1,1,853089
2,1,853596
3,1,854793
4,1,858862


# Additional SNP annotations

We don't want to rely just on the B value annotations. Since they are used as a measure of proximity to functional regions, how about using a more direct measure of this proximity?

* distance to the nearest coding region
* amount of coding sequence in a X bp window around each site
* phyloP score

## Exons

Create array sites BED object:

In [7]:
snp_pos['start'] = snp_pos.pos - 1
snp_pos['end'] = snp_pos.pos
snp_bed = BedTool.from_dataframe(snp_pos[['chrom', 'start', 'end']]).sort()

Download the genome annotations

In [8]:
gtf = pd.read_table('../tmp/gtf.gz',
                    header=None, sep='\t', skipinitialspace=True, skiprows=5, compression='gzip',
                    names=['chrom', 'source', 'feature', 'start', 'end',
                           'score', 'strand', 'frame', 'attribute'], low_memory=False)

In [9]:
gtf.head()

Unnamed: 0,chrom,source,feature,start,end,score,strand,frame,attribute
0,1,pseudogene,gene,11869,14412,.,+,.,"gene_id ""ENSG00000223972""; gene_name ""DDX11L1""..."
1,1,processed_transcript,transcript,11869,14409,.,+,.,"gene_id ""ENSG00000223972""; transcript_id ""ENST..."
2,1,processed_transcript,exon,11869,12227,.,+,.,"gene_id ""ENSG00000223972""; transcript_id ""ENST..."
3,1,processed_transcript,exon,12613,12721,.,+,.,"gene_id ""ENSG00000223972""; transcript_id ""ENST..."
4,1,processed_transcript,exon,13221,14409,.,+,.,"gene_id ""ENSG00000223972""; transcript_id ""ENST..."


Subset to exon annotations only and create a BED object

In [10]:
exons = BedTool.from_dataframe(
    gtf[(gtf.source == "protein_coding") &
        (gtf.feature == "exon")].query('end - start > 10')
).sort().merge().sort()

In [11]:
exons.total_coverage()

81532972

### Distance to the nearest exon

In [12]:
closest = snp_bed.closest(exons, t='first', d=True).to_dataframe()

In [13]:
snp_pos = snp_pos.merge(closest, on=["chrom", "start", "end"])[["chrom", "pos", "start", "end", "thickStart"]] \
       .rename(columns={"thickStart": "exon_distance"})

In [14]:
snp_pos.head()

Unnamed: 0,chrom,pos,start,end,exon_distance
0,1,847983,847982,847983,12277
1,1,853089,853088,853089,7171
2,1,853596,853595,853596,6664
3,1,854793,854792,854793,5467
4,1,858862,858861,858862,1398


### Amount of exon sequence in a 20 kb window upstream/downstream of the SNP

How far upstream/downstream to extend the window? Window size will be `(2 * flank + 1)` bp.

In [15]:
flank = 10000

Generate the BED object of windows flanking both sides of all SNPs:

In [16]:
snp_windows = snp_bed.slop(b=flank, genome='hg19')

For each exon overlapping a given window, count the number of bases of the overlap. One row for each potential exon. If there are no exons overlapping a window, report 0 bp overlap.

In [17]:
exon_overlaps = snp_windows.intersect(exons, wao=True)   \
                           .to_dataframe()[['chrom', 'start', 'end', 'thickStart']]

In [18]:
exon_overlaps.head()

Unnamed: 0,chrom,start,end,thickStart
0,1,837982,857983,0
1,1,843088,863089,69
2,1,843088,863089,40
3,1,843088,863089,63
4,1,843088,863089,143


`thickStart` contains the number of bases overlapping with each exon in a window around each SNP. Summing up these column for each unique region using `groupby` (i.e. window around each SNP) gives the total number of coding sequence surrounding each SNP.

In [19]:
exon_total = exon_overlaps.groupby(['chrom', 'start', 'end'])['thickStart'].sum().reset_index()['thickStart']
snp_pos['exon_density'] = exon_total / (2 * flank + 1)

In [20]:
snp_pos.head()

Unnamed: 0,chrom,pos,start,end,exon_distance,exon_density
0,1,847983,847982,847983,12277,0.0
1,1,853089,853088,853089,7171,0.015749
2,1,853596,853595,853596,6664,0.015749
3,1,854793,854792,854793,5467,0.015749
4,1,858862,858861,858862,1398,0.027549


## phyloP annotation

Conservation score calculated without human data (more suitable to study Nea. introgression).

Create array sites BED object:

In [127]:
# chromosome IDs need a 'chr' prefix to match phyloP data
snp_pos_with_chr = snp_pos.copy()
snp_pos_with_chr['chrom'] = "chr" + snp_pos.chrom.astype(str)

snp_bed = BedTool.from_dataframe(snp_pos_with_chr[['chrom', 'start', 'end']]).sort()

Size of the window around each SNP, in which to calculate average phyloP score:

In [128]:
WINDOW_SIZE = 1000

In [129]:
def process_chromosome(chrom, snps, window_size):
    phylop = BedTool("/mnt/expressions/benjamin_vernot/phyloP_no_human/Compara.36_eutherian_mammals_EPO_LOW_COVERAGE.chr" + str(chrom) + "_.phyloP_no_human.SS.bed")
    
    # get all sites surrounding each SNP from the archaic admixture array
    phylop_windows = snps.window(phylop, w=window_size).to_dataframe()
    
    # calculate the average phyloP score of each SNP (based on phyloP values in windows around them)
    avg_phylop = phylop_windows.groupby(['chrom', 'start', 'end'])['thickStart'].mean().reset_index()
    del phylop_windows

    # rename the fields to make join with the original archaic admixture array SNP table possible
    avg_phylop = avg_phylop.rename(columns={'end': 'pos', 'thickStart': 'phylop_avg'})[['chrom', 'pos', 'phylop_avg']]

    # remove the 'chr' prefix of chromosome IDs
    avg_phylop.chrom.replace(to_replace='chr', value='', regex=True, inplace=True)
    avg_phylop['chrom'] = avg_phylop.astype(int)
    
    return avg_phylop

In [130]:
def process_chromosome(chrom, snps, window_size):
    """Process the given chromosome - calculate average phylop for each SNP,
    and also annotate each SNP with it's estimated phylop value. Return both
    as a pandas dataframe.
    """
    phylop = BedTool("/mnt/expressions/benjamin_vernot/phyloP_no_human/Compara.36_eutherian_mammals_EPO_LOW_COVERAGE.chr" + str(chrom) + "_.phyloP_no_human.SS.bed")
    
    # get all sites surrounding each SNP from the archaic admixture array
    phylop_windows = snps.window(phylop, w=window_size).to_dataframe()
    
    # calculate the average phyloP score of each SNP (based on phyloP values in windows around them)
    phylop_snps = phylop_windows.groupby(['chrom', 'start', 'end'])['thickStart'].mean().reset_index()
    del phylop_windows

    # rename the fields to make join with the original archaic admixture array SNP table possible
    phylop_snps = phylop_snps.rename(columns={'thickStart': 'phylop_avg'})

    # get phylop value for each site
    phylop_per_site = snps.intersect(phylop, wb=True) \
                          .to_dataframe()[["chrom", "start", "end", "thickStart"]] \
                          .rename(columns={"thickStart": "phylop"})
    
    # merge the tables with per-site phylop and phylop averages in windows
    phylop_snps = phylop_snps.merge(phylop_per_site,
                                    on=["chrom", "start", "end"])
    
    # remove the 'chr' prefix of chromosome IDs
    phylop_snps.chrom.replace(to_replace='chr', value='', regex=True, inplace=True)
    phylop_snps['chrom'] = phylop_snps.astype(int)
    
    return phylop_snps[['chrom', 'start', 'end', 'phylop', 'phylop_avg']]

### Process all phyloP BED files in parallel

In [131]:
import functools

# create a partial function used to process a given chromosome's phyloP BED file
fn = functools.partial(process_chromosome, snps=snp_bed, window_size=WINDOW_SIZE)

In [133]:
# shuffle so that the biggest chromosomes are not processed all
# at the same time
import random
chroms = [chrom for chrom in range(1, 23)]
random.shuffle(chroms)

In [135]:
from multiprocessing import Pool

# create a pool of 22 workers that will process each chromosome
with Pool(processes=11) as pool:
    # map the processing function across all chromosomes
    phylop_snps_per_chrom = pool.map(fn, chroms)

    # concatenate the results into a single dataframe
    phylop_snps = pd.concat(phylop_snps_per_chrom)
    
    # convert to from the BED coordinates into a normal table format
    phylop_snps = phylop_snps.rename(columns={"end": "pos"})[["chrom", "pos", "phylop", "phylop_avg"]]

### Join the phyloP averages with the rest of the data

In [147]:
snp_pos = snp_pos.merge(phylop_snps, on=['chrom', 'pos'])[['chrom', 'pos', 'exon_distance', 'exon_density', 'phylop', 'phylop_avg']]

In [148]:
snp_pos.head()

Unnamed: 0,chrom,pos,exon_distance,exon_density,phylop,phylop_avg
0,1,847983,12277,0.0,1.303,-0.176853
1,1,853089,7171,0.015749,-0.365,-0.145938
2,1,853596,6664,0.015749,-0.645,-0.252323
3,1,854793,5467,0.015749,-1.678,-0.280611
4,1,858862,1398,0.027549,0.197,-0.157188


# Output the processed genosnp table in a tab-separated file

In [149]:
snp_pos.to_csv('../tests/cadd/clean_data/more_annotations.tsv', sep='\t', index=False, na_rep='-')