# Simulating background selection on a nonrecombining Y chromosome


This notebook uses the packages [fwdpp](https://github.com/molpopgen/fwdpp), [fwdpy](https://github.com/molpopgen/fwdpy), [libsequence](https://github.com/molpopgen/libsequence), and [pylibseq](https://github.com/molpopgen/pylibseq), developed by Kevin Thornton (UC Irvine) for conducting forward-time population genetic simulations and analyzing their output. If you use these packages, please cite:
   * Thornton, K. R. (2014). A C++ template library for efficient forward-time population genetic simulation of large populations. Genetics 98:157-166


   * Thornton, K.R. (2003). Libsequence, a C++ class library for evolutionary genetic analysis. Bioinformatics 19:2325-2327

## Setting up the simulation

* These simulations are aimed at evaluating whether the observed reduction in nucleotide diversity on the _R. hastatulus_ Y-chromosome (relative to neutral evolution predictions) can be explained by the effects of background selection arising due to the lack of recombination on this chromosome. 


* We first simulate and collect diversity statistics for a non-recombining Y chromosome under a model for the distribution of fitness effects of new mutations in which selective coefficients are drawn from a gamma distribution with shape parameter 0.1. This is implemented in the GammaS function of fwdpy. The simulations are repeated with recombination for X chromosomes and autosomes (where $Ne_X$ = $3/4 Ne_A$) to obtain normalized $X/A$ and $Y/A$ ratios of diversity. These simulated diversity ratios are then compared to empirical data and predictions under neutral evolution, and we estimate the strength of background selection required to explain our observed reduction in Y-linked diversity in the _R. hastatulus_ Y chromosome.

In [1]:
#import required libraries
from __future__ import print_function
import fwdpy as fp
import numpy as np
import pandas
import math

## Establishing regions for mutation and recombination on the Y-chromosome


* Neutral mutations occur on the interval $[0,1)$.
* Strongly-deleterious mutations occur on the intervals $[-1,0)$ and $[1,2)$.
* Recombination is uniform throughout the region. Here we set the recombination rate $r=0$ to simulate the case of a nonrecombining Y-chromosome. For the X chromosome and autosomes (below), we allow free recombination with $r=0.5$. Here, $r$ is the mean Poisson number of crossover events (per diploid, per generation) and represents the total rate across the simulated region.

In [2]:
# Where neutral mutations occur:
nregions = [fp.Region(beg=0,end=1,weight=1)]

Our "selected" mutations will have positions on the continuous interval $[1,2)$. There will be two classes of such mutations, each with gamma-distrubted selection coefficients. The first class will have a mean of $s = -0.1$ (deleterious), and the second
will have a mean of $s=0.001$ (adaptive). The former will be 100x more common than the latter, as the weights are 1 and 0.01, respectively.

In [3]:
# Where selected mutations occur:

#constant model with s=-0.05 (not used)
#sregions = [fp.ConstantS(beg=-1,end=0,weight=1,s=-0.05,h=1),
#            fp.ConstantS(beg=1,end=2,weight=1,s=-0.05,h=1)]

#gamma model
sregions = [fp.GammaS(1,2,1,-0.01,0.1,1)] #these are deleterious with mean s= -0.01
           
#input paramters
#b: the beginning of the region
#e: the end of the region
#w: the “weight” assigned to the region
#mean: mean of the Gamma
#shape: shape of the Gamma
#h: the dominance term

In [4]:
# Recombination:
recregions = [fp.Region(beg=-1,end=2,weight=1)]

## Population size and simulation length

In [5]:
#Population size
#we set to 1/4 Ne for autosomes
N=1000
#We'll evolve for 10N generations.
#nlist is a list of population sizes over time.
#len(nlist) is the length of the simulation
#We use numpy arrays for speed and optimised RAM
#use.  Note the dtype=np.uint32, which means 32-bit
#unsigned integer. Failure to use this type will
#cause a run-time error.
nlist = np.array([N]*10*N,dtype=np.uint32)

In [6]:
#Initalize a random number generator with random seed value
rng = fp.GSLrng(238)

In [7]:
#Simulate 24 replicate populations. This uses C++11 threads behind the scenes:
pops = fp.evolve_regions(rng,       #The random number generator
                         24,         #The number of pops to simulate, chosen to be the same as empirical popns sampled.
                         N,         #Initial population size for each of the populations
                         nlist[0:], #List of population sizes over time.
                         0.005,     #Neutral mutation rate (per gamete, per generation)
                         0.01,      #Deleterious mutation rate (per gamete, per generation)
                         0.5,         #Recombination rate (per diploid, per generation)
                         nregions,  #Defined above
                         sregions,  #Defined above
                         recregions)#Defined above

In [8]:
#Now, pops is a Python list with len(pops) = 24
#Each element's type is fwdpy.singlepop
print(len(pops))
print(type(pops[0]))

24
<type 'fwdpy.fwdpy.singlepop'>


## Taking samples from simulated populations


In [9]:
#Use a list comprehension to get a random sample of size
#n = 24 from each replicate
samples = [fp.get_samples(rng,i,24) for i in pops]

#Samples is now a list of tuples of two lists.
#Each list contains tuples of mutation positions and genotypes.
#The first list represents neutral variants.
#The second list represents variants affecting fitness ('selected' variants)

for i in samples[:24]:
    print ("A sample from a population is a ",type(i))
    
print(len(samples))

A sample from a population is a  <type 'tuple'>
A sample from a population is a  <type 'tuple'>
A sample from a population is a  <type 'tuple'>
A sample from a population is a  <type 'tuple'>
A sample from a population is a  <type 'tuple'>
A sample from a population is a  <type 'tuple'>
A sample from a population is a  <type 'tuple'>
A sample from a population is a  <type 'tuple'>
A sample from a population is a  <type 'tuple'>
A sample from a population is a  <type 'tuple'>
A sample from a population is a  <type 'tuple'>
A sample from a population is a  <type 'tuple'>
A sample from a population is a  <type 'tuple'>
A sample from a population is a  <type 'tuple'>
A sample from a population is a  <type 'tuple'>
A sample from a population is a  <type 'tuple'>
A sample from a population is a  <type 'tuple'>
A sample from a population is a  <type 'tuple'>
A sample from a population is a  <type 'tuple'>
A sample from a population is a  <type 'tuple'>
A sample from a population is a  <type '

## Getting additional information about samples

In [10]:
#Again, use list comprehension to get the 'details' of each sample
#Given that each object in samples is a tuple, and that the second
#item in each tuple represents selected mutations, i[1] in the line
#below means that we are getting the mutation information only for
#selected variants
details = [fp.get_sample_details(i[1],j) for i,j in zip(samples,pops)]

In [11]:
#details is now a list of pandas DataFrame objects
#Each DataFrame has the following columns:
#  a: mutation age (in generations)
#  h: dominance of the mutation
#  p: frequency of the mutation in the population
#  s: selection coefficient of the mutation
#  label: A label applied for mutations for each region.  Here, I use 0 for all regions
for i in details:
    print(i)

       a  h  label       p             s
0   1972  1      0  0.9750 -8.580116e-06
1    161  1      0  0.0045 -1.318960e-04
2    728  1      0  0.1130 -4.667727e-12
3    431  1      0  0.0595 -1.218205e-04
4   3415  1      0  0.7490 -3.898504e-05
5    370  1      0  0.1400 -5.357685e-14
6    442  1      0  0.0690 -1.556923e-09
7   1326  1      0  0.1350 -1.813133e-04
8    571  1      0  0.1605 -4.899956e-13
9   5444  1      0  0.3595 -1.514979e-07
10  1229  1      0  0.0735 -1.104145e-04
11   116  1      0  0.0540 -9.004279e-07
12    53  1      0  0.0550 -1.891849e-09
13  1345  1      0  0.2185 -1.580763e-05
14  1929  1      0  0.2775 -1.027599e-04
15  3092  1      0  0.1990 -3.150916e-06
16  1060  1      0  0.2415 -1.059231e-08
17  3378  1      0  0.1965 -3.581976e-04
18   541  1      0  0.0575 -3.828298e-12
19   487  1      0  0.0925 -3.244818e-07
20  3839  1      0  0.2000 -1.473197e-06
21   234  1      0  0.0375 -1.657054e-17
22    50  1      0  0.0335 -8.484371e-07
23   673  1     

In [12]:
#The order of the rows in each DataFrame is the
#same as the order as the objects in 'samples':
for i in range(24):
    print("Number of sites in samples[",i,"] = ",
          len(samples[i][1]),". Number of rows in DataFrame ",i,
          " = ",len(details[i].index),sep="")

Number of sites in samples[0] = 111. Number of rows in DataFrame 0 = 111
Number of sites in samples[1] = 100. Number of rows in DataFrame 1 = 100
Number of sites in samples[2] = 99. Number of rows in DataFrame 2 = 99
Number of sites in samples[3] = 97. Number of rows in DataFrame 3 = 97
Number of sites in samples[4] = 109. Number of rows in DataFrame 4 = 109
Number of sites in samples[5] = 110. Number of rows in DataFrame 5 = 110
Number of sites in samples[6] = 115. Number of rows in DataFrame 6 = 115
Number of sites in samples[7] = 89. Number of rows in DataFrame 7 = 89
Number of sites in samples[8] = 114. Number of rows in DataFrame 8 = 114
Number of sites in samples[9] = 105. Number of rows in DataFrame 9 = 105
Number of sites in samples[10] = 90. Number of rows in DataFrame 10 = 90
Number of sites in samples[11] = 107. Number of rows in DataFrame 11 = 107
Number of sites in samples[12] = 90. Number of rows in DataFrame 12 = 90
Number of sites in samples[13] = 102. Number of rows in

In [13]:
#Add a column to each DataFrame
#specifying the mutation position,
#count of derived state,
#and a "replicate ID"
for i in range(len(details)):
    ##samples[i][1] again is the selected mutations in the sample taken
    ##from the i-th replicate
    details[i]['pos']=[x[0] for x in samples[i][1]]               #Mutation position
    details[i]['count']=[ x[1].count('1') for x in samples[i][1]] #No. occurrences of derived state in sample
    details[i]['id']=[i]*len(details[i].index)                    #Replicate id

In [14]:
##Merge into 1 big DataFrame:
BigTable = pandas.concat(details)

print("This is a merged table:")
print(BigTable)

This is a merged table:
       a  h  label       p             s       pos  count  id
0   1972  1      0  0.9750 -8.580116e-06  1.001300     22   0
1    161  1      0  0.0045 -1.318960e-04  1.002744      1   0
2    728  1      0  0.1130 -4.667727e-12  1.018532      6   0
3    431  1      0  0.0595 -1.218205e-04  1.029117      3   0
4   3415  1      0  0.7490 -3.898504e-05  1.033391     18   0
5    370  1      0  0.1400 -5.357685e-14  1.051165      2   0
6    442  1      0  0.0690 -1.556923e-09  1.052919      2   0
7   1326  1      0  0.1350 -1.813133e-04  1.057491      5   0
8    571  1      0  0.1605 -4.899956e-13  1.061458      2   0
9   5444  1      0  0.3595 -1.514979e-07  1.061666      7   0
10  1229  1      0  0.0735 -1.104145e-04  1.076750      1   0
11   116  1      0  0.0540 -9.004279e-07  1.106480      3   0
12    53  1      0  0.0550 -1.891849e-09  1.116217      2   0
13  1345  1      0  0.2185 -1.580763e-05  1.116720      9   0
14  1929  1      0  0.2775 -1.027599e-04  1.12

## Summary statistics from samples

In [15]:
import libsequence.polytable as polyt
import libsequence.summstats as sstats

#Convert neutral mutations into libsequence "SimData" objects,
#which are intended to handle binary (0/1) data like
#what comes out of these simulations
n = [polyt.simData(i[0]) for i in samples]

#Create "factories" for calculating the summary stats
an = [sstats.polySIM(i) for i in n]

##Collect a bunch of summary stats into a pandas.DataFrame:
NeutralMutStats = pandas.DataFrame([ {'thetapi':i.thetapi(),'npoly':i.numpoly(),'thetaw':i.thetaw()} for i in an ])

NeutralMutStats

Unnamed: 0,npoly,thetapi,thetaw
0,75,21.358696,20.084131
1,74,19.485507,19.816343
2,83,23.438406,22.226438
3,55,14.851449,14.728363
4,57,14.449275,15.26394
5,72,16.072464,19.280766
6,76,20.561594,20.351919
7,72,20.931159,19.280766
8,63,17.768116,16.87067
9,71,20.67029,19.012977


## The average $\pi$ under the model

Under the BGS model, the expectation of $\pi$ is $E[\pi]=\pi_0e^{-\frac{U}{2sh+r}}$. $U$ is the mutation rate to strongly-deleterious variants, $\pi_0$ is the value expected in the absence of BGS (*i.e.,* $\pi_0 = \theta = 4N_e\mu$), $s$ and $h$ are the selection and dominance coefficients, respectively, and $r$ is the recombination rate. Note that $U$ is per diploid, meaning twice the per gamete rate. (See Hudson and Kaplan 1995 for details).

For our parameters, we have $\pi_0 = 4N_e\mu$ = $20$, so


$E[\pi]=20e^{-\frac{0.02}{0.1+0.5}}$ , which equals

In [16]:
print(20*math.exp(-0.02/(0.1+0.5)))

19.3443220096


 Now, let's get the average from 1000 simulated replicates. We already have 24 replicates that we did above, so we'll run another 24 sets of four populations. We will use standard Python to grow our collection of summary statistics.

In [18]:
for i in range(0,100,1):
    pops = fp.evolve_regions(rng,
                         24,
                         N,
                         nlist[0:],
                         0.005,
                         0.01,
                         0.5,
                         nregions,
                         sregions,
                         recregions)
    samples = [fp.get_samples(rng,i,24) for i in pops]
    simdatasNeut = [polyt.simData(i[0]) for i in samples]
    polySIMn = [sstats.polySIM(i) for i in simdatasNeut]
    ##Append stats into our growing DataFrame:
    NeutralMutStats=pandas.concat([NeutralMutStats,
                                   pandas.DataFrame([ {'thetapi':i.thetapi(),
                                                       'npoly':i.numpoly(),
                                                       'thetaw':i.thetaw()} for i in polySIMn ])])

## Getting the mean diversity

We’ve collected everything into a big pandas DataFrame. We can easily get the mean using the built-in groupby and mean functions.

In [19]:
#Get means for each column:
NeutralMutStats.mean(0)

npoly      73.765887
thetapi    19.758085
thetaw     19.753650
dtype: float64

The ‘thetapi’ record is our mean π from all of the simulations.

Our neutral expectation for polymorphisms levels on the Y chromosome (Y_pi_not), the X chromsome (X_pi_not) and the autosomes (A_pi_not) in the absence of background selection were obtained using equations…. with the empirically estimated sex ratio bias in R. hastatulus populations (__). To match the sample size of our our empirical data, we sampled chromosomes from 12 males and 12 females. …. We then calculated stimulated pi under a neutral (biased-sex ratio) model, simulated pi under background selection (for shape parameters =), and then compared these results with observed levels of diversity (pi observed) on each chromosome type. In particular, we estimated the expected proportional change in the ratio of diversity expected under background selection: [equation] Y_pi_not / A_pi_not : Y_pi_bgs/A_pi_bgs, and assessed whether this was significantly different from the observed proportional change in the  


In [None]:
NeutralMutStats.to_csv('a_fixed.csv')