### Estimating divergence time between Jean-Talon and it's relatives in generations and years
Method adopted from Skoglund et al. 2011 (following Green et al. 2006).
Method uses triplets: Jean-Talon, relative & outgroup. Outgroup is S. cerevisiae, which is the reference genome.
First estimation of divergence between Jean-Talon & an outgroup (using molecular clock), by calculating fixed differences between Jean-Talon and outgroup (synonymous sites fixed for 1 in Jean-Talon). Second calculating sites which are shared between Jean-Talon & outgroup but not the relative (genotypes 0,1,0) and shared between relative and an outgroup but not Jean-Talon (genotypes 1, 0, 0). Number of these sites over total number of sites carry information about the proportion of the branch length from the split of Jean-Talon with its relative, relative to the branch length from the split of Jean-Talon (or relative) with an outgroup.
@author:aniafijarczyk

In [1]:
import pandas as pd
import glob
import random
import gzip
import numpy as np
from collections import defaultdict

#### Setting global variables

In [2]:
focal_species = "Jean-Talon"
relatives = ["A.Muntons","A.S-33","A.T-58","BE005","CFI","CFN","CFP"]
mutation_rate = 1.67E-10
gen_year_min = 150 # lower estimate of generation number per year (from Gallone et al. 2016)
gen_year_max = 2920 # higher estimate of generation number per year (from Fay & Benavides 2005)
jean_talon_synonymous_sites = 1601812.63 # total length of synonymous sites in Jean-Talon (no missing data)

#### Lengths of synonymous sites for pairs of genomes of different relatives with Jean-Talon excluding positions with any missing data

In [3]:
#lengths = pd.read_csv("./input_files/synonymous_length.txt",sep="\t",header=0,names=['strain','length'])
lengths = pd.DataFrame({"strain":["A.2565","A.Muntons","A.S-33","A.T-58","BE005","CFI","CFN","CFP","CHK","Jean-Talon"],
                        "length":[825134.19,1393964.5,921367.81,981293.2,1600532.52,1601356.33,1601365.39,1600888.52,1601484.86,1601812.63]})
lengths.head()

Unnamed: 0,strain,length
0,A.2565,825134.19
1,A.Muntons,1393964.5
2,A.S-33,921367.81
3,A.T-58,981293.2
4,BE005,1600532.52


#### Selecting strain indices for triplets (duplets here - S. cerevisiae is reference genome)

In [4]:
# File with list of sample names from relatives_annot_Filtered2.vcf file
samples = pd.read_csv("./input_files/relatives_annot_Filtered2.samples", sep="\t", header=None, names=["haplotype"])
sample_names = list(samples["haplotype"])
T = []
for strain in relatives:
    test_samples = ["Jean-Talon",strain]
    test_samples_p1 = [[0,1][ele.split("_")[0] in test_samples[0]] for ele in sample_names]
    samp_indices_p1 = [i for i in range(len(test_samples_p1)) if test_samples_p1[i] == 1]
    test_samples_p2 = [[0,1][ele.split("_")[0] in test_samples[1]] for ele in sample_names]
    samp_indices_p2 = [i for i in range(len(test_samples_p2)) if test_samples_p2[i] == 1]
    samp_indices = samp_indices_p1 + samp_indices_p2
    T.append(samp_indices)
T

[[38, 39, 40, 41, 4, 5, 6, 7],
 [38, 39, 40, 41, 8, 9, 10, 11],
 [38, 39, 40, 41, 12, 13, 14, 15],
 [38, 39, 40, 41, 20, 21, 22, 23],
 [38, 39, 40, 41, 24, 25, 26, 27],
 [38, 39, 40, 41, 28, 29, 30, 31],
 [38, 39, 40, 41, 32, 33, 34, 35]]

#### Getting synonymous variants
##### Annotating vcf file (File S5 after changing chromosome names to chr<> format)
```console
java -jar snpEff.jar R64-2-1 relatives_annot_Filtered2_R64.2.1.vcf.gz > relatives_annot_Filtered2_R64.2.1_snpEff.vcf
```
##### Getting variants for selected genes (./output/manipulateFasta_nonoverlappingCDS.bed) in table
```console
bcftools query -f '%CHROM\t%POS\t%INFO/ANN\n' -R ./output/manipulateFasta_nonoverlappingCDS.bed \
relatives_annot_Filtered2_R64.2.1.vcf.gz | grep 'synonymous_variant' > \
relatives_annot_synonymous_snpEff.tab
```

#### Reading file with synonymous variants

In [5]:
fa = gzip.open("./input_files/relatives_annot_synonymous_snpEff.tab.gz", "rt").readlines()
#fa = gzip.open("./input_files/sample_annot_synonymous_snpEff.tab.gz", "rt").readlines()
ann = [ele.split() for ele in fa]
D = {a+"_"+b:c for a,b,c in ann}
print("Number of all synonymous variants = "+str(len(list(D.keys()))))

Number of all synonymous variants = 51608


#### Generating table with genotypes encoded as 0 and 1 from vcf file for selected genes (./output/manipulateFasta_nonoverlappingCDS.bed)
```console
bcftools query -f '%CHROM\t%POS[\t%GT]\n' -R ./output/manipulateFasta_nonoverlappingCDS.bed \
relatives_annot_Filtered2_R64.2.1.vcf.gz | sed 's/\//\|/g' \
| awk  -F"\t" -v OFS="\t" 'function GSUB(F) {gsub(/[|]/,"\t",$F)} \ 
{GSUB(3);GSUB(4);GSUB(5);GSUB(6);GSUB(7);GSUB(8);GSUB(9);GSUB(10);GSUB(11);GSUB(12);GSUB(13)}1' \
| awk '{if (length($3)==1) print $0}' > relatives_annot_Filtered2_01.tab
```

#### Reading file with all variant genotypes & filtering only synonymous

In [6]:
fh = gzip.open("./input_files/relatives_annot_Filtered2_01.tab.gz","rt").readlines()
#fh = gzip.open("./input_files/sample_annot_Filtered2_01.tab.gz","rt").readlines()
d = {'_'.join(ele.split()[:2]):''.join(ele.split()[2:]) for ele in fh}
k = {ele:d[ele] for ele in D.keys()}
print("Number of filtered synonymous variants = "+str(len(list(k.keys()))))

Number of filtered synonymous variants = 51608


#### Calculating fixed differences between Jean-Talon & reference (outgroup)

In [7]:
jt = [] # fixed variants relative to reference
jt_tot = [] # all synonymous variants with no missing data
for pos in k.keys():
    newset = ''.join(k[pos][-4:])
    if (newset.count('.') == 0):
        jt_tot.append(newset)
        if (newset.count('0') == 0):
            jt.append(newset)
k_rate = len(jt)/jean_talon_synonymous_sites
t_out = k_rate/(2*mutation_rate)
print("Number of fixed synonymous differences between Jean-Talon & reference is "+str(len(jt)))
print("Divergence rate between Jean-Talon & reference is "+str(k_rate))
print("Number of generations since divergence of Jean-Talon with reference is "+str(t_out))

Number of fixed synonymous differences between Jean-Talon & reference is 7697
Divergence rate between Jean-Talon & reference is 0.004805181240205354
Number of generations since divergence of Jean-Talon with reference is 14386770.180255553


#### Calculating time of split of Jean-Talon with relatives, relative to time of split with reference

In [8]:
S = defaultdict(list)
S2 = defaultdict(list)
for duplex_index in range(len(T)):
    
    sec_strain = relatives[duplex_index]
    print(sec_strain)
    n = []
    for pos in k.keys():
        newset = ''.join([k[pos][i] for i in T[duplex_index]])
        if newset.count('.') == 0:
            n.append(newset)
    #n[:3]
    P = []
    P2 = []
    C2_aba = []
    C2_baa = []
    for site in n:
        # taxon 1 and taxon 2 bases are given by randomly selecting one base from all alleles in a given position
        #anc = random.sample(list(site[0:4]),1)
        sp1 = random.sample(list(site[0:4]),1)[0]
        sp2 = random.sample(list(site[4:8]),1)[0]
        #if sp1.intersection(set(anc)): p1 = "A"
        if sp1 == '1': p1 = "B"   
        else: p1 = "A"
        if sp2 == '1': p2 = "B"
        else: p2 = "A"
        pat = p1+p2+"A"
        P.append(pat)
        
        # derived bases in taxon 1 and 2 are all bases with derived mutations of any frequency 
        sp1 = set(list(site[0:4]))
        sp2 = set(list(site[4:8]))
        if sp1.intersection(set('1')): p1 = "B"
        else: p1 = "A"
        if sp2.intersection(set('1')): p2 = "B"
        else: p2 = "A"
        pat2 = p1+p2+"A"
        P2.append(pat2)
        if pat2 == "ABA":
            C2_aba.append(list(sp2).count("1")/4.)
        elif pat2 == "BAA":
            C2_baa.append(list(sp1).count("1")/4.)
     
    nnn = lengths.loc[lengths["strain"]==sec_strain,'length'].values[0]
    
    aba = P.count("ABA") 
    baa = P.count("BAA")
    Ss1 = aba/float(nnn)
    Ss2 = baa/float(nnn)
    S['strain'].append(sec_strain)
    S['ABA'].append(aba)
    S['BAA'].append(baa)
    S['Ss_ABA'].append(Ss1)
    S['Ss_BAA'].append(Ss2)
    S['mean_Ss'].append(np.mean([Ss1, Ss2]))
   
    
    aba2 = P2.count("ABA")
    baa2 = P2.count("BAA")
    Ss1 = aba2/float(nnn)
    Ss2 = baa2/float(nnn)
    S2['strain'].append(sec_strain)
    S2['ABA'].append(aba2)
    S2['BAA'].append(baa2)
    S2['Ss1'].append(Ss1)
    S2['Ss2'].append(Ss2)
    S2['meanSs'].append(np.mean([Ss1, Ss2]))
    # rate of aba and baa patterns is multiplied by frequency of corresponding derived mutations
    Ss1_freq = sum(C2_aba)/float(nnn)
    Ss2_freq = sum(C2_baa)/float(nnn)
    S2['Ss1_freq'].append(Ss1_freq)
    S2['Ss2_freq'].append(Ss2_freq)
    S2['meanSs_freq'].append(np.mean([Ss1_freq, Ss2_freq]))
    
dS1 = pd.DataFrame(S)
dS2 = pd.DataFrame(S2)

A.Muntons
A.S-33
A.T-58
BE005
CFI
CFN
CFP


#### Calculating divergence times

In [9]:
dS1['t_aba_150'] = (dS1['Ss_ABA']*t_out)/150
dS1['t_baa_150'] = (dS1['Ss_BAA']*t_out)/150
dS1['t_150'] = (dS1['mean_Ss']*t_out)/150
dS1['t_aba_2920'] = (dS1['Ss_ABA']*t_out)/2920
dS1['t_baa_2920'] = (dS1['Ss_BAA']*t_out)/2920
dS1['t_2920'] = (dS1['mean_Ss']*t_out)/2920
dS1['t_out'] = t_out
dS1['mut_rate'] = mutation_rate
dF = dS1.loc[dS1['strain'].isin(["A.Muntons","A.S-33","BE005","CFI","CFN"]),:]
dM = pd.merge(dF,lengths,on=['strain'],how='left')
dM

Unnamed: 0,strain,ABA,BAA,Ss_ABA,Ss_BAA,mean_Ss,t_aba_150,t_baa_150,t_150,t_aba_2920,t_baa_2920,t_2920,t_out,mut_rate,length
0,A.Muntons,1478,2926,0.00106,0.002099,0.00158,101.693868,201.323585,151.508727,5.224,10.341965,7.782983,14386770.0,1.67e-10,1393964.5
1,A.S-33,772,1721,0.000838,0.001868,0.001353,80.363032,179.151266,129.757149,4.128238,9.202976,6.665607,14386770.0,1.67e-10,921367.81
2,BE005,4259,5575,0.002661,0.003483,0.003072,255.220282,334.081492,294.650887,13.110631,17.16172,15.136176,14386770.0,1.67e-10,1600532.52
3,CFI,4199,5768,0.002622,0.003602,0.003112,251.495339,345.469187,298.482263,12.919281,17.746705,15.332993,14386770.0,1.67e-10,1601356.33
4,CFN,4379,5682,0.002735,0.003548,0.003141,262.274794,340.316369,301.295581,13.47302,17.482005,15.477513,14386770.0,1.67e-10,1601365.39


#### Saving table

In [10]:
dM.to_csv("./output/calcRelativeTime_Scer.out",sep="\t",index=False,header=True)
#dM.to_csv("calcRelativeTime_Scer_sample.out",sep="\t",index=False,header=True)