## ipcoal: simulation and analysis of coalescent genealogies

The **ipcoal** Python package provides a simple framework for simulating and analyzing genealogies and inferred gene trees under complex demographic scenarios. You can generate demographic models representing population histories, species trees, or networks from a newick file and easily visualize the model in **toytree**. 

Model parameters are parsed by **ipcoal** to define a simulation framework in **msprime** that will be used to generate a distribution of genealogies from which SNPs, loci, or chromosomes can be simulated under general time reversible substitution models (i.e., like with seq-gen). 

The simulated sequence data can be saved to disk in a variety of formats, or, gene tree analyses can be automated to infer empirical gene trees on the simulated sequence data. The resulting true genealogies, summary statistics, and inferred trees are returned by **ipcoal** as a Pandas DataFrame for further statistical analysis. 

### Required software
All required software can be installed with the following conda command. 

In [1]:
# conda install ipcoal -c eaton-lab conda-forge

In [2]:
import toytree
import ipcoal

### The main functions of *ipcoal*
Start by initializing a `Model` class object by providing a species tree/network and additional optional model parameters (e.g., Ne, migration, mutation rate, recombination rate). Then you can simulate either loci or SNPs on the genealogies produced under this model. **ipcoal** makes it easy to either write the sequence data to files under a variety of formats, or to perform phylogenetic inference on the sequence data directly. You can then compare true simulated genealogies to the inferred trees. Each of these functions is further demonstrated below.

In [3]:
# init a model Class object for simulations
tree = toytree.rtree.unittree(8, treeheight=1e6)
model = ipcoal.Model(tree=tree, Ne=1e6, seed=12345)

# simulate N unlinked SNPs (will run until N snps are produced)
model.sim_snps(100)

# simulate N loci of len L 
model.sim_loci(10, 300)

# view the genealogies and stats in a table
model.df

# save table to a CSV file
model.df.to_csv("./tree_table.csv")

# view sequence data as an array
model.seqs

# write loci as separate phylip files to a directory
model.write_loci_to_phylip(outdir="./tests")

# write concatenated loci or snps to a single phylip file
model.write_concat_to_phylip(outdir="./tests", name="test.phy")

# infer a tree for every locus
model.infer_gene_trees(inference_method='raxml')

wrote 10 loci (8 x 300bp) to /home/deren/Documents/ipcoal/notebooks/tests/[...].phy
wrote concat loci (8 x 3000bp) to /home/deren/Documents/ipcoal/notebooks/tests/test.phy


In [4]:
model.df.head(20)

Unnamed: 0,locus,start,end,nbps,nsnps,genealogy,inferred_tree
0,0,0,68,68,11,"((r5:2.63305e+06,r7:2.63...",(((r6:0.0207490555588561...
1,0,68,79,11,5,"(((r0:555160,r2:555160):...",(((r6:0.0207490555588561...
2,0,79,184,105,27,"(r7:5.4488e+06,((r0:5551...",(((r6:0.0207490555588561...
3,0,184,188,4,0,"(r7:5.4488e+06,((r1:2.24...",(((r6:0.0207490555588561...
4,0,188,193,5,1,"(r7:5.4488e+06,((r1:2.24...",(((r6:0.0207490555588561...
5,0,193,246,53,2,"((r1:2.2438e+06,(r5:1.34...",(((r6:0.0207490555588561...
6,0,246,249,3,0,"((r1:2.2438e+06,(r6:1.34...",(((r6:0.0207490555588561...
7,0,249,300,51,10,"((r1:2.2438e+06,(r3:1.34...",(((r6:0.0207490555588561...
8,1,0,64,64,4,"(((r6:893045,(r4:892520,...",(((r7:0.0066955806294154...
9,1,64,265,201,22,"(((r6:893045,(r4:892520,...",(((r7:0.0066955806294154...


### Define a species/population tree
Node heights should be in units of generations. 

In [5]:
# generate an imbalanced 6-tip tree with root height of 1M generations
tree = toytree.rtree.imbtree(4, treeheight=5e5)

# draw tree showing idx labels
tree.draw(tree_style='p', tip_labels=True);

### Define an ipcoal simulation model. 
Here you can define the demographic model by setting a global Ne value (overrides Ne values stored to the tree), and setting the mutation and recombination rates. You can define a admixture scenarios using a simple syntax provided by a list of tuples. In each tuple you list the (source, dest, edge_prop, rate), where edge_prop is a float value of the proportion of the length of the shared edge between two taxa from recent to the past at which the migration pulse took place. In other words, if you set this to (7, 4, 0.5, 0.1) then 10% of the population of 7 will migrate into population 4 (backwards in time) at the midpoint of the shared edge between them. 

In [6]:
model = ipcoal.Model(
    tree,
    Ne=1e6,
    mut=1e-8,
    recomb=1e-9,
    seed=123,
    #admixture_edges=[(6, 4, 0.5, 0.1)],
)

In [7]:
model.sim_snps(10000);

In [8]:
mats = ipcoal.utils.get_snps_count_matrix(tree, model.seqs)

In [44]:
ipcoal.utils.calculate_dstat(model.seqs, 0, 1, 2, 3)

Unnamed: 0,dstat,baba,abba
0,0.04,12,13


In [32]:
import toyplot
toyplot.matrix(mats[0], margin=0, width=300);

In [5]:
import simcat
simcat.plot.draw_count_matrix(width=700, height=700);

In [11]:
model.seqs[[0, 1, 2, 3], :]

array([[1, 2, 1, ..., 2, 1, 2],
       [1, 2, 1, ..., 1, 0, 2],
       [1, 2, 2, ..., 1, 1, 2],
       [0, 2, 1, ..., 1, 1, 2]], dtype=uint8)

In [30]:
import itertools
import numpy as np
from ipcoal.jitted import count_matrix_int

# get matrix to fill
nquarts = sum(1 for i in itertools.combinations(range(tree.ntips), 4))
counts = np.zeros((nquarts, 16, 16), dtype=np.int64)

# iter
qiter = itertools.combinations(range(tree.ntips), 4)
quartidx = 0
for currquart in qiter:
    # cols indices match tip labels b/c we named tips node.idx
    quartsnps = model.seqs[currquart, :]
    
    # get counts
    counts[quartidx] = count_matrix_int(quartsnps.T)                    
    quartidx += 1

In [39]:
import toyplot
toyplot.matrix(counts[0], margi\n=0, width=350, height=350);

### Simulate genealogies and sequences for N independent loci of length L
Because our simulation includes recombination each locus may represent multiple genealogical histories. You can see this in the dataframe below where loc 0 is represented by 5 genealogies. 

In [6]:
# run the simulation
model.sim_loci(nloci=10, nsites=500)

In [7]:
# view the genealogies and their summary stats
model.df.head(10)

Unnamed: 0,locus,start,end,nbps,nsnps,genealogy
0,0,0,104,104,7,"(r4:2.67611e+06,(r3:1.11..."
1,0,104,234,130,11,"((r0:2.20817e+06,r4:2.20..."
2,0,234,257,23,3,"((r4:2.20817e+06,(r0:1.3..."
3,0,257,438,181,20,"((r3:2.20817e+06,(r0:1.8..."
4,0,438,500,62,6,"((r3:2.20817e+06,r0:2.20..."
5,1,0,146,146,13,"((r2:1.3103e+06,r3:1.310..."
6,1,146,163,17,3,"((r1:840627,(r4:802207,r..."
7,1,163,175,12,2,"((r2:1.3103e+06,r3:1.310..."
8,1,175,212,37,2,"(r0:2.79004e+06,((r2:1.3..."
9,1,212,242,30,3,"((r1:840627,(r4:802207,r..."


### Visualize genealogical variation using toytree
Here the genealogies are plotted with tips in the same order as in the species tree so that you can easily identify the discordance between genealogies and the species tree. 

In [8]:
# load a multitree object from first 5 genealogies
mtre = toytree.mtree(model.df.genealogy)

# draw trees from the first locus
#  with 'shared_axis' to show diff in heights
#  with 'fixed_order' to show diff in topology (relative to first tree)
mtre.draw_tree_grid(
    start=0, ncols=4, nrows=1,
    shared_axis=True,
    fixed_order=tree.get_tip_labels(),
    tree_style='c', 
    node_labels=False, 
    node_sizes=8,
    tip_labels=True,
);

# draw trees from the second locus
mtre.draw_tree_grid(
    start=6, ncols=4, nrows=1,
    shared_axis=True,
    fixed_order=tree.get_tip_labels(),
    tree_style='c', 
    node_labels=False, 
    node_sizes=8,
    tip_labels=True,
);

### Write the simulated sequence data to file

In [9]:
# view the sequence array for the first locus (showing first 20 bp)
model.seqs[0, :, :20]

array([[1, 1, 1, 1, 0, 2, 2, 1, 1, 0, 1, 3, 0, 0, 2, 3, 0, 0, 3, 3],
       [1, 1, 1, 1, 0, 2, 2, 1, 1, 0, 1, 3, 0, 0, 2, 3, 0, 0, 3, 3],
       [1, 1, 1, 1, 0, 2, 2, 1, 1, 0, 1, 3, 0, 0, 2, 3, 0, 0, 3, 3],
       [1, 1, 1, 1, 0, 2, 2, 1, 1, 0, 1, 3, 0, 0, 2, 3, 0, 0, 3, 3],
       [1, 1, 1, 1, 1, 2, 2, 1, 1, 0, 1, 3, 0, 0, 2, 3, 0, 0, 3, 3],
       [1, 1, 1, 1, 0, 2, 2, 1, 1, 0, 1, 3, 0, 0, 2, 3, 0, 0, 3, 3]],
      dtype=uint8)

In [10]:
# write all loci as separate phylip files to a directory
model.write_loci_to_phylip()

wrote 10 loci (6 x 500bp) to home/deren/Documents/physeqs/notebooks/ipcoal-sims/[...].phy


In [11]:
# write all loci concatenated to a single sequence file
model.write_concat_to_phylip()

wrote concatenated loci (6 x 5000bp) to /home/deren/Documents/physeqs/notebooks/test.phy


### Simulate N unlinked SNPs

In some cases you may only be interested in sampling unlinked SNPs. This is easy to do in **ipcoal** using the `.sim_snps()` function. This has two modes, the default is to simulate genealogies and attempt to drop a mutation on each one given the mutation rate. It will continue to generate new genealogies until you get the requested number of SNPs (which could take forever in some instances, like mutation_rate=0). The other option is to turn on the `repeat_on_trees=True` flag, which will continue to try to simulate a SNP on each tree until it is successful before moving on to the next tree. This may be slightly faster but will likely introduce biases. Only use the latter mode out of curiosity. 

When simulating SNPs the dataframe in `.df` is not particularly interesting, since every genealogy corresponds to only 1 site and 1 SNP. But it is still of interest for testing methods that rely on SNP data as a summary of the genealogy. The sequence data in `.seqs` is now a 2-d array (ntaxa, nsnps) as opposed to 3-d (nloci, ntaxa, nsites) when simulating loci. The functions to write the data to files works the same as before. You can call `.write_seqs_to_phy()` to write. 

In [12]:
# simulate N unlinked SNPs
model.sim_snps(100)

In [13]:
# the genealogies for each SNP are stored in .df
model.df.head()

Unnamed: 0,locus,start,end,nbps,nsnps,genealogy
0,0,0,1,1,1,"(r3:3.96605e+06,(r2:1.64..."
1,1,0,1,1,1,"((r1:1.38507e+06,(r0:693..."
2,2,0,1,1,1,"((r2:1.12575e+06,r5:1.12..."
3,3,0,1,1,1,"((r0:2.09737e+06,r2:2.09..."
4,4,0,1,1,1,"((r5:1.10614e+06,(r0:174..."


In [14]:
# the snp array is stored in .seqs
model.seqs[:, :20]

array([[0, 1, 3, 2, 3, 1, 3, 1, 0, 2, 3, 0, 3, 2, 2, 1, 2, 3, 1, 2],
       [0, 1, 3, 1, 3, 2, 1, 1, 0, 2, 3, 0, 3, 2, 2, 1, 2, 3, 1, 2],
       [0, 1, 3, 2, 2, 1, 1, 1, 1, 2, 3, 0, 3, 2, 0, 3, 2, 3, 1, 2],
       [2, 1, 3, 1, 2, 1, 1, 1, 0, 3, 2, 0, 0, 2, 0, 3, 2, 3, 1, 2],
       [0, 0, 3, 1, 2, 2, 1, 1, 1, 2, 3, 0, 3, 1, 1, 1, 0, 3, 1, 3],
       [0, 0, 1, 1, 3, 2, 1, 2, 1, 2, 3, 2, 3, 1, 0, 1, 0, 0, 0, 2]],
      dtype=uint8)

In [15]:
# write the snps array as a phylip file
model.write_concat_to_phylip()

wrote concatenated loci (6 x 100bp) to /home/deren/Documents/physeqs/notebooks/test.phy


### Infer gene trees 
Writing the sequence data to disk is optional and actually not required for some types of analyses, since *ipcoal* has built-in inference tools for inferring gene trees from the sequence data while it is stored in memory. This can create a really simple and reproducible workflow based simply on the random seed used for your analysis without a need to upload your simulated files to DRYAD at the end of your project. 

When you call one of the *inference* methods it will fill a new column in your dataframe called **inferred_trees**. 

In [16]:
# simulate locus data
tree = toytree.rtree.unittree(8, treeheight=1e6)
model = ipcoal.Model(tree=tree, Ne=1e5)
model.sim_loci(10, 500)

In [17]:
model.infer_gene_trees(inference_method="raxml")

In [18]:
model.df

Unnamed: 0,locus,start,end,nbps,nsnps,genealogy,inferred_tree
0,0,0,168,168,10,"((r7:883806,(r4:758813,r...",((r5:0.00401358103372166...
1,0,168,216,48,6,"(r0:1.5277e+06,((r7:8838...",((r5:0.00401358103372166...
2,0,216,277,61,4,"(r0:1.5277e+06,((r4:8838...",((r5:0.00401358103372166...
3,0,277,308,31,1,"((r0:815595,(r5:580343,r...",((r5:0.00401358103372166...
4,0,308,467,159,13,"((r5:882470,(r0:815595,r...",((r5:0.00401358103372166...
5,0,467,500,33,2,"((r2:519745,r1:519745):7...",((r5:0.00401358103372166...
6,1,0,131,131,11,"((r6:1.21659e+06,(r7:616...",(((r6:0.0162177841210500...
7,1,131,500,369,33,"((r6:1.21659e+06,(r7:616...",(((r6:0.0162177841210500...
8,2,0,500,500,25,"((r7:819768,r6:819768):2...",(((r1:0.0000010000005000...
9,3,0,73,73,11,"((r2:567261,r1:567261):8...",((r6:0.01011614698215194...


In [19]:
# save the dataframe with the inferred trees 
model.df.to_csv("./tree_table.csv")

### Write data as a site count matrix (*sensu* SVDquartets)

In [20]:
# for idx, mat in enumerate(snps.reshape((5,16,16))):
#     toyplot.matrix(mat, label="Matrix " + str(idx), colorshow=True);