# Genealogical variation *within* a population


### Notebook outline:
1. Visualize genealogical variation within a population.
2. Visualize substitutions on genealogies.
3. Visualize DNA sequence alignments.
4. Compute population genetic diversity $\theta = 4Ne\mu$.

### Learning objectives: 
By the end of this notebook you should:
1. Be a little more familiar with the `toytree` and `ipcoal` Python libraries.
2. Understand that within a population there is not one phylogenetic history, but many.
3. Understand that sequence variation arises from mutations on genealogies.


### Additional recommended reading:

- [Rosenberg and Nordborg (2002) Genealogical trees, coalescent theory and the analysis of genetic polymorphisms. Nature Review Genetics](https://eaton-lab.org/slides/genomics/readings/Rosenberg-and-Nordborg-2002.pdf)

In [None]:
import ipcoal
import toytree
import itertools
import numpy as np

### Terminology (genealogies and gene trees)
This notebook aims to demonstrate and reinforce an understanding of the difference between the *true* genealogical history of a set of samples, and the *inferred* gene tree history that is estimated from observable sequence data. The following terminology is helpful in this respect. 

**Genealogies** are the true unobserved histories of a set of sampled gene copies from one or more populations. They share a common ancestor at some point in their past, and the relationships among the samples and their ancestors traces back a tree-like relationship. In sexually reproducing diploid species, gene copies will share different genealogical histories at different unlinked regions of the genome, because they trace back to different ancestors.

**Gene trees** are estimates of the genealogies. Because we cannot observe genealogies directly, we must infer them from the data that we can observe, in the form of substitutions that evolved on those genealogies and left information in genetic sequence data.

### Simulating trees and mutations
In the examples below we will simulate sequence data on one or more genealogies. In contrast to our previous notebook where we wrote a custom coalescent simulation script, we will now use a Python package that is built for this purpose, named `ipcoal`. In addition to simulated coalescent trees, this can also simulate the process of mutation. Once again, this process is implemented in a similar way to how it would be in a WF model. Each generation a gene copy can experience a mutation, in which case it represents a new allele. 

The coalescent model is a particularly efficient method for studying genetic variation because we only need to keep track of the mutations that occur in the history of the samples that we are studying, not the history of all mutations that occurred in every other gene copy in the population. This is because we know that our sampled gene copies trace back to a common ancestor at some point in time. Thus, all of our sampled gene copies were represented by the same allele at that time. The only way they could have changed since then is by mutation. Thus, new mutations since the MRCA in a set of sampled gene copies is how genetic variation originates. 

If we assume a neutral model in which mutations have no fitness effect, and occur at a constant rate, then we can simulate mutations as a random process that occurs along the edges of a genealogy. Thus, coalescent simulations of sequence data simply follow a two step approach: first simulate coalescent genealogies, and then simulate a random point process of mutation along the branches of the genealogy. This is demonstrated below. 

### Simulate a genealogy

In this example we simulate a single locus that is 100bp in length on a species tree with very little discordance (low Ne). A print statement returns the number of genealogies that evolve across this length. A gene tree is inferred with RAxML and its topology is compared to the species tree and to the first genealogy in the simulated locus. In this case the inferred gene tree does not match the underlying genealogy or the species tree because there is little information in the 200bp locus. 

In [None]:
# seed for random number generator (RNG)
SEED = 333

# setup simulation with small Ne
model1 = ipcoal.Model(
    Ne=2e5, 
    nsamples=8,
    recomb=0, 
    mut=1e-7,
    store_tree_sequences=True, 
    seed_mutations=SEED, 
    seed_trees=SEED,
)

# simulate one short uninformative locus
model1.sim_loci(nloci=1, nsites=100)

In [None]:
# print information about the simulation
print(f"simulated {model1.df.shape[0]} genealogies with "
      f"{model1.df.nbps.sum()} sites with "
      f"{model1.df.nsnps.sum()} substitutions.")

In [None]:
# visualize the first 100bp of simulated data
model1.draw_seqview(show_text=True, scrollable=True, end=100);

### Substitutions occur on genealogies
There is a connection between genealogies and sequence variation at a locus. Specifically, 


In [None]:
# visualize the substitutions on the genealogy
model1.draw_genealogy(layout='d', show_substitutions=True);

### Infer an empirical gene tree
Followng our terminology from the beginning of this notebook, the <span style="color:tomato; font-weight:600">gene tree is a hypothesis for the genealogy inferred from data</span>. The function below will implement a maximum likelihood method of phylogenetic inference to infer a gene tree from the simulated sequence data. We then plot the inferred gene tree below it. Compare this inferred tree to the true genealogy with substitutions mapped onto its edges, from above. 

In [None]:
# infer an empirical gene tree from the sequence data
gene_tree = ipcoal.phylo.infer_raxml_ng_tree(model1)

In [None]:
# draw the inferred gene tree
gene_tree.draw(layout='d');


<div class="alert alert-success">
    <b>Action:</b> 
Try editing parameters of the simulation above, such as the number of simulated sites (nsites) or the mutation rate (`mut`) or Ne, and re-execute the last few cells. What type of changes make it more likely that the inferred gene tree will match closely to the coalescent genealogy?
</div>

### Gene tree variation
In summary, a gene tree is a hypothesis for the genealogy, an attempt to infer the correct relationships among gene copies, and sometimes also to infer the correct branch lengths. The gene tree may be incorrect for a number of reasons, including insufficient number of substitutions at a locus, or multiple mutations to the same sites at a locus (homoplasy). Another common source of gene tree inference error is recombination. In the examples above we set recomb=0, so it had no effect, but if we allowed recombination and simulated a 10,000 bp locus, it may actually represent multiple different genealogies with recombination breaks between them. In this case there is no single correct gene tree for the locus. 

### Sequence variation in genomes
We have seen previously that within a single population the coalescent genealogies at different unlinked regions of the genome have completely and independent genealogies. Because we assume random mating within populations, the topologies are random, but the coalescent waiting times are not, they are predicted by the population parameter, Ne. <span style="color:tomato; font-weight:600">The variation in coalescent waiting times among gene copies is important in determining the opportunity over which mutations can arise, causing sequence variation.</span> A single coalescent genealogy is too variable to tell us much about a population, but by examining many of them we can gain information about a population. 

Let's simulate a genomic data set that is similar to if we sequenced a large number of unlinked loci to try to measure genetic variation. The cells below will simulate a sequence matrix containing data from 1K simulated loci (genealogies). In the code below that we then measure the average number of differences between any two samples in the population, and also show the theoretical expectation for the population genetic diversity of this population ($4Neu$). After completing this part once, try changing some parameters in the "set variables" section of code below, such as increasing or decreasing the population size, and run the code again to see its effect. 

In [None]:
# set variables
Ne = 10000
nsamples = 8
mut = 1e-7
nloci = 1000
loclen = 20

In [None]:
# simulate sequence data for a single population at many loci
model = ipcoal.Model(Ne=Ne, nsamples=nsamples, mut=mut)
model.sim_loci(nloci=nloci, nsites=loclen)

# combine the loci into one large sequence matrix
model.seqs = np.concatenate(model.seqs, 1)

In [None]:
def get_pop_gen_diversity(seqs):
    """Return the proportion diffs measured for every pair of samples."""
    nsamples = seqs.shape[0]
    
    # sample all combination of two gene copies
    samples = list(itertools.combinations_with_replacement(range(nsamples), 2))

    # measure proportion of variable sites in each sample of two gene copies
    theta = []
    idx = 0
    for sample in samples:
        i, j = sample
        if i != j:
            two_samples = seqs[[i, j], :]
            theta.append((sum(two_samples[1, :] != two_samples[0, :]) / two_samples.shape[1]))
            idx += 1
    return theta

#### Finally, measure theta at every locus
For every pair of gene copies this simply returns the proportion of sites that are different.

In [None]:
# report results
theta = get_pop_gen_diversity(model.seqs)
print("measured population genetic diversity (theta)")
print('mean={:.4f}, std={:.4f}'.format(np.mean(theta), np.std(theta)))
print("\ntheoretical expectation (4Neu): {}".format(4 * Ne * mut))