# Wright-Fisher Model exercise

### Learning objectives¶

By the end of this lesson you will be familiar with:

1. The Wright-Fisher (WF) process model.
2. Assumptions underlying this model.
3. calculating the probability two gene copies share a parent.
4. extensions of this probabilistic thinking to studying genealogies.

In [None]:
from toytree.learn import wright_fisher_simulation

### Wright-Fisher Process¶

The Wright-Fisher process is considered one of the simplest models in population genetics. Compared to an idealized population, in which allele frequencies are always in HW equilibrium, and do not change, the WF process by default relaxes at least one assumption: population sizes are not infinite.

The introduction of finite population sizes allows for allele frequencies to change over time, as genetic drift will cause an allele to randomly become either entirely fixed, or lost , at a rate that is a function of the population size (N). In this lesson we will simulate and visualize a population evolving under a WF process to understand how population size affects not only the rate of genetic drift, but also the genealogical relationships among samples. Ultimately, we will see that these two results are linked. Before we proceed, however, it is important to discuss some terminology.

### Effective population size¶

One of the most challenging concepts in population genetics is that of effective population size (Ne). This is not to be confused with the population census size, which is the number 
of diploid individuals in a population (N).

The concept of effective population size was first developed by Sewall Wright (1931, 1933) as a metric for describing the rate of genetic drift, or inbreeding. Specifically, it was developed in reference to populations evolving under a WF process, where an idealized population will experience changes in allele frequencies over time due to only a single evolutionary force, genetic drift caused by a finite population size.

The effective population is usually smaller than the actual population census size, representing the effects of deviations from the assumptions of an idealized population. In the simplest WF process model, however, it is easy to interpret, since Ne is equivalent to the census population size, N. However, in more complex models, where we might add the effects of selection or migration, the two values would deviate. In that case, Ne becomes a summary statistic for the impacts of many possible evolutionary processes that violate the assumptions of an idealized population, causing genetic drift and inbreeding at higher rates than would occur in a population of the same size if it were an idealized population.

For example, an idealized population of census size N=10,000 will have an effective population size of Ne=10,000. By contrast, a different population of census size N=10,000 that experiences non-random mating (or other violations of the simple WF model) may have an effective population size of Ne=6,000. This value can be interpreted to mean that the second population will experience genetic drift at the same rate as an idealized population of size N=6,000, even though it in fact contains 10,000 individuals.


### Visualizing the WF process¶

To help understand WF models and their connection to genealogies it is helpful to visualize stochastic outcomes of a WF process. The following visualizations are created using the Python package toyplot, and using random statistical sampling with numpy. Here we will simply need to execute the code below to generate and draw WF process outcomes. 

## Simulating WF process¶

A WF process can be simulated quite easily. It is a discrete-time process where each generation 2N haploid gene copies exist in a population of census size N. Because the population is assumed to be randomly mating we can randomly assign haploid gene copies to be grouped into diploid individuals. This grouping has no effect on how the model operates unless some other model assumptions were to be violated, and is thus only for visualization purposes. To simulate one generation of the WF process one needs only to create a new generation of 2N haploid copies and to randomly sample 2N copies, with replacement, from the previous generation to serve as their parents.

The result of this random mating process, played out over multiple generations, looks like the plot below. The process gives rise to a genealogy of gene copies. Lines are drawn to connect gene copies in each generation from parent to offspring. Grey rectangles represent diploid individuals, which contain two gene copies, represented by circles.

In [None]:
wright_fisher_simulation(time=11, popsize=10, show_diploids=True, sort_edges=False);

### Interpreting WF genealogies¶

This process is much easier to comprehend when the edges are sorted so that they are easier to read. Remember, however, that regardless of this sorting for visualization purposes, that the parent copy of each gene copy is still being randomly sampled each generation. This visualization thus retains the same probability of a gene copy leaving descendants to the next generation, but is biased in the appearance of greater selfing (one or both copies in an individual ending up together in an individual in the next generation). Thus the outlines of the diploid individuals could be hidden at this point, after sorting the edges, to be more accurate, which we'll do here.

Time in this example flows from top to bottom, such that the gene copies at time=0 represent the present, and the ancestors of these gene copies, as well as other gene copies that did not leave descendants, exist in past generations (gens 1 - 9 in this visualization).

In [None]:
wf = wright_fisher_simulation(time=10, popsize=11, sort_edges=True)

### Interpreting Time
Understanding the flow of time in these diagrams is very important, and can sometimes be challenging, since different authors may present the same type of figure flowing top to bottom, or bottom to top. Fortunately, there is a simple trick to easily identify the flow of time, which is to identify the direction in which every node has an edge flowing. That direction will be backwards in time. This is because every gene copy has an ancestor, but not every gene copy will leave descendants. This is also the case in biological evolution: every living individual and gene copy has an ancestor from which they are descended.

The plot below zooms in on just 3 generations of a WF process simulation. You can see that every gene copy has an edge pointing up (backwards in time), but many gene copies do not have an edge pointing down.

As we said earlier, mating occurs randomly in this model, without regard to the way in which gene copies are grouped inside of diploid individuals. Consequently, it is possible for odd mating strategies to occur, such as selfing. In the figure below, generated with the random seed 123, look at the first individual from the left in generation 0. The first gene copy is inherited from the diploid individual above it in the previous generation, but the other copy comes from a very distant individual. By contrast, look at the individual second from the right in generation 0. It inherited both copies from the same parent (selfing). Mating strategies like this can occur by chance in this model. In fact, using this Wright-Fisher process is what defines the expected level of inbreeding based on random mating in a population of size N. 

In [None]:
wright_fisher_simulation(
    time=3, popsize=11, sort_edges=False, show_diploids=True, seed=123, height=275,
);

### Genetic drift
A consequence of the fact that some gene copies do not produce descendants in the next generation, while others do, simply as an outcome of a random sampling process, is an example of genetic drift. We demonstrated in the last notebook how genetic drift can lead to changes in allele frequencies over time, causing the eventual loss of genetic variation at a locus by fixing one or another allele. 

We can follow this same process below in a more detailed way by assigning different allele states to the gene copies in our Wright-Fisher process. Here, in the first generation (generation 30) we randomly assign gene copies to different alleles (green or orange). This represents genetic variation at the beginning of this simulation where the allele frequencies are p=0.5 and q=0.5. 

Each generation as a new population of gene copies is created the allele frequencies in the population change slightly, representing differential replication of the different alleles in the previous generation simply due to chance. In this example, the green allele is lost by about generation 15 (the orange allele becomes fixed). 

In [None]:
wright_fisher_simulation(
    time=31, popsize=11, sort_edges=True, seed=123, allele_frequency=0.5, 
    node_size=7, height=600
);

### Genealogy of samples¶

A consequence of this simple discrete random sampling process, repeated over multiple generations, is that any gene copy that exists at the present can trace back a tree-like genealogical history with any other gene copy at the present. As an example, we can randomly sample k gene copies at the present and focus on their individual ancestries. As we can see, they each trace back some number of generations before they eventually share a common ancestor; an ancestral gene copy from which they are all descended.

This is demonstrated below. Here we sample 6 gene copies which are shown with a bold outline compared to other unsampled gene copies and their ancestors. All of the 6 gene copies that we sampled in this scenario coalesce into a single common ancestral gene copy at generation 9. (9 generation back from the present). If we ran it again from a different random seed, or if we sampled a different set of gene 6 gene copies the path connecting them would be slightly different. This tree of ancestor-descendant relationships among gene copies is called a **genealogy**. 

In [None]:
wright_fisher_simulation(
    time=31, popsize=11, sort_edges=True, seed=123, allele_frequency=0.5, 
    node_size=7, height=600, nsamples=6,
);

<div class="alert alert-block alert-info">
    <h3>Action: Run this code to answer the questions</h3>
    <p>
        Try modifying the code below to explore how the population size (popsize) and number 
        of sampled gene copies (nsamples) affects the time until all sampled gene copies 
        coalesce. Stick with relatively small N values so the simulations 
        don't take too long. Play around with these two variables to explore their effects. 
        Each run will return a randomly sampled result given the parameter settings.
        You may need to modify the (time) variable to simulate longer evolution if all
        samples do not coalesce within the allowed time.

- Does increasing N tend to make coalescence take longer or shorter?
- Does sampling more alleles tend to make coalescence take longer or shorter?
- Given a fixed N and k, do you see a lot of variation in the time until the 'nsampled' gene copies have coalesced when you run many replicates?
        
    </p>
</div>

In [None]:
wright_fisher_simulation(
    popsize=10,              # keep this at 10
    nsamples=2,              # change this to examine its effect
    time=61,                 # change this if needed
    height=800,              # change this if needed
    sort_edges=True,         # do not change
    seed=None,               # do not change
    allele_frequency=0.5,    # do not change
    node_size=7,             # do not change
);

<div class="alert alert-block alert-info">
    <h3>Action: Estimate time to coalescence for 2 samples</h3>
    <p>
        The code block below uses the parameter setting `nsamples=2`, meaning
        that it will highlight the genealogical history for a random set
        of two gene copies each iteration. Run this simulation at least
        10 times and record the approximate generation each time at which the 
        two samples coalesce into a common ancestor. How many generations does 
        it take on average? Record your results in the next code block down to
        calculate the mean. (Simulations should usually finish within 61 generations,
        if not, just record it as 61 for now.)
    </p>
</div>

In [None]:
wright_fisher_simulation(
    popsize=10,              # keep this at 10
    nsamples=2,              # change this to examine its effect
    time=61,                 # change this if needed
    height=800,              # change this if needed
    sort_edges=True,         # do not change
    node_size=7,             # do not change
);

In [None]:
# write your 10 values inside the brackets here like: [10, 20, 25, 21, 30] ...
coal_times = [0]

# print the mean value 
print(f"Mean coalescent time for 2 samples={sum(coal_times) / len(coal_times)}")

## Coalescence¶

Sewall Wright noted that in a finite population evolving under a WF process the probability that two random gene copies will share a common ancestor in the previous generation is 1/2N, since there are 2N gene copies in the population. This type of event, where two descendant gene copies merge into a common ancestor, when viewed backwards in time, is termed a coalescence. It turns that this view of population histories, in the form of probabilistic statements about common ancestry, can be a very powerful way to describe evolution in populations. In fact, an entire statistical framework has been developed to study populations based on probabilistic expectations for the time and until some number of samples coalesce, called **Coalescent Theory**. This is the subject of the next lesson.

## Summary¶

- A Wright-Fisher process, or model, is a discrete-time process of reproduction in an idealized population of finite size over multiple generations.
- Interpreting a WF-process plot requires understanding the direction of time, and which gene copies are ancestors or descendants of others.
- The WF process gives rise to genealogical relationships among gene copies.
- All gene copies in a population will eventually coalesce to a common ancestor.
- The probability that two gene copies share a common ancestor 1 generation ago is 1 / 2Ne.
