# Notebook 1: Hardy Weinberg


### Notebook outline:
1. Idealized populations and Hardy Weinberg
2. Hardy Weinberg as a binomial sampling problem
3. Sampling error and genetic drift.


### Learning objectives: 
By the end of this notebook series you should:
1. Know which assumptions underlie the definition of an 'idealized population'
2. Understand why allele frequencies do not change in the absence of evolutionary processes.
3. Be able to calculate genotype frequencies under Hardy-Weinberg equilibrium given allele frequencies at a locus.
4. Understand why violations of the Hardy-Weinberg assumptions cause allele frequencies to change over generations.

### Optional further reading:

- Chapter 4 of Futuyma 4th edition

In [None]:
# we start by loading a common Python library containing many stats functions
import numpy as np

### Genetic variation in populations
For the examples in this notebook we are focusing on variation at a single gene (A) that has two alleles (A$_1$ and A$_2$) in a single population. If a new mutation arose at this gene then we could consider additional alleles, such as A$_3$, but for simplicity we will assume only two alleles exist in the population. In fact, these types of simplifying assumptions are the topic of this notebook.

### Assumptions of an idealized population

Population genetics is the study of allele frequencies in populations. Amazingly, much of the foundational work in this field was developed *even before we knew about DNA*, and involved only combining probability and statistics with Mendel's theory of [particulate (non-blending) inheritance](https://en.wikipedia.org/wiki/Mendelian_inheritance). 

Mendel showed that genes are discrete units, diploid organisms contain two copies at every gene, and that the two copies have equal probability of being passed on to offspring through the formation of haploid gametes. This simple law of *segregation and independent assortment* can be formalized as a probability statement, and used to develop probabilistic models.

For this, we need to make some assumptions about the makeup of populations. An *idealized population* refers to a theoretical population that meets a number of unrealistic but useful assumptions. Most notably, that it is of *infinite size* and that the individuals within it are *randomly mating*. As we will see, these and other assumptions can be relaxed to allow for further insights into their effect on the model. Overall, the purpose of these population genetic models is to understand how and why allele frequencies (i.e., the relative abundance of A$_1$ versus A$_2$) change over time.

### Hardy-Weinberg 
The [*Hardy-Weinberg* equilibrium](https://en.wikipedia.org/wiki/Hardy%E2%80%93Weinberg_principle) is a statement that **the frequencies of alleles (variants at a gene) and genotypes (combinations of alleles at a gene) will remain constant** through time in an *idealized population* in the absence of <span style="color:red">selection, mutation, migration, and genetic drift.</span> 

This is not actually a super surprising statement in terms of statistics, but more so in terms of evolution. It is identifying that <span style="color:red">these processes are fundamental to explaining why allele frequencies change over time</span>, and also, that the process of *segregation of alleles* during Meiosis (i.e., Mendelian Inheritance) is not expected to change allele frequencies on its own. The concept of Hardy Weinberg equilibrium is demonstrated in the figure below from Chapter 4 of your textbook.

<img src=https://eaton-lab.org/slides/fundamentals2019/session-9-popgen/data/hardy-weinberg.png style="width:60%">


<p style="text-align:center">(Image source: Futuyma 4th Edition)</p>

### Hardy-Weinberg Expectation: 
In the first cell of the figure above the population is initially not in Hardy-Weinberg equilibrium. We can tell this by looking at the genotype frequencies. There appears to be a deficit of heterozygotes. But what is the null expectation for the number of heterozygotes that should exist? Well, this is what Hardy-Weinberg can tell us. To answer that question we need to calculate the genotype frequencies after just one generation of random mating in an infinite-sized population starting with the allele frequencies that exist in the population currently.

After a single generation **genotype frequencies** (e.g., A$_1$A$_1$, A$_1$A$_2$, and A$_2$A$_2$) will reach HW equilibrium, despite the fact that **allele frequencies** (e.g., A$_1$ and A$_2$) will not change. This expectation can be computed exactly using probability (here I will use the same genotype frequencies as in the figure above, but change the allele names to A and B, rather than A$_1$ and A$_2$ to make it easier to type).

In [None]:
# (A) diploid genotype frequencies in the population
AA = 0.3
AB = 0.0
BB = 0.7

Each AA parent produces two A alleles, each AB parent produces one A and one B allele, and each BB parent produces two B alleles. Based on this simple fact we can calculate the relative frequency of A and B alleles, which we label **p** and **q**. For example, p is equal to the frequency of AA homozygotes plus 1/2 the frequency of heterozygotes; q is equal to the frequency of BB homozygotes plus 1/2 the frequency of heterozygoes.

You might ask, but what if some of the diploids produced more alleles than others? Well, for that to happen would be a manifestation of either drift or selection, both of which we are assuming to be absent from this model. This is key to remember.

In [None]:
# (B) sample alleles from these diploid parents (e.g., haploid gametes)
p = AA + (AB / 2)
q = BB + (AB / 2)

# show result
print(p, q)

Now that we know the frequency of alleles in the gametes we can calculate the frequency of genotypes formed in the next generation *by assuming that populations are randomly mating*. AA genotypes will occur with probability of sampling p twice (p * p), heterozygotes are the probability of sampling p and q (or q and p), and BB homozygotes are the probability of sampling q twice (q * q). Once again, note the importance of the assumptions underlying our idealized population for this model prediction.

In [None]:
# (C) HW states that: p**2 + 2pq + q**2 = 1
newAA = p * p
newAB = 2 * p * q
newBB = q * q

# (D) the *genotype frequencies* have changed in the new generation
print(newAA, newAB, newBB)

We can see that the new genotypes frequencies are different from the previous generation, and they match the expectation from the figure. An important point is that after only one generation they have reached an *equilibrium*, meaning that we can repeat this same calculation for the next generation and the genotype and allele frequencies will not change. 

In [None]:
# (D) it is important to note that the *allele frequencies* have NOT changed
p = newAA + (newAB / 2)
q = newBB + (newAB / 2)
print(p, q)

### Mendelian segregation and Binomial sampling
One reason I stated in the beginning that the Hardy-Weinberg concept is not surprising from a statistical viewpoint is that it can be described by a very simple and common model in statistics, as a *binomial sampling problem*. A *Binomial* distribution is used in statistics to model the probability of binary outcomes (e.g., True vs. False). For a diploid organism, we can describe the three possible genotypes (A$_1$A$_1$, A$_1$A$_2$, and A$_2$A$_2$) at a locus with two alleles as the probability of sampling the A1 allele or not sampling the A1 allele in two independent trials (we do two trials because a diploid organism has two allele copies). Below this is demonstrated. 

In [None]:
# This is just showing an example of drawing random samples from
# a binomial distribution for 1 random outcome (0 or 1) repeated 20 times
# where the probability of sampling a 1 is p=0.3
np.random.binomial(n=1, p=0.3, size=20)

In [None]:
# Here I demonstrate similarly that we can draw two random samples
# for each trial and get the sum (0, 1, or 2), repeated 20 times,
# where the probability of sampling a 1 in a trial is p=0.3.
np.random.binomial(n=2, p=0.3, size=20)

In [None]:
# OK, now let's use this to sample a large number of diploid samples,
# these look just like above, a collection of 0s, 1s and 2s.
new_genotypes = np.random.binomial(n=2, p=0.3, size=100000)

# and calculate genotype frequencies 
BB = sum(new_genotypes == 0) / len(new_genotypes)
AB = sum(new_genotypes == 1) / len(new_genotypes)
AA = sum(new_genotypes == 2) / len(new_genotypes)
print(AA, AB, BB)

### Deviating from Hardy Weinberg.
You'll notice that the results above are *close* to the expectation we calculated above, but *not exactly the same*. This is because even though we sampled a very large number of random samples, there is still a small amount of sampling error that can cause slight deviations from the expectation. This is an example of genetic drift! And it brings us to our next subject.

### Sampling error is genetic drift
The binomial sampling method allows us to approximate the change in allele frequencies over multiple generations that is expected to occur by genetic drift in a finite sized population. It is important to note that by using finite populations sizes (i.e., allowing for drift) we have violated the assumptions of the Hardy-Weinberg model, and thus both genotype and allele frequencies are expected to change each generation. Our model now has an additional parameter, N, which we can modify and see its effect on the results. 

Below is the change in genotype frequencies after one generation when the population size is finite. We can see that they are still very close to the HW expectation. When N is larger it is closer to the expectation, when it is smaller the genotype frequencies deviate further from expectations.

In [None]:
# model the allele frequencies after one generation of sampling in a finite population
N = 1000
p = 0.3

# and calculate genotype frequencies 
new_genotypes = np.random.binomial(n=2, p=0.3, size=2*N)
BB = sum(new_genotypes == 0) / len(new_genotypes)
AB = sum(new_genotypes == 1) / len(new_genotypes)
AA = sum(new_genotypes == 2) / len(new_genotypes)
print(AA, AB, BB)

### Simulate allele frequency change over multiple generations
When we simulate this process over many generations even small fluctuations each generation can lead to large changes over time. Genetic drift is a random process (just random sampling!) so in each generation an allele may increase *or* decrease. 

Here we can see the change in allele frequency of the p allele in 8 replicate simulations from the same starting frequency (0.3). Sometimes it increases by drift, sometimes in decreases. This simulation uses the binomial sampling method described above.

In [None]:
# model allele frequency change over 500 generations in population of size 1000 diploids
N = 1000
p = 0.3
ngens = 500

# calculate allele frequencies through time
afreq = np.zeros((8, ngens))
afreq[:, 0] = p
for rep in range(8):
    for gen in range(ngens - 1):
        genotypes = np.random.binomial(n=2, p=afreq[rep, gen], size=2*N)
        freqB = genotypes.sum() / (len(genotypes) * 2)
        afreq[rep, gen + 1] = freqB
        
# plot the replicates
import toyplot
canvas = toyplot.Canvas(width=500, height=325)
axes = canvas.cartesian(xlabel="generations", ylabel="allele frequency (p)")
axes.plot(afreq.T);

### Moving forward: Genealogies
As we have seen, allele frequencies in populations can change over time due to genetic drift caused by the simple process of *randomly sampling* which parent's alleles make up the next generation in a finite sized population. Try changing the variable N is the code cell above from 1000 to 100000 and see how the results change. You should see that larger populations (larger N) have less volatility in allele frequencies over time. 

The **effective size** of a population (often shown as either N, Ne, or N$_e$) is an important parameter in population genetic models.  As we showed here it affects the rate of allele frequency change caused by genetic drift. This relationship between Ne and genetic drift has many consequences. Surprisingly, this relationship can be entirely explained by the effect of Ne on the probability that two random samples (e.g., gene copies in the population) share a common ancestor in a previous generation. In other words, Ne affects the genealogy of samples!. This is the basis for two mathematical models we will discuss in the next notebook, the Wright-Fisher process and the coalescent.