# Week 3 Lab: Population genetics
## Prelab 1

**Due: Monday 4/22/19 11:59PM**

For this prelab, we'll become familiar with the concepts of genotype and allele frequencies, and how they behave when we are looking at large or diverse populations. 

You may work with a partner or consult basically anything (internet, classmates) for help!

## 1. Allele and genotype frequencies

If you were to look at the genomes of two unrelated people, the vast majority would be the same. But in some cases, mutations have arisen over time that make genomes slightly different from each other.

For this week, we'll focus almost exclusely on a type of genetic variation (mutation) which we call "SNPs" (single nucleotide polymorphisms). A SNP can be thought of simply as a spelling mistake, where originally there was a particular nucleotide at a position in the genome (say "A"), but at some point there was a mutation (for example, changing the "A" to a "C"). 

We use **allele** to refer to a particular "version" of a SNP. For example a SNP may have to different *alleles* (e.g. "A" or "C").

A **genotype** is the combination of an individual's two alleles. So for a T/A SNP a person's genotype can be either TT, AT, or AA. (At least for now, we don't care about the order of alleles in heterozygous SNPs. So you could also say "TA" instead of "AT".

**Question 1 (2 pts)**: For a SNP with alleles "G" and "C", what are all possible diploid genotypes"? Set the list `possible_gts` to your answer below. An example is shown for you. Please keep alleles in uppercase.

In [2]:
# Example: what are the possible genotypes for a SNP with alleles "A" and "T"?
possible_gts_AT = ["AA","AT","TT"]

# What are the possible genotypes for a SNP with alleles "G" and "C"?
possible_gts_GC = ["GG", "GC", "CC"] 


In [3]:
"""Test the list of possible genotypes is correct"""
assert(len(possible_gts_AT)==3)
assert("AA" in possible_gts_AT)
assert("AT" in possible_gts_AT or "TA" in possible_gts_AT)
assert("TT" in possible_gts_AT)


We can think of all of the copies of a chromosome in the population (2 per person) as deriving from some common ancestor a long time ago. The leaves of the tree below are all the present-day copies of the chromosome. Since humans are diploid, humans each have two copies (right).

<img src=popgen_fig1.png width=800>

Once a mutation arises on one copy of a chromosome, it can be passed on to future generations and can eventually spread throughout the population (see left figure above).

We use **allele frequency** to refer to the frequency of a particular allele out of all the copies of the genome in population. 

We use **genotype frequency** to refer to the frequency of each of possible *genotype* (consisting of two alleles) in the population.

Since we each have two copies of each chromosome, allele frequency is usually the number of times the allele was seen divided by two times the number of people we analyzed (ignoring sex chromosomes or mitochondria).

For example, consider an "A/T" SNP that we are analyzing in a set of 1,000 people. We find that 40 people have genotype AA, 320 have genotype AT, and 640 have genotype TT:

* The *genotype frequencies* are: AA=40/1000 = 4%, AT=320/1000 = 32%, and TT=640/1000 = 64%.
* To find the *allele frequencies*:
  * We have 2,000 (2*number of people) total copies of the chromosome
  * For A, 40 people have two copies of A and 320 people have one copy of A, so there are $40*2+320 = 400$ total copies of A.
  * Similarly, there are $640*2+320=1600$ copies of T
  * So the allele frequencies are A=400/2000 = 20%, T=1600/2000 = 80%
  
We use **minor allele** to refer to the allele at a position that is least frequent in the population, and **major allele** to refer to the most common allele. (Note, in some cases the reference allele is actually the minor allele!)

We use **minor allele frequency (MAF)** to refer to the frequency of the minor allele.

Since most SNPs have only two alleles ("bi-allelic"), we can conveniently represent most SNP genotypes as 0, 1, or 2. Usually "0" would mean homozygous for the most common allele, "1" would mean heterogyzous, and "2" would mean homozygous for the least common allele.

**Question 2 (5 pts)**: Consider a "G/C" SNP that we are analyzing in a set of 10,000 people. We find that 7,225 people have genotype CC ("0"), 2,550 people have genotype GC ("1"), and 225 people have genotype GG ("2"). Complete the functions `GetGenotypeFrequencies` and `GetAlleleFrequencies` to compute the genotype and allele frequencies of this SNP, respectively. The example from above is given as a test case to make sure your code is working.

In [9]:
# Compute genotype frequencies, given genotype counts
# Input: gt_counts[# people with genotype 0, # people with genotype 1, # people with genotype 2]
# Return [fraction of people with gt 0, fraction of people with gt1, fraction of people with gt2]
def GetGenotypeFrequencies(gt_counts):
    gt_freqs = [0, 0, 0]
    total_people = sum(gt_counts)
    for i, counts in enumerate(gt_counts):
        gt_freqs[i] = counts/total_people
    return gt_freqs

# Compute allele frequencies, given genotype counts
# Input: gt_counts[# people with genotype 0, # people with genotype 1, # people with genotype 2]
# Return: [fraction of alleles that are the major allele, fraction of alleles that are the minor allele]
def GetAlleleFrequencies(gt_counts):
    allele_freqs = [0, 0]
    total_alleles = sum(gt_counts) * 2
    allele_freqs[0] = (gt_counts[0]*2 + gt_counts[1])/total_alleles
    allele_freqs[1] = (gt_counts[1] + gt_counts[2]*2)/total_alleles

    return allele_freqs

print("Example")
ex_gt_counts = [640, 320, 40]
ex_gt_freqs = GetGenotypeFrequencies(ex_gt_counts)
ex_allele_freqs = GetAlleleFrequencies(ex_gt_counts)
print("Computed genotype frequencies: 0=%.2f, 1=%.2f, 2=%.2f"%(ex_gt_freqs[0], ex_gt_freqs[1], ex_gt_freqs[2]))
print("Computed allele frequencies: major allele=%.2f, minor allele=%.2f"%(ex_allele_freqs[0], ex_allele_freqs[1]))

print("Your test case")
gt_counts = [7225, 2550, 225]
gt_freqs = GetGenotypeFrequencies(gt_counts)
allele_freqs = GetAlleleFrequencies(gt_counts)
print("Computed genotype frequencies: 0=%.2f, 1=%.2f, 2=%.2f"%(gt_freqs[0], gt_freqs[1], gt_freqs[2]))
print("Computed allele frequencies: major allele=%.2f, minor allele=%.2f"%(allele_freqs[0], allele_freqs[1]))

Example
Computed genotype frequencies: 0=0.64, 1=0.32, 2=0.04
Computed allele frequencies: major allele=0.80, minor allele=0.20
Your test case
Computed genotype frequencies: 0=0.72, 1=0.26, 2=0.02
Computed allele frequencies: major allele=0.85, minor allele=0.15


In [10]:
"""Test results of GetGenotypeFrequencies and GetAlleleFrequencies"""
ex_gt_counts = [640, 320, 40]
ex_gt_freqs = GetGenotypeFrequencies(ex_gt_counts)
ex_allele_freqs = GetAlleleFrequencies(ex_gt_counts)
assert(sum(ex_gt_freqs)==1)
assert(sum(ex_allele_freqs)==1)
assert(ex_gt_freqs[0]==0.64)
assert(ex_gt_freqs[1]==0.32)
assert(ex_gt_freqs[2]==0.04)
assert(ex_allele_freqs[0]==0.80)
assert(ex_allele_freqs[1]==0.20)

## 2. Mutations in populations

Depending on when a mutation occurs, it can have very different frequency in the population. Consider a mutation that occurred thousands of years ago vs. a mutation that happened very recently:

<img src=popgen_fig2.png width=800>

Now think about different human populations in the world. At the highest level, we can think of the major population groups of the world as being related as depicted in the tree below:

<img src=popgen_fig3.png width=400>

(This is a major simplification, and ignores things like admixture between different populations)

If a mutation occurred before these populations split, it might be pretty common all across the globe. On the other hand, if a poulation occurred after a population split, it may be common in one population but completely missing from another.

<img src=popgen_fig4.png width=700>

**Question 3 (3 pts)**: The ancestral copy of a chromosome has base "T" at a particular position. This "T" mutated to a "C" in an Asian individual thousands of years ago but after Asian populations had diverged from other populations. What do you expect the minor allele frequency (frequency of allele "C") of this SNP to be in Europeans? Set `expected_maf` to your answer below (ignoring new mutations at the same position, which are rare but can happen). 

In [13]:
expected_maf = 0

In [14]:
"""Test result of expected_maf"""

'Test result of expected_maf'