# Models of Evolution

1. Mutations
2. Recombination
3. Natural Selection
4. Migration
5. Mating
6. Popluation Size

This notebook explores fundamental evolutionary models and code that shows their impact

In [1]:
import numpy as np
import pandas as pd

In [2]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# Fundamentals of Genomics

Evolution is driven through mutations accumulating in a chromosome.  A **chromosome** is a structure of nucleic acids and protein found in the nucleus of most living cells, carrying genetic information in the form of genes. For this lesson, we'll only evaluate the nucleic acids in a chromosome. In this exploration we will use *A* (adenine), *C* (cytosine), *G* (guanine), and *T* (thymine) nucleotides, also known as **bases**. Each position on a chromosome is called a **locus** and at each position the specific nucleotide is referred to as a **varient**.  A set of variants from a single chromosome ordered by locus is referred to as a **sequence**.  When multiple of the sequences are structured by order they become an **alignment** that allows you to see **alleles**, loci where more than one unique variant is present. 


## Alignment

We will start by generating 5 identical sequence of 10 varients.  We will add a skew to the generation to bias the A and T nucleotides so that we get less of a random sequence.  

In [3]:
# sets the seed so the random generation is the same each time
np.random.seed(123)
bases = ['A','C','G','T']

# weights for our bases
base_p = [0.4, 0.1, 0.1, 0.4] 

# create a single strand
strand = np.random.choice(bases,10, p=base_p)

# create alginment of 5 identical strands
alignment = np.tile(strand,(5,1))

#converting to pandas
align_df = pd.DataFrame(alignment)
align_df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,T,A,A,G,T,C,T,T,C,A
1,T,A,A,G,T,C,T,T,C,A
2,T,A,A,G,T,C,T,T,C,A
3,T,A,A,G,T,C,T,T,C,A
4,T,A,A,G,T,C,T,T,C,A


now we have an alignment where we have 5 strands (*n=5*), each strand has a length of 10 (*L=10*) and we have no sites with more than a single unique variant (*S=6*).   Let's now introduce a few **SNP** (single-nucleotide polymorphism), or a substitution of a single nucleotide at a specific position in the genome.   We'll add 5 SNPs at random

In [4]:
np.random.seed(1)
mut_df = align_df.copy()

for i in range(5):
    # creates a copy of the base list, generates a random strand and locus to mutate, 
    # removes the existing variant from the options, then picks a random choice of the bases left
    mut = bases.copy()
    strand = np.random.randint(0,5)
    loci = np.random.randint(0,10)
    existing_variant = mut_df[loci][strand]
    mut.remove(existing_variant)
    mut_df[loci][strand] = np.random.choice(mut)
mut_df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,T,C,A,G,T,C,T,T,C,A
1,T,A,A,G,T,C,T,T,C,A
2,T,A,A,G,G,C,T,T,C,A
3,T,A,A,G,T,A,T,T,G,A
4,T,A,A,G,T,C,C,T,C,A


We've now have 5 loci with multiple Alleles (1,4,5,6,8) so now *S=5*.  Another form of mutation can be inserts and deletions. **Insertions** are when a strand acquires new bases that would generate a new loci not present in any other strand.  A **Deletions** is the opposite, or when a strand's base is removed at a loci. Both of these processes exert evolutionary impact and are often lumped togenter as "InDels" as there can be at times difficutly in discerning from which a new loci occurs. In both cases, the missingness of an allele (whether on the strand or on other strands) is represented with a period ".".  We will now add in a deletion to our alignment.

In [5]:
np.random.seed(123)
strand = np.random.randint(0,5)
loci = np.random.randint(0,10)
mut_df[loci][strand] = '.'
mut_df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,T,C,A,G,T,C,T,T,C,A
1,T,A,A,G,T,C,T,T,C,A
2,T,A,.,G,G,C,T,T,C,A
3,T,A,A,G,T,A,T,T,G,A
4,T,A,A,G,T,C,C,T,C,A


At this point we now have 6 loci with multiple alleles *S=6*.  We can now calculate the allele frequency at each loci.  The allele frequency is the relative frequency of an allele (variant of a gene) at a particular locus in a population, expressed as a fraction or percentage.

The frequency ($p$) of an allele *A* is the fraction of the number of copies ($n_A$) of the A allele and the population or sample size ($N$), so
$$
p = \frac{n_A}{N}
$$



In [6]:
for i in range(10):
    #get values as dictionary
    vals = mut_df[i].value_counts().to_dict()
    
    # get N
    total_ct = sum(vals.values()) 
    
    #iterate through each value and create a frequency dictionary
    freq_dict = {j:vals[j]/total_ct for j in vals}
    mut_df.loc['allele_freq',i] = str(freq_dict)
mut_df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,T,C,A,G,T,C,T,T,C,A
1,T,A,A,G,T,C,T,T,C,A
2,T,A,.,G,G,C,T,T,C,A
3,T,A,A,G,T,A,T,T,G,A
4,T,A,A,G,T,C,C,T,C,A
allele_freq,{'T': 1.0},"{'A': 0.8, 'C': 0.2}","{'A': 0.8, '.': 0.2}",{'G': 1.0},"{'T': 0.8, 'G': 0.2}","{'C': 0.8, 'A': 0.2}","{'T': 0.8, 'C': 0.2}",{'T': 1.0},"{'C': 0.8, 'G': 0.2}",{'A': 1.0}


In the new row we can now clearly see the frequencies of each allele and the frequency of the missingness.  As we can see, even when there are multiple alleles, one tends to dominate within a popluation.  This is most common as allele variations, or mutations, are rarely advantageous and so they tend to stay in low frequencies.  We'll explore next some models of mutation to better understand why these alleles persist, even in low frequencies, and how they are introduced and spread.