## Parentage assignment simulation walktrough

In this notebook, I will do a walkthrough of how the code for the simulations of my Data Science class project works. 

### Goal 

The goal of this code is to estimate the probability of assigning parentage between two individuals correctly depending on the number of SNPs sampled to do so. The whole point is to see what is the minimum number of SNPs that would be needed to assign parentage correctly at least 95% of the time. The hypothesis is that sequencing fewer SNPs might yield the same results while being cheaper. 

In [1]:
import random as rd
import numpy as np
import pandas as pd

### Step 1: Generating the parental population

The first step is to generate a parental population. In this population, each individual will be defined by a sequence of $N$ SNPs which will have 3 possible values: 0, 1, 2. Assuming diploid genetics, the number zero indicates dominant homozygosity for that locus (let's say AA = 0), whereas a number of 1 indicates heterozygosity (let's say AC/CA = 1) and 2 recessive homozygosity (CC = 2). 

To generate unique parents I will use the `choices` function from the `random` package as seen below:

In [2]:
# define possible values for SNPS
snp_values = ["A","T","C","G"]

# generate an example parent with 10 SNPs
rd.choices(snp_values, k = 10)

['A', 'C', 'T', 'T', 'G', 'T', 'G', 'C', 'G', 'A']

Using the same principle I will generate a parental population of 50 individuals with 200 SNPs each:

In [3]:
# define the number of parents 
n_parents = 100

# define the number of SNPs
n_snps = 100

# generate an empty lists to store parental population SNPs and info
parent_snps = []
parent_id = []

# loop to generate parental population 
for i in range(n_parents):
    
    parent_snps.append(rd.choices(snp_values, k = n_snps)) # generate and add parent snps
    
    parent_id.append(i) # generate and add parent id
    
print(parent_snps)

[['T', 'G', 'A', 'T', 'C', 'A', 'T', 'T', 'G', 'A', 'G', 'A', 'C', 'G', 'C', 'C', 'G', 'A', 'C', 'A', 'G', 'T', 'G', 'A', 'C', 'C', 'T', 'C', 'G', 'T', 'A', 'T', 'C', 'G', 'C', 'G', 'C', 'A', 'A', 'G', 'G', 'G', 'G', 'G', 'C', 'C', 'T', 'T', 'A', 'T', 'A', 'A', 'T', 'G', 'G', 'T', 'C', 'C', 'G', 'A', 'T', 'G', 'A', 'G', 'A', 'C', 'A', 'G', 'C', 'G', 'G', 'G', 'A', 'G', 'T', 'G', 'G', 'T', 'C', 'T', 'G', 'C', 'G', 'G', 'G', 'T', 'T', 'T', 'G', 'T', 'G', 'C', 'C', 'G', 'C', 'G', 'A', 'G', 'A', 'T'], ['A', 'G', 'C', 'A', 'C', 'A', 'C', 'A', 'T', 'C', 'C', 'G', 'T', 'A', 'T', 'C', 'C', 'T', 'A', 'T', 'G', 'T', 'T', 'G', 'A', 'C', 'G', 'C', 'G', 'T', 'A', 'C', 'C', 'G', 'G', 'T', 'T', 'A', 'T', 'T', 'T', 'C', 'G', 'G', 'G', 'C', 'A', 'G', 'G', 'A', 'C', 'C', 'C', 'C', 'G', 'C', 'T', 'T', 'C', 'A', 'C', 'G', 'G', 'T', 'A', 'C', 'G', 'G', 'C', 'A', 'A', 'G', 'A', 'T', 'C', 'T', 'A', 'G', 'A', 'C', 'G', 'A', 'C', 'G', 'C', 'T', 'A', 'G', 'C', 'G', 'G', 'C', 'C', 'C', 'C', 'G', 'G', 'G', 'G', '

### Step 2: Generating offspring 

The second step is to generate an offspring population. This offspring population will be the same size as the parental population. First, each individual will be assigned a random parent and then, assuming perfect heritability for simplicity, the parent's SNP sequence will be copied to form the offspring's sequence. 

In [4]:
# generate empty lists to store prent ID and offspring population 
offspring_parents = []
offspring_snps = []

# loop to generate offspring population 
for i in range(n_parents):
    
    # pick a random parent 
    offspring_parent_id = rd.choice(parent_id)
    
    # attach assigned parents to the actual parents list 
    offspring_parents.append(offspring_parent_id)
    
    # generate and add offspring SNPS as a perfect copy of the parents 
    offspring_snps.append(parent_snps[offspring_parent_id])

print(offspring_parents)
print(parent_id)

[5, 46, 5, 66, 72, 85, 5, 64, 5, 19, 89, 46, 80, 10, 61, 70, 94, 1, 38, 29, 82, 56, 83, 84, 43, 75, 89, 42, 29, 61, 19, 41, 97, 15, 31, 71, 35, 80, 81, 78, 75, 33, 7, 38, 72, 69, 13, 14, 15, 14, 73, 18, 76, 93, 14, 12, 47, 94, 81, 80, 66, 55, 46, 46, 97, 0, 62, 71, 82, 87, 60, 64, 84, 54, 13, 19, 65, 90, 99, 94, 60, 76, 64, 46, 96, 64, 81, 30, 71, 13, 99, 24, 90, 45, 32, 15, 29, 74, 18, 67]
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99]


### Step 3: Reduce the number of SNPs

Now that we know how what the parental and offspring population's SNPs look like it is time to take only a fraction of them and see if the same parental relationships are recovered. The first step is determine the number of SNPs that should be removed and then randomly choose which ones are actually being removed. 

In [5]:
# determine the percentage of SNPs that should be removed 
p_removed = 0.96

# calculate the number of SNPs that are actually removed 
n_removed = round(n_snps * p_removed) # here I need to round to get a count

# define list of SNP positions 
snp_positions = list(range(n_snps))

# determine the positions that should be removed
positions_removed = rd.sample(snp_positions, n_removed)
print(positions_removed)

# initialize lists for parent and offspring snps subsampled
parent_snps_sub = []
offspring_snps_sub = []

# loop to subset snps
for i in range(n_parents):
    
    # select the parent and offspring snps 
    p_snps = parent_snps[i]
    o_snps = offspring_snps[i]
    
    for j in range(len(positions_removed)):
        
        # assign the value of X to indicate removal
        p_snps[j] = "x"
        o_snps[j] = "x"
        
    # remove elements from both lists equaling X
    p_snps = [x for x in p_snps if x != "x"]
    o_snps = [x for x in o_snps if x != "x"]
    
    # append removed info
    parent_snps_sub.append(p_snps)
    offspring_snps_sub.append(o_snps)
    
print(parent_snps)
print(parent_snps_sub)

[39, 56, 64, 67, 5, 38, 19, 57, 26, 30, 10, 1, 61, 36, 77, 32, 28, 33, 99, 47, 66, 80, 81, 20, 68, 87, 89, 60, 53, 46, 7, 55, 50, 79, 13, 31, 82, 58, 72, 44, 43, 63, 21, 35, 2, 14, 42, 83, 73, 84, 88, 76, 96, 59, 98, 15, 86, 40, 12, 51, 0, 95, 85, 16, 70, 91, 9, 75, 90, 18, 78, 37, 22, 62, 45, 34, 49, 8, 29, 3, 94, 17, 93, 48, 24, 54, 25, 4, 11, 23, 92, 69, 97, 65, 27, 52]
[['x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'A', 'G', 'A', 'T'], ['x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 

### Step 4: Assign parentage

In [6]:
# generate an empty list to store assigned parentage
offspring_assigned_parents = []

# define a function to find the number of matches between 2 sequences
def find_matches(list_a, list_b):

    # define empty matches offspring
    matches = []

    # loop through first list 
    for i in range(len(list_a)):

        # if elements of the list match add 1 to matches if not add 0
        if list_a[i] == list_b[i]:
            matches.append(1)
        else:
            matches.append(0)
    return(sum(matches))

# find matches between each offspring and all parents and find best parent match 
for i in range(n_parents): # offspring number = parent number, could be changed in the future

    # object to store each offspring's matches
    matches = []

    # loop to find number of matches between one offspring and all parents
    for j in range(n_parents):

        # find matches between that offspring and each specific parent
        matches.append(find_matches(offspring_snps_sub[i], parent_snps_sub[j]))

    # find index for the best match
    offspring_assigned_parents.append(matches.index(max(matches)))

print(offspring_assigned_parents)
print(offspring_parents)

[5, 46, 5, 66, 72, 11, 5, 64, 5, 19, 89, 46, 72, 10, 61, 70, 94, 1, 38, 29, 82, 56, 83, 84, 43, 75, 89, 42, 29, 61, 19, 5, 97, 15, 31, 71, 3, 72, 59, 78, 75, 33, 7, 38, 72, 69, 10, 14, 15, 14, 73, 18, 76, 93, 14, 12, 47, 94, 59, 72, 66, 55, 46, 46, 97, 0, 42, 71, 82, 87, 60, 64, 84, 54, 10, 19, 65, 90, 94, 94, 60, 76, 64, 46, 44, 64, 59, 26, 71, 10, 94, 24, 90, 45, 32, 15, 29, 74, 18, 67]
[5, 46, 5, 66, 72, 85, 5, 64, 5, 19, 89, 46, 80, 10, 61, 70, 94, 1, 38, 29, 82, 56, 83, 84, 43, 75, 89, 42, 29, 61, 19, 41, 97, 15, 31, 71, 35, 80, 81, 78, 75, 33, 7, 38, 72, 69, 13, 14, 15, 14, 73, 18, 76, 93, 14, 12, 47, 94, 81, 80, 66, 55, 46, 46, 97, 0, 62, 71, 82, 87, 60, 64, 84, 54, 13, 19, 65, 90, 99, 94, 60, 76, 64, 46, 96, 64, 81, 30, 71, 13, 99, 24, 90, 45, 32, 15, 29, 74, 18, 67]


### Step 5: Determine how well was parentage assigned

In [7]:
# initialize empty list to store correctly assigned parentage
correct_parentages = []

# define function to get mean 
def mean(lst):
    return sum(lst)/len(lst)

# find percentage of parentage assigned correctly
for i in range(len(offspring_assigned_parents)):

    if offspring_assigned_parents[i] == offspring_parents[i]:
        correct_parentages.append(1) # if match do 1
    else:
        correct_parentages.append(0) # if not match do 0    
        
print(mean(correct_parentages))

0.83


In [8]:
d = {'percentage_snps': [1 - p_removed], 'correct_parentage': [mean(correct_parentages)]}
print(d)
df = pd.DataFrame(data = d)
print(df)

{'percentage_snps': [0.040000000000000036], 'correct_parentage': [0.83]}
   percentage_snps  correct_parentage
0             0.04               0.83
