Clustering simulated sequences using TMRCA

Name: CQS21 
FYP 2024

In [None]:
# installing packages to my python environment
# NOTE: my pip is v22.0.4 but I cant seemt to update to the newest 24.0

#!pip install zarr
#!pip install scipy
#!pip install scikit-allel
#!pip install matplotlib
#!pip install tqdm
#!pip install dask
#!pip install seaborn

In [18]:
# Import modules
import numpy as np
import zarr
import allel
import scipy.cluster.hierarchy as sch
import scipy.spatial
import matplotlib
import matplotlib.pyplot as plt
import scipy.signal
from scipy.ndimage.filters import gaussian_filter1d
from numpy.lib.stride_tricks import sliding_window_view
from tqdm import tqdm
import dask
from dask import compute, delayed
from itertools import combinations
import time
import seaborn as sns
import sys

ModuleNotFoundError: No module named 'allel'

Step 1. Generate burn in simulations establishing nucleotide diversity across a range of parameters (DONE BY PREVIOUS STUDENTS)

In [None]:
# see burn in simulations folder

Step 2. Conduct 27 variations of partial soft sweep simulations and output vcf files using the HPC facility. Complete x number of repeats.

Size of population, N: 100, 1000, 10000

Mutation rate, μ : (0.1/4N), (1/4N), (5/4N)

Recombination rate, r : (0.1/4N), (1/4N), (5/4N)

Step 3. Read in VCF Files from soft sweep simulations, extract haplotypes using scikit_allel

Working with 27 simulation variants and repeats of each variant (100?)

but why only look at haplotypes of first 200 individuals? and why calculate sample frequency over specifically 400 haplotypes? is it linked to the 200 individuals = 400 genomes?


In [21]:
seed = sys.argv[1]
#Like Hamming distance code, this was also taken from Anushka Thawani. Adaptations were made to this on the 
#high-performance computer using shell script, but this could not be represented.
def convert(file, genome):
    '''
    This function extracts haplotypes sequences from a vcf file 
    Adapted from: http://alimanfoo.github.io/2018/04/09/selecting-variants.html 
    
    Arguments:
        file: name of vcf file (from SLiM soft sweep simulation)
        genome: length of genome used in SLiM simulation 
        
    Returns:
        ht: haplotype sequences for 200 individuals
        samp_freq: frequency of sweep mutation in sample
        cols: used to color dendrogram

    '''
    
    v = file + '.vcf'
    z = file + '.zarr'
    slim_sim_data = allel.read_vcf(v, fields='*')
    allel.vcf_to_zarr(v, z, fields='*', overwrite=True)
    data = zarr.open_group(z, mode='r')
    
 
    pos = allel.SortedIndex(data['variants/POS']) # Stores the ID and genomic position of each variant
    
    # Extract genotypes for the first 200 individuals and convert to haplotypes
    gt = data['calldata/GT'][:,0:200] 
    ht = allel.GenotypeArray(gt).to_haplotypes()
    
    mutation = int((genome+1)/2) + 1  # position of sweep mutation
    
    
    # Output the frequency of the sweep mutation in the sample
    contains_sweep = pos.locate_range(mutation,mutation)
    sweep = ht[contains_sweep]
    sweep = np.sum(sweep, axis =0)
    
    samp_freq = np.sum(sweep)/400  # 400 haplotypes
    
    
    # This dictionary is used later to color the dendrogram branches according to whether or not the 
    # corresponding sequence contains the sweep mutation
    cols = {}
    for i in range(400):
        if sweep[i]:
            cols[i] = 'r'  
        else:
            cols[i] = "#808080"
    
    return ht, pos, samp_freq, cols, sweep

Step 4. Calculate Hij (S or homozygosity) for all pairs of haplotypes
make a function 
homozygosity = (no. of SNPs/ length of window, L)

In [3]:
def sliding_homozygosity (length, genome, ht, sweeploc, pos):
    '''
    Arguments:
    length: length of sliding window
    genome : length of genome 
    ht : vector?  of haplotype sequences from previous function
    pos: position of variants from previosu function
    sweeploc: position of sweep mutation in genome

    
    Returns:
    homozygosities: homozygosity of all haplotypes in ht in a array(?)
    '''
    # Make empty vectors
    homozygosities = []
    mismatch = []
    for x in range(0,genome):
        start  = x
        end = x + length
        # locate the position of region around a variant Nt
        region = pos.locate_range(start,end) 
        haplotype_region = ht [region]

        #use allel.pairwise distance
        pairwise_dist = allel.pairwise_distance(haplotype_region, metric = 'hamming' , chunked=True, blen=None)
        homozygosities = pairwise_dist/length

    return homozygosities



##streamlining: previosu students code say to add a if condition (region =/= prev region to speed up code)

Step 5. Calculate Lij from Hij by finding width at half maximum homozygosity
where Lij is the shared haplotype length

Step 6. Calculate Tij (Time to common ancestor) from Lij and Kij (which is no. of SNPs) on each shared length (haplotype)

𝜏_𝑖𝑗=(𝑘_𝑖𝑗+1)/(2ℓ_𝑖𝑗 (𝑟+𝜇))