## Session 1: Time complexity and data structures

In this first exercise worksheet, you will be using algorithms and data structures to gain basic insights into genetic data.

### Exercise 1: Working with k-mers

K-mers are subsequences of a DNA sequence length k. The shortest possible k-mers, where k = 1, are known as _monomers_ and are just words of 1 letters. There are 4 possible monomers: A, C, T, G. Likewise, there are 16 possible k-mers of k=2, known as _dimers_ (AA, AC, AG, AT, ...) and so on...

<img style="float: right;" alt="" src="./images/kmers.svg" />

a) How many distinct k-mer of fixed length k exist ?

b) How many k-mers of length k can we fit in a sequence of length L ?

c) Write the `get_unique_kmers` function to record all the unique k-mers found in an input DNA sequence.

d) Out of the 3 genes provided, which pair of genes share the most identical k-mers ?

In [1]:
# 3 genes are stored in a file, in FASTA format we load their sequences into biopython
from Bio import SeqIO
gene1, gene2, gene3 = [str(a.seq) for a in SeqIO.parse('data/session_1.fasta', format='fasta')]


In [2]:
def get_unique_kmers(dna_string, k=3):
    """
    Add code to make this function return all unique
    k-mers of requested length in an input DNA sequence.
    
    >> get_unique_kmers("ACATA", k=3)
    ("ACA", "CAT", "ATA")
    """
    kmers = set()
    seq_len = len(dna_string)
    # Scan each position in the sequence
    for i in range(seq_len - (k - 1)):
        # Add it to our set (i.e. hash table)
        kmers.add(dna_string[i: i+k])
    
    return kmers


In [3]:

k = 8
# Compute the unique k-mer for each gene and store them in a list
kmers_genes = [get_unique_kmers(g, k=k) for g in [gene1, gene2, gene3]]

# Which pair of genes share the most k-mers ?

# Compute the shared k-mers for each combination of genes
for i, ki in enumerate(kmers_genes):
    for j, kj in enumerate(kmers_genes):
        # Skip redundant combinations (e.g. compute 1vs2 but not 2vs1)
        if i < j:
            # & is the python shortcut for set intersections (k-mers in both sets)
            shared = len(ki & kj)
            print(f"Genes {i+1} and {j+1} share {shared} k-mers")

# Can you think of a better similarity metric than this ?
# > Yes, Jaccard index, which takes into account the total amount of k-mers. Otherwise
# > longer gene are more likely to share more k-mers.
for i, ki in enumerate(kmers_genes):
    for j, kj in enumerate(kmers_genes):
        if i < j:
            # Jaccard index is the intersect divided by the union
            # i.e.: number of common k-mers divided by the total number of k-merss
            jaccard = len(ki & kj) / len(ki | kj) # &=intersect, |=union
            print(f"Jaccard index for genes {i+1} and {j+1}: {jaccard:.2f}")


# What happens if you reduce k ? Why ?
# > Genes become all identical. Because, given a small enough k, all possible k-mers are present in all genes.

# Note: 
# Genes 1 and 2 are much more similar ! They are actually hemoglobin alpha1 from human and cat, respectively
# gene 3 is a bacterial enzyme.


Genes 1 and 2 share 215 k-mers
Genes 1 and 3 share 17 k-mers
Genes 2 and 3 share 18 k-mers
Jaccard index for genes 1 and 2: 0.15
Jaccard index for genes 1 and 3: 0.01
Jaccard index for genes 2 and 3: 0.01


### Exercice 2: Quantifying similarity

Instead of just recording presence / absence of k-mers, we can count the number of occurence of each k-mer in a sequence. K-mer frequencies are widely used in genomics to detect contamination in sequencing libraries or estimate genomic features such as ploidy or heterozygosity.

a) Write the `get_kmer_counts` function to retrieve each k-mer in a DNA sequence along with its number of occurences.

b) Compute the pairwise distance between the genes' k-mers counts.

c) What is the time complexity of computing the pairwise distance between two counts ?

In [4]:
from math import sqrt

def get_kmers_counts(dna_string, k=3):
    """
    Add code to make this function return the k-mer
    counts of an input DNA sequence.

    >> get_kmers_counts("ACATAC", k=2)
    {
       "AC": 2,
       "AT": 1,
       "CA": 1,
       "TA": 1,
    }
    """
    kmers_counts = dict()
    seq_len = len(dna_string)
    for i in range(seq_len - (k - 1)):
        # Python protip: We could have used a collections.defaultdict
        # object instead of a try/except statement.
        try:
            kmers_counts[dna_string[i: i+k]] += 1
        except KeyError:
            kmers_counts[dna_string[i: i+k]] = 1
        
    return kmers_counts


def pairwise_dist_kmers_counts(kmer_count1, kmer_count2):
    """
    This function should return the euclidean distance
    between two k-mer counts.
    
    >>> pairwise_dist_kmers_counts({'AA': 10, 'AC': 3}, {'AA': 1, 'AT': 2})
    9.695359714832659
    """
    # Getting the list of all k-mers present in either set
    all_kmers = set(kmer_count1.keys()) | set(kmer_count2.keys())
    dist = 0
    # Iterate over every possible k-mer
    for kmer in all_kmers:
        # Record their counts in both sets (absent means 0 occurence)
        try:
            c1 = kmer_count1[kmer]
        except KeyError:
            c1 = 0
        try:
            c2 = kmer_count2[kmer]
        except KeyError:
            c2 = 0
        
        dist += (c2 - c1)**2
    
    dist = sqrt(dist)
    
    return dist



In [5]:
k = 4

counts = [get_kmers_counts(g, k=k) for g in [gene1, gene2, gene3]]
for a, count_a in enumerate(counts):
    for b, count_b in enumerate(counts):
        if a < b:
            dist = pairwise_dist_kmers_counts(count_a, count_b)
            print(f"The distance between genes {a+1} and {b+1} is: {dist}")


# Which are the two most similar genes ? How does changing k affect the result ? (and why ?)
# > Same as before, genes 1 and 2 are most similar. Increasing k reduces the contrast between distances
# > if k becomes too large, the majority of k-mers are absent and distances become inacurrate.

# What's the time complexity of using pairwise_dist_kmer_counts on a pair of genes ?
# > O(n) where n = the total number of k-mers in both sets

# And computing all pairwise distances between N genes ?
# > O(n^2)


The distance between genes 1 and 2 is: 27.80287754891569
The distance between genes 1 and 3 is: 58.77924803874238
The distance between genes 2 and 3 is: 60.876925020897694


### Additional information:
    
If you are interested in k-mers, here are a few examples of how they are used in the wild:
    
* The sequence read archive (SRA) is the largest public database of genomic data in the world. It uses k-mer analysis on all of their libraries to quantify their taxonomic content and detect contaminations. You can view an example here by clicking on the "analysis" tab (the algorithm is described at the bottom of the page): https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR11945594
* Genomescope and smudgeplot use K-mer profiles to predict genome length, heterozygosity and ploidy from k-mer profiles: https://www.nature.com/articles/s41467-020-14998-3
* Sourmash identifies the different organisms present in a metagenome: https://sourmash.readthedocs.io/en/latest/tutorials.html
