## Session 1: Time complexity and data structures

In this first exercise worksheet, you will be using algorithms and data structures to gain basic insights into genetic data.

### Exercise 1: Working with k-mers

K-mers are subsequences of a DNA sequence length k. The shortest possible k-mers, where k = 1, are known as _monomers_ and are just words of 1 letters. There are 4 possible monomers: A, C, T, G. Likewise, there are 16 possible k-mers of k=2, known as _dimers_ (AA, AC, AG, AT, ...) and so on...

<img style="float: right;" alt="" src="./images/kmers.svg" />

a) How many distinct k-mer of fixed length k exist ?

b) How many k-mers of length k can we fit in a sequence of length L ?

c) Write the `get_unique_kmers` function to record all the unique k-mers found in an input DNA sequence.

d) Out of the 3 genes provided, which pair of genes share the most identical k-mers ?

In [2]:
# 3 genes are stored in a file, in FASTA format we load their sequences into biopython
from Bio import SeqIO
gene1, gene2, gene3 = [str(a.seq) for a in SeqIO.parse('data/session_1.fasta', format='fasta')]

In [56]:
def get_unique_kmers(dna_string, k=3):
    """
    Add code to make this function return all unique
    k-mers of requested length in an input DNA sequence.
    
    >> get_unique_kmers("ACATA", k=3)
    ("ACA", "CAT", "ATA")
    """
    kmers = set()
    ...
    return kmers

get_unique_kmers('ACACGGT', k=2)

{'AC', 'CA', 'CG', 'GG', 'GT'}

In [70]:
k = 8
kmers_gene1 = get_unique_kmers(gene1, k=k)
kmers_gene2 = get_unique_kmers(gene2, k=k)
kmers_gene3 = get_unique_kmers(gene3, k=k)

# Which pair of genes share the most k-mers ?

# Compute the shared k-mers for each combination of genes
                                            
# Can you think of a better similarity metric than this ?

# What happens if you reduce k ? Why ?

Genes 1 and 2 share 215 k-mers
Genes 1 and 3 share 17 k-mers
Genes 2 and 3 share 18 k-mers
Jaccard index for genes 1 and 2: 0.15
Jaccard index for genes 1 and 3: 0.01
Jaccard index for genes 2 and 3: 0.01


### Exercice 2: Quantifying similarity

Instead of just recording presence / absence of k-mers, we can count the number of occurence of each k-mer in a sequence. K-mer frequencies are widely used in genomics to detect contamination in sequencing libraries or estimate genomic features such as ploidy or heterozygosity.

a) Write the `get_kmer_counts` function to retrieve each k-mer in a DNA sequence along with its number of occurences.

b) Write the `pairwise_dist_kmer_counts` function to compute the pairwise euclidean distance between the genes' k-mers counts.
> Note: Given two k-mer counts $C_1$ and $C_2$ containing counts for the set of k-mers $K = \{k_1, ..., k_n\}$, the euclidean distance is defined as $D_{C_1,C_2} = \sqrt{\sum^{k_n}_{k=k_i}{(C_{1,k} - C_{2,k})^2}}$.

c) Which are the two most similar genes in terms of k-mer counts ?

d) What is the time complexity of computing the pairwise distance between two counts ?

In [86]:
from math import sqrt

def get_kmers_counts(dna_string, k=3):
    """
    Add code to make this function return the k-mer
    counts of an input DNA sequence.

    >> get_kmers_counts("ACATAC", k=2)
    {
       "AC": 2,
       "AT": 1,
       "CA": 1,
       "TA": 1,
    }
    """
    kmers_counts = {}
     ...
    return kmers_counts


def pairwise_dist_kmers_counts(kmer1, kmer2):
    """
    This function should return the Euclidean distance
    between two k-mer counts.
    
    >>> pairwise_dist_kmers_counts({'AA': 10, 'AC': 3}, {'AA': 1, 'AT': 2})
    9.695359714832659
    """
    dist = 0
    
    ...
    
    return dist


In [92]:
k = 4
count1 = get_kmers_counts(gene1, k=k)
count2 = get_kmers_counts(gene2, k=k)

pairwise_dist_kmers_counts(count1, count2)
#...

# Which are the two most similar genes ? How does changing k affect the result ? (and why ?)

# What's the time complexity of using pairwise_dist_kmer_counts on a pair of genes ?

# And computing all pairwise distances between N genes ?

The distance between genes 1 and 2 is: 281
The distance between genes 1 and 3 is: 597
The distance between genes 2 and 3 is: 583


### Additional information:
    
If you are interested in k-mers, here are a few examples of how they are used in the wild:
    
* The sequence read archive (SRA) is the largest public database of genomic data in the world. It uses k-mer analysis on all of their libraries to quantify their taxonomic content and detect contaminations. You can view an example here by clicking on the "analysis" tab (the algorithm is described at the bottom of the page): https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR11945594
* Genomescope and smudgeplot use K-mer profiles to predict genome length, heterozygosity and ploidy from k-mer profiles: https://www.nature.com/articles/s41467-020-14998-3
* Sourmash identifies the different organisms present in a metagenome: https://sourmash.readthedocs.io/en/latest/tutorials.html
* PlasClass uses logistic regression on k-mer frequency vectors to detect whether they originate from a plasmid sequence or a chromosomal segment. This is a binary classification based tool: https://github.com/Shamir-Lab/PlasClass/
