# Genome Assembly

In [1]:
from collections import defaultdict

![Newspaper Analogy from CSE 181 online notes](imgs/newspaper_blowup.png)

## Overview:
Genome Sequencing is difficult. We cannot simply read from one end of a chromosome to another (at least not yet). **So instead we must cut each chromosome into many small pieces and sequence the individual parts.** Reconstructing the original chromosomes using the small pieces is called Genome Assembly.

Just like we can find overlapping parts in the newspaper scraps, we assemble the original genome by finding overlapping sequences in each read. Then we just read the entire sequence from left to right.

![Sequencing overview from CSE 181 online material](imgs/sequencing_overview.png)

Although this process seems simple, there are a few wrinkles which complicate the process:

1. DNA is double stranded so the overlapping sequences may be on opposite strands-we have to deal with reverse compliment sequences.
2. Sequencing is not perfect and errors are still common. 
3. Some regions of the genome may not be covered by any reads.

Though for our excercises we're mostly going to ignore these problems. We can pretend that there are no missing segments or errors in our sequences and that all sequences are on the same strand.

## Python functions and k-mers

Intro to python functions.

Functions are a fundamental part of programming that exist in every language. They are used in bioinformatics to run algorithms and data analyses in the exact same way everytime.

- TODO: Compelte introduction
- TODO: Give a small example for people to work on. Should include string slicing, dictionaries, for loop

TODO: Kmer introduction

In [None]:
def kmer_composition(text: str, k: int) -> list[str]:
    """
    This function should return a list containing all kmers of length k from text
    """
    return []

In [None]:
# Insert hidden test case here
k = 5
text = "CAATCCAAC"
output = ["CAATC", "AATCC", "ATCCA", "TCCAA", "CCAAC"]

## Does aligning kmers always work?

Given our set of kmers, one way to reconstruct our original sequence is to notice their overlaps. In the previous example our contigs overlap as shown:

``` python
CAATC
 AATCC
  ATCCA
   TCCAA
    CCAAC
```
Each column corresponds to a single nucleotide and we can reconstruct the sequence by reading from left to right.

``` python
CAATCCAAC
```


More concretely, each k-mer is made up of a **prefix** and a **suffix** each of length k-1. Then overlap between k-mers A and B is equivalent to saying the suffix of k-mer A is identical to the prefix of k-mer B.

For example let kmer_A and kmer_B be defined below:
``` python
kmer_A = "CAATC"
kmer_B = "AATCC"
```

What is the prefix and suffix of kmer_A? kmer_B?

In [4]:
# TODO: insert answers here
kmer_A_prefix = ""
kmer_A_suffix = ""

kmer_B_prefix = ""
kmer_B_suffix = ""

Which of the following statements will print True?

In [None]:
print("Case A:", kmer_A_prefix == kmer_B_prefix)
print("Case B:", kmer_A_prefix == kmer_B_suffix)
print("Case C:", kmer_A_suffix == kmer_B_prefix)
print("Case D:", kmer_A_suffix == kmer_B_prefix)

Now try assembly in practice to get an intuition for our goal. 

Consider this set of 3-mers. Take a few minutes and try to reconstruct the original sequence. Type your answer below.
``` python
AAT AAT ATG ATT GAA TAA TGA TTA
```

Try adding in the overlapping segments here like we saw before:
``` python
# TODO: 
SEQA
 SEQB
  SEQC
   ...

# Solution:
```

In [2]:
#TODO hide solution
original_seq = "TAATGAATTA"

You may notice that connecting the overlapping pieces doesn't always work. Consider this example:
``` python
TAA
 AAT
  ATT
   TTA
```
After assembling 4 kmers we find that there is no kmer that we can add after TTA. This is because one k-mer, AAT, is repeated twice. We have to guess the correct kmer to follow afterward in either ATG or ATT. 

Recall our genomes are made up of millions of repeated segments. To test every possible combination is an extremely slow task, even with modern computing power.

Instead we must come up with a more clever approach to solve this problem.

## Graph introduction, Overlap Graphs and de Bruijn graphs

TODO

1. Introduction to directed graphs.
    1. A graph is a set of nodes and edges.
        1. Also introduce the in-degree and out-degree of a node.
    2. We represent a di-graph in python as a dictionary of nodes mapping to other nodes it connects to. (practice this on trivial case probably)
2. Then introduce overlap graphs. 
    1. Practice reading the original sequence back from an overlap graph, and how to construct one using a set of kmers. Especially when the nodes are not in the correct order.
    2. The original sequence in an overlap graph is a path which visits all nodes.
    3. Finding this path (eulerian path) is slow so instead we use a slightly different graph
3. Then introduce de Bruijn graph == (compress identical nodes into one)
    1. Then the original sequence in this graph is a path that visits every edge. (Eulerian path)
    2. Practice reading it out.

Even though we introduced de Bruijn graphs from overlap graphs. We can construct them directly from a set of kmers. We don't need the original sequence at all.

`For each kmer, add an edge from its prefix to its suffix to our graph.`

For example:
``` python
de_bruijn_graph = defaultdict(list)
kmer = "ACGT"
de_bruijn_graph["ACG"].append("CGT")
```

Try implementing a function to build a de Brujin graph from a list of kmers. I've started the function below for you:

In [None]:
def build_de_bruijn(kmers:list[str]) ->  Dict[str, List[str]]:
    de_bruijn_graph = defaultdict(list)
    for kmer in kmers:
        #TODO Your solution here
    return de_bruijn_graph

In [6]:
# TODO hide test function for de_bruijn graph
"""
Sample input k and string
3
ACGTGTATA
"""
"""
Sample output edge list
AC: CG
AT: TA
CG: GT
GT: TA TG
TA: AT
TG: GT
"""

'\nSample output edge list\nAC: CG\nAT: TA\nCG: GT\nGT: TA TG\nTA: AT\nTG: GT\n'

## Finding genomes in de Bruijn graphs

TODO: 
1. Introduce Eulerian Cycle
    1. Recall we want to find an path that visits every edge (Eulerian Path)
    2. One of the challenges is that we don't even know where to start.
    3. Idea: if we connect the first and last node by an edge. We can start looking from anywhere. --> Eulerian Cycle
2. We can identify the first and last node because they are the only nodes with an odd number of edges. 
    1. Ant traveling along a graph analogy. If there are an odd number of edges to a node we must either start or finish there. Otherwise we only visit and then leave using the corresponding even node
    2. We can have them implement this part in an example
3. Then we can use efficient algorithms for finding an eulerian cycle
    1. Hierholzer's Algorithm, may have time to explain with some figures but definitely we won't be implementing it.
