# De Bruijn graphs

---
## Before Class
1. Review slides on De Bruijn graphs

---
## Learning Objectives
1. Understand and implement De Bruijn graphs for assembly


---
## De Bruijn graphs

In class today we will be implementing one of the primary assembly algorithms from short-read data that is used today. We will implement a simple form of the algorithm where we assume perfect sequencing. That is, everything is sequenced exactly once and there are no errors or variants in the sequencing. 

A graph is composed of nodes and edges and we will need to develop a data strcture to track edges between nodes in our graph. We have provided the basic class structure as well as descriptions of functions to `add_edge` and `remove_edge` from the graph. You will need to implement these functions in order to then build the de Bruijn graph. In our implementation below, we use a `defaultdict` data structure to hold a list of all edges in the graph where all "right" nodes connected to a "left" node are stored in a list for that node.

```
build_debruijn_graph:
define substring length k and input string
For each k-length substring of input:
  split k mer into left and right k-1 mer
  add k-1 mers as nodes with a directed edge from left k-1 mer to right k-1 mer
```


In [1]:
from collections import defaultdict
import random

class DeBruijnGraph():
    """Main class for De Bruijn graphs
    
    Private Attributes:
        graph (defaultdict of lists): Edges for De Bruijn graph
        first_node (str): starting position for traversing the graph
    """

    def __init__(self, input_string, k):
        self.graph = defaultdict(list)
        self.first_node = ''
        self.build_debruijn_graph(input_string, k)
        
    def add_edge(self, left_kmer, right_kmer):
        ''' This function adds a new edge to the graph
        
        Args:
            left (str): The k-1 mer for the left edge
            right (str): The k-1 mer for the right edge

        Updates graph attribute to add right to the list named left in defaultdict   
        '''
        #update the graph attribute (meaning the attribute in the class). To add right to the list named left. ATTCCTTC
        #[ATTC] = [CCTT]
        #this graph contains an empty dictionary list. 
        self.graph[left_kmer].append(right_kmer) #left_kmer item in the graph, append to the list it contains which will be right kmers.
        #left_kmer is my key in the dictionary. 
        #for each key, we have a list that contains all the right-kmers. 
        
        
    def remove_edge(self, left_kmer, right_kmer):
        ''' This function removes an edge from the graph
        
        Args:
            left (str): The k-1 mer for the left edge
            right (str): The k-1 mer for the right edge

        Updates graph attribute to remove right from the list named left in defaultdict
        '''
        self.graph[left_kmer].remove(right_kmer)
        #self.graph[left] = [right] now becomes self.graph[left] = []
        #self.graph[left] = [right, "AATT", "GCCG"] now is self.graph[left] = ["AATT", "GCCG"]
        

    def build_debruijn_graph(self, input_string, k):
        ''' This function builds a De Buijn graph from a string
        
        Args:
            input_string (str): string to use for building the graph
            k (int): k-mer length for graph construction

        Updates graph attribute to add all valid edges from the string
        
        Example:
        >>> dbg = DeBruijnGraph("this this this is a test", 4)
        >>> print(dbg.graph) #doctest: +ELLIPSIS +NORMALIZE_WHITESPACE
        defaultdict(<class 'list'>, {'thi': ['his', 'his', 'his'], 'his': ['is ', 'is ', 'is '], ...)
        '''
        
        for i in range(len(input_string)-k+1): #the last kmer isn't starting at end of the sequence bz python is inclusive. 
            #need to make a k-length substring aka a k-mer from the input_string. 
            kmer = input_string[i:i+k] #this is the normal kmer. 
            #now we need to split each kmer into left and right.
            left_kmer = kmer[0:k-1] #keeping off the right most nc, must focus on the kmer itself. 
            right_kmer = kmer[1:k] #keeping off the left most nc
            
            # left_kmer = kmer[0:k-1] #keeping off the right most nc
            # right_kmer = kmer[1:k] #keeping off the left most nc
            
            self.add_edge(left_kmer, right_kmer)
      

In [2]:
d= defaultdict(list)

for i in range(10):
    d[i].append(i) #i is the dictionary key that you are putting into it. 
print(d)

defaultdict(<class 'list'>, {0: [0], 1: [1], 2: [2], 3: [3], 4: [4], 5: [5], 6: [6], 7: [7], 8: [8], 9: [9]})


In [3]:
#defining the list
d = defaultdict(list)

for i in range(5):
    d[i].append(i)
print("dictionary with values as list:")
print(d)

dictionary with values as list:
defaultdict(<class 'list'>, {0: [0], 1: [1], 2: [2], 3: [3], 4: [4]})


In [4]:
d = defaultdict(list)

seq = "AATTCTT"
k = 3

edges = []
for i in range(len(seq)):
    kmer = seq[i:i+k]
    edges.append((kmer[i:i+k-1], kmer[i+1:i+k]))
print(edges)

    

[('AA', 'AT'), ('TT', 'T'), ('C', ''), ('', ''), ('', ''), ('', ''), ('', '')]


In [5]:
graph = DeBruijnGraph("fool me once shame on shame on you fool me", 4)
graph.graph

defaultdict(list,
            {'foo': ['ool', 'ool'],
             'ool': ['ol ', 'ol '],
             'ol ': ['l m', 'l m'],
             'l m': [' me', ' me'],
             ' me': ['me '],
             'me ': ['e o', 'e o', 'e o'],
             'e o': [' on', ' on', ' on'],
             ' on': ['onc', 'on ', 'on '],
             'onc': ['nce'],
             'nce': ['ce '],
             'ce ': ['e s'],
             'e s': [' sh'],
             ' sh': ['sha', 'sha'],
             'sha': ['ham', 'ham'],
             'ham': ['ame', 'ame'],
             'ame': ['me ', 'me '],
             'on ': ['n s', 'n y'],
             'n s': [' sh'],
             'n y': [' yo'],
             ' yo': ['you'],
             'you': ['ou '],
             'ou ': ['u f'],
             'u f': [' fo'],
             ' fo': ['foo']})

Expected output:
defaultdict(<class 'list'>, {'fool ': ['ool m', 'ool m'], 'ool m': ['ol me', 'ol me'], 'ol me': ['l me '], 'l me ': [' me o'], ' me o': ['me on'], 'me on': ['e onc', 'e on ', 'e on '], 'e onc': [' once'], ' once': ['once '], 'once ': ['nce s'], 'nce s': ['ce sh'], 'ce sh': ['e sha'], 'e sha': [' sham'], ' sham': ['shame', 'shame'], 'shame': ['hame ', 'hame '], 'hame ': ['ame o', 'ame o'], 'ame o': ['me on', 'me on'], 'e on ': [' on s', ' on y'], ' on s': ['on sh'], 'on sh': ['n sha'], 'n sha': [' sham'], ' on y': ['on yo'], 'on yo': ['n you'], 'n you': [' you '], ' you ': ['you f'], 'you f': ['ou fo'], 'ou fo': ['u foo'], 'u foo': [' fool'], ' fool': ['fool ']})

In [18]:
import doctest
doctest.testmod()

TestResults(failed=0, attempted=2)