-
Notifications
You must be signed in to change notification settings - Fork 893
/
slides.html
954 lines (655 loc) · 32.1 KB
/
slides.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
---
layout: tutorial_slides
logo: "GTN"
class: enlarge120
title: "Unicycler Assembly"
zenodo_link: "https://doi.org/10.5281/zenodo.940733"
tags:
- prokaryote
questions:
- "I have short reads and long reads. How do I assemble a genome?"
objectives:
- "Perform Quality Control on your reads"
- "Perform a Small genome Assembly with Unicycler"
- "Evaluate the Quality of the Assembly with Quast"
- "Annotate the assembly with Prokka"
time_estimation: "4h"
key_points:
- "We learned about the strategies used by assemblers for hybrid assemblies"
- "We performed an hybrid assembly of a bacterial genome and its annotation"
- "Unicycler is a pipeline bases on Spades and Pilon dedicated to hybrid assembly of Small genomes"
- "Combination of short and long reads helped us produce an almost perfect assembly"
contributors:
- nekrut
- delphine-l
- slugger70
---
# Small Genome Assembly With Unicycler
---
.enlarge120[
# **Introduction to Genome Assembly**
]
To start on Genome assembly you can read the previous assembly tutorials:
* [Introduction to Genome Assembly]({% link topics/assembly/tutorials/general-introduction/slides.html %})
* [De Bruijn Graph Assembly]({% link topics/assembly/tutorials/debruijn-graph-assembly/slides.html %}#46)
---
.enlarge120[
# **Assembly basics : Challenges of genome (and transcriptome) assembly**
]
[![Graphic shown three different assemblies, all tangled knots of sequences that will be very difficult to resolve.](../../images/illumina_graph_comparison.png)](https://github.com/rrwick/Unicycler)
Genome assembly is a difficult task. In trying to explain it we will be relying on two highly regarded sources:
- [Ben Langmead's Teaching Materials](http://www.langmead-lab.org/teaching-materials/)
- [Pevzner and Compeau Bioinformatics Book](https://www.goodreads.com/book/show/22033056-bioinformatics-algorithms).
---
.enlarge120[
# **Genomes and reads: Strings and *k*-mers**
]
## *k*-mer composition
Genomes are strings of text. When we sequence genomes we use sequencing machines that generate reads. For now let's assume that all reads have the same length *k* and every *k*-mer is sequenced only once. We will relax these assumptions later in this lecture. Thus sequencing a genome generates a large list of *k*-mers.
Suppose we are dealing with a *very* short genome `TATGGGGTGC`. Its *k*-mer composition (note the subscript) **Composition_k(Text)** is the collection of all *k*-mer substrings (including repeated ones). When *k* = 3 we get (basically we split sequence into windows of length 3 sliding window by 1 base every time):
<samp>
Composition_3(TATGGGGTGC)= ATG,GGG,GGG,GGT,GTG,TAT,TGC,TGG
<samp>
---
<samp>
Composition_3(TATGGGGTGC)= ATG,GGG,GGG,GGT,GTG,TAT,TGC,TGG
<samp>
Note that we have listed *k*-mers in lexicographic order (i.e., how they would appear in a dictionary) rather than in the order of their appearance in <code>TATGGGGTGC</code>. We have done this because the correct ordering of the reads is unknown when they are generated (i.e., a sequencing machine does not generate reads in any particular order).
---
## Assembly by overlap
In the example above we know what the "genome" sequence is. In real life we don't know that and our goal is to determine genome sequence given a scrambled collection of *k*-mers. Let's consider the following collection of 3-mers representing a hypothetical genome:
<code> AAT ATG GTT TAA TGT </code>
Let's "tile" k-mers if they overlap in k-1 nucleotides:
```
TAA
AAT
ATG
TGT
GTT
-------
TAATGTT
```
---
Now let's apply it to slightly longer "genome" with the following 3-mer composition sorted in a lexicographic order:
<code>
AAT ATG ATG ATG CAT CCA GAT GCC GGA GGG GTT TAA TGC TGG TGT
</code>
`TAA` looks like a great beginning and we are continuing:
```
1 TAA
2 AAT
3 ATG
4 TGT
5 GTT
-------
TAATGTT
```
There is nothing in the original 3-mer composition, which starts with `TT`.
---
Let's track back and instead of `TGT` in step 4 insert `TGC`:
```
1 TAA
2 AAT
3 ATG
4 TGC
5 GCC
6 CCA
7 CAT
8 ATG
9 TGG
10 GGA
11 GAT
12 ATG
13 TGT
14 GTT
----------------
TAATGCCATGGATGTT
```
We only used 14 3-mers from the total of 15, so our genome is shorter (we have extra parts!). This difficulty is related to the fact that there are three repeated `ATG` motifs in this genome and as a result each `ATG` can be extended by either `TGG`, `TGC`, or `TGT`.
---
## The concept of coverage
*Coverage* is the number of reads covering a particular position in the genome. For example, in the following case:
```
TAA
AAT
ATG <- "reads" (15 bases total)
TGT
GTT
-------
TAATGTT <- "genome" (7 bases)
-------
0123456
```
The *Coverage* at positions 1 and 6 is *1*, at positions 1 and 5 is *2*, and at position 2, 3, and 4 is *3*. <br>The *Average Coverage* will be 15/7 ~ 2x
---
Below is another, slightly more realistic example where average coverage is 177/35 ~ 7x:
```
CTAGGCCCTCAATTTTT
CTCTAGGCCCTCAATTTTT
GGCTCTAGGCCCTCATTTTTT
CTCGGCTCTAGCCCCTCATTTT
TATCTCGACTCTAGGCCCTCA <- 177 bases
TATCTCGACTCTAGGCC
TCTATATCTCGGCTCTAGG
GGCGTCTATATCTCG
GGCGTCGATATCT
GGCGTCTATATCT
-----------------------------------
GGCGTCTATATCTCGGCTCTAGGCCCTCATTTTTT <- 35 bases
-----------------------------------
| | | | |
0 10 20 30 34
```
---
# The First and the Second laws of assembly
The goal of assembly process is to reconstruct an unknown genome sequence given a collection of scrambled sequencing reads:
```
CTAGGCCCTCAATTTTT
CTCTAGGCCCTCAATTTTT
GGCTCTAGGCCCTCATTTTTT
CTCGGCTCTAGCCCCTCATTTT
TATCTCGACTCTAGGCCCTCA <- Reads (Given)
TATCTCGACTCTAGGCC
TCTATATCTCGGCTCTAGG
GGCGTCTATATCTCG
GGCGTCGATATCT
GGCGTCTATATCT
-----------------------------------
??????????????????????????????????? <- Genome (Unknown)
```
> **The goal of assembly process**. Given sequencing reads reconstruct underlying genome sequence.
We've seen that this can (in principle) be accomplished by finding overlaps. We also discussed the concept of the coverage. We can now formulate the two first assembly laws.
---
## The First Assembly Law: Overlaps imply co-location
Let's define terms **Prefix** and **Suffix** using string <code>TAA</code> as an example:
* <code>Prefix(TAA) = TA</code>
* <code>Suffix(TAA) = AA</code>
---
The First law states that if a *suffix* of one read is similar to a *prefix* of another read...
```
TCTATATCTCGGCTCTAGG <- read 1
||||||| |||||||
TATCTCGACTCTAGGCC <- read 2
```
...then they may overlap (may be derived from the same location) within the genome.
```
TCTATATCTCGGCTCTAGG <- read 1
-------------------------------------
AGCGTTCTATATCTCGGCTCTAGGCCGTGCAGGACGT <- genome
-------------------------------------
TATCTCGACTCTAGGCC <- read 2
```
---
Note that in the above example suffix of the first read is *not* exactly identical to the prefix of the second read: they differ by a G-to-A substitution. Such differences are quite common in real life and may be caused by:
* **sequencing errors** - experimental or computational artifacts of DNA sequencing procedures.
* **allelic differences** - organisms such as human are diploid (and others, such as wheat are hexaploid) which maternal and paternal genomes being different at a number of genomic sites.
* **polymorphic sites** - DNA that is being sequenced is usually isolated from a large number of cells (e.g., white blood cells) or individuals (bacterial and viral cultures). Natural variation present in these cell (or viral particle) populations will manifest itself as these differences.
---
## The Second Assembly Law: The higher the coverage, the better
The Second law states that higher coverage leads to more frequent and longer overlaps:
```
CTAGGCCCTCAATTTTT
TATCTCGACTCTAGGCCCTCA <- Low coverage
GGCGTCTATATCT
-----------------------------------
GGCGTCTATATCTCGGCTCTAGGCCCTCATTTTTT <- Genome
-----------------------------------
CTAGGCCCTCAATTTTT
CTCTAGGCCCTCAATTTTT
GGCTCTAGGCCCTCATTTTTT
CTCGGCTCTAGCCCCTCATTTT
TATCTCGACTCTAGGCCCTCA <- Higher coverage
TATCTCGACTCTAGGCC
TCTATATCTCGGCTCTAGG
GGCGTCTATATCTCG
GGCGTCGATATCT
GGCGTCTATATCT
```
---
# Solving assembly problem with graphs
We can solve assembly challenge using overlaps between sequencing reads. However, to solve this problem effectively we need to first represent all overlaps in a way that would facilitate further analysis. *Directed graphs* help achieving this.
---
## Directed graphs
Finding overlaps is identical to building a *directed graph* where directed *edges* connect *nodes* representing overlapping reads:
![Cartoons of two sequences aligned with nearly perfect matches. On the right are a list of nodes representing sequences. The two from the alignment are linked with an arrow and the text reads Suffix of source is similar to prefix of sink.](../../images/dag.png)
>**Directed graph** representing overlapping reads. (Image from [Ben Langmead](http://www.cs.jhu.edu/~langmea/resources/lecture_notes/assembly_scs.pdf)).
---
For example, the string reconstruction we have seen earlier (with the difference of inserting `GGG` in line 10):
```
1 TAA
2 AAT
3 ATG
4 TGC
5 GCC
6 CCA
7 CAT
8 ATG
9 TGG
10 GGG
11 GGA
12 GAT
13 ATG
14 TGT
15 GTT
-----------------
TAATGCCATGGGATGTT
```
can be represented as a following directed graph (or genome path):
![Graph where each of the above 3-mers is a single node in a graph and they are linked together from left to right most 3-mer.](../../images/4.6.png)
**Genome path**. Trimers composing the <code>TAATGCCATGGGATGTT</code> sequence represented as the "genome" path. (Fig. 4.6 from [CP](http://bioinformaticsalgorithms.com/)). In this path a suffix of a 3-mer is equal to prefix of the next 3-mer.
---
**However**, we do not know the actual genome! All we have in real life is a collection of reads. Let's first build an overlap graph by connecting two 3-mers if suffix of one is equal to the prefix of the other:
>![Th above graph is expanded to include a lot more connections between any nodes that share 2 base overlap.](../../images/4.7.png)
>
>**Overlap graph**. All possible overlap connections for our 3-mer collection. (Fig. 4.7 from [CP](http://bioinformaticsalgorithms.com/))
So to determine the sequence of the underlying genome we are looking a path in this graph that visits every node (3-mer) once. Such path is called [Hamiltonian path](https://en.wikipedia.org/wiki/Hamiltonian_path) and it may not be unique.
---
For example for our 3-mer collection there are two possible Hamiltonian paths:
![The above graph is re-used but a single path is highlighted](../../images/4.9a.png)
![The same graph with a different path highlighted.](../../images/4.9b.png)
**Two Hamiltonian paths for the 15 3-mers**. Edges spelling "genomes" <code>TAATGCCATGGGATGTT</code> and <code>TAATGGGATGCCATGTT</code> are highlighted in black. (Fig. 4.9. from [[CP](http://bioinformaticsalgorithms.com/)](http://bioinformaticsalgorithms.com/)).
---
The reason for this "duality" is the fact that we have a *repeat*: 3-mer <code>ATG</code> is present three times in our data (<font color="green">green</font>, <font color="red">red</font>, and <font color="blue">blue</font>). As we will see later repeats cause a lot of trouble in genome assembly.
---
## Finding overlaps
In the example above we had a collection of 3-mers and were always looking for overlaps of length two. In real life things may not be so "regular". Suppose we have two reads:
```
Read X CTCTAGGCC
Read Y TAGGCCCTC
```
What is the overlap between these two reads? For now we will define overlap of <code>length - l</code> suffix of Read X matches <code>length - l</code> prefix of Read Y, where <code>l</code> is given. To find these overlap we look in Read Y for instances <code>length - l</code> suffix of Read X.
---
We will start with some minimal match of length $k$. Once a match is found it will be extended to the left to verify that the entire prefix of Read Y matches:
>![Two sequences are shown X and Y. The X sequence ends with the string GCC. In Y they start looking from the end until they find a GCC. In this case we confirm a length-6 prefix of Y matches a suffix of X.](../../images/find_overlap.png)
>
>**Finding overlaps** between Read X and Read Y (Image from [Ben Langmead](http://www.cs.jhu.edu/~langmea/resources/lecture_notes/assembly_scs.pdf)).
---
As a result we represent two reads are connected nodes:
>
>![Node labelled with one sequence points to another. The sequences are the ones from above with the number 6 on the arrow between them.](../../images/og1.png)
>
>Number above the edge shows the length of the overlap.
While with just two reads the problem may seen quite straightforward. Let now consider a set of reads representing a very short genome <code>GTACGTACGAT</code>:
```
GTACGT
TACGTA
CGTACG
ACGTAC
GTACGA
TACGAT
```
---
Building an overlap graph with overlap of <code>length >= 4</code> will give us the following:
>![The overlap graph now has longer sequences and numbers annotating the length of the overlap](../../images/og2.png)
>
>You can see that there is a path through this graph that would spell out the original genome sequence <code>GTACGTACGAT</code> :
>
>![The optimal, highest scoring path is represented through this graph](../../images/og3.png)
>
>Here we are lucky enough to have all nodes having a single outgoing edge with the highest number (the length of overlap).
---
## The Shortest Common Superstring Problem
The problem of reconstructing genome using the overlap graph that we have just illustrated can be initially formulated as the *Shortest Common Superstring (SCS)* problem. It states: *given a collection of strings S, find SCS(S), which is the shortest string that contains all strings from the set S as substrings*.
For simplicity let's suppose that we have the following set of strings <code>S</code>:
<samp>
BAA AAB BBA ABA ABB BBB AAA BAB
</samp>
One way of getting a string that would contain all of these as substrings will simply be concatenating them:
<samp>
Concat(S): BAAAABBBAABAABBBBBAAABAB (length = 24)
</samp>
This, however, is not the *shortest* superstring that contains all strings from $S$. Instead the SCS is (just trust us here):
<samp>
SCS(S):: AAABBBABAA (length = 10)
</samp>
---
It looks like finding SCS for a set of sequencing reads may just be what we need to produce a genome assembly. But how can this work in practice? One potential idea is to order the strings in some way and "reduce" them into a superstring (following examples are from Ben Langmead):
>![A sequence of As and Bs in groups of 3s is labelled order 1. Below the sequence is written AAA, the current superstring.](../../images/scs1.png)
>Let's look at the first two strings. They can be "reduced" to `AAAB`:
>
>![Now AAAB is written below the sequence because of the overlap in the first two groups.](../../images/scs2.png)
>
>The next two add an `A`:
>
>![Now AAABA is written, due to overlap in the second and third groups](../../images/scs3.png)
>
>Third and fourth add `BB`:
>
>![The extension process continues with another overlapping component extending it further.](../../images/scs4.png)
---
>Continuing this we will eventually get `AAABABBAABABBABB`:
>
>![Order 1 produces a longish superstring](../../images/scs5.png)
>
>But `AAABABBAABABBABB` is the shortest only for this particular ordering. So let's reorder and try again:
>
>![Order 2, a new ordering of the same sequences, produces a shorter superstring.](../../images/scs6.png)
>
>Now we did better, but maybe we can do even better.
Ultimately we need to try all possible ordering and pick the shortest among all. Using this approach is we have <code>S</code> strings we will need to do <code>S!</code> tries. This can quickly get impossible. For our set of
eight strings <code>8! = 40320</code>. If we get, say, a 1,000,000 reads from an Illumina machine then the factorial of a million is not going to be an attractive analysis option.
---
## Shortest common superstring: Greedy approach
As we've seen it will be impossible to assemble the genome using SCS logic. There is a simplification called *Greedy* approach to SCS problem. Let's take the following set of "reads":
<samp>
AAA AAB ABB BBA BBB
</samp>
and first build an overlap graph:
>![Graph showing 5 nodes with various scores between them.](../../images/greedy1.png)
>
>**An overlap graph** for set <code>S: AAA AAB ABB BBA BBB</code>.
---
Next, we start collapsing the nodes to maximize the overlap (and hence to decrease the length of the SCS we are trying to construct):
>In the graph below there are multiple ties: nodes with outgoing edges of identical weights (e.g., edges pointing from `ABB` to both `BBA` and `BBB` have weight of two. Remember, that the weight is the length of overlap between two nodes' labels). In this situation we will break ties by randomly picking an edge to traverse. Let's pick <font color="red">`AAA` → `AAB`</font>:
>![The above graph however the node AAA -> 2 -> AAB is highlighted in red.](../../images/greedy2.png)
---
.pull-left[
>We then merge `AAA` and `AAB` into an SCS containing both, which will be `AAAB`:
>
>![The above graph however with 4 nodes, as AAA and AAB are merged](../../images/greedy3.png)
]
--
.pull-right[
>Now let's pick edge <font color="red">`ABB` → `BBB`</font>:
>![Another edge scoring 2 is highlighted, ABB to BBB (2)](../../images/greedy4.png)
]
---
.pull-left[
>Collapse the nodes:
>
>![They're collapsed, now with 3 nodes in the graph](../../images/greedy5.png)
]
--
.pull-right[
>Pick <font color="red">`ABBB` → `BBA`</font>:
>
>![Same as before, another highlighted](../../images/greedy6.png)
]
---
.pull-left[
>Collapse the nodes:
>
>![Same as before, down to 2 nodes with arrows going both directions between them. One scores 2, one scores 1.](../../images/greedy7.png)
]
--
.pull-right[
>Pick <font color="red">`AAAB` → `ABBBA`</font>:
>
>![The AAAB to ABBBA is highlighted, scoring 2.](../../images/greedy8.png)
]
---
>Collapse again and now we are left with a superstring of length 7:
>
>![A single node is left producing AAABBBA](../../images/greedy9.png)
The above procedure can be computed *very* quickly. But there is a catch: it does not guarantee that it will give us truly the shortest superstring. It really depends on how we choose edges.
---
Below is another example of using the same dataset in which we traverse graph in a slightly different way: <br>
--
.pull-left[
>We start the same way as before by choosing <font color="red">`AAA` → `AAB`</font>:
>
>![THe graph resets to 5 nodes](../../images/greedy2.png)
]
--
.pull-right[
>Merge `AAA` and `AAB`:
>
>![The same initial step](../../images/greedy3.png)
]
---
.pull-left[
>But now we pick a different edge <font color="red">`ABB` → `BBA`</font>:
>
>![Now a different edge scoring 2 is picked, ABB to BBA](../../images/greedy4_alt.png)
]
--
.pull-right[
>Collapsing these nodes dramatically changes the graph:
>
>![these are collapsed producing a three node graph. AAAB and ABBA have a score of 2, BBB is only connected to AAAB with a score of 1.](../../images/greedy5_alt.png)
]
---
.pull-left[
>Now we pick <font color="red">`AAAB` → `ABBA`</font> as this is the edge with the highest weight:
>
>![The highest scoring edge is picked](../../images/greedy6_alt.png)
]
--
.pull-right[
>Collapsing it produces two nodes that are not connected to each other:
>
>![AAABBA and BBB are left, unconnected.](../../images/greedy7_alt.png)
]
---
And the SCS of these two will be a concatenation `AAABBABBB` of length 9. Thus a greedy approach may produce different answers. However, it is a sufficient approximation as the superstring yielded this way will not be more than ~2.5 times longer than the true SCS ([Gusfield](https://www.goodreads.com/book/show/145058.Algorithms_on_Strings_Trees_and_Sequences) 16.17.1).
---
# The Third Law of Assembly: Repeats are Evil!
Let's again apply Greedy SCS to a different "genome". Suppose we want to reconstruct the phrase:
<samp>
a_long_long_long_time
</samp>
from all 6-mers that overlap by at least 3 characters. The list of 6-mers is:
```
ng_lon _long_ a_long long_l ong_ti ong_lo long_t
g_long g_time ng_tim
```
---
An overlap graph will look like this:
>![A large, messy graph with various portions fo the above sentence.](../../images/long.png)
>
<small>**An overlap graph** for with overlap length <code> >= 3<code>.</small>
---
If we proceed with Greedy SCS we will follow the following trajectory through the graph:
>![A greedy path is highlighted, all scoring 5.](../../images/long_opt.png)
---
To make things even clearer let's isolate the path:
![Irrelevant edges are removed.](../../images/long_opt_focus.png)
---
The total overlap here (the sum of edge weights) is 4+5+5+5+5+5+5+5+5=44 but it gives us `a_long_long_time` as the shortest superstring:
```
a_long
long_l
ong_lo
ng_lon
g_long
_long_
long_t
ong_ti
ng_tim
g_time
----------------
a_long_long_time
```
---
We are missing one instance of 'long' in this string. The following graph shows the path that would return the *correct* string:
>![A less optimal path is shown that recovers the duplicated long](../../images/long_corr.png)
---
A path yielding the correct string with three repeats. The total overlap here is 5+3+3+5+4+4+5+5+5=39, which is *worse* than the previous path if our goal is to find the shortest superstring:
```
a_long
_long_
ng_lon
long_l
ong_lo
g_long
long_t
ong_ti
ng_tim
g_time
---------------------
a_long_long_long_time
```
---
## Are we really looking for the shortest superstring?
As we've seen above the shortest common superstring (SCS) is:
1. **Difficult to obtain** as Greedy SCS algorithm does not guarantee finding it. So the answer we get may be longer than the real genome we are trying to assemble.
2. **May be shorter than we want** because if the genome contains repeats that are longer than the reads we are using, Greedy SCS will collapse them and make assembly shorter that the genome we are trying to get.
Let's talk about an alternative way to represent the relationship between *k*-mers that may give us a more efficient algorithm.
---
# de Bruijn graphs
[Nicolaas de Bruijn](https://en.wikipedia.org/wiki/Nicolaas_Govert_de_Bruijn) had a purely theoretical interest of constructing _k_-universal strings for an arbitrary value of _k_. A _k_-universal string contains every possible _k_-mer only once:
>[![Screenshot of a box from a textbook with illegible text and a similar graph to the ABBA ones shown above.](../../images/deBruijn_txt.png)](http://www.nature.com/nbt/journal/v29/n11/abs/nbt.2023.html)
>
>**de Bruijn graph**. From [Compeau:2011](http://www.nature.com/nbt/journal/v29/n11/abs/nbt.2023.html)
---
This problem is equivalent to a string reconstruction problem we have been talking about above: finding a _k_-universal string is equivalent to finding a Hamiltonian path in an overlap graph constructed from all _k_-mers. Yet finding a Hamiltonian path in a really large graph (representing a real genome) is not a tractable problem as we have seen. Instead de Bruijn decided to represent _k_-mer composition in a graph using a slightly different logic. Again, suppose we have a "genome" <ode>TAATGCCATGGGATGTT</code> split in a collection of 3-mers:
<samp>
TAA AAT ATG TGC GCC CCA CAT ATG TGG GGG GGA GAT ATG TGT GTT
</samp>
---
We will assign 3-mers to _edges_ instead of _nodes_:
![A long string of 2-mer nodes linked together with 3-mers on edges.](../../images/4.12.png)
**_k_-mers as edges**. Edges represented by 3-mers connect nodes representing the overlaps. (Fig. 4.12 from [CP](http://bioinformaticsalgorithms.com/))
This graph can be simplified by gluing identical nodes together:
![The same graph](../../images/4.13a.png)
![Re-organised to collapse identical nodes, first the AT node is collapsed from three into one.](../../images/4.13b.png)
---
Here the complexity of the graph is reduced by first gluing redundant <font color="red">`AT`</font> nodes
![Now the 3 TG nodes are highlighted](../../images/4.13c.png)
![and collapsed into a single one, sharing the original's connections.](../../images/4.13d.png)
---
Next, <font color="blue">`TG`</font> nodes are merged
![2 GG nodes are highlighted](../../images/4.13e.png)
![and merged, including a self-link.](../../images/4.13f.png)
---
And, finally the two <font color="green">`GG`</font> nodes are resolved. (Fig. 4.13 from [CP](http://bioinformaticsalgorithms.com/))
Because we now represent _k_-mers as edges (rather than nodes), our problem has morphed into finding a path that visits every _edge_ once, or an [Eulerian Path](https://en.wikipedia.org/wiki/Eulerian_path):
![The final graph, same as previous image but now edges are numbered as well.](../../images/4.15.png)
**Eulerian paths for the 15 3-mers**. Numbering of edges provides a way to reconstruct the original "genome". (Fig. 4.15 from [CP](http://bioinformaticsalgorithms.com/))
---
## Euler's Theorem
Some definitions:
* **Balanced node** - a node where the number of incoming edges is equal to the number of outgoing edges
* **Balanced graph** - a graph where all nodes are balanced
* **Strongly connected graph** - any node can be reached from any other node
**Euler's Theorem**:
>Every balanced, strongly connected directed graph is Eulerian.
---
Let's apply Euler's Theorem to a classical problem: The bridges of Königsberg problem. Here the question is: *Can you walk through all of Königsberg traversing every bridge exactly one time?* In other words: *Is there a Eulerian path through the city of Königsberg?*
>![Old map of Königsberg with dots on land portions and links between them. This is reduced into the same coloured nodes and bridges representing edges.](../../images/koninsberg.png)
>
>**Königsberg and Euler's Theorem**. (a) A map of old Königsberg, in which each area of the city is labeled with a different color point. (b) The Königsberg Bridge graph, formed by representing each of four land areas as a node and each of the city's seven bridges as an edge. (From [Campeau:2011](http://www.nature.com/nbt/journal/v29/n11/abs/nbt.2023.html#close))
By looking at this graph we can see that it is *unbalanced*. If one arrives to, say, the <font color="orange">orange</font> node from the <font color="blue">blue</font> node there are two ways to get out. Thus there is no way to see all of the city and traverse every bridge once!
---
## Repeats are still a challenge
Let's look at the de Bruijn graph from above again. But this time let's drop edge numbering and pretend that the genome is now really known to us (as is usually the case in real life):
>![Similar graph as above but now edges are only labelled by sequences, no more numbers.](../../images/dg1.png)
>
>**Eulerian paths for the 15 3-mers**.
---
In the original sequence `TAATGCCATGGGATGTT` *k*-mer <font color="red">`AT`</font> is present 3 times and *k*-mer <font color="blue">`TG`</font> is found twice. Thus *multiple* Eulerian walks are now possible like this:
>![Following a direct path from the one node without incoming connections, to TG node, choosing one of the three outgoing connections we go 'up'](../../images/dg2.png)
>
>**Possible path #1**. Here after we reach <font color="blue">`TG`</font> node we turn **up**.
---
The above path spells out:
```
TAA
AAT
ATG
TGC
GCC
CCA
CAT
ATG
TGG
GGG
GGA
GAT
ATG
TGT
GTT
-----------------
TAATGCCATGGGATGTT
```
---
Yet there is an alternative:
>![The same as above but this time going down to follow a different path](../../images/dg3.png)
>
>**Possible path #2**. Here after we reach <font color="blue">`TG`</font> node we turn **dow**.
---
Which spells:
```
TAA
AAT
ATG
TGG
GGA
GAT
ATG
TGC
GCC
CCA
CAT
ATG
TGT
GTT
----------------
TAATGGATGCCATGTT
```
Note how different these are:
```
TAATGCCATGGGATGTT
TAATGGATGCCATGTTT
```
and only one of them is correct. Repeats are evil!
---
## *k*-mer size affects repeat resolution
In the above example we have used *k*-mer size of 3. But what if we try 4 or 5? Below are De Bruijn graphs for different values of *k*:
#### *k* = 3
>![The same graph as above, now in a different orientation. There are many connections](../../images/k2.png)
---
>This is our original graph
#### *k* = 4
>![There are now significantly fewer connections.](../../images/k3.png)
>
>Here complexity is decreasing, but we still have the problem with having `ATG` twice.
---
.pull-left[
#### *k* = 5
In this case there is only one path. This because our *k* is larger that the repeat size, so we can resolve it accurately.
This is why technologies producing long sequencing reads stimulate so much enthusiasm - they will allow to resolve and produce accurate assembly of large genomes.
]
.pull-right[
![Graph with only a single route through the sequence.](../../images/k4.png)
]
---
# Assembly in real life
In this topic we've learned about two ways of representing the relationship between reads derived from a genome we are trying to assemble:
1. **Overlap graphs** - nodes are reads, edges are overlaps between reads.
2. **De Bruijn graphs** - nodes are overlaps, edges are reads.
>![Overlap: graph from before, nodes are 3-mers, unlabelled connections.](../../images/4.7.png)
---
**A**.
![De Bruijn: The 2-mer graph with 3-mer connections again.](../../images/4.13f.png)
---
**B.**
An overlap (A) and De Bruijn (B) graphs for the same string.
Whatever the representation will be it will be messy:
![A hypothetical real life graph with a huge number of nodes snaking all over the screen, and many many isolaed single and pairs of nodes unconnected to anything.](../../images/t12_mess.png)
A fragment of a very large De Bruijn graph (Image from [BL](https://github.com/BenLangmead/ads1-slides/blob/master/0580_asm__practice.pdf)).
---
There are multiple reasons for such messiness:
**Sequence errors**
Sequencing machines do not give perfect data. This results in spurious deviations on the graph. Some sequencing technologies such as Oxford Nanopore have very high error rate of ~15%.
>![An inset from the above. A red box highlights pairs of 2 and 3 node groups.](../../images/t12_errors.png)
>
>Graph components resulting from sequencing errors (Image from [BL](https://github.com/BenLangmead/ads1-slides/blob/master/0580_asm__practice.pdf)).
---
**Ploidy**
As we discussed earlier humans are diploid and there are multiple differences between maternal and paternal genomes. This creates "bubbles" on assembly graphs:
>![A bubble in the graph. there are a series of nodes which have 5-8 connections pointing from one to the next. Halfway through there is a single base deviation and the 5-mer nodes highlight this single base change. The change results in a split into two strands which shortly reconnect. One strand is labelled Maternal, the other Paternal.](../../images/t12_bubble.png)
>
>Bubbles due to a heterozygous site (Image from [BL](https://github.com/BenLangmead/ads1-slides/blob/master/0580_asm__practice.pdf)).
---
**Repeats**
As we've seen the third law of assembly is unbeatable. As a result some regions of the genome simply cannot be resolved and are reported in segments called *contigs*:
>![A sentence is represented "to everything turn", "turn" (a repeat) and "turn there is a season". The repeat makes the graph hard to solve into a single path.](../../images/t12_contigs.png)
>
>The following "genomic" segment will be reported in three pieces corresponding to regions flanking the repeat and repeat itself (Image from [BL](https://github.com/BenLangmead/ads1-slides/blob/master/0580_asm__practice.pdf)).
---
.enlarge120[
# **How to perform Assembly with Galaxy?**
]
---
See the tutorial accompanied by these slides!