---
layout: post  
title:  Assembling Genomes  
date:   2019-06-28  

---

If we want to use DNA to identify microbial species and their relative quantities, we need to be be able to read their genomes and count how many of each genome we find in the sample. Ideally, we would have the technology to read entire genomes from beginning to end without making any errors. While our technology is rapidly approaching this level of capability, we are not there yet.

If we can't read genomes from beginning to end, then we need to be able to take the short fragments that we can read and stitch them together into the original, full length genome. We do this by looking for sequences of DNA that are the same or nearly the same and join them together via the overlap.

In [11]:
using BioSequences
using BioAlignments
using Random
Random.seed!(3);

Let's simulate a genome of 10 DNA base-pairs

In [12]:
genome = randdnaseq(10)

10nt DNA Sequence:
TTTTAGAATG

Now, let's pretend that we can only read DNA sequences 7 nucleotides at a time. And to keep the example super simple, we'll pretend we first read from the beginning to the 7th base, and from the 4rd base until the end.

In [13]:
fragment_1 = genome[1:7]

7nt DNA Sequence:
TTTTAGA

In [15]:
fragment_2 = genome[4:end]

7nt DNA Sequence:
TAGAATG

We can see that these fragments overlap at the end of fragment_1 and the beginning of fragment_2 with the nucleotides 'TAGA'

In [33]:
println(fragment_1, repeat('-', 3))
println(repeat('-', 3), fragment_2)

TTTTAGA---
---TAGAATG


In [65]:
pairalign(OverlapAlignment(), fragment_1, fragment_2, AffineGapScoreModel(match=1, mismatch=-1, gap_open=-1, gap_extend=-1))

PairwiseAlignmentResult{Int64,BioSequence{DNAAlphabet{4}},BioSequence{DNAAlphabet{4}}}:
  score: 4
  seq: 1 TTTTAGA--- 7
            ||||   
  ref: 0 ---TAGAATG 7


If we are confident in the overlap, we can merge the two fragments into one via the overlap

In [36]:
fragment_1 * fragment_2[end-2:end] == genome

true

This overly simplistic example shows the key idea of assembling shorter fragments into longer genomes. Another key question is to decide how long of a fragment we need to have before we feel confident that the overlap is not due to random chance.

If we are sequencing a single organism, the minimum length $l$ of an overlap that would be desired before being confident enough to merge overlapping fragments would be $l = log_4(\text{genome length})$. The idea is that if there are 4 nucleotides, A C G & T, then the combinatorial likelihood that two fragments of unrelated DNA overlap by chance is inversely proportional to $4^l$. If $4^l$ is greater than the length of the genome, then the fragments wouldn't be expected to be the same by chance. Unfortunately this doesn't work out for many eukaryotic genomes that have been found to contain long repeats up to thousands of DNA nucleotide in length, but it's a good starting point nonetheless. The genome length of a single organism can be estimated by pairing copy number information from a [karyotype](https://en.wikipedia.org/wiki/Karyotype) with the mass of DNA obtained per cell.

So for example, if we have a genome of 1,000 DNA nucleotides and we don't expect long repeats, then the minimum DNA overlap length that we would want would be

In [53]:
genome_length = 1000
minimum_overlap = Int(ceil(log(4, genome_length)))

5

In [54]:
println(4^minimum_overlap)
println(4^minimum_overlap > genome_length)

1024
true


[Earth may have up to one trillion species](https://www.pnas.org/content/113/21/5970). If we are sequencing and trying to assemble genomes in mixed communities of organisms, then we should ideally have unique DNA segments long enough to both uniquely identify them to species and to a unique location within the genome of that species. If we assume an average genome size of 1Gb (one billion DNA basepairs) we would want overlaps of

In [56]:
genome_size = 1_000_000_000
number_of_species = 1_000_000_000_000
minimum_overlap = Int(ceil(log(4, genome_size * number_of_species)))

31

Repeats within genomes can be several thousand DNA bases in length making a length of 31 insufficient to uniquely identify every segment of DNA without having reads that span the full length of the repeat to provide unique anchor points at each end of the repeat.