# Iterative Assembly

The following notebook shows some small examples of the assembly approach that I have been developing that I call iterative assembly. The approach works by first building a de-bruijn assembly graph of the sequenced DNA. The approach then utilizes that assembly graph as a probablistic model to perform error correction via the viterbi algorithm on the reads in the dataset to drastically reduce error rates and improve assembly accuracy. By correcting most errors using relatively short kmers, and then progressively working up to longer kmers that help resolve more repeats, it is possible to avoid a common issue during the genome assembly process where most kmers in the dataset are erroneous, as well as to overcome the sequencing accuracy limitations of long read nanopore sequencers.

It may be helpful to first start by looking at the initial assembly graph figure, which is laid out using a breadth-first search traversal. The deeper the graph, the more complexity (e.g. errors) are present. As the errors are removed during the iterative correction process, the graph flattens out and widens into a relatively easy to resolve assembly graph with few, if any, remaing errors.

In these examples, I am assuming the error rates to be 1/(k+1) (the default when the user does not provide a known error rate). Note the inferred error rate is generally truer to the true error rate than the assumed error rate.

In my experiments, I have found that iteratively assembling and correcting at a given k until no more corrections are performed and then incrementing k is the most effective way to remove errors. When two consecutive iterations do not produce any error corrections, the process can be halted and a final genome assembly produced (variants, coverage depths, closest known species, and other attributes can also be produced at this point).

Programs using iterative kmer assembly
[Spades](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3342519/)
[Accurate selfcorrection of errors in long readsusing de Bruijn graphs](https://arxiv.org/pdf/1604.02233.pdf)

In [1]:
using Eisenia
using Random
using Dates

# L10

In [2]:
L = 10
Random.seed!(L)
reference_sequence = randdnaseq(L)
reference_sequence_id = randstring(Int(round(log10(length(L)))+3))
reference_FASTA_record = FASTA.Record(reference_sequence_id, reference_sequence)

BioSequences.FASTA.Record:
   identifier: Apx
  description: <missing>
     sequence: ACCAAACTAT

In [3]:
reverse_complement(reference_sequence)

10nt DNA Sequence:
ATAGTTTGGT

In [4]:
error_rate = 0.15
n_sequences = 100
observations = [Eisenia.observe(reference_FASTA_record, error_rate=error_rate) for i in 1:n_sequences]

100-element Array{BioSequences.FASTA.Record,1}:
 BioSequences.FASTA.Record:
   identifier: RO08
  description: <missing>
     sequence: ATCCAAACTA  
 BioSequences.FASTA.Record:
   identifier: EUNI
  description: <missing>
     sequence: ATAGTTTTGGT 
 BioSequences.FASTA.Record:
   identifier: vMBU
  description: <missing>
     sequence: ATAGTTTGG   
 BioSequences.FASTA.Record:
   identifier: bbqN
  description: <missing>
     sequence: ACCAAACTAACT
 BioSequences.FASTA.Record:
   identifier: R0Fc
  description: <missing>
     sequence: ACCCAAACTAT 
 BioSequences.FASTA.Record:
   identifier: kTEg
  description: <missing>
     sequence: ACCAACCAT   
 BioSequences.FASTA.Record:
   identifier: j2mA
  description: <missing>
     sequence: ACCAACTAT   
 BioSequences.FASTA.Record:
   identifier: nQi9
  description: <missing>
     sequence: ACCAACTAT   
 BioSequences.FASTA.Record:
   identifier: SOR3
  description: <missing>
     sequence: AGTAGTTTGTGT
 BioSequences.FASTA.Record:
   identifier: 

## L10 starting @ k=5

In [5]:
k = 5
canonical_kmers = collect(keys(Eisenia.count_canonical_kmers(observations, k)))
stranded_kmer_graph = Eisenia.build_stranded_kmer_graph(canonical_kmers, observations)
filename = reference_sequence_id * "." * replace(string(Dates.now()), ':' => '.') * ".svg"
Eisenia.plot_stranded_kmer_graph(stranded_kmer_graph, filename=filename)
HTML("""
<image src="$filename" width=50%>
""")

In [6]:
maximum_likelihood_observations = Eisenia.viterbi_maximum_likelihood_traversals(stranded_kmer_graph, verbosity="reads");

computing kmer counts...
computing kmer state likelihoods...
finding shortest paths between kmers...
finding viterbi maximum likelihood paths for observed sequences...

evaluating sequence 1 of 100
	observed sequence                 ATCCAAACTA
	maximum likelihood sequence       ACCAAACTA
	maximum likelihood edit distance  3

evaluating sequence 2 of 100
	observed sequence                 ATAGTTTTGGT
	maximum likelihood sequence       ATAGTTTGGT
	maximum likelihood edit distance  1

evaluating sequence 3 of 100
	observed sequence                 ATAGTTTGG
	maximum likelihood sequence       ATAGTTTGG
	maximum likelihood edit distance  0

evaluating sequence 4 of 100
	observed sequence                 ACCAAACTAACT
	maximum likelihood sequence       ACCAAACTAT
	maximum likelihood edit distance  3

evaluating sequence 5 of 100
	observed sequence                 ACCCAAACTAT
	maximum likelihood sequence       ACCAAACTAT
	maximum likelihood edit distance  2

evaluating sequence 6 of 100
	obser

In [7]:
canonical_kmers = collect(keys(Eisenia.count_canonical_kmers(maximum_likelihood_observations, k)))
stranded_kmer_graph = Eisenia.build_stranded_kmer_graph(canonical_kmers, maximum_likelihood_observations)
filename = reference_sequence_id * "." * replace(string(Dates.now()), ':' => '.') * ".svg"
Eisenia.plot_stranded_kmer_graph(stranded_kmer_graph, filename=filename)
HTML("""
<image src="$filename" width=50%>
""")

In [8]:
maximum_likelihood_observations = Eisenia.viterbi_maximum_likelihood_traversals(stranded_kmer_graph, error_rate = error_rate, verbosity="reads");

computing kmer counts...
computing kmer state likelihoods...
finding shortest paths between kmers...
finding viterbi maximum likelihood paths for observed sequences...

evaluating sequence 1 of 100
	observed sequence                 ACCAAACTA
	maximum likelihood sequence       ACCAAACTA
	maximum likelihood edit distance  0

evaluating sequence 2 of 100
	observed sequence                 ATAGTTTGGT
	maximum likelihood sequence       ATAGTTTGGT
	maximum likelihood edit distance  0

evaluating sequence 3 of 100
	observed sequence                 ATAGTTTGG
	maximum likelihood sequence       ATAGTTTGG
	maximum likelihood edit distance  0

evaluating sequence 4 of 100
	observed sequence                 ACCAAACTAT
	maximum likelihood sequence       ACCAAACTAT
	maximum likelihood edit distance  0

evaluating sequence 5 of 100
	observed sequence                 ACCAAACTAT
	maximum likelihood sequence       ACCAAACTAT
	maximum likelihood edit distance  0

evaluating sequence 6 of 100
	observed s

In [9]:
canonical_kmers = collect(keys(Eisenia.count_canonical_kmers(maximum_likelihood_observations, k)))
stranded_kmer_graph = Eisenia.build_stranded_kmer_graph(canonical_kmers, maximum_likelihood_observations)
filename = reference_sequence_id * "." * replace(string(Dates.now()), ':' => '.') * ".svg"
Eisenia.plot_stranded_kmer_graph(stranded_kmer_graph, filename=filename)
HTML("""
<image src="$filename" width=50%>
""")

In [10]:
maximum_likelihood_observations = Eisenia.viterbi_maximum_likelihood_traversals(stranded_kmer_graph, error_rate = error_rate, verbosity="reads");

computing kmer counts...
computing kmer state likelihoods...
finding shortest paths between kmers...
finding viterbi maximum likelihood paths for observed sequences...

evaluating sequence 1 of 100
	observed sequence                 ACCAAACTA
	maximum likelihood sequence       ACCAAACTA
	maximum likelihood edit distance  0

evaluating sequence 2 of 100
	observed sequence                 ATAGTTTGGT
	maximum likelihood sequence       ATAGTTTGGT
	maximum likelihood edit distance  0

evaluating sequence 3 of 100
	observed sequence                 ATAGTTTGG
	maximum likelihood sequence       ATAGTTTGG
	maximum likelihood edit distance  0

evaluating sequence 4 of 100
	observed sequence                 ACCAAACTAT
	maximum likelihood sequence       ACCAAACTAT
	maximum likelihood edit distance  0

evaluating sequence 5 of 100
	observed sequence                 ACCAAACTAT
	maximum likelihood sequence       ACCAAACTAT
	maximum likelihood edit distance  0

evaluating sequence 6 of 100
	observed s

In [11]:
k = 7
canonical_kmers = collect(keys(Eisenia.count_canonical_kmers(maximum_likelihood_observations, k)))
stranded_kmer_graph = Eisenia.build_stranded_kmer_graph(canonical_kmers, maximum_likelihood_observations)
filename = reference_sequence_id * "." * replace(string(Dates.now()), ':' => '.') * ".svg"
Eisenia.plot_stranded_kmer_graph(stranded_kmer_graph, filename=filename)
HTML("""
<image src="$filename" width=50%>
""")

In [12]:
maximum_likelihood_observations = Eisenia.viterbi_maximum_likelihood_traversals(stranded_kmer_graph, error_rate = error_rate, verbosity="reads");

computing kmer counts...
computing kmer state likelihoods...
finding shortest paths between kmers...
finding viterbi maximum likelihood paths for observed sequences...

evaluating sequence 1 of 100
	observed sequence                 ACCAAACTA
	maximum likelihood sequence       ACCAAACTA
	maximum likelihood edit distance  0

evaluating sequence 2 of 100
	observed sequence                 ATAGTTTGGT
	maximum likelihood sequence       ATAGTTTGGT
	maximum likelihood edit distance  0

evaluating sequence 3 of 100
	observed sequence                 ATAGTTTGG
	maximum likelihood sequence       ATAGTTTGG
	maximum likelihood edit distance  0

evaluating sequence 4 of 100
	observed sequence                 ACCAAACTAT
	maximum likelihood sequence       ACCAAACTAT
	maximum likelihood edit distance  0

evaluating sequence 5 of 100
	observed sequence                 ACCAAACTAT
	maximum likelihood sequence       ACCAAACTAT
	maximum likelihood edit distance  0

evaluating sequence 6 of 100
	observed s

# L20

In [13]:
L = 20
Random.seed!(L)
reference_sequence = randdnaseq(L)
reference_sequence_id = randstring(Int(round(log10(length(L)))+3))
reference_FASTA_record = FASTA.Record(reference_sequence_id, reference_sequence)

BioSequences.FASTA.Record:
   identifier: Bg3
  description: <missing>
     sequence: CTGCAAGGTCGAATCCGGTC

In [14]:
error_rate = 0.15
n_sequences = 100
observations = [Eisenia.observe(reference_FASTA_record, error_rate=error_rate) for i in 1:n_sequences]

100-element Array{BioSequences.FASTA.Record,1}:
 BioSequences.FASTA.Record:
   identifier: UClF
  description: <missing>
     sequence: GACCGGATTCGGCCTGGGAG    
 BioSequences.FASTA.Record:
   identifier: 6u5g
  description: <missing>
     sequence: CTGCCCAGGGTCGGAATCCGGTAC
 BioSequences.FASTA.Record:
   identifier: CghJ
  description: <missing>
     sequence: GATCCGGATTCGATCCTTGCAG  
 BioSequences.FASTA.Record:
   identifier: WVAh
  description: <missing>
     sequence: ACCGAGATATCGACCTTGCAG   
 BioSequences.FASTA.Record:
   identifier: yBd8
  description: <missing>
     sequence: GACCGGATTCCGATGGCAG     
 BioSequences.FASTA.Record:
   identifier: Bxw5
  description: <missing>
     sequence: GAACCGGGTTCGACCTTGCAG   
 BioSequences.FASTA.Record:
   identifier: XFoi
  description: <missing>
     sequence: CTGCAGGTTCGAATACCGGTC   
 BioSequences.FASTA.Record:
   identifier: RfvP
  description: <missing>
     sequence: CTGCAAGGTCGAATCCGGTC    
 BioSequences.FASTA.Record:
   identifier: 9sYB


## L20 starting @ K=7

In [15]:
k = 7
canonical_kmers = collect(keys(Eisenia.count_canonical_kmers(observations, k)))
stranded_kmer_graph = Eisenia.build_stranded_kmer_graph(canonical_kmers, observations)
filename = reference_sequence_id * "." * replace(string(Dates.now()), ':' => '.') * ".svg"
Eisenia.plot_stranded_kmer_graph(stranded_kmer_graph, filename=filename)
HTML("""
<image src="$filename" width=50%>
""")

In [16]:
maximum_likelihood_observations = Eisenia.viterbi_maximum_likelihood_traversals(stranded_kmer_graph, verbosity="reads");

computing kmer counts...
computing kmer state likelihoods...
finding shortest paths between kmers...
finding viterbi maximum likelihood paths for observed sequences...

evaluating sequence 1 of 100
	observed sequence                 GACCGGATTCGGCCTGGGAG
	maximum likelihood sequence       GACCGGATTCGACCTTGCAG
	maximum likelihood edit distance  3

evaluating sequence 2 of 100
	observed sequence                 CTGCCCAGGGTCGGAATCCGGTAC
	maximum likelihood sequence       CCTGCAAGGTCGAATCCGGTC
	maximum likelihood edit distance  6

evaluating sequence 3 of 100
	observed sequence                 GATCCGGATTCGATCCTTGCAG
	maximum likelihood sequence       GACCGGATTCGACCTTGCAG
	maximum likelihood edit distance  4

evaluating sequence 4 of 100
	observed sequence                 ACCGAGATATCGACCTTGCAG
	maximum likelihood sequence       GACCGGATTCGACCTTGCAG
	maximum likelihood edit distance  3

evaluating sequence 5 of 100
	observed sequence                 GACCGGATTCCGATGGCAG
	maximum likelihood seq

In [17]:
canonical_kmers = collect(keys(Eisenia.count_canonical_kmers(maximum_likelihood_observations, k)))
stranded_kmer_graph = Eisenia.build_stranded_kmer_graph(canonical_kmers, maximum_likelihood_observations)
filename = reference_sequence_id * "." * replace(string(Dates.now()), ':' => '.') * ".svg"
Eisenia.plot_stranded_kmer_graph(stranded_kmer_graph, filename=filename)
HTML("""
<image src="$filename" width=50%>
""")

In [18]:
maximum_likelihood_observations = Eisenia.viterbi_maximum_likelihood_traversals(stranded_kmer_graph, verbosity="reads");

computing kmer counts...
computing kmer state likelihoods...
finding shortest paths between kmers...
finding viterbi maximum likelihood paths for observed sequences...

evaluating sequence 1 of 100
	observed sequence                 GACCGGATTCGACCTTGCAG
	maximum likelihood sequence       GACCGGATTCGACCTTGCAG
	maximum likelihood edit distance  0

evaluating sequence 2 of 100
	observed sequence                 CCTGCAAGGTCGAATCCGGTC
	maximum likelihood sequence       CCTGCAAGGTCGAATCCGGTC
	maximum likelihood edit distance  0

evaluating sequence 3 of 100
	observed sequence                 GACCGGATTCGACCTTGCAG
	maximum likelihood sequence       GACCGGATTCGACCTTGCAG
	maximum likelihood edit distance  0

evaluating sequence 4 of 100
	observed sequence                 GACCGGATTCGACCTTGCAG
	maximum likelihood sequence       GACCGGATTCGACCTTGCAG
	maximum likelihood edit distance  0

evaluating sequence 5 of 100
	observed sequence                 GACCGGATTCGACCTTGCAG
	maximum likelihood sequence

In [19]:
canonical_kmers = collect(keys(Eisenia.count_canonical_kmers(maximum_likelihood_observations, k)))
stranded_kmer_graph = Eisenia.build_stranded_kmer_graph(canonical_kmers, maximum_likelihood_observations)
filename = reference_sequence_id * "." * replace(string(Dates.now()), ':' => '.') * ".svg"
Eisenia.plot_stranded_kmer_graph(stranded_kmer_graph, filename=filename)
HTML("""
<image src="$filename" width=50%>
""")

In [20]:
maximum_likelihood_observations = Eisenia.viterbi_maximum_likelihood_traversals(stranded_kmer_graph, verbosity="reads");

computing kmer counts...
computing kmer state likelihoods...
finding shortest paths between kmers...
finding viterbi maximum likelihood paths for observed sequences...

evaluating sequence 1 of 100
	observed sequence                 GACCGGATTCGACCTTGCAG
	maximum likelihood sequence       GACCGGATTCGACCTTGCAG
	maximum likelihood edit distance  0

evaluating sequence 2 of 100
	observed sequence                 CCTGCAAGGTCGAATCCGGTC
	maximum likelihood sequence       CCTGCAAGGTCGAATCCGGTC
	maximum likelihood edit distance  0

evaluating sequence 3 of 100
	observed sequence                 GACCGGATTCGACCTTGCAG
	maximum likelihood sequence       GACCGGATTCGACCTTGCAG
	maximum likelihood edit distance  0

evaluating sequence 4 of 100
	observed sequence                 GACCGGATTCGACCTTGCAG
	maximum likelihood sequence       GACCGGATTCGACCTTGCAG
	maximum likelihood edit distance  0

evaluating sequence 5 of 100
	observed sequence                 GACCGGATTCGACCTTGCAG
	maximum likelihood sequence

In [21]:
k = 11
canonical_kmers = collect(keys(Eisenia.count_canonical_kmers(maximum_likelihood_observations, k)))
stranded_kmer_graph = Eisenia.build_stranded_kmer_graph(canonical_kmers, maximum_likelihood_observations)
filename = reference_sequence_id * "." * replace(string(Dates.now()), ':' => '.') * ".svg"
Eisenia.plot_stranded_kmer_graph(stranded_kmer_graph, filename=filename)
HTML("""
<image src="$filename" width=50%>
""")

In [22]:
maximum_likelihood_observations = Eisenia.viterbi_maximum_likelihood_traversals(stranded_kmer_graph, verbosity="reads");

computing kmer counts...
computing kmer state likelihoods...
finding shortest paths between kmers...
finding viterbi maximum likelihood paths for observed sequences...

evaluating sequence 1 of 100
	observed sequence                 GACCGGATTCGACCTTGCAG
	maximum likelihood sequence       GACCGGATTCGACCTTGCAG
	maximum likelihood edit distance  0

evaluating sequence 2 of 100
	observed sequence                 CCTGCAAGGTCGAATCCGGTC
	maximum likelihood sequence       CCTGCAAGGTCGAATCCGGTC
	maximum likelihood edit distance  0

evaluating sequence 3 of 100
	observed sequence                 GACCGGATTCGACCTTGCAG
	maximum likelihood sequence       GACCGGATTCGACCTTGCAG
	maximum likelihood edit distance  0

evaluating sequence 4 of 100
	observed sequence                 GACCGGATTCGACCTTGCAG
	maximum likelihood sequence       GACCGGATTCGACCTTGCAG
	maximum likelihood edit distance  0

evaluating sequence 5 of 100
	observed sequence                 GACCGGATTCGACCTTGCAG
	maximum likelihood sequence

In [23]:
k = 13
canonical_kmers = collect(keys(Eisenia.count_canonical_kmers(maximum_likelihood_observations, k)))
stranded_kmer_graph = Eisenia.build_stranded_kmer_graph(canonical_kmers, maximum_likelihood_observations)
filename = reference_sequence_id * "." * replace(string(Dates.now()), ':' => '.') * ".svg"
Eisenia.plot_stranded_kmer_graph(stranded_kmer_graph, filename=filename)
HTML("""
<image src="$filename" width=50%>
""")

In [24]:
maximum_likelihood_observations = Eisenia.viterbi_maximum_likelihood_traversals(stranded_kmer_graph, verbosity="reads");

computing kmer counts...
computing kmer state likelihoods...
finding shortest paths between kmers...
finding viterbi maximum likelihood paths for observed sequences...

evaluating sequence 1 of 100
	observed sequence                 GACCGGATTCGACCTTGCAG
	maximum likelihood sequence       GACCGGATTCGACCTTGCAG
	maximum likelihood edit distance  0

evaluating sequence 2 of 100
	observed sequence                 CCTGCAAGGTCGAATCCGGTC
	maximum likelihood sequence       CCTGCAAGGTCGAATCCGGTC
	maximum likelihood edit distance  0

evaluating sequence 3 of 100
	observed sequence                 GACCGGATTCGACCTTGCAG
	maximum likelihood sequence       GACCGGATTCGACCTTGCAG
	maximum likelihood edit distance  0

evaluating sequence 4 of 100
	observed sequence                 GACCGGATTCGACCTTGCAG
	maximum likelihood sequence       GACCGGATTCGACCTTGCAG
	maximum likelihood edit distance  0

evaluating sequence 5 of 100
	observed sequence                 GACCGGATTCGACCTTGCAG
	maximum likelihood sequence

In [25]:
k = 17
canonical_kmers = collect(keys(Eisenia.count_canonical_kmers(maximum_likelihood_observations, k)))
stranded_kmer_graph = Eisenia.build_stranded_kmer_graph(canonical_kmers, maximum_likelihood_observations)
filename = reference_sequence_id * "." * replace(string(Dates.now()), ':' => '.') * ".svg"
Eisenia.plot_stranded_kmer_graph(stranded_kmer_graph, filename=filename)
HTML("""
<image src="$filename" width=50%>
""")

In [26]:
maximum_likelihood_observations = Eisenia.viterbi_maximum_likelihood_traversals(stranded_kmer_graph, verbosity="reads");

computing kmer counts...
computing kmer state likelihoods...
finding shortest paths between kmers...
finding viterbi maximum likelihood paths for observed sequences...

evaluating sequence 1 of 100
	observed sequence                 GACCGGATTCGACCTTGCAG
	maximum likelihood sequence       GACCGGATTCGACCTTGCAG
	maximum likelihood edit distance  0

evaluating sequence 2 of 100
	observed sequence                 CCTGCAAGGTCGAATCCGGTC
	maximum likelihood sequence       CCTGCAAGGTCGAATCCGGTC
	maximum likelihood edit distance  0

evaluating sequence 3 of 100
	observed sequence                 GACCGGATTCGACCTTGCAG
	maximum likelihood sequence       GACCGGATTCGACCTTGCAG
	maximum likelihood edit distance  0

evaluating sequence 4 of 100
	observed sequence                 GACCGGATTCGACCTTGCAG
	maximum likelihood sequence       GACCGGATTCGACCTTGCAG
	maximum likelihood edit distance  0

evaluating sequence 5 of 100
	observed sequence                 GACCGGATTCGACCTTGCAG
	maximum likelihood sequence

In [27]:
k = 19
canonical_kmers = collect(keys(Eisenia.count_canonical_kmers(maximum_likelihood_observations, k)))
stranded_kmer_graph = Eisenia.build_stranded_kmer_graph(canonical_kmers, maximum_likelihood_observations)
filename = reference_sequence_id * "." * replace(string(Dates.now()), ':' => '.') * ".svg"
Eisenia.plot_stranded_kmer_graph(stranded_kmer_graph, filename=filename)
HTML("""
<image src="$filename" width=50%>
""")

└ @ Eisenia /Users/Cameron/Desktop/Microbes/Eisenia/src/Eisenia.jl:1120
└ @ Eisenia /Users/Cameron/Desktop/Microbes/Eisenia/src/Eisenia.jl:1120
└ @ Eisenia /Users/Cameron/Desktop/Microbes/Eisenia/src/Eisenia.jl:1120
└ @ Eisenia /Users/Cameron/Desktop/Microbes/Eisenia/src/Eisenia.jl:1120
└ @ Eisenia /Users/Cameron/Desktop/Microbes/Eisenia/src/Eisenia.jl:1120
└ @ Eisenia /Users/Cameron/Desktop/Microbes/Eisenia/src/Eisenia.jl:1120
└ @ Eisenia /Users/Cameron/Desktop/Microbes/Eisenia/src/Eisenia.jl:1120
└ @ Eisenia /Users/Cameron/Desktop/Microbes/Eisenia/src/Eisenia.jl:1120


In [28]:
maximum_likelihood_observations = Eisenia.viterbi_maximum_likelihood_traversals(stranded_kmer_graph, verbosity="reads");

computing kmer counts...
computing kmer state likelihoods...
finding shortest paths between kmers...
finding viterbi maximum likelihood paths for observed sequences...

evaluating sequence 1 of 92
	observed sequence                 GACCGGATTCGACCTTGCAG
	maximum likelihood sequence       GACCGGATTCGACCTTGCAG
	maximum likelihood edit distance  0

evaluating sequence 2 of 92
	observed sequence                 CCTGCAAGGTCGAATCCGGTC
	maximum likelihood sequence       CCTGCAAGGTCGAATCCGGTC
	maximum likelihood edit distance  0

evaluating sequence 3 of 92
	observed sequence                 GACCGGATTCGACCTTGCAG
	maximum likelihood sequence       GACCGGATTCGACCTTGCAG
	maximum likelihood edit distance  0

evaluating sequence 4 of 92
	observed sequence                 GACCGGATTCGACCTTGCAG
	maximum likelihood sequence       GACCGGATTCGACCTTGCAG
	maximum likelihood edit distance  0

evaluating sequence 5 of 92
	observed sequence                 GACCGGATTCGACCTTGCAG
	maximum likelihood sequence     

# L50

In [29]:
L = 50
Random.seed!(L)
reference_sequence = randdnaseq(L)
reference_sequence_id = randstring(Int(round(log10(length(L)))+3))
reference_FASTA_record = FASTA.Record(reference_sequence_id, reference_sequence)

BioSequences.FASTA.Record:
   identifier: j96
  description: <missing>
     sequence: TGGAACCAGGATCATGCTACGGCGCGTAATCTACCACGA…

In [30]:
error_rate = 0.15
n_sequences = 100
observations = [Eisenia.observe(reference_FASTA_record, error_rate=error_rate) for i in 1:n_sequences]

100-element Array{BioSequences.FASTA.Record,1}:
 BioSequences.FASTA.Record:
   identifier: nxKL8
  description: <missing>
     sequence: TGGAACTAGGATCATGCTACGGCGCGAATCTACCAGATG…
 BioSequences.FASTA.Record:
   identifier: h0jcA
  description: <missing>
     sequence: TGGAACCAGATCATGCTACGGCGCGTATTTACCACGATG…
 BioSequences.FASTA.Record:
   identifier: RKMGo
  description: <missing>
     sequence: TGCTAGAAGCACGTGGTAGATTACGCGCTAGCATGATCC…
 BioSequences.FASTA.Record:
   identifier: deuPI
  description: <missing>
     sequence: TGGAACCAAGTCTCAGTGCTTACGGCGCGTAATCTACCC…
 BioSequences.FASTA.Record:
   identifier: HKzzv
  description: <missing>
     sequence: TGGAACCAGATCATCTACGGCGCGTATTCTACCACGACC…
 BioSequences.FASTA.Record:
   identifier: cQcBT
  description: <missing>
     sequence: GGCTAGAAGCACGTGGTAGATTACGCGCTCGGTAGCATG…
 BioSequences.FASTA.Record:
   identifier: CUBmJ
  description: <missing>
     sequence: TGCTAAGAAGCATCTGGTAGATTATGCGCCGTAGCATGA…
 BioSequences.FASTA.Record:
   identifier:

## L50 starting @ K=7

In [31]:
k = 7
canonical_kmers = collect(keys(Eisenia.count_canonical_kmers(observations, k)))
stranded_kmer_graph = Eisenia.build_stranded_kmer_graph(canonical_kmers, observations)
filename = reference_sequence_id * "." * replace(string(Dates.now()), ':' => '.') * ".svg"
Eisenia.plot_stranded_kmer_graph(stranded_kmer_graph, filename=filename)
HTML("""
<image src="$filename" width=50%>
""")

In [32]:
maximum_likelihood_observations = Eisenia.viterbi_maximum_likelihood_traversals(stranded_kmer_graph, verbosity="reads");

computing kmer counts...
computing kmer state likelihoods...
finding shortest paths between kmers...
finding viterbi maximum likelihood paths for observed sequences...

evaluating sequence 1 of 100
	observed sequence                 TGGAACTAGGATCATGCTACGGCGCGAATCTACCAGATGCTTCTCCA
	maximum likelihood sequence       TGGAACCAGGATCATGCTACGGCGCGTAATCTACCACGATGCTTCTAGCA
	maximum likelihood edit distance  5

evaluating sequence 2 of 100
	observed sequence                 TGGAACCAGATCATGCTACGGCGCGTATTTACCACGATGTCTAGCA
	maximum likelihood sequence       TGGAACCAGGATCATGCTACGGCGCGTAATCTACCACGATGCTTCTAGCA
	maximum likelihood edit distance  5

evaluating sequence 3 of 100
	observed sequence                 TGCTAGAAGCACGTGGTAGATTACGCGCTAGCATGATCCTAGTCCA
	maximum likelihood sequence       TGCTAGAAGCATCGTGGTAGATTACGCGCCGTAGCATGATCCTGGTTCCA
	maximum likelihood edit distance  5

evaluating sequence 4 of 100
	observed sequence                 TGGAACCAAGTCTCAGTGCTTACGGCGCGTAATCTACCCGATGCTTCCGAGCA
	maximu

In [33]:
canonical_kmers = collect(keys(Eisenia.count_canonical_kmers(maximum_likelihood_observations, k)))
stranded_kmer_graph = Eisenia.build_stranded_kmer_graph(canonical_kmers, maximum_likelihood_observations)
filename = reference_sequence_id * "." * replace(string(Dates.now()), ':' => '.') * ".svg"
Eisenia.plot_stranded_kmer_graph(stranded_kmer_graph, filename=filename)
HTML("""
<image src="$filename" width=50%>
""")

In [34]:
maximum_likelihood_observations = Eisenia.viterbi_maximum_likelihood_traversals(stranded_kmer_graph, verbosity="reads");

computing kmer counts...
computing kmer state likelihoods...
finding shortest paths between kmers...
finding viterbi maximum likelihood paths for observed sequences...

evaluating sequence 1 of 100
	observed sequence                 TGGAACCAGGATCATGCTACGGCGCGTAATCTACCACGATGCTTCTAGCA
	maximum likelihood sequence       TGGAACCAGGATCATGCTACGGCGCGTAATCTACCACGATGCTTCTAGCA
	maximum likelihood edit distance  0

evaluating sequence 2 of 100
	observed sequence                 TGGAACCAGGATCATGCTACGGCGCGTAATCTACCACGATGCTTCTAGCA
	maximum likelihood sequence       TGGAACCAGGATCATGCTACGGCGCGTAATCTACCACGATGCTTCTAGCA
	maximum likelihood edit distance  0

evaluating sequence 3 of 100
	observed sequence                 TGCTAGAAGCATCGTGGTAGATTACGCGCCGTAGCATGATCCTGGTTCCA
	maximum likelihood sequence       TGCTAGAAGCATCGTGGTAGATTACGCGCCGTAGCATGATCCTGGTTCCA
	maximum likelihood edit distance  0

evaluating sequence 4 of 100
	observed sequence                 TGGAACCAGGATCATGCTACGGCGCGTAATCTACCACGATGCTTCTAGCA

In [35]:
canonical_kmers = collect(keys(Eisenia.count_canonical_kmers(maximum_likelihood_observations, k)))
stranded_kmer_graph = Eisenia.build_stranded_kmer_graph(canonical_kmers, maximum_likelihood_observations)
filename = reference_sequence_id * "." * replace(string(Dates.now()), ':' => '.') * ".svg"
Eisenia.plot_stranded_kmer_graph(stranded_kmer_graph, filename=filename)
HTML("""
<image src="$filename" width=50%>
""")

In [36]:
maximum_likelihood_observations = Eisenia.viterbi_maximum_likelihood_traversals(stranded_kmer_graph, verbosity="reads");

computing kmer counts...
computing kmer state likelihoods...
finding shortest paths between kmers...
finding viterbi maximum likelihood paths for observed sequences...

evaluating sequence 1 of 100
	observed sequence                 TGGAACCAGGATCATGCTACGGCGCGTAATCTACCACGATGCTTCTAGCA
	maximum likelihood sequence       TGGAACCAGGATCATGCTACGGCGCGTAATCTACCACGATGCTTCTAGCA
	maximum likelihood edit distance  0

evaluating sequence 2 of 100
	observed sequence                 TGGAACCAGGATCATGCTACGGCGCGTAATCTACCACGATGCTTCTAGCA
	maximum likelihood sequence       TGGAACCAGGATCATGCTACGGCGCGTAATCTACCACGATGCTTCTAGCA
	maximum likelihood edit distance  0

evaluating sequence 3 of 100
	observed sequence                 TGCTAGAAGCATCGTGGTAGATTACGCGCCGTAGCATGATCCTGGTTCCA
	maximum likelihood sequence       TGCTAGAAGCATCGTGGTAGATTACGCGCCGTAGCATGATCCTGGTTCCA
	maximum likelihood edit distance  0

evaluating sequence 4 of 100
	observed sequence                 TGGAACCAGGATCATGCTACGGCGCGTAATCTACCACGATGCTTCTAGCA

In [37]:
k = 11
canonical_kmers = collect(keys(Eisenia.count_canonical_kmers(maximum_likelihood_observations, k)))
stranded_kmer_graph = Eisenia.build_stranded_kmer_graph(canonical_kmers, maximum_likelihood_observations)
filename = reference_sequence_id * "." * replace(string(Dates.now()), ':' => '.') * ".svg"
Eisenia.plot_stranded_kmer_graph(stranded_kmer_graph, filename=filename)
HTML("""
<image src="$filename" width=50%>
""")

In [38]:
maximum_likelihood_observations = Eisenia.viterbi_maximum_likelihood_traversals(stranded_kmer_graph, verbosity="reads");

computing kmer counts...
computing kmer state likelihoods...
finding shortest paths between kmers...
finding viterbi maximum likelihood paths for observed sequences...

evaluating sequence 1 of 100
	observed sequence                 TGGAACCAGGATCATGCTACGGCGCGTAATCTACCACGATGCTTCTAGCA
	maximum likelihood sequence       TGGAACCAGGATCATGCTACGGCGCGTAATCTACCACGATGCTTCTAGCA
	maximum likelihood edit distance  0

evaluating sequence 2 of 100
	observed sequence                 TGGAACCAGGATCATGCTACGGCGCGTAATCTACCACGATGCTTCTAGCA
	maximum likelihood sequence       TGGAACCAGGATCATGCTACGGCGCGTAATCTACCACGATGCTTCTAGCA
	maximum likelihood edit distance  0

evaluating sequence 3 of 100
	observed sequence                 TGCTAGAAGCATCGTGGTAGATTACGCGCCGTAGCATGATCCTGGTTCCA
	maximum likelihood sequence       TGCTAGAAGCATCGTGGTAGATTACGCGCCGTAGCATGATCCTGGTTCCA
	maximum likelihood edit distance  0

evaluating sequence 4 of 100
	observed sequence                 TGGAACCAGGATCATGCTACGGCGCGTAATCTACCACGATGCTTCTAGCA