---
layout: post  
---

In this post, I will demonstrate two methods for how edges could be stored in a graph genome, and argue for the use of explicitly storing observed edges rather than inferring possible edges during runtime

In [2]:
import Pkg
pkgs = [
    "DataStructures",
    "BioSequences",
    "GraphRecipes",
    "LightGraphs",
    "PlotlyJS",
    "Plots",
    "Random",
    "StatsBase",
    "StatsPlots"
]

Pkg.add(pkgs)
for pkg in pkgs
    eval(Meta.parse("import $pkg"))
end

Plots.plotlyjs()

[32m[1m  Resolving[22m[39m package versions...
[32m[1mUpdating[22m[39m `~/.julia/environments/v1.5/Project.toml`
 [90m [9a3f8284] [39m[92m+ Random[39m
[32m[1mNo Changes[22m[39m to `~/.julia/environments/v1.5/Manifest.toml`


Plots.PlotlyJSBackend()

In [3]:
Random.seed!(0)

Random.MersenneTwister(UInt32[0x00000000], Random.DSFMT.DSFMT_state(Int32[748398797, 1073523691, -1738140313, 1073664641, -1492392947, 1073490074, -1625281839, 1073254801, 1875112882, 1073717145  …  943540191, 1073626624, 1091647724, 1073372234, -1273625233, -823628301, 835224507, 991807863, 382, 0]), [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0  …  0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], UInt128[0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000  …  0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x000

In [4]:
genome = BioSequences.randdnaseq(10)

10nt DNA Sequence:
TGTATACTGG

In [5]:
k = 3

3

In [6]:
kmer_counts = StatsBase.countmap(BioSequences.canonical(kmer.fw) for kmer in BioSequences.each(BioSequences.DNAMer{k}, genome))

Dict{Any,Int64} with 6 entries:
  GTA => 2
  ATA => 2
  ACA => 1
  ACT => 1
  CAG => 1
  CCA => 1

In [7]:
kmers = DataStructures.OrderedDict(
    kmer => (index = index, count = count) for (index, (kmer, count)) in enumerate(sort(kmer_counts))
)

OrderedCollections.OrderedDict{BioSequences.Mer{BioSequences.DNAAlphabet{2},3},NamedTuple{(:index, :count),Tuple{Int64,Int64}}} with 6 entries:
  ACA => (index = 1, count = 1)
  ACT => (index = 2, count = 1)
  ATA => (index = 3, count = 2)
  CAG => (index = 4, count = 1)
  CCA => (index = 5, count = 1)
  GTA => (index = 6, count = 2)

In [8]:
K = length(kmers)

6

In this basic example, we can see that if we iterate through all kmer connections in our original dataset to determine the set of edges, we are able to resolve a single path that will reconstruct the original sequence.

```
1 <-> 6 <-> 3 <-> 3 <-> 6 <-> 2 <-> 4 <-> 5
```

While that sequence can be read in the forward or reverse-complement orientation (i.e. we can reverse the order in which we visit each node), that path visits each node at least once and integrates all nodes in the fewest # of steps.
The length of the path will also generate a sequence equal to the length of our original genome.

In [40]:
graph = LightGraphs.SimpleGraph(K)
for i in 1:length(genome)-k
    a_to_b_connection = genome[i:i+k]
    a = BioSequences.canonical(BioSequences.DNAMer(a_to_b_connection[1:end-1]))
    b = BioSequences.canonical(BioSequences.DNAMer(a_to_b_connection[2:end]))
    a_index = kmers[a].index
    b_index = kmers[b].index
    edge = LightGraphs.Edge(a_index, b_index)
    LightGraphs.add_edge!(graph, edge)
end
# Note hover doesn't really work well for these because it cycles the hover info over the edges as well
GraphRecipes.graphplot(
    graph,
    names = 1:K,
    markersize = 0.15,
    fontsize=12)

The algorithmic runtime of assessing the edges of a given graph using the above framework is proportional to the size of the dataset, not to the # of kmers. Determining the edges requires reading through the entire dataset, which we've already done once in order to count the kmers and determine the nodes of the graph.

Some other genome assemblers (CITATION NEEDED) don't actually store the edges at all. Instead, the algorithms infer whether an edge could exist based on whether the two kmers are neighbors. We define "neighbors" as two kmers that can satisfiy the condition
```
kmer_a[1:end-1] == kmer_b[2:end]

where kmer_a and kmer_b can each be in their forward or reverse complement orientation
```



These possible edges can be stored directly as well to reduce runtime at the cost of storing more information. However, if we store all possible edges, we add additional graph complexity that isn't necessarily supported by the data. See below where we add edges for every possible edge that could occur, rather than only those that were actually observed.

In [43]:
graph = LightGraphs.SimpleGraph(K)
# This may be wrong too. We shouldn't necessarily add edges just becuase they COULD exist
# Maybe we want to only add edges when they DO exist
for (index_a, fw_kmer) in enumerate(kmers.keys)
    for kmer in (fw_kmer, BioSequences.reverse_complement(fw_kmer))
        for neighbor in BioSequences.neighbors(kmer)
            canonical_neighbor = BioSequences.canonical(neighbor)
            index_b = get(kmers, canonical_neighbor, (index = 0, count = 0)).index
            if index_b != 0
                LightGraphs.add_edge!(graph, LightGraphs.Edge(index_a, index_b))
            end
        end
    end
end
GraphRecipes.graphplot(
    graph,
    names = 1:K,
    markersize = 0.15,
    fontsize=12)

Since we've already made the concession that we will only store and consider kmers that were actually observed to exist, I would argue that we should only entertain edges that were also demonstrated to exist.