# Colander Benchmarking

In [7]:
from colander.mock_data_generation.utils import *
from colander.estimate import greedy_strain_estimation
import networkx as nx

Broadly speaking, this process involves a few steps:

1. Generate a random "starting genome"
2. Generate strains
3. Shear strains into k-mers then make de Bruijn graph
4. Run greedy strain estimation code

Of course, genomes in practice are not random sequences of nucleotides -- as chapter 1 of Compeau and Pevzner shows, factors like G/C skew and repetitive regions are examples of nonrandomness in real genomes. This gives us reason to doubt the efficacy of modeling genomes completely randomly, as we do here.

That being said, we have to start somewhere.

## Define parameter sets for how the tests we're going to run

The parameters are:

- Starting genome length
- k-mer size in de Bruijn graph construction
- Strain coverages

(We've kept the mutation rates constant, but these could of course be adjusted on a per-test basis as well.)

In [None]:

tests =
    [1000, 

In [16]:
genome = generate_random_sequence(5000)
# Define hypervariable regions in the genome: these will undergo more mutations
hv_regions = [(100, 150), (200, 250), (3000, 3500)]

# What are the "coverages" of each strain in the metagenome sequence data?
strain_coverages = [1, 3, 5]

strains = generate_strains_from_genome(
    genome,
    strain_coverages,
    hv_regions,
    hypervariable_mutation_probability=0.01,
    normal_mutation_probability=0.001
)
kmers = []
for s in strains:
    kmers += shear_into_kmers(s.seq, s.coverage, 5)
print("Three arbitrary k-mers: {}".format(kmers[:3]))
print(len(kmers))

g = make_debruijn_graph(kmers)

cs = greedy_strain_estimation(g, 5)
print("CycleSet with N = 5 has conformity score {}".format(cs.conformity_score(g)))

Three arbitrary k-mers: ['GTGGG', 'TGGGC', 'GGGCA']
45022
CycleSet with N = 5 has conformity score 468386
