# Colander Benchmarking

## 1. Generate mock data using Colander's utility code

In [1]:
from colander.mock_data_generation.utils import *

### 1.1. Generate a random "starting genome"

Of course, genomes in practice are not random sequences of nucleotides -- as chapter 1 of Compeau and Pevzner shows, factors like G/C skew and repetitive regions are examples of nonrandomness in real genomes. This gives us reason to doubt the efficacy of modeling genomes completely randomly, as we do here.

That being said, we have to start somewhere.

In [2]:
genome = generate_random_sequence(500)
genome

'CACGCGCCATGGAGTGGTTTCCAGATCTACTACTTCCGACACGTACGAGTATTGGATAGGTACGCCCTCGGGGTCGCCCAAGGACTGGAAGCTACCCATTTAACGGGTGGTGTATGGCCGTGCATTCCTGAACCTCACCTGGCGGACTGACTCTCCTTACCTTCAGCTGCAGTCTTACCATTGGTAGATTGAAAAAATTAGCTGAGAGCCTCGCCGAGGGCGCTGTACTGTAGGATTAACAATTTCAGAGACGTTCTCTTACTAGGGCTGTCGAGGATATTTGCGCACCGACTTTGTTAGGATTATCTGAAATCCTATAGACATGGCGGCTTATGAGCGGCGAAAACTCTTGCACTCACGCTTTTGGATCTGATAAGTCCGTGCGCCTTGTGATTTACGTCCCGCTGCTTTTTGGTTGAGTCTCTGGGACAACTTAGATTGCCAGGCGGGTTGTTAGCTCTTGCCTTTGTCTAGTTATCAATCGGGGTGTCCGCTGATGT'

### 1.2. Generate strains by randomly adding mutations to this genome

In [3]:
# Define hypervariable regions in the genome: these will undergo more mutations
hv_regions = [(100, 150), (200, 250), (300, 350)]

# What are the "coverages" of each strain in the metagenome sequence data?
strain_coverages = [1, 3, 5]

strains = generate_strains_from_genome(
    genome,
    strain_coverages,
    hv_regions,
    hypervariable_mutation_probability=0.05,
    normal_mutation_probability=0.01
)

In [4]:
for s in strains:
    print(len(s.seq), s.coverage)

497 1
498 3
497 5


### 1.3. Shear the strain genomes into k-mers and create a de Bruijn graph from all k-mers
We use k = 15 here, but this is of course configurable.

In [5]:
kmers = []
for s in strains:
    kmers += shear_into_kmers(s.seq, s.coverage, 15)
print("Three arbitrary k-mers: {}".format(kmers[:3]))
print(len(kmers))

import networkx as nx
g = make_debruijn_graph(kmers)
nx.drawing.nx_agraph.write_dot(g, "dbg_k15.gv")

Three arbitrary k-mers: ['CCACGCGCCATGAGT', 'CACGCGCCATGAGTG', 'ACGCGCCATGAGTGG']
4350
