# Colander Benchmarking

## 1. Generate mock data using Colander's utility code

In [22]:
from colander.mock_data_generation.utils import generate_random_sequence, generate_strains_from_genome

### 1.1. Generate a random "starting genome"

Of course, genomes in practice are not random sequences of nucleotides -- as chapter 1 of Compeau and Pevzner shows, factors like G/C skew and repetitive regions are examples of nonrandomness in real genomes. This gives us reason to doubt the efficacy of modeling genomes completely randomly, as we do here.

That being said, we have to start somewhere.

In [23]:
genome = generate_random_sequence(500)
genome

'ATTCAATAATCCCCGTGTGTAGCACAGTGCCTGACACAAGCAAAGTCGGCGGTTAAAACACCATTTACAAGCATTGAATGTGGCAAGAGGAGGGACTCCCCTCTTAAAGACGGATGCAGATCGTAGTGGCATATGCGACTTAACATCATATCCGATACTCCACTCCGGAAAGGATAGGTCGCTTTATTCCTTTGGTCAGGCCCCGAGCAATTCGCAAGAGGAATAGCCATTGCCCTATAAAAATAGTCTGAGAATACATAGAACGCGTCGCCATTCGTGCAGCTGTGTAAGGCGGAACACGGATATTTCGGGACGTCTTTCTATGTAATATAATTGTTTTATGTGGACTGATTTCTCCCGAAGCCAAGAAGCAAATTAATTCCGATGTAAGTATAAGACTACCTAAGTCATAGCTAAAGTTGGTGGGTCCCAGGCACGCATTGATCAGGCCGGAAAGACCTCCTAACGGAAAATGGAAGACGTTTCATAATACTGTCCGT'

### 1.2. Generate strains by randomly adding mutations to this genome

In [24]:
# Define hypervariable regions in the genome: these will undergo more mutations
hv_regions = [(100, 150), (200, 250), (300, 350)]

# What are the "coverages" of each strain in the metagenome sequence data?
strain_coverages = [10, 5, 5, 3, 6, 4]

strains = generate_strains_from_genome(
    genome,
    strain_coverages,
    hv_regions,
    hypervariable_mutation_probability=0.5,
    normal_mutation_probability=0.01
)

In [25]:
for s in strains:
    print(len(s.seq), s.coverage)

492 10
514 5
523 5
507 3
512 6
515 4


### 1.3. Shear the strain genomes into short reads, create de Bruijn graph from k-mers (TODO)