# Generating four genomes of same length (1Mbp)

## Fragment Preparation

The genomes contains 8 fragments R1 to R8
* G1 = R1 R3 R3
* G2 = R2 R1 R4
* G3 = R5 R2 R6
* G4 = R7 R8 R8 R1

Sizes of each of the fragment
* R1 = 400kbp
* R2 = 300kbp
* R3 = 300kbp
* R4 = 300kbp
* R5 = 250kbp
* R6 = 450kbp
* R7 = 350kbp
* R8 = 125kbp

## Fragment preparation function


In [8]:
from random import choice

def createFrag(size):
    DNA = ""
    for count in range(size):
        DNA += choice("CGTA")
    return DNA

In [51]:
R1 = createFrag(400 * 1000)
R2 = createFrag(300 * 1000)
R3 = createFrag(300 * 1000)
R4 = createFrag(300 * 1000)
R5 = createFrag(250 * 1000)
R6 = createFrag(450 * 1000)
R7 = createFrag(350 * 1000)
R8 = createFrag(125 * 1000)

G1 = R1 + R3 + R3
G2 = R2 + R1 + R4
G3 = R5 + R2 + R6
G4 = R7 + R8 + R8 + R1

## Create files for each genome

In [52]:
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio.Alphabet import DNAAlphabet
from Bio import SeqIO

# G1
record1 = SeqRecord(Seq(G1, DNAAlphabet), id="G1", description="Sample genome 1")
SeqIO.write(record1, "./Sample Genomes/G1.fa", "fasta")
# G2
record2 = SeqRecord(Seq(G2, DNAAlphabet), id="G2", description="Sample genome 2")
SeqIO.write(record2, "./Sample Genomes/G2.fa", "fasta")
# G3
record3 = SeqRecord(Seq(G3, DNAAlphabet), id="G3", description="Sample genome 3")
SeqIO.write(record3, "./Sample Genomes/G3.fa", "fasta")
# G4
record4 = SeqRecord(Seq(G4, DNAAlphabet), id="G4", description="Sample genome 4")
SeqIO.write(record4, "./Sample Genomes/G4.fa", "fasta")

SeqIO.write([record1, record2, record3, record4], "./Sample Genomes/all-set.fa", "fasta")

4

## Creating Reads After Mixing Genomes

In [53]:
!cd 'Sampled Reads' && \
/media/anuradhawick/data/Tools/PBSIM-PacBio-Simulator/src/pbsim \
--data-type CLR \
--depth 25 \
--sample-fastq /media/anuradhawick/data/Tools/PBSIM-PacBio-Simulator/sample/sample.fastq \
--length-min 10000 \
--length-max 30000 \
'/media/anuradhawick/data/Experiments/Assembly Graph/Sample Genomes/all-set.fa'

:::: Simulation parameters :::

Simulated by fastq sampling.

prefix : sd
data-type : CLR
depth : 25.000000
length-mean : (sampling FASTQ)
length-sd : (sampling FASTQ)
length-min : 10000
length-max : 30000
accuracy-mean : (sampling FASTQ)
accuracy-sd : (sampling FASTQ)
accuracy-min : 0.750000
accuracy-max : 1.000000
difference-ratio : 10:60:30

:::: FASTQ stats ::::

file name : /media/anuradhawick/data/Tools/PBSIM-PacBio-Simulator/sample/sample.fastq

:: all reads ::
read num. : 1711
read total length : 5154862
read min length : 140
read max length : 22539

:: filtered reads ::
read num. : 30
read total length : 397677
read min length : 10146
read max length : 22539
read length mean (SD) : 13255.900000 (3191.799047)
read accuracy mean (SD) : 0.783918 (0.015961)

!! ref file /media/anuradhawick/data/Experiments/Assembly Graph/Sample Genomes/all-set.fa
:::: Reference stats ::::

file name : /media/anuradhawick/data/Experiments/Assembly Graph/Sample Genomes/all-set.fa

ref.1 (len:1000000

In [3]:
!/media/anuradhawick/data/Tools/Flye-2.3.6/bin/flye \
--subassemblies 'Sampled Reads/sd_0001.fastq' 'Sampled Reads/sd_0002.fastq' 'Sampled Reads/sd_0003.fastq' 'Sampled Reads/sd_0004.fastq' \
--out-dir 'Out Pacbio' \
--genome-size 1m \
--threads 16

[2018-11-05 11:02:10] INFO: Running Flye 2.3.6-release
[2018-11-05 11:02:10] INFO: Configuring run
[2018-11-05 11:02:10] INFO: Input genome size: 1000000
[2018-11-05 11:02:10] INFO: Estimated coverage: 100
[2018-11-05 11:02:10] INFO: Reads N50/N90: 12544 / 10665
[2018-11-05 11:02:10] INFO: Minimum overlap set to 1000
[2018-11-05 11:02:10] INFO: Selected k-mer size: 31
[2018-11-05 11:02:10] INFO: Assembling reads
[2018-11-05 11:02:10] INFO: Reading sequences
[2018-11-05 11:02:10] INFO: Generating solid k-mer index
[2018-11-05 11:02:23] INFO: Counting kmers (1/2):
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 
[2018-11-05 11:02:24] INFO: Counting kmers (2/2):
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 
[2018-11-05 11:02:26] INFO: Filling index table
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 
[2018-11-05 11:02:32] INFO: Extending reads
[2018-11-05 11:02:36] INFO: Overlap-based coverage: 0
[2018-11-05 11:02:36] INFO: Median overlap divergence: 0
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%