# Generating reads from four genomes of same length (6Mbp)

## Fragment Preparation

The genomes contains 8 fragments R1 to R8
* G1 = R1 R3 R3
* G2 = R2 R1 R4
* G3 = R5 R2 R6
* G4 = R7 R8 R8 R1

Sizes of each of the fragment
* R1 = 2000kbp
* R2 = 1500kbp
* R3 = 1500kbp
* R4 = 1500kbp
* R5 = 1250kbp
* R6 = 2250kbp
* R7 = 1750kbp
* R8 = 625kbp

## Fragment preparation function


In [9]:
# python imports
from random import choice

# Biopython imports
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio.Alphabet import DNAAlphabet
from Bio import SeqIO

def createFrag(size):
    DNA = ""
    for count in range(size):
        DNA += choice("CGTA")
    return DNA

In [10]:
R1 = createFrag(2000 * 1000)
R2 = createFrag(1500 * 1000)
R3 = createFrag(1500 * 1000)
R4 = createFrag(1500 * 1000)
R5 = createFrag(1250 * 1000)
R6 = createFrag(2250 * 1000)
R7 = createFrag(1750 * 1000)
R8 = createFrag(625 * 1000)

G1 = R1 + R3 + R3
G2 = R2 + R1 + R4
G3 = R5 + R2 + R6
G4 = R7 + R8 + R8 + R1

## Create files for each genome

In [11]:
![ -e 'Sample Genomes' ] && rm -r 'Sample Genomes'
!mkdir 'Sample Genomes'

# G1
record1 = SeqRecord(Seq(G1, DNAAlphabet), id="G1", description="Sample genome 1")
SeqIO.write(record1, "./Sample Genomes/G1.fa", "fasta")
# G2
record2 = SeqRecord(Seq(G2, DNAAlphabet), id="G2", description="Sample genome 2")
SeqIO.write(record2, "./Sample Genomes/G2.fa", "fasta")
# G3
record3 = SeqRecord(Seq(G3, DNAAlphabet), id="G3", description="Sample genome 3")
SeqIO.write(record3, "./Sample Genomes/G3.fa", "fasta")
# G4
record4 = SeqRecord(Seq(G4, DNAAlphabet), id="G4", description="Sample genome 4")
SeqIO.write(record4, "./Sample Genomes/G4.fa", "fasta")

SeqIO.write([record1, record2, record3, record4], "./Sample Genomes/all-set.fa", "fasta")

4

## Creating Reads After Mixing Genomes

In [12]:
![ -e 'Sampled Reads' ] && rm -r 'Sampled Reads'
!mkdir 'Sampled Reads'

!cd 'Sampled Reads' && pbsim \
--data-type CLR \
--depth 50 \
--model_qc /media/anuradhawick/data/Tools/PBSIM-PacBio-Simulator/data/model_qc_clr \
--length-min 10000 \
--length-max 30000 \
--difference-ratio 0:0:0 \
'/media/anuradhawick/data/Experiments/Assembly Graph/Assembly/Sample Genomes/all-set.fa'

:::: Simulation parameters :::

Simulated by stochastic model.

prefix : sd
data-type : CLR
depth : 50.000000
length-mean : 3000.000000
length-sd : 2300.000000
length-min : 10000
length-max : 30000
accuracy-mean : 0.780000
accuracy-sd : 0.020000
accuracy-min : 0.750000
accuracy-max : 1.000000
difference-ratio : 0:0:0

!! ref file /media/anuradhawick/data/Experiments/Assembly Graph/Assembly/Sample Genomes/all-set.fa
:::: Reference stats ::::

file name : /media/anuradhawick/data/Experiments/Assembly Graph/Assembly/Sample Genomes/all-set.fa

ref.1 (len:5000000) : G1 Sample genome 1
ref.2 (len:5000000) : G2 Sample genome 2
ref.3 (len:5000000) : G3 Sample genome 3
ref.4 (len:5000000) : G4 Sample genome 4

:::: Simulation stats (ref.1 - name: G1 Sample genome 1) ::::

read num. : 19164
depth : 50.000680
read length mean (SD) : 13045.470570 (3192.433938)
read length min : 10000
read length max : 29879
read accuracy mean (SD) : 0.781444 (0.018242)
substitution rate. : 0.000000
insertion rate.

## Mixing reads to a single FASTQ

In [13]:
!cat 'Sampled Reads/sd_0001.fastq' 'Sampled Reads/sd_0002.fastq' 'Sampled Reads/sd_0003.fastq' 'Sampled Reads/sd_0004.fastq' > 'Sampled Reads/all-in-one.fastq'
!gzip -c -1 'Sampled Reads/all-in-one.fastq' > 'Sampled Reads/all-in-one.fastq.gz'

!gzip -c -1 'Sampled Reads/sd_0001.fastq' > 'Sampled Reads/sd_0001.fastq.gz'
!gzip -c -1 'Sampled Reads/sd_0002.fastq' > 'Sampled Reads/sd_0002.fastq.gz'
!gzip -c -1 'Sampled Reads/sd_0003.fastq' > 'Sampled Reads/sd_0003.fastq.gz'
!gzip -c -1 'Sampled Reads/sd_0004.fastq' > 'Sampled Reads/sd_0004.fastq.gz'

# Generating Reads for SPAdes (CCS)

In [35]:
!cd 'Sampled Reads' && pbsim \
--data-type CLR \
--prefix sd_ccs \
--depth 50 \
--model_qc /media/anuradhawick/data/Tools/PBSIM-PacBio-Simulator/data/model_qc_ccs \
--length-min 10000 \
--length-max 30000 \
--difference-ratio 0:0:0 \
'/media/anuradhawick/data/Experiments/Assembly Graph/Sample Genomes/all-set.fa'

:::: Simulation parameters :::

Simulated by stochastic model.

prefix : sd_ccs
data-type : CLR
depth : 50.000000
length-mean : 3000.000000
length-sd : 2300.000000
length-min : 10000
length-max : 30000
accuracy-mean : 0.780000
accuracy-sd : 0.020000
accuracy-min : 0.750000
accuracy-max : 1.000000
difference-ratio : 0:0:0

!! ref file /media/anuradhawick/data/Experiments/Assembly Graph/Sample Genomes/all-set.fa
:::: Reference stats ::::

file name : /media/anuradhawick/data/Experiments/Assembly Graph/Sample Genomes/all-set.fa

ref.1 (len:5000000) : G1 Sample genome 1
ref.2 (len:5000000) : G2 Sample genome 2
ref.3 (len:5000000) : G3 Sample genome 3
ref.4 (len:5000000) : G4 Sample genome 4

:::: Simulation stats (ref.1 - name: G1 Sample genome 1) ::::

read num. : 19146
depth : 50.001312
read length mean (SD) : 13057.900397 (3198.824039)
read length min : 10000
read length max : 29879
read accuracy mean (SD) : 0.781542 (0.018263)
substitution rate. : 0.000000
insertion rate. : 0.000000
de

## Mixing CC reads to a single FASTQ

In [37]:
!cat 'Sampled Reads/sd_ccs_0001.fastq' 'Sampled Reads/sd_ccs_0002.fastq' 'Sampled Reads/sd_ccs_0003.fastq' 'Sampled Reads/sd_ccs_0004.fastq' > 'Sampled Reads/all-in-one_ccs.fastq'
!gzip -c -1 'Sampled Reads/all-in-one_ccs.fastq' > 'Sampled Reads/all-in-one_ccs.fastq.gz'

!gzip -c -1 'Sampled Reads/sd_ccs_0001.fastq' > 'Sampled Reads/sd_ccs_0001.fastq.gz'
!gzip -c -1 'Sampled Reads/sd_ccs_0002.fastq' > 'Sampled Reads/sd_ccs_0002.fastq.gz'
!gzip -c -1 'Sampled Reads/sd_ccs_0003.fastq' > 'Sampled Reads/sd_ccs_0003.fastq.gz'
!gzip -c -1 'Sampled Reads/sd_ccs_0004.fastq' > 'Sampled Reads/sd_ccs_0004.fastq.gz'

# Generating reads from E. Coli

In [5]:
![ -e 'Sampled Reads EColi' ] && rm -r 'Sampled Reads EColi'
!mkdir 'Sampled Reads EColi'

!cd 'Sampled Reads EColi' && pbsim \
--data-type CLR \
--depth 50 \
--model_qc /media/anuradhawick/data/Tools/PBSIM-PacBio-Simulator/data/model_qc_clr \
--length-min 10000 \
--length-max 30000 \
--difference-ratio 0:0:0 \
'/media/anuradhawick/data/Experiments/Assembly Graph/Assembly/Sample Genomes/e.coli.fasta'

:::: Simulation parameters :::

Simulated by stochastic model.

prefix : sd
data-type : CLR
depth : 50.000000
length-mean : 3000.000000
length-sd : 2300.000000
length-min : 10000
length-max : 30000
accuracy-mean : 0.780000
accuracy-sd : 0.020000
accuracy-min : 0.750000
accuracy-max : 1.000000
difference-ratio : 0:0:0

!! ref file /media/anuradhawick/data/Experiments/Assembly Graph/Assembly/Sample Genomes/e.coli.fasta
:::: Reference stats ::::

file name : /media/anuradhawick/data/Experiments/Assembly Graph/Assembly/Sample Genomes/e.coli.fasta

ref.1 (len:5231428) : AE014075.1 Escherichia coli CFT073, complete genome

:::: Simulation stats (ref.1 - name: AE014075.1 Escherichia coli CFT073, complete genome) ::::

read num. : 20074
depth : 50.000872
read length mean (SD) : 13030.584836 (3169.437169)
read length min : 10000
read length max : 29879
read accuracy mean (SD) : 0.781598 (0.018261)
substitution rate. : 0.000000
insertion rate. : 0.000000
deletion rate. : 0.000000

:::: System ut