# Generating reads from four genomes of same length (5Mbp)

## Fragment Preparation

The genomes contains 8 fragments R1 to R8
* G1 = R1 R3 R3
* G2 = R2 R1 R4
* G3 = R5 R2 R6
* G4 = R7 R8 R8 R1

Sizes of each of the fragment
* R1 = 2000kbp
* R2 = 1500kbp
* R3 = 1500kbp
* R4 = 1500kbp
* R5 = 1250kbp
* R6 = 2250kbp
* R7 = 1750kbp
* R8 = 625kbp

## Fragment preparation function


In [1]:
# python imports
from random import choice

# Biopython imports
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio.Alphabet import DNAAlphabet
from Bio import SeqIO
import glob

def createFrag(size):
    DNA = ""
    for count in range(size):
        DNA += choice("CGTA")
    return DNA

In [10]:
R1 = createFrag(2000 * 1000)
R2 = createFrag(1500 * 1000)
R3 = createFrag(1500 * 1000)
R4 = createFrag(1500 * 1000)
R5 = createFrag(1250 * 1000)
R6 = createFrag(2250 * 1000)
R7 = createFrag(1750 * 1000)
R8 = createFrag(625 * 1000)

G1 = R1 + R3 + R3
G2 = R2 + R1 + R4
G3 = R5 + R2 + R6
G4 = R7 + R8 + R8 + R1

## Create files for each genome

In [11]:
![ ! -e 'Sample_Genomes' ] && mkdir 'Sample_Genomes'

# G1
record1 = SeqRecord(Seq(G1, DNAAlphabet), id="G1", description="Sample genome 1")
SeqIO.write(record1, "./Sample_Genomes/G1.fa", "fasta")
# G2
record2 = SeqRecord(Seq(G2, DNAAlphabet), id="G2", description="Sample genome 2")
SeqIO.write(record2, "./Sample_Genomes/G2.fa", "fasta")
# G3
record3 = SeqRecord(Seq(G3, DNAAlphabet), id="G3", description="Sample genome 3")
SeqIO.write(record3, "./Sample_Genomes/G3.fa", "fasta")
# G4
record4 = SeqRecord(Seq(G4, DNAAlphabet), id="G4", description="Sample genome 4")
SeqIO.write(record4, "./Sample_Genomes/G4.fa", "fasta")

SeqIO.write([record1, record2, record3, record4], "./Sample Genomes/all-set.fa", "fasta")

4

## Creating Reads After Mixing Genomes

In [12]:
![ -e 'Sampled_Reads' ] && rm -r 'Sampled_Reads'
!mkdir 'Sampled_Reads'

!cd 'Sampled_Reads' && pbsim \
--data-type CLR \
--depth 50 \
--model_qc /media/anuradhawick/data/Tools/PBSIM-PacBio-Simulator/data/model_qc_clr \
--length-min 10000 \
--length-max 30000 \
--difference-ratio 0:0:0 \
'/media/anuradhawick/data/Experiments/Assembly_Graph/Assembly/Sample_Genomes/all-set.fa'

:::: Simulation parameters :::

Simulated by stochastic model.

prefix : sd
data-type : CLR
depth : 50.000000
length-mean : 3000.000000
length-sd : 2300.000000
length-min : 10000
length-max : 30000
accuracy-mean : 0.780000
accuracy-sd : 0.020000
accuracy-min : 0.750000
accuracy-max : 1.000000
difference-ratio : 0:0:0

!! ref file /media/anuradhawick/data/Experiments/Assembly Graph/Assembly/Sample Genomes/all-set.fa
:::: Reference stats ::::

file name : /media/anuradhawick/data/Experiments/Assembly Graph/Assembly/Sample Genomes/all-set.fa

ref.1 (len:5000000) : G1 Sample genome 1
ref.2 (len:5000000) : G2 Sample genome 2
ref.3 (len:5000000) : G3 Sample genome 3
ref.4 (len:5000000) : G4 Sample genome 4

:::: Simulation stats (ref.1 - name: G1 Sample genome 1) ::::

read num. : 19164
depth : 50.000680
read length mean (SD) : 13045.470570 (3192.433938)
read length min : 10000
read length max : 29879
read accuracy mean (SD) : 0.781444 (0.018242)
substitution rate. : 0.000000
insertion rate.

## Mixing reads to a single FASTQ

In [13]:
!cat 'Sampled_Reads/sd_0001.fastq' 'Sampled_Reads/sd_0002.fastq' 'Sampled_Reads/sd_0003.fastq' 'Sampled_Reads/sd_0004.fastq' > 'Sampled_Reads/all-in-one.fastq'
!gzip -c -1 'Sampled_Reads/all-in-one.fastq' > 'Sampled_Reads/all-in-one.fastq.gz'

!gzip -c -1 'Sampled_Reads/sd_0001.fastq' > 'Sampled_Reads/sd_0001.fastq.gz'
!gzip -c -1 'Sampled_Reads/sd_0002.fastq' > 'Sampled_Reads/sd_0002.fastq.gz'
!gzip -c -1 'Sampled_Reads/sd_0003.fastq' > 'Sampled_Reads/sd_0003.fastq.gz'
!gzip -c -1 'Sampled_Reads/sd_0004.fastq' > 'Sampled_Reads/sd_0004.fastq.gz'

# Generating Reads for SPAdes (CCS)

In [35]:
!cd 'Sampled_Reads' && pbsim \
--data-type CLR \
--prefix sd_ccs \
--depth 50 \
--model_qc /media/anuradhawick/data/Tools/PBSIM-PacBio-Simulator/data/model_qc_ccs \
--length-min 10000 \
--length-max 30000 \
--difference-ratio 0:0:0 \
'/media/anuradhawick/data/Experiments/Assembly_Graph/Sample_Genomes/all-set.fa'

:::: Simulation parameters :::

Simulated by stochastic model.

prefix : sd_ccs
data-type : CLR
depth : 50.000000
length-mean : 3000.000000
length-sd : 2300.000000
length-min : 10000
length-max : 30000
accuracy-mean : 0.780000
accuracy-sd : 0.020000
accuracy-min : 0.750000
accuracy-max : 1.000000
difference-ratio : 0:0:0

!! ref file /media/anuradhawick/data/Experiments/Assembly Graph/Sample Genomes/all-set.fa
:::: Reference stats ::::

file name : /media/anuradhawick/data/Experiments/Assembly Graph/Sample Genomes/all-set.fa

ref.1 (len:5000000) : G1 Sample genome 1
ref.2 (len:5000000) : G2 Sample genome 2
ref.3 (len:5000000) : G3 Sample genome 3
ref.4 (len:5000000) : G4 Sample genome 4

:::: Simulation stats (ref.1 - name: G1 Sample genome 1) ::::

read num. : 19146
depth : 50.001312
read length mean (SD) : 13057.900397 (3198.824039)
read length min : 10000
read length max : 29879
read accuracy mean (SD) : 0.781542 (0.018263)
substitution rate. : 0.000000
insertion rate. : 0.000000
de

## Mixing CC reads to a single FASTQ

In [37]:
!cat 'Sampled_Reads/sd_ccs_0001.fastq' 'Sampled_Reads/sd_ccs_0002.fastq' 'Sampled_Reads/sd_ccs_0003.fastq' 'Sampled_Reads/sd_ccs_0004.fastq' > 'Sampled Reads/all-in-one_ccs.fastq'
!gzip -c -1 'Sampled Reads/all-in-one_ccs.fastq' > 'Sampled Reads/all-in-one_ccs.fastq.gz'

!gzip -c -1 'Sampled_Reads/sd_ccs_0001.fastq' > 'Sampled_Reads/sd_ccs_0001.fastq.gz'
!gzip -c -1 'Sampled_Reads/sd_ccs_0002.fastq' > 'Sampled_Reads/sd_ccs_0002.fastq.gz'
!gzip -c -1 'Sampled_Reads/sd_ccs_0003.fastq' > 'Sampled_Reads/sd_ccs_0003.fastq.gz'
!gzip -c -1 'Sampled_Reads/sd_ccs_0004.fastq' > 'Sampled_Reads/sd_ccs_0004.fastq.gz'

# Generating reads from E. Coli

In [5]:
![ -e 'Sampled_Reads_EColi' ] && rm -r 'Sampled_Reads_EColi'
!mkdir 'Sampled_Reads_EColi'

!cd 'Sampled Reads EColi' && pbsim \
--data-type CLR \
--depth 50 \
--model_qc /media/anuradhawick/data/Tools/PBSIM-PacBio-Simulator/data/model_qc_clr \
--length-min 10000 \
--length-max 30000 \
--difference-ratio 0:0:0 \
'/media/anuradhawick/data/Experiments/Assembly_Graph/Assembly/Sample_Genomes/e.coli.fasta'

:::: Simulation parameters :::

Simulated by stochastic model.

prefix : sd
data-type : CLR
depth : 50.000000
length-mean : 3000.000000
length-sd : 2300.000000
length-min : 10000
length-max : 30000
accuracy-mean : 0.780000
accuracy-sd : 0.020000
accuracy-min : 0.750000
accuracy-max : 1.000000
difference-ratio : 0:0:0

!! ref file /media/anuradhawick/data/Experiments/Assembly Graph/Assembly/Sample Genomes/e.coli.fasta
:::: Reference stats ::::

file name : /media/anuradhawick/data/Experiments/Assembly Graph/Assembly/Sample Genomes/e.coli.fasta

ref.1 (len:5231428) : AE014075.1 Escherichia coli CFT073, complete genome

:::: Simulation stats (ref.1 - name: AE014075.1 Escherichia coli CFT073, complete genome) ::::

read num. : 20074
depth : 50.000872
read length mean (SD) : 13030.584836 (3169.437169)
read length min : 10000
read length max : 29879
read accuracy mean (SD) : 0.781598 (0.018261)
substitution rate. : 0.000000
insertion rate. : 0.000000
deletion rate. : 0.000000

:::: System ut

# Generating reads from Human Chr22

In [14]:
![ -e 'Sampled_Reads_Chr22' ] && rm -r 'Sampled_Reads_Chr22'
!mkdir 'Sampled_Reads_Chr22'

fa_file = ''
with open('/media/anuradhawick/data/Experiments/Assembly_Graph/Assembly/Sample_Genomes/chr22.fa', 'r') as f:
    ls = f.read().split()
    name = ls.pop(0)
    
    g0 = ''.join(ls)
    
    g = g0[25000000:30000000]
    
    count_N = g.count('N')
    
    print("N Count = " + str(count_N))
    
    record1 = SeqRecord(Seq(g, DNAAlphabet), id=name, description="Chromosome 22 5mb")
    SeqIO.write(record1, "/media/anuradhawick/data/Experiments/Assembly_Graph/Assembly/Sample_Genomes/chr22_5mb.fa", "fasta")

print("COMPLETED")

!cd 'Sampled_Reads_Chr22' && pbsim \
--data-type CLR \
--depth 50 \
--model_qc /media/anuradhawick/data/Tools/PBSIM-PacBio-Simulator/data/model_qc_clr \
--length-min 20000 \
--length-max 30000 \
--difference-ratio 0:0:0 \
'/media/anuradhawick/data/Experiments/Assembly_Graph/Assembly/Sample_Genomes/chr22_5mb.fa'

N Count = 0
COMPLETED
:::: Simulation parameters :::

Simulated by stochastic model.

prefix : sd
data-type : CLR
depth : 50.000000
length-mean : 3000.000000
length-sd : 2300.000000
length-min : 20000
length-max : 30000
accuracy-mean : 0.780000
accuracy-sd : 0.020000
accuracy-min : 0.750000
accuracy-max : 1.000000
difference-ratio : 0:0:0

!! ref file /media/anuradhawick/data/Experiments/Assembly Graph/Assembly/Sample Genomes/chr22_5mb.fa
:::: Reference stats ::::

file name : /media/anuradhawick/data/Experiments/Assembly Graph/Assembly/Sample Genomes/chr22_5mb.fa

ref.1 (len:5000000) : >chr22 Chromosome 22 5mb

:::: Simulation stats (ref.1 - name: >chr22 Chromosome 22 5mb) ::::

read num. : 10791
depth : 50.003845
read length mean (SD) : 23169.235752 (2597.060978)
read length min : 20000
read length max : 29903
read accuracy mean (SD) : 0.781409 (0.018173)
substitution rate. : 0.000000
insertion rate. : 0.000000
deletion rate. : 0.000000

:::: System utilization ::::

CPU time(s) : 9


# Generating reads from Yeast Genome

In [23]:
![ -e 'Sampled_Reads_Yeast' ] && rm -r 'Sampled_Reads_Yeast'
!mkdir 'Sampled_Reads_Yeast'

file_list = list(glob.glob("/media/anuradhawick/data/Experiments/Assembly_Graph/Assembly/Sample_Genomes/yeast/chr*.fa"))
genome = ""

for f in file_list:
    with open(f, 'r') as fopen:
        content = fopen.read()
        content = content.strip().split()
        content.pop(0)
        content = "".join(content)
        genome += content
        
record = SeqRecord(Seq(genome, DNAAlphabet), id="YeastGenome", description="Complete Yeast Genome")
SeqIO.write(record, "./Sample_Genomes/yeast/YeastGenome.fa", "fasta")

!cd 'Sampled_Reads_Yeast' && pbsim \
--data-type CLR \
--depth 50 \
--model_qc /media/anuradhawick/data/Tools/PBSIM-PacBio-Simulator/data/model_qc_clr \
--length-min 20000 \
--length-max 30000 \
--difference-ratio 0:0:0 \
'/media/anuradhawick/data/Experiments/Assembly_Graph/Assembly/Sample_Genomes/yeast/YeastGenome.fa'

:::: Simulation parameters :::

Simulated by stochastic model.

prefix : sd
data-type : CLR
depth : 50.000000
length-mean : 3000.000000
length-sd : 2300.000000
length-min : 20000
length-max : 30000
accuracy-mean : 0.780000
accuracy-sd : 0.020000
accuracy-min : 0.750000
accuracy-max : 1.000000
difference-ratio : 0:0:0

!! ref file /media/anuradhawick/data/Experiments/Assembly_Graph/Assembly/Sample_Genomes/yeast/YeastGenome.fa
:::: Reference stats ::::

file name : /media/anuradhawick/data/Experiments/Assembly_Graph/Assembly/Sample_Genomes/yeast/YeastGenome.fa

ref.1 (len:12157105) : YeastGenome Complete Yeast Genome

:::: Simulation stats (ref.1 - name: YeastGenome Complete Yeast Genome) ::::

read num. : 26277
depth : 50.000218
read length mean (SD) : 23132.697987 (2557.293961)
read length min : 20000
read length max : 29903
read accuracy mean (SD) : 0.781396 (0.018077)
substitution rate. : 0.000000
insertion rate. : 0.000000
deletion rate. : 0.000000

:::: System utilization ::::

CPU