# Example: paired-end sequencing

This is an example generated from this source
file: [`pe-example.jl`](https://github.com/bioinfologics/Pseudoseq.jl/blob/master/examples/pe-example.jl)
You are seeing the
jupyter notebook version. The corresponding online documentation page can
be found here: [`pe-example.html`](https://bioinfologics.github.io/Pseudoseq.jl/latest/examples/pe-example),
and the script can be found here: [`pe-example.jl`](./pe-example.jl)

For the simulation we are going to:

1. Create a pool of 5000 copies of a reference genome.
2. Fragment the DNA molecules in the pool, to an average length of 700bp.
3. Subsample the molecules in the pool to achieve approximatly 50x coverage.
4. Create a set of 250bp paired-end reads.
5. Apply errors to the paired-end reads at a rate of 0.001 (.1%).
6. Generate an output FASTQ file.

In [1]:
using Pseudoseq

Starting with a FASTA formatted file containing the genome we want to sequence,
we create a pool with 5000 copies of the genome.

In [2]:
pool = makepool("ecoli-ref.fasta", 5000)

┌ Info: Reading FASTA file...
└ @ Pseudoseq /home/travis/build/bioinfologics/Pseudoseq.jl/src/pool.jl:38
┌ Info: Creating pool...
└ @ Pseudoseq /home/travis/build/bioinfologics/Pseudoseq.jl/src/pool.jl:16
┌ Info: Populating pool with 5000 copies of the genome...
└ @ Pseudoseq /home/travis/build/bioinfologics/Pseudoseq.jl/src/pool.jl:21
┌ Info: Done
└ @ Pseudoseq /home/travis/build/bioinfologics/Pseudoseq.jl/src/pool.jl:25


Pool of 5000 molecules:
 All molecules are of the same size: 4639675


Next we use the fragment function to make a pool of shorter DNA molecules.

In [3]:
cutpool = fragment(pool, 700)

┌ Info: Fragmenting molecules in pool...
└ @ Pseudoseq /home/travis/build/bioinfologics/Pseudoseq.jl/src/pool.jl:126
┌ Info: Done
└ @ Pseudoseq /home/travis/build/bioinfologics/Pseudoseq.jl/src/pool.jl:128


Pool of 33139998 molecules:
 Maximum molecule size: 11867
 Average molecule size: 700
 Minimum molecule size: 1


We need to determine the number of molecules to sample, and subsample the pool:

In [4]:
genome_size = 4639675
expected_coverage = 50
read_length = 250

N = needed_sample_size(expected_coverage, genome_size, read_length)
N = div(N, 2) # Divide by 2 as we're doing paired end sequencing.

sampledpool = subsample(cutpool, N)

┌ Info: Subsampling molecules in pool...
└ @ Pseudoseq /home/travis/build/bioinfologics/Pseudoseq.jl/src/pool.jl:147
┌ Info: Done
└ @ Pseudoseq /home/travis/build/bioinfologics/Pseudoseq.jl/src/pool.jl:149


Pool of 463967 molecules:
 Maximum molecule size: 10558
 Average molecule size: 700
 Minimum molecule size: 1


We now want to create a set of paired-end reads. We want our reads to be 250bp
in length.

In [5]:
pe_reads = make_reads(PairedEnd, sampledpool, 250)



Now we have some reads, we should mark positions in the reads that are destined
to be errors in the output FASTQ.

In [6]:
pe_w_errs = mark_errors(pe_reads, 0.001)

┌ Info: Marking 162210 bases in the set of reads as errors...
└ @ Pseudoseq /home/travis/build/bioinfologics/Pseudoseq.jl/src/reads.jl:50
┌ Info: Done...
└ @ Pseudoseq /home/travis/build/bioinfologics/Pseudoseq.jl/src/reads.jl:52




Now we have some paired end reads and have marked some positions as errors, we
can generate FASTQ files.

In [7]:
generate("pe-reads.fastq", pe_w_errs)#-

┌ Info: Writing paired end reads to FASTQ...
└ @ Pseudoseq /home/travis/build/bioinfologics/Pseudoseq.jl/src/reads.jl:189
┌ Info: Done
└ @ Pseudoseq /home/travis/build/bioinfologics/Pseudoseq.jl/src/reads.jl:215


*This notebook was generated using [Literate.jl](https://github.com/fredrikekre/Literate.jl).*