# Example: long, single end reads

This is an example generated from this source
file: [`se-example.jl`](https://github.com/bioinfologics/Pseudoseq.jl/blob/masterexamples/sequencing/se-example.jl)
You are seeing the
jupyter notebook version. The corresponding online documentation page can
be found here: [`se-example.html`](https://bioinfologics.github.io/Pseudoseq.jl/latest/examples/sequencing/se-example),
and the script can be found here: [`se-example.jl`](./se-example.jl)

Let's see how you might simulate something like an Oxford Nanopore sequencing
experiment.

For the simulation we are going to:

1. Create a pool of 5000 copies of a reference genome.
2. Fragment the DNA molecules in the pool, to an average length of 40,000bp.
3. Subsample the molecules in the pool to achieve approximatly 30x coverage.
4. Create a set of single-end reads, the enitre length of each molecule.
5. Apply errors to the reads at a rate of 0.10 (1 error every 10bp).
6. Generate an output FASTQ file.

In [1]:
using Pseudoseq.Sequencing

## Using the `sequence` method

First, let's see how we do this with the `sequence` method.
The first two parameters we give to the function will be the input genome we
want to sequence, and the destination FASTQ file for output reads.
Here we are setting:
- The number of genome copies in the molecule pool to 5000.
- The average fragment size to 40000bp.
- The sampling coverage to 30x.
- The read length to `nothing`, which will make the sequencer read the whole length of any DNA fragment.
- The per base read error rate to 0.1.
- The fact we want paired-ends of fragments to be read (`paired`) to false.

In [2]:
sequence("ecoli-ref.fasta", "longreads.fastq"; ng = 5000, flen = 40000, cov = 30, rdlen = nothing, err = 0.1, paired = false)

┌ Info: - ✔ Created pool of 5000 copies of a 4639675bp genome
└ @ Pseudoseq.Sequencing /home/runner/.julia/packages/Pseudoseq/IOh54/src/sequencing/sequence.jl:45
┌ Info: - ✔ Created pool of fragments with an average length of 40000bp
└ @ Pseudoseq.Sequencing /home/runner/.julia/packages/Pseudoseq/IOh54/src/sequencing/sequence.jl:55
┌ Info: - ✔ Subsampled pool at 30X coverage (3479 molecules)
└ @ Pseudoseq.Sequencing /home/runner/.julia/packages/Pseudoseq/IOh54/src/sequencing/sequence.jl:62
┌ Info: - ✔ Created set of single-end reads
└ @ Pseudoseq.Sequencing /home/runner/.julia/packages/Pseudoseq/IOh54/src/sequencing/sequence.jl:69
┌ Info: - ✔ Applied sequencing errors at a per-base rate of 0.1
└ @ Pseudoseq.Sequencing /home/runner/.julia/packages/Pseudoseq/IOh54/src/sequencing/sequence.jl:73
┌ Info: - ✔ Wrote 3479 single end reads to FASTQ file
└ @ Pseudoseq.Sequencing /home/runner/.julia/packages/Pseudoseq/IOh54/src/sequencing/Reads.jl:182




## Using the Pseudoseq API

Here's how to achieve the same thing, using the Pseudoseq API. It is nessecery to
use the API if you want to do something that is not anticipated by the available
functionality of the `sequence` method: the cost of conveinience is fewer options.

Let's start with a pool of 5000 copies of a genome contained in a FASTA file:

In [3]:
pool = Molecules("ecoli-ref.fasta", 5000)

Pool of 5000 DNA molecules:
 All molecules are of the same size: 4639675


Cut the pool of DNA into fragments of an average length of 40,000bp

In [4]:
cutpool = fragment(pool, 40000)

Pool of 575000 DNA molecules:
 Maximum molecule size: 502463
 Average molecule size: 40345
 Minimum molecule size: 1


Now we'll estimate the number of fragments we need to sample from the pool to
achieve 30x coverage.

In [5]:
genome_size = 4639675
expected_coverage = 30
readlength = 40000

N = needed_sample_size(expected_coverage, genome_size, readlength)

sampledpool = subsample(cutpool, N)

Pool of 3479 DNA molecules:
 Maximum molecule size: 319879
 Average molecule size: 39558
 Minimum molecule size: 19


By using the `make_reads` function without specifying a read length,
the function will generate reads from the entire length of each molecule in
the pool. We do this to emulate what Nanopore sequencing is supposed to do:
It takes an entire DNA fragment, feeds it through an electrically charged
pore, producing a read for the entire fragment.

In [6]:
se_reads = unpaired_reads(sampledpool, nothing)



Long read sequencer have much higher error rates than short read sequencers
so we use a error rate of 0.1.

In [7]:
se_w_errs = mark_errors(se_reads, 0.1)



Finally produce the ouput FASTQ file.

In [8]:
generate("longreads.fastq", se_w_errs)

┌ Info: - ✔ Wrote 3479 single end reads to FASTQ file
└ @ Pseudoseq.Sequencing /home/runner/.julia/packages/Pseudoseq/IOh54/src/sequencing/Reads.jl:182


## Constructing a pipeline of `Processors`.

As a convenience, some users may prefer to use pipelines of `Processors`
These behave like curried versions of the `Molecules` transformation methods.
First let's define our starting `Molecules` pool:

In [9]:
pool = Molecules("ecoli-ref.fasta", 5000)

Pool of 5000 DNA molecules:
 All molecules are of the same size: 4639675


To make a Processor, use a `Molecules` transformation method, but do not
provide a `Molecules` value as a first argument. So let's make Processors for
each step of our single end sequencing pipeline.

In [10]:
cutter = fragment(40000)
sampler = subsample(N) # Remember how to computed N previously.
mkreads = unpaired_reads(nothing)
adderr = mark_errors(0.1)

Pseudoseq.Sequencing.ErrorMaker(0.1)

Next we can construct the pipeline using standard julia function pipelining syntax:

In [11]:
pool |> cutter |> sampler |> mkreads |> adderr |> generate("se-reads.fastq")

┌ Info: - ✔ Wrote 3479 single end reads to FASTQ file
└ @ Pseudoseq.Sequencing /home/runner/.julia/packages/Pseudoseq/IOh54/src/sequencing/Reads.jl:182


You can also compose the processors together into one whole function.
Typing \circ in the julia repl and then hitting tab gives you the circular
composition symbol. Note how pipelining above progresses from left to right,
but composition is right to left in order.

In [12]:
my_protocol = adderr ∘ mkreads ∘ sampler ∘ cutter

pool |> my_protocol |> generate("se-reads.fastq")

┌ Info: - ✔ Wrote 3479 single end reads to FASTQ file
└ @ Pseudoseq.Sequencing /home/runner/.julia/packages/Pseudoseq/IOh54/src/sequencing/Reads.jl:182


---

*This notebook was generated using [Literate.jl](https://github.com/fredrikekre/Literate.jl).*