# Example: paired-end sequencing

This is an example generated from this source
file: [`pe-example.jl`](https://github.com/bioinfologics/Pseudoseq.jl/blob/masterexamples/sequencing/pe-example.jl)
You are seeing the
jupyter notebook version. The corresponding online documentation page can
be found here: [`pe-example.html`](https://bioinfologics.github.io/Pseudoseq.jl/latest/examples/sequencing/pe-example),
and the script can be found here: [`pe-example.jl`](./pe-example.jl)

For the simulation we are going to:

1. Create a pool of 5000 copies of a reference genome.
2. Fragment the DNA molecules in the pool, to an average length of 700bp.
3. Subsample the molecules in the pool to achieve approximatly 50x coverage.
4. Create a set of 250bp paired-end reads.
5. Apply errors to the paired-end reads at a rate of 0.001 (.1%).
6. Generate an output FASTQ file.

In [1]:
using Pseudoseq.Sequencing

Starting with a FASTA formatted file containing the genome we want to sequence,
we create a pool with a copy of the genome.

In [2]:
m = Molecules("ecoli-ref.fasta")

Pool of 1 DNA molecules:
 All molecules are of the same size: 4639675


We want to first have 5000 full copies of the genome, so we will make an amplifier.

In [3]:
amp = amplify(5000)

Pseudoseq.Sequencing.Amplifier{Pseudoseq.Sequencing.var"#5#6"}(Pseudoseq.Sequencing.var"#5#6"(), 5000)

Next we want a fragmenter to make a pool of shorter DNA molecules, with an average
length of 700bp.

In [4]:
frag = fragment(700bp)

Pseudoseq.Sequencing.Fragmenter(700)

We will want a filter to make sure no extreme sized molecules make their way through.

In [5]:
size_filter = select(x -> 900 >= length(x) >= 450)

Pseudoseq.Sequencing.Selector{Main.##254.var"#1#2"}(Main.##254.var"#1#2"())

Next we create a subsampler which will randomly sample molecules, to give us
a desired expected coverage. We will want our reads to be 250bp in length.

In [6]:
ssmpl = subsample(50X, 2 * 250bp)

Pseudoseq.Sequencing.CovSubSampler{Tuple{Pseudoseq.Sequencing.SequenceLength,Pseudoseq.Sequencing.SequenceLength}}(50X sequencing coverage
, (250 base pairs
, 250 base pairs
))

We then want a read-maker that will give us paired end reads. We want our
reads to be 250bp in length.

In [7]:
readmaker = makereads(2 * 250bp)

Pseudoseq.Sequencing.PairedReads(250, 250)

Once we have reads, we will mark positions in the reads that are incorrectly
detected by the sequencer: errors.
We will construct a `FixedProbSubstitutions` function with a per base error
probability of 0.001 and pass it to the `make_substitutions` method.
This will make errors fall totally randomly over each read.

In [8]:
errmaker = make_substitutions(FixedProbSubstitutions(0.001))

Pseudoseq.Sequencing.SubstitutionMaker{FixedProbSubstitutions}(FixedProbSubstitutions(0.001))

Now we can push our molecules through a pipeline of these processors, and out
to a FASTQ file:

In [9]:
m |> amp |> frag |> size_filter |> ssmpl |> readmaker |> errmaker |> generate("pe-reads.fastq")

┌ Info: - ✔ Wrote 927934 paired end reads to FASTQ file
└ @ Pseudoseq.Sequencing /home/runner/work/Pseudoseq.jl/Pseudoseq.jl/src/sequencing/Reads.jl:273




You can also compose the processors together into one whole function.
Typing \circ in the julia repl and then hitting tab gives you the circular
composition symbol. Note how pipelining above progresses from left to right,
but composition is right to left in order.

In [10]:
my_protocol = errmaker ∘ readmaker ∘ ssmpl ∘ size_filter ∘ frag ∘ amp

m |> my_protocol |> generate("pe-reads.fastq")

┌ Info: - ✔ Wrote 927934 paired end reads to FASTQ file
└ @ Pseudoseq.Sequencing /home/runner/work/Pseudoseq.jl/Pseudoseq.jl/src/sequencing/Reads.jl:273




---

*This notebook was generated using [Literate.jl](https://github.com/fredrikekre/Literate.jl).*