# Example: long, single end reads

This is an example generated from this source
file: [`se-example.jl`](https://github.com/bioinfologics/Pseudoseq.jl/blob/master/examples/se-example.jl)
You are seeing the
jupyter notebook version. The corresponding online documentation page can
be found here: [`se-example.html`](https://bioinfologics.github.io/Pseudoseq.jl/latest/examples/se-example),
and the script can be found here: [`se-example.jl`](./se-example.jl)

Let's see how you might simulate something like an Oxford Nanopore sequencing
experiment.

For the simulation we are going to:

1. Create a pool of 5000 copies of a reference genome.
2. Fragment the DNA molecules in the pool, to an average length of 40,000bp.
3. Subsample the molecules in the pool to achieve approximatly 30x coverage.
4. Create a set of single-end reads, the enitre length of each molecule.
5. Apply errors to the reads at a rate of 0.10 (1 error every 10bp).
6. Generate an output FASTQ file.

In [1]:
using Pseudoseq

Let's start with a pool of 5000 copies of a genome contained in a FASTA file:

In [2]:
pool = makepool("ecoli-ref.fasta", 5000)

┌ Info: Reading FASTA file...
└ @ Pseudoseq /home/travis/build/bioinfologics/Pseudoseq.jl/src/pool.jl:38
┌ Info: Creating pool...
└ @ Pseudoseq /home/travis/build/bioinfologics/Pseudoseq.jl/src/pool.jl:16
┌ Info: Populating pool with 5000 copies of the genome...
└ @ Pseudoseq /home/travis/build/bioinfologics/Pseudoseq.jl/src/pool.jl:21
┌ Info: Done
└ @ Pseudoseq /home/travis/build/bioinfologics/Pseudoseq.jl/src/pool.jl:25


Pool of 5000 molecules:
 All molecules are of the same size: 4639675


Cut the pool of DNA into fragments of an average length of 40,000bp

In [3]:
cutpool = fragment(pool, 40000)

┌ Info: Fragmenting molecules in pool...
└ @ Pseudoseq /home/travis/build/bioinfologics/Pseudoseq.jl/src/pool.jl:126
┌ Info: Done
└ @ Pseudoseq /home/travis/build/bioinfologics/Pseudoseq.jl/src/pool.jl:128


Pool of 575000 molecules:
 Maximum molecule size: 487593
 Average molecule size: 40345
 Minimum molecule size: 1


Now we'll estimate the number of fragments we need to sample from the pool to
achieve 30x coverage.

In [4]:
genome_size = 4639675
expected_coverage = 30
readlength = 40000

N = needed_sample_size(expected_coverage, genome_size, readlength)

sampledpool = subsample(cutpool, N)

┌ Info: Subsampling molecules in pool...
└ @ Pseudoseq /home/travis/build/bioinfologics/Pseudoseq.jl/src/pool.jl:147
┌ Info: Done
└ @ Pseudoseq /home/travis/build/bioinfologics/Pseudoseq.jl/src/pool.jl:149


Pool of 3479 molecules:
 Maximum molecule size: 262728
 Average molecule size: 40046
 Minimum molecule size: 1


By using the `make_reads` function without specifying a read length,
the function will generate reads from the entire length of each molecule in
the pool. We do this to emulate what Nanopore sequencing is supposed to do:
It takes an entire DNA fragment, feeds it through an electrically charged
pore, producing a read for the entire fragment.

In [5]:
se_reads = make_reads(SingleEnd, sampledpool)



Long read sequencer have much higher error rates than short read sequencers
so we use a error rate of 0.1.

In [6]:
se_w_errs = mark_errors(se_reads, 0.1)

┌ Info: Marking 13932018 bases in the set of reads as errors...
└ @ Pseudoseq /home/travis/build/bioinfologics/Pseudoseq.jl/src/reads.jl:50
┌ Info: Done...
└ @ Pseudoseq /home/travis/build/bioinfologics/Pseudoseq.jl/src/reads.jl:52




Finally produce the ouput FASTQ file.

In [7]:
generate("longreads.fastq", se_w_errs)#-

┌ Info: Writing single end reads to FASTQ file...
└ @ Pseudoseq /home/travis/build/bioinfologics/Pseudoseq.jl/src/reads.jl:293
┌ Info: Done
└ @ Pseudoseq /home/travis/build/bioinfologics/Pseudoseq.jl/src/reads.jl:312


*This notebook was generated using [Literate.jl](https://github.com/fredrikekre/Literate.jl).*