# Example: tagged paired-end reads

This is an example generated from this source
file: [`tg-example.jl`](https://github.com/bioinfologics/Pseudoseq.jl/blob/master/examples/tg-example.jl)
You are seeing the
jupyter notebook version. The corresponding online documentation page can
be found here: [`tg-example.html`](https://bioinfologics.github.io/dev/examples/tg-example.html),
and the script can be found here: [`tg-example.jl`](./tg-example.jl)

Let's see how we might simulate something like an 10x sequencing experiment.

For this simulation script we will:

1. Create a pool of 5000 copies of a reference genome.
2. Fragment the DNA molecules in the pool, to an average length of 40,000bp.
3. Tag the long molecules in the pool randomly with a set of 1,000,000 tags.
4. Fragment the molecules in the pool to an average length of 700bp.
5. Subsample the molecules in the pool to achieve approximatly 50x coverage.
6. Create a set of 250bp paired-end reads.
7. Apply errors to the paired-end reads at a rate of 0.001 (.1%).
8. Generate an output FASTQ file.

In [1]:
using Pseudoseq

dnapool = makepool("ecoli-ref.fasta", 5000)
cutpool = fragment(dnapool, 40000)

┌ Info: Reading FASTA file...
└ @ Pseudoseq /home/travis/build/bioinfologics/Pseudoseq.jl/src/pool.jl:38
┌ Info: Creating pool...
└ @ Pseudoseq /home/travis/build/bioinfologics/Pseudoseq.jl/src/pool.jl:16
┌ Info: Populating pool with 5000 copies of the genome...
└ @ Pseudoseq /home/travis/build/bioinfologics/Pseudoseq.jl/src/pool.jl:21
┌ Info: Done
└ @ Pseudoseq /home/travis/build/bioinfologics/Pseudoseq.jl/src/pool.jl:25
┌ Info: Fragmenting molecules in pool...
└ @ Pseudoseq /home/travis/build/bioinfologics/Pseudoseq.jl/src/pool.jl:126
┌ Info: Done
└ @ Pseudoseq /home/travis/build/bioinfologics/Pseudoseq.jl/src/pool.jl:128


Pool of 575000 molecules:
 Maximum molecule size: 553888
 Average molecule size: 40345
 Minimum molecule size: 1


Ok, now we will tag these large fragments randomly.
Once you tag a fragment in a universe, any other fragments that
are derived from that tagged fragment will inherit the same tag.

In [2]:
taggedpool = tag(cutpool, 1000000)

┌ Info: Attaching 1000000 tags randomly to each molecule in pool...
└ @ Pseudoseq /home/travis/build/bioinfologics/Pseudoseq.jl/src/pool.jl:176
┌ Info: Done
└ @ Pseudoseq /home/travis/build/bioinfologics/Pseudoseq.jl/src/pool.jl:178


Pool of 575000 molecules:
 Maximum molecule size: 553888
 Average molecule size: 40345
 Minimum molecule size: 1
 Number of distinct tags: 437474


Here I'm going to use a pool of 1,000,000 distinct tags.
Which fragment gets a certain tag is random.
The size of the tag pool, and the number of fragments in your universe will
determine how likely it is that any two fragments get the same tag.
Now we'll fragment the pool again

In [3]:
taggedcutpool = fragment(taggedpool, 700)

┌ Info: Fragmenting molecules in pool...
└ @ Pseudoseq /home/travis/build/bioinfologics/Pseudoseq.jl/src/pool.jl:126
┌ Info: Done
└ @ Pseudoseq /home/travis/build/bioinfologics/Pseudoseq.jl/src/pool.jl:128


Pool of 32863848 molecules:
 Maximum molecule size: 11176
 Average molecule size: 705
 Minimum molecule size: 1
 Number of distinct tags: 437474


Subsample the pool of tagged molecules.

In [4]:
genome_size = 4639675
expected_coverage = 50
read_length = 250

N = needed_sample_size(expected_coverage, genome_size, read_length)
N = div(N, 2) # Divide by 2 as we're doing paired end sequencing.

sampledpool = subsample(taggedcutpool, N)

┌ Info: Subsampling molecules in pool...
└ @ Pseudoseq /home/travis/build/bioinfologics/Pseudoseq.jl/src/pool.jl:147
┌ Info: Done
└ @ Pseudoseq /home/travis/build/bioinfologics/Pseudoseq.jl/src/pool.jl:149


Pool of 463967 molecules:
 Maximum molecule size: 10443
 Average molecule size: 705
 Minimum molecule size: 1
 Number of distinct tags: 227105


Now let's make some 250bp tagged paired reads and generate some erroneous
positions.

In [5]:
tagged_reads = make_reads(TaggedPairs, sampledpool, 250)
tagged_w_errs = mark_errors(tagged_reads, 0.001)

┌ Info: Marking 163530 bases in the set of reads as errors...
└ @ Pseudoseq /home/travis/build/bioinfologics/Pseudoseq.jl/src/reads.jl:50
┌ Info: Done...
└ @ Pseudoseq /home/travis/build/bioinfologics/Pseudoseq.jl/src/reads.jl:52




Output to FASTQ:

In [6]:
generate("tagged_reads.fastq", tagged_w_errs)#-

┌ Info: Writing tagged paired end reads to FASTQ file...
└ @ Pseudoseq /home/travis/build/bioinfologics/Pseudoseq.jl/src/reads.jl:365
┌ Info: Done
└ @ Pseudoseq /home/travis/build/bioinfologics/Pseudoseq.jl/src/reads.jl:404


*This notebook was generated using [Literate.jl](https://github.com/fredrikekre/Literate.jl).*