## UMI example

This is an example taken from the pRESTO documentation, linked [here](http://presto.readthedocs.io/en/latest/workflows/Stern2014_Workflow.html). The abstract of the paper is below.

In [None]:
!esearch -db pubmed -query 25100741 | efetch -format abstract

The read layout is shown below.

![](http://presto.readthedocs.io/en/latest/_images/Stern2014_ReadConfiguration.svg)

The workflow for processing the data is shown below.

![](http://presto.readthedocs.io/en/latest/_images/Stern2014_Flowchart.svg)

## Obtaining the data

Use `fastq-dump` to obtain the sequence data from accession SRR1383456, by running the cell below.

In [None]:
!fastq-dump --split-files --readids SRR1383456

## Filtering by quality

Use `FilterSeq.py quality` with a minimum (mean) quality score of 20 to filter the paired end reads. The usage information is given below.

In [None]:
!FilterSeq.py quality -h

Run the cell below to filter read 1.

In [None]:
!FilterSeq.py quality -s SRR1383456_1.fastq -q 20 --outname SRR1383456_R1 --log SRR1383456.quality.R1.log

**Now repeat the above, but for the read 2.**

In [None]:
!FilterSeq.py quality

## Masking primers

Now cut the primers (`Stern2014_CPrimers.fasta`) from read 1 We know where the primers start (15 for read 1), so we can use `MaskPrimers.py score` rather than `MaskPrimers.py align`. Read 1 is bar-coded **so you need to specify the `--barcode` option** to extract the barcode region.

In [None]:
!MaskPrimers.py score -h

Run the cell below to mask the reverse primers and extract the barcode.

In [None]:
!MaskPrimers.py score -s SRR1383456_R1_quality-pass.fastq -p Stern2014_CPrimers.fasta \
    --start 15 --mode cut --barcode --outname SRR1383456_R1 --log SRR1383456.REV.log

**Using the above as a guide, complete the below cell to mask the forward primer from read 2 (`Stern2014_VPrimers.fasta`) starting from position 0.**

In [None]:
!MaskPrimers.py score

## Copying over the barcodes

The barcode region is only annotated in read 1, so we need to copy this over to read 2, which we do with `PairSeq.py`.

In [None]:
!PairSeq.py -h

Run the below cell to copy over the annotation.

In [None]:
!PairSeq.py -1 SRR1383456_R1_primers-pass.fastq -2 SRR1383456_R2_primers-pass.fastq \
    --1f BARCODE --coord sra

## Building a consensus based  on the barcodes

We now build a consensus sequence for the sequences with a particular barcode. We do this separately for the paired sequences, `SRR1383456_R1_primers-pass_pair-pass.fastq` and `SRR1383456_R2_primers-pass_pair-pass.fastq` using `BuildConsensus.py`.

In [None]:
!BuildConsensus.py -h

Run the cell below to build a consensus from read 1.

In [None]:
!BuildConsensus.py -s SRR1383456_R1_primers-pass_pair-pass.fastq --bf BARCODE --pf PRIMER \
    --prcons 0.6 --maxerror 0.1 --maxgap 0.5 --outname SRR1383456_R1 --log SRR1383456.consensus.R1.log

**Now complete the below cell to build a consensus for read 2.**

In [None]:
!BuildConsensus.py

Now run the following cell to pair the sequences that passed the consensus building.

In [None]:
!PairSeq.py -1 SRR1383456_R1_consensus-pass.fastq -2 SRR1383456_R2_consensus-pass.fastq \
    --coord presto

## Assembling mate pairs

In [None]:
!AssemblePairs.py align -h

In [None]:
!AssemblePairs.py align -1 SRR1383456_R2_consensus-pass_pair-pass.fastq \
    -2 SRR1383456_R1_consensus-pass_pair-pass.fastq --coord presto --rc tail \
    --1f CONSCOUNT --2f CONSCOUNT PRCONS --outname SRR1383456 --log SRR1383456.assemble.log

## Deduplication and filtering

Let's take a look at the resulting sequences.

In [None]:
!head SRR1383456_assemble-pass.fastq

As you can see, there are two entries for the number of sequences used for each consensus. The following command replaces this pair with the minimum of the two.

In [None]:
!ParseHeaders.py collapse -s SRR1383456_assemble-pass.fastq -f CONSCOUNT --act min

The following command removes duplicate sequences.

In [None]:
!CollapseSeq.py -s SRR1383456_assemble-pass_reheader.fastq -n 20 --inner --uf PRCONS \
    --cf CONSCOUNT --act sum --outname SRR1383456

We will also convert the sequences to FASTA format.

In [None]:
from Bio import SeqIO
SeqIO.convert("SRR1383456_collapse-unique.fastq","fastq","SRR1383456_collapse-unique.fasta","fasta")

In [None]:
!head SRR1383456_collapse-unique.fasta