wf-isoforms

This repository is now deprecated, please see https://github.com/epi2me-labs/wf-transcriptomes

This repository contains a nextflow workflow for assembly and annotation of transcripts from Oxford Nanopore cDNA or direct RNA reads. It has been adapted from two existing Snakemake pipelines:

Introduction

This workflow identifies RNA isoforms using either cDNA or direct RNA (dRNA) Oxford Nanopore reads.

Preprocesing

cDNA reads are initially preprocessed by pychopper for the identification of full-length reads, as well as trimming and orientation correction (This step is omitted for direct RNA reads).

Reference-aided approach

Full length reads are mapped to a supplied reference genome using minimap2
Transcripts are assembled by stringtie in long read mode (with or without a guide reference annotation) to generate the GFF annotation.
The annotation generated by the pipeline is compared to the reference annotation. using gffcompare

de novo-based approach (experimental!)

Sequence clusters are generated using isONclust2
- If a reference genome is supplied, cluster quality metrics are determined by comparing
  with clusters generated from a minimap2 alignment.
A consensus sequence for each cluster is generated using spoa
Three rounds of polishing using racon and minimap2 to give a final polished CDS for each gene.
Full-length reads are then mapped to these polished CDS.
Transcripts are assembled by stringtie as for the reference-based approach.
Note: This approach is currently not supported with direct RNA reads.

Workflow inputs

Directory containing cDNA/direct RNA reads. Or a directory containing subdirectories each with reads from different samples (in fastq/fastq.gz format)
Reference genome in fasta format (required for reference-based assembly).
Optional reference annotation in GFF2/3 format.## Quickstart

The workflow uses nextflow to manage compute and software resources, as such nextflow will need to be installed before attempting to run the workflow.

The workflow can currently be run using either Docker, Singularity or conda to provide isolation of the required software. Each method is automated out-of-the-box provided either docker, singularity or conda is installed.

It is not required to clone or download the git repository in order to run the workflow. For more information on running EPI2ME Labs workflows visit out website.

Workflow options

To obtain the workflow, having installed nextflow, users can run:

nextflow run epi2me-labs/wf-isoforms --help

to see the options for the workflow.

Example execution of a workflow for reference-based transcript assembly

This example uses a synthetic SIRV dataset, so we need to tell minimap2 about the non-canonical splice junctions with --minimap2_opts '-uf --splice-flank=no'

OUTPUT=~/output;
nextflow run wf-isoforms/ --fastq test_data/fastq  --ref_genome test_data/SIRV_150601a.fasta --ref_annotation test_data/SIRV_isofroms.gtf
--minimap2_opts '-uf --splice-flank=no' --out_dir outdir -w workspace_dir -profile conda -resume

# To evaluate the workflow on a larger Drosophila dataset
./evaluation/run_evaluation_dmel.sh outdir

Example workflow for denovo transcript assembly

OUTPUT=~/output
nextflow run . --fastq test_data/fastq --denovo --ref_genome test_data/SIRV_150601a.fasta  -profile local --out_dir ${OUTPUT} -w ${OUTPUT}/workspace \
--sample sample_id -resume

A full list of options can be seen in nextflow_schema.json. Below are some commonly used ones.

Threshold for including isoforms into interactive table transcript_table_cov_thresh = 50
Run the denovo pipeline denovo = true (default false)
To run the workflow with direct RNA reads --direct_rna (skips the pychopper step).

Pychopper and minimap2 can take options via minimap2_opts and pychopper_opts, for example:

When using the SIRV synthetic test data
- minimap2_opts = '-uf --splice-flank=no'
pychopper needs to know which cDNA synthesis kit used
- SQK-PCS109: use pychopper_opts = '-k PCS109' (default)
- SQK-PCS110: use pychopper_opts = '-k PCS110'
pychopper can use one of two available backends for identifying primers in the raw reads
- nhmmscan pychopper opts = '-m phmm'
- edlib pychopper opts = '-m edlib'

Note: edlib is set by default in the config as it's quite a lot faster. However it may be less sensitive than nhmmscan.

Name		Name	Last commit message	Last commit date
Latest commit History 133 Commits
bin		bin
data		data
docs		docs
evaluation		evaluation
lib		lib
test_data		test_data
.dockerignore		.dockerignore
.gitignore		.gitignore
.gitlab-ci.yml		.gitlab-ci.yml
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGELOG.md		CHANGELOG.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
denovo_assembly.nf		denovo_assembly.nf
environment.yaml		environment.yaml
main.nf		main.nf
nextflow.config		nextflow.config
nextflow_schema.json		nextflow_schema.json
reference_assembly.nf		reference_assembly.nf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

wf-isoforms

Introduction

Preprocesing

Reference-aided approach

de novo-based approach (experimental!)

Workflow inputs

Useful links

About

Releases 4

Packages

Contributors 5

Languages

License

epi2me-labs/wf-isoforms

Folders and files

Latest commit

History

Repository files navigation

wf-isoforms

Introduction

Preprocesing

Reference-aided approach

de novo-based approach (experimental!)

Workflow inputs

Useful links

About

Resources

License

Stars

Watchers

Forks

Releases 4

Packages 0

Contributors 5

Languages

Packages