FASTQ2OTU

⚠️ This package is still under active development

FASTQ2OTU was developed as a easy and effective tool for downloading, analyzing, and processing large-scale microbiome rRNA gene data obtained from NCBI's SRA database. The package uses many functions from DADA2 to analyze sequence data. The primary objective of FASTQ2OTU is to (1) increase the reproducibility of microbiome analysis and (2) encourage the analysis of archived data to obtain new knowledge. This

FASTQ2OTU's workflow can be broken down into multiple stages:

Get Sequences - Sequences can be downloaded using FASTQDUMP or wget (if FTP links are available).
Plot Quality Distribution - Generate a figure that shows the quality distribution of the dataset. This step can be run independently if necessary.
Filter and Trim
Learn Errors and Denoise
Find and Remove Chimeric Sequences
Merge Paired-End Sequences
Assign Taxonomy
Merge OTU Tables - Merges individual OTU tables to create a single table that can used in downstream analyses.

Advantages of using FASTQ2OTU

Documentation is automated (all inputs and outputs are recorded)
Integrated workflow
Outputs are automatically generated
Easy to use

Getting Started

After installing FASTQ2OTU, the following input files and/or directories will be required to begin processing data:

A YML-formatted config file containing all parameters (more information about the config file can be found below).
A directory of single or paired-end FASTQ files OR a text file containing SRA ids to download from NCBI.
A bit of knowledge about the sequences
- Ideal trimming parameters
- Forward and reverse primer lengths
- If merging, the desired overlap length
- Working knowledge of DADA2 workflow

Execute pipeline

# Load package into environoment
library("ananata/fastq2otu")

# Path to config file
paired_config <- "path/to/my_paired-example_config.yml"

# Run pipeline
runPipeline(configFile = paired_config, isPaired = TRUE, getQuality = TRUE, getMergedSamples = TRUE, getDownloadedSeqs = TRUE, getGeneratedReport = FALSE)

The runPipeline() function will allow the the entire DADA2 pipeline to be run. The parameters in the function allows users to specify which steps of the pipeline they would like to execute. The following table provides a description of each parameter and the action(s) it controls.

Parameter	Description	Directions
configFile	Path to YML-file containing all user inputs	The file must be formatted with the correct variable names (please refer to template)
isPaired	TRUE if handling paired-end data and FALSE if handling single-end.	Please note that paired-end and single-end data must be processed seperately (the package cannot analyze both datatypes simultaneously).
getQuality	TRUE if you would like to generate a quality distribution plot and FALSE if you would like to skip the step.	This step can be run independently.
getMergedSamples	TRUE if you would like to generate a merged sample table and FALSE if you would like to skip the step.	Generates a single table containing data from all samples.
getDownloadedSeqs	TRUE if you would like to use `fastq-dump` or `wget` to download data directly from NCBI's SRA database.	Requires a text file containing all SRA sample IDs or FTP download links
getGeneratedReport	If TRUE, a FASTQC report is generate using the FASTQCR R-package	This step can also be run independently.

Quick Start Guide

DADA2 is an R package that allows users to preform high-resolution taxanomy analyses from FASTQ files. This package will allow most users to analyse datasets using the DADA2 pipeline. This procedure will cover some basics of R programming, installing and running the package on R server, and interpreting some of the outputs generated. There are two objectives for this document:

Introduce new users to DADA2’s functions;
To set-up a pipeline for 16S rRNA analyses of target bacterial isolates.

Plot Quality Distribution

DADA2’s plotQualityProfile() function creates a plot(s) that visualizes the overall distribution of quality scores within a dataset. Users can use the plots to make informed decisions about how they would like their data to be processed (i.e. filtering and trimming). The generated quality graphs show colored lines that signify different statistics.

Green is the mean quality score for all reads in a single dataset
Orange is the median
Dashed orange lines demarcate the 25th and 75th quantiles.

Merging Samples

Sequence tables generated by DADA2’s makeSequenceTable() function are formatted as single-row matrices (contain only one row), with consensus sequences as column headings and read counts as elements in the row. OTU Tables (given by DADA2's assignTaxonomy() or assignSpecies() function) contain taxonomic assignments and sequence variants (ASV). FASTQ2OTU's mergeSamples function will merge data from sequence and OTU tables obtained from different samples to generate a single table. The final table can be used to make inter-sample comparisons that may inform downstream analyses.

Downloading Data from NCBI

Public data can be accessed from NCBI’s SRA website . To view datasets, enter a project ID (i.e. PRJEB8073), click "Search" and select “Send results to Run Selector" link to view the results interactively. To access the Run Selector tool directly, the following link can also be used. To obtain a list of all SRA accession IDs within a given project, click the "Accession List" button in middle the "Select" panel and wait for the text file to be downloaded.

Using FASTQ-DUMP

To download datasets using NCBI's fastq-dump utility, download the sra-toolkit from NCBI and obtain the path to the fastq-dump tool. Record the paths to the fastq-dump script the SRA accession list in the config file (described below). Make sure to set the getDownloadedSeqs parameter to TRUE, when executing the runPipeline() function.

Using WGET

To download datasets from SRA using wget, navigate to SRA-Explorer and input your project ID. Once you click the search icon, a table should appear at the bottom of the window. Select all rows in the tables and store the results by clicking the blue "Add to Collection" button on the right. Please not that the search only outputs a certain number of results each time (with the max being 500). In order to obtain data on more than 500 samples, you must update the "Start at Record" text box after each search. Once you have stored all your samples in your collection, click the shopping cart icon on the top right. Click the tab that says "Raw FastQ Download URLs" and select the download link. Record the path to the downloaded text file in the config file and make sure to set the getDownloadedSeqs parameter to TRUE, when executing the runPipeline() function.

Installation

This application is designed to be lightweight and simple to use. The intended use is via a remote server, however it can also be run using RStudio (the package was written in R 3.5.3) and can be downloaded from Github using devtools.

if (!requireNamespace("BiocManager", quietly = TRUE))
  install.packages("BiocManager")
BiocManager::install("Biostrings")
BiocManager::install("ShortRead")
BiocManager::install("dada2")
BiocManager::install("gtools")

# Install package into environment
install.packages("devtools")
library(devtools)
install_github("ananata/fastq2otu")
library("FASTQ2OTU")

Using a config file

Variable	Type	Default	Description
projectPrefix	Character	"myproject"	Prefix to append to newly created files (i.e. _filtered_files/ is created to store filtered files)
outDir	Character	Current working directory	Path to output directory that the contain all output files and documents.
pathToData	Character	N/A	Path to directory storing all input data.
verbose	Logical	FALSE	Sets `verbose` parameter for all functions
multithread	Logical	FALSE	Sets the `multithread` parameter for all functions
pathToSampleIDs	Character	N/A	The path to a text file containing SRA Accession IDs.
fastaPattern	Character	^.*[1,2]?.fastq(.gz)?$	Regex pattern to use when parsing directories for FASTQ files.
aggregateQual	Logical	N/A	Provide TRUE if you would like to aggregate your quality profile diagram.
qualN		Numeric	0
useFastqDump	Logical	FALSE	Provide TRUE if you would like to download sequences using a locally installed version of SRA's FASTQDUMP
pathToFastqDump	Character	N/A	Path to fastq-dump script. Required if useFastqDump parameter is TRUE.
pathToSampleURLs	Character	N/A	Path to text file containing FTP download links.
pathToFastqc	Character	N/A	Path to fastqc software. Required to use FASTQCR
installFastqc	Logical	FALSE	If TRUE, FASTQC will be automatically downloaded into the users home directory. Unless an input for pathToFastqc is provided, then the new download will overwrite the older version.
pathToFastqcResults	Character	N/A	Path to the directory storing the FASTQC reports.
taxDatabase	Character	N/A	Required. Path to reference taxonomy database.

Please refer to a template config file for a more comprehensive list of the available parameters.

Authors

Nana Afia Twumasi-Ankrah
Dennis Wylie, PhD
Jennifer Fettweis, PhD

License

This project is licensed under the GNU GPLv3 License. This license restricts the usage of this application for non-open sourced systems. Please contact the authors for questions related to relicensing of this software in non-open sourced systems.

Acknowledgments

DADA2's Original Developers (Callahan Lab)
Virginia Commonwealth University - Vaginal Microbiome Consortium
University of Texas - Austin
Emory University

Name		Name	Last commit message	Last commit date
Latest commit History 87 Commits
R		R
inst		inst
man		man
tests/testthat		tests/testthat
.Rhistory		.Rhistory
DESCRIPTION		DESCRIPTION
LICENSE		LICENSE
NAMESPACE		NAMESPACE
README.md		README.md
fastq2otu.Rproj		fastq2otu.Rproj

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FASTQ2OTU

Advantages of using FASTQ2OTU

Getting Started

Execute pipeline

Quick Start Guide

Plot Quality Distribution

Merging Samples

Downloading Data from NCBI

Using FASTQ-DUMP

Using WGET

Installation

Using a config file

Authors

License

Acknowledgments

About

Releases

Packages

Languages

License

ananata/fastq2otu

Folders and files

Latest commit

History

Repository files navigation

FASTQ2OTU

Advantages of using FASTQ2OTU

Getting Started

Execute pipeline

Quick Start Guide

Plot Quality Distribution

Merging Samples

Downloading Data from NCBI

Using FASTQ-DUMP

Using WGET

Installation

Using a config file

Authors

License

Acknowledgments

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages