Skip to content

Automatically analyze aggregated meta-genomic datasets

License

Notifications You must be signed in to change notification settings

ananata/fastq2otu

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

87 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

FASTQ2OTU

⚠️ This package is still under active development

FASTQ2OTU was developed as a easy and effective tool for downloading, analyzing, and processing large-scale microbiome rRNA gene data obtained from NCBI's SRA database. The package uses many functions from DADA2 to analyze sequence data. The primary objective of FASTQ2OTU is to (1) increase the reproducibility of microbiome analysis and (2) encourage the analysis of archived data to obtain new knowledge. This

FASTQ2OTU's workflow can be broken down into multiple stages:

  1. Get Sequences - Sequences can be downloaded using FASTQDUMP or wget (if FTP links are available).
  2. Plot Quality Distribution - Generate a figure that shows the quality distribution of the dataset. This step can be run independently if necessary.
  3. Filter and Trim
  4. Learn Errors and Denoise
  5. Find and Remove Chimeric Sequences
  6. Merge Paired-End Sequences
  7. Assign Taxonomy
  8. Merge OTU Tables - Merges individual OTU tables to create a single table that can used in downstream analyses.

Advantages of using FASTQ2OTU

  • Documentation is automated (all inputs and outputs are recorded)
  • Integrated workflow
  • Outputs are automatically generated
  • Easy to use

Getting Started

After installing FASTQ2OTU, the following input files and/or directories will be required to begin processing data:

  • A YML-formatted config file containing all parameters (more information about the config file can be found below).
  • A directory of single or paired-end FASTQ files OR a text file containing SRA ids to download from NCBI.
  • A bit of knowledge about the sequences
    • Ideal trimming parameters
    • Forward and reverse primer lengths
    • If merging, the desired overlap length
    • Working knowledge of DADA2 workflow

Execute pipeline

# Load package into environoment
library("ananata/fastq2otu")

# Path to config file
paired_config <- "path/to/my_paired-example_config.yml"

# Run pipeline
runPipeline(configFile = paired_config, isPaired = TRUE, getQuality = TRUE, getMergedSamples = TRUE, getDownloadedSeqs = TRUE, getGeneratedReport = FALSE)

The runPipeline() function will allow the the entire DADA2 pipeline to be run. The parameters in the function allows users to specify which steps of the pipeline they would like to execute. The following table provides a description of each parameter and the action(s) it controls.

Parameter Description Directions
configFile Path to YML-file containing all user inputs The file must be formatted with the correct variable names (please refer to template)
isPaired TRUE if handling paired-end data and FALSE if handling single-end. Please note that paired-end and single-end data must be processed seperately (the package cannot analyze both datatypes simultaneously).
getQuality TRUE if you would like to generate a quality distribution plot and FALSE if you would like to skip the step. This step can be run independently.
getMergedSamples TRUE if you would like to generate a merged sample table and FALSE if you would like to skip the step. Generates a single table containing data from all samples.
getDownloadedSeqs TRUE if you would like to use fastq-dump or wget to download data directly from NCBI's SRA database. Requires a text file containing all SRA sample IDs or FTP download links
getGeneratedReport If TRUE, a FASTQC report is generate using the FASTQCR R-package This step can also be run independently.

Quick Start Guide

DADA2 is an R package that allows users to preform high-resolution taxanomy analyses from FASTQ files. This package will allow most users to analyse datasets using the DADA2 pipeline. This procedure will cover some basics of R programming, installing and running the package on R server, and interpreting some of the outputs generated. There are two objectives for this document:

  1. Introduce new users to DADA2’s functions;
  2. To set-up a pipeline for 16S rRNA analyses of target bacterial isolates.

Plot Quality Distribution

DADA2’s plotQualityProfile() function creates a plot(s) that visualizes the overall distribution of quality scores within a dataset. Users can use the plots to make informed decisions about how they would like their data to be processed (i.e. filtering and trimming). The generated quality graphs show colored lines that signify different statistics.

  • Green is the mean quality score for all reads in a single dataset
  • Orange is the median
  • Dashed orange lines demarcate the 25th and 75th quantiles.

Merging Samples

Sequence tables generated by DADA2’s makeSequenceTable() function are formatted as single-row matrices (contain only one row), with consensus sequences as column headings and read counts as elements in the row. OTU Tables (given by DADA2's assignTaxonomy() or assignSpecies() function) contain taxonomic assignments and sequence variants (ASV). FASTQ2OTU's mergeSamples function will merge data from sequence and OTU tables obtained from different samples to generate a single table. The final table can be used to make inter-sample comparisons that may inform downstream analyses.

Downloading Data from NCBI

Public data can be accessed from NCBI’s SRA website . To view datasets, enter a project ID (i.e. PRJEB8073), click "Search" and select “Send results to Run Selector" link to view the results interactively. To access the Run Selector tool directly, the following link can also be used. To obtain a list of all SRA accession IDs within a given project, click the "Accession List" button in middle the "Select" panel and wait for the text file to be downloaded.

Using FASTQ-DUMP

To download datasets using NCBI's fastq-dump utility, download the sra-toolkit from NCBI and obtain the path to the fastq-dump tool. Record the paths to the fastq-dump script the SRA accession list in the config file (described below). Make sure to set the getDownloadedSeqs parameter to TRUE, when executing the runPipeline() function.

Using WGET

To download datasets from SRA using wget, navigate to SRA-Explorer and input your project ID. Once you click the search icon, a table should appear at the bottom of the window. Select all rows in the tables and store the results by clicking the blue "Add to Collection" button on the right. Please not that the search only outputs a certain number of results each time (with the max being 500). In order to obtain data on more than 500 samples, you must update the "Start at Record" text box after each search. Once you have stored all your samples in your collection, click the shopping cart icon on the top right. Click the tab that says "Raw FastQ Download URLs" and select the download link. Record the path to the downloaded text file in the config file and make sure to set the getDownloadedSeqs parameter to TRUE, when executing the runPipeline() function.

Installation

This application is designed to be lightweight and simple to use. The intended use is via a remote server, however it can also be run using RStudio (the package was written in R 3.5.3) and can be downloaded from Github using devtools.

if (!requireNamespace("BiocManager", quietly = TRUE))
  install.packages("BiocManager")
BiocManager::install("Biostrings")
BiocManager::install("ShortRead")
BiocManager::install("dada2")
BiocManager::install("gtools")

# Install package into environment
install.packages("devtools")
library(devtools)
install_github("ananata/fastq2otu")
library("FASTQ2OTU")

Using a config file

Variable Type Default Description
projectPrefix Character "myproject" Prefix to append to newly created files (i.e. _filtered_files/ is created to store filtered files)
outDir Character Current working directory Path to output directory that the contain all output files and documents.
pathToData Character N/A Path to directory storing all input data.
verbose Logical FALSE Sets verbose parameter for all functions
multithread Logical FALSE Sets the multithread parameter for all functions
pathToSampleIDs Character N/A The path to a text file containing SRA Accession IDs.
fastaPattern Character ^.*[1,2]?.fastq(.gz)?$ Regex pattern to use when parsing directories for FASTQ files.
aggregateQual Logical N/A Provide TRUE if you would like to aggregate your quality profile diagram.
qualN Numeric 0
useFastqDump Logical FALSE Provide TRUE if you would like to download sequences using a locally installed version of SRA's FASTQDUMP
pathToFastqDump Character N/A Path to fastq-dump script. Required if useFastqDump parameter is TRUE.
pathToSampleURLs Character N/A Path to text file containing FTP download links.
pathToFastqc Character N/A Path to fastqc software. Required to use FASTQCR
installFastqc Logical FALSE If TRUE, FASTQC will be automatically downloaded into the users home directory. Unless an input for pathToFastqc is provided, then the new download will overwrite the older version.
pathToFastqcResults Character N/A Path to the directory storing the FASTQC reports.
taxDatabase Character N/A Required. Path to reference taxonomy database.

Please refer to a template config file for a more comprehensive list of the available parameters.

Authors

  • Nana Afia Twumasi-Ankrah
  • Dennis Wylie, PhD
  • Jennifer Fettweis, PhD

License

This project is licensed under the GNU GPLv3 License. This license restricts the usage of this application for non-open sourced systems. Please contact the authors for questions related to relicensing of this software in non-open sourced systems.

Acknowledgments

  • DADA2's Original Developers (Callahan Lab)
  • Virginia Commonwealth University - Vaginal Microbiome Consortium
  • University of Texas - Austin
  • Emory University

About

Automatically analyze aggregated meta-genomic datasets

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published