DBTC is an R implementation of Dada2 and a BLAST approach to metabarcoding analysis.
molecular analysis of high-throughput metabarcode molecular sequence data fastq files and taxonomic assignment of unique reads (in fasta format files) using the Basic Local Alignment Search Tool (BLAST) and R to reduced obtained results (See additional documentation at DBTC). The Dada-BLAST-Taxon Assign-Condense ('DBTC') package contains the foundational 'DBTC' functions run through the R command line (But also see 'DBTCShiny' and DBTCShinyTutorial). The 'DBTC' functions have four main outcomes...
- Fastq file processing using Dada in R
- Using the Basic Local Alignment Search Tool (BLAST), amplicon sequence variants (ASV) can be searched against local NCBI or custom libraries
- Assign taxa to the unique reads using NCBI taxon database (obtain the database using taxonomizr website)
- Condense the resulting ASV taxonomic assignment tables to unique taxa with the ability to combine datasets (using different sequence libraries for the same reads, or results from the same samples for different molecular regions) into a combined results table
NOTE: While the DBTC package has been built for the analysis of high-throughput sequencing results, the BLAST, taxonomic assignment, and taxonomic condense can be utilized with single specimen Sanger sequencing data.
Also see DBTCShiny and the associated DBTCShinyTutorial.
- Installation
- Package Dependencies
- External R Element Dependencies
- NCBI BLAST+ local program to run BLAST on local databases
- R package taxonomizr to establish a local NCBI taxonomy database
- Establish a local NCBI prepared sequence database
- Create a custom sequence database to BLAST against
- Create a local NCBI taxonomy database to assign taxonomic identifications to BLAST results
- R Packages Dependencies
- External R Element Dependencies
- Function Descriptions
- Naming Convention Rules
- Package Function Details
- Dada Implement - dada_implement()
- Combine Dada Output - combine_dada_output()
- Make BLAST DB - make_BLAST_DB()
- Sequence BLAST - seq_BLAST()
- Taxon Assignment - taxon_assign()
- Combine Assignment Output - combine_assign_output()
- Reduce Taxa - reduce_taxa()
- Combine Reduced Output - combine_reduced_output()
- Citation
DBTC can be installed three ways.
install.packages('DBTCShiny’)
Run the following commands in your R terminal...
if(!require(devtools)) install.packages('devtools')
library('devtools')
devtools::install_github('rgyoung6/DBTC')
library('DBTC')
Note: the first command to install the "devtools" may not be necessary if already installed.
Navigate to the DBTC GitHub page. Download the files associated with this page to your local computer and place them somewhere in the main file folder named DBTC. Then run the following command pointing to that location on your local computer by replacing the HERE with the path in the below command...
library("DBTC", lib.loc="HERE")
Follow the instructions on the NCBI BLAST+ executables page to obtain a local version of the BLAST tools. The list of the latest installation files can be found here.
Note: It is best to download and install the most recent versions of the blast+ suite to your computer and place the programs in your computers path so you can access the program from any folder. However, the program files for both blastn and makeblastdb have been included in the DBTCShinyTutorial GitHub page for ease of use (please note that these may not be the most recent versions).
The R package taxonomizr website is used to establish a NCBI taxaID database (NOTE: this package is also required when using the taxon assignment elements in the DBTC pipeline).
install.packages('taxonomizr')
library('taxonomizr')
NCBI BLASTable databases can be established through two methods.
-
Download your desired preformatted NCBI database by using the 'update_blastdb.pl' (found in the NCBI BLAST+ local install folder). NOTE: Perl programming language needs to be installed on your local machine. Instructions can be found at Get NCBI BLAST databases.
-
You can download your desired preformatted NCBI database manually with instructions at BLAST FTP Site and a list of the available databases at Index of /blast/db.
In addition to the NCBI resources, DBTC can also use custom databases. To establish these databases you will require a Fasta file with the desired records with MACER formatted headers. The MACER R package and instructions can be found at either of the two locations:
MACER GitHub (will have the most recent version and development versions)
In the 'Preparation' section of the taxonomizr website, use the instructions and the prepareDatabase('accessionTaxa.sql', getAccessions = FALSE) taxonomizr command to establish a local taxonomy database.
prepareDatabase('accessionTaxa.sql', getAccessions = FALSE)
Note: Along with the accessionTaxa.sql two other files nodes.dmp and names.dmp files are downloaded. These two files are not necessary for downstream analyses and can be deleted.
The ShortRead package is required to run elements of the DBTC pipeline and can be obtained through Bioconductor.
The dada2 package is main package to process the raw fastq files and can be obtained from Bioconductor. There is also a dada2 GitHub resource.
if (!require('BiocManager', quietly = TRUE))
install.packages('BiocManager')
BiocManager::install('ShortRead')
BiocManager::install('dada2')
library(dada2)
Each of below CRAN packages and their dependencies are required for the DBTC package.
install.packages(c('ggplot2',
'parallel',
'pbapply',
'plyr',
'stats',
'taxonomizr',
'utils'))
library(c('ggplot2',
'parallel',
'pbapply',
'plyr',
'stats',
'taxonomizr',
'utils'))
After installation of the DBTC and all of its dependencies you need to load the package using the following commands...
library(DBTC)
The dada_implement() function takes fastq files as input, analyses them and produces amplicon sequence variant (ASV) files. This function requires a main directory containing folder(s) representing sequencing runs which in-turn contains fastq files (the location of one of the fastq files in one of the sequencing run folders is used as an input argument). A run is a group of results processed at the same time on the same machine representing the same molecular methods. All sequencing folders in the main directory need to represent data from sequencing runs that have used the same primers and protocols. Output from this function includes all processing files and final main output files in the form of fasta files and ASV tables.
DBTC dada_implement() uses ASV output files ('YYYY_MM_DD_HH_MM_UserInputRunName_Merge' and/or 'YYYY_MM_DD_HH_MM_UserInputRunName_MergeFwdRev') and combines them into a single ASV table and creates an accompanying fasta file. This function also produces a file containing the processing information for the function. The main input argument for this function is the location of a file in a folder containing all ASV tables wanting to be combined. Output files are generated with the naming convention 'YYYY_MM_DD_HH_MM_combinedDada'.
This function takes a fasta file with headers in the MACER format (Young et al. 2021) and establishes a database upon which a BLAST search can be completed. However, if a NCBI sequence database is desired, it is advisable to use, where applicable, NCBI preformatted databases and skip the make_BLAST_DB() function (https://www.ncbi.nlm.nih.gov/books/NBK62345/#blast_ftp_site.The_blastdb_subdirectory). The outcome of the function is a folder with a BLASTable NCBI formatted sequence database.
The MACER fasta header format
>GenBankAccessionOrBOLDID|GenBankAccession|Genus|species|UniqueID|Marker
Fasta file(s) are used as input along with a user selected NCBI formatted database upon which query sequences will be searched using BLAST. The outcome of the function are two files, a BLAST run file and a single file containing all of the BLAST results in tab delimited format. There are no headers in the BLAST results file but the columns (in order left to right) are: query sequence ID, search sequence ID, search taxonomic ID, query to sequence coverage, percent identity, search scientific name, search common name, query start, query end, search start, search end, e-value.
This function takes a BLAST result file and associated fasta files (either on their own or with accompanying ASV files generated from the dada_implement() function) and collapses the multiple BLAST results into as single result for each query sequence. When an ASV table is present the taxonomic results will be combined with the ASV table.
The combine_assign_output() function takes a file selection and then uses all DBTC ASV taxon assign files ('_taxaAssign_YYYY_MM_DD_HHMM.tsv') in a selected directory and combines them into a single output 'YYYY_MM_DD_HHMM_taxaAssignCombined.tsv' file. The files being combined should represent different samples but representing data that have all come from analysis using the same molecular methods, the same analysis arguments, and the same molecular sequence databases.
To reduce ASV results to unique taxa per file the reduce_taxa() function takes a file selection and then uses all '_taxaAssign_YYYY_MM_DD_HHMM.tsv' and/or 'YYYY_MM_DD_HHMM_taxaAssignCombined.tsv' files in that directory. This function then reduces all ASV with the same taxonomic assignment into a single result and places these results in a '_taxaReduced_YYYY_MM_DD_HHMM.tsv' file for each of the target files in the directory.
This function takes a file selection and then uses all '_taxaReduced_YYYY_MM_DD_HHMM.tsv' files in that directory and combines them into a single 'YYYY_MM_DD_HHMM_CombineTaxaReduced.txt' taxa table file with presence absence results. The files being combined should represent the same biological samples but with different molecular marker information. The output ASV can include read numbers or be reduced to presence absence results.
WARNING - NO WHITESPACE!
When running DBTC functions the paths for the files selected cannot have white space! File folder locations should be as short as possible (close to the root directory) as some functions do not process long naming conventions.
Also, special characters should be avoided (including question mark, number sign, exclamation mark). It is recommended that dashes be used for separations in naming conventions while retaining underscores for use as information delimiters (this is how DBTC functions use underscore).
There are several key character strings used in the DBTC pipeline, the presence of these strings in file or folder names will cause errors when running DBTC functions. The following strings are those used in DBTC and should not be used in file or folder naming:
- _BLAST
- _combinedDada
- _taxaAssign
- _taxaAssignCombined
- _taxaReduced
- _CombineTaxaReduced
dada_implement() - Process metabarcode raw fastq files by runs using Dada2 (Note: molecular markers are independently analysed and combined at the end of the analysis pipeline using the combine_reduced_output() function).
Two files are required as input for the dada_implement() function. The first are the fastq files in the appropriate folder structure (see below) and the second is a file containing the primers used for the amplification of the sequence reads (tab separated file, see DBTCShinyTutorial for file examples). To select all of the fastq files simply select one file in one of the Run directories to point to the desired data files, as long as the file and folder structure is correct.
Fastq File Folder Structure
Parent Directory
|
|
-----------------
| |
| |
Run1 Directory Run2 Directory
-Fastq -Fastq
-Fastq -Fastq
... ...
Format of the primer file
Forward | Reverse |
---|---|
AGTGTGTAGTGATTG | CGCATCGCTCAGACTGACTGC |
GAGCCCTCGATCGCT | GGTCGATAGCTACGCGCGCATACGACT |
GGTTCACATCGCATTCAT |
- runFolderLoc - Select a file in the one of the run folders with the fastq files of interest (Default NULL).
- primerFile - Select a file with the primers for this analysis (Default NULL).
- fwdIdent - Forward identifier naming string (Default '_R1_001').
- revIdent - Reverse identifier naming string (Default '_R2_001').
- bidirectional - Selection to process paired forward and reverse sequence for analysis (Default TRUE).
- unidirectional - Selection to process files independently (Default FALSE).
- printQualityPdf - Selection to process save image files showing quality metrics (Default TRUE).
- maxPrimeMis -Maximum number of mismatches allowed when pattern matching trimming the primers from the ends of the reads for the ShortRead trimLRPatterns() function (Default 2).
- fwdTrimLen - Select a forward trim length for the Dada filterAndTrim() function (Default 0).
- revTrimLen - Select a reverse trim length for the Dada filterAndTrim() function (Default 0).
- maxEEVal - Maximum number of expected errors allowed in a read for the Dada filterAndTrim() function (Default 2).
- truncQValue - Truncation value use to trim ends of reads, nucleotides with quality values less than this value will be used to trim the remainder of the reads for the Dada filterAndTrim() function (Default 2).
- truncLenValueF - Dada forward length trim value for the Dada filterAndTrim() function. This function is set to 0 when the pattern matching trim function is enabled (Default 0).
- truncLenValueR - Dada reverse length trim value for the Dada filterAndTrim() function. This function is set to 0 when the pattern matching trim function is enabled (Default 0).
- error - Percent of fastq files used to assess error rates for the Dada learnErrors() function (Default 0.1).
- nbases - The total number of bases used to assess errors for the Dada learnErrors() function (Default 1e80) NOTE: this value is set very high to get all nucleotides in the error present file subset. If the error is to be assessed using total reads and not specific fastq files then set the error to 1 and set this value to the desired number of reads.
- maxMismatchValue - Maximum number of mismatches allowed when merging two reads for the Dada mergePairs() function (Default 2).
- minOverlapValue - Minimum number of overlapping nucleotides for the forward and reverse reads for the Dada mergePairs() function (Default 12).
- trimOverhang - Trim merged reads past the start of the complimentary primer regions for the Dada mergePairs() function (Default FALSE).
- minFinalSeqLen - The minimum final desired length of the read (Default 100).
- verbose - If set to TRUE then there will be output to the R console, if FALSE then this reporting data is suppressed (Default TRUE).
The output files from this function appear in four folders. See the below diagram for the saved file structure.
Parent Directory
|
|
-----------------------------
| |
| |
Run1 Directory Run2 Directory
- Fastq - Fastq
- Fastq - Fastq
... ...
|
|
--------------------------------------------------------
| | | |
YYYY_MM_DD_HHMM_Run1_A_Qual | | |
-revQual.pdf | YYYY_MM_DD_HHMM_Run1_C_FiltQual |
... | -filtFwdQual.pdf |
| -filtRevQual.pdf |
| ... |
| |
| YYYY_MM_DD_HHMM_Run1_D_Output
| -dadaSummary.txt
| -dadaSummaryTable.tsv
| -ErrorForward.pdf
| -ErrorReverse.pdf
| -MergeFwdRev.tsv
YYYY_MM_DD_HHMM_Run1_B_Filt -MergeFwdRev.fas
-fwdFilt.fastq -Merge.tsv
-revFilt.fastq -Merge.fas
... -TotalTable.tsv
Primer_Trim
-primeTrim.fastq
...
Quality pdf's in the A_Qual folder represent the quality metrics for the raw Fastq files (This folder may not be present if printQualityPdf is set to FALSE). Files in the B_Filt folder represent the trimming (in the Primer_Trim folder) and trimmed and cleaned in the larger folder. Quality pdf's in the C_FiltQual folder represent the quality metrics for the trimmed and cleaned Fastq files (This folder may not be present if printQualityPdf is set to FALSE). There are numerous files in the D_Output folder. These include:
- dadaSummary.txt file which provides all of the information on the running of the dada_implement() function.
- dadaSummaryTable.tsv contains a table with summary information for the processing of the samples in the run.
- ErrorForward.pdf and ErrorReverse.pdf provide visualizations on the assessed sequencing error for the sequencing run.
- MergeFwdRev.tsv is the ASV table with the data from the sequencing run and the MergeFwdRev.fas is a Fasta file with the reads from the samples. The MergeFwdRev files include reads that were able to be merged, as well as reads that were not able to be merged. NOTE: The merged, forward, and reverse reads are obtained in parallel analyses and combined into a single file so MergeFwdRev files will represent triplicate molecular processing results. These files are present to see if there are reads that are not represented or poorly represented across merged and unidirectional results, perhaps indicating issues with one of the primers.
- Merge.tsv is the ASV table with the read data able to be merged from the sequencing run. The companion Merge.fas is a Fasta file with the reads from the samples in Fasta format.
- TotalTable.tsv is an ASV table with all of the merged, forward, and reverse results as well as the retained results for the merged reads removed due to being suspected chimeric combinations.
- dada2
- ShortRead readFastq()
- ShortRead writeFastq()
combine_dada_output() - Combine Dada2 ASV tables (YYYY_MM_DD_HHMM_FileName_MergeFwdRev.tsv OR YYYY_MM_DD_HHMM_FileName_Merge.tsv) into a single ASV output file.
Two or more files to be combined are required as input for this function. These files need to be ASV files as outputted from the dada_inplement() and can include Merge, MergeFwdRev, or TotalTable.tsv files.
- fileLoc - Select a file in the file folder with dada_inplement() results you would like to combine (YYYY_MM_DD_HHMM_FileName_MergeFwdRev OR YYYY_MM_DD_HHMM_FileName_Merge both .tsv and .fas files (Default NULL).
- minLen - The minimum final desired length of the read (Default 100).
- verbose - If set to TRUE then there will be output to the R console, if FALSE then this reporting data is suppressed (Default TRUE).
The output from this function includes three files.
- YYYY_MM_DD_HHMM_combinedDada.tsv - combined ASV table
- YYYY_MM_DD_HHMM_combinedDada.fas - combined Fasta file
- YYYY_MM_DD_HHMM_combinedDada.txt - Summary file from the combine_dada_output()
Outputted data files will come in the same ASV table format as the output dada_inplement() ASV files.
- Base R
make_BLAST_DB() - Create a local database to BLAST against.
This function takes a Fasta file (in MACER format) and establishes a database upon which a BLAST search can be completed. The outcome of the function is a folder with an NCBI database.
- The MACER Fasta header format -
>UniqueID|OtherInformation|Genus|species|OtherInformation|Marker
- An example of the header format output from the MACER program is
>GenBankAccessionOrBOLDID|GenBankAccession|Genus|species|UniqueID|Marker
- fileLoc - The location of a file in a directory where all Fasta files will be used to construct a BLASTable database (Default = NULL).
- makeblastdbPath - The local path for the blast+ makeblastdbPath program (Default 'makeblastdb').
- taxaDBLoc - The location of the NCBI accessionTaxa.sql taxonomic data base (Default NULL).
- dbName - A short 6-8 alpha character name used when building a database (Default NULL).
- minLen - The minimum sequence length used to construct the BLAST database (Default 100).
- verbose - If set to TRUE then there will be output to the R console, if FALSE then this reporting data is suppressed (Default TRUE).
The output from this function includes a folder with the BLAST database named according to the submitted dbName.
The constructed database can then be used with the seq_BLAST() function.
- taxonomizr()
seq_BLAST() - Search Fasta files of unknown sequences against a BLAST formatted database.
Provide a location for the BLAST database you would like to use by selecting a file in the target directory (This can be a built database using the make_BLAST_DB() function or a preformatted NCBI BLAST database). Then provide the location of the query sequence files by indicating a file in a directory that contains the Fasta files. Provide the path for the blast+ blastn program. Finally, provide the minimum query sequence length to BLAST (Default = 100), the maximum depth of the BLAST returned results (Default = 200), and finally the number of cores to process the function (default = 1, Windows implementation can only use a single core and will default to this value when running on Windows).
- databasePath - The location of a file in a directory where the desired BLAST database is located (Default NULL).
- querySeqPath - The location of a file in a directory containing all of the Fasta files wishing to be BLASTed (Default NULL).
- blastnPath - The location of the NCBI blast+ blastn program (Default 'blastn').
- minLen - The minimum length of the sequences that will be BLASTed (Default 100).
- BLASTResults - The number of returned results, or the depth of the reported results, saved from the BLAST (Default 200).
- numCores - The number of cores used to run the function (Default 1, Windows systems can only use a single core).
- verbose - If set to TRUE then there will be output to the R console, if FALSE then this reporting data is suppressed (Default TRUE).
Two files are produced from this function, a BLAST run file and a BLAST results file for each of the Fasta files in the target directory.
The BLAST run file contains the command used to run the BLAST search. The BLAST results file includes all results in a tab delimited .tsv file format with the columns qseqid, sseqid, staxid, qcovs, pident, ssciname, scomname, qstart, qend, sstart, send, evalue.
- Base R
taxon_assign() - Using BLAST results to construct a table with taxonomic assignments for each unique sequence.
This function requires a BLAST output file and an associated Fasta file. In addition, if present an ASV file will also be used to combine with the taxonomic results. The function will take the BLAST results and reduce the taxonomic assignment to a single result for each read.
- fileLoc - The location of a file in a directory where all of the paired Fasta and BLAST (and potentially ASV) files are located (Default NULL).
- taxaDBLoc - The location of the NCBI accessionTaxa.sql taxonomic data base (Default NULL).
- numCores - The number of cores used to run the function (Default 1, Windows systems can only use a single core).
- coverage - The percent coverage used for taxonomic assignment for the above threshold results (Default 95).
- ident - The percent identity used for the taxonomic assignment for above threshold results (Default 95).
- propThres - The proportional threshold flags the final result based on the preponderance of the data. So if the threshold is set to 0.95, results will be flagged if the taxa directly below the assigned taxa has fewer than 0.95 percent of the records causing the upward taxonomic placement (Default 0.95).
- coverReportThresh - The percent coverage threshold used for reporting flags below this threshold (Default 95).
- identReportThresh - The percent identity threshold used for reporting flags below this threshold (Default 95).
- includeAllDada - When paired Dada ASV tables are present, when set to FALSE, this will exclude records without taxonomic assignment (Default TRUE).
- verbose - If set to TRUE then there will be output to the R console, if FALSE then this reporting data is suppressed (Default TRUE).
A single taxonomic assignment file is created for each BLAST output file with the naming convention having the string '_taxaAssign_YYYY_MM_DD_HHMM.tsv'. In addition, there is a run file that is also generated which contains the run details with the naming convention 'YYYY_MM_DD_HHMM_taxaAssign.txt'.
The number of returned BLAST results is dictated by the seq_BLAST() BLASTResults argument. The taxon_assign() function takes into account all returned BLAST results for each read. At each taxonomic level assignments have quality metrics in parentheses after the name. These values ("Num_Rec", "Coverage", "Identity", "Max_eVal") represent the number of records with this taxonomic placement, the minimum coverage and identity, and the maximum eValue for the reported taxa.
Column headers for the resulting taxonomic assignments include...
uniqueID, superkingdom, phylum, class, order, family, genus, species, Top_BLAST, Lowest_Single_Ran, Lowest_Single_Taxa, Lowest_Single_Rank_Above_Thres, Lowest_Single_Taxa_Above_Thres, Final_Common_Names, Final_Rank, Final_Taxa, Final_Rank_Taxa_Thres, Result_Code, Sequence, Length, Results, followed by columns of samples containing ASV values
There are three columns that deserve special explanation.
-
The Final_Rank_Taxa_Thres column contains the threshold values (Coverage, Identitiy) applied to the final rank and taxonomic values for the associated records.
-
The Results column contains the reference to the record if it was from a merged, forward or reverse analysis result.
-
The Result_Code column contains flags placed on the results to better understand the quality of the resulting taxonomic assignments. Below is a list of codes:
- SFAT(coverage, ident): Saturated filtered taxa above threshold
- SANF(coverage, ident): Saturated non-filtered
- BIRT(identReportThresh): Final taxa result is below the identity reporting threshold
- BCRT(coverReportThresh): Final taxa result is below the nucleotide coverage reporting threshold
- TBAT(propThres): Taxa Below Assigned Taxa Threshold
Note: Records with BIRT and BCRT flags should be highly scrutinized. TBAT are also concerning in that they may represent a less specific taxonomic placement due to the moving of the result to a higher taxonomic placement. The taxonomic rank directly below the final reported rank should be reviewed as there may be potential to adjust the final taxonomic assignment. SANF results should be explored and the size of the database, the trust placed in the records in the database, and the depth of the BLAST results should be considered when assessing records with this flag. Records with SFAT are among the least concerning as the BLAST results were saturated but this taxonomic assignment saturation occurred above your set quality coverage and identity threshold. Concerns with records with this result could be that the depth of the BLAST analysis was not low enough for very large databases, or that the database is not complete (taxonomic breadth) when using smaller databases.
- taxonomizr()
- pbapply()
- parallel()
combine_assign_output() - Using results from the taxon_assign() function, combines all files with the string '_taxaAssign_YYYY_MM_DD_HHMM.tsv' in to a single .tsv file.
Select a file in a folder with the taxa assigned files you would like to combine (extension '_taxaAssign_YYYY_MM_DD_HHMM.tsv'). NOTE: all '_taxaAssign_YYYY_MM_DD_HHMM.tsv' files in the folder location should originate from the same dada output file but have outputs from different BLAST sequence libraries and therefore contain the same ASV's.
- fileLoc - The location of a file in a directory where all of the '_taxaAssign_YYYY_MM_DD_HHMM.tsv' files are located (Default NULL).
- numCores - The number of cores used to run the function (Default 1, Windows systems can only use a single core).
- verbose - If set to TRUE then there will be output to the R console, if FALSE then this reporting data is suppressed (Default TRUE).
This function produces a 'YYYY_MM_DD_HHMM_taxaAssignCombined.tsv' and a 'YYYY_MM_DD_HHMM_taxaAssignCombined.txt' file in the selected target directory.
The interpretation of the output file for the combine_assign_output() 'YYYY_MM_DD_HHMM_taxaAssignCombined.tsv' files is the same as the is the same as the taxon_assign() '_taxaAssign_YYYY_MM_DD_HHMM.tsv' files.
- pbapply()
- parallel()
reduce_taxa() - Using results from taxon_assign() and/or Combine Assignment Output this function combines all reads with the same taxonomic assignment into a single result for each of the files submitted.
This function requires a file in a directory where all '_taxaAssign_YYYY_MM_DD_HHMM.tsv' and/or 'YYYY_MM_DD_HHMM_taxaAssignCombined.tsv' files in that directory will be combined. All records with the same taxonomic result will be combined. The BLAST values in parentheses ("Num_Rec", "Coverage", "Identity", "Max_eVal") are combine by the mean number of records, the mean of the minimum coverage and identity values, and the mean of the maximum eValues.
- fileLoc - The location of a file in a directory where all of the '_taxaAssign_YYYY_MM_DD_HHMM.tsv' and/or 'YYYY_MM_DD_HHMM_taxaAssignCombined.tsv' files are located (Default NULL).
- numCores - The number of cores used to run the function (Default 1, Windows systems can only use a single core).
- verbose - If set to TRUE then there will be output to the R console, if FALSE then this reporting data is suppressed (Default TRUE).
This function produces a '_taxaReduced_YYYY_MM_DD_HHMM.tsv' file for every '_taxaAssign_YYYY_MM_DD_HHMM.tsv' or 'YYYY_MM_DD_HHMM_taxaAssignCombined.tsv' present in the target directory.
Reduced taxonomic assignment files have fewer columns in the main taxa_reduced.tsv file than the '_taxaAssign_YYYY_MM_DD_HHMM.tsv' or 'YYYY_MM_DD_HHMM_taxaAssignCombined.tsv' files as columns are collapsed. In addition, the values in the taxonomic columns in parentheses represent the average values across all of the results with the same taxonomic assignment (see taxon_assign() interpretation).
The columns include, superkingdom, phylum, class, order, family, genus, species, Top_BLAST, Final_Common_Names, Final_Rank, Final_Taxa, Result_Code, RepSequence, Number_ASV, Average_ASV_Length, Number_Occurrences, Average_ASV_Per_Sample, Median_ASV_Per_Sample, Results. NOTE: The representative sequence (RepSequence) is the longest sequence for each of the collapsed taxa assigned.
- pbapply()
- parallel()
combine_reduced_output() - This function takes '_taxaReduced_YYYY_MM_DD_HHMM.tsv' files generated from the same biological samples but representing different amplified molecular markers and collapses these data into a single file. The outcome of this process results in a presence absence matrix for all taxa and markers.
Select a file in a folder with '_taxaReduced_YYYY_MM_DD_HHMM.tsv' files representing data for the same biological samples but representing different amplified molecular markers.
- fileLoc - The location of a file in a directory where all of the '_taxaReduced_YYYY_MM_DD_HHMM.tsv' files are located (Default NULL).
- presenceAbsence - A TRUE or FALSE value used to indicate if the read values should be replaced with presence/absence (1/0) data. This change is necessary when combining taxa for the same named samples across molecular markers (TRUE) but is not necessary when combining results for taxa with all unique sample names (FALSE).
- verbose - If set to TRUE then there will be output to the R console, if FALSE then this reporting data is suppressed (Default TRUE).
Two files, a 'YYYY_MM_DD_HHMM_CombineTaxaReduced.tsv' result file and a 'YYYY_MM_DD_HHMM_CombineTaxaReduced.txt' run summary file are generated from this function. The result file contains presence/absence data in a matrix that associates the data with samples, taxa, and molecular marker. The column headers in the results file includes the following, superkingdom, phylum, class, order, family, genus, species, markers(n number of columns), samples (n number).
The interpretation of the output file is the same as the taxon_assign() '_taxaAssign_YYYY_MM_DD_HHMM.tsv' files.
- plyr() rbind.fill
Young RG, et al., Hanner RH (2024) A Scalable, Open Source, Cross Platform, MetaBarcode Analysis Method using Dada2 and BLAST. Biodiversity Data Journal (In progress)