Process TDPortal and ProSightPD top-down reports (.tdReport files) by retrieving information from UniProt and filtering by false detection rate. TDViewer is also highly recommended for viewing top-down reports.
The easiest way to use the functionality of GUPPI by accessing the GUPPI web application.
Install from Github with:
remotes::install_github("davidsbutcher/GUPPI")A pre-built Docker image is available at Docker
hub.
Alternately, after downloading the source code the Dockerfile found in
/inst/shiny can be used to build the Docker image.
To download the image and start a container, install Docker Desktop, start it, and run the following commands in the appropriate shell (e.g. Windows PowerShell). NOTE: Check Docker hub for the latest release version before pulling an image.
docker pull davidsbutcher/guppi:release6
docker run -p 3838:3838 davidsbutcher/guppi:release6The application can be accessed in a web browser at http://localhost:3838 after the Docker container is started.
The processing of tdReports is carried out by the guppi() function. An
example of running the function:
library(GUPPI)
guppi(
"C:/Users/David Butcher/TDReports",
c(
"20200420_Excellent_TDReport_01.tdReport",
"20200420_Excellent_TDReport_02.tdReport"
),
83333,
GOLocType = "bacteria",
fractionAssignments = NULL,
outputdir = "C:/Users/David Butcher/guppi_output",
fdr = 0.01,
saveOutput = TRUE,
makeDashboard = TRUE
)Arguments to the guppi function are as follows:
-
filedirAdd the path to the folder containing tdReports to be processed. This directory can be at any level above the tdReports in the directory structure. -
filenameFull name of the tdReport file or files, including extension. Must match the name of file exactly. -
taxonNumberNCBI taxon number to use for adding UniProt data. Should match the taxon number from the TDPortal/ProSightPD analysis. -
outputdirDirectory to place output files. Directory will be created if it doesn’t exist, but parent directory must be an existing folder.
-
GOLocTypeName of taxon to use for determination of subcellular locations. Acceptable values are “bacteria” or “eukaryota”. -
fractionAssignmentsOptional argument. A named list with names set to the fraction numbers of the input files and values set to the input file names. If left blank, GUPPI will attempt to assign fraction numbers automatically from file names in the tdReport. -
fdrFalse detection rate to use for filtering results. Defaults to 1%. -
saveOutputBoolean value. Controls whether protein report, proteoform report, etc. are saved to output directory. Doesn’t effect GUPPI report. Defaults to TRUE. -
makeDashboardBoolean value (TRUE or FALSE) which controls whether an HTML report is generated. Defaults to FALSE. -
dashboardPathFull path and name of dashboard output (i.e. GUPPI report). Defaults to saving to the /report subdirectory of the output directory.
A connection is established to the SQLite database in the TD Report
using RSQLite. All protein- and proteoform-level IDs and other
relevant data for each ID are extracted. The taxon number is checked
against files in the package directory to see if a corresponding UniProt
taxon database has already been downloaded. The following UniProt
databases are included with the package: Escherichia coli,
Saccharomyces cerevisiae, Chlamydomonas reinhardtii,
Mycoplasma genitalium, Caenorhabditis elegans, Homo sapiens,
and Mus musculus.
If a database is not available, the UniProt web service is queried for
all UniProt accession numbers in the taxon using the package
UniProt.ws. Protein name, organism, organism taxon ID, protein
sequence, protein function, subcellular location, and any associated GO
IDs are returned. Note that some of these values may not be found and
come back as empty or NA. This process can take a long time, owing to
limitations with the UniProt web service.
The UniProt taxon database is used to add information for all IDs
extracted from the tdReport. GO terms are obtained for all GO IDs using
the GO.db package and terms corresponding to subcellular locations are
saved in column “GO_subcellular_locations”.
Minimum Q value from among all hits, average and monoisotopic masses, and data file for lowest Q value hit are obtained for all proteoforms. Proteoforms whose Q values are above the FDR cutoff are deleted. Proteoforms whose corresponding protein entry is above the FDR cutoff are also deleted.
The “main” output includes Q values, observed precursor masses, data
files, subcellular locations from the GO database and a variety of other
parameters for all protein and proteoform IDs. All protein and
proteoform IDs with Q values which are missing or greater than the
cutoff value (fdr) are deleted.
Output files are saved to the output directory (outputdir). Files are
timestamped with the time the script was initialized or share the same
name as the input file.
- Protein results are saved to
outputdir/protein_results/YYYYMMDD_hhmmss_protein_results.xlsx. - Proteoform results are saved to
outputdir/proteoform_results/YYYYMMDD_hhmmss_proteoform_results.xlsx. - Output xlsx files contain sheets corresponding to each input tdReport file and a final summary sheet containing all input file names and counts of proteins based on identified subcellular localizations.
- Proteoform results including all hits from the tdReport are saved to
outputdir/protein_results_allhits/{filename}_allhits.xlsx. - Results showing generalized subcellular localizations determined
from protein results for each tdReport are saved to
outputdir/protein_results_countsbyfraction/YYYYMMDD_hhmmss_countsbyfrac.xlsx. - If
makeDashboardis set to TRUE, an HTML report is saved tooutputdir/report.
Package dependencies are listed in the Imports section of the
DESCRIPTION file and include packages from CRAN, Bioconductor, and
Github.
Package developed by David S. Butcher and available under the MIT license.
