Skip to content

dien-n-nguyen/SeuratToGO

Repository files navigation

SeuratToGO

Description

In single-cell RNA sequencing (scRNA-seq), clusters are groups of cells that exhibit similar gene expression patterns. The primary goal of clustering in scRNA-seq analysis is to identify and group together cells that share similar transcriptional profiles. Each cluster represents a distinct population of cells with potentially similar cell types, biological states, or functions. An R pacakge called Seurat is a popular tool used to carry out the pre-processing, clustering and visualization steps in scRNA-seq analysis.

The package processes Seurat’s differential expression markers (after running FindAllMarkers() function in Seurat). This package reformats the gene markers to go through gene ontology (GO) analysis using DAVID (Database for Annotation, Visualization and Integrated Discovery). It also provides functions for analysis of DAVID output files and visualization.

Currently, users have to manually separate the clusters in Seurat’s markers dataframe using Excel and export it as a tab-delimited text file to upload to DAVID. They then have to manually combine all the DAVID output files (one for each clusters) to do further analysis.

The R package includes the main components: DESCRIPTION, NAMESPACE, man subdirectory and R subdirectory. Additionally, LICENSE, README and subdirectories vignettes, tests, data and inst are also explored. The SeuratToGO package was developed using R version 4.3.2 (2023-10-31 ucrt), Platform: x86_64-w64-mingw32/x64 (64-bit) and Running under: Windows 11 x64 (build 22621).

Installation

You can install the development version of SeuratToGO from GitHub with:

install.packages("devtools")
library("devtools")
devtools::install_github("dien-n-nguyen/SeuratToGO", build_vignettes = TRUE)
library("SeuratToGO")

To run the Shiny app:

SeuratToGO::run_SeuratToGO()

Overview

ls("package:SeuratToGO")
data(package = "SeuratToGO") 
browseVignettes("SeuratToGO")

SeuratToGO contains 5 functions.

  1. separate_clusters for separating the differentially expressed markers data frame generated by Seurat and exporting it as a tab-delimited text file.

  2. combine_david_files for combining all the DAVID output files into a list of data frames.

  3. get_top_processes to get the top processes for a one specified cluster. The output is a dataframe in which each row is a biological process and each column is a property relating to that process, for example genes, p-value, population, etc… This is to get a closer look at the each cluster.

  4. get_all_top_processes to get the p-values of the top processes for every cluster and consolidate them into one data frame.

  5. top_processes_heatmap to generate a heatmap for all the top processes in each cluster

The package also contains a dataset called pbmc_markers, which contains differentially expressed markers generated using Seurat’s tutorial. It also contains a zip folder called david.zip in inst/extdata/ that contains sample DAVID output files if users want to view them.

An overview of the package is illustrated below. The steps highlighted yellow are not supported by this package, since DAVID’s API does not support the type of gene IDs we are working with. See the vignette for more details.

Contributions

The author of the package is Dien Nguyen. The author wrote all 5 functions mentioned above. separate_clusters uses the package magrittr for piping and the package dplyr for filtering and selecting. get_top_processes uses dplyr to sort data frames. top_processes_heatmap uses the package pheatmap to generate the heatmap. The pbmc_markers dataset was generated by following Seurat’s clustering tutorial. The DAVID output files were generated using the DAVID web server.

References

  • Bache S, Wickham H. 2022. magrittr: A Forward-Pipe Operator for R. https://magrittr.tidyverse.org, https://github.com/tidyverse/magrittr.

  • Benjamini Y, Hochberg Y. 1995. Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society: Series B (Methodological. 57(1):289–300. doi:10.1111/j.2517-6161.1995.tb02031.x.

  • Butler A, Hoffman P, Smibert P, Papalexi E, Satija R. 2018. Integrating single-cell transcriptomic data across different conditions, technologies, and species. Nat Biotechnol. 36(5):411–420. doi:10.1038/nbt.4096.

  • Kolde R. 2019. Pheatmap: pretty heatmaps. https://github.com/raivokolde/pheatmap

  • R Core Team. 2023. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/

  • Sherman BT, Hao M, Qiu J, Jiao X, Baseler MW, Lane HC, Imamichi T, Chang W. 2022. DAVID: a web server for functional enrichment analysis and functional annotation of gene lists (2021 update). Nucleic Acids Res. 50(W1):W216–W221. doi:10.1093/nar/gkac194.

  • Stuart T, Butler A, Hoffman P, Hafemeister C, Papalexi E, Mauck WM, Hao Y, Stoeckius M, Smibert P, Satija R. 2019. Comprehensive Integration of Single-Cell Data. Cell. 177(7):1888-1902.e21. doi:10.1016/j.cell.2019.05.031.

  • Wickham H, Bryan, J. 2019. R Packages (2nd edition). Newton, Massachusetts: O’Reilly Media. https://r-pkgs.org/

  • Wickham H, François R, Henry L, Müller K, Vaughan D. 2023. dplyr: A Grammar of Data Manipulation. https://dplyr.tidyverse.org, https://github.com/tidyverse/dplyr.

Acknowledgements

This package was developed as part of an assessment for 2022-2023 BCB410H: Applied Bioinformatics course at the University of Toronto, Toronto, CANADA. SeuratToGO welcomes issues, enhancement requests, and other contributions. To submit an issue, use the GitHub issues. Many thanks to those who provided feedback to improve this package.

About

No description, website, or topics provided.

Resources

License

Unknown, MIT licenses found

Licenses found

Unknown
LICENSE
MIT
LICENSE.md

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages