Added the option to return dispersal and corrected orthogroup scores …

…as separate variables in calculate_H()
almeidasilvaf · Oct 7, 2023 · e9a3ad5 · e9a3ad5
1 parent ffae8b0
commit e9a3ad5
Show file tree

Hide file tree

Showing 5 changed files with 113 additions and 45 deletions.
diff --git a/.github/workflows/check-bioc.yml b/.github/workflows/check-bioc.yml
@@ -52,7 +52,7 @@ jobs:
       fail-fast: false
       matrix:
         config:
-          - { os: ubuntu-latest, r: '4.2', bioc: '3.15', cont: "bioconductor/bioconductor_docker:RELEASE_3_15", rspm: "https://packagemanager.rstudio.com/cran/__linux__/focal/latest" }
+          - { os: ubuntu-latest, r: '4.3', bioc: '3.17', cont: "bioconductor/bioconductor_docker:RELEASE_3_17", rspm: "https://packagemanager.rstudio.com/cran/__linux__/focal/latest" }
     env:
       R_REMOTES_NO_ERRORS_FROM_WARNINGS: true
       RSPM: ${{ matrix.config.rspm }}

diff --git a/DESCRIPTION b/DESCRIPTION
@@ -37,7 +37,7 @@ biocViews:
 Encoding: UTF-8
 LazyData: false
 Roxygen: list(markdown = TRUE)
-RoxygenNote: 7.2.1
+RoxygenNote: 7.2.3
 SystemRequirements: BUSCO (>= 5.1.3) <https://busco.ezlab.org/>
 Imports:
     utils,

diff --git a/R/homology_detection.R b/R/homology_detection.R
@@ -1,4 +1,51 @@
 
+#' Correct orthogroup scores for overclustering
+#'
+#' @param homogeneity_df A 2-column data frame with
+#' variables \strong{Orthogroup} and \strong{Score} as returned
+#' by \code{calculate_H().}
+#' @param orthogroup_df Data frame with orthogroups and their associated genes
+#' and annotation. The columns \strong{Gene}, \strong{Orthogroup}, and
+#' \strong{Annotation} are mandatory, and they must represent Gene ID,
+#' Orthogroup ID, and Annotation ID (e.g., Interpro/PFAM), respectively.
+#' @param update_score Logical indicating whether to replace scores with
+#' corrected scores or not. If FALSE, the dispersal term and corrected scores
+#' are returned as separate variables in the output data frame.
+#'
+#' @return A data frame with the following variables:
+#' \describe{
+#'   \item{Orthogroup}{Character, orthogroup ID.}
+#'   \item{Score}{Numeric, orthogroup scores.}
+#'   \item{Dispersal}{Numeric, dispersal term. Only present
+#'         if \strong{update_score = FALSE}.}
+#'   \item{Score_c}{Numeric, corrected orthogroup scores. Only present if
+#'         \strong{update_score = FALSE}.}
+#' }
+#'
+#' @noRd
+overclustering_correction <- function(
+        homogeneity_df, orthogroup_df, update_score = TRUE
+) {
+
+    # Calculate % of domains present in 2+ OGs
+    dispersal <- split(orthogroup_df, orthogroup_df$Annotation)
+    dispersal <- unlist(lapply(dispersal, function(x) {
+        return(length(unique(x$Orthogroup)))
+    }))
+
+    dispersal_freq <- (sum(dispersal > 1) / length(dispersal))
+
+    if(update_score) {
+        homogeneity_df$Score <- homogeneity_df$Score / dispersal_freq
+    } else {
+        homogeneity_df$Dispersal <- dispersal_freq
+        homogeneity_df$Score_c <- homogeneity_df$Score - dispersal_freq
+    }
+
+    return(homogeneity_df)
+}
+
+
 #' Calculate homogeneity scores for orthogroups
 #'
 #' @param orthogroup_df Data frame with orthogroups and their associated genes
@@ -7,6 +54,13 @@
 #' Orthogroup ID, and Annotation ID (e.g., Interpro/PFAM), respectively.
 #' @param correct_overclustering Logical indicating whether to correct
 #' for overclustering in orthogroups. Default: TRUE.
+#' @param max_size Numeric indicating the maximum orthogroup size to consider.
+#' If orthogroups are too large, calculating Sorensen-Dice indices for all
+#' pairwise combinations could take a long time, so setting a limit prevents
+#' that. Default: 200.
+#' @param update_score Logical indicating whether to replace scores with
+#' corrected scores or not. If FALSE, the dispersal term and corrected scores
+#' are returned as separate variables in the output data frame.
 #'
 #' @details
 #' Homogeneity is calculated based on pairwise Sorensen-Dice similarity
@@ -15,12 +69,18 @@
 #' orthogroup share the same domain, the orthogroup will have a homogeneity
 #' score of 1. On the other hand, if genes in an orthogroup do not have any
 #' domain in common, the orthogroup will have a homogeneity score of 0.
+#' The percentage of orthogroups with size greater
+#' than \strong{max_size} will be subtracted from the homogeneity scores, since
+#' too large orthogroups typically have very low scores.
 #' Additionally, users can correct for overclustering by penalizing
 #' protein domains that appear in multiple orthogroups (default).
 #'
 #' @return A 2-column data frame with the variables \strong{Orthogroup}
 #' and \strong{Score}, corresponding to orthogroup ID and orthogroup score,
-#' respectively.
+#' respectively. If \strong{update_score = FALSE}, additional columns
+#' named \strong{Dispersal} and \strong{Score_c} are added, which correspond
+#' to the dispersal term and corrected scores, respectively.
+#'
 #' @export
 #' @rdname calculate_H
 #' @examples
@@ -30,22 +90,31 @@
 #' # Filter data to reduce run time
 #' orthogroup_df <- orthogroup_df[1:10000, ]
 #' H <- calculate_H(orthogroup_df)
-calculate_H <- function(orthogroup_df, correct_overclustering = TRUE) {
+calculate_H <- function(orthogroup_df, correct_overclustering = TRUE,
+                        max_size = 200, update_score = TRUE) {
 
     by_og <- split(orthogroup_df, orthogroup_df$Orthogroup)
 
+    # Calculate OG sizes
+    og_sizes <- vapply(by_og, function(x) {
+        return(length(unique(x$Gene)))
+    }, numeric(1))
+    perc_excluded <- (sum(og_sizes >= max_size) / length(og_sizes)) * 100
+
     # Calculate homogeneity scores
     sdice <- Reduce(rbind, lapply(by_og, function(x) {
-        genes <- unique(x$Gene)
+
+        ngenes <- length(unique(x$Gene))
         og <- unique(x$Orthogroup)
 
+        x <- x[!is.na(x$Annotation), ]
+        annot_genes <- unique(x$Gene)
+
         scores_df <- NULL
-        if(length(genes) > 1) {
-            # Create a list of domains for each gene
-            domains_per_gene <- split(x$Annotation, x$Gene)
+        if(length(annot_genes) > 1 & ngenes <= max_size) {
 
             # Calculate Sorensen-Dice indices for all pairwise combinations
-            combinations <- utils::combn(genes, 2, simplify = FALSE)
+            combinations <- utils::combn(annot_genes, 2, simplify = FALSE)
             scores <- lapply(combinations, function(y) {
                 d1 <- x$Annotation[x$Gene == y[1]]
                 d2 <- x$Annotation[x$Gene == y[2]]
@@ -55,7 +124,8 @@ calculate_H <- function(orthogroup_df, correct_overclustering = TRUE) {
                 s <- round(numerator / denominator, 2)
                 return(s)
             })
-            scores <- mean(Reduce(c, scores))
+            scores <- unlist(scores) - perc_excluded * 0.1
+            scores <- mean(scores)
 
             scores_df <- data.frame(
                 Orthogroup = og,
@@ -67,19 +137,9 @@ calculate_H <- function(orthogroup_df, correct_overclustering = TRUE) {
 
     # Account for overclustering
     if(correct_overclustering) {
-
-        n_domains <- length(unique(orthogroup_df$Annotation))
-        n_ogs <- length(unique(orthogroup_df$Orthogroup))
-
-        dispersal <- split(orthogroup_df, orthogroup_df$Annotation)
-        dispersal <- unlist(lapply(dispersal, function(x) {
-            return(length(unique(x$Orthogroup)))
-        }))
-
-        dispersal <- sum(dispersal) / (n_domains * n_ogs)
-        sdice$Score <- sdice$Score / dispersal
-
+        sdice <- overclustering_correction(sdice, orthogroup_df, update_score)
     }
+
     return(sdice)
 }
 

diff --git a/man/calculate_H.Rd b/man/calculate_H.Rd
diff --git a/vignettes/vignette_02_assessing_orthogroup_inference.Rmd b/vignettes/vignette_02_assessing_orthogroup_inference.Rmd
@@ -96,13 +96,13 @@ formula below:
 
 $$
 \begin{aligned}
-Scores &= \frac{Homogeneity}{Dispersal}
+Scores &= Homogeneity - Dispersal
 \\
 \end{aligned}
 $$
 
 
-The numerator, $homogeneity$, is the mean Sorensen-Dice index for all 
+The $homogeneity$ term is the mean Sorensen-Dice index for all 
 pairwise combinations of genes in an orthogroup. The Sorensen-Dice index
 measures how similar two genes are, and it ranges from 0 to 1, with 0 meaning
 that a gene pair does not share any protein domain, and 1 meaning that 
@@ -124,23 +124,11 @@ the orthogroup will have $homogeneity = 0$. If only some gene pairs share
 the same domain, $homogeneity$ will be somewhere between 0 and 1. 
 
 
-The denominator, $dispersal$, aims to correct for overclustering (i.e.,
+The $dispersal$ term aims to correct for overclustering (i.e.,
 orthogroup assignments that break "true" gene families into an artificially
-large number of smaller subfamilies). It is the mean number of 
-orthogroups containing the same protein domain corrected by 
-the number of orthogroup. Formally:
-
-
-$$
-\begin{aligned}
-Dispersal &=  \frac{1}{N_{domains} N_{OG}} \sum_{i=1}^{N_{domains}}D_{i} \\
-\\
-\end{aligned}
-$$
-
-
-where $N_{OG}$ is the number of orthogroups, and $D_{i}$ is the number of 
-orthogroups containing the protein domain $i$. This term penalizes
+large number of smaller subfamilies), and it is the relative frequency of 
+dispersed domains (i.e., domains that are split into multiple orthogroups).
+This term penalizes
 orthogroup assignments where the same protein domains appears in multiple 
 orthogroups. As orthogroups represent groups of genes that 
 evolved from a common ancestor, a protein domain being present in multiple 
@@ -151,10 +139,11 @@ convergent evolution.
 To calculate scores for each orthogroup, you can use the 
 function `assess_orthogroups()`. This function takes as input a list of 
 annotation data frames[^1] and an orthogroups data frame, and returns the 
-relative homogeneity scores of each orthogroup for each species. Note that
+relative homogeneity scores of each orthogroup for each species. 
+If you do not want to calculate scores separately by species, you can also 
+use the function `calculate_H()`. Note that
 if you don't want to take the dispersal into account, you can set
-`correct_overclustering = FALSE`. This will ignore the denominator of the
-score formula.
+`correct_overclustering = FALSE`.
 
 [^1]: **NOTE:** The names of the list elements must match the species 
 abbreviations in the column *Species* of the orthogroups data frame.