Skip to content

Commit

Permalink
Added the option to return dispersal and corrected orthogroup scores …
Browse files Browse the repository at this point in the history
…as separate variables in calculate_H()
  • Loading branch information
almeidasilvaf committed Oct 7, 2023
1 parent ffae8b0 commit e9a3ad5
Show file tree
Hide file tree
Showing 5 changed files with 113 additions and 45 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/check-bioc.yml
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@ jobs:
fail-fast: false
matrix:
config:
- { os: ubuntu-latest, r: '4.2', bioc: '3.15', cont: "bioconductor/bioconductor_docker:RELEASE_3_15", rspm: "https://packagemanager.rstudio.com/cran/__linux__/focal/latest" }
- { os: ubuntu-latest, r: '4.3', bioc: '3.17', cont: "bioconductor/bioconductor_docker:RELEASE_3_17", rspm: "https://packagemanager.rstudio.com/cran/__linux__/focal/latest" }
env:
R_REMOTES_NO_ERRORS_FROM_WARNINGS: true
RSPM: ${{ matrix.config.rspm }}
Expand Down
2 changes: 1 addition & 1 deletion DESCRIPTION
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ biocViews:
Encoding: UTF-8
LazyData: false
Roxygen: list(markdown = TRUE)
RoxygenNote: 7.2.1
RoxygenNote: 7.2.3
SystemRequirements: BUSCO (>= 5.1.3) <https://busco.ezlab.org/>
Imports:
utils,
Expand Down
100 changes: 80 additions & 20 deletions R/homology_detection.R
Original file line number Diff line number Diff line change
@@ -1,4 +1,51 @@

#' Correct orthogroup scores for overclustering
#'
#' @param homogeneity_df A 2-column data frame with
#' variables \strong{Orthogroup} and \strong{Score} as returned
#' by \code{calculate_H().}
#' @param orthogroup_df Data frame with orthogroups and their associated genes
#' and annotation. The columns \strong{Gene}, \strong{Orthogroup}, and
#' \strong{Annotation} are mandatory, and they must represent Gene ID,
#' Orthogroup ID, and Annotation ID (e.g., Interpro/PFAM), respectively.
#' @param update_score Logical indicating whether to replace scores with
#' corrected scores or not. If FALSE, the dispersal term and corrected scores
#' are returned as separate variables in the output data frame.
#'
#' @return A data frame with the following variables:
#' \describe{
#' \item{Orthogroup}{Character, orthogroup ID.}
#' \item{Score}{Numeric, orthogroup scores.}
#' \item{Dispersal}{Numeric, dispersal term. Only present
#' if \strong{update_score = FALSE}.}
#' \item{Score_c}{Numeric, corrected orthogroup scores. Only present if
#' \strong{update_score = FALSE}.}
#' }
#'
#' @noRd
overclustering_correction <- function(
homogeneity_df, orthogroup_df, update_score = TRUE
) {

# Calculate % of domains present in 2+ OGs
dispersal <- split(orthogroup_df, orthogroup_df$Annotation)
dispersal <- unlist(lapply(dispersal, function(x) {
return(length(unique(x$Orthogroup)))
}))

dispersal_freq <- (sum(dispersal > 1) / length(dispersal))

if(update_score) {
homogeneity_df$Score <- homogeneity_df$Score / dispersal_freq
} else {
homogeneity_df$Dispersal <- dispersal_freq
homogeneity_df$Score_c <- homogeneity_df$Score - dispersal_freq
}

return(homogeneity_df)
}


#' Calculate homogeneity scores for orthogroups
#'
#' @param orthogroup_df Data frame with orthogroups and their associated genes
Expand All @@ -7,6 +54,13 @@
#' Orthogroup ID, and Annotation ID (e.g., Interpro/PFAM), respectively.
#' @param correct_overclustering Logical indicating whether to correct
#' for overclustering in orthogroups. Default: TRUE.
#' @param max_size Numeric indicating the maximum orthogroup size to consider.
#' If orthogroups are too large, calculating Sorensen-Dice indices for all
#' pairwise combinations could take a long time, so setting a limit prevents
#' that. Default: 200.
#' @param update_score Logical indicating whether to replace scores with
#' corrected scores or not. If FALSE, the dispersal term and corrected scores
#' are returned as separate variables in the output data frame.
#'
#' @details
#' Homogeneity is calculated based on pairwise Sorensen-Dice similarity
Expand All @@ -15,12 +69,18 @@
#' orthogroup share the same domain, the orthogroup will have a homogeneity
#' score of 1. On the other hand, if genes in an orthogroup do not have any
#' domain in common, the orthogroup will have a homogeneity score of 0.
#' The percentage of orthogroups with size greater
#' than \strong{max_size} will be subtracted from the homogeneity scores, since
#' too large orthogroups typically have very low scores.
#' Additionally, users can correct for overclustering by penalizing
#' protein domains that appear in multiple orthogroups (default).
#'
#' @return A 2-column data frame with the variables \strong{Orthogroup}
#' and \strong{Score}, corresponding to orthogroup ID and orthogroup score,
#' respectively.
#' respectively. If \strong{update_score = FALSE}, additional columns
#' named \strong{Dispersal} and \strong{Score_c} are added, which correspond
#' to the dispersal term and corrected scores, respectively.
#'
#' @export
#' @rdname calculate_H
#' @examples
Expand All @@ -30,22 +90,31 @@
#' # Filter data to reduce run time
#' orthogroup_df <- orthogroup_df[1:10000, ]
#' H <- calculate_H(orthogroup_df)
calculate_H <- function(orthogroup_df, correct_overclustering = TRUE) {
calculate_H <- function(orthogroup_df, correct_overclustering = TRUE,
max_size = 200, update_score = TRUE) {

by_og <- split(orthogroup_df, orthogroup_df$Orthogroup)

# Calculate OG sizes
og_sizes <- vapply(by_og, function(x) {
return(length(unique(x$Gene)))
}, numeric(1))
perc_excluded <- (sum(og_sizes >= max_size) / length(og_sizes)) * 100

# Calculate homogeneity scores
sdice <- Reduce(rbind, lapply(by_og, function(x) {
genes <- unique(x$Gene)

ngenes <- length(unique(x$Gene))
og <- unique(x$Orthogroup)

x <- x[!is.na(x$Annotation), ]
annot_genes <- unique(x$Gene)

scores_df <- NULL
if(length(genes) > 1) {
# Create a list of domains for each gene
domains_per_gene <- split(x$Annotation, x$Gene)
if(length(annot_genes) > 1 & ngenes <= max_size) {

# Calculate Sorensen-Dice indices for all pairwise combinations
combinations <- utils::combn(genes, 2, simplify = FALSE)
combinations <- utils::combn(annot_genes, 2, simplify = FALSE)
scores <- lapply(combinations, function(y) {
d1 <- x$Annotation[x$Gene == y[1]]
d2 <- x$Annotation[x$Gene == y[2]]
Expand All @@ -55,7 +124,8 @@ calculate_H <- function(orthogroup_df, correct_overclustering = TRUE) {
s <- round(numerator / denominator, 2)
return(s)
})
scores <- mean(Reduce(c, scores))
scores <- unlist(scores) - perc_excluded * 0.1
scores <- mean(scores)

scores_df <- data.frame(
Orthogroup = og,
Expand All @@ -67,19 +137,9 @@ calculate_H <- function(orthogroup_df, correct_overclustering = TRUE) {

# Account for overclustering
if(correct_overclustering) {

n_domains <- length(unique(orthogroup_df$Annotation))
n_ogs <- length(unique(orthogroup_df$Orthogroup))

dispersal <- split(orthogroup_df, orthogroup_df$Annotation)
dispersal <- unlist(lapply(dispersal, function(x) {
return(length(unique(x$Orthogroup)))
}))

dispersal <- sum(dispersal) / (n_domains * n_ogs)
sdice$Score <- sdice$Score / dispersal

sdice <- overclustering_correction(sdice, orthogroup_df, update_score)
}

return(sdice)
}

Expand Down
23 changes: 21 additions & 2 deletions man/calculate_H.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

31 changes: 10 additions & 21 deletions vignettes/vignette_02_assessing_orthogroup_inference.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -96,13 +96,13 @@ formula below:

$$
\begin{aligned}
Scores &= \frac{Homogeneity}{Dispersal}
Scores &= Homogeneity - Dispersal
\\
\end{aligned}
$$


The numerator, $homogeneity$, is the mean Sorensen-Dice index for all
The $homogeneity$ term is the mean Sorensen-Dice index for all
pairwise combinations of genes in an orthogroup. The Sorensen-Dice index
measures how similar two genes are, and it ranges from 0 to 1, with 0 meaning
that a gene pair does not share any protein domain, and 1 meaning that
Expand All @@ -124,23 +124,11 @@ the orthogroup will have $homogeneity = 0$. If only some gene pairs share
the same domain, $homogeneity$ will be somewhere between 0 and 1.


The denominator, $dispersal$, aims to correct for overclustering (i.e.,
The $dispersal$ term aims to correct for overclustering (i.e.,
orthogroup assignments that break "true" gene families into an artificially
large number of smaller subfamilies). It is the mean number of
orthogroups containing the same protein domain corrected by
the number of orthogroup. Formally:


$$
\begin{aligned}
Dispersal &= \frac{1}{N_{domains} N_{OG}} \sum_{i=1}^{N_{domains}}D_{i} \\
\\
\end{aligned}
$$


where $N_{OG}$ is the number of orthogroups, and $D_{i}$ is the number of
orthogroups containing the protein domain $i$. This term penalizes
large number of smaller subfamilies), and it is the relative frequency of
dispersed domains (i.e., domains that are split into multiple orthogroups).
This term penalizes
orthogroup assignments where the same protein domains appears in multiple
orthogroups. As orthogroups represent groups of genes that
evolved from a common ancestor, a protein domain being present in multiple
Expand All @@ -151,10 +139,11 @@ convergent evolution.
To calculate scores for each orthogroup, you can use the
function `assess_orthogroups()`. This function takes as input a list of
annotation data frames[^1] and an orthogroups data frame, and returns the
relative homogeneity scores of each orthogroup for each species. Note that
relative homogeneity scores of each orthogroup for each species.
If you do not want to calculate scores separately by species, you can also
use the function `calculate_H()`. Note that
if you don't want to take the dispersal into account, you can set
`correct_overclustering = FALSE`. This will ignore the denominator of the
score formula.
`correct_overclustering = FALSE`.

[^1]: **NOTE:** The names of the list elements must match the species
abbreviations in the column *Species* of the orthogroups data frame.
Expand Down

0 comments on commit e9a3ad5

Please sign in to comment.