mutmatch
is an R package designed to estimate somatic selection for a gene using various methods including neighbors baseline of mutation rates or low-impact mutations estimated with CADD score.
Before installing the package, it's recommended to set up an isolated conda environment to ensure compatibility and avoid conflicts with other packages.
-
If you don't have
conda
ormamba
installed, download and install Miniconda or Anaconda. -
Create a new conda environment using the provided
environment.yml
:
conda env create -f environment.yml
- Activate the newly created conda environment:
conda activate mutmatch
You can install mutmatch
from GitHub using:
# If the devtools package is not installed, uncomment the following line:
# install.packages("devtools")
devtools::install_github("ebesedina/mutmatch")
- Estimate somatic selection forces using a gene's neighboring baseline mutation rates.
- Use low-impact mutations estimated with CADD score as an alternative baseline.
- Comprehensive filtering options for genomic regions.
- Flexible statistical modeling for selection.
The following example demonstrates how to estimate selection using neighboring genes as a neutral mutational rate baseline for the KRAS
gene:
# Load the mutmatch library
library(mutmatch)
# Execute the function to get selection estimates using neighboring genes
selection_estimates <- get_selection_estimates_neighbors(
hgnc = "KRAS",
mutationsPath = system.file("extdata", "example_mutations.csv.gz", package = "mutmatch"),
annotationGenePath = system.file("extdata", "example_gene_annotation.csv.gz", package = "mutmatch"),
annotationGenomeWidePath = system.file("extdata", "example_genomewide_annotation.csv", package = "mutmatch"),
neighborsWindow = "0.5Mb",
outlierNeighborsThreshold = 0.2
)
For further details please follow this tutorial.
First, download the Combined Annotation-Dependent Depletion (CADD) scores. Note that these scores are not included in the package due to their large file size.
Replace "your/destination/path/CADD_GRCh37-v1.4.bw"
with the path where you want to save the file.
# Load the mutmatch library
library(mutmatch)
# Specify the path to store the CADD file
caddScoresPath = "your/destination/path/CADD_GRCh37-v1.4.bw"
# Download the CADD file
download_cadd_file(caddScoresPath = caddScoresPath)
For Windows machines, download the CADD scores directly from this link.
After obtaining the CADD scores, execute the following command to estimate selection using low-CADD regions as a neutral mutational rate baseline.
# Execute the function to get selection estimates using low-CADD regions
selection_estimates <- get_selection_estimates_cadd(
hgnc = "KRAS",
caddScoresPath = caddScoresPath,
mutationsPath = system.file("extdata", "example_mutations.csv.gz", package = "mutmatch"),
annotationGenePath = system.file("extdata", "example_gene_annotation.csv.gz", package = "mutmatch"),
annotationGenomeWidePath = system.file("extdata", "example_genomewide_annotation.csv", package = "mutmatch")
)
The package includes example datasets for convenient testing and demonstration:
example_mutations.csv.gz
: Example mutations dataset.example_gene_annotation.csv.gz
: Example gene annotation dataset.example_genomewide_annotation.csv
: Example genome-wide annotation dataset.
These files can be found under inst/extdata/
and accessed using the system.file()
function.
For example:
genomewide_annotation = data.table::fread(system.file("extdata",
"example_genomewide_annotation.csv", package = "mutmatch"))
The MutMatch estimates of overall selection and conditional selection, along with their corresponding p-values, can exhibit distributional inaccuracies when derived from sparse datasets. To mitigate this, the fit_selection_model function employs a debiasing technique. It involves simulating a null distribution of selection estimates under the assumption of neutral selection. It then adjusts the actual selection estimate by subtracting the median of this simulated null distribution, resulting in corrected selection estimate.
To address potential biases in p-value estimates due to sparse data, we implemented an additional step with the post_process_pvalues function. This function corrects p-values by comparing them against a distribution of p-values obtained from randomized data. By mapping the actual p-values to their corresponding quantiles within this simulated distribution, the process effectively adjusts for any deflation or inflation present in the original p-values. This debiasing technique enhances the accuracy and statistical reliability of p-value estimates in selection analysis. For optimal results, applying this function across a wide range of genes is advised, leveraging the comprehensive dataset of simulated null distribution p-values to refine resolution.
This package is released under the MIT License. For more details, see the LICENSE.md
file.