Table of Contents
- DMR Approach and Interpretation
- The Design Matrix and Covariates
DMR Approach and Interpretation
The main statistical approach applied by the executable script located in the
exec folder is
dmrseq::dmrseq(), which identifies DMRs in a two step approach:
- DMR Detection: The differences in CpG methylation for the effect of interest are pooled and smoothed to give CpG sites with higher coverage a higher weight, and candidate DMRs with a difference between groups are assembled.
- Statistical Analysis: A region statistic for each DMR, which is comparable across the genome, is estimated via the application of a generalized least squares (GLS) regression model with a nested autoregressive correlated error structure for the effect of interest. Then, permutation testing of a pooled null distribution enables the identification of significant DMRs This approach accounts for both inter-individual and inter-CpG variability across the entire genome.
The main estimate of a difference in methylation between groups is not a fold change but rather a beta coefficient, which is representative of the average effect size; however, it is on the scale of the arcsine transformed differences and must be divided by π (3.14) to be similar to the mean methylation difference over a DMR, which is provided in the
percentDifference column. Since the testing is permutation based, it provides empirical p-values as well as FDR corrected q-values.
One of the key differences between
dmrseq and other DMR identification packages, like
bsseq, is that
dmrseq is performing statistical testing on the DMRs themselves rather than testing for differences in single CpGs that are then assembled into DMRs like
bsseq::dmrFinder() does. This unique approach helps with controlling the false discovery rate and testing the correlated nature of CpG sites in a regulatory region, while also enabling complex experimental designs. However, since
dmrseq::dmrseq() does not provide individual smoothed methylation values,
bsseq::BSmooth() is utlized to generate individual smoothed methylation values from the DMRs. Therefore, while the DMRs themselves are adjusted for covariates, the indvidual smoothed methylation values for these DMRs are not adjusted for covaraites.
You can also read my general summary of the drmseq approach on EpiGenie.
Example DMR Each dot represents the methylation level of an individual CpG in a single sample, where the size of the dot is representative of coverage. The lines represent smoothed methylation levels for each sample, either control (blue) or DS (red). Genic and CpG annotations are shown below the plot.
No manual installation of R packages is required, since the required packages and updates will occur automatically upon running the executable script located in the
exec folder. However, the package does require Bioconductor 3.8, which you can install or update to using:
if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("BiocInstaller", version = "3.9")
Additionally, if you are interested in creating your own workflow as opposed to using the executable script, you can download the package using:
The Design Matrix and Covariates
This script requires a basic design matrix to identify the groups and covariates, which should be named
sample_info.xlsx and contain header columns to identify the factor. It is important to have the label for the experimental samples start with a letter in the alphabet that comes after the one used for control samples in order to obtain results for experimental vs. control rather than control vs. experimental. You can select which specific samples to analyze from the working directory through the design matrix, where pattern matching of the sample name will only select bismark cytosine report files with a matching name before the first underscore, which also means that sample names should not contain underscores. Within the script, covariates can be selected for adjustment. There are two different ways to adjust for covariates: directly adjust values or balance permutations.
Before running the executable, ensure you have the following project directory tree structure for the Bismark cytosine reports and design matrix:
├── Project │ ├── cytosine_reports │ │ ├── sample1_bismark_bt2.deduplicated.bismark.cov.gz.CpG_report.txt.gz │ │ ├── sample2_bismark_bt2.deduplicated.bismark.cov.gz.CpG_report.txt.gz │ │ ├── sample_info.xlsx
This workflow requires the following variables:
-g --genomeSelect either: hg38, mm10, rn6, or rheMac8.
-x --coverageCpG coverage cutoff for all samples, 1x is the default and minimum value.
-s --perGroupPercent of samples per a group for CpG coverage cutoff, values range from 0 to 1, 1 (100%) is the default.
-m --minCpGsMinimum number of CpGs for a DMR, 5 is default.
-p --maxPermsNumber of permutations for DMR and block analyses, 10 is default.
-o --cutoffThe cutoff value for the single CpG coefficient utilized to discover testable background regions, values range from 0 to 1, 0.05 (5%) is the default.
-t --testCovariateCovariate to test for significant differences between experimental and control, i.e. Diagnosis.
-a --adjustCovariateAdjust covariates that are continuous or contain two or more factor groups, i.e. "Age". More than one covariate can be adjusted for using single brackets and the
-m --matchCovariateCovariate to balance permutations, which is meant for two-group factor covariates in small sample sizes in order to prevent extremely unbalanced permutations. Only one covariate two-group factor can be balanced, i.e. Sex. Note: This will not work for larger sample sizes (> 500,000 permutations) and is not needed for them as the odds of sampling an extremely unbalanced permutation for a covariate decreases with increasing sample size.
-c --coresThe number of cores to use, 20 is recommended but you can go as low as 1, 8 is the default.
Below is an example of how to execute the main R script (DM.R) in the
exec folder on command line. This should be called from the working directory that contains the cytosine reports.
call="Rscript \ --vanilla \ /share/lasallelab/programs/DMRichR/DM.R \ --genome hg38 \ --coverage 1 \ --perGroup '1' \ --minCpGs 5 \ --maxPerms 10 \ --cutoff '0.05' \ --testCovariate Diagnosis \ --adjustCovariate 'BMI;Smoking' \ --matchCovariate Sex \ --cores 20" echo $call eval $call
UC Davis Example
If you are using the Barbera cluster at UC Davis, the following commands can be used to execute
DM.R from your login node (i.e. epigenerate), where
htop should be called first to make sure the whole node is available. This should be called from the working directory that contains the cytosine reports and not from within a
module load R/3.6.0 call="nohup \ Rscript \ --vanilla \ /share/lasallelab/programs/DMRichR/DM.R \ --genome hg38 \ --coverage 1 \ --perGroup '1' \ --minCpGs 5 \ --maxPerms 10 \ --cutoff '0.05' \ --testCovariate Diagnosis \ --adjustCovariate 'BMI;Smoking' \ --matchCovariate Sex \ --cores 60 \ > DMRichR.log 2>&1 &" echo $call eval $call echo $! > save_pid.txt
You can then check on the job using
tail -f DMRichR.log and ⌃ Control + c to exit the log view.
You can cancel the job from the project directory using
cat save_pid.txt | xargs kill. You can also check your running jobs using
ps -ef | grep , which should be followed by your username i.e.
ps -ef | grep blaufer. Finally, if you still see leftover processes in htop, you can cancel all your processes using
pkill -u, which should be followed by your username i.e.
pkill -u blaufer.
Alternatively, the executable can also be submitted to the cluster using the shell script via
This workflow provides the following files:
- DMRs and testable background regions
- Individual smoothed methylation values for DMRs and background regions
- DMR plots of individual smoothed methylation values
- Smoothed global and chromosomal methylation values and statistics
- PCA plots of 20 Kb windows (all genomes) and CpG island windows (hg38, mm10, and rn6)
- Heatmap of individual smoothed methylation values for DMRs
- An HTML report of annotated DMRs
- Gene region and CpG annotations and plots (only for hg38, mm10, or rn6)
- Manhattan and Q-Q plots
- Gene ontology and pathway enrichment tables and plots (enrichr and GOfuncR for all genomes, GREAT for hg38 and mm10)
- Large blocks of differential methylation and testable background blocks
- Block plots
If you use DMRichR in published research please cite the following 3 articles:
Laufer BI, Hwang H, Vogel Ciernia A, Mordaunt CE, LaSalle JM. Whole genome bisulfite sequencing of Down syndrome brain reveals regional DNA hypermethylation and novel disease insights. Epigenetics, 2019. doi: 10.1080/15592294.2019.1609867
Korthauer K, Chakraborty S, Benjamini Y, and Irizarry RA. Detection and accurate false discovery rate control of differentially methylated regions from whole genome bisulfite sequencing. Biostatistics, 2018. doi: 10.1093/biostatistics/kxy007
Hansen KD, Langmead B, Irizarry RA. BSmooth: from whole genome bisulfite sequencing reads to differentially methylated regions. Genome Biology, 2012. doi: 10.1186/gb-2012-13-10-r83
This workflow is primarily based on the dmrseq and bsseq bioconductor packages. I would like to thank Keegan Korthauer, the creator of dmrseq, for helpful conceptual advice in establishing and optimizing this workflow. I would like to thank Matt Settles from the UC Davis Bioinformatics Core for advice on creating an R package and use of the tidyverse and also for help with the UC Davis example. I would like to thank Rochelle Coulson for a script that was developed into the PCA function. I would also like to thank Blythe Durbin-Johnson and Annie Vogel Ciernia for statistical consulting that enabled the global and chromosomal methylation statistics. I would like to thank Nikhil Joshi from the UC Davis Bioinformatics Core for troubleshooting of a resource issue. Finally, I would like to thank Ian Korf for invaluable discussions related to the bioinformatic approaches utilized in this repository.