Skip to content

epigenetics-sb/MethylBench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MethylBench

Reproducible analysis code for a systematic benchmark of six DNA methylation profiling technologies across diverse sequencing platforms.


Overview

MethylBench systematically compares six widely used DNA methylation profiling technologies:

Technology Type CpG Coverage
Illumina EPIC array Array-based ~850k CpGs
TWIST Methylation Panel Targeted short-read 2–4 million CpGs
Whole-Genome Enzymatic Conversion (WGEC) Genome-wide short-read 28–30 million CpGs
Reduced Representation Bisulfite Sequencing (RRBS) Enrichment-based short-read 1–4 million CpGs
Oxford Nanopore Technologies (ONT) Long-read 28–30 million CpGs
Pacific Biosciences (PacBio) Long-read 28–30 million CpGs

Analyses were performed on matched blood and fibroblast samples from 5 individuals and two Genome in a Bottle (GIAB) reference samples (HG001, HG002).


Repository Structure

MethylBench/
│
├── README.md
├── LICENSE
├── .gitignore
│
├── scripts/
│   ├── bash/
│   │   ├── 01_modkit_pileup.sh			# modkit, extract methylation information from aligned .bam files.
│   │   ├── 02_toulligqc.sh			# ToulligQC, perform QC analysis on alignend .bam files and create intermediary files.         
│   │   ├── 03_pbcpgtools.sh			# pb-cpg-tools, extract methylation information from PacBio alignment files.         		
│   │   └── 04_methylseq.sh			# nf-core/methylseq, run the nextflow methylseq pipeline for standard short-read data processing.         		
│   ├── python/
│   │   ├── 05_parse_toulligqc.py	       	# Summarize over ToulligQC .data files into one QC table        
│   └── R/
│       ├── 06_visualize_toulligqc_summary.R 	# Visualization for ONT QC reports. 
│       ├── 07_generate_cpg_stats.R		# Summarize CpG information for further analysis.
│       ├── 08_qc_visualization.R 		# QC Visualization, Figure 3. 
│       ├── 09_correlation_analysis.R 		# Correlation Analysis, Figure 4. 
│       ├── 10_density_plots.R 			# Methylation Density Analysis, Figure 5. 
│       ├── 11_pca.R 				# Principal Component Analysis, Figure 6. 
│       ├── 12_differential_methylation.R 	# Differential Methylation Analysis, Figure 7. 
│       ├── 13_annotation.R 			# Visualization for ONT QC reports.   		
│       ├── limma_diff_meth.R 	# Provides functionality to run also the limma approach (Needed for some figures un 12_differential_methylation.R).
│       └── helpers.R				# Helper functionality.
│
└── envs/
    ├── environment.yml				# Basic environment, Tools and Python utility
    ├── ont.yml					# ONT related tools, modkit, toulligQC
    ├── pacbio.yml				# PacBio specific tool, pb-cpg-tools
    └── r_analysis.yml				# R-related packages for R-analysis

Requirements

All tool-specific dependencies are managed via Conda environments defined in envs/.


Quick Start

# Clone repository
git clone https://github.com/epigenetics-sb/MethylBench
cd MethylBench

# Create environments
cd envs/
conda env create -f environment.yml
conda env create -f ont.yml
conda env create -f pacbio.yml
conda env create -f r-analysis.yml

cd ..
################################################################################################
# NOTE: Please check the scripts for proper argument structure and folder setup for all scripts!
################################################################################################

# Run preprocessing steps
# All bash scripts can be properly run, as the help descriptions tell you.
conda activate ont

bash scripts/bash/01_modkit_pileup.sh --help
bash scripts/bash/02_toulligqc.sh --help

conda deactivate 
conda activate pacbio

bash scripts/bash/03_pbcpgtools.sh --help

conda deactivate
conda activate methylbench

bash scripts/bash/04_methylseq.sh --help

# Run tool QC visualization
python3 scripts/python/05_parse_toulligqc.py

# Run actual analysis in R
Rscript scripts/R/06_visualize_toulligqc_summary.R
Rscript scripts/R/07_generate_cpg_stats.R
Rscript scripts/R/08_qc_visualization.R
Rscript scripts/R/09_correlation_analysis.R
Rscript scripts/R/10_density_plots.R
Rscript scripts/R/11_pca.R
Rscript scripts/R/limma_diff_meth.R
Rscript scripts/R/12_differential_methylation.R
Rscript scripts/R/13_annotation.R

Data Availability

Processed methylation matrices are available from the corresponding author upon request.

GIAB reference samples (HG001/NA12878, HG002/NA24385) including PacBio methylation data are publicly available via the PacBio website.


Input Data: QC Statistics Tables

Two pre-computed summary tables are required as input for the downstream R analysis scripts. Both are provided in data/stats/:

cpg_stats_methylbench.tab – fully reproducible

Per-sample counts of overlapping CpGs at increasing coverage thresholds, computed across all methods simultaneously. Generated programmatically from the raw per-CpG methylation files:

Rscript scripts/R/07_generate_cpg_stats.R \
  --samplesheet [samplesheet.tsv] \
  --datadir     data/ \
  --outdir      data/stats/

qc_stats_methylbench.tab – manually compiled

Per-sample × per-method QC metrics. This table was assembled manually by extracting summary statistics from the QC reports of each tool:

Column Source Tool / File
Mean_meth_general Global mean methylation Bismark summary report / modkit stats
Mean_meth_overlapped Mean methylation at overlapping CpGs Computed from merged matrices
Mean_meth_10x Mean methylation at ≥10× CpGs Computed from merged matrices
Passed_reads Fraction of reads passing QC nf-core/methylseq MultiQC report (RRBS/WGEC/TWIST), ToulligQC report (ONT)
Mean_readlength_passed Mean read length of passing reads MultiQC (short-read), ToulligQC .data report (ONT)
Mean_Cov Mean CpG-level coverage Bismark coverage report / modkit / pb-cpg-tools
Insert_size Mean insert size (TWIST only) Picard InsertSizeMetrics via MultiQC
Unique_alignments Number of uniquely aligned reads Bismark alignment report / modkit stats
Mean_Genome_Cov Mean genome-wide coverage (ONT only) samtools coverage summary

Note: Because qc_stats_methylbench.tab aggregates heterogeneous per-tool reports that do not share a common machine-readable format, it was compiled manually and is not auto-generated by this pipeline.


Reproduce Individual Figures

Each R script in scripts/R/ corresponds directly to a figure in the manuscript:

# Figure 3 – QC metrics and coverage
Rscript scripts/R/08_qc_visualization.R

# Figure 4 – Cross-platform correlation
Rscript scripts/R/09_correlation_analysis.R

# Figure 5 – Methylation density distributions
Rscript scripts/R/10_density_plots.R

# Figure 6 – Principal component analysis
Rscript scripts/R/11_pca.R

# Figure 7 – Differential methylation analysis
Rscript scripts/R/12_differential_methylation.R

# Figure S14 – DMC annotation
Rscript scripts/R/13_annotation.R

All scripts expect preprocessed (methylation) matrices as input.


License

This project is licensed under the MIT License – see LICENSE for details.


Contact

For questions regarding the analysis code, please open a GitHub issue or contact the corresponding author: Julia Schulze-Hentrich – Department of Genetics, Saarland University Lukas Laufer - Department of Genetics, Saarland University

About

Reproducible analysis code for MethylBench — a systematic benchmark of six DNA methylation profiling technologies (EPIC array, TWIST, WGEC, RRBS, ONT, PacBio) across diverse sequencing platforms.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors