Skip to content

Snakemake re-implementation of Snakepipe ChipSeq workflow

Notifications You must be signed in to change notification settings

bmc-CompBio/chipseq-standalone

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ChIP-seq Standalone Pipeline

A Snakemake pipeline for ChIP-seq analysis including read mapping, peak calling with MACS2, and differential binding analysis with CSAW.

Requirements

  • Snakemake >= 7.0
  • Conda/Mamba (for environment management)

Quick Start

# 1. Clone the repository
git clone <repo-url> chipseq-standalone
cd chipseq-standalone

# 2. Edit configuration files
#    - config/samples.tsv    (your samples)
#    - config/genome.yaml    (genome resources)
#    - config/config.yaml    (pipeline parameters)

# 3. Run the pipeline
snakemake --use-conda --cores 8

# Dry run (see what would be executed)
snakemake -n

# Generate DAG visualization
snakemake --dag | dot -Tpdf > dag.pdf

Input Specification

Sample table format (config/samples.tsv)

Tab-separated file with one row per library (IP or INPUT).

Column Required Description
sample_id Yes Unique identifier (used in filenames)
type Yes IP or INPUT
group Yes Biological condition (for differential analysis)
target Yes* ChIP target / histone mark (e.g. H3K4me2, H3K9me2); *required for IP samples
broad Yes* TRUE/FALSE or 1/0 (*required for IP, ignored for INPUT)
fastq_1 Yes Path to R1 FASTQ file (.gz supported)
fastq_2 No Path to R2 FASTQ file (empty for single-end)
control_sample_id Yes* sample_id of INPUT control (*required for IP only)

Target column:

  • A column named target is required for IP samples.
  • target identifies the ChIP target / histone mark (e.g. H3K4me2, H3K9me2).
  • Differential analysis (CSAW) is performed separately per target (no mixing marks).
  • INPUT samples should have target = NA or empty.

Validation Rules:

  • sample_id must be unique
  • type must be IP or INPUT
  • IP samples must have:
    • Valid target value identifying the ChIP mark
    • Valid control_sample_id pointing to an INPUT sample
    • Valid broad value
  • INPUT samples must have empty control_sample_id and target

Example (two targets):

sample_id	type	group	target	broad	fastq_1	fastq_2	control_sample_id
IP_H3K4me2_ESC	IP	ESC	H3K4me2	FALSE	data/IP_H3K4me2_ESC.fq.gz		INPUT_ESC
IP_H3K4me2_EpiLC	IP	EpiLC	H3K4me2	FALSE	data/IP_H3K4me2_EpiLC.fq.gz		INPUT_EpiLC
IP_H3K9me2_ESC	IP	ESC	H3K9me2	TRUE	data/IP_H3K9me2_ESC.fq.gz		INPUT_ESC
IP_H3K9me2_EpiLC	IP	EpiLC	H3K9me2	TRUE	data/IP_H3K9me2_EpiLC.fq.gz		INPUT_EpiLC
INPUT_ESC	INPUT	ESC			data/INPUT_ESC.fq.gz
INPUT_EpiLC	INPUT	EpiLC			data/INPUT_EpiLC.fq.gz

2. Genome Configuration (config/genome.yaml)

genome:
  name: "GRCh38"
  effective_genome_size: 2913022398

  fasta:
    url: "https://..."       # Optional: URL for download
    local: "resources/genome/GRCh38.fa"
    sha256: "..."            # Optional: checksum verification

  gtf:                       # Optional: for gene annotation
    url: "https://..."
    local: "resources/genome/GRCh38.gtf"

  blacklist_bed:             # Optional: problematic regions to exclude
    url: "https://..."
    local: "resources/genome/blacklist.bed"

3. Pipeline Configuration (config/config.yaml)

Key parameters:

  • samples: Path to samples.tsv
  • genome_config: Path to genome.yaml
  • min_mapq: Minimum mapping quality (default: 20)
  • macs2_qvalue: MACS2 q-value threshold (default: 0.05)
  • csaw_fdr: Differential binding FDR (default: 0.05)
  • csaw_lfc: Log fold change threshold (default: 1.0)

Output Structure

.
├── filtered_bam/           # Filtered, deduplicated BAM files
│   ├── {sample}.filtered.bam
│   └── {sample}.filtered.bam.bai
├── bigwig/                 # Coverage tracks
│   ├── {sample}.bw
│   ├── {sample}.log2ratio.bw    # ChIP/Input ratio (IP only)
│   └── {sample}.subtract.bw     # ChIP - Input (IP only)
├── macs2/                  # Peak calls
│   ├── {sample}_peaks.narrowPeak
│   ├── {sample}_peaks.broadPeak
│   └── {sample}_peaks.xls
├── csaw/                   # Differential binding results (per target)
│   ├── {target}/           # e.g. H3K4me2/, H3K9me2/
│   │   ├── DB_regions.UP.bed
│   │   ├── DB_regions.DOWN.bed
│   │   ├── DB_regions.annotated.tsv
│   │   └── csaw_report.html
├── qc/                     # Quality control
│   ├── fastqc/
│   ├── flagstat/
│   ├── markdup/
│   ├── multiqc_report.html
│   └── qc_summary.tsv
├── deeptools_qc/           # deepTools QC
│   ├── fragmentSize.metrics.tsv
│   ├── plotFingerprint.metrics.txt
│   ├── correlation.pearson.png
│   └── pca.png
└── logs/                   # Log files

Workflow Steps

  1. Genome Preparation: Download and index reference genome
  2. Read Mapping: Bowtie2 alignment
  3. BAM Processing: Sort, mark duplicates, filter
  4. Quality Control: FastQC, flagstat, deepTools metrics
  5. Coverage Tracks: Generate bigWig files
  6. Peak Calling: MACS2 (narrow or broad based on sample annotation)
  7. Differential Binding: CSAW analysis per target (if multiple groups)
  8. Reporting: MultiQC and summary reports

Cluster Execution

# SLURM example
snakemake --use-conda --cores 100 \
    --cluster "sbatch -p {cluster.partition} -t {cluster.time} -c {threads}" \
    --cluster-config cluster.yaml \
    --jobs 50

License

MIT

About

Snakemake re-implementation of Snakepipe ChipSeq workflow

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published