A Snakemake pipeline for ChIP-seq analysis including read mapping, peak calling with MACS2, and differential binding analysis with CSAW.
- Snakemake >= 7.0
- Conda/Mamba (for environment management)
# 1. Clone the repository
git clone <repo-url> chipseq-standalone
cd chipseq-standalone
# 2. Edit configuration files
# - config/samples.tsv (your samples)
# - config/genome.yaml (genome resources)
# - config/config.yaml (pipeline parameters)
# 3. Run the pipeline
snakemake --use-conda --cores 8
# Dry run (see what would be executed)
snakemake -n
# Generate DAG visualization
snakemake --dag | dot -Tpdf > dag.pdfTab-separated file with one row per library (IP or INPUT).
| Column | Required | Description |
|---|---|---|
sample_id |
Yes | Unique identifier (used in filenames) |
type |
Yes | IP or INPUT |
group |
Yes | Biological condition (for differential analysis) |
target |
Yes* | ChIP target / histone mark (e.g. H3K4me2, H3K9me2); *required for IP samples |
broad |
Yes* | TRUE/FALSE or 1/0 (*required for IP, ignored for INPUT) |
fastq_1 |
Yes | Path to R1 FASTQ file (.gz supported) |
fastq_2 |
No | Path to R2 FASTQ file (empty for single-end) |
control_sample_id |
Yes* | sample_id of INPUT control (*required for IP only) |
Target column:
- A column named
targetis required for IP samples. targetidentifies the ChIP target / histone mark (e.g. H3K4me2, H3K9me2).- Differential analysis (CSAW) is performed separately per
target(no mixing marks). - INPUT samples should have
target= NA or empty.
Validation Rules:
sample_idmust be uniquetypemust beIPorINPUT- IP samples must have:
- Valid
targetvalue identifying the ChIP mark - Valid
control_sample_idpointing to an INPUT sample - Valid
broadvalue
- Valid
- INPUT samples must have empty
control_sample_idandtarget
Example (two targets):
sample_id type group target broad fastq_1 fastq_2 control_sample_id
IP_H3K4me2_ESC IP ESC H3K4me2 FALSE data/IP_H3K4me2_ESC.fq.gz INPUT_ESC
IP_H3K4me2_EpiLC IP EpiLC H3K4me2 FALSE data/IP_H3K4me2_EpiLC.fq.gz INPUT_EpiLC
IP_H3K9me2_ESC IP ESC H3K9me2 TRUE data/IP_H3K9me2_ESC.fq.gz INPUT_ESC
IP_H3K9me2_EpiLC IP EpiLC H3K9me2 TRUE data/IP_H3K9me2_EpiLC.fq.gz INPUT_EpiLC
INPUT_ESC INPUT ESC data/INPUT_ESC.fq.gz
INPUT_EpiLC INPUT EpiLC data/INPUT_EpiLC.fq.gz
genome:
name: "GRCh38"
effective_genome_size: 2913022398
fasta:
url: "https://..." # Optional: URL for download
local: "resources/genome/GRCh38.fa"
sha256: "..." # Optional: checksum verification
gtf: # Optional: for gene annotation
url: "https://..."
local: "resources/genome/GRCh38.gtf"
blacklist_bed: # Optional: problematic regions to exclude
url: "https://..."
local: "resources/genome/blacklist.bed"Key parameters:
samples: Path to samples.tsvgenome_config: Path to genome.yamlmin_mapq: Minimum mapping quality (default: 20)macs2_qvalue: MACS2 q-value threshold (default: 0.05)csaw_fdr: Differential binding FDR (default: 0.05)csaw_lfc: Log fold change threshold (default: 1.0)
.
├── filtered_bam/ # Filtered, deduplicated BAM files
│ ├── {sample}.filtered.bam
│ └── {sample}.filtered.bam.bai
├── bigwig/ # Coverage tracks
│ ├── {sample}.bw
│ ├── {sample}.log2ratio.bw # ChIP/Input ratio (IP only)
│ └── {sample}.subtract.bw # ChIP - Input (IP only)
├── macs2/ # Peak calls
│ ├── {sample}_peaks.narrowPeak
│ ├── {sample}_peaks.broadPeak
│ └── {sample}_peaks.xls
├── csaw/ # Differential binding results (per target)
│ ├── {target}/ # e.g. H3K4me2/, H3K9me2/
│ │ ├── DB_regions.UP.bed
│ │ ├── DB_regions.DOWN.bed
│ │ ├── DB_regions.annotated.tsv
│ │ └── csaw_report.html
├── qc/ # Quality control
│ ├── fastqc/
│ ├── flagstat/
│ ├── markdup/
│ ├── multiqc_report.html
│ └── qc_summary.tsv
├── deeptools_qc/ # deepTools QC
│ ├── fragmentSize.metrics.tsv
│ ├── plotFingerprint.metrics.txt
│ ├── correlation.pearson.png
│ └── pca.png
└── logs/ # Log files
- Genome Preparation: Download and index reference genome
- Read Mapping: Bowtie2 alignment
- BAM Processing: Sort, mark duplicates, filter
- Quality Control: FastQC, flagstat, deepTools metrics
- Coverage Tracks: Generate bigWig files
- Peak Calling: MACS2 (narrow or broad based on sample annotation)
- Differential Binding: CSAW analysis per target (if multiple groups)
- Reporting: MultiQC and summary reports
# SLURM example
snakemake --use-conda --cores 100 \
--cluster "sbatch -p {cluster.partition} -t {cluster.time} -c {threads}" \
--cluster-config cluster.yaml \
--jobs 50MIT