ChIP-seq Standalone Pipeline

A Snakemake pipeline for ChIP-seq analysis including read mapping, peak calling with MACS2, and differential binding analysis with CSAW.

Requirements

Snakemake >= 7.0
Conda/Mamba (for environment management)

Quick Start

# 1. Clone the repository
git clone <repo-url> chipseq-standalone
cd chipseq-standalone

# 2. Edit configuration files
#    - config/samples.tsv    (your samples)
#    - config/genome.yaml    (genome resources)
#    - config/config.yaml    (pipeline parameters)

# 3. Run the pipeline
snakemake --use-conda --cores 8

# Dry run (see what would be executed)
snakemake -n

# Generate DAG visualization
snakemake --dag | dot -Tpdf > dag.pdf

Input Specification

Sample table format (config/samples.tsv)

Tab-separated file with one row per library (IP or INPUT).

Column	Required	Description
`sample_id`	Yes	Unique identifier (used in filenames)
`type`	Yes	`IP` or `INPUT`
`group`	Yes	Biological condition (for differential analysis)
`target`	Yes*	ChIP target / histone mark (e.g. H3K4me2, H3K9me2); *required for IP samples
`broad`	Yes*	`TRUE`/`FALSE` or `1`/`0` (*required for IP, ignored for INPUT)
`fastq_1`	Yes	Path to R1 FASTQ file (.gz supported)
`fastq_2`	No	Path to R2 FASTQ file (empty for single-end)
`control_sample_id`	Yes*	sample_id of INPUT control (*required for IP only)

Target column:

A column named target is required for IP samples.
target identifies the ChIP target / histone mark (e.g. H3K4me2, H3K9me2).
Differential analysis (CSAW) is performed separately per target (no mixing marks).
INPUT samples should have target = NA or empty.

Validation Rules:

sample_id must be unique
type must be IP or INPUT
IP samples must have:
- Valid target value identifying the ChIP mark
- Valid control_sample_id pointing to an INPUT sample
- Valid broad value
INPUT samples must have empty control_sample_id and target

Example (two targets):

sample_id	type	group	target	broad	fastq_1	fastq_2	control_sample_id
IP_H3K4me2_ESC	IP	ESC	H3K4me2	FALSE	data/IP_H3K4me2_ESC.fq.gz		INPUT_ESC
IP_H3K4me2_EpiLC	IP	EpiLC	H3K4me2	FALSE	data/IP_H3K4me2_EpiLC.fq.gz		INPUT_EpiLC
IP_H3K9me2_ESC	IP	ESC	H3K9me2	TRUE	data/IP_H3K9me2_ESC.fq.gz		INPUT_ESC
IP_H3K9me2_EpiLC	IP	EpiLC	H3K9me2	TRUE	data/IP_H3K9me2_EpiLC.fq.gz		INPUT_EpiLC
INPUT_ESC	INPUT	ESC			data/INPUT_ESC.fq.gz
INPUT_EpiLC	INPUT	EpiLC			data/INPUT_EpiLC.fq.gz

2. Genome Configuration (`config/genome.yaml`)

genome:
  name: "GRCh38"
  effective_genome_size: 2913022398

  fasta:
    url: "https://..."       # Optional: URL for download
    local: "resources/genome/GRCh38.fa"
    sha256: "..."            # Optional: checksum verification

  gtf:                       # Optional: for gene annotation
    url: "https://..."
    local: "resources/genome/GRCh38.gtf"

  blacklist_bed:             # Optional: problematic regions to exclude
    url: "https://..."
    local: "resources/genome/blacklist.bed"

3. Pipeline Configuration (`config/config.yaml`)

Key parameters:

samples: Path to samples.tsv
genome_config: Path to genome.yaml
min_mapq: Minimum mapping quality (default: 20)
macs2_qvalue: MACS2 q-value threshold (default: 0.05)
csaw_fdr: Differential binding FDR (default: 0.05)
csaw_lfc: Log fold change threshold (default: 1.0)

Output Structure

.
├── filtered_bam/           # Filtered, deduplicated BAM files
│   ├── {sample}.filtered.bam
│   └── {sample}.filtered.bam.bai
├── bigwig/                 # Coverage tracks
│   ├── {sample}.bw
│   ├── {sample}.log2ratio.bw    # ChIP/Input ratio (IP only)
│   └── {sample}.subtract.bw     # ChIP - Input (IP only)
├── macs2/                  # Peak calls
│   ├── {sample}_peaks.narrowPeak
│   ├── {sample}_peaks.broadPeak
│   └── {sample}_peaks.xls
├── csaw/                   # Differential binding results (per target)
│   ├── {target}/           # e.g. H3K4me2/, H3K9me2/
│   │   ├── DB_regions.UP.bed
│   │   ├── DB_regions.DOWN.bed
│   │   ├── DB_regions.annotated.tsv
│   │   └── csaw_report.html
├── qc/                     # Quality control
│   ├── fastqc/
│   ├── flagstat/
│   ├── markdup/
│   ├── multiqc_report.html
│   └── qc_summary.tsv
├── deeptools_qc/           # deepTools QC
│   ├── fragmentSize.metrics.tsv
│   ├── plotFingerprint.metrics.txt
│   ├── correlation.pearson.png
│   └── pca.png
└── logs/                   # Log files

Workflow Steps

Genome Preparation: Download and index reference genome
Read Mapping: Bowtie2 alignment
BAM Processing: Sort, mark duplicates, filter
Quality Control: FastQC, flagstat, deepTools metrics
Coverage Tracks: Generate bigWig files
Peak Calling: MACS2 (narrow or broad based on sample annotation)
Differential Binding: CSAW analysis per target (if multiple groups)
Reporting: MultiQC and summary reports

Cluster Execution

# SLURM example
snakemake --use-conda --cores 100 \
    --cluster "sbatch -p {cluster.partition} -t {cluster.time} -c {threads}" \
    --cluster-config cluster.yaml \
    --jobs 50

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
GSE155062_test		GSE155062_test
config		config
docs		docs
envs		envs
tests		tests
workflow		workflow
.gitignore		.gitignore
README.md		README.md
dag.dot		dag.dot
rulegraph.dot		rulegraph.dot
rulegraph.png		rulegraph.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ChIP-seq Standalone Pipeline

Requirements

Quick Start

Input Specification

Sample table format (config/samples.tsv)

2. Genome Configuration (`config/genome.yaml`)

3. Pipeline Configuration (`config/config.yaml`)

Output Structure

Workflow Steps

Cluster Execution

License

About

Uh oh!

Releases

Packages

Languages

bmc-CompBio/chipseq-standalone

Folders and files

Latest commit

History

Repository files navigation

ChIP-seq Standalone Pipeline

Requirements

Quick Start

Input Specification

Sample table format (config/samples.tsv)

2. Genome Configuration (config/genome.yaml)

3. Pipeline Configuration (config/config.yaml)

Output Structure

Workflow Steps

Cluster Execution

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

2. Genome Configuration (`config/genome.yaml`)

3. Pipeline Configuration (`config/config.yaml`)

Packages