A comprehensive Nextflow pipeline for RNA-seq quality control and contamination detection.
Conterminator is a production-ready bioinformatics pipeline designed to perform comprehensive quality control and contamination detection on RNA-seq data. The pipeline combines multiple tools to identify potential contamination sources in sequencing data, including cross-strain contamination, bacterial/fungal/viral contamination, and technical artifacts.
- Flexible input handling: Direct file paths for FASTQ, pre-aligned BAM, or unmapped reads
- Strain-specific alignment with STAR using custom pseudogenomes or reference genome
- Multi-database contamination screening via BLAST
- Microbial contamination detection using DecontaMiner
- Comprehensive QC suite: FastQC, FastQ Screen, Qualimap, DeepTools, Picard, BEDTools
- Singularity container with all tools bundled for reproducible execution
- SLURM cluster support with intelligent job scheduling and resource management
- Automatic retry with dynamic memory scaling on out-of-memory errors
- Flexible subsampling for performance optimization
- Interactive visualizations and MultiQC reporting
- Per-sample strain assignment with optional default (GRCm39) for mixed experiments
- Highly configurable with sensible defaults
- Requirements
- Installation
- Quick Start
- Input Format
- Usage
- Configuration
- Output Structure
- Troubleshooting
- Execution Profiles
- Singularity Container
- Contact
- Nextflow ≥ 23.04.0
- Java ≥ 11
- Singularity ≥ 3.5 (optional, for containerized execution)
To install Nextflow, you can run:
curl -s https://get.nextflow.io | bashIf you get an error that your Java is not up-to-date, read below on how to install it.
To install Java 24, you can run the following commands:
cd
wget https://download.oracle.com/java/24/archive/jdk-24.0.2_linux-x64_bin.tar.gz
tar -zxf jdk-24.0.2_linux-x64_bin.tar.gz
echo 'export JAVA_CMD="/home/${USER%%@*}/jdk-24.0.2/bin/java"' >> ~/.bashrc
echo 'export JAVA_HOME="/home/${USER%%@*}/jdk-24.0.2"' >> ~/.bashrc
echo 'export PATH="$JAVA_HOME/bin:$PATH"' >> ~/.bashrc
source .bashrcNote: When using Singularity, Python and R dependencies are bundled in the container.
The pipeline uses the following bioinformatics tools (paths configurable in nextflow.config):
| Tool | Purpose | Version |
|---|---|---|
| STAR | RNA-seq alignment | 2.7.11b |
| Samtools | BAM manipulation | 1.22.1 |
| FastQC | Quality control | 0.12.1 |
| FastQ Screen | Contamination screening | 0.16.0 |
| Qualimap | BAM quality control | 2.3 |
| Picard | GC bias analysis | 3.4.0 |
| DeepTools | GC bias computation | 3.5.6 |
| BEDTools | Coverage analysis | 2.31.1 |
| BLAST+ | Sequence alignment | 2.17.0+ |
| DecontaMiner | Microbial detection | 1.4 |
| Seqtk | FASTQ subsampling | 1.5-r133 |
| BBMap | Format conversion | 39.37 |
| MultiQC | Report aggregation | 1.31 |
- Strain-specific pseudogenomes and annotations
- STAR genome indices (auto-generated if missing)
- BLAST databases for contamination screening
- FastQ Screen configuration and indices
git clone git@github.com:Z-Zen/Conterminator.git
cd ConterminatorThe container includes all required bioinformatics tools pre-configured.
# Download from Google Drive
pip install gdown
gdown 1a2CevfBkMUjSt5R4AnQyDX4ofAHBZmfZ
# Test the build
singularity test conterminator.sifBuild the Singularity container with all dependencies bundled:
# Build the container
sudo singularity build conterminator.sif conterminator.def
# Test the build
singularity test conterminator.sifInstall all required tools manually and configure paths in nextflow.config.
Edit nextflow.config to specify paths to installed tools and reference data:
params {
// Tool paths
star_bin = "/path/to/STAR"
samtools_bin = "/path/to/samtools"
fastqc_bin = "/path/to/fastqc"
// ... (see nextflow.config for all options)
// Reference directories
strains_base_dir = "/path/to/pseudogenomes" // Strain-specific pseudogenomes
standard_references_dir = "/path/to/references/Mus" // Standard reference genomes (GRCm39, etc.)
star_index_dir = "/path/to/star/indices"
contamination_blast_dbs = "/path/to/blast/databases"
}# Native installation
nextflow run main.nf --help
# With Singularity
nextflow run main.nf --singularity_path /path/to/conterminator.sif -profile singularity --help
# With Singularity + Slurm
nextflow run main.nf --singularity_path /path/to/conterminator.sif -profile singularity_hpc,slurm --helpCreate a tab-separated file (e.g., samples.tsv):
sample strain type read1 read2
sample1 C57BL_6J fastq /data/sample1_R1.fq.gz /data/sample1_R2.fq.gz
sample2 C57BL_6J fastq /data/sample2_R1.fq.gz /data/sample2_R2.fq.gznextflow run main.nf \
--sample_sheet samples.tsv \
--outdir results \
-bg &> results.lognextflow run main.nf \
-profile singularity \
--singularity_path /path/to/conterminator.sif \
--sample_sheet samples.tsv \
--outdir results \
-bg &> results.logWhen running the pipeline on SLURM, you cannot provide data located on the RCP storage or archive. You will need to copy some dependencies from the lispserver before you can run the pipeline.
Copy them to either /scratch/[username], /export/lisp/[username], or /work/lisp/[username]. Data copied on scratch stays for 30 days before they are deleted. Other spaces cost money to put data on them.
For example, you can copy the GRCm39 genome from the lispserver to /scratch with the following command:
cd /scratch/`whoami`/
mkdir -p Data/Mus/GRCm39 blast_databases
cd /scratch/`whoami`/Data/Mus/GRCm39
# enter password when prompted from the command below
scp `whoami`@lispserver.rcp.epfl.ch:/mnt/sas/Data/References/Mus/GRCm39/* .
cd /scratch/`whoami`/blast_databases/
# enter password when prompted from the command below
scp `whoami`@lispserver.rcp.epfl.ch:/mnt/sas/Tools/blast_databases/* .Then, you can run the pipeline using:
nextflow run main.nf \
-profile slurm,singularity_hpc \
--singularity_path conterminator.sif \
--sample_sheet samples.tsv \
--outdir /scratch/`whoami`/myproject_results/ \
--star_index_dir /scratch/`whoami`/Data/Mus \
--strains_base_dir /scratch/`whoami`/Data/Mus \
--standard_references_dir /scratch/`whoami`/Data/Mus \
--contamination_blast_dbs /scratch/`whoami`/blast_databases/ \
-bg &> ~/myproject_results.logThe pipeline requires a tab-separated sample sheet file specifying samples, input types, and file paths.
Create a TSV file (e.g., samples.tsv) with 5 columns:
sample strain type read1 read2
sample1 C57BL_6J fastq /path/to/sample1_R1.fastq.gz /path/to/sample1_R2.fastq.gz
sample2 C57BL_6J bam /path/to/sample2.bam
sample3 DBA_2J unmapped_fastq /path/to/sample3_unmapped_R1.fastq.gz /path/to/sample3_unmapped_R2.fastq.gz
sample4 fastq /path/to/sample4_R1.fastq.gz /path/to/sample4_R2.fastq.gz
sample5 bam /path/to/sample5.bam| Column | Required | Description | Values/Examples |
|---|---|---|---|
| sample | Yes | Unique sample identifier | sample1, exp_001, ctrl-A |
| strain | No | Reference strain name (defaults to GRCm39 if empty) |
C57BL_6J, DBA_2J, BALB_cJ, or leave empty |
| type | Yes | Input data type | fastq, bam, or unmapped_fastq |
| read1 | Yes | Path to first read file or BAM file | /path/to/sample_R1.fastq.gz or /path/to/sample.bam |
| read2 | Conditional | Path to second read file (required for fastq and unmapped_fastq, empty for bam) |
/path/to/sample_R2.fastq.gz or empty |
| Type | Description | Use Case | Required Columns |
|---|---|---|---|
| fastq | Paired-end FASTQ files for alignment | Raw sequencing data to be aligned with STAR | read1, read2 |
| bam | Pre-aligned BAM file | Already aligned data, skip STAR alignment | read1 only |
| unmapped_fastq | Unmapped reads in FASTQ format | Already extracted unmapped reads for contamination analysis | read1, read2 |
- Header row is mandatory - First line must be:
sample strain type read1 read2 - Tab-separated format - Columns must be separated by tabs (not spaces)
- Five columns required - All 5 columns must be present in header
- Unique sample IDs - Each sample name must be unique across the sheet
- Valid type values - Type must be one of:
fastq,bam, orunmapped_fastq - Strain handling:
- If strain is empty or not specified, defaults to
GRCm39 - If strain is specified, it must exist in
params.strains_base_dir
- If strain is empty or not specified, defaults to
- read2 rules:
- For
type=fastq: read2 is required (paired-end reads) - For
type=unmapped_fastq: read2 is required (paired-end reads) - For
type=bam: read2 must be empty (BAM files don't have separate read files)
- For
- File paths - All file paths in
read1andread2must exist and be accessible
Example 1: Mixed input types
sample strain type read1 read2
WT_rep1 C57BL_6J fastq /data/exp1/WT_rep1_R1.fq.gz /data/exp1/WT_rep1_R2.fq.gz
WT_rep2 C57BL_6J fastq /data/exp1/WT_rep2_R1.fq.gz /data/exp1/WT_rep2_R2.fq.gz
KO_rep1 DBA_2J bam /data/exp1/KO_rep1_aligned.bam
archived_sample bam /archive/old_alignment.bam
contaminated_sample BALB_cJ unmapped_fastq /data/unmapped/sample_R1.fq.gz /data/unmapped/sample_R2.fq.gzExample 2: All FASTQ inputs (standard RNA-seq workflow)
sample strain type read1 read2
ctrl_1 C57BL_6J fastq /mnt/data/ctrl_1_R1.fastq.gz /mnt/data/ctrl_1_R2.fastq.gz
ctrl_2 C57BL_6J fastq /mnt/data/ctrl_2_R1.fastq.gz /mnt/data/ctrl_2_R2.fastq.gz
treat_1 C57BL_6J fastq /mnt/data/treat_1_R1.fastq.gz /mnt/data/treat_1_R2.fastq.gz
treat_2 C57BL_6J fastq /mnt/data/treat_2_R1.fastq.gz /mnt/data/treat_2_R2.fastq.gzExample 3: Pre-aligned BAM files (QC only)
sample strain type read1 read2
sample1 C57BL_6J bam /alignments/sample1.bam
sample2 DBA_2J bam /alignments/sample2.bam
sample3 bam /alignments/sample3.bamExample 4: Using default strain (GRCm39)
sample strain type read1 read2
exp001 fastq /data/exp001_R1.fq.gz /data/exp001_R2.fq.gz
exp002 fastq /data/exp002_R1.fq.gz /data/exp002_R2.fq.gz
exp003 bam /data/exp003.bamFor each strain specified in the sample sheet (or the default GRCm39), the following files must exist in params.strains_base_dir:
strains_base_dir/
└── STRAIN_NAME/
├── *pseudogenome__strain_STRAIN_NAME.fa.gz # FASTA reference
└── *pseudogenome__strain_STRAIN_NAME.gtf.gz # Gene annotations
Example for C57BL_6J strain:
/path/to/strains_base_dir/C57BL_6J/
├── mm39_pseudogenome__strain_C57BL_6J.fa.gz
└── mm39_pseudogenome__strain_C57BL_6J.gtf.gz
Example for default GRCm39 strain:
/path/to/strains_base_dir/GRCm39/
├── GRCm39.genome.fa.gz
└── gencode.vM35.primary_assembly.annotation.gtf.gz
Note on Standard Reference Genomes (v1.2+): The pipeline now supports storing standard reference genomes (like GRCm39, GRCm38) in a separate directory specified by params.standard_references_dir (default: /mnt/sas/Data/References/Mus). The pipeline will automatically search both strains_base_dir (for strain-specific pseudogenomes) and standard_references_dir (for standard references) when resolving strain names. This allows you to maintain your existing HDP pseudogenome collection while using standard Ensembl/GENCODE references without reorganizing your directory structure.
--sample_sheet <file> # Sample sheet with 5 columns: sample, strain, type, read1, read2 (required)
--outdir <path> # Output directory (default: results)--subset_for_fastq_qc true # Subsample for FastQ QC (default: true)
--subset_fastq_qc_reads 100000 # Reads for QC (default: 100k)
--subset_for_star false # Subsample for STAR (default: false)
--subset_star_reads 100000 # Reads for STAR if enabled
--subset_bam_for_qc true # Subsample BAMs for QC (default: true)
--bam_qc_subset_mapped 200000 # Mapped reads for BAM QC (default: 200k)
--subset_unmapped_for_blast true # Subsample for BLAST (default: true)
--unmapped_subset_reads 100000 # Unmapped reads for BLAST (default: 100k)--run_star_alignment true # Run STAR alignment (default: true)
--run_fastqc true # Run FastQC (default: true)
--run_fastq_screen true # Run FastQ Screen (default: true)
--run_deeptools true # Run DeepTools GC bias (default: true)
--run_picard_gc true # Run Picard GC bias (default: true)
--run_bedtools_gc true # Run BEDTools coverage (default: true)
--run_qualimap true # Run Qualimap (default: true)
--run_mapinsights true # Run MapInsights (default: true)
--run_decontaminer true # Run DecontaMiner (default: true)
--run_contamination_check true # Run BLAST contamination (default: true)
--run_multiqc true # Run MultiQC (default: true)--max_cpus 16 # Maximum CPUs per process
--max_mem "32 GB" # Maximum memory per process
--max_time "24h" # Maximum time per process
--max_parallel_samples 4 # Max samples in parallel--qualimap_mode "both" # "bamqc", "rnaseq", or "both"
--qualimap_genome "mm10" # Reference genome name
--qualimap_protocol "strand-specific-reverse"
--qualimap_threads 8--contamination_blast_dbs "/path/to/blast/dbs" # BLAST database directory
--contamination_evalue "1e-10" # E-value threshold
--blast_max_parallel 10 # Max parallel BLAST jobs--decontaminer_config "/path/to/config.txt"
--decontaminer_organisms "bfv" # b=bacteria, f=fungi, v=viruses
--decontaminer_pairing "P" # P=paired, S=single
--decontaminer_quality_filter "yes"
--decontaminer_ribo_filter "yes"results/
├── Input/
│ ├── reference/ # Prepared references
│ ├── strain_references/ # Strain-specific files
│ ├── subsampled_bams/ # Subsampled BAM files
│ ├── subsampled_fastq/ # Subsampled FASTQ files
│ └── sample_sheet.tsv # Copy of input sample sheet
├── Output/
│ ├── bedtools_gc/ # GC content coverage
│ ├── contamination_check/ # BLAST results and plots
│ ├── decontaminer/ # DecontaMiner reports
│ ├── deeptools_gc_bias/ # DeepTools GC bias
│ ├── fastqc/ # FastQC reports
│ ├── fastq_screen/ # FastQ Screen results
│ ├── mapinsights/ # MapInsights reports
│ ├── multiqc/ # MultiQC aggregate report
│ ├── picard_gc_bias/ # Picard GC metrics
│ └── qualimap/ # Qualimap QC
├── Temporary/
│ ├── decontaminer/ # Intermediate files
│ └── star_alignment/ # STAR outputs
├── NextflowReports/ # Internal reports by Nextflow
├── pid.txt # Contains Nextflow PIDs
└── pipeline_info.txt # Run metadata
Issue: STAR index not found
# The pipeline will auto-build missing indices
# Ensure strains_base_dir contains:
# strains_base_dir/STRAIN_NAME/*pseudogenome__strain_STRAIN_NAME.fa.gz
# strains_base_dir/STRAIN_NAME/*pseudogenome__strain_STRAIN_NAME.gtf.gzIssue: Out of memory errors
# Increase memory allocation
--max_mem "64 GB"
# Enable more aggressive subsampling
--subset_bam_for_qc true
--bam_qc_subset_mapped 100000Issue: Pipeline hangs during BLAST
# Reduce parallel BLAST jobs
--blast_max_parallel 5
# Enable BLAST subsampling
--subset_unmapped_for_blast true
--unmapped_subset_reads 50000If your run has failed due to an error, you can resume it after fixing the problem by running the following:
# Get all the runs launched by nextflow
nextflow log
# Copy the session id of the run you want to resume. It looks like this: 2b29d621-8eff-400c-83a2-05146c4f6131
# Resume your pipeline by using exactly the same command as before but by adding -resume [session id]
# For example, if your command was
nextflow run main.nf --outdir myresults --sample_sheet samples.tsv -profile singularity,slurm
# It will become
nextflow run main.nf --outdir myresults --sample_sheet samples.tsv -profile singularity,slurm -resume 2b29d621-8eff-400c-83a2-05146c4f6131You can simply run nextflow in the background using -bg:
nextflow run main.nf --outdir myresults --sample_sheet samples.tsv -profile singularity,slurm -bg &> myrun.logYou can find the latest running pid in the pid.txt in the root of output directory.
cat pid.txt
# ---------------------
# 4149804
# nextflow run main.nf --sample_sheet input_test/sample_sheet.tsv --outdir results_test --singularity_path /mnt/sas/Users/abadreddine/Projects/Conterminator/conterminator.sif -profile singularity -resume 2b29d621-8eff-400c-83a2-05146c4f6131
# Start: 07-Nov-2025 01:25:30
# Then you can use the PID number to kill the running job
kill 4149804Nextflow works by running his processes in a folder called work in your current path. The work directory is important to resume your jobs. If this folder gets deleted, you cannot resume your jobs anymore.
However, once your analysis has finished, you can clean your working directory by running:
rm -rf work .nextflow.log*
# OR
nextflow clean -f $(nextflow log -q)Author: Alaa Badreddine
Project Repository: https://github.com/Z-Zen/Conterminator
For bug reports and feature requests, please open an issue on GitLab.
-
v1.3 (March 2026) - Remove DecontaMiner, eliminate hardcoded paths, add tests:
- Remove DecontaMiner entirely (5 processes, all params, config, workflow logic)
- Replace all hardcoded /mnt/sas tool/reference paths with null defaults
- Add centralized parameter validation with clear error messages
- Simplify tool path resolution: use bare command names via $PATH in containers, with optional param overrides for local installs
- Fix Picard to use wrapper command instead of java -jar with bare filename
- Support both compressed and uncompressed FASTA/GTF in reference discovery
- Fix STAR index detection to check star_index/ subdirectory
- Parse sample sheet independently of STAR alignment toggle
- Update all URLs from GitLab to GitHub
- Add sample sheet JSON schema (assets/samplesheet_schema.json)
- Add test suite with synthetic data and validation tests
- Add test profile to nextflow.config
-
v1.2 (December 2025) - Enhanced reference genome support:
- Dual directory support for reference genomes:
strains_base_dirfor strain-specific pseudogenomesstandard_references_dirfor standard reference genomes (GRCm39, GRCm38, etc.)
- Automatic directory resolution - pipeline searches both locations for each strain
- Flexible reference file pattern matching for standard genomes
- Fixed WRITE_PIPELINE_INFO staging to avoid file collision errors
- Pipeline configuration files (main.nf, nextflow.config, conterminator.def) now copied to output for reproducibility
- Dynamic user path support using
${System.getProperty('user.home')}
- Dual directory support for reference genomes:
-
v1.1 (November 2025) - Major updates:
- Full Singularity container support with all tools bundled
- SLURM cluster execution with job scheduling
- Per-process resource configuration with automatic retry on OOM
- Dynamic memory scaling (2×/3× on retry)
- Parallelization control with
maxForksparameter - Enhanced error handling with intelligent retry strategy
- Singularity information in pipeline reports
- Improved channel scoping for complex workflows
- New flexible sample sheet format supporting multiple input types:
- Direct FASTQ file paths (no directory scanning required)
- Pre-aligned BAM files for QC-only workflows
- Unmapped FASTQ files for contamination-only analysis
- Optional strain specification with GRCm39 default
-
v1.0 (2025) - Initial release with full QC and contamination detection suite
