GitHub - Z-Zen/Conterminator

A comprehensive Nextflow pipeline for RNA-seq quality control and contamination detection.

Overview

Conterminator is a production-ready bioinformatics pipeline designed to perform comprehensive quality control and contamination detection on RNA-seq data. The pipeline combines multiple tools to identify potential contamination sources in sequencing data, including cross-strain contamination, bacterial/fungal/viral contamination, and technical artifacts.

Key Features

Flexible input handling: Direct file paths for FASTQ, pre-aligned BAM, or unmapped reads
Strain-specific alignment with STAR using custom pseudogenomes or reference genome
Multi-database contamination screening via BLAST
Microbial contamination detection using DecontaMiner
Comprehensive QC suite: FastQC, FastQ Screen, Qualimap, DeepTools, Picard, BEDTools
Singularity container with all tools bundled for reproducible execution
SLURM cluster support with intelligent job scheduling and resource management
Automatic retry with dynamic memory scaling on out-of-memory errors
Flexible subsampling for performance optimization
Interactive visualizations and MultiQC reporting
Per-sample strain assignment with optional default (GRCm39) for mixed experiments
Highly configurable with sensible defaults

Requirements

Software Dependencies

Nextflow ≥ 23.04.0
Java ≥ 11
Singularity ≥ 3.5 (optional, for containerized execution)

To install Nextflow, you can run:

curl -s https://get.nextflow.io | bash

If you get an error that your Java is not up-to-date, read below on how to install it.

To install Java 24, you can run the following commands:

cd
wget https://download.oracle.com/java/24/archive/jdk-24.0.2_linux-x64_bin.tar.gz
tar -zxf jdk-24.0.2_linux-x64_bin.tar.gz
echo 'export JAVA_CMD="/home/${USER%%@*}/jdk-24.0.2/bin/java"' >> ~/.bashrc
echo 'export JAVA_HOME="/home/${USER%%@*}/jdk-24.0.2"' >> ~/.bashrc
echo 'export PATH="$JAVA_HOME/bin:$PATH"' >> ~/.bashrc
source .bashrc

Note: When using Singularity, Python and R dependencies are bundled in the container.

Required Tools

The pipeline uses the following bioinformatics tools (paths configurable in nextflow.config):

Tool	Purpose	Version
STAR	RNA-seq alignment	2.7.11b
Samtools	BAM manipulation	1.22.1
FastQC	Quality control	0.12.1
FastQ Screen	Contamination screening	0.16.0
Qualimap	BAM quality control	2.3
Picard	GC bias analysis	3.4.0
DeepTools	GC bias computation	3.5.6
BEDTools	Coverage analysis	2.31.1
BLAST+	Sequence alignment	2.17.0+
DecontaMiner	Microbial detection	1.4
Seqtk	FASTQ subsampling	1.5-r133
BBMap	Format conversion	39.37
MultiQC	Report aggregation	1.31

Reference Data

Strain-specific pseudogenomes and annotations
STAR genome indices (auto-generated if missing)
BLAST databases for contamination screening
FastQ Screen configuration and indices

Installation

Clone the Repository

git clone git@github.com:Z-Zen/Conterminator.git
cd Conterminator

Tools

Option 1: download pre-built tools container (Recommended)

The container includes all required bioinformatics tools pre-configured.

# Download from Google Drive
pip install gdown
gdown 1a2CevfBkMUjSt5R4AnQyDX4ofAHBZmfZ

# Test the build
singularity test conterminator.sif

Option 2: build Singularity container (requires sudo)

Build the Singularity container with all dependencies bundled:

# Build the container
sudo singularity build conterminator.sif conterminator.def

# Test the build
singularity test conterminator.sif

Option 3: native installation

Install all required tools manually and configure paths in nextflow.config.

Edit nextflow.config to specify paths to installed tools and reference data:

params {
    // Tool paths
    star_bin = "/path/to/STAR"
    samtools_bin = "/path/to/samtools"
    fastqc_bin = "/path/to/fastqc"
    // ... (see nextflow.config for all options)
    
    // Reference directories
    strains_base_dir = "/path/to/pseudogenomes"          // Strain-specific pseudogenomes
    standard_references_dir = "/path/to/references/Mus"  // Standard reference genomes (GRCm39, etc.)
    star_index_dir = "/path/to/star/indices"
    contamination_blast_dbs = "/path/to/blast/databases"
}

Test Installation

# Native installation
nextflow run main.nf --help

# With Singularity
nextflow run main.nf --singularity_path /path/to/conterminator.sif -profile singularity --help

# With Singularity + Slurm
nextflow run main.nf --singularity_path /path/to/conterminator.sif -profile singularity_hpc,slurm --help

Quick Start

Step 1: Prepare Sample Sheet

Create a tab-separated file (e.g., samples.tsv):

sample	strain	type	read1	read2
sample1	C57BL_6J	fastq	/data/sample1_R1.fq.gz	/data/sample1_R2.fq.gz
sample2	C57BL_6J	fastq	/data/sample2_R1.fq.gz	/data/sample2_R2.fq.gz

Step 2: Run Pipeline

Minimal Example (Local)

nextflow run main.nf \
--sample_sheet samples.tsv \
--outdir results \
-bg &> results.log

With Singularity Container

nextflow run main.nf \
-profile singularity \
--singularity_path /path/to/conterminator.sif \
--sample_sheet samples.tsv \
--outdir results \
-bg &> results.log

On SLURM Cluster with Singularity (Recommended)

When running the pipeline on SLURM, you cannot provide data located on the RCP storage or archive. You will need to copy some dependencies from the lispserver before you can run the pipeline.

Copy them to either /scratch/[username], /export/lisp/[username], or /work/lisp/[username]. Data copied on scratch stays for 30 days before they are deleted. Other spaces cost money to put data on them.

For example, you can copy the GRCm39 genome from the lispserver to /scratch with the following command:

cd /scratch/`whoami`/
mkdir -p Data/Mus/GRCm39 blast_databases

cd /scratch/`whoami`/Data/Mus/GRCm39
# enter password when prompted from the command below
scp `whoami`@lispserver.rcp.epfl.ch:/mnt/sas/Data/References/Mus/GRCm39/* .

cd /scratch/`whoami`/blast_databases/
# enter password when prompted from the command below
scp `whoami`@lispserver.rcp.epfl.ch:/mnt/sas/Tools/blast_databases/* .

Then, you can run the pipeline using:

nextflow run main.nf \
-profile slurm,singularity_hpc \
--singularity_path conterminator.sif \
--sample_sheet samples.tsv \
--outdir /scratch/`whoami`/myproject_results/ \
--star_index_dir /scratch/`whoami`/Data/Mus \
--strains_base_dir /scratch/`whoami`/Data/Mus \
--standard_references_dir /scratch/`whoami`/Data/Mus \
--contamination_blast_dbs /scratch/`whoami`/blast_databases/ \
-bg &> ~/myproject_results.log

Input Format

Sample Sheet (Required)

The pipeline requires a tab-separated sample sheet file specifying samples, input types, and file paths.

Format Specification

Create a TSV file (e.g., samples.tsv) with 5 columns:

sample	strain	type	read1	read2
sample1	C57BL_6J	fastq	/path/to/sample1_R1.fastq.gz	/path/to/sample1_R2.fastq.gz
sample2	C57BL_6J	bam	/path/to/sample2.bam
sample3	DBA_2J	unmapped_fastq	/path/to/sample3_unmapped_R1.fastq.gz	/path/to/sample3_unmapped_R2.fastq.gz
sample4		fastq	/path/to/sample4_R1.fastq.gz	/path/to/sample4_R2.fastq.gz
sample5		bam	/path/to/sample5.bam

Column Descriptions

Column	Required	Description	Values/Examples
sample	Yes	Unique sample identifier	`sample1`, `exp_001`, `ctrl-A`
strain	No	Reference strain name (defaults to `GRCm39` if empty)	`C57BL_6J`, `DBA_2J`, `BALB_cJ`, or leave empty
type	Yes	Input data type	`fastq`, `bam`, or `unmapped_fastq`
read1	Yes	Path to first read file or BAM file	`/path/to/sample_R1.fastq.gz` or `/path/to/sample.bam`
read2	Conditional	Path to second read file (required for `fastq` and `unmapped_fastq`, empty for `bam`)	`/path/to/sample_R2.fastq.gz` or empty

Input Type Definitions

Type	Description	Use Case	Required Columns
fastq	Paired-end FASTQ files for alignment	Raw sequencing data to be aligned with STAR	`read1`, `read2`
bam	Pre-aligned BAM file	Already aligned data, skip STAR alignment	`read1` only
unmapped_fastq	Unmapped reads in FASTQ format	Already extracted unmapped reads for contamination analysis	`read1`, `read2`

Requirements and Rules

Header row is mandatory - First line must be: sample strain type read1 read2
Tab-separated format - Columns must be separated by tabs (not spaces)
Five columns required - All 5 columns must be present in header
Unique sample IDs - Each sample name must be unique across the sheet
Valid type values - Type must be one of: fastq, bam, or unmapped_fastq
Strain handling:
- If strain is empty or not specified, defaults to GRCm39
- If strain is specified, it must exist in params.strains_base_dir
read2 rules:
- For type=fastq: read2 is required (paired-end reads)
- For type=unmapped_fastq: read2 is required (paired-end reads)
- For type=bam: read2 must be empty (BAM files don't have separate read files)
File paths - All file paths in read1 and read2 must exist and be accessible

Comprehensive Examples

Example 1: Mixed input types

sample	strain	type	read1	read2
WT_rep1	C57BL_6J	fastq	/data/exp1/WT_rep1_R1.fq.gz	/data/exp1/WT_rep1_R2.fq.gz
WT_rep2	C57BL_6J	fastq	/data/exp1/WT_rep2_R1.fq.gz	/data/exp1/WT_rep2_R2.fq.gz
KO_rep1	DBA_2J	bam	/data/exp1/KO_rep1_aligned.bam
archived_sample		bam	/archive/old_alignment.bam
contaminated_sample	BALB_cJ	unmapped_fastq	/data/unmapped/sample_R1.fq.gz	/data/unmapped/sample_R2.fq.gz

Example 2: All FASTQ inputs (standard RNA-seq workflow)

sample	strain	type	read1	read2
ctrl_1	C57BL_6J	fastq	/mnt/data/ctrl_1_R1.fastq.gz	/mnt/data/ctrl_1_R2.fastq.gz
ctrl_2	C57BL_6J	fastq	/mnt/data/ctrl_2_R1.fastq.gz	/mnt/data/ctrl_2_R2.fastq.gz
treat_1	C57BL_6J	fastq	/mnt/data/treat_1_R1.fastq.gz	/mnt/data/treat_1_R2.fastq.gz
treat_2	C57BL_6J	fastq	/mnt/data/treat_2_R1.fastq.gz	/mnt/data/treat_2_R2.fastq.gz

Example 3: Pre-aligned BAM files (QC only)

sample	strain	type	read1	read2
sample1	C57BL_6J	bam	/alignments/sample1.bam
sample2	DBA_2J	bam	/alignments/sample2.bam
sample3		bam	/alignments/sample3.bam

Example 4: Using default strain (GRCm39)

sample	strain	type	read1	read2
exp001		fastq	/data/exp001_R1.fq.gz	/data/exp001_R2.fq.gz
exp002		fastq	/data/exp002_R1.fq.gz	/data/exp002_R2.fq.gz
exp003		bam	/data/exp003.bam

Strain Reference Requirements

For each strain specified in the sample sheet (or the default GRCm39), the following files must exist in params.strains_base_dir:

strains_base_dir/
└── STRAIN_NAME/
    ├── *pseudogenome__strain_STRAIN_NAME.fa.gz    # FASTA reference
    └── *pseudogenome__strain_STRAIN_NAME.gtf.gz   # Gene annotations

Example for C57BL_6J strain:

/path/to/strains_base_dir/C57BL_6J/
├── mm39_pseudogenome__strain_C57BL_6J.fa.gz
└── mm39_pseudogenome__strain_C57BL_6J.gtf.gz

Example for default GRCm39 strain:

/path/to/strains_base_dir/GRCm39/
├── GRCm39.genome.fa.gz
└── gencode.vM35.primary_assembly.annotation.gtf.gz

Note on Standard Reference Genomes (v1.2+): The pipeline now supports storing standard reference genomes (like GRCm39, GRCm38) in a separate directory specified by params.standard_references_dir (default: /mnt/sas/Data/References/Mus). The pipeline will automatically search both strains_base_dir (for strain-specific pseudogenomes) and standard_references_dir (for standard references) when resolving strain names. This allows you to maintain your existing HDP pseudogenome collection while using standard Ensembl/GENCODE references without reorganizing your directory structure.

Usage

Key Parameters

Input/Output

--sample_sheet <file>      # Sample sheet with 5 columns: sample, strain, type, read1, read2 (required)
--outdir <path>            # Output directory (default: results)

Subsampling (Performance Tuning)

--subset_for_fastq_qc true          # Subsample for FastQ QC (default: true)
--subset_fastq_qc_reads 100000      # Reads for QC (default: 100k)

--subset_for_star false             # Subsample for STAR (default: false)
--subset_star_reads 100000          # Reads for STAR if enabled

--subset_bam_for_qc true            # Subsample BAMs for QC (default: true)
--bam_qc_subset_mapped 200000       # Mapped reads for BAM QC (default: 200k)

--subset_unmapped_for_blast true    # Subsample for BLAST (default: true)
--unmapped_subset_reads 100000      # Unmapped reads for BLAST (default: 100k)

Tool Toggles

--run_star_alignment true     # Run STAR alignment (default: true)
--run_fastqc true             # Run FastQC (default: true)
--run_fastq_screen true       # Run FastQ Screen (default: true)
--run_deeptools true          # Run DeepTools GC bias (default: true)
--run_picard_gc true          # Run Picard GC bias (default: true)
--run_bedtools_gc true        # Run BEDTools coverage (default: true)
--run_qualimap true           # Run Qualimap (default: true)
--run_mapinsights true        # Run MapInsights (default: true)
--run_decontaminer true       # Run DecontaMiner (default: true)
--run_contamination_check true # Run BLAST contamination (default: true)
--run_multiqc true            # Run MultiQC (default: true)

Resources

--max_cpus 16              # Maximum CPUs per process
--max_mem "32 GB"          # Maximum memory per process
--max_time "24h"           # Maximum time per process
--max_parallel_samples 4   # Max samples in parallel

Configuration

Qualimap Settings

--qualimap_mode "both"                    # "bamqc", "rnaseq", or "both"
--qualimap_genome "mm10"                  # Reference genome name
--qualimap_protocol "strand-specific-reverse"
--qualimap_threads 8

Contamination Detection

--contamination_blast_dbs "/path/to/blast/dbs"  # BLAST database directory
--contamination_evalue "1e-10"                   # E-value threshold
--blast_max_parallel 10                          # Max parallel BLAST jobs

DecontaMiner Settings

--decontaminer_config "/path/to/config.txt"
--decontaminer_organisms "bfv"          # b=bacteria, f=fungi, v=viruses
--decontaminer_pairing "P"              # P=paired, S=single
--decontaminer_quality_filter "yes"
--decontaminer_ribo_filter "yes"

Output Structure

results/
├── Input/
│   ├── reference/              # Prepared references
│   ├── strain_references/      # Strain-specific files
│   ├── subsampled_bams/        # Subsampled BAM files
│   ├── subsampled_fastq/       # Subsampled FASTQ files
│   └── sample_sheet.tsv        # Copy of input sample sheet
├── Output/
│   ├── bedtools_gc/            # GC content coverage
│   ├── contamination_check/    # BLAST results and plots
│   ├── decontaminer/           # DecontaMiner reports
│   ├── deeptools_gc_bias/      # DeepTools GC bias
│   ├── fastqc/                 # FastQC reports
│   ├── fastq_screen/           # FastQ Screen results
│   ├── mapinsights/            # MapInsights reports
│   ├── multiqc/                # MultiQC aggregate report
│   ├── picard_gc_bias/         # Picard GC metrics
│   └── qualimap/               # Qualimap QC
├── Temporary/
│   ├── decontaminer/           # Intermediate files
│   └── star_alignment/         # STAR outputs
├── NextflowReports/            # Internal reports by Nextflow
├── pid.txt                     # Contains Nextflow PIDs
└── pipeline_info.txt           # Run metadata

Troubleshooting

Common Issues

Issue: STAR index not found

# The pipeline will auto-build missing indices
# Ensure strains_base_dir contains:
# strains_base_dir/STRAIN_NAME/*pseudogenome__strain_STRAIN_NAME.fa.gz
# strains_base_dir/STRAIN_NAME/*pseudogenome__strain_STRAIN_NAME.gtf.gz

Issue: Out of memory errors

# Increase memory allocation
--max_mem "64 GB"

# Enable more aggressive subsampling
--subset_bam_for_qc true
--bam_qc_subset_mapped 100000

Issue: Pipeline hangs during BLAST

# Reduce parallel BLAST jobs
--blast_max_parallel 5

# Enable BLAST subsampling
--subset_unmapped_for_blast true
--unmapped_subset_reads 50000

Nextflow tips

Resuming failed runs

If your run has failed due to an error, you can resume it after fixing the problem by running the following:

# Get all the runs launched by nextflow
nextflow log

# Copy the session id of the run you want to resume. It looks like this: 2b29d621-8eff-400c-83a2-05146c4f6131

# Resume your pipeline by using exactly the same command as before but by adding -resume [session id]
# For example, if your command was
nextflow run main.nf --outdir myresults --sample_sheet samples.tsv -profile singularity,slurm
# It will become
nextflow run main.nf --outdir myresults --sample_sheet samples.tsv -profile singularity,slurm -resume 2b29d621-8eff-400c-83a2-05146c4f6131

Run Nextflow in the background

You can simply run nextflow in the background using -bg:

nextflow run main.nf --outdir myresults --sample_sheet samples.tsv -profile singularity,slurm -bg &> myrun.log

Stopping background run

You can find the latest running pid in the pid.txt in the root of output directory.

cat pid.txt
# ---------------------
# 4149804
# nextflow run main.nf --sample_sheet input_test/sample_sheet.tsv --outdir results_test --singularity_path /mnt/sas/Users/abadreddine/Projects/Conterminator/conterminator.sif -profile singularity -resume 2b29d621-8eff-400c-83a2-05146c4f6131
# Start: 07-Nov-2025 01:25:30

# Then you can use the PID number to kill the running job
kill 4149804

Cleaning the work directory

Nextflow works by running his processes in a folder called work in your current path. The work directory is important to resume your jobs. If this folder gets deleted, you cannot resume your jobs anymore.

However, once your analysis has finished, you can clean your working directory by running:

rm -rf work .nextflow.log*
# OR 
nextflow clean -f $(nextflow log -q)

Contact

Author: Alaa Badreddine

Project Repository: https://github.com/Z-Zen/Conterminator

For bug reports and feature requests, please open an issue on GitLab.

Version History

v1.3 (March 2026) - Remove DecontaMiner, eliminate hardcoded paths, add tests:
- Remove DecontaMiner entirely (5 processes, all params, config, workflow logic)
- Replace all hardcoded /mnt/sas tool/reference paths with null defaults
- Add centralized parameter validation with clear error messages
- Simplify tool path resolution: use bare command names via $PATH in containers, with optional param overrides for local installs
- Fix Picard to use wrapper command instead of java -jar with bare filename
- Support both compressed and uncompressed FASTA/GTF in reference discovery
- Fix STAR index detection to check star_index/ subdirectory
- Parse sample sheet independently of STAR alignment toggle
- Update all URLs from GitLab to GitHub
- Add sample sheet JSON schema (assets/samplesheet_schema.json)
- Add test suite with synthetic data and validation tests
- Add test profile to nextflow.config
v1.2 (December 2025) - Enhanced reference genome support:
- Dual directory support for reference genomes:
  - strains_base_dir for strain-specific pseudogenomes
  - standard_references_dir for standard reference genomes (GRCm39, GRCm38, etc.)
- Automatic directory resolution - pipeline searches both locations for each strain
- Flexible reference file pattern matching for standard genomes
- Fixed WRITE_PIPELINE_INFO staging to avoid file collision errors
- Pipeline configuration files (main.nf, nextflow.config, conterminator.def) now copied to output for reproducibility
- Dynamic user path support using ${System.getProperty('user.home')}
v1.1 (November 2025) - Major updates:
- Full Singularity container support with all tools bundled
- SLURM cluster execution with job scheduling
- Per-process resource configuration with automatic retry on OOM
- Dynamic memory scaling (2×/3× on retry)
- Parallelization control with maxForks parameter
- Enhanced error handling with intelligent retry strategy
- Singularity information in pipeline reports
- Improved channel scoping for complex workflows
- New flexible sample sheet format supporting multiple input types:
- Direct FASTQ file paths (no directory scanning required)
- Pre-aligned BAM files for QC-only workflows
- Unmapped FASTQ files for contamination-only analysis
- Optional strain specification with GRCm39 default
v1.0 (2025) - Initial release with full QC and contamination detection suite

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
assets		assets
bin		bin
tests		tests
.gitlab-ci.yml		.gitlab-ci.yml
README.md		README.md
conterminator.def		conterminator.def
conterminator.png		conterminator.png
main.nf		main.nf
name.txt		name.txt
nextflow.config		nextflow.config
sample_sheet_example.tsv		sample_sheet_example.tsv

Folders and files

Latest commit

History

Repository files navigation

Overview

Key Features

Table of Contents

Requirements

Software Dependencies

Required Tools

Reference Data

Installation

Clone the Repository

Tools

Option 1: download pre-built tools container (Recommended)

Option 2: build Singularity container (requires sudo)

Option 3: native installation

Test Installation

Quick Start

Step 1: Prepare Sample Sheet

Step 2: Run Pipeline

Minimal Example (Local)

With Singularity Container

On SLURM Cluster with Singularity (Recommended)

Input Format

Sample Sheet (Required)

Format Specification

Column Descriptions

Input Type Definitions

Requirements and Rules

Comprehensive Examples

Strain Reference Requirements

Usage

Key Parameters

Input/Output

Subsampling (Performance Tuning)

Tool Toggles

Resources

Configuration

Qualimap Settings

Contamination Detection

DecontaMiner Settings

Output Structure

Troubleshooting

Common Issues

Nextflow tips

Resuming failed runs

Run Nextflow in the background

Stopping background run

Cleaning the work directory

Contact

Version History

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages