ECCFP (EccDNA Caller based on Consecutive Full Pass) is a comprehensive, end-to-end bioinformatics workflow for identifying extrachromosomal circular DNA (eccDNA) from long-read Nanopore sequencing data. This repository provides the complete implementation of the protocol described in the manuscript "A bioinformatics workflow to identify eccDNA using ECCFP from long-read Nanopore sequencing data".
The workflow integrates quality control, adapter trimming, genome alignment, and eccDNA detection into a streamlined pipeline optimized for Nanopore data.
README.md- Main documentation file for the ECCFP workflowLICENSE- MIT License file specifying the open-source license termssetup_workspace.sh- One-click workspace setup script that creates the working directory structure
The scripts/ directory contains all executable scripts for installing, running, and testing the ECCFP workflow.
install_all.sh- Complete software installation script that sets up Conda environment and installs all dependenciesrun_test.sh- Test script that validates the installation using provided test data
Contains small datasets for quick validation and testing of the workflow:
test.fastq- Small test FASTQ file for validating the complete workflowGRCh38_chr21.fasta- Reference genome containing only chromosome 21 for quick testing
This directory is automatically created when you run setup_workspace.sh
RawData/- Place your own FASTQ files here for analysisReference/- Store reference genome files (FASTA format) hereSoftware/- Contains installed software tools and dependencies
- Initial Setup: Run
bash setup_workspace.shto create theeccfpws/working directory - Software Installation: Navigate to
scripts/and runbash install_all.shto install all dependencies - Validation Test: Run
bash scripts/run_test.shto verify the installation using test data - Your Own Analysis: Place your FASTQ files in
eccfpws/RawData/and run the workflow scripts
git clone https://github.com/WSG-Lab/ECCFP_Workflow.git
cd ECCFP_Workflow# Create the working directory structure
bash setup_workspace.sh# Navigate to the scripts directory
cd scripts
# Run the complete installation script
bash install_all.shImportant: The installation script will:
- Install Conda (if not already installed)
- Create a Conda environment named
eccfpEnv - Install all required software with exact versions matching the manuscript
- Verify all installations
# Run the test script to verify everything works
bash run_test.shThe test script will:
- Validate the environment and software versions
- Run the complete workflow on a small test dataset
- Generate a test report
- Verify that all outputs are generated correctly
Expected Output: All steps should complete successfully with a final message indicating test completion.
Important: All following steps should be performed within the eccfpws working directory and with the eccfpEnv Conda environment activated:
# Activate the Conda environment
conda activate eccfpEnv
# Navigate to the working directory
cd eccfpwsRunning the Complete Workflow For analyzing your own data, follow these steps:
# Place your FASTQ files in the RawData directory
cp /path/to/your/data/*.fastq eccfpws/RawData/
# Place your reference genome in the Reference directory
cp /path/to/your/reference.fasta eccfpws/Reference/Command (single sample):
mkdir -p 01QualityControl
NanoPlot --fastq RawData/HRR2590080.fastq \
-o 01QualityControl/HRR2590080 \
-t 16 \
--plots hex dotCommand (batch processing):
mkdir -p 01QualityControl
ls RawData/*.fastq RawData/*.fq 2>/dev/null | while read fastq; do
prefix=$(basename ${fastq%.fastq})
prefix=$(basename ${fastq%.fq})
NanoPlot --fastq "$fastq" \
-o "01QualityControl/$prefix" \
-t 16 \
--plots hex dot
doneExpected Output:
- 01QualityControl/[sample]/NanoPlot-report.html - Interactive HTML report
- 01QualityControl/[sample]/LengthvsQualityScatterPlot_dot.png - Read length vs quality scatter plot
- 01QualityControl/[sample]/NonWeightedHistogramReadLength.png - Read length distribution
Quality Assessment:
- Check read length distribution (should approximate log-normal)
- Verify average read quality > Q7
- Ensure sufficient sequencing depth
Removes adapter and barcode sequences that can interfere with alignment and eccDNA detection. Command (single sample):
porechop -i RawData/HRR2590080.fastq \
-o 02TrimAdapters/HRR2590080_clean.fastq \
--extra_end_trim 0 \
--discard_middle \
-t 16Command (batch processing):
mkdir -p 02TrimAdapters
ls RawData/*.fastq RawData/*.fq 2>/dev/null | while read fastq; do
prefix=$(basename ${fastq} .fastq)
prefix=$(basename ${fastq} .fq)
porechop -i "$fastq" \
-o "02TrimAdapters/${prefix}_clean.fastq" \
--extra_end_trim 0 \
--discard_middle \
-t 16
doneParameters Explained:
- --extra_end_trim 0: No additional trimming beyond adapters
- --discard_middle: Reads with middle adapters will be discarded (default: reads with middle adapters are split) (required for reads to be used with Nanopolish, this option is on by default when outputting reads into barcode bins)
- -t 16: Use 16 threads for faster processing
Re-evaluate data quality after adapter removal to ensure trimming was successful. Command:
mkdir -p 03QualityCheck
ls 02TrimAdapters/*.fastq | while read fastq; do
prefix=$(echo ${fastq} |cut -d '_' -f 1)
NanoPlot --fastq "$fastq" \
-o "03QualityCheck/$prefix" \
-t 16 \
--plots hex dot
doneWhat to Check:
- Compare pre- and post-trimming quality metrics
- Verify adapter removal (should see improved quality scores)
- Ensure no significant data loss (<10% reads trimmed is normal)
Align trimmed reads to a reference genome to determine genomic locations. First, build reference index:
minimap2 -d Reference/GRCh38.p14.genome.fa.mmi Reference/GRCh38.p14.genome.faThen align reads (single sample):
mkdir -p 04MappingGenome
minimap2 -cx map-ont \
Reference/GRCh38.p14.genome.fa.mmi \
02TrimAdapters/HRR2590080_clean.fastq \
--secondary=no \
-t 16 > 04MappingGenome/HRR2590080.pafCommand (batch processing):
mkdir -p 04MappingGenome
ls 02TrimAdapters |while read fastq
do
prefix=$(echo ${fastq} |cut -d '_' -f 1)
minimap2 -cx map-ont \
Reference/GRCh38.p14.genome.fa.mmi \
02TrimAdapters/${fastq} \
--secondary=no \
-t 16 >04MappingGenome/${prefix}.paf
doneParameters Explained:
- -cx map-ont: Optimize for Oxford Nanopore reads
- --secondary=no: Filter out secondary alignments (keep only primary)
- -t 16: Use 16 threads
Expected Mapping Rates:
- Human samples to hg38: >95% (optimal), >85% (acceptable)
- Low mapping rates may indicate contamination or reference mismatch
The core step that identifies circular DNA from aligned reads.
Command (single sample):
mkdir 05IdentifyingEccDNA
eccfp --fastq 02TrimAdapters/HRR2590080_clean.fastq \
--paf 04MappingGenome/HRR2590080.paf \
--output 05IdentifyingEccDNA/HRR2590080 \
--reference Reference/GRCh38.p14.genome.faCommand (batch processing):
mkdir 05IdentifyingEccDNA
ls 04MappingGenome/ |while read paf
do
prefix=$(echo ${paf} |cut -d '.' -f 1)
eccfp --fastq 02TrimAdapters/${prefix}_clean.fastq \
--paf 04MappingGenome/${paf} \
--reference ~/eccfpws/Reference/GRCh38.p14.genome.fa \
--output 05IdentifyingEccDNA/${prefix}
doneOutput Files:
05IdentifyingEccDNA/[sample]/
βββ final_eccDNA.csv # Final eccDNA list with genomic coordinates
βββ consensus_sequence.fasta # Consensus sequences for each eccDNA
βββ variants.csv # Mutations/variants within eccDNA regions
βββ unit.csv # Candidate eccDNAs within individual reads
βββ candidate_consolidated.csv # Consolidated candidate information
final_eccDNA.csv The main results file with columns:
| description | |
|---|---|
| eccDNApos | eccDNA position |
| Nfullpass | Number of consecutive full pass for this eccDNA covered by all reads |
| Nfragments | Number of fragment that form this eccDNA |
| Nreads | Number of reads identified for the eccDNA |
| refLength | The length of reference genome that this eccDNA |
| seqLength | The length of consensus sequence that this eccDNA |
consensus_sequence.fasta FASTA format consensus sequences for each identified eccDNA:
>chrM_201_299_+
AAGTGTGTTAATTAATTAATGCTTGTAGGACATAATAATAACAATTGAATGTCTGCACAGCCACTTTCCACACAGACATCATAACAAAAAATTTCCACC
>chrM_2614_2766_-
GTTAGGTACTGTTTGCATTAATAAATTAAAGCTCCATAGGGTCTTCTCGTCTTGCTGTGTCATGCCCGCCTCTTCACGGGCAGGTCAATTTCACTGGTTAAAAGTAAGAGACAGCTGAACCCTCGTGGAGCCATTCATACAGGTCCCTATTTA
variants.csv Mutation information within eccDNA regions:
| description | |
|---|---|
| col1 | chromsome |
| col2 | position in the reference genome |
| col3 | reference base |
| col4 | variant |
| col5 | supportive coverage depth |
| col6 | total coverage depth |
| col7 | type |
| col8 | eccDNApos |