ECCFP: A Bioinformatics Workflow for eccDNA Identification from Nanopore Sequencing Data

📋 Overview

ECCFP (EccDNA Caller based on Consecutive Full Pass) is a comprehensive, end-to-end bioinformatics workflow for identifying extrachromosomal circular DNA (eccDNA) from long-read Nanopore sequencing data. This repository provides the complete implementation of the protocol described in the manuscript "A bioinformatics workflow to identify eccDNA using ECCFP from long-read Nanopore sequencing data".

The workflow integrates quality control, adapter trimming, genome alignment, and eccDNA detection into a streamlined pipeline optimized for Nanopore data.

📂 Repository Structure

Root Level Files

README.md - Main documentation file for the ECCFP workflow
LICENSE - MIT License file specifying the open-source license terms
setup_workspace.sh - One-click workspace setup script that creates the working directory structure

Core Scripts Directory (`scripts/`)

The scripts/ directory contains all executable scripts for installing, running, and testing the ECCFP workflow.

Main Installation and Test Scripts

install_all.sh - Complete software installation script that sets up Conda environment and installs all dependencies
run_test.sh - Test script that validates the installation using provided test data

Test Data Directory (`test_data/`)

Contains small datasets for quick validation and testing of the workflow:

test.fastq - Small test FASTQ file for validating the complete workflow
GRCh38_chr21.fasta - Reference genome containing only chromosome 21 for quick testing

Working Directory (`eccfpws/`)

This directory is automatically created when you run setup_workspace.sh

RawData/ - Place your own FASTQ files here for analysis
Reference/ - Store reference genome files (FASTA format) here
Software/ - Contains installed software tools and dependencies

How to Use This Structure

Initial Setup: Run bash setup_workspace.sh to create the eccfpws/ working directory
Software Installation: Navigate to scripts/ and run bash install_all.sh to install all dependencies
Validation Test: Run bash scripts/run_test.sh to verify the installation using test data
Your Own Analysis: Place your FASTQ files in eccfpws/RawData/ and run the workflow scripts

🚀 Quick Start

Step 1: Clone the Repository

git clone https://github.com/WSG-Lab/ECCFP_Workflow.git
cd ECCFP_Workflow

Step 2: Set Up the Workspace

# Create the working directory structure
bash setup_workspace.sh

Step 3: Install All Software and Dependencies

# Navigate to the scripts directory
cd scripts

# Run the complete installation script
bash install_all.sh

Important: The installation script will:

Install Conda (if not already installed)
Create a Conda environment named eccfpEnv
Install all required software with exact versions matching the manuscript
Verify all installations

Step 4: Verify Installation with Test Data

# Run the test script to verify everything works
bash run_test.sh

The test script will:

Validate the environment and software versions
Run the complete workflow on a small test dataset
Generate a test report
Verify that all outputs are generated correctly

Expected Output: All steps should complete successfully with a final message indicating test completion.

📖 Detailed Usage Instructions

Important: All following steps should be performed within the eccfpws working directory and with the eccfpEnv Conda environment activated:

# Activate the Conda environment
conda activate eccfpEnv

# Navigate to the working directory
cd eccfpws

Running the Complete Workflow For analyzing your own data, follow these steps:

Step1 Prepare Your Data

# Place your FASTQ files in the RawData directory
cp /path/to/your/data/*.fastq eccfpws/RawData/

# Place your reference genome in the Reference directory
cp /path/to/your/reference.fasta eccfpws/Reference/

Step2 Data Quality Control using NanoPlot

Command (single sample):

mkdir -p 01QualityControl
NanoPlot --fastq RawData/HRR2590080.fastq \
         -o 01QualityControl/HRR2590080 \
         -t 16 \
         --plots hex dot

Command (batch processing):

mkdir -p 01QualityControl
ls RawData/*.fastq RawData/*.fq 2>/dev/null | while read fastq; do
    prefix=$(basename ${fastq%.fastq})
    prefix=$(basename ${fastq%.fq})
    NanoPlot --fastq "$fastq" \
             -o "01QualityControl/$prefix" \
             -t 16 \
             --plots hex dot
done

Expected Output:

01QualityControl/[sample]/NanoPlot-report.html - Interactive HTML report
01QualityControl/[sample]/LengthvsQualityScatterPlot_dot.png - Read length vs quality scatter plot
01QualityControl/[sample]/NonWeightedHistogramReadLength.png - Read length distribution

Quality Assessment:

Check read length distribution (should approximate log-normal)
Verify average read quality > Q7
Ensure sufficient sequencing depth

Step 3: Trim Adapters and Barcodes using Porechop

Removes adapter and barcode sequences that can interfere with alignment and eccDNA detection. Command (single sample):

porechop -i RawData/HRR2590080.fastq \
         -o 02TrimAdapters/HRR2590080_clean.fastq \
         --extra_end_trim 0 \
         --discard_middle \
         -t 16

Command (batch processing):

mkdir -p 02TrimAdapters
ls RawData/*.fastq RawData/*.fq 2>/dev/null | while read fastq; do
    prefix=$(basename ${fastq} .fastq)
    prefix=$(basename ${fastq} .fq)
    porechop -i "$fastq" \
             -o "02TrimAdapters/${prefix}_clean.fastq" \
             --extra_end_trim 0 \
             --discard_middle \
             -t 16
done

Parameters Explained:

--extra_end_trim 0: No additional trimming beyond adapters
--discard_middle: Reads with middle adapters will be discarded (default: reads with middle adapters are split) (required for reads to be used with Nanopolish, this option is on by default when outputting reads into barcode bins)
-t 16: Use 16 threads for faster processing

Step 4: Post-Adapter Trimming Quality Assessment

Re-evaluate data quality after adapter removal to ensure trimming was successful. Command:

mkdir -p 03QualityCheck
ls 02TrimAdapters/*.fastq | while read fastq; do
    prefix=$(echo ${fastq} |cut -d '_' -f 1)
    NanoPlot --fastq "$fastq" \
             -o "03QualityCheck/$prefix" \
             -t 16 \
             --plots hex dot
done

What to Check:

Compare pre- and post-trimming quality metrics
Verify adapter removal (should see improved quality scores)
Ensure no significant data loss (<10% reads trimmed is normal)

Step 5: Mapping Reads to Reference Genome

Align trimmed reads to a reference genome to determine genomic locations. First, build reference index:

minimap2 -d Reference/GRCh38.p14.genome.fa.mmi Reference/GRCh38.p14.genome.fa

Then align reads (single sample):

mkdir -p 04MappingGenome
minimap2 -cx map-ont \
         Reference/GRCh38.p14.genome.fa.mmi \
         02TrimAdapters/HRR2590080_clean.fastq \
         --secondary=no \
         -t 16 > 04MappingGenome/HRR2590080.paf

Command (batch processing):

mkdir -p 04MappingGenome
ls 02TrimAdapters |while read fastq
do
	prefix=$(echo ${fastq} |cut -d '_' -f 1)
	minimap2 -cx map-ont \
        Reference/GRCh38.p14.genome.fa.mmi \
        02TrimAdapters/${fastq} \
        --secondary=no \
        -t 16 >04MappingGenome/${prefix}.paf
done

Parameters Explained:

-cx map-ont: Optimize for Oxford Nanopore reads
--secondary=no: Filter out secondary alignments (keep only primary)
-t 16: Use 16 threads

Expected Mapping Rates:

Human samples to hg38: >95% (optimal), >85% (acceptable)
Low mapping rates may indicate contamination or reference mismatch

Step 6: eccDNA Identification using ECCFP

The core step that identifies circular DNA from aligned reads.

Command (single sample):

mkdir 05IdentifyingEccDNA
eccfp --fastq 02TrimAdapters/HRR2590080_clean.fastq \
      --paf 04MappingGenome/HRR2590080.paf \
      --output 05IdentifyingEccDNA/HRR2590080 \
      --reference Reference/GRCh38.p14.genome.fa

Command (batch processing):

mkdir 05IdentifyingEccDNA
ls 04MappingGenome/ |while read paf
do
    prefix=$(echo ${paf} |cut -d '.' -f 1)
    eccfp --fastq 02TrimAdapters/${prefix}_clean.fastq \
        --paf 04MappingGenome/${paf} \
        --reference ~/eccfpws/Reference/GRCh38.p14.genome.fa \
        --output 05IdentifyingEccDNA/${prefix}
done

Output Files:

05IdentifyingEccDNA/[sample]/
├── final_eccDNA.csv          # Final eccDNA list with genomic coordinates
├── consensus_sequence.fasta  # Consensus sequences for each eccDNA
├── variants.csv             # Mutations/variants within eccDNA regions
├── unit.csv                 # Candidate eccDNAs within individual reads
└── candidate_consolidated.csv # Consolidated candidate information

📊 Output Interpretation

final_eccDNA.csv The main results file with columns:

	description
eccDNApos	eccDNA position
Nfullpass	Number of consecutive full pass for this eccDNA covered by all reads
Nfragments	Number of fragment that form this eccDNA
Nreads	Number of reads identified for the eccDNA
refLength	The length of reference genome that this eccDNA
seqLength	The length of consensus sequence that this eccDNA

consensus_sequence.fasta FASTA format consensus sequences for each identified eccDNA:

>chrM_201_299_+
AAGTGTGTTAATTAATTAATGCTTGTAGGACATAATAATAACAATTGAATGTCTGCACAGCCACTTTCCACACAGACATCATAACAAAAAATTTCCACC
>chrM_2614_2766_-
GTTAGGTACTGTTTGCATTAATAAATTAAAGCTCCATAGGGTCTTCTCGTCTTGCTGTGTCATGCCCGCCTCTTCACGGGCAGGTCAATTTCACTGGTTAAAAGTAAGAGACAGCTGAACCCTCGTGGAGCCATTCATACAGGTCCCTATTTA

variants.csv Mutation information within eccDNA regions:

	description
col1	chromsome
col2	position in the reference genome
col3	reference base
col4	variant
col5	supportive coverage depth
col6	total coverage depth
col7	type
col8	eccDNApos

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ECCFP: A Bioinformatics Workflow for eccDNA Identification from Nanopore Sequencing Data

📋 Overview

📂 Repository Structure

Root Level Files

Core Scripts Directory (`scripts/`)

Main Installation and Test Scripts

Test Data Directory (`test_data/`)

Working Directory (`eccfpws/`)

How to Use This Structure

🚀 Quick Start

Step 1: Clone the Repository

Step 2: Set Up the Workspace

Step 3: Install All Software and Dependencies

Step 4: Verify Installation with Test Data

📖 Detailed Usage Instructions

Step1 Prepare Your Data

Step2 Data Quality Control using NanoPlot

Step 3: Trim Adapters and Barcodes using Porechop

Step 4: Post-Adapter Trimming Quality Assessment

Step 5: Mapping Reads to Reference Genome

Step 6: eccDNA Identification using ECCFP

📊 Output Interpretation

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
scripts		scripts
test_data		test_data
LICENSE		LICENSE
README.md		README.md
setup_workspace.sh		setup_workspace.sh

License

WSG-Lab/ECCFP_Workflow

Folders and files

Latest commit

History

Repository files navigation

ECCFP: A Bioinformatics Workflow for eccDNA Identification from Nanopore Sequencing Data

📋 Overview

📂 Repository Structure

Root Level Files

Core Scripts Directory (scripts/)

Main Installation and Test Scripts

Test Data Directory (test_data/)

Working Directory (eccfpws/)

How to Use This Structure

🚀 Quick Start

Step 1: Clone the Repository

Step 2: Set Up the Workspace

Step 3: Install All Software and Dependencies

Step 4: Verify Installation with Test Data

📖 Detailed Usage Instructions

Step1 Prepare Your Data

Step2 Data Quality Control using NanoPlot

Step 3: Trim Adapters and Barcodes using Porechop

Step 4: Post-Adapter Trimming Quality Assessment

Step 5: Mapping Reads to Reference Genome

Step 6: eccDNA Identification using ECCFP

📊 Output Interpretation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Core Scripts Directory (`scripts/`)

Test Data Directory (`test_data/`)

Working Directory (`eccfpws/`)

Packages