Skip to content

WSG-Lab/ECCFP_Workflow

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

5 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

ECCFP: A Bioinformatics Workflow for eccDNA Identification from Nanopore Sequencing Data

License: MIT DOI

πŸ“‹ Overview

ECCFP (EccDNA Caller based on Consecutive Full Pass) is a comprehensive, end-to-end bioinformatics workflow for identifying extrachromosomal circular DNA (eccDNA) from long-read Nanopore sequencing data. This repository provides the complete implementation of the protocol described in the manuscript "A bioinformatics workflow to identify eccDNA using ECCFP from long-read Nanopore sequencing data".

The workflow integrates quality control, adapter trimming, genome alignment, and eccDNA detection into a streamlined pipeline optimized for Nanopore data.

πŸ“‚ Repository Structure

Root Level Files

  • README.md - Main documentation file for the ECCFP workflow
  • LICENSE - MIT License file specifying the open-source license terms
  • setup_workspace.sh - One-click workspace setup script that creates the working directory structure

Core Scripts Directory (scripts/)

The scripts/ directory contains all executable scripts for installing, running, and testing the ECCFP workflow.

Main Installation and Test Scripts

  • install_all.sh - Complete software installation script that sets up Conda environment and installs all dependencies
  • run_test.sh - Test script that validates the installation using provided test data

Test Data Directory (test_data/)

Contains small datasets for quick validation and testing of the workflow:

  • test.fastq - Small test FASTQ file for validating the complete workflow
  • GRCh38_chr21.fasta - Reference genome containing only chromosome 21 for quick testing

Working Directory (eccfpws/)

This directory is automatically created when you run setup_workspace.sh

  • RawData/ - Place your own FASTQ files here for analysis
  • Reference/ - Store reference genome files (FASTA format) here
  • Software/ - Contains installed software tools and dependencies

How to Use This Structure

  1. Initial Setup: Run bash setup_workspace.sh to create the eccfpws/ working directory
  2. Software Installation: Navigate to scripts/ and run bash install_all.sh to install all dependencies
  3. Validation Test: Run bash scripts/run_test.sh to verify the installation using test data
  4. Your Own Analysis: Place your FASTQ files in eccfpws/RawData/ and run the workflow scripts

πŸš€ Quick Start

Step 1: Clone the Repository

git clone https://github.com/WSG-Lab/ECCFP_Workflow.git
cd ECCFP_Workflow

Step 2: Set Up the Workspace

# Create the working directory structure
bash setup_workspace.sh

Step 3: Install All Software and Dependencies

# Navigate to the scripts directory
cd scripts

# Run the complete installation script
bash install_all.sh

Important: The installation script will:

  1. Install Conda (if not already installed)
  2. Create a Conda environment named eccfpEnv
  3. Install all required software with exact versions matching the manuscript
  4. Verify all installations

Step 4: Verify Installation with Test Data

# Run the test script to verify everything works
bash run_test.sh

The test script will:

  1. Validate the environment and software versions
  2. Run the complete workflow on a small test dataset
  3. Generate a test report
  4. Verify that all outputs are generated correctly

Expected Output: All steps should complete successfully with a final message indicating test completion.

πŸ“– Detailed Usage Instructions

Important: All following steps should be performed within the eccfpws working directory and with the eccfpEnv Conda environment activated:

# Activate the Conda environment
conda activate eccfpEnv

# Navigate to the working directory
cd eccfpws

Running the Complete Workflow For analyzing your own data, follow these steps:

Step1 Prepare Your Data

# Place your FASTQ files in the RawData directory
cp /path/to/your/data/*.fastq eccfpws/RawData/

# Place your reference genome in the Reference directory
cp /path/to/your/reference.fasta eccfpws/Reference/

Step2 Data Quality Control using NanoPlot

Command (single sample):

mkdir -p 01QualityControl
NanoPlot --fastq RawData/HRR2590080.fastq \
         -o 01QualityControl/HRR2590080 \
         -t 16 \
         --plots hex dot

Command (batch processing):

mkdir -p 01QualityControl
ls RawData/*.fastq RawData/*.fq 2>/dev/null | while read fastq; do
    prefix=$(basename ${fastq%.fastq})
    prefix=$(basename ${fastq%.fq})
    NanoPlot --fastq "$fastq" \
             -o "01QualityControl/$prefix" \
             -t 16 \
             --plots hex dot
done

Expected Output:

  • 01QualityControl/[sample]/NanoPlot-report.html - Interactive HTML report
  • 01QualityControl/[sample]/LengthvsQualityScatterPlot_dot.png - Read length vs quality scatter plot
  • 01QualityControl/[sample]/NonWeightedHistogramReadLength.png - Read length distribution

Quality Assessment:

  • Check read length distribution (should approximate log-normal)
  • Verify average read quality > Q7
  • Ensure sufficient sequencing depth

Step 3: Trim Adapters and Barcodes using Porechop

Removes adapter and barcode sequences that can interfere with alignment and eccDNA detection. Command (single sample):

porechop -i RawData/HRR2590080.fastq \
         -o 02TrimAdapters/HRR2590080_clean.fastq \
         --extra_end_trim 0 \
         --discard_middle \
         -t 16

Command (batch processing):

mkdir -p 02TrimAdapters
ls RawData/*.fastq RawData/*.fq 2>/dev/null | while read fastq; do
    prefix=$(basename ${fastq} .fastq)
    prefix=$(basename ${fastq} .fq)
    porechop -i "$fastq" \
             -o "02TrimAdapters/${prefix}_clean.fastq" \
             --extra_end_trim 0 \
             --discard_middle \
             -t 16
done

Parameters Explained:

  • --extra_end_trim 0: No additional trimming beyond adapters
  • --discard_middle: Reads with middle adapters will be discarded (default: reads with middle adapters are split) (required for reads to be used with Nanopolish, this option is on by default when outputting reads into barcode bins)
  • -t 16: Use 16 threads for faster processing

Step 4: Post-Adapter Trimming Quality Assessment

Re-evaluate data quality after adapter removal to ensure trimming was successful. Command:

mkdir -p 03QualityCheck
ls 02TrimAdapters/*.fastq | while read fastq; do
    prefix=$(echo ${fastq} |cut -d '_' -f 1)
    NanoPlot --fastq "$fastq" \
             -o "03QualityCheck/$prefix" \
             -t 16 \
             --plots hex dot
done

What to Check:

  • Compare pre- and post-trimming quality metrics
  • Verify adapter removal (should see improved quality scores)
  • Ensure no significant data loss (<10% reads trimmed is normal)

Step 5: Mapping Reads to Reference Genome

Align trimmed reads to a reference genome to determine genomic locations. First, build reference index:

minimap2 -d Reference/GRCh38.p14.genome.fa.mmi Reference/GRCh38.p14.genome.fa

Then align reads (single sample):

mkdir -p 04MappingGenome
minimap2 -cx map-ont \
         Reference/GRCh38.p14.genome.fa.mmi \
         02TrimAdapters/HRR2590080_clean.fastq \
         --secondary=no \
         -t 16 > 04MappingGenome/HRR2590080.paf

Command (batch processing):

mkdir -p 04MappingGenome
ls 02TrimAdapters |while read fastq
do
	prefix=$(echo ${fastq} |cut -d '_' -f 1)
	minimap2 -cx map-ont \
        Reference/GRCh38.p14.genome.fa.mmi \
        02TrimAdapters/${fastq} \
        --secondary=no \
        -t 16 >04MappingGenome/${prefix}.paf
done

Parameters Explained:

  • -cx map-ont: Optimize for Oxford Nanopore reads
  • --secondary=no: Filter out secondary alignments (keep only primary)
  • -t 16: Use 16 threads

Expected Mapping Rates:

  • Human samples to hg38: >95% (optimal), >85% (acceptable)
  • Low mapping rates may indicate contamination or reference mismatch

Step 6: eccDNA Identification using ECCFP

The core step that identifies circular DNA from aligned reads.

Command (single sample):

mkdir 05IdentifyingEccDNA
eccfp --fastq 02TrimAdapters/HRR2590080_clean.fastq \
      --paf 04MappingGenome/HRR2590080.paf \
      --output 05IdentifyingEccDNA/HRR2590080 \
      --reference Reference/GRCh38.p14.genome.fa

Command (batch processing):

mkdir 05IdentifyingEccDNA
ls 04MappingGenome/ |while read paf
do
    prefix=$(echo ${paf} |cut -d '.' -f 1)
    eccfp --fastq 02TrimAdapters/${prefix}_clean.fastq \
        --paf 04MappingGenome/${paf} \
        --reference ~/eccfpws/Reference/GRCh38.p14.genome.fa \
        --output 05IdentifyingEccDNA/${prefix}
done

Output Files:

05IdentifyingEccDNA/[sample]/
β”œβ”€β”€ final_eccDNA.csv          # Final eccDNA list with genomic coordinates
β”œβ”€β”€ consensus_sequence.fasta  # Consensus sequences for each eccDNA
β”œβ”€β”€ variants.csv             # Mutations/variants within eccDNA regions
β”œβ”€β”€ unit.csv                 # Candidate eccDNAs within individual reads
└── candidate_consolidated.csv # Consolidated candidate information

πŸ“Š Output Interpretation

final_eccDNA.csv The main results file with columns:

description
eccDNApos eccDNA position
Nfullpass Number of consecutive full pass for this eccDNA covered by all reads
Nfragments Number of fragment that form this eccDNA
Nreads Number of reads identified for the eccDNA
refLength The length of reference genome that this eccDNA
seqLength The length of consensus sequence that this eccDNA

consensus_sequence.fasta FASTA format consensus sequences for each identified eccDNA:

>chrM_201_299_+
AAGTGTGTTAATTAATTAATGCTTGTAGGACATAATAATAACAATTGAATGTCTGCACAGCCACTTTCCACACAGACATCATAACAAAAAATTTCCACC
>chrM_2614_2766_-
GTTAGGTACTGTTTGCATTAATAAATTAAAGCTCCATAGGGTCTTCTCGTCTTGCTGTGTCATGCCCGCCTCTTCACGGGCAGGTCAATTTCACTGGTTAAAAGTAAGAGACAGCTGAACCCTCGTGGAGCCATTCATACAGGTCCCTATTTA

variants.csv Mutation information within eccDNA regions:

description
col1 chromsome
col2 position in the reference genome
col3 reference base
col4 variant
col5 supportive coverage depth
col6 total coverage depth
col7 type
col8 eccDNApos

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages