Frequency-Domain Transformation of cfDNA End-Motif Profiles Enhances Robust Cancer Detection
This repository provides code and example workflows for frequency-domain transformation, model construction, statistical evaluation, motif interpretation, and figure generation of cfDNA end-motif profiles.
- Feature Extraction: Calculation of 5' end-motif frequencies for k = 4, 5, and 6.
- Spectral Transformation: Z-score standardization and Softmax mapping followed by DFT to extract amplitude spectra.
- Classification: Training base learners on spectral features and final prediction via a meta-classifier.
DFT_code/
├── data/ # Feature and metadata files
├── examples/ # Small example inputs for quick testing
├── scripts/
│ ├── core/ # Core modeling workflow
│ │ ├── transform_features.py # EDM → FFT/DCT/wavelet features
│ │ ├── run_cv_models.py # Repeated cross-validation for single models
│ │ ├── run_svm_model.py # Train SVM and validate on external dataset
│ │ └── run_ESM.py # Ensemble Spectrum Model pipeline
│ ├── preprocessing/ # Data preparation
│ │ ├── bam_to_fragment_tsv.py # BAM → fragment TSV
│ │ ├── extract_edm_from_fragments.py # Fragment TSV → end-motif frequency table
│ │ ├── merge_edm_features.py # Merge per-sample features into matrix
│ │ └── calculate_mds.py # Calculate Motif Diversity Score
│ ├── statistics/ # Statistical evaluation
│ │ ├── run_score_summary.py # AUC, CI, and sensitivity summary
│ │ ├── run_delong_test.py # DeLong test between ROC curves
│ │ ├── run_permutation_test.py # Permutation test
│ │ └── run_feature_u_test.py # Feature-wise U test with FDR correction
│ ├── plotting/ # Visualization scripts for main figures
│ ├── interpretation/ # Frequency-guided motif interpretation analyses
│ └── utils/ # Shared utility functions used across scripts
└── README.md
We recommend using Conda to manage the environment.
# Clone the repository
git clone https://github.com/Upupdownn/DFT_code.git
cd DFT_code
# Create the environment
conda env create -f environment.yml
# Activate the environment
conda activate dft_analysisThe DFT pipeline supports two input formats for starting the analysis. You can either provide raw alignment files (BAM) or pre-processed fragment files (TSV).
1. Input Fragment Data
-
Option A: BAM Files
-
Standard genomic alignment files are supported.
-
Requirements: Files must be sorted and indexed (e.g., sample.bam and sample.bam.bai).
-
-
Option B: Fragment TSV Files
-
If you have already extracted fragment information, provide a TSV file with a header and the following columns: chr, start, end, mapq, and strand.
-
Example Data: For testing purposes, sample fragment TSV files are provided in the
examples/frag_file/directory.
-
2. Reference Genome (2bit format)
A .2bit file of the reference genome is required for sequence extraction and end-motif frequency calculation. Please choose the version (e.g., hg19 or hg38) that matches your alignment.
To run the provided example dataset, you can download the hg19 reference genome using the following command:
# Download hg19 reference genome from UCSC
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/hg19.2bitAll scripts can be used with the -h/--help option to view their help documentation.
We provide a toy dataset for testing the ESM pipeline in examples/ESM/. This example can be used to verify that the environment and scripts are installed correctly before running your own analysis.
The ESM workflow is designed as a general multi-scale ensemble framework and is not limited to EDM features in this study. It can be applied to any multi-scale feature set.
To run the workflow, the following requirements should be met:
- Feature directories
cv_feature_dir: directory containing feature matrices for cross-validationval_feature_dir (optional): directory containing feature matrices for external validation
- Matched scales
cvandvaldirectories must contain the same number of feature files- corresponding files must have identical filenames, so that each scale can be matched correctly
- Info files
cv_info.tsvis requiredval_info.tsvis optional- each info file should contain:
- sample ID as index
labelcolumn indicating sample class (e.g. 0 / 1)
- Validation set
valdata are optional- the ESM model can be trained using only the
cvdataset
Using the provided example dataset:
python scripts/core/run_ESM.py \
examples/ESM/cv_features/ \
examples/ESM/cv_info.tsv \
examples/ESM/result/esm_cv_score.tsv \
--val_feature_dir examples/ESM/val_features/ \
--val_info_tsv examples/ESM/val_info.tsv \
--val_score_tsv examples/ESM/result/esm_val_score.tsvThe output score file contains:
sample— sample identifierscore— predicted ESM score
Example:
sample score
Sample_1 0.8732
Sample_2 0.2145
...
You may also build your own ESM model starting from raw alignment data by following the workflow described follow:
Convert paired-end BAM files into fragment-level TSV format.
python scripts/preprocessing/bam_to_fragment_tsv.py sample.bam sample.fragments.tsvOutput columns:
chr | start | end | mapq | strand
Calculate 5′ cfDNA end-motif frequencies from fragment files. Example for 4-mer:
python scripts/preprocessing/extract_edm_from_fragments.py \
sample.fragments.tsv \
hg19.2bit \
sample.4mer.tsv \
-k 4You can repeat for 5-mer and 6-mer.
Merge per-sample EDM frequency tables into one feature matrix.
python scripts/preprocessing/merge_edm_features.py \
input_4mer_dir \
--merged_output 4mer_matrix.tsvOutput:
rows = samples
columns = k-mer features
Transform EDM frequency matrix into frequency-domain features.
python scripts/core/transform_features.py \
4mer_matrix.tsv \
transformed_features/Generated outputs include:
- FFT amplitude
fft_amplitude_full.tsv
fft_amplitude_processed.tsv
- FFT phase
fft_phase_full.tsv
fft_phase_processed.tsv
- DCT features
dct_features.tsv
- Wavelet features
wavelet_features.tsv
The processed DFT amplitude features correspond to:
Half spectrum
- remove symmetric redundancy
- remove DC component
which were used for model construction in this study.
Train the final ESM classifier using transformed multi-scale features.
python scripts/core/run_ESM.py \
cv_feature_dir \
cv_info.tsv \
cv_score.tsv \
--val_feature_dir val_feature_dir \
--val_info_tsv val_info.tsv \
--val_score_tsv val_score.tsvcv_feature_dir should contain one feature file per scale:
4mer.tsv
5mer.tsv
6mer.tsv
Validation directory should contain matching files with identical names.
For each scale:
4-mer / 5-mer / 6-mer
four base learners are trained:
SVM
Logistic Regression
Random Forest
Gradient Boosting
Their out-of-fold prediction scores are combined into meta-features, followed by a second-level SVM classifier.
The repeated cross-validation strategy used for out-of-fold score generation is illustrated below.
The ESM framework integrates multi-scale spectral features and multiple base learners through a second-level SVM classifier.
If you have any questions or feedback, please contact us at:
Email: upupdownn@gmail.com
This project is licensed under the MIT License - see the LICENSE file for details.
