otk (ecDNA Analysis Toolkit) is a deep learning-based tool for analyzing extrachromosomal DNA (ecDNA), predicting whether genes are detected as ecDNA cargo genes at the gene level, and classifying focal amplification types at the sample level.
- Deep learning-based ecDNA cargo gene prediction
- Sample-level focal amplification type classification
- Support for analysis from BAM files or processed copy number data
- Efficient command-line interface
- GPU acceleration support
- Python 3.8+
- PyTorch 2.0+
- NumPy
- Pandas
- scikit-learn
- Click (command-line interface)
- Clone the repository:
git clone https://github.com/WangLabCSU/otk.git
cd otk- Install with pip:
pip install -e .The following dependencies will be installed automatically:
- pandas>=2.0
- numpy>=1.24
- torch>=2.0
- scikit-learn>=1.3
- tqdm>=4.65
- click>=8.1
- matplotlib>=3.7
- seaborn>=0.12
- pyyaml>=6.0
otk provides two main command-line subcommands: train and predict.
Use the otk train command to train the model:
otk train --config configs/model_config.yml --output models/ --gpu 0Parameters:
--config, -c: Path to configuration file (default: configs/model_config.yml)--output, -o: Output directory for trained models (default: models/)--gpu, -g: GPU device ID to use (default: 0)
Use the otk predict command for predictions:
otk predict --model models/best_model.pth --input data/test_data.csv --output predictions/ --gpu -1Parameters:
--model, -m: Path to trained model (required)--input, -i: Path to input data file (required)--output, -o: Output directory for predictions (default: predictions/)--gpu, -g: GPU device ID to use (default: -1, i.e., use CPU)
Input data should be in CSV format with the following columns:
Required identifier columns:
sample: Tumor sample IDgene_id: Gene ID
Copy number features:
segVal: Total gene copy numberminor_cn: Minor gene copy numberintersect_ratio: Proportion of overlap between copy number detection segment and gene region
Sample-level genomic features (same value for all genes in a sample):
purity: Tumor purity estimateploidy: Tumor genome ploidy estimateAScore: Aneuploidy scorepLOH: Proportion of genome with loss of heterozygosity (LOH)cna_burden: Proportion of genome with copy number alterations
Copy number signature features:
CN1toCN19: 19 copy number signature activity estimates
Clinical features:
age: Patient agegender: Patient gender (0/1 encoded)
Tumor type features (one-hot encoded, 24 cancer types):
type_BLCA,type_BRCA,type_CESC,type_COAD,type_DLBC,type_ESCA,type_GBM,type_HNSCtype_KICH,type_KIRC,type_KIRP,type_LGG,type_LIHC,type_LUAD,type_LUSC,type_OVtype_PRAD,type_READ,type_SARC,type_SKCM,type_STAD,type_THCA,type_UCEC,type_UVM
Gene frequency features:
freq_Linear: Prior estimated frequency of gene in linear focal amplificationsfreq_BFB: Prior estimated frequency of gene in breakage-fusion-bridge (BFB) eventsfreq_Circular: Prior estimated frequency of gene in circular focal amplifications (ecDNA)freq_HR: Prior estimated frequency of gene in homologous recombination events
Target column (for training data):
y: Binary label indicating whether the gene is detected as an ecDNA cargo gene (1) or not (0)
Prediction results are saved as a CSV file with the following columns:
Gene-level predictions:
sample: Tumor sample IDgene_id: Gene IDprediction_prob: Probability of being an ecDNA cargo gene (0-1)prediction: Binary classification result (0 = not ecDNA cargo, 1 = ecDNA cargo)
Sample-level predictions:
sample_level_prediction_label: Sample-level focal amplification type classification:nofocal: No focal amplification detectednoncircular: Non-circular focal amplification detectedcircular: Circular focal amplification (ecDNA) detected
sample_level_prediction: Numerical encoding of sample-level classification (0 = nofocal, 1 = noncircular, 2 = circular)
Note: Sample-level classification follows these rules:
- If any gene in the sample is predicted as ecDNA cargo (
prediction= 1), the sample is classified ascircular - If no ecDNA cargo genes but any gene has
segVal > ploidy + 2, the sample is classified asnoncircular - Otherwise, the sample is classified as
nofocal
otk supports multiple deep learning model architectures (MLP, Transformer, MultiInputTransformer) with configurable parameters. The default MLP configuration is:
- Input layer: 57 features (matching the input data format)
- Hidden layer 1: 128 neurons, ReLU activation, 20% dropout
- Hidden layer 2: 64 neurons, ReLU activation, 20% dropout
- Hidden layer 3: 32 neurons, ReLU activation, 10% dropout
- Output layer: 1 neuron, Sigmoid activation
The model uses BCEWithLogitsLoss (or CombinedLoss with Focal Loss for imbalanced data) as the loss function and Adam as the optimizer.
Model configuration uses YAML format, with example configuration files located in configs/. You can modify parameters in the configuration files as needed, such as model architecture and training parameters.
# Train model with default configuration
otk train
# Train model with custom configuration file
otk train --config my_config.yml# Make predictions using a trained model
otk predict --model models/best_model.pth --input test_data.csvThe following performance metrics are recorded during model training:
- auPRC (Area under Precision-Recall Curve)
- AUC (Area under ROC Curve)
- F1 Score
- Precision
- Recall
We welcome community contributions! If you have any questions or suggestions, please submit them through GitHub Issues.
- Fork the repository
- Create a feature branch
- Implement features or fix bugs
- Run tests
- Submit a Pull Request
This project is licensed under the Apache License 2.0. See the LICENSE file for details.
If you use otk in your research, please cite the following paper:
Wang, S., Wu, C. Y., He, M. M., Yong, J. X., Chen, Y. X., Qian, L. M., ... & Zhao, Q. (2024). Machine learning-based extrachromosomal DNA identification in large-scale cohorts reveals its clinical implications in cancer. Nature Communications, 15(1), 1-17.
- Project homepage: https://github.com/WangLabCSU/otk
- Email: wangshx@csu.edu.cn