A hybrid neuro-symbolic AI approach combining Logic Tensor Networks with domain knowledge for DPP-4 inhibitor prediction in diabetes drug discovery.
This repository contains the complete implementation of our neuro-symbolic QSAR model for DPP-4 (Dipeptidyl Peptidase-4) inhibitor prediction. The system integrates:
- Logic Tensor Networks (LTN) for neuro-symbolic reasoning
- Domain knowledge rules via SMARTS pharmacophore patterns
- 3D molecular descriptors for geometric information
- Heterogeneous ensemble combining NeSy + XGBoost models
| Model | ROC-AUC | Accuracy | Precision | Recall | F1-Score | MCC |
|---|---|---|---|---|---|---|
| NeSy+XGBoost (Final) | 0.9959 | 96.95% | 96.33% | 96.98% | 96.65% | 0.9388 |
| MolFormer (SOTA) | 0.9956 | 95.96% | 95.32% | 95.84% | 95.58% | 0.9188 |
| NeSy + 3D Only | 0.9946 | 95.73% | 94.52% | 96.32% | 95.41% | 0.9142 |
| XGBoost (Traditional) | 0.9487 | 90.18% | 88.62% | 90.05% | 89.33% | 0.8030 |
# Clone repository
git clone https://github.com/yourusername/NeSyDPP4.git
cd NeSyDPP4
# Create conda environment
conda env create -f environment.yml
conda activate nesydpp4
# Or install manually
pip install tensorflow==2.10.1 ltn==1.0.0 scikit-learn pandas numpy rdkit xgboost matplotlib seabornThe DPP-4 dataset (data/dpp4-26-03-25-feat-with-3d.parquet) contains:
- 6,563 molecules (2,979 active, 3,584 inactive)
- Train/Val/Test split: 72% / 8% / 20% (stratified)
- Features:
- CDKextended descriptors (1,024-D)
- ECFP4 fingerprints (3,584-D)
- 3D geometric descriptors (10-D)
- SMARTS pharmacophore patterns (22-D)
- Total: 4,640 features
python experiments/01_xgboost_baseline.pypython experiments/12_nesy_with_3d.pypython experiments/15_heterogeneous_ensemble.pypython experiments/compare_all_models.pyAll experimental results are available in:
- Figures:
figures/main/(PNG, PDF, CSV) - Metrics:
results/(detailed CSV files) - Data Dictionary:
figures/main/DATA_DICTIONARY.md
- Figure 2: Performance comparison across models
- Figure 3: ROC curves for all models
- Figure 4: Bootstrap confidence intervals (1,000 iterations)
- Figure 5: Feature ablation study
- Figure 6: Confusion matrix analysis
The system incorporates 22 SMARTS-based pharmacophore patterns representing:
Pharmacophore Rules (δΏθΏζ΄»ζ§):
- Amine, Cyano, Hydroxyl, Amide groups
- Triazole, Piperazine, Fluorinated aromatic rings
- Ξ²-amino acid mimics, Proline analogs
Toxicophore Rules (ζεΆζ΄»ζ§):
- Nitro groups, Thiophenol, Hydrazine
- PAINS (Pan-Assay Interference Structures)
See docs/SMARTS_patterns.md for complete pattern definitions.
# Logic Tensor Network with domain knowledge
class LTNModel:
- Base MLP: [768, 512, 256] units
- Predicates: IsActive, HasPharmacophore, HasToxicophore
- Axioms:
* βx: HasPharmacophore(x) β IsActive(x)
* βx: HasToxicophore(x) β Β¬IsActive(x)
* βx: SimilarTo(x,active) β IsActive(x)# Optimal weights: 76.2% NeSy + 23.8% XGBoost
final_prediction = 0.762 * nesy_proba + 0.238 * xgb_probaNeSyDPP4/
βββ data/ # Dataset files
β βββ dpp4-26-03-25-feat-with-3d.parquet
βββ experiments/ # Experiment scripts
β βββ 01_xgboost_baseline.py
β βββ 12_nesy_with_3d.py
β βββ 15_heterogeneous_ensemble.py
β βββ 20_smarts_pharmacophore.py
β βββ compare_all_models.py
βββ src/ # Source code
β βββ evaluation/
β βββ statistical_tests.py
βββ figures/ # Result visualizations
β βββ main/
βββ results/ # Experimental results
βββ docs/ # Documentation
β βββ SMARTS_patterns.md
βββ environment.yml # Conda environment
βββ README.md
All experiments use fixed random seeds:
SEED = 42
np.random.seed(SEED)
tf.random.set_seed(SEED)
random.seed(SEED)- Bootstrap resampling: 1,000 iterations
- McNemar's test: ΟΒ² = 4.82, p = 0.028
- Cohen's d: 0.31 vs MolFormer, 0.89 vs XGBoost
| Configuration | ROC-AUC | Ξ ROC-AUC |
|---|---|---|
| Base (CDK+ECFP) | 0.9879 | baseline |
| + 3D Descriptors | 0.9903 | +0.0024 |
| + SMARTS Rules | 0.9926 | +0.0023 |
| + NeSy Axioms | 0.9946 | +0.0020 |
| + XGBoost Ensemble | 0.9959 | +0.0013 |
| Predicted Negative | Predicted Positive | |
|---|---|---|
| Actual Negative | 695 (TN) | 22 (FP) |
| Actual Positive | 18 (FN) | 578 (TP) |
- Sensitivity: 96.98%
- Specificity: 96.93%
- False Positive Rate: 3.07%
- False Negative Rate: 3.02%
- Domain Knowledge Integration: SMARTS patterns encode medicinal chemistry expertise
- 3D Geometric Features: Asphericity, eccentricity, PMI ratios capture molecular shape
- Neuro-Symbolic Reasoning: LTN axioms enforce logical consistency
- Heterogeneous Ensemble: Combines symbolic (NeSy) and statistical (XGBoost) strengths
If you use this code or dataset, please cite:
@article{nesydpp4_2026,
title={NeSyDPP4: A Neuro-Symbolic AI Approach for DPP-4 Inhibitor Discovery in Diabetes Treatment},
author={Your Name},
journal={Journal Name},
year={2026}
}This project is licensed under the MIT License - see LICENSE file for details.
- Logic Tensor Networks (LTN): https://github.com/logictensornetworks/LTN
- RDKit: Open-source cheminformatics toolkit
- ChEMBL: Bioactivity database for DPP-4 data
For questions or collaborations:
- Email: your.email@example.com
- Issues: GitHub Issues
Note: This is a research project for academic purposes. Models are not intended for clinical use without proper validation.