Skip to content

cyuanlong/NeSyDPP4

Repository files navigation

NeSyDPP4: Neuro-Symbolic AI for DPP-4 Inhibitor Discovery

License: MIT Python 3.8+ TensorFlow 2.10+

A hybrid neuro-symbolic AI approach combining Logic Tensor Networks with domain knowledge for DPP-4 inhibitor prediction in diabetes drug discovery.

πŸ“‹ Overview

This repository contains the complete implementation of our neuro-symbolic QSAR model for DPP-4 (Dipeptidyl Peptidase-4) inhibitor prediction. The system integrates:

  • Logic Tensor Networks (LTN) for neuro-symbolic reasoning
  • Domain knowledge rules via SMARTS pharmacophore patterns
  • 3D molecular descriptors for geometric information
  • Heterogeneous ensemble combining NeSy + XGBoost models

Key Results

Model ROC-AUC Accuracy Precision Recall F1-Score MCC
NeSy+XGBoost (Final) 0.9959 96.95% 96.33% 96.98% 96.65% 0.9388
MolFormer (SOTA) 0.9956 95.96% 95.32% 95.84% 95.58% 0.9188
NeSy + 3D Only 0.9946 95.73% 94.52% 96.32% 95.41% 0.9142
XGBoost (Traditional) 0.9487 90.18% 88.62% 90.05% 89.33% 0.8030

πŸš€ Quick Start

Installation

# Clone repository
git clone https://github.com/yourusername/NeSyDPP4.git
cd NeSyDPP4

# Create conda environment
conda env create -f environment.yml
conda activate nesydpp4

# Or install manually
pip install tensorflow==2.10.1 ltn==1.0.0 scikit-learn pandas numpy rdkit xgboost matplotlib seaborn

Dataset

The DPP-4 dataset (data/dpp4-26-03-25-feat-with-3d.parquet) contains:

  • 6,563 molecules (2,979 active, 3,584 inactive)
  • Train/Val/Test split: 72% / 8% / 20% (stratified)
  • Features:
    • CDKextended descriptors (1,024-D)
    • ECFP4 fingerprints (3,584-D)
    • 3D geometric descriptors (10-D)
    • SMARTS pharmacophore patterns (22-D)
    • Total: 4,640 features

Run Experiments

1. Train XGBoost Baseline

python experiments/01_xgboost_baseline.py

2. Train NeSy Model with 3D Descriptors

python experiments/12_nesy_with_3d.py

3. Train Heterogeneous Ensemble (Final Model)

python experiments/15_heterogeneous_ensemble.py

4. Compare All Models

python experiments/compare_all_models.py

πŸ“Š Results

All experimental results are available in:

  • Figures: figures/main/ (PNG, PDF, CSV)
  • Metrics: results/ (detailed CSV files)
  • Data Dictionary: figures/main/DATA_DICTIONARY.md

Performance Visualization

  • Figure 2: Performance comparison across models
  • Figure 3: ROC curves for all models
  • Figure 4: Bootstrap confidence intervals (1,000 iterations)
  • Figure 5: Feature ablation study
  • Figure 6: Confusion matrix analysis

🧬 Domain Knowledge Rules

The system incorporates 22 SMARTS-based pharmacophore patterns representing:

Pharmacophore Rules (δΏƒθΏ›ζ΄»ζ€§):

  • Amine, Cyano, Hydroxyl, Amide groups
  • Triazole, Piperazine, Fluorinated aromatic rings
  • Ξ²-amino acid mimics, Proline analogs

Toxicophore Rules (ζŠ‘εˆΆζ΄»ζ€§):

  • Nitro groups, Thiophenol, Hydrazine
  • PAINS (Pan-Assay Interference Structures)

See docs/SMARTS_patterns.md for complete pattern definitions.

πŸ—οΈ Architecture

Neuro-Symbolic Model

# Logic Tensor Network with domain knowledge
class LTNModel:
    - Base MLP: [768, 512, 256] units
    - Predicates: IsActive, HasPharmacophore, HasToxicophore
    - Axioms:
      * βˆ€x: HasPharmacophore(x) β†’ IsActive(x)
      * βˆ€x: HasToxicophore(x) β†’ Β¬IsActive(x)
      * βˆ€x: SimilarTo(x,active) β†’ IsActive(x)

Heterogeneous Ensemble

# Optimal weights: 76.2% NeSy + 23.8% XGBoost
final_prediction = 0.762 * nesy_proba + 0.238 * xgb_proba

πŸ“ Project Structure

NeSyDPP4/
β”œβ”€β”€ data/                           # Dataset files
β”‚   └── dpp4-26-03-25-feat-with-3d.parquet
β”œβ”€β”€ experiments/                    # Experiment scripts
β”‚   β”œβ”€β”€ 01_xgboost_baseline.py
β”‚   β”œβ”€β”€ 12_nesy_with_3d.py
β”‚   β”œβ”€β”€ 15_heterogeneous_ensemble.py
β”‚   β”œβ”€β”€ 20_smarts_pharmacophore.py
β”‚   └── compare_all_models.py
β”œβ”€β”€ src/                            # Source code
β”‚   └── evaluation/
β”‚       └── statistical_tests.py
β”œβ”€β”€ figures/                        # Result visualizations
β”‚   └── main/
β”œβ”€β”€ results/                        # Experimental results
β”œβ”€β”€ docs/                           # Documentation
β”‚   └── SMARTS_patterns.md
β”œβ”€β”€ environment.yml                 # Conda environment
└── README.md

πŸ”¬ Reproducibility

Random Seed Control

All experiments use fixed random seeds:

SEED = 42
np.random.seed(SEED)
tf.random.set_seed(SEED)
random.seed(SEED)

Statistical Validation

  • Bootstrap resampling: 1,000 iterations
  • McNemar's test: χ² = 4.82, p = 0.028
  • Cohen's d: 0.31 vs MolFormer, 0.89 vs XGBoost

πŸ“ˆ Performance Analysis

Ablation Studies

Configuration ROC-AUC Ξ” ROC-AUC
Base (CDK+ECFP) 0.9879 baseline
+ 3D Descriptors 0.9903 +0.0024
+ SMARTS Rules 0.9926 +0.0023
+ NeSy Axioms 0.9946 +0.0020
+ XGBoost Ensemble 0.9959 +0.0013

Confusion Matrix (Test Set, N=1,313)

Predicted Negative Predicted Positive
Actual Negative 695 (TN) 22 (FP)
Actual Positive 18 (FN) 578 (TP)
  • Sensitivity: 96.98%
  • Specificity: 96.93%
  • False Positive Rate: 3.07%
  • False Negative Rate: 3.02%

πŸ’‘ Key Innovations

  1. Domain Knowledge Integration: SMARTS patterns encode medicinal chemistry expertise
  2. 3D Geometric Features: Asphericity, eccentricity, PMI ratios capture molecular shape
  3. Neuro-Symbolic Reasoning: LTN axioms enforce logical consistency
  4. Heterogeneous Ensemble: Combines symbolic (NeSy) and statistical (XGBoost) strengths

πŸ“ Citation

If you use this code or dataset, please cite:

@article{nesydpp4_2026,
  title={NeSyDPP4: A Neuro-Symbolic AI Approach for DPP-4 Inhibitor Discovery in Diabetes Treatment},
  author={Your Name},
  journal={Journal Name},
  year={2026}
}

πŸ“„ License

This project is licensed under the MIT License - see LICENSE file for details.

πŸ™ Acknowledgments

πŸ“§ Contact

For questions or collaborations:

πŸ”— Related Resources


Note: This is a research project for academic purposes. Models are not intended for clinical use without proper validation.

About

Neuro-Symbolic AI for DPP-4 Inhibitor Discovery

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors