Skip to content

etiu/LogS_predictor

Repository files navigation

logS ADMET Predictor — End-to-End ML Pipeline

XGBoost solubility (logS) prediction model, served via BentoML, containerized with Docker, orchestrated with Kubernetes, and deployed on Azure — with a professional web frontend.


Overview

This project builds a complete production ML pipeline for predicting aqueous solubility (logS) of drug-like molecules from SMILES strings. It demonstrates how modern ML tools chain together from model training to cloud deployment.

Train & track  →  Register model  →  Package as API  →  Containerize  →  Deploy
   MLflow            MLflow            BentoML            Docker         Kubernetes + Azure

Model performance (Butina-split validated):

  • Dataset: Delaney logS benchmark (1,128 molecules)
  • Split: Butina clustering (chemically diverse train/test)
  • Tuning: Optuna (50 trials)
  • Final RMSE: ~0.84 log mol/L
  • Final R²: ~0.88

Project Structure

project_19_LogSpredictor/
│
├── train_logs.py          # Step 1 — train XGBoost, log to MLflow, register model
├── save_model.py          # Step 2 — export model from MLflow → BentoML store
├── service.py             # Step 3 — BentoML HTTP API service definition
├── bentofile.yaml         # Step 4 — Bento bundle config (deps + model + docker)
├── deployment.yaml        # Step 5 — Kubernetes Deployment + Service
├── index.html             # Frontend — professional ADMET predictor UI
├── requirements.txt       # Python dependencies
│
├── mlflow.db              # MLflow SQLite tracking database (gitignore this)
├── mlruns/                # MLflow artifact store (gitignore this)
├── bento_export/          # Exported Bento bundle for Docker build (gitignore this)
└── bento_export.tar       # Compressed bento export (gitignore this)

Tools & Why Each One

Tool Role Analogy
XGBoost Learns patterns from molecular features The brain
MLflow Tracks experiments, stores model versions Git for models
BentoML Packages model as a production HTTP API The lunchbox
Docker Containerizes the bento for any machine The delivery truck
Kubernetes Orchestrates containers at scale The warehouse manager
Azure AKS Hosts Kubernetes in the cloud The restaurant

Key Design Decisions

Why Butina Clustering Split?

Random splits are over-optimistic for molecular ML. If training molecules are similar to test molecules (high Tanimoto similarity), the model appears to perform better than it actually does on new chemical series.

Butina clustering groups molecules by structural similarity and assigns whole clusters to either train or test — ensuring the test set contains genuinely new chemical scaffolds.

Random split:  Train molecule ←→ Test molecule  (Tanimoto ~0.8, too similar)
Butina split:  Train cluster  ←→ Test cluster   (Tanimoto ~0.3, genuinely different)

Why MLflow + BentoML instead of one tool?

MLflow is optimised for the training side — experiment comparison, hyperparameter logging, model versioning. BentoML is optimised for the serving side — HTTP APIs, batching, containerization. They are intentionally separate so the data scientist and the deployment engineer can work independently.

Why save with bentoml.sklearn not bentoml.xgboost?

When a model is logged via mlflow.sklearn.log_model(), it is stored as a scikit-learn pickle. Loading it back with mlflow.sklearn.load_model() returns an XGBRegressor object with its full sklearn wrapper intact. Saving this into BentoML with bentoml.sklearn.save_model() preserves the _estimator_type attribute that XGBoost requires to load correctly inside the container.


Setup

Prerequisites

  • Python 3.11
  • conda or miniforge
  • Docker Desktop
  • kubectl
  • Azure CLI (brew install azure-cli)

Install dependencies

conda create -n proj19 python=3.11 -y
conda activate proj19

pip install setuptools==69.5.1
pip install numpy==1.26.4
pip install pandas==2.2.0
pip install scikit-learn==1.4.0
pip install xgboost==2.0.3
pip install mlflow==2.11.0
pip install optuna==3.5.0
pip install rdkit==2023.9.5
pip install pydantic==1.10.13
pip install bentoml==1.3.4.post1

Running the Pipeline

Step 1 — Train the model

python train_logs.py

This will:

  • Download the Delaney logS dataset
  • Compute Morgan fingerprints + physicochemical descriptors
  • Perform Butina clustering split
  • Run 50 Optuna trials logged as nested MLflow runs
  • Register the best model in the MLflow Model Registry

View results in MLflow UI:

mlflow ui --backend-store-uri sqlite:///mlflow.db
# Open http://localhost:5000

Step 2 — Export model to BentoML

python save_model.py

Pulls the registered model from MLflow and saves it into BentoML's local model store as a scikit-learn pickle.

Verify:

bentoml models list

Step 3 — Test the API locally

bentoml serve service:svc
# Open http://localhost:3000

Test with curl:

curl -X POST http://localhost:3000/predict \
  -H "Content-Type: application/json" \
  -d '{"smiles": ["CC(=O)O", "c1ccccc1"]}'

Expected response:

{
  "smiles": ["CC(=O)O", "c1ccccc1"],
  "logS": [0.814, -1.72],
  "valid": [true, true]
}

Step 4 — Build the Bento bundle

bentoml build

Step 5 — Export and containerize

# Export bento
bentoml export logs-predictor-bento:latest ./bento_export.tar
mkdir bento_export
tar -xf bento_export.tar -C bento_export

# Fix Dockerfile for newer uv installer
sed -i '' 's/.cargo\/bin\/uv/.local\/bin\/uv/' bento_export/env/docker/Dockerfile

# Build multi-platform Docker image (arm64 for Mac, amd64 for Azure)
cd bento_export
docker buildx build \
  --platform linux/amd64,linux/arm64 \
  -t your-dockerhub-username/logs-predictor:v1 \
  -f env/docker/Dockerfile \
  --push \
  --progress=plain \
  --no-cache \
  .

Step 6 — Deploy to Kubernetes

Update deployment.yaml with your Docker Hub image tag, then:

# Local (Docker Desktop)
kubectl apply -f deployment.yaml
kubectl port-forward svc/logs-predictor-svc 3000:80

# Azure AKS
az aks get-credentials --resource-group logs-predictor-rg --name logs-predictor-aks
kubectl apply -f deployment.yaml
kubectl get svc logs-predictor-svc  # get public IP

Azure Deployment

Create AKS cluster

# Login
az login

# Create resource group
az group create --name logs-predictor-rg --location ukwest

# Register AKS provider
az provider register --namespace Microsoft.ContainerService

# Create cluster
az aks create \
  --resource-group logs-predictor-rg \
  --name logs-predictor-aks \
  --node-count 2 \
  --node-vm-size Standard_B2s_v2 \
  --generate-ssh-keys

# Connect kubectl to Azure
az aks get-credentials \
  --resource-group logs-predictor-rg \
  --name logs-predictor-aks

# Deploy
kubectl apply -f deployment.yaml

# Get public IP
kubectl get svc logs-predictor-svc

Cost management

# Stop cluster overnight (stops VM charges)
az aks stop --resource-group logs-predictor-rg --name logs-predictor-aks

# Restart next day
az aks start --resource-group logs-predictor-rg --name logs-predictor-aks

# Delete everything (stops all charges)
az group delete --name logs-predictor-rg --yes

Frontend

Open index.html in your browser. Enter your API endpoint (local or Azure public IP) in the URL field, paste SMILES strings (one per line), and click Run Prediction.

Features:

  • Multi-SMILES input
  • Solubility classification (Soluble / Moderate / Poor)
  • Bar chart visualisation of results
  • Export to CSV
  • Property tabs for future ADMET models (logP, hERG, BBB, Caco-2)

MLflow Structure

Experiment: logS-xgboost-butina
└── optuna-butina-split (parent run)
    ├── trial_0    val_rmse=1.20
    ├── trial_1    val_rmse=0.94
    ├── ...
    └── trial_N    val_rmse=0.84  ← best
        → registered as "logS-predictor v1" in Model Registry

MLflow database tables:

  • experiments — experiment names and IDs
  • runs — each training run
  • metrics — RMSE, MAE, R² per run
  • params — hyperparameters per run
  • model_versions — registered models

Featurization

Each molecule is converted to a 2054-dimensional feature vector:

  • Morgan fingerprints (2048 bits, radius=2, ECFP4 equivalent) — encodes local chemical environment
  • Physicochemical descriptors (6 features):
    • Molecular weight
    • LogP (lipophilicity)
    • H-bond donors
    • H-bond acceptors
    • Topological polar surface area (TPSA)
    • Number of rotatable bonds

.gitignore

Add these to your .gitignore:

__pycache__/
*.pyc
mlflow.db
mlruns/
bento_export/
bento_export.tar

Dependencies

xgboost==2.0.3
scikit-learn==1.4.0
numpy==1.26.4
pandas==2.2.0
mlflow==2.11.0
optuna==3.5.0
rdkit==2023.9.5
pydantic==1.10.13
bentoml==1.3.4.post1
setuptools==69.5.1

Lessons Learned

  • Butina split is essential for honest evaluation — random splits can inflate R² by 0.05-0.15 on molecular datasets
  • BentoML version compatibility matters — v1.2, v1.3, and v1.4 have breaking API changes; pin your version
  • Save model format mattersmlflow.sklearn.log_model + bentoml.sklearn.save_model preserves the sklearn wrapper; mlflow.xgboost.log_model does not
  • Multi-platform Docker builds are essential when developing on Apple Silicon (arm64) and deploying to cloud servers (amd64)
  • Local Kubernetes vs Azure AKS — local is useful for testing but resource-heavy; Azure gives a real public IP and proper scaling

About

End-to-end MLOps pipeline: XGBoost logS solubility model trained on the Delaney dataset, tracked with MLflow, packaged with BentoML, containerized with Docker, orchestrated with Kubernetes, and deployed on Azure AKS. Includes a professional web frontend for predictions.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors