XGBoost solubility (logS) prediction model, served via BentoML, containerized with Docker, orchestrated with Kubernetes, and deployed on Azure — with a professional web frontend.
This project builds a complete production ML pipeline for predicting aqueous solubility (logS) of drug-like molecules from SMILES strings. It demonstrates how modern ML tools chain together from model training to cloud deployment.
Train & track → Register model → Package as API → Containerize → Deploy
MLflow MLflow BentoML Docker Kubernetes + Azure
Model performance (Butina-split validated):
- Dataset: Delaney logS benchmark (1,128 molecules)
- Split: Butina clustering (chemically diverse train/test)
- Tuning: Optuna (50 trials)
- Final RMSE: ~0.84 log mol/L
- Final R²: ~0.88
project_19_LogSpredictor/
│
├── train_logs.py # Step 1 — train XGBoost, log to MLflow, register model
├── save_model.py # Step 2 — export model from MLflow → BentoML store
├── service.py # Step 3 — BentoML HTTP API service definition
├── bentofile.yaml # Step 4 — Bento bundle config (deps + model + docker)
├── deployment.yaml # Step 5 — Kubernetes Deployment + Service
├── index.html # Frontend — professional ADMET predictor UI
├── requirements.txt # Python dependencies
│
├── mlflow.db # MLflow SQLite tracking database (gitignore this)
├── mlruns/ # MLflow artifact store (gitignore this)
├── bento_export/ # Exported Bento bundle for Docker build (gitignore this)
└── bento_export.tar # Compressed bento export (gitignore this)
| Tool | Role | Analogy |
|---|---|---|
| XGBoost | Learns patterns from molecular features | The brain |
| MLflow | Tracks experiments, stores model versions | Git for models |
| BentoML | Packages model as a production HTTP API | The lunchbox |
| Docker | Containerizes the bento for any machine | The delivery truck |
| Kubernetes | Orchestrates containers at scale | The warehouse manager |
| Azure AKS | Hosts Kubernetes in the cloud | The restaurant |
Random splits are over-optimistic for molecular ML. If training molecules are similar to test molecules (high Tanimoto similarity), the model appears to perform better than it actually does on new chemical series.
Butina clustering groups molecules by structural similarity and assigns whole clusters to either train or test — ensuring the test set contains genuinely new chemical scaffolds.
Random split: Train molecule ←→ Test molecule (Tanimoto ~0.8, too similar)
Butina split: Train cluster ←→ Test cluster (Tanimoto ~0.3, genuinely different)
MLflow is optimised for the training side — experiment comparison, hyperparameter logging, model versioning. BentoML is optimised for the serving side — HTTP APIs, batching, containerization. They are intentionally separate so the data scientist and the deployment engineer can work independently.
When a model is logged via mlflow.sklearn.log_model(), it is stored as a scikit-learn pickle. Loading it back with mlflow.sklearn.load_model() returns an XGBRegressor object with its full sklearn wrapper intact. Saving this into BentoML with bentoml.sklearn.save_model() preserves the _estimator_type attribute that XGBoost requires to load correctly inside the container.
- Python 3.11
- conda or miniforge
- Docker Desktop
- kubectl
- Azure CLI (
brew install azure-cli)
conda create -n proj19 python=3.11 -y
conda activate proj19
pip install setuptools==69.5.1
pip install numpy==1.26.4
pip install pandas==2.2.0
pip install scikit-learn==1.4.0
pip install xgboost==2.0.3
pip install mlflow==2.11.0
pip install optuna==3.5.0
pip install rdkit==2023.9.5
pip install pydantic==1.10.13
pip install bentoml==1.3.4.post1python train_logs.pyThis will:
- Download the Delaney logS dataset
- Compute Morgan fingerprints + physicochemical descriptors
- Perform Butina clustering split
- Run 50 Optuna trials logged as nested MLflow runs
- Register the best model in the MLflow Model Registry
View results in MLflow UI:
mlflow ui --backend-store-uri sqlite:///mlflow.db
# Open http://localhost:5000python save_model.pyPulls the registered model from MLflow and saves it into BentoML's local model store as a scikit-learn pickle.
Verify:
bentoml models listbentoml serve service:svc
# Open http://localhost:3000Test with curl:
curl -X POST http://localhost:3000/predict \
-H "Content-Type: application/json" \
-d '{"smiles": ["CC(=O)O", "c1ccccc1"]}'Expected response:
{
"smiles": ["CC(=O)O", "c1ccccc1"],
"logS": [0.814, -1.72],
"valid": [true, true]
}bentoml build# Export bento
bentoml export logs-predictor-bento:latest ./bento_export.tar
mkdir bento_export
tar -xf bento_export.tar -C bento_export
# Fix Dockerfile for newer uv installer
sed -i '' 's/.cargo\/bin\/uv/.local\/bin\/uv/' bento_export/env/docker/Dockerfile
# Build multi-platform Docker image (arm64 for Mac, amd64 for Azure)
cd bento_export
docker buildx build \
--platform linux/amd64,linux/arm64 \
-t your-dockerhub-username/logs-predictor:v1 \
-f env/docker/Dockerfile \
--push \
--progress=plain \
--no-cache \
.Update deployment.yaml with your Docker Hub image tag, then:
# Local (Docker Desktop)
kubectl apply -f deployment.yaml
kubectl port-forward svc/logs-predictor-svc 3000:80
# Azure AKS
az aks get-credentials --resource-group logs-predictor-rg --name logs-predictor-aks
kubectl apply -f deployment.yaml
kubectl get svc logs-predictor-svc # get public IP# Login
az login
# Create resource group
az group create --name logs-predictor-rg --location ukwest
# Register AKS provider
az provider register --namespace Microsoft.ContainerService
# Create cluster
az aks create \
--resource-group logs-predictor-rg \
--name logs-predictor-aks \
--node-count 2 \
--node-vm-size Standard_B2s_v2 \
--generate-ssh-keys
# Connect kubectl to Azure
az aks get-credentials \
--resource-group logs-predictor-rg \
--name logs-predictor-aks
# Deploy
kubectl apply -f deployment.yaml
# Get public IP
kubectl get svc logs-predictor-svc# Stop cluster overnight (stops VM charges)
az aks stop --resource-group logs-predictor-rg --name logs-predictor-aks
# Restart next day
az aks start --resource-group logs-predictor-rg --name logs-predictor-aks
# Delete everything (stops all charges)
az group delete --name logs-predictor-rg --yesOpen index.html in your browser. Enter your API endpoint (local or Azure public IP) in the URL field, paste SMILES strings (one per line), and click Run Prediction.
Features:
- Multi-SMILES input
- Solubility classification (Soluble / Moderate / Poor)
- Bar chart visualisation of results
- Export to CSV
- Property tabs for future ADMET models (logP, hERG, BBB, Caco-2)
Experiment: logS-xgboost-butina
└── optuna-butina-split (parent run)
├── trial_0 val_rmse=1.20
├── trial_1 val_rmse=0.94
├── ...
└── trial_N val_rmse=0.84 ← best
→ registered as "logS-predictor v1" in Model Registry
MLflow database tables:
experiments— experiment names and IDsruns— each training runmetrics— RMSE, MAE, R² per runparams— hyperparameters per runmodel_versions— registered models
Each molecule is converted to a 2054-dimensional feature vector:
- Morgan fingerprints (2048 bits, radius=2, ECFP4 equivalent) — encodes local chemical environment
- Physicochemical descriptors (6 features):
- Molecular weight
- LogP (lipophilicity)
- H-bond donors
- H-bond acceptors
- Topological polar surface area (TPSA)
- Number of rotatable bonds
Add these to your .gitignore:
__pycache__/
*.pyc
mlflow.db
mlruns/
bento_export/
bento_export.tar
xgboost==2.0.3
scikit-learn==1.4.0
numpy==1.26.4
pandas==2.2.0
mlflow==2.11.0
optuna==3.5.0
rdkit==2023.9.5
pydantic==1.10.13
bentoml==1.3.4.post1
setuptools==69.5.1
- Butina split is essential for honest evaluation — random splits can inflate R² by 0.05-0.15 on molecular datasets
- BentoML version compatibility matters — v1.2, v1.3, and v1.4 have breaking API changes; pin your version
- Save model format matters —
mlflow.sklearn.log_model+bentoml.sklearn.save_modelpreserves the sklearn wrapper;mlflow.xgboost.log_modeldoes not - Multi-platform Docker builds are essential when developing on Apple Silicon (arm64) and deploying to cloud servers (amd64)
- Local Kubernetes vs Azure AKS — local is useful for testing but resource-heavy; Azure gives a real public IP and proper scaling