logS ADMET Predictor — End-to-End ML Pipeline

XGBoost solubility (logS) prediction model, served via BentoML, containerized with Docker, orchestrated with Kubernetes, and deployed on Azure — with a professional web frontend.

Overview

This project builds a complete production ML pipeline for predicting aqueous solubility (logS) of drug-like molecules from SMILES strings. It demonstrates how modern ML tools chain together from model training to cloud deployment.

Train & track  →  Register model  →  Package as API  →  Containerize  →  Deploy
   MLflow            MLflow            BentoML            Docker         Kubernetes + Azure

Model performance (Butina-split validated):

Dataset: Delaney logS benchmark (1,128 molecules)
Split: Butina clustering (chemically diverse train/test)
Tuning: Optuna (50 trials)
Final RMSE: ~0.84 log mol/L
Final R²: ~0.88

Project Structure

project_19_LogSpredictor/
│
├── train_logs.py          # Step 1 — train XGBoost, log to MLflow, register model
├── save_model.py          # Step 2 — export model from MLflow → BentoML store
├── service.py             # Step 3 — BentoML HTTP API service definition
├── bentofile.yaml         # Step 4 — Bento bundle config (deps + model + docker)
├── deployment.yaml        # Step 5 — Kubernetes Deployment + Service
├── index.html             # Frontend — professional ADMET predictor UI
├── requirements.txt       # Python dependencies
│
├── mlflow.db              # MLflow SQLite tracking database (gitignore this)
├── mlruns/                # MLflow artifact store (gitignore this)
├── bento_export/          # Exported Bento bundle for Docker build (gitignore this)
└── bento_export.tar       # Compressed bento export (gitignore this)

Tools & Why Each One

Tool	Role	Analogy
XGBoost	Learns patterns from molecular features	The brain
MLflow	Tracks experiments, stores model versions	Git for models
BentoML	Packages model as a production HTTP API	The lunchbox
Docker	Containerizes the bento for any machine	The delivery truck
Kubernetes	Orchestrates containers at scale	The warehouse manager
Azure AKS	Hosts Kubernetes in the cloud	The restaurant

Key Design Decisions

Why Butina Clustering Split?

Random splits are over-optimistic for molecular ML. If training molecules are similar to test molecules (high Tanimoto similarity), the model appears to perform better than it actually does on new chemical series.

Butina clustering groups molecules by structural similarity and assigns whole clusters to either train or test — ensuring the test set contains genuinely new chemical scaffolds.

Random split:  Train molecule ←→ Test molecule  (Tanimoto ~0.8, too similar)
Butina split:  Train cluster  ←→ Test cluster   (Tanimoto ~0.3, genuinely different)

Why MLflow + BentoML instead of one tool?

MLflow is optimised for the training side — experiment comparison, hyperparameter logging, model versioning. BentoML is optimised for the serving side — HTTP APIs, batching, containerization. They are intentionally separate so the data scientist and the deployment engineer can work independently.

Why save with `bentoml.sklearn` not `bentoml.xgboost`?

When a model is logged via mlflow.sklearn.log_model(), it is stored as a scikit-learn pickle. Loading it back with mlflow.sklearn.load_model() returns an XGBRegressor object with its full sklearn wrapper intact. Saving this into BentoML with bentoml.sklearn.save_model() preserves the _estimator_type attribute that XGBoost requires to load correctly inside the container.

Setup

Prerequisites

Python 3.11
conda or miniforge
Docker Desktop
kubectl
Azure CLI (brew install azure-cli)

Install dependencies

conda create -n proj19 python=3.11 -y
conda activate proj19

pip install setuptools==69.5.1
pip install numpy==1.26.4
pip install pandas==2.2.0
pip install scikit-learn==1.4.0
pip install xgboost==2.0.3
pip install mlflow==2.11.0
pip install optuna==3.5.0
pip install rdkit==2023.9.5
pip install pydantic==1.10.13
pip install bentoml==1.3.4.post1

Running the Pipeline

Step 1 — Train the model

python train_logs.py

This will:

Download the Delaney logS dataset
Compute Morgan fingerprints + physicochemical descriptors
Perform Butina clustering split
Run 50 Optuna trials logged as nested MLflow runs
Register the best model in the MLflow Model Registry

View results in MLflow UI:

mlflow ui --backend-store-uri sqlite:///mlflow.db
# Open http://localhost:5000

Step 2 — Export model to BentoML

python save_model.py

Pulls the registered model from MLflow and saves it into BentoML's local model store as a scikit-learn pickle.

Verify:

bentoml models list

Step 3 — Test the API locally

bentoml serve service:svc
# Open http://localhost:3000

Test with curl:

curl -X POST http://localhost:3000/predict \
  -H "Content-Type: application/json" \
  -d '{"smiles": ["CC(=O)O", "c1ccccc1"]}'

Expected response:

{
  "smiles": ["CC(=O)O", "c1ccccc1"],
  "logS": [0.814, -1.72],
  "valid": [true, true]
}

Step 4 — Build the Bento bundle

bentoml build

Step 5 — Export and containerize

# Export bento
bentoml export logs-predictor-bento:latest ./bento_export.tar
mkdir bento_export
tar -xf bento_export.tar -C bento_export

# Fix Dockerfile for newer uv installer
sed -i '' 's/.cargo\/bin\/uv/.local\/bin\/uv/' bento_export/env/docker/Dockerfile

# Build multi-platform Docker image (arm64 for Mac, amd64 for Azure)
cd bento_export
docker buildx build \
  --platform linux/amd64,linux/arm64 \
  -t your-dockerhub-username/logs-predictor:v1 \
  -f env/docker/Dockerfile \
  --push \
  --progress=plain \
  --no-cache \
  .

Step 6 — Deploy to Kubernetes

Update deployment.yaml with your Docker Hub image tag, then:

# Local (Docker Desktop)
kubectl apply -f deployment.yaml
kubectl port-forward svc/logs-predictor-svc 3000:80

# Azure AKS
az aks get-credentials --resource-group logs-predictor-rg --name logs-predictor-aks
kubectl apply -f deployment.yaml
kubectl get svc logs-predictor-svc  # get public IP

Azure Deployment

Create AKS cluster

# Login
az login

# Create resource group
az group create --name logs-predictor-rg --location ukwest

# Register AKS provider
az provider register --namespace Microsoft.ContainerService

# Create cluster
az aks create \
  --resource-group logs-predictor-rg \
  --name logs-predictor-aks \
  --node-count 2 \
  --node-vm-size Standard_B2s_v2 \
  --generate-ssh-keys

# Connect kubectl to Azure
az aks get-credentials \
  --resource-group logs-predictor-rg \
  --name logs-predictor-aks

# Deploy
kubectl apply -f deployment.yaml

# Get public IP
kubectl get svc logs-predictor-svc

Cost management

# Stop cluster overnight (stops VM charges)
az aks stop --resource-group logs-predictor-rg --name logs-predictor-aks

# Restart next day
az aks start --resource-group logs-predictor-rg --name logs-predictor-aks

# Delete everything (stops all charges)
az group delete --name logs-predictor-rg --yes

Frontend

Open index.html in your browser. Enter your API endpoint (local or Azure public IP) in the URL field, paste SMILES strings (one per line), and click Run Prediction.

Features:

Multi-SMILES input
Solubility classification (Soluble / Moderate / Poor)
Bar chart visualisation of results
Export to CSV
Property tabs for future ADMET models (logP, hERG, BBB, Caco-2)

MLflow Structure

Experiment: logS-xgboost-butina
└── optuna-butina-split (parent run)
    ├── trial_0    val_rmse=1.20
    ├── trial_1    val_rmse=0.94
    ├── ...
    └── trial_N    val_rmse=0.84  ← best
        → registered as "logS-predictor v1" in Model Registry

MLflow database tables:

experiments — experiment names and IDs
runs — each training run
metrics — RMSE, MAE, R² per run
params — hyperparameters per run
model_versions — registered models

Featurization

Each molecule is converted to a 2054-dimensional feature vector:

Morgan fingerprints (2048 bits, radius=2, ECFP4 equivalent) — encodes local chemical environment
Physicochemical descriptors (6 features):
- Molecular weight
- LogP (lipophilicity)
- H-bond donors
- H-bond acceptors
- Topological polar surface area (TPSA)
- Number of rotatable bonds

.gitignore

Add these to your .gitignore:

__pycache__/
*.pyc
mlflow.db
mlruns/
bento_export/
bento_export.tar

Dependencies

xgboost==2.0.3
scikit-learn==1.4.0
numpy==1.26.4
pandas==2.2.0
mlflow==2.11.0
optuna==3.5.0
rdkit==2023.9.5
pydantic==1.10.13
bentoml==1.3.4.post1
setuptools==69.5.1

Lessons Learned

Butina split is essential for honest evaluation — random splits can inflate R² by 0.05-0.15 on molecular datasets
BentoML version compatibility matters — v1.2, v1.3, and v1.4 have breaking API changes; pin your version
Save model format matters — mlflow.sklearn.log_model + bentoml.sklearn.save_model preserves the sklearn wrapper; mlflow.xgboost.log_model does not
Multi-platform Docker builds are essential when developing on Apple Silicon (arm64) and deploying to cloud servers (amd64)
Local Kubernetes vs Azure AKS — local is useful for testing but resource-heavy; Azure gives a real public IP and proper scaling

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

logS ADMET Predictor — End-to-End ML Pipeline

Overview

Project Structure

Tools & Why Each One

Key Design Decisions

Why Butina Clustering Split?

Why MLflow + BentoML instead of one tool?

Why save with `bentoml.sklearn` not `bentoml.xgboost`?

Setup

Prerequisites

Install dependencies

Running the Pipeline

Step 1 — Train the model

Step 2 — Export model to BentoML

Step 3 — Test the API locally

Step 4 — Build the Bento bundle

Step 5 — Export and containerize

Step 6 — Deploy to Kubernetes

Azure Deployment

Create AKS cluster

Cost management

Frontend

MLflow Structure

Featurization

.gitignore

Dependencies

Lessons Learned

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitignore		.gitignore
README.md		README.md
bentofile.yaml		bentofile.yaml
deployment.yaml		deployment.yaml
index.html		index.html
requirements.txt		requirements.txt
save_model.py		save_model.py
service.py		service.py
train_logs.py		train_logs.py

Folders and files

Latest commit

History

Repository files navigation

logS ADMET Predictor — End-to-End ML Pipeline

Overview

Project Structure

Tools & Why Each One

Key Design Decisions

Why Butina Clustering Split?

Why MLflow + BentoML instead of one tool?

Why save with bentoml.sklearn not bentoml.xgboost?

Setup

Prerequisites

Install dependencies

Running the Pipeline

Step 1 — Train the model

Step 2 — Export model to BentoML

Step 3 — Test the API locally

Step 4 — Build the Bento bundle

Step 5 — Export and containerize

Step 6 — Deploy to Kubernetes

Azure Deployment

Create AKS cluster

Cost management

Frontend

MLflow Structure

Featurization

.gitignore

Dependencies

Lessons Learned

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Why save with `bentoml.sklearn` not `bentoml.xgboost`?

Packages