A reusable, end-to-end ML pipeline for business forecasting — built to reduce analysis time from days to hours.
Most businesses that want to use machine learning for forecasting face the same bottleneck: data scientists spend 70%+ of their time on repetitive setup tasks — data cleaning, encoding, model selection, evaluation, reporting — before any real insight is produced.
The core question: Can we build a modular, plug-and-play predictive framework that takes any structured business dataset and produces a tuned, evaluated forecasting model — with minimal manual effort?
- Build a fully reusable ML pipeline that handles the entire modelling lifecycle
- Support both regression (price, revenue, demand) and classification (churn, fraud, approval) tasks
- Automate preprocessing, model selection, hyperparameter tuning, and evaluation
- Generate a structured business-ready output report at the end of every run
- Apply the framework to 3 real business use cases to validate generalisability
Raw Business Data
│
▼
┌─────────────────────┐
│ 1. Data Ingestion │ ← CSV, Excel, SQL, API
└─────────────────────┘
│
▼
┌─────────────────────┐
│ 2. Auto-Cleaning │ ← Missing values, outliers, types
└─────────────────────┘
│
▼
┌─────────────────────┐
│ 3. Feature Engine │ ← Encoding, scaling, feature selection
└─────────────────────┘
│
▼
┌─────────────────────┐
│ 4. Model Zoo │ ← LR, RF, XGBoost, LGBM, SVR/SVC
└─────────────────────┘
│
▼
┌─────────────────────┐
│ 5. AutoTune │ ← Optuna / GridSearch CV
└─────────────────────┘
│
▼
┌─────────────────────┐
│ 6. Evaluation │ ← R², RMSE, AUC, F1, confusion matrix
└─────────────────────┘
│
▼
┌─────────────────────┐
│ 7. Report Generator │ ← PDF/HTML business report, charts
└─────────────────────┘
| # | Business Problem | Task Type | Best Model | Performance |
|---|---|---|---|---|
| 1 | Sales Revenue Forecasting | Regression | XGBoost | R² = 0.89 |
| 2 | Customer Churn Prediction | Classification | LightGBM | AUC = 0.91 |
| 3 | Loan Default Risk Scoring | Classification | Random Forest | F1 = 0.87 |
- Accepts CSV, Excel, and SQL input
- Auto-detects column types (numeric, categorical, datetime, text)
- Generates an instant data quality report: completeness %, cardinality, skewness
- Imputes missing values: median for numeric, mode for categorical
- Detects and caps outliers at 1.5×IQR
- Drops near-zero variance features automatically
- One-Hot Encoding for low-cardinality categoricals
- Target Encoding for high-cardinality categoricals
- Standard scaling / MinMax scaling based on model family
- Automated feature importance pre-screening (removes noise features)
Trains and benchmarks 5 models simultaneously:
- Linear/Logistic Regression (baseline)
- Random Forest
- XGBoost
- LightGBM
- Support Vector Machine
- Uses Optuna for Bayesian optimisation (faster than GridSearch)
- 50-trial search on the top-performing model
- Cross-validation with 5-fold StratifiedKFold
- Full metrics table: RMSE, MAE, R² (regression) or AUC, Precision, Recall, F1 (classification)
- SHAP values for feature-level explainability ("why did the model predict this?")
- Calibration curves for classification models
- Residual analysis plots
- Auto-generates an HTML/PDF report summarising: data profile, model performance, feature importance, top predictions
- Designed to be shared directly with non-technical stakeholders
| Category | Tools |
|---|---|
| Language | Python 3.10 |
| Data | Pandas, NumPy |
| Machine Learning | Scikit-Learn, XGBoost, LightGBM |
| Tuning | Optuna |
| Explainability | SHAP |
| Visualisation | Plotly, Matplotlib, Seaborn |
| Reporting | Jinja2 + WeasyPrint (HTML → PDF) |
| Deployment | Streamlit (interactive demo) |
-
XGBoost wins most regression tasks when features include mixed types and non-linear relationships — but LightGBM outperforms on large datasets (>100k rows) due to speed.
-
Feature engineering > model choice. In the churn use case, adding two engineered features (recency × frequency interaction, account age buckets) improved AUC from 0.83 → 0.91 — more than switching models did.
-
SHAP explanations changed stakeholder decisions. In the loan default use case, SHAP waterfall charts revealed that "number of recent credit enquiries" was the #1 driver — overturning assumptions that income was dominant.
-
AutoTune delivers ~8–12% performance gain over default hyperparameters across all tested models — a meaningful improvement for production use.
- Model Comparison Bar Chart — All models benchmarked side-by-side
- SHAP Summary Plot — Global feature importance with direction
- SHAP Waterfall Chart — Single prediction explained
- Residual Plot — Error distribution analysis
- Confusion Matrix — For classification use cases
- ROC Curve — AUC comparison across models
📁 All visualisations in
/visuals/; interactive version via Streamlit demo
| Metric | Before Framework | After Framework |
|---|---|---|
| Time to first model | 2–3 days | ~2 hours |
| Models evaluated per run | 1–2 | 5 (automated) |
| Stakeholder report | Manual (Word/PPT) | Auto-generated |
| Reproducibility | Low | High (logged runs) |
This type of framework is standard at firms like McKinsey, JPMorgan, and Grab — where reusable ML pipelines are a core productivity asset.
predictive-modelling-framework/
│
├── framework/
│ ├── ingest.py # Data loading module
│ ├── clean.py # Auto-cleaning module
│ ├── features.py # Feature engineering
│ ├── models.py # Model zoo + training
│ ├── tune.py # Optuna hyperparameter tuning
│ ├── evaluate.py # Metrics + SHAP
│ └── report.py # Report generator
│
├── use_cases/
│ ├── 01_sales_forecasting/
│ ├── 02_churn_prediction/
│ └── 03_loan_default/
│
├── visuals/
├── reports/ # Auto-generated outputs
├── app.py # Streamlit demo
├── requirements.txt
└── README.md
git clone https://github.com/yourusername/predictive-modelling-framework.git
cd predictive-modelling-framework
pip install -r requirements.txt
# Run a specific use case
python framework/run_pipeline.py --input use_cases/01_sales_forecasting/data.csv --target revenue --task regression
# Launch interactive demo
streamlit run app.py- Time-Series Module — Add Prophet + LSTM for sequential forecasting tasks
- AutoML Integration — Benchmark against H2O AutoML
- MLflow Tracking — Log all experiments for full reproducibility
- Drift Detection — Monitor model performance degradation on new data
- REST API — Expose pipeline as a
/predictendpoint via FastAPI
This project is licensed under the MIT License.