13 end-to-end data-science projects on real public data, 5 long-form bilingual case studies (EN/FR), one Streamlit twin.
Domains: forecasting · GLMs / actuarial pricing · survival analysis · hierarchical reconciliation · stochastic optimization / RL · experimental design.
Live site: gabayae.github.io/data-portfolio · Version française
Each project follows the same pipeline: business question → data & EDA → modeling → validation → deployment → business outcome.
| # | Project | Domain | Real data source | Headline result |
|---|---|---|---|---|
| 01 | Health Supply-Chain Demand Forecasting | Health logistics | USAID PEPFAR SCMS | SARIMA WAPE 0.94 |
| 02 | Hourly Load Forecasting — PJM (case study) | Energy | PJM Hourly Consumption | GBM MAPE 6.2% (UC PI 99%) |
| 03 | Insurance Claims Frequency & Severity | Insurance | freMTPL2 freq + sev | Tweedie Gini 0.310 |
| 04 | Hierarchical Retail Demand Forecasting | Retail | M5 sample | MinT-OLS reconciliation |
| 05 | Stochastic Optimization for Resource Allocation | Health / Ops | Kenya KMPDC + SHA | Q-learning +122% vs manual |
| 06 | River Flow Forecasting — Lake Kariba | Hydropower | Kariba Reservoir | GBM RMSE 7 cm |
| 07 | Solar Energy Forecasting — Nairobi | Energy | NASA POWER API | GBM MAPE 9.4% |
| 08 | Customer Survival Analysis — MTN Nigeria | Telecom | MTN Nigeria Churn | Cox PH + Weibull AFT |
| 09 | Flight Demand & Price — Southern Africa | Aviation | SA Flight Prices | SARIMA MAPE 2.0% |
| 10 | Property Valuation — Lagos | Real estate | Lagos Housing | GBM R² 0.57 |
| 11 | Geospatial Farm-Output Forecasting | Agriculture | African Farm Households | GBM R² 0.66 |
| 12 | Churn Classification — MTN Nigeria | Telecom | MTN Nigeria Churn | XGBoost AUC 0.71 |
| 13 | A/B Test Framework | Marketing | Marketing Campaign A/B | ANOVA F=21.95, p < 1e-9 |
Five projects have a long-form narrative deep-dive — business context, methodology, results, trade-offs, deployment sketch. Each is fully bilingual (EN/FR) with reading time, prev/next nav, breadcrumb schema, and a per-case-study OG card.
| Case study | Family | Honest finding |
|---|---|---|
| PJM hourly load · FR | Time-series | UC state-space underperforms SARIMA on point error (19.3% vs 14.5% MAPE), but SARIMA's 99% PI coverage is what procurement actually uses. |
| Lake Kariba river flow · FR | Time-series + exog | 7 cm RMSE on a 7 m operational band; turbine discharge as exog is a 2× point-error improvement. |
| Nairobi solar irradiance · FR | Time-series | Monthly climatology (12.3% MAPE) beats SARIMA (13.8%) — at this latitude the seasonal envelope is most of the predictability. |
| freMTPL2 pricing · FR | GLMs / actuarial | Tweedie wins on segmentation (Gini 0.310); Poisson + Gamma wins on top-decile lift (2.66×). The choice is actuarial, not technical. |
| Kenya mobile clinics · FR | Stochastic optimization | Q-learning +122% over manual, but the constraint formulation matters more than the algorithm — capped LP with explicit equity at +39% is more defensible. |
Browse all five at /case-studies/ (or /fr/case-studies/).
- Languages: Python, SQL
- Time-series: SARIMA, UnobservedComponents (state-space, Kalman filter), Prophet, ETS / Holt-Winters, ARIMA
- ML: scikit-learn (GradientBoostingRegressor, RandomForest, LogisticRegression), XGBoost, LightGBM
- GLM / generalized: statsmodels (Poisson, Gamma, Tweedie, Cox PH, Weibull AFT, OLS)
- Survival: lifelines (KaplanMeier, CoxPHFitter, WeibullAFTFitter)
- Hierarchical reconciliation: MinT-OLS
- Stochastic optimization: Markov Decision Processes, Q-learning, linear programming (
scipy.optimize.linprog) - Experimentation: ANOVA, Welch t-tests with Bonferroni correction, Bayesian A/B (posterior simulation)
- Data engineering: pandas, NumPy, Kaggle CLI, NASA POWER API
- Visualization: matplotlib, seaborn
- Energy (2): PJM hourly load, Nairobi solar
- Health logistics & operations (2): USAID supply chain, Kenya facility scheduling
- Telecom (2): MTN customer survival, MTN churn classification
- Time-series & forecasting infrastructure (2): hierarchical reconciliation, river flow
- Insurance (1), Aviation (1), Real estate (1), Agriculture (1), Experimentation (1)
git clone https://github.com/gabayae/data-portfolio.git
cd data-portfolio
make install # install deps from pyproject.toml
make data # download all datasets (Kaggle CLI required for some)
make notebooks # execute every notebook end-to-endFor a single project:
cd <NN-project-name>
pip install -r requirements.txt
python download_data.py
jupyter nbconvert --to notebook --execute notebook.ipynbEach notebook is self-contained — open it on GitHub and the rendered plots and tables are visible without running anything.
portfolio_app.py is an interactive twin of index.html — same content, same color palette, with project filtering and clickable cards. Run it locally:
pip install -r requirements-app.txt
streamlit run portfolio_app.pyTo deploy on Streamlit Community Cloud:
- share.streamlit.io → New app
- Repository:
gabayae/data-portfolio· Branch:main· Main file path:portfolio_app.py - Advanced settings → Python requirements file:
requirements-app.txt - Deploy. Free, public, takes ~1 minute.
portfolio_config.py is the single source of truth for project metadata — edit there and both surfaces update.
Drop a square headshot at profile.jpg in the repo root and the hero avatar on index.html automatically swaps the "YG" monogram for the photo (no code change — handled by onerror="this.remove()" on the <img>). Recommended ≥240×240 px; JPG / PNG / WebP all work.
- Email: yaeulrich.gaba@gmail.com
- LinkedIn: linkedin.com/in/gabayae
- Personal site: gabayae.github.io
- Google Scholar: profile (h-index 12)
- Book (No Starch Press, 2024): The Shape of Data
MIT — see LICENSE.