This is a personal MLOps project based on a Kaggle dataset for stroke prediction.
Feel free to ⭐ and clone this repo 😉
The project has been structured with the following folders and files:
data:
raw and clean datasrc:
source code. It is divided into:- Notebooks with EDA, Baseline Model and AWS Pipelines incl. unit testing
code_scripts
: processing, training, evaluation, docker container, serving and lambda
requirements.txt:
project requirements
The dataset was obtained from Kaggle and contains 5110 rows and 10 columns to detect stroke predictions. To prepare the data for modelling, an Exploratory Data Analysis was conducted where it was detected that the dataset is very imbalance (95% no stroke, 5% stroke). For modeling, the categorical features where encoded, XGBoost was use das model and the best roc-auc threshold was selected for the predictions using aditionally threshold-moving for the predictions due to the imbalance. The learning rate was tuned in order to find the best one on the deployed model.
All pipelines where deployed on AWS SageMaker, as well as the Model Registry and Endpoints. The following pipelines where created:
- ✅ Preprocessing
- ✅ Training
- ✅ Tuning
- ✅ Evaluation
- ✅ Model Registry
- ✅ Model Conditional Registry
- ✅ Deployment
Additionally the experiments were tracked on Comel ML.