Machine Learning Zoomcap Mid-term project
Image credits: Forbes article
- Problem statement
- Directory layout
- Setup
- Running the app with Docker (Recommended)
- Running the app manually
- Notebooks
- Application running on Cloud
- Checkpoints
- References
This project aims to support individuals seeking approximate values for their medical insurance costs. Whether they are moving to a different state within United State or are foreigner recently relocated to they country, there arises a need for an algorithm which predicts these costs. This way, the end-user can determine whether or not they can afford the resulting charges. To achieve this goal, the Health Insurance Premium Prediction Database for the United States" (See on references) was utilized and comprises information regarding a range of elements that have an impact on healthcare expenses and insurance premiums in the United States. This database encompasses details on ten distinct variables, encompassing age, gender, body mass index (BMI), the number of dependents, smoking habits, geographical location, income level, educational attainment, profession, and the nature of the insurance plan. All of these features were analyzed to extract useful insight and observe patterns among them. As a result, a LinearRegression model was trained, validated and deployed in real time to provide predictions for medical insurance charges.
.
├── .github # CI/CD workflows
├── backend_app/ # Config files
| ├── config/ # Entrypoint for the application
| ├── ml_workflow/ # Classes related to machine learning processes
| ├── schemas/ # Classes used to model the application
├── frontend_streamlit/ # Directory with files to create Streamlit UI application
├── images/ # Assets
├── notebooks/ # Notebooks used to explore data and select the best model
├── .env.example # Template to set environment variables
├── docker-compose.yaml # To orchestrate containers locally
├── Dockerfile # Docker image for backend application
├── Makefile # Configuration of commands to automate the applications
├── poetry.lock # Requirements for development and production
└── pyproject.toml # Project metadata and dependencies
└── README.md
- Rename
.env.exampleto.envand set your Kaggle credentials in this file. - Sign into Kaggle account.
- Go to https://www.kaggle.com/settings
- Click on
Create new Tokento download thekaggle.jsonfile - Copy
usernameandkeyvalues and past them into.envvariables respectively. - Make installation:
- For UNIX-based systems and Windows (WSL), you do not need to install make.
- For Windows without WSL:
- Install chocolatey from here
- Then,
choco install make.
Run make build_services to start the services at first time or make up_services to start services after the initial build
http://localhost:8501(Streamlit UI)http://localhost:8080(Backend service): Not only start a Uvicorn server, but fetches the dataset from Kaggle and train the model in the startup app.
The output should look like this:
User interface designed using Streamlit to interact with backend endpoints:
Swagger documentation for FastAPI backend:
- Stop the services with
docker-compose down
A virtual environment will be needed to run the app manually, run the following commands from root project directory:
pip install poetrypoetry shellpoetry installmake start_server- Go to
http://localhost:8080(Swagger doc)
- Open a new terminal.
- Run
deactivatejust in case if the backend service environment is activated. cd frontend_streamlitpoetry shellMake sure the environment is activated by runningpoetry env infopoetry install- In the same terminal, set the enpoint url variable:
export ENDPOINT_URL=http://localhost:8080 streamlit run app.py- Go to
http://localhost:8501
Run notebooks in notebooks/ directory to conduct Exploratory Data Analysis and experiment with features selection using Feature-engine module ideally created for these purposes (See References for further information). Diverse experiments were carry out using Linear Regression, RandomForest and XGBoost. The resultant features were persistent into a yaml file file containing other global properties.
To reproduce the notebooks, you will need to follow the steps 1 to 3 from Backend service (Manually steps)
From VSCode
- Open the noteboook and select the kernel interpreter from VSCode
From Jupyter Notebook:
- Run
jupyter notebookin the terminal. - Select the kernel:
The following is a picture obtained from the model_selection.ipynb notebook displaying the error distribution of the Linear Regression model which achieved the best performance.
The application has been deployed to cloud using AWS ElasticBeanstalk, both frontend and backend were separately deployed using eb command:
Deploy backend app
-
In root project directory:
eb init
Follow the steps after enter the command, but make sure to pickDocker running on 64bit Amazon Linux 2in Docker plataform question. -
Then
eb create medical-insurance-backend-env --instance_type m5.large --envvars \
KAGGLE_USERNAME=<kaggle_username>,
KAGGLE_KEY=<kaggle_key>,\
N_SPLITS=4Replace <kaggle_username> and <kaggle_key> with your Kaggle credentials. It is optional to modify N_SPLITS variable with other integer values. Addionally, it is neccesary to use a more robust EC2 instance namely m5.large as the training and validation of the model is carried out by creating and running the container.
Deploy frontend app
- Navigate to the frontend application directory:
cd frontend_streamlit eb init(Same steps as backend app)- Then:
eb create medical-insurance-charges-frontend-env --envvars ENDPOINT_URL=<endpoint_url>You must replace <endpoint_url> with the endpoint url resulting from deploying the backend application and removing "/" character in the end of the url
Application working
As a result, you will be able to see the applications running on AWS cloud:
- Frontend: http://medical-insurance-charges-frontend-env.eba-gqxzgsm2.us-east-2.elasticbeanstalk.com/
- Backend: http://medical-insurance-backend-env.eba-fv2x9xjx.us-east-2.elasticbeanstalk.com/
Warning
After mid-term deadline, these cloud services will no longer be accessible.
- Problem description
- EDA
- Model training
- Exporting notebook to script
- Reproducibility
- Model deployment
- Dependency and enviroment management
- Containerization (Docker with multi-stage)
- Cloud deployment
- Linter
- CI/CD workflow (Only to analyze the code with linter)
- Pipeline orchestration
- Unit tests
LinkedIn: https://www.linkedin.com/in/erick-calderin-5bb6963b/
e-mail: edcm.erick@gmail.com
Explore more of my work on Medium
I regularly share insights, tutorials, and reflections on tech, AI, and more. Your feedback and thoughts are always welcome!






