This project aims to build an end-to-end MLOps pipeline that predicts student performance based on a variety of academic, socio-economic, and behavioral factors. The model is designed to help educational institutions proactively identify students at risk of underperformance, enabling early interventions and personalized support strategies.
The project demonstrates the complete lifecycle of machine learning, from data collection and model development to deployment, monitoring, and continuous integration. By leveraging MLOps principles, we ensure that the solution is scalable, reproducible, and maintainable.
Note: The "Student Performance Factors" dataset used in this project is a synthetic dataset generated for educational and analytical purposes. The data is not sourced from any real-world institutions but is created to simulate realistic scenarios for analyzing student performance factors. As such, the results and predictions should not be used in real-world decision-making. Care should be taken when interpreting the outcomes, as the model's performance is based solely on this synthetic dataset and may not generalize to different or actual student populations.
- Project Overview
- Docker Network
2.1. Creating the Docker Network - MLflow: Experiment Tracking and Model Registration
3.1. Starting MLflow
3.2. Important Note - Orchestration
4.1. Starting Airflow
4.1.1. Initialize the Airflow Database
4.1.2. Start Airflow Services - Web Service
5.1. Running the Web Service - Monitoring Logs
6.1. Setting Up the ELK Stack
6.1.1. Initialize the Services
6.1.2. Start the ELK Stack
6.2. Visualizing Logs - GitHub Actions CI/CD
7.1. Workflow Overview
7.2. Workflow Configuration
7.3. Getting Started - Installation
8.1. Prerequisites
8.2. Steps - Makefile
9.1. Makefile Commands
9.2. Recommended Workflow
A shared Docker network is created to facilitate communication between the various services in the project. This ensures that the different containers can interact seamlessly with each other.
To create the shared Docker network, run the following command:
docker network create shared_network
MLflow is an open-source platform for managing the machine learning lifecycle. In this project, MLflow is utilized primarily for tracking experiments and registering models, enabling better organization and reproducibility of machine learning workflows.
Before starting the other services, ensure that MLflow is up and running to guarantee proper functionality. Use the following command to start the MLflow service:
docker-compose --env-file mlflow.env -f mlflow.docker-compose.yml up --build
It is crucial to start the MLflow service before any other services (such as Airflow and the web service) to ensure that they can properly connect to and utilize MLflow’s tracking and model registration capabilities. Failing to do so may lead to connectivity issues and hinder the effectiveness of the machine learning lifecycle management in this project.
Apache Airflow is an open-source platform to programmatically create, schedule, and monitor workflows. In this project, Airflow is used for orchestrating the machine learning workflows, ensuring that tasks are executed in the correct order and dependencies are managed effectively.
To initialize and start the Airflow services, use the following commands:
-
Initialize the Airflow Database: These commands set up the necessary database for Airflow to manage its metadata and task states.
echo -e "AIRFLOW_UID=$(id -u)" > .env docker-compose -f airflow.docker-compose.yaml up airflow-init
-
Start Airflow Services: After initializing, start the Airflow services with the following command:
docker-compose -f airflow.docker-compose.yaml up --build
This project includes a web service that consists of two main components: an API built using FastAPI and a user interface created with Streamlit. The FastAPI backend handles model predictions and serves the machine learning model, while the Streamlit frontend provides an interactive interface for users to input data and visualize results.
Both the FastAPI API and the Streamlit interface are containerized and can be run together using Docker Compose. The following command will build and start the services in Docker containers:
docker-compose -f web-service.docker-compose.yml up --build
FastAPI URL:
Once the service is up, you can access the API documentation (Swagger UI) at:
http://localhost:8000/docs
Streamlit URL:
The Streamlit interface will be available at:
http://localhost:8501
This project utilizes the ELK Stack (Elasticsearch, Logstash, and Kibana) for monitoring logs and visualizing system performance. The ELK stack provides powerful tools for searching, analyzing, and visualizing log data in real-time.
To set up the ELK stack, run the following command to initialize the services:
-
Initialize the Services: Run the following command to set up the necessary configurations and prepare the ELK stack for use:
docker-compose -f elk.docker-compose.yml up setup
-
Start the ELK Stack: Once the setup is complete, you can start the ELK stack with the following command:
docker-compose -f elk.docker-compose.yml up
Logs can be visualized in Kibana while utilizing the Streamlit app. As users interact with the Streamlit interface, logs are generated and sent to the ELK stack for analysis and visualization. For visualizing logs connect to Kibana at http://localhost:5601
.
This project utilizes GitHub Actions to automate the continuous integration and continuous deployment (CI/CD) processes. The CI/CD workflow is triggered by pushes to the main branch, as well as pull requests targeting the main branch. Additionally, it supports manual triggers for greater flexibility.
The CI/CD workflow consists of the following steps:
-
Checkout the Repository: The workflow begins by checking out the code from the repository.
-
Set Up Environment:
- Sets up Docker Buildx for building Docker images.
- Installs Python 3.12 using the
setup-python
action. - Installs Docker Compose.
-
Install Dependencies: Installs the necessary Python packages using Pipenv.
-
Create Shared Network: A Docker network named
shared_network
is created to allow communication between containers. -
Start Services:
- MLflow Service: The MLflow service is started using a Docker Compose file.
- Airflow Service: The Airflow service is initialized and started, allowing for task orchestration.
-
Trigger Airflow DAG: The workflow triggers a specified Airflow DAG to execute tasks defined within it.
-
Start Web Service: The FastAPI web service is started using Docker Compose.
-
Install FastAPI Dependencies: Installs dependencies from the
Pipfile
for the FastAPI application. -
Run Tests:
- Runs a request test to ensure the web service is functioning as expected.
- Executes unit tests for the FastAPI application using pytest.
-
Stop Services: Finally, the workflow stops all running Docker services to clean up resources.
The CI/CD workflows are defined in the .github/workflows
directory of the repository. Each workflow file outlines the steps required for the automated processes, including:
- Triggering on specific events (e.g., push, pull requests, manual triggers).
- Specifying jobs to run (e.g., testing, building, deploying).
- Using environment variables and secrets for secure configuration.
To set up GitHub Actions for your fork of this project, ensure that the necessary secrets and environment variables are configured in your repository settings. This setup will allow the CI/CD pipeline to run smoothly, automating the testing and deployment process whenever changes are made.
By using GitHub Actions, we ensure a robust, efficient, and automated workflow, enhancing the development experience and minimizing the risk of errors during deployments.
Note
This project was originally created on GitHub Codespaces using a 4-core 16GB RAM • 32GB machine. If you want to run the code easily, consider using GitHub Codespaces for a seamless experience.
- Python 3.12
- Pipenv
- Docker
To simplify the installation and setup process, you can utilize the Makefile commands instead of manual steps. However, if you prefer to do it manually, follow these steps:
-
Clone the repository:
git clone https://github.com/aalvan/mlops-student-performance.git
-
Navigate to the project directory:
cd mlops-student-performance
-
Install dependencies using
pipenv
:pipenv install
-
Activate the Pipenv shell:
pipenv shell
-
Install the project as a module using
setup.py
:pipenv install --editable .
The Makefile automates project management tasks, including:
- Installation: Sets up Python dependencies and installs the project.
- Service Management: Starts and stops essential services like MLflow, Airflow, Web Service, and ELK.
- Airflow DAG Handling: Unpauses and triggers specific Airflow DAGs while monitoring their status.
- Testing: Runs unit tests using pytest.
- help: Displays a list of available commands and their descriptions.
- install: Installs or updates project dependencies and sets up the project in editable mode.
- setup-services: Creates a Docker network and initializes the ELK stack and Airflow.
- start-services: Launches all services (MLflow, Airflow, Web Service, and ELK).
- mlflow: Starts the MLflow service using Docker.
- airflow: Starts the Airflow scheduler and web server and waits for them to be reachable.
- airflow-dag-trigger: Unpauses and triggers a specific Airflow DAG, monitoring its execution status.
- web_service: Starts the Web Service using Docker.
- elk: Starts the ELK stack using Docker.
- test: Runs unit tests for the Web Service using pytest.
- stop-services: Stops all running Docker containers associated with the project.
To run the commands correctly, follow this order:
- install: Set up dependencies and project configuration.
- setup-services: Initialize the Docker network and services.
- start-services: Launch all necessary services.
Important Note: Always run start-services before executing tests to ensure all necessary services are running.
After running these commands, the entire setup will be ready for use.