Skip to content

benitomartin/mlops-car-prices

Repository files navigation

MLOps Project Car Prices Prediction

This project has been developed as part of the MLOps Zoomcamp course provided by DataTalks.Club.

The dataset used has been downloaded from Kaggle and a preliminary data analysis was performed (see notebooks folder), to get some insights for the further project development.

Below you can find some instructions to understand the project content. Feel free to ⭐ and clone this repo 😉

Tech Stack

Visual Studio Code Jupyter Notebook PostgreSQL Python Pandas Matplotlib scikit-learn Flask MLflow Docker Anaconda Linux AWS Grafana Git

Project Structure

The project has been structured with the following folders and files:

  • .github: contains the CI/CD files (GitHub Actions)
  • data: dataset and test sample for testing the model
  • integration_tests: prediction integration test with docker-compose
  • lambda: test of the lambda handler with and w/o docker
  • model: full pipeline from preprocessing to prediction and monitoring using MLflow, Prefect, Grafana, Adminer, and docker-compose
  • notebooks: EDA and Modeling performed at the beginning of the project to establish a baseline
  • tests: unit tests
  • terraform: IaC stream-based pipeline infrastructure in AWS using Terraform
  • Makefile: set of execution tasks
  • pyproject.toml: linting and formatting
  • setup.py: project installation module
  • requirements.txt: project requirements

Project Description

The dataset was obtained from Kaggle and contains various columns with car details and prices. To prepare the data for modeling, an Exploratory Data Analysis was conducted to preprocess numerical and categorical features, and suitable scalers and encoders were chosen for the preprocessing pipeline. Subsequently, a GridSearch was performed to select the best regression models, with RandomForestRegressor and GradientBoostingRegressor being the top performers, achieving an R2 value of approximately 0.9.

Afterward, the models underwent testing, model registry, and deployment using MLflow, Prefect, and Flask. Monitoring of the models was established through Grafana and Adminer Database. Subsequently, a project infrastructure was set up in Terraform, utilizing AWS modules such as Kinesis Streams (Producer & Consumer), Lambda (Serving API), S3 Bucket (Model artifacts), and ECR (Image Registry).

Finally, to streamline the development process, a fully automated CI/CD pipeline was created using GitHub Actions.

Project Set Up

The Python version used for this project is Python 3.9.

  1. Clone the repo (or download it as a zip file):

    git clone https://github.com/benitomartin/mlops-car-prices.git
  2. Create the virtual environment named main-env using Conda with Python version 3.9:

    conda create -n main-env python=3.9
    conda activate main-env
  3. Install setuptools and wheel:

    conda install setuptools wheel
    
  4. Execute the setup.py script and install the project dependencies included in the requirements.txt:

    pip install .
    
    or
    
    make install

Each project folder contains a README.md file with instructions about how to run the code. I highly recommend creating a virtual environment for each one. Additionally, please note that an AWS Account, credentials, and proper policies with full access to EC2, S3, ECR, Lambda, and Kinesis are necessary for the projects to function correctly. Make sure to configure the appropriate credentials to interact with AWS services.

Project Best Practices

The following best practices were implemented:

  • Problem description: The project is well described and it's clear and understandable
  • Cloud: The project is developed on the cloud and IaC tools are used for provisioning the infrastructure
  • Experiment tracking and model registry: Both experiment tracking and model registry are used
  • Workflow orchestration: Fully deployed workflow
  • Model deployment: The model deployment code is containerized and can be deployed to the cloud
  • Model monitoring: Basic model monitoring that calculates and reports metrics
  • Reproducibility: Instructions are clear, it's easy to run the code, and it works. The versions for all the dependencies are specified.
  • Best practices:
    • There are unit tests
    • There is an integration test
    • Linter and code formatting are used
    • There is a Makefile
    • There is a CI/CD pipeline