Skip to content

benitomartin/mlops-chicago-rides

Repository files navigation

MLOps Project Chicago Taxi Prices Prediction 🚕

This is a personal MLOps project based on a BigQuery dataset for taxi ride prices in Chicago. Below you can find some instructions to understand the project content. Feel free to ⭐ and clone this repo 😉

Tech Stack

Visual Studio Code Jupyter Notebook SQLite Python Pandas NumPy Matplotlib scikit-learn FastAPI MLflow Prefect Docker Anaconda Linux Google Cloud Git

Project Structure

The project has been structured with the following folders and files:

  • images: images from results

  • notebooks: EDA and Modelling performed at the beginning of the project to establish a baseline

  • src: source code. It is divided into:

    • api: FastApi app code
    • interface: main workflows
    • ml_logic: data/preprocessing/modelling functions
  • requirements.txt: project requirements

  • Dockerfile: docker image for deployment

  • .env.sample: sample file of environmental variables

  • .env.sample.yaml: sample file of environmental variables for Dockerfile deployment

Project Description

The dataset was obtained from BigQuery and contains 200 million rows and various columns from which the following were selected for this project: prices, pick-up and drop-off locations, and timestamps. To prepare the data for modelling, an Exploratory Data Analysis was conducted to preprocess time and distance features, and suitable scalers and encoders were chosen for the preprocessing pipeline.

Fare Distribution

The following two charts show the fare distribution of the rides. As the number of rows is too big an environmental variable (DATA_SIZE) was set up to decide how many rows to query. However, the price distribution for the first 1 million rows shows a big concentration in the first 100 USD.

In order to detect outliers, the z-score is calculated for each query, so that the outliers are removed depending on the number of rows downloaded. The following chart represents the fare distribution after removing outliers.

Distance Distribution

For the distance preprocessing, the first approach was to plot the pickup and drop-off locations on a map and histogram (excluding outliers), to see the distribution.

It can be seen that the distance distribution is heavily concentrated in the first 10 km to 50 km. The preprocessing approach was to calculate the Manhattan and Haversine distance for each ride and encode it.

Time Distribution

For the time preprocessing, the idea was to extract the hour/day/month and separate features and encode them. The hours were previously divided into sine and cosine.

Subsequently, a Neural Network Model was performed with several Dense, BatchNormalization and Dropout layers. The results showed a MAE of around 3 USD from an average price of 20 USD. However, the price prediction for rides above 10 USD shows a higher accuracy compared to rides up to 10 USD.

Afterwards, the models underwent model registry, and deployment using MLflow, Prefect, and FasApi. The Dockerimage was pushed to a Google Container Registry and deployed in Google Cloud Run.

Modelling

In order to train a model, the file main.py in the src/interface folder must be run. This will log the models in MLflow and allow registration and model transition from None to Staging and Production stages. These options can be set up in the file registry.py in the src/ml_logic folder. Additionally, the environmental variable MODEL_TARGET must be set either to local or gcs, so that the model is saved either locally or in a GCS Bucket.

Once a model is saved/registered, the workflow.py file in the src/interface folder allows a Prefect workflow to predict new data with the saved model and train a new model with these data to compare the results. If the MAE of the new model is lower, this model can be sent to the production stage and the old model will be archived.

To run Prefect and MLflow the following commands must be run in the terminal from the src/interface directory, to see the logs:

  • MLFlow:
mlflow ui --backend-store-uri sqlite:///mlflow.db
  • Prefect Cloud (with own account):
prefect cloud login
  • Prefect (locally):

    prefect server
    prefect config set PREFECT_API_URL=http://127.0.0.1:4200/api

Deployment

Having a model saved and in production, the fast.py file can be run to get a prediction. This can be done either locally by running a prediction API, building a Dockerfile, or pushing the Dockerfile to a Docker container in Google Cloud Run to get a service URL.

Prediction API

To run the prediction API run this from the project root directory and check the results here http://127.0.0.1:8000/predict:

uvicorn src.api.fast:app --reload

Dockerimage

To run the Dockerimage build it and check the results here http://127.0.0.1:8000/predict:

docker build --tag=image .
docker run -it -e PORT=8000 -p 8000:8000 --env-file your/path/to/.env image

Dockerimage in Google Cloud

To get a service URL, first build the image:

docker build --tag=image .

Then push our image to Google Container Registry:

docker push image

Finally, deploy it and get the Service URL in the terminal to run predictions on your own website. You should get something like this: Service URL: https://yourimage-jdhsk768sdfd-rt.a.run.app

gcloud run deploy --image image --region your-gcp-region --env-vars-file .env.yaml