Skip to content

akossch0/bdm2

Repository files navigation

Big Data Management - Data Pipeline System

A comprehensive big data pipeline system implementing the Formatted and Exploitation zones of a Data Management Backbone. This project processes real estate, income, and air quality data for Barcelona, creating a multidimensional data warehouse and machine learning models for predictive analytics.

Overview

This system implements a complete data pipeline architecture with three main components:

  1. Formatted Zone Pipeline - Data ingestion, cleaning, and standardization
  2. Multidimensional Exploitation Zone - Star schema data warehouse with dimension and fact tables
  3. Predictive Exploitation Zone - Machine learning pipelines for training and inference

The project uses Apache Spark for distributed data processing, MongoDB for the formatted zone storage, PostgreSQL for the data warehouse, and MLflow for machine learning model management.

Key Features

  • Automated ETL Pipeline: Processes data from multiple sources (Idealista real estate, Barcelona income statistics, air quality measurements)
  • Data Reconciliation: Intelligent neighborhood matching and data harmonization across datasets
  • Star Schema Data Warehouse: Optimized multidimensional model with fact and dimension tables
  • Machine Learning Pipeline:
    • Linear regression model for real estate price prediction
    • Automated feature engineering and preprocessing
    • Model versioning and tracking with MLflow
  • Flexible Configuration: Environment-based configuration for different deployment scenarios
  • Code Quality: Automated formatting with Black and isort

Architecture

Data Sources → Formatted Zone (MongoDB) → Exploitation Zones
                                        ├─ Multidimensional (PostgreSQL)
                                        └─ Predictive (MLflow + Models)

Data Sources

  • Idealista: Barcelona real estate listings with property details
  • Open Data BCN Income: Neighborhood-level income statistics
  • Open Data BCN Air Quality:

Prerequisites

  • Python >= 3.8
  • pip >= 21.0
  • make >= 4.2.1
  • Docker (for MongoDB and PostgreSQL setup)
  • Java (for Spark)

Installation

1. Clone the Repository

git clone https://github.com/akossch0/bdm2.git
cd bdm2

2. Create Virtual Environment and Install Dependencies

make install

This will create a virtual environment in .venv/ and install all required Python packages from requirements.txt.

3. Set Environment Variables

source setup.sh

This sets the PYTHONPATH to include the src directory.

4. Configure Environment

Create a .env file based on .env-template:

cp .env-template .env

Edit the .env file to configure:

  • MongoDB connection (host, port, database name)
  • PostgreSQL connection (host, port, database, credentials)
  • MLflow server (host, port)
  • Data source folders
  • Model names and versions

Key variables to adjust:

  • MONGODB_HOST, MONGODB_PORT: Your MongoDB deployment
  • POSTGRES_HOST, POSTGRES_PORT, POSTGRES_USER, POSTGRES_PASSWORD: Your PostgreSQL deployment
  • MLFLOW_HOST, MLFLOW_PORT: Your MLflow tracking server
  • Source folder paths: Point to your data directories

5. Set Up Database Instances (Optional)

If you don't have MongoDB and PostgreSQL instances, you can run them locally with Docker:

PostgreSQL Setup

docker pull postgres
docker run -itd -e POSTGRES_USER=postgres -e POSTGRES_PASSWORD=bdm -p 5432:5432 postgres

MongoDB Setup

Follow the MongoDB Docker installation guide.

Usage

Run the Formatted Zone Pipeline

Processes raw data and loads it into MongoDB with standardized schemas:

make formatted-pipeline

This pipeline:

  • Reads data from the persistent landing zone
  • Performs data cleaning and transformation
  • Reconciles location data across datasets
  • Loads formatted data into MongoDB collections

Run the Multidimensional Exploitation Pipeline

Creates a star schema data warehouse in PostgreSQL:

make exploitation-multidim

This pipeline:

  • Reads formatted data from MongoDB
  • Creates dimension tables (locations, time periods, property types)
  • Creates fact tables (income facts, property listings)
  • Establishes foreign key relationships

Train Machine Learning Model

Trains a linear regression model to predict real estate prices:

make model-training

This pipeline:

  • Loads property data from MongoDB
  • Performs feature engineering (imputation, encoding, scaling)
  • Trains a linear regression model
  • Logs model and metrics to MLflow
  • Registers the model for production use

Run Model Inference

Applies the trained model to make predictions on new data:

make model-inference

Before running, configure in .env:

  • MONGODB_PREDICTION_COLLECTION: Collection with data to predict on
  • MLFLOW_MODEL_VERSION: Version of the model to use

Code Formatting

Format Python code with Black and isort:

make format

Help

View all available commands:

make help

Project Structure

.
├── src/
│   ├── formatted_pipeline.py        # Main ETL pipeline for formatted zone
│   ├── data_formatters.py           # Data source handlers and formatters
│   ├── data_reconciliation.py       # Cross-dataset reconciliation logic
│   ├── exploitation/
│   │   ├── multidimensional.py      # Star schema warehouse creation
│   │   ├── train.py                 # ML model training pipeline
│   │   └── predict.py               # ML model inference pipeline
│   ├── utils.py                     # Utility functions
│   └── logging_config.py            # Logging configuration
├── data/
│   └── persistent-landing-zone/     # Raw data storage
├── Makefile                          # Build automation
├── requirements.txt                  # Python dependencies
├── .env-template                     # Environment configuration template
└── README.md                         # This file

Dependencies

Main Python packages (see requirements.txt for full list):

  • PySpark 3.5.1: Distributed data processing
  • pandas 2.2.2: Data manipulation
  • MLflow 2.13.0: ML experiment tracking and model registry
  • SQLAlchemy 2.0.30: Database ORM
  • psycopg2 2.9.9: PostgreSQL adapter
  • python-dotenv 1.0.1: Environment configuration
  • black 24.4.2 & isort 5.13.2: Code formatting

Troubleshooting

Spark/Java Issues

Ensure Java is properly installed and JAVA_HOME is set. The project expects Java to be available for Spark execution.

Database Connection Issues

Verify that MongoDB and PostgreSQL are running and accessible at the configured hosts and ports. Check firewall settings if connecting to remote instances.

Data Path Issues

Ensure all data source folders specified in .env exist and contain the expected data files.

Contributors

  • Darryl Abraham
  • Ákos Schneider

License

This project is part of the Big Data Management course.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors