Big Data Management - Data Pipeline System

A comprehensive big data pipeline system implementing the Formatted and Exploitation zones of a Data Management Backbone. This project processes real estate, income, and air quality data for Barcelona, creating a multidimensional data warehouse and machine learning models for predictive analytics.

Overview

This system implements a complete data pipeline architecture with three main components:

Formatted Zone Pipeline - Data ingestion, cleaning, and standardization
Multidimensional Exploitation Zone - Star schema data warehouse with dimension and fact tables
Predictive Exploitation Zone - Machine learning pipelines for training and inference

The project uses Apache Spark for distributed data processing, MongoDB for the formatted zone storage, PostgreSQL for the data warehouse, and MLflow for machine learning model management.

Key Features

Automated ETL Pipeline: Processes data from multiple sources (Idealista real estate, Barcelona income statistics, air quality measurements)
Data Reconciliation: Intelligent neighborhood matching and data harmonization across datasets
Star Schema Data Warehouse: Optimized multidimensional model with fact and dimension tables
Machine Learning Pipeline:
- Linear regression model for real estate price prediction
- Automated feature engineering and preprocessing
- Model versioning and tracking with MLflow
Flexible Configuration: Environment-based configuration for different deployment scenarios
Code Quality: Automated formatting with Black and isort

Architecture

Data Sources → Formatted Zone (MongoDB) → Exploitation Zones
                                        ├─ Multidimensional (PostgreSQL)
                                        └─ Predictive (MLflow + Models)

Data Sources

Idealista: Barcelona real estate listings with property details
Open Data BCN Income: Neighborhood-level income statistics
Open Data BCN Air Quality:
- Stations: Air Quality Stations
- Measurements: Air Quality Details

Prerequisites

Python >= 3.8
pip >= 21.0
make >= 4.2.1
Docker (for MongoDB and PostgreSQL setup)
Java (for Spark)

Installation

1. Clone the Repository

git clone https://github.com/akossch0/bdm2.git
cd bdm2

2. Create Virtual Environment and Install Dependencies

make install

This will create a virtual environment in .venv/ and install all required Python packages from requirements.txt.

3. Set Environment Variables

source setup.sh

This sets the PYTHONPATH to include the src directory.

4. Configure Environment

Create a .env file based on .env-template:

cp .env-template .env

Edit the .env file to configure:

MongoDB connection (host, port, database name)
PostgreSQL connection (host, port, database, credentials)
MLflow server (host, port)
Data source folders
Model names and versions

Key variables to adjust:

MONGODB_HOST, MONGODB_PORT: Your MongoDB deployment
POSTGRES_HOST, POSTGRES_PORT, POSTGRES_USER, POSTGRES_PASSWORD: Your PostgreSQL deployment
MLFLOW_HOST, MLFLOW_PORT: Your MLflow tracking server
Source folder paths: Point to your data directories

5. Set Up Database Instances (Optional)

If you don't have MongoDB and PostgreSQL instances, you can run them locally with Docker:

PostgreSQL Setup

docker pull postgres
docker run -itd -e POSTGRES_USER=postgres -e POSTGRES_PASSWORD=bdm -p 5432:5432 postgres

MongoDB Setup

Follow the MongoDB Docker installation guide.

Usage

Run the Formatted Zone Pipeline

Processes raw data and loads it into MongoDB with standardized schemas:

make formatted-pipeline

This pipeline:

Reads data from the persistent landing zone
Performs data cleaning and transformation
Reconciles location data across datasets
Loads formatted data into MongoDB collections

Run the Multidimensional Exploitation Pipeline

Creates a star schema data warehouse in PostgreSQL:

make exploitation-multidim

This pipeline:

Reads formatted data from MongoDB
Creates dimension tables (locations, time periods, property types)
Creates fact tables (income facts, property listings)
Establishes foreign key relationships

Train Machine Learning Model

Trains a linear regression model to predict real estate prices:

make model-training

This pipeline:

Loads property data from MongoDB
Performs feature engineering (imputation, encoding, scaling)
Trains a linear regression model
Logs model and metrics to MLflow
Registers the model for production use

Run Model Inference

Applies the trained model to make predictions on new data:

make model-inference

Before running, configure in .env:

MONGODB_PREDICTION_COLLECTION: Collection with data to predict on
MLFLOW_MODEL_VERSION: Version of the model to use

Code Formatting

Format Python code with Black and isort:

make format

Help

View all available commands:

make help

Project Structure

.
├── src/
│   ├── formatted_pipeline.py        # Main ETL pipeline for formatted zone
│   ├── data_formatters.py           # Data source handlers and formatters
│   ├── data_reconciliation.py       # Cross-dataset reconciliation logic
│   ├── exploitation/
│   │   ├── multidimensional.py      # Star schema warehouse creation
│   │   ├── train.py                 # ML model training pipeline
│   │   └── predict.py               # ML model inference pipeline
│   ├── utils.py                     # Utility functions
│   └── logging_config.py            # Logging configuration
├── data/
│   └── persistent-landing-zone/     # Raw data storage
├── Makefile                          # Build automation
├── requirements.txt                  # Python dependencies
├── .env-template                     # Environment configuration template
└── README.md                         # This file

Dependencies

Main Python packages (see requirements.txt for full list):

PySpark 3.5.1: Distributed data processing
pandas 2.2.2: Data manipulation
MLflow 2.13.0: ML experiment tracking and model registry
SQLAlchemy 2.0.30: Database ORM
psycopg2 2.9.9: PostgreSQL adapter
python-dotenv 1.0.1: Environment configuration
black 24.4.2 & isort 5.13.2: Code formatting

Troubleshooting

Spark/Java Issues

Ensure Java is properly installed and JAVA_HOME is set. The project expects Java to be available for Spark execution.

Database Connection Issues

Verify that MongoDB and PostgreSQL are running and accessible at the configured hosts and ports. Check firewall settings if connecting to remote instances.

Data Path Issues

Ensure all data source folders specified in .env exist and contain the expected data files.

Contributors

Darryl Abraham
Ákos Schneider

License

This project is part of the Big Data Management course.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
data/persistent-landing-zone		data/persistent-landing-zone
logs		logs
src		src
traces		traces
.env-template		.env-template
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
log4j.properties		log4j.properties
requirements.txt		requirements.txt
setup.sh		setup.sh

Folders and files

Latest commit

History

Repository files navigation

Big Data Management - Data Pipeline System

Overview

Key Features

Architecture

Data Sources

Prerequisites

Installation

1. Clone the Repository

2. Create Virtual Environment and Install Dependencies

3. Set Environment Variables

4. Configure Environment

5. Set Up Database Instances (Optional)

PostgreSQL Setup

MongoDB Setup

Usage

Run the Formatted Zone Pipeline

Run the Multidimensional Exploitation Pipeline

Train Machine Learning Model

Run Model Inference

Code Formatting

Help

Project Structure

Dependencies

Troubleshooting

Spark/Java Issues

Database Connection Issues

Data Path Issues

Contributors

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages