A comprehensive big data pipeline system implementing the Formatted and Exploitation zones of a Data Management Backbone. This project processes real estate, income, and air quality data for Barcelona, creating a multidimensional data warehouse and machine learning models for predictive analytics.
This system implements a complete data pipeline architecture with three main components:
- Formatted Zone Pipeline - Data ingestion, cleaning, and standardization
- Multidimensional Exploitation Zone - Star schema data warehouse with dimension and fact tables
- Predictive Exploitation Zone - Machine learning pipelines for training and inference
The project uses Apache Spark for distributed data processing, MongoDB for the formatted zone storage, PostgreSQL for the data warehouse, and MLflow for machine learning model management.
- Automated ETL Pipeline: Processes data from multiple sources (Idealista real estate, Barcelona income statistics, air quality measurements)
- Data Reconciliation: Intelligent neighborhood matching and data harmonization across datasets
- Star Schema Data Warehouse: Optimized multidimensional model with fact and dimension tables
- Machine Learning Pipeline:
- Linear regression model for real estate price prediction
- Automated feature engineering and preprocessing
- Model versioning and tracking with MLflow
- Flexible Configuration: Environment-based configuration for different deployment scenarios
- Code Quality: Automated formatting with Black and isort
Data Sources → Formatted Zone (MongoDB) → Exploitation Zones
├─ Multidimensional (PostgreSQL)
└─ Predictive (MLflow + Models)
- Idealista: Barcelona real estate listings with property details
- Open Data BCN Income: Neighborhood-level income statistics
- Open Data BCN Air Quality:
- Stations: Air Quality Stations
- Measurements: Air Quality Details
- Python >= 3.8
- pip >= 21.0
- make >= 4.2.1
- Docker (for MongoDB and PostgreSQL setup)
- Java (for Spark)
git clone https://github.com/akossch0/bdm2.git
cd bdm2make installThis will create a virtual environment in .venv/ and install all required Python packages from requirements.txt.
source setup.shThis sets the PYTHONPATH to include the src directory.
Create a .env file based on .env-template:
cp .env-template .envEdit the .env file to configure:
- MongoDB connection (host, port, database name)
- PostgreSQL connection (host, port, database, credentials)
- MLflow server (host, port)
- Data source folders
- Model names and versions
Key variables to adjust:
MONGODB_HOST,MONGODB_PORT: Your MongoDB deploymentPOSTGRES_HOST,POSTGRES_PORT,POSTGRES_USER,POSTGRES_PASSWORD: Your PostgreSQL deploymentMLFLOW_HOST,MLFLOW_PORT: Your MLflow tracking server- Source folder paths: Point to your data directories
If you don't have MongoDB and PostgreSQL instances, you can run them locally with Docker:
docker pull postgres
docker run -itd -e POSTGRES_USER=postgres -e POSTGRES_PASSWORD=bdm -p 5432:5432 postgresFollow the MongoDB Docker installation guide.
Processes raw data and loads it into MongoDB with standardized schemas:
make formatted-pipelineThis pipeline:
- Reads data from the persistent landing zone
- Performs data cleaning and transformation
- Reconciles location data across datasets
- Loads formatted data into MongoDB collections
Creates a star schema data warehouse in PostgreSQL:
make exploitation-multidimThis pipeline:
- Reads formatted data from MongoDB
- Creates dimension tables (locations, time periods, property types)
- Creates fact tables (income facts, property listings)
- Establishes foreign key relationships
Trains a linear regression model to predict real estate prices:
make model-trainingThis pipeline:
- Loads property data from MongoDB
- Performs feature engineering (imputation, encoding, scaling)
- Trains a linear regression model
- Logs model and metrics to MLflow
- Registers the model for production use
Applies the trained model to make predictions on new data:
make model-inferenceBefore running, configure in .env:
MONGODB_PREDICTION_COLLECTION: Collection with data to predict onMLFLOW_MODEL_VERSION: Version of the model to use
Format Python code with Black and isort:
make formatView all available commands:
make help.
├── src/
│ ├── formatted_pipeline.py # Main ETL pipeline for formatted zone
│ ├── data_formatters.py # Data source handlers and formatters
│ ├── data_reconciliation.py # Cross-dataset reconciliation logic
│ ├── exploitation/
│ │ ├── multidimensional.py # Star schema warehouse creation
│ │ ├── train.py # ML model training pipeline
│ │ └── predict.py # ML model inference pipeline
│ ├── utils.py # Utility functions
│ └── logging_config.py # Logging configuration
├── data/
│ └── persistent-landing-zone/ # Raw data storage
├── Makefile # Build automation
├── requirements.txt # Python dependencies
├── .env-template # Environment configuration template
└── README.md # This file
Main Python packages (see requirements.txt for full list):
- PySpark 3.5.1: Distributed data processing
- pandas 2.2.2: Data manipulation
- MLflow 2.13.0: ML experiment tracking and model registry
- SQLAlchemy 2.0.30: Database ORM
- psycopg2 2.9.9: PostgreSQL adapter
- python-dotenv 1.0.1: Environment configuration
- black 24.4.2 & isort 5.13.2: Code formatting
Ensure Java is properly installed and JAVA_HOME is set. The project expects Java to be available for Spark execution.
Verify that MongoDB and PostgreSQL are running and accessible at the configured hosts and ports. Check firewall settings if connecting to remote instances.
Ensure all data source folders specified in .env exist and contain the expected data files.
- Darryl Abraham
- Ákos Schneider
This project is part of the Big Data Management course.