Final project for the Data Engineering Zoomcamp 2026. This platform automates the ingestion, transformation, and analysis of UFC statistics to provide performance indicators and fight predictions based on historical data.
FYI: fight predictions will be included in a future iteration of the project using Gemini and function calling.
Combat sports analysis, specifically for the UFC, often suffers from a lack of structured and accessible data for temporal analysis. This project solves this problem by:
- centralizing scattered UFC data (fighters, events, bout results) into a modern data warehouse.
- modeling fighter state changes (weight, rankings, record) via an SCD Type 2 approach to allow for accurate retrospective analysis.
- automating the end-to-end pipeline to offer decision-making dashboards for comparing fighting styles and predicting potential outcomes based on striking and grappling metrics.
The project follows a medallion-inspired lakehouse pattern, managing the lineage and data flow through three distinct layers:
Bruin automatically manage all the pipeline within the project:
- raw/ layer: python-based scrapers extract data directly from web sources into BigQuery tables (e.g.,
raw.bouts). - staging/ layer: SQL transformations clean, cast, and deduplicate raw data while generating identifiers.
staging.boutsdepends onraw.bouts. - DWH layer (i.e core): dimensional models consolidate data for analysis. The
fact_boutstable depends onstaging.bouts,dim_fighters, anddim_eventsto ensure referential integrity.
The data warehouse uses a star schema optimized for BigQuery:
- surrogate keys: persistent identifiers are generated via a custom macro (using immutable source urls). This ensures stable joins across the pipeline. Most of the time, there were no suitable candidates to create a proper stable surrogate key; therefore, I used the URL as the SK. Additionally, I used
farm_fingerprint, which is better as it returns a BIGINT instead of a hash. - SCD type 2: All dimensions (
dim_fighters,dim_events) implement slowly changing dimension type 2 via Bruin'sscd2_by_columnstrategy. This allows for point-in-time analysis (e.g., a fighter's weight class at the time of a specific historic fight). Fordim_fighters, I would have loved to use a surrogate key based on (firstname, lastname); however, female fighters might change their last name when married, and duplicates are possible. Moreover, if I had fetched the fighter's birthday, it could have been a suitable SK. - fact table optimization: The
fact_boutstable is partitioned bydateto reduce scan costs and clustered byevent_skandfighter_skto accelerate common joins in downstream dashboards (even though the dataset is currently small, as I plan to add data from other providers, it is important to implement best practices for large-scale data to minimize technical debt). - incremental strategy: All transformations use an incremental
time_intervalapproach, processing only new data daily to minimize compute consumption. Additionally, in case a fighter fails a doping test and the bout's outcome changes (e.g., from a win to a no contest), we can simply re-fetch the data.
- Orchestration: Bruin (SQL & Python assets) with cloud Scheduler & cloud Run.
- data warehouse: BigQuery (storage & compute) as the cost is the same as buckets.
- infrastructure: Terraform & Terragrunt for multi-environment management.
- environments: - local (DuckDB) - staging (GCP) - production (GCP)
- security: github workload identity federation (wif) for keyless authentication. Nowadays, it is too dangerous to generate a service account through a .json file.
pip install uv
curl -LsSf https://getbruin.com/install/cli | sh
uv sync --devInstall pre-commit (to get a production-ready coding experience):
uv --directory app run prek installEach time you commit, you will see this, for instance:
Ruff Format..........................................(no files to check)Skipped
Ruff Lint............................................(no files to check)Skipped
Bruin Format sql files & asset definitions...............................Passed
Bruin Validate...........................................................Passed
Terragrunt Format........................................................Passed
# Source environment (within app/)
source .venv/bin/activate
# Validate the pipeline
bruin validate app/
# Formatting
uv run ruff format
bruin format app/ --sqlfluff
# Staging execution
bruin run app --environment stagingInfrastructure is fully automated via Terragrunt
cd iac/staging # or production
terragrunt run-all applyNB 1: Make sure to update the project_id in .bruin.yml and iac/*/env.hcl.
NB 2: Make sure to activate all gcp services before running terragrunt gcloud services enable ...
The project uses Workload identity federation. Configure your github environments with:
- 2 environments (staging & production)
- required secrets:
GCP_WID_PROVIDER: WID provider identifier.GCP_SA_EMAIL: Terraform service account email.
The .github/workflows/ci.yml workflow automatically handles docker builds, pushing to artifact registry, and deploying to cloud Run.
An interactive interface to explore data. Generate a JSON key for your service account and use it through Streamlit Cloud, otherwise; use the link below.
The dashboard contains multiple analysis tiles (KO trends, performance by weight class, finish rates, etc.).
- Slack webhook alerts for pipeline failures.
- Support for other organizations (hexagonemma, ares, pfl, ksw, Bellator). e.g inside raw/providers -> ares/ , ...
- Gemini integration for advanced predictive analysis.



