Skip to content

Blueprint for Porting Apache Airflow Breeze to Apache Sedona #2993

@jbampton

Description

@jbampton

We call it Airflow Breeze as It's a Breeze to contribute to Airflow.

I have worked with Apache Airflow, the Breeze Docker environment and the pre-commit / prek hooks system.

https://github.com/apache/airflow/blob/main/dev/breeze/doc/README.rst

refs #2729
refs #2202


Gemini is AI and can make mistakes so double check it


What is Breeze

Apache Airflow Breeze is a Python-based CLI development environment designed to streamline workflows for project contributors and maintainers. It eliminates the need to manually manage complex Docker topologies and conflicting Python dependencies by encapsulating them into a single platform.

By utilizing the exact same Docker images and configuration matrices locally as the GitHub Actions CI pipelines, it ensures absolute environment parity. This structural consistency prevents environment drift and drastically reduces wasted CI compute cycles during pull request reviews.

Developers can instantly swap runtime environments, including various Python versions and backend databases like PostgreSQL or MySQL, using simple CLI flags. The environment isolates all external provider dependencies inside containers, keeping the host machine clean and automatically updating layers when files like pyproject.toml change.

Breeze also live-mounts the local host repository into the workspace, meaning any code edits made in an IDE reflect immediately inside the running container without requiring a rebuild. For testing frontend components, it features built-in background asset compilation utilities to dynamically handle UI modifications.

Code quality is strictly maintained through a unified static analysis framework powered by pre-commit for linting and security checks. Finally, it includes direct tooling to compile documentation locally and spin up Kubernetes test environments using Kind commands.


Blueprint for Porting Apache Airflow Breeze to Apache Sedona

Porting Apache Airflow Breeze (Airflow's containerized development environment) over to Apache Sedona creates a standardized, reproducible contributor environment.

Since Breeze is essentially a complex orchestration wrapper built on top of Docker, Docker Compose, and Python's click library, you aren't porting APIs; you are adapting its container architecture and CLI harness to handle Spark, Java/Scala dependencies, and geospatial libraries instead of Airflow backends.


1. Deconstruct the Architecture

Airflow Breeze works by building a heavyweight CI Docker image, mounting your local git repository into the container, and providing a unified CLI (./breeze) to handle testing, linting, and image building.

To port this concept to Sedona, you need to swap the dependencies:

  • CLI Framework: Python click -> Python click (retaining the framework)
  • Containers: Webserver, Scheduler, Celery, Postgres/MySQL -> Spark Master, Spark Worker, Jupyter Notebook/Lab
  • Base Image: Debian/Ubuntu + Python + Airflow system deps -> Ubuntu + Java (17/21) + Spark + Python (uv/pip)
  • Core Matrix: Python versions x Backend Databases -> Python versions x Spark versions x Scala versions

2. Step-by-Step Porting Guide

Step 1: Isolate the CLI Skeleton

Airflow's Breeze code lives entirely in dev/breeze/. You don't need the entire Airflow repository. You can copy the structure of dev/breeze/src/airflow_breeze/ into Sedona's repository (e.g., dev/sedona_breeze/).

Keep the foundational click structure, but strip out Airflow-specific commands like start-airflow, setup-kvm, or k8s. Focus on defining these core commands instead:

  • breeze shell – Drops the developer into an interactive shell inside a container pre-configured with Spark and Sedona.
  • breeze test – Runs pytest or sbt test inside the containerized environment.
  • breeze build-image – Compiles the local developer Docker image.

Step 2: Redesign the Dockerfile (Dockerfile.ci)

Airflow's Breeze Dockerfile is highly optimized for multi-stage caching. You will need to write a new Dockerfile.ci tailored for Sedona's unique dual-language setup (Java/Scala + Python).

Your base image layer needs to manage:

  1. Java Runtime: Install your required JDK (e.g., Temurin or Zulu OpenJDK 17).
  2. Spark Binaries: Download and unpack the targeted Apache Spark version matching your matrix.
  3. Geospatial Libraries: Install underlying native system dependencies if required (like libgeos-dev or proj-bin for certain Python bindings), though Sedona handles most geometry primitives natively via JTS.
  4. Python Tooling: Install uv or pip alongside the targeted Python version to quickly mount the apache-sedona Python package in editable mode (pip install -e ".[spark]").

Step 3: Map the Volume Mounts and Environments

The magic of Breeze is that code changes on your host machine instantly update inside the container. In your ported Docker Compose configuration (docker-compose.yaml generated dynamically by your script), make sure to map:

  • The Sedona root repository to /opt/sedona
  • Local Maven/SBT caches (~/.m2 and ~/.sbt) to container paths to prevent downloading JARs (like the geotools-wrapper) on every restart.
  • Local uv or pip cache directories.

Step 4: Adapt the Matrix Generation

Airflow Breeze uses environment variables to dynamically construct container tags and combinations (e.g., Python 3.9 + Postgres 15).

For Sedona, rewrite the shell and build parameters to accept a different matrix:

# Example of what your ported CLI options should look like
./breeze shell --python 3.11 --spark 3.5 --scala 2.12

Your Python script will catch these parameters and feed them as --build-arg strings to Docker:

  • SPARK_VERSION=${SPARK_VERSION}
  • SCALA_VERSION=${SCALA_VERSION}

3. Recommended Directory Blueprint

When restructuring the copied Breeze code inside Sedona, aim for this simplified file structure to keep your build system maintainable:

dev/sedona_breeze/
├── BREEZE.rst                        # Developer documentation
├── breeze                             # Main executable entrypoint script
├── pyproject.toml                     # Python dependencies for the CLI tool (Click, Rich)
├── setup.cfg
└── src/
    └── sedona_breeze/
        ├── __init__.py
        ├── main.py                    # Root Click group configuration
        ├── commands/
        │   ├── developer_commands.py  # 'shell', 'test', 'jupyter' logic
        │   └── ci_commands.py         # 'build-image', 'pull-image' logic
        └── utils/
            ├── docker_command_utils.py # Wrapper logic for spinning up docker/docker-compose
            └── path_utils.py          # Finds Sedona repository root paths


4. Key Pitfalls to Avoid

  • Over-complicating the Backend: Airflow Breeze spins up multiple companion databases (Postgres, MySQL, Core, Celery queues). Sedona does not need this. Your Docker Compose file should be exceptionally lean—usually just a single container for a unified localized Spark environment, or a 3-node layout (1 Master, 2 Workers) if testing distributed spatial partitioning behavior.
  • Ignoring the Fat JARs: Sedona relies on compiling Scala/Java code into shared shaded JARs (sedona-spark-shaded). Ensure your breeze test or breeze shell commands contain an automated pre-hook step that executes sbt assembly or mvn clean package before firing up the Python environment, ensuring the spark.jars.packages configuration can find the updated local builds.

Metadata

Metadata

Assignees

No one assigned
    No fields configured for Feature.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions