In [170]:
%%writefile src\orchestrators\learning-airflow\README.md


# Airflow Learning Project

This project demonstrates best practices for creating modular, maintainable Airflow DAGs using Airflow 3.0's TaskFlow API and AssetWatcher for event-driven scheduling.

## Project Structure

```
learning-airflow/
├── dags/                     # DAG definition files 
│   ├── example_etl_galaxies.py     # Example ETL DAG that processes galaxy data
│   ├── gtfs_data_pipeline.py       # DAG that processes GTFS realtime transit data
│   ├── nba_ingest_pipeline.py      # DAG that processes NBA game data
│   ├── user_metrics_etl.py         # Simple ETL DAG example for user metrics
│   └── weather_kafka_pipeline.py   # DAG that processes weather data from Kafka
├── include/                  # Shared code and resources
│   ├── custom_functions/          # Modularized DAG functions
│   │   ├── galaxy_functions.py     # Functions for galaxy data processing
│   │   ├── gtfs_functions.py       # Functions for GTFS data processing
│   │   ├── nba_functions.py        # Functions for NBA data processing
│   │   ├── user_metrics_functions.py # Functions for user metrics processing
│   │   └── weather_functions.py    # Functions for weather data processing
│   ├── data/                      # Sample data files
│   └── astronomy.db               # DuckDB database for the example DAG
├── tests/                    # Test files
├── .astro/                   # Astro CLI configuration
├── Dockerfile                # Custom Dockerfile for this project
├── packages.txt              # System-level dependencies
├── README.md                 # This file
└── requirements.txt          # Python dependencies
```

## DAG Modularization

Each DAG in this project follows best practices for code organization:

1. **Separation of concerns**: Core data processing logic is separated from DAG flow definition
2. **Modular functions**: Each DAG has corresponding helper functions in `include/custom_functions/`
3. **Reusable components**: Common patterns are extracted into helper classes
4. **Well-documented**: DAGs and functions include descriptive docstrings

## Example DAGs

### Galaxy ETL Example

A simple ETL pipeline that extracts galaxy data, filters based on distance, and loads into a DuckDB database.

```python
# Use like this:
from include.custom_functions.galaxy_functions import get_galaxy_data
```

### GTFS Data Pipeline

Processes GTFS Realtime transit data with different storage backends (S3, BigQuery, Azure, DuckDB).

```python
# Use like this:
from include.custom_functions.gtfs_functions import GTFSProcessor
```

### NBA Ingest Pipeline

Demonstrates event-driven processing with AssetWatcher, fetching NBA game data and loading into PostgreSQL.

```python
# Use like this:
from include.custom_functions.nba_functions import NBAProcessor
```

### User Metrics ETL

Simple ETL example for processing user metrics.

```python
# Use like this:
from include.custom_functions.user_metrics_functions import UserMetricsProcessor
```

### Weather Kafka Pipeline

Shows how to consume Kafka messages, process weather data, and save to PostgreSQL with analytics.

```python
# Use like this:
from include.custom_functions.weather_functions import WeatherProcessor
```

## Running the DAGs

### Using Astro CLI

This project is configured for the Astro CLI, making it easy to run locally:

```bash
# Start the project
astro dev start

# Access the Airflow UI
# Open http://localhost:8080 in your browser
# Default credentials: admin/admin
```

### Using Docker

You can also run the project using Docker directly:

```bash
docker-compose up -d
```

### Environment Variables

Environment variables can be set in a `.env` file at the project root or passed to the Airflow containers.

## Testing

Execute tests using the following commands:

```bash
# Run all tests
astro dev pytest

# Run a specific test
astro dev pytest tests/dags/test_dag_example.py
```

## Contributing

When adding a new DAG:

1. Create a new DAG file in the `dags/` directory
2. Create a module with helper functions in `include/custom_functions/`
3. Add tests in the `tests/` directory
4. Update this README with relevant information

## License

This project is licensed under the MIT License.


Overwriting src\orchestrators\learning-airflow\README.md


In [171]:
%%writefile src\orchestrators\learning-airflow\Dockerfile
FROM astrocrpublic.azurecr.io/runtime:3.0-2


Overwriting src\orchestrators\learning-airflow\Dockerfile


!astro version

# uncomment and cd to whereever you want to create airflow env
#cd 

!astro dev init --from-template learning-airflow

# or 
!mkdir hello-astro && cd hello-astro
!astro dev init


!astro dev start

#Once healthy, open your browser to https://localhost:8080/ to access the Airflow UI. Your DAGs (including example-astronauts) appear under the DAGs view.




In [172]:
%%writefile src\orchestrators\learning-airflow\requirements.txt

# Astro Runtime includes the following pre-installed providers packages: https://docs.astronomer.io/astro/runtime-image-architecture#provider-packages

apache-airflow[postgres]
apache-airflow-providers-apache-kafka
apache-airflow-providers-amazon
apache-airflow-providers-google
apache-airflow-providers-microsoft-azure
apache-airflow-providers-postgres
apache-airflow-providers-common-sql
# apache-airflow-upgrade-check  # Tool for checking Airflow 3 upgrade compatibility
# airflow upgrade_check

duckdb>=1.0.0
pandas>=2.0.0
tabulate
pyarrow>=10.0.0
requests>=2.0.0
python-dotenv>=1.0.0
confluent-kafka>=2.0.0
kafka-python
protobuf
gtfs-realtime-bindings
psycopg2-binary
ruff

Overwriting src\orchestrators\learning-airflow\requirements.txt


In [173]:
%%writefile src\orchestrators\learning-airflow\dags\example_etl_galaxies.py
"""
## Galaxies ETL example DAG

This example demonstrates an ETL pipeline using Airflow.
The pipeline mocks data extraction for data about galaxies using a modularized
function, filters the data based on the distance from the Milky Way, and loads the
filtered data into a DuckDB database.
"""
  # This DAG uses the TaskFlow API. See: https://www.astronomer.io/docs/learn/airflow-decorators
from airflow.sdk import Asset, chain, Param, dag, task
from airflow.timetables.trigger import MultipleCronTriggerTimetable  # 🎉 Airflow 3 timetable
from airflow.models import Variable  # 🔒 For centralized credentials
from pendulum import datetime, duration
from tabulate import tabulate
import pandas as pd
import duckdb
import logging
import os

# modularize code by importing functions from the include folder
from include.custom_functions.galaxy_functions import get_galaxy_data

# use the Airflow task logger to log information to the task logs (or use print())
t_log = logging.getLogger("airflow.task")

# define variables used in a DAG as environment variables in .env for your whole Airflow instance
# to standardize your DAGs
_DUCKDB_INSTANCE_NAME = Variable.get("DUCKDB_INSTANCE_NAME", default_var=os.getenv("DUCKDB_INSTANCE_NAME", "include/astronomy.db"))  # 🔒 Use Variable.get for configuration
_DUCKDB_TABLE_NAME = Variable.get("DUCKDB_TABLE_NAME", default_var=os.getenv("DUCKDB_TABLE_NAME", "galaxy_data"))  # 🔒 Use Variable.get for configuration
_DUCKDB_TABLE_URI = f"duckdb://{_DUCKDB_INSTANCE_NAME}/{_DUCKDB_TABLE_NAME}"
_CLOSENESS_THRESHOLD_LY_DEFAULT = Variable.get("CLOSENESS_THRESHOLD_LY_DEFAULT", default_var=os.getenv("CLOSENESS_THRESHOLD_LY_DEFAULT", 500000))  # 🔒 Use Variable.get for configuration
_CLOSENESS_THRESHOLD_LY_PARAMETER_NAME = "closeness_threshold_light_years"
_NUM_GALAXIES_TOTAL = Variable.get("NUM_GALAXIES_TOTAL", default_var=os.getenv("NUM_GALAXIES_TOTAL", 20))  # 🔒 Use Variable.get for configuration

# -------------- #
# DAG Definition #
# -------------- #


# instantiate a DAG with the @dag decorator and set DAG parameters (see: https://www.astronomer.io/docs/learn/airflow-dag-parameters)
@dag(
    start_date=datetime(2025, 4, 1),  # date after which the DAG can be scheduled
    schedule=MultipleCronTriggerTimetable(  # 🎉 Using MultipleCronTriggerTimetable for multiple schedules
        "0 10 * * *",  # 10 AM daily
        "0 14 * * 1-5",  # 2 PM weekdays
        timezone="UTC"
    ),
    dag_display_name="Galaxy ETL 🚀",  # 📌 dag_display_name improves UI discoverability
    max_consecutive_failed_dag_runs=5,  # auto-pauses the DAG after 5 consecutive failed runs, experimental
    max_active_runs=1,  # only allow one concurrent run of this DAG, prevents parallel DuckDB calls
    doc_md=__doc__,  # add DAG Docs in the UI, see https://www.astronomer.io/docs/learn/custom-airflow-ui-docs-tutorial
    default_args={
        "owner": "Astro",  # owner of this DAG in the Airflow UI
        "retries": 3,  # tasks retry 3 times before they fail
        "retry_delay": duration(seconds=30),  # tasks wait 30s in between retries
    },  # default_args are applied to all tasks in a DAG
    tags=["example", "ETL"],  # add tags in the UI
    params={  # Airflow params can add interactive options on manual runs. See: https://www.astronomer.io/docs/learn/airflow-params
        _CLOSENESS_THRESHOLD_LY_PARAMETER_NAME: Param(
            _CLOSENESS_THRESHOLD_LY_DEFAULT,
            type="number",
            title="Galaxy Closeness Threshold",
            description="Set how close galaxies need ot be to the milkyway in order to be loaded to DuckDB.",
        )
    },
    # Warning - in-memory DuckDB is not a persistent database between workers. To move this workflow in production, use a
    # cloud-based database and based on concurrency capabilities adjust the two parameters below.
    is_paused_upon_creation=False, # start running the DAG as soon as its created
)
def example_etl_galaxies():  # by default the dag_id is the name of the decorated function

    # ---------------- #
    # Task Definitions #
    # ---------------- #
    # the @task decorator turns any Python function into an Airflow task
    # any @task decorated function that is called inside the @dag decorated
    # function is automatically added to the DAG.
    # if one exists for your use case you can still use traditional Airflow operators
    # and mix them with @task decorators. Checkout registry.astronomer.io for available operators
    # see: https://www.astronomer.io/docs/learn/airflow-decorators for information about @task
    # see: https://www.astronomer.io/docs/learn/what-is-an-operator for information about traditional operators

    @task(retries=2)  # you can override default_args at the task level
    def create_galaxy_table_in_duckdb(  # by default the name of the decorated function is the task_id
        duckdb_instance_name: str = _DUCKDB_INSTANCE_NAME,
        table_name: str = _DUCKDB_TABLE_NAME,
    ) -> None:
        """
        Create a table in DuckDB to store galaxy data.
        This task simulates a setup step in an ETL pipeline.
        Args:
            duckdb_instance_name: The name of the DuckDB instance.
            table_name: The name of the table to be created.
        """

        t_log.info("Creating galaxy table in DuckDB.")

        os.makedirs(os.path.dirname(duckdb_instance_name), exist_ok=True)

        cursor = duckdb.connect(duckdb_instance_name)

        cursor.execute(
            f"""
            CREATE TABLE IF NOT EXISTS {table_name} (
                name STRING PRIMARY KEY,
                distance_from_milkyway INT,
                distance_from_solarsystem INT,
                type_of_galaxy STRING,
                characteristics STRING
            )"""
        )
        cursor.close()

        t_log.info(f"Table {table_name} created in DuckDB.")

    @task
    def extract_galaxy_data(num_galaxies: int = _NUM_GALAXIES_TOTAL) -> pd.DataFrame:
        """
        Retrieve data about galaxies.
        This task simulates an extraction step in an ETL pipeline.
        Args:
            num_galaxies (int): The number of galaxies for which data should be returned.
            Default is 20. Maximum is 20.
        Returns:
            pd.DataFrame: A DataFrame containing data about galaxies.
        """

        galaxy_df = get_galaxy_data(num_galaxies)

        return galaxy_df

    @task
    def transform_galaxy_data(galaxy_df: pd.DataFrame, **context):
        """
        Filter the galaxy data based on the distance from the Milky Way.
        This task simulates a transformation step in an ETL pipeline.
        Args:
            closeness_threshold_light_years (int): The threshold for filtering
            galaxies based on distance.
            Default is 500,000 light years.
        Returns:
            pd.DataFrame: A DataFrame containing filtered galaxy data.
        """

        # retrieve param values from the context
        closeness_threshold_light_years = context["params"][
            _CLOSENESS_THRESHOLD_LY_PARAMETER_NAME
        ]

        t_log.info(
            f"Filtering for galaxies closer than {closeness_threshold_light_years} light years."
        )

        filtered_galaxy_df = galaxy_df[
            galaxy_df["distance_from_milkyway"] < closeness_threshold_light_years
        ]

        return filtered_galaxy_df

    @task(
        outlets=[Asset(_DUCKDB_TABLE_URI)]
    )  # Define that this task produces updates to an Airflow Dataset.
    # Downstream DAGs can be scheduled based on combinations of Dataset updates
    # coming from tasks in the same Airflow instance or calls to the Airflow API.
    # See: https://www.astronomer.io/docs/learn/airflow-datasets
    def load_galaxy_data(
        filtered_galaxy_df: pd.DataFrame,
        duckdb_instance_name: str = _DUCKDB_INSTANCE_NAME,
        table_name: str = _DUCKDB_TABLE_NAME,
    ):
        """
        Load the filtered galaxy data into a DuckDB database.
        This task simulates a loading step in an ETL pipeline.
        Args:
            filtered_galaxy_df (pd.DataFrame): The filtered galaxy data to be loaded.
            duckdb_instance_name (str): The name of the DuckDB instance.
            table_name (str): The name of the table to load the data into.
        """

        t_log.info("Loading galaxy data into DuckDB.")
        cursor = duckdb.connect(duckdb_instance_name)
        cursor.sql(
            f"INSERT OR IGNORE INTO {table_name} BY NAME SELECT * FROM filtered_galaxy_df;"
        )
        t_log.info("Galaxy data loaded into DuckDB.")

    @task
    def print_loaded_galaxies(
        duckdb_instance_name: str = _DUCKDB_INSTANCE_NAME,
        table_name: str = _DUCKDB_TABLE_NAME,
    ):
        """
        Get the galaxies stored in the DuckDB database that were filtered
        based on closeness to the Milky Way and print them to the logs.
        Args:
            duck_db_conn_id (str): The connection ID for the duckdb database
            where the table is stored.
        Returns:
            pd.DataFrame: A DataFrame containing the galaxies closer than
            500,000 light years from the Milky Way.
        """

        cursor = duckdb.connect(duckdb_instance_name)
        near_galaxies_df = cursor.sql(f"SELECT * FROM {table_name};").df()
        near_galaxies_df = near_galaxies_df.sort_values(
            by="distance_from_milkyway", ascending=True
        )
        t_log.info(tabulate(near_galaxies_df, headers="keys", tablefmt="pretty"))

    # ------------------------------------ #
    # Calling tasks + Setting dependencies #
    # ------------------------------------ #

    # each call of a @task decorated function creates one task in the Airflow UI
    # passing the return value of one @task decorated function to another one
    # automatically creates a task dependency
    create_galaxy_table_in_duckdb_obj = create_galaxy_table_in_duckdb()
    extract_galaxy_data_obj = extract_galaxy_data()
    transform_galaxy_data_obj = transform_galaxy_data(extract_galaxy_data_obj)
    load_galaxy_data_obj = load_galaxy_data(transform_galaxy_data_obj)

    # you can set explicit dependencies using the chain function (or bit-shift operators)
    # See: https://www.astronomer.io/docs/learn/managing-dependencies
    chain(
        create_galaxy_table_in_duckdb_obj, load_galaxy_data_obj, print_loaded_galaxies()
    )


# Instantiate the DAG
example_etl_galaxies()



Overwriting src\orchestrators\learning-airflow\dags\example_etl_galaxies.py


In [174]:
%%writefile src\orchestrators\learning-airflow\include\custom_functions\gtfs_functions.py

#!/usr/bin/env python3
"""GTFS helper functions for Airflow DAGs"""

import os
import logging
import json
from datetime import datetime

class GTFSProcessor:
    """Class to process GTFS Realtime data"""
    
    @staticmethod
    def process_data(data):
        """Process the GTFS data before storing"""
        # Add processing timestamp
        processed_data = []
        processing_time = datetime.now().isoformat()

        for entity in data:
            # Add processing metadata
            entity['_processing_time'] = processing_time
            processed_data.append(entity)

        logging.info(f"Processed {len(processed_data)} GTFS entities")
        return processed_data

    @staticmethod
    def transform_for_sql(data):
        """Transform data into a format suitable for SQL insertion"""
        if not data:
            return []

        sql_ready_data = []
        for entity in data:
            if 'vehicle' in entity and 'position' in entity['vehicle']:
                try:
                    vehicle_id = entity.get('id', '') or entity['vehicle'].get('vehicle', {}).get('id', 'unknown')
                    position = entity['vehicle']['position']
                    timestamp = entity['vehicle'].get('timestamp', '')

                    record = (
                        vehicle_id,
                        position.get('latitude', 0),
                        position.get('longitude', 0),
                        position.get('bearing', 0),
                        position.get('speed', 0),
                        timestamp,
                        entity.get('_processing_time', '')
                    )
                    sql_ready_data.append(record)
                except (KeyError, TypeError) as e:
                    logging.warning(f"Could not extract position data from entity: {e}")

        logging.info(f"Transformed {len(sql_ready_data)} entities for SQL insertion")
        # Return as a list of tuples for SQL insertion
        return sql_ready_data

    @staticmethod
    def prepare_sql_values(sql_data):
        """Convert data to SQL VALUES format for PostgresOperator"""
        if not sql_data:
            return "''"  # Empty string if no data

        # Convert list of tuples to SQL VALUES syntax
        values_strings = []
        for record in sql_data:
            values_str = f"('{record[0]}', {record[1]}, {record[2]}, {record[3]}, {record[4]}, '{record[5]}', '{record[6]}')"
            values_strings.append(values_str)

        return ", ".join(values_strings)

    @staticmethod
    def cleanup_flag_file(flag_file_path):
        """Clean up the flag file that triggered a DAG"""
        try:
            if os.path.exists(flag_file_path):
                os.remove(flag_file_path)
                logging.info(f"Removed flag file: {flag_file_path}")
            else:
                logging.warning(f"Flag file not found: {flag_file_path}")
        except Exception as e:
            logging.error(f"Error removing flag file: {e}") 

Overwriting src\orchestrators\learning-airflow\include\custom_functions\gtfs_functions.py


In [175]:
%%writefile src\orchestrators\learning-airflow\dags\gtfs_data_pipeline.py

#!/usr/bin/env python3
"""
GTFS Realtime Data Pipeline DAG 

This DAG fetches GTFS-RT data from the MTA Bus Time API, processes it,
and loads it into the storage backend of choice: S3, BigQuery, Azure Blob, or DuckDB.
It also demonstrates SQL operations by loading data into PostgreSQL.

The DAG demonstrates the Airflow TaskFlow API (Python functions as tasks)
and parameterization for different cloud environments.
"""

import os
import sys
import json
import logging
from datetime import timedelta
from pathlib import Path

from airflow.decorators import dag, task
from airflow.models import Variable, Connection
from airflow.operators.python import get_current_context

from airflow.providers.common.sql.operators.sql import SQLExecuteQueryOperator
from airflow.operators.empty import EmptyOperator

# Import directly from the src/ingestion package
from include.custom_functions.gtfs_functions import GTFSProcessor

# Import custom modularized functions
from pendulum import today, duration

# Define helper function to replace days_ago
def days_ago(n: int):
    return today("UTC").subtract(days=n)


# Default settings applied to all tasks
default_args = {
    'owner': 'data-engineering',
    'depends_on_past': False,
    'email_on_failure': False,
    'email_on_retry': False,
    'retries': 1,
    'retry_delay': timedelta(minutes=5),
    'execution_timeout': timedelta(minutes=10),
}

# Configurable parameters with defaults
# These can be overridden by setting Airflow Variables
CLOUD_PROVIDER = Variable.get("CLOUD_PROVIDER", default_var="local")  # aws, gcp, azure, or local
STORAGE_TYPE = Variable.get("STORAGE_TYPE", default_var="duckdb")  # s3, gcs, azure_blob, bigquery, duckdb
API_URL = Variable.get("GTFS_API_URL", default_var="https://gtfsrt.prod.obanyc.com/vehiclePositions")
OUTPUT_FORMAT = Variable.get("OUTPUT_FORMAT", default_var="json")
USE_SQL_DB = Variable.get("USE_SQL_DB", default_var="true").lower() == "true"  # Whether to also load data into PostgreSQL

# Cloud-specific settings with defaults
if CLOUD_PROVIDER == "aws":
    S3_BUCKET = Variable.get("S3_BUCKET", default_var="gtfs-data")
    S3_PREFIX = Variable.get("S3_PREFIX", default_var="vehicle_positions")
elif CLOUD_PROVIDER == "gcp":
    GCS_BUCKET = Variable.get("GCS_BUCKET", default_var="gtfs-data")
    GCS_PREFIX = Variable.get("GCS_PREFIX", default_var="vehicle_positions")
    BQ_DATASET = Variable.get("BQ_DATASET", default_var="gtfs_data")
    BQ_TABLE = Variable.get("BQ_TABLE", default_var="vehicle_positions")
elif CLOUD_PROVIDER == "azure":
    AZURE_CONTAINER = Variable.get("AZURE_CONTAINER", default_var="gtfs-data")
    AZURE_PREFIX = Variable.get("AZURE_PREFIX", default_var="vehicle_positions")
else:  # local
    DUCKDB_PATH = Variable.get("DUCKDB_PATH", default_var="/tmp/gtfs.duckdb")
    DUCKDB_TABLE = Variable.get("DUCKDB_TABLE", default_var="vehicle_positions")

# Define an asset for asset-driven scheduling
from airflow.sdk import Asset, AssetWatcher
from airflow.providers.standard.triggers.file import FileDeleteTrigger

# Create a file sensor trigger for the GTFS asset
gtfs_file_trigger = FileDeleteTrigger(filepath="/data/gtfs/new_data.flag")
gtfs_asset = Asset(
    "gtfs_data_asset", 
    watchers=[AssetWatcher(name="gtfs_data_watcher", trigger=gtfs_file_trigger)]
)

@dag(
    default_args=default_args,
    schedule=[gtfs_asset],  # asset-driven scheduling
    start_date=days_ago(1),
    catchup=False,
    max_active_runs=1,
    dag_display_name="GTFS Real-time Data Pipeline 🚌",
    tags=['gtfs', 'realtime', 'sql', CLOUD_PROVIDER],
    doc_md=__doc__
)
def gtfs_data_pipeline():
    """
    ### GTFS-RT Data Pipeline

    This DAG demonstrates how to fetch and process GTFS-RT data with Airflow,
    using different cloud providers and storage backends.

    #### Environment configuration
    * Cloud Provider: {cloud_provider}
    * Storage Type: {storage_type}
    * Data Format: {format}
    * Also Load to SQL DB: {use_sql_db}
    * Schedule: Asset-driven (file trigger)
    """.format(
        cloud_provider=CLOUD_PROVIDER,
        storage_type=STORAGE_TYPE,
        format=OUTPUT_FORMAT,
        use_sql_db=USE_SQL_DB
    )

    @task()
    def fetch_gtfs():
        """Fetch GTFS-RT data from the configured API"""
        # Get API key from connection if configured
        try:
            conn = Connection.get_connection_from_secrets("gtfs_api")
            api_key = conn.password if conn else Variable.get("MTA_API_KEY", default_var=os.getenv("MTA_API_KEY"))
        except:
            api_key = Variable.get("MTA_API_KEY", default_var=os.getenv("MTA_API_KEY"))

        # Initialize fetcher
        fetcher = GTFSFetcher(api_url=API_URL, api_key=api_key)

        # Get the data
        logging.info(f"Fetching GTFS data from {API_URL}")
        try:
            data = fetcher.fetch_and_parse()
            logging.info(f"Successfully fetched {len(data)} GTFS entities")
            return data
        except Exception as e:
            logging.error(f"Error fetching GTFS data: {e}")
            raise

    @task()
    def process_data(data):
        """Process the GTFS data before storing"""
        # Use the modularized GTFSProcessor class
        return GTFSProcessor.process_data(data)

    @task()
    def transform_for_sql(data):
        """Transform data into a format suitable for SQL insertion"""
        # Use the modularized GTFSProcessor class
        return GTFSProcessor.transform_for_sql(data)

    @task()
    def prepare_sql_values(sql_data):
        """Convert data to SQL VALUES format for SQLExecuteQueryOperator"""
        # Use the modularized GTFSProcessor class
        return GTFSProcessor.prepare_sql_values(sql_data)

    @task()
    def store_data(data):
        """Store the data in the configured backend"""
        if not data:
            logging.warning("No data to store")
            return {"status": "warning", "message": "No data to store"}

        # Get the fetcher for storage methods
        try:
            conn = Connection.get_connection_from_secrets("gtfs_api")
            api_key = conn.password if conn else os.getenv("MTA_API_KEY")
        except:
            api_key = os.getenv("MTA_API_KEY")

        fetcher = GTFSFetcher(api_url=API_URL, api_key=api_key)

        # Store based on the configured backend
        try:
            if CLOUD_PROVIDER == "aws":
                location = fetcher.save_to_s3(
                    data, 
                    bucket=S3_BUCKET, 
                    prefix=S3_PREFIX, 
                    fmt=OUTPUT_FORMAT
                )
                logging.info(f"Data saved to S3: {location}")
                return {"status": "success", "location": location}

            elif CLOUD_PROVIDER == "gcp":
                if STORAGE_TYPE == "bigquery":
                    rows = fetcher.save_to_bigquery(data, BQ_DATASET, BQ_TABLE)
                    logging.info(f"Data saved to BigQuery: {rows} rows")
                    return {"status": "success", "rows": rows}
                else:
                    location = fetcher.save_to_gcs(
                        data, 
                        bucket=GCS_BUCKET, 
                        prefix=GCS_PREFIX, 
                        fmt=OUTPUT_FORMAT
                    )
                    logging.info(f"Data saved to GCS: {location}")
                    return {"status": "success", "location": location}

            elif CLOUD_PROVIDER == "azure":
                # Azure implementation would go here
                # This would use the Azure blob storage client
                logging.info("Azure storage not yet implemented")
                return {"status": "not_implemented", "message": "Azure storage not yet implemented"}

            else:  # local/duckdb
                rows = fetcher.save_to_duckdb(data, table=DUCKDB_TABLE, db_path=DUCKDB_PATH)
                logging.info(f"Data saved to DuckDB: {DUCKDB_PATH}, table: {DUCKDB_TABLE}, {rows} rows")
                return {"status": "success", "rows": rows, "database": DUCKDB_PATH}

        except Exception as e:
            logging.error(f"Error storing data: {e}")
            raise

    # Create PostgreSQL tables
    create_pg_table = SQLExecuteQueryOperator(
        task_id="create_gtfs_table",
        conn_id="postgres_default",
        sql="""
        CREATE TABLE IF NOT EXISTS public.gtfs_vehicle_positions (
            vehicle_id TEXT,
            latitude DOUBLE PRECISION,
            longitude DOUBLE PRECISION,
            bearing DOUBLE PRECISION,
            speed DOUBLE PRECISION,
            timestamp TIMESTAMP,
            processing_time TIMESTAMP,
            PRIMARY KEY (vehicle_id, processing_time)
        );
        """
    )

    # Insert task with dynamic SQL
    @task()
    def insert_to_postgres(values):
        """Insert the values into PostgreSQL using SQLExecuteQueryOperator"""
        if not values or values == "''":
            logging.warning("No values to insert into PostgreSQL")
            return {"rows_inserted": 0}

        pg_insert = SQLExecuteQueryOperator(
            task_id="insert_gtfs_data",
            conn_id="postgres_default",
            sql=f"""
            INSERT INTO public.gtfs_vehicle_positions
            (vehicle_id, latitude, longitude, bearing, speed, timestamp, processing_time)
            VALUES {values}
            ON CONFLICT (vehicle_id, processing_time) 
            DO UPDATE SET
                latitude = EXCLUDED.latitude,
                longitude = EXCLUDED.longitude,
                bearing = EXCLUDED.bearing,
                speed = EXCLUDED.speed;
            """
        )

        pg_insert.execute(context={})
        return {"rows_inserted": values.count('),') + 1 if values else 0}

    # Task to clean up the flag file that triggered this DAG
    @task()
    def cleanup():
        """Clean up the flag file that triggered this DAG"""
        # Use the modularized GTFSProcessor class
        return GTFSProcessor.cleanup_flag_file("/data/gtfs/new_data.flag")

    # Define SQL branch based on configuration
    sql_branch = EmptyOperator(task_id="skip_sql_branch") if not USE_SQL_DB else EmptyOperator(task_id="use_sql_branch")

    # Define the task dependencies
    raw_data = fetch_gtfs()
    processed_data = process_data(raw_data)
    storage_result = store_data(processed_data)

    # SQL branch
    if USE_SQL_DB:
        sql_data = transform_for_sql(processed_data)
        sql_values = prepare_sql_values(sql_data)
        create_pg_table >> insert_to_postgres(sql_values) >> cleanup()

    # Main flow
    raw_data >> processed_data >> storage_result

    # Return the DAG result
    return {"result": storage_result}

# Instantiate the DAG
gtfs_pipeline = gtfs_data_pipeline() 


Overwriting src\orchestrators\learning-airflow\dags\gtfs_data_pipeline.py


In [176]:
%%writefile src\orchestrators\learning-airflow\include\custom_functions\nba_functions.py

#!/usr/bin/env python3
"""NBA data processing helper functions for Airflow DAGs"""

import logging
import os
from datetime import datetime
import requests
from typing import Dict, List, Any, Optional, Union


class NBAProcessor:
    """Class to process NBA game data"""
    
    @staticmethod
    def fetch_nba_games() -> list[dict]:
        """
        Fetch today's NBA games live from stats.nba.com via nba_api.
        No API key required.
        """
        from datetime import datetime
        from nba_api.stats.endpoints import ScoreboardV2

        # NBA expects dates as MM/DD/YYYY
        game_date = datetime.utcnow().strftime("%m/%d/%Y")
        sb = ScoreboardV2(game_date=game_date)
        payload = sb.get_dict()
        rows = payload["resultSets"][0]["rowSet"]

        games = []
        for row in rows:
            games.append({
                "game_id":       row[2],            # GAME_ID
                "date":          row[0],            # GAME_DATE
                "home_team":     row[6],            # HOME_TEAM_ABBREVIATION
                "away_team":     row[7],            # VISITOR_TEAM_ABBREVIATION
                "score_home":    row[21] or 0,      # PTS_HOME; if None, 0
                "score_away":    row[22] or 0,      # PTS_AWAY; if None, 0
            })
        return games

    @staticmethod
    def process_games(games_data: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
        """
        Process the NBA games data
        
        Args:
            games_data: Raw NBA game data
            
        Returns:
            Processed NBA game data with added timestamps
        """
        # Add processing metadata
        processed_data = []
        processing_time = datetime.now().isoformat()

        for game in games_data:
            # Add processing timestamp
            game['processing_time'] = processing_time
            processed_data.append(game)

        logging.info(f"Processed {len(processed_data)} NBA games")
        return processed_data

    @staticmethod
    def prepare_sql_values(games: List[Dict[str, Any]]) -> str:
        """
        Prepare SQL VALUES for insertion
        
        Args:
            games: List of processed game data
            
        Returns:
            SQL VALUES string for insertion
        """
        if not games:
            return "''"

        values_strings = []
        for game in games:
            # Format values for SQL INSERT
            values_str = f"('{game['id']}', '{game['date']}', '{game['home_team']}', '{game['away_team']}', {game['score_home']}, {game['score_away']}, CURRENT_TIMESTAMP)"
            values_strings.append(values_str)

        return ", ".join(values_strings)
        
    @staticmethod
    def cleanup_flag_file(flag_file_path: str) -> Dict[str, str]:
        """
        Clean up the flag file that triggered a DAG
        
        Args:
            flag_file_path: Path to the flag file
            
        Returns:
            Status dictionary
        """
        try:
            if os.path.exists(flag_file_path):
                os.remove(flag_file_path)
                logging.info(f"Removed flag file: {flag_file_path}")
            else:
                logging.warning(f"Flag file not found: {flag_file_path}")
        except Exception as e:
            logging.error(f"Error removing flag file: {e}")
            
        return {"status": "success"} 


Overwriting src\orchestrators\learning-airflow\include\custom_functions\nba_functions.py


In [177]:
%%writefile src\orchestrators\learning-airflow\dags\nba_ingest_pipeline.py
#!/usr/bin/env python3
"""
GTFS Realtime Data Pipeline DAG 

This DAG fetches GTFS-RT data from the MTA Bus Time API, processes it,
and loads it into the storage backend of choice: S3, BigQuery, Azure Blob, or DuckDB.
It also demonstrates SQL operations by loading data into PostgreSQL.

The DAG demonstrates the Airflow TaskFlow API (Python functions as tasks)
and parameterization for different cloud environments.
"""

import os
import sys
import json
import logging
from datetime import timedelta
from pathlib import Path

from airflow.decorators import dag, task
from airflow.models import Variable, Connection
from airflow.operators.python import get_current_context

from airflow.providers.common.sql.operators.sql import SQLExecuteQueryOperator
from airflow.operators.empty import EmptyOperator

# Import directly from the src/ingestion package
from include.custom_functions.gtfs_functions import GTFSProcessor

# Import custom modularized functions
from include.custom_functions.gtfs_functions import GTFSProcessor
from pendulum import today, duration

# Define helper function to replace days_ago
def days_ago(n: int):
    return today("UTC").subtract(days=n)


# Default settings applied to all tasks
default_args = {
    'owner': 'data-engineering',
    'depends_on_past': False,
    'email_on_failure': False,
    'email_on_retry': False,
    'retries': 1,
    'retry_delay': timedelta(minutes=5),
    'execution_timeout': timedelta(minutes=10),
}

# Configurable parameters with defaults
# These can be overridden by setting Airflow Variables
CLOUD_PROVIDER = Variable.get("CLOUD_PROVIDER", default_var="local")  # aws, gcp, azure, or local
STORAGE_TYPE = Variable.get("STORAGE_TYPE", default_var="duckdb")  # s3, gcs, azure_blob, bigquery, duckdb
API_URL = Variable.get("GTFS_API_URL", default_var="https://gtfsrt.prod.obanyc.com/vehiclePositions")
OUTPUT_FORMAT = Variable.get("OUTPUT_FORMAT", default_var="json")
USE_SQL_DB = Variable.get("USE_SQL_DB", default_var="true").lower() == "true"  # Whether to also load data into PostgreSQL

# Cloud-specific settings with defaults
if CLOUD_PROVIDER == "aws":
    S3_BUCKET = Variable.get("S3_BUCKET", default_var="gtfs-data")
    S3_PREFIX = Variable.get("S3_PREFIX", default_var="vehicle_positions")
elif CLOUD_PROVIDER == "gcp":
    GCS_BUCKET = Variable.get("GCS_BUCKET", default_var="gtfs-data")
    GCS_PREFIX = Variable.get("GCS_PREFIX", default_var="vehicle_positions")
    BQ_DATASET = Variable.get("BQ_DATASET", default_var="gtfs_data")
    BQ_TABLE = Variable.get("BQ_TABLE", default_var="vehicle_positions")
elif CLOUD_PROVIDER == "azure":
    AZURE_CONTAINER = Variable.get("AZURE_CONTAINER", default_var="gtfs-data")
    AZURE_PREFIX = Variable.get("AZURE_PREFIX", default_var="vehicle_positions")
else:  # local
    DUCKDB_PATH = Variable.get("DUCKDB_PATH", default_var="/tmp/gtfs.duckdb")
    DUCKDB_TABLE = Variable.get("DUCKDB_TABLE", default_var="vehicle_positions")

# Define an asset for asset-driven scheduling
from airflow.sdk import Asset, AssetWatcher
from airflow.providers.standard.triggers.file import FileDeleteTrigger

# Create a file sensor trigger for the GTFS asset
gtfs_file_trigger = FileDeleteTrigger(filepath="/data/gtfs/new_data.flag")
gtfs_asset = Asset(
    "gtfs_data_asset", 
    watchers=[AssetWatcher(name="gtfs_data_watcher", trigger=gtfs_file_trigger)]
)

@dag(
    default_args=default_args,
    schedule=[gtfs_asset],  # asset-driven scheduling
    start_date=days_ago(1),
    catchup=False,
    max_active_runs=1,
    dag_display_name="GTFS Real-time Data Pipeline 🚌",
    tags=['gtfs', 'realtime', 'sql', CLOUD_PROVIDER],
    doc_md=__doc__
)
def gtfs_data_pipeline():
    """
    ### GTFS-RT Data Pipeline

    This DAG demonstrates how to fetch and process GTFS-RT data with Airflow,
    using different cloud providers and storage backends.

    #### Environment configuration
    * Cloud Provider: {cloud_provider}
    * Storage Type: {storage_type}
    * Data Format: {format}
    * Also Load to SQL DB: {use_sql_db}
    * Schedule: Asset-driven (file trigger)
    """.format(
        cloud_provider=CLOUD_PROVIDER,
        storage_type=STORAGE_TYPE,
        format=OUTPUT_FORMAT,
        use_sql_db=USE_SQL_DB
    )

    @task()
    def fetch_gtfs():
        """Fetch GTFS-RT data from the configured API"""
        # Get API key from connection if configured
        try:
            conn = Connection.get_connection_from_secrets("gtfs_api")
            api_key = conn.password if conn else Variable.get("MTA_API_KEY", default_var=os.getenv("MTA_API_KEY"))
        except:
            api_key = Variable.get("MTA_API_KEY", default_var=os.getenv("MTA_API_KEY"))

        # Initialize fetcher
        fetcher = GTFSFetcher(api_url=API_URL, api_key=api_key)

        # Get the data
        logging.info(f"Fetching GTFS data from {API_URL}")
        try:
            data = fetcher.fetch_and_parse()
            logging.info(f"Successfully fetched {len(data)} GTFS entities")
            return data
        except Exception as e:
            logging.error(f"Error fetching GTFS data: {e}")
            raise

    @task()
    def process_data(data):
        """Process the GTFS data before storing"""
        # Use the modularized GTFSProcessor class
        return GTFSProcessor.process_data(data)

    @task()
    def transform_for_sql(data):
        """Transform data into a format suitable for SQL insertion"""
        # Use the modularized GTFSProcessor class
        return GTFSProcessor.transform_for_sql(data)

    @task()
    def prepare_sql_values(sql_data):
        """Convert data to SQL VALUES format for SQLExecuteQueryOperator"""
        # Use the modularized GTFSProcessor class
        return GTFSProcessor.prepare_sql_values(sql_data)

    @task()
    def store_data(data):
        """Store the data in the configured backend"""
        if not data:
            logging.warning("No data to store")
            return {"status": "warning", "message": "No data to store"}

        # Get the fetcher for storage methods
        try:
            conn = Connection.get_connection_from_secrets("gtfs_api")
            api_key = conn.password if conn else os.getenv("MTA_API_KEY")
        except:
            api_key = os.getenv("MTA_API_KEY")

        fetcher = GTFSFetcher(api_url=API_URL, api_key=api_key)

        # Store based on the configured backend
        try:
            if CLOUD_PROVIDER == "aws":
                location = fetcher.save_to_s3(
                    data, 
                    bucket=S3_BUCKET, 
                    prefix=S3_PREFIX, 
                    fmt=OUTPUT_FORMAT
                )
                logging.info(f"Data saved to S3: {location}")
                return {"status": "success", "location": location}

            elif CLOUD_PROVIDER == "gcp":
                if STORAGE_TYPE == "bigquery":
                    rows = fetcher.save_to_bigquery(data, BQ_DATASET, BQ_TABLE)
                    logging.info(f"Data saved to BigQuery: {rows} rows")
                    return {"status": "success", "rows": rows}
                else:
                    location = fetcher.save_to_gcs(
                        data, 
                        bucket=GCS_BUCKET, 
                        prefix=GCS_PREFIX, 
                        fmt=OUTPUT_FORMAT
                    )
                    logging.info(f"Data saved to GCS: {location}")
                    return {"status": "success", "location": location}

            elif CLOUD_PROVIDER == "azure":
                # Azure implementation would go here
                # This would use the Azure blob storage client
                logging.info("Azure storage not yet implemented")
                return {"status": "not_implemented", "message": "Azure storage not yet implemented"}

            else:  # local/duckdb
                rows = fetcher.save_to_duckdb(data, table=DUCKDB_TABLE, db_path=DUCKDB_PATH)
                logging.info(f"Data saved to DuckDB: {DUCKDB_PATH}, table: {DUCKDB_TABLE}, {rows} rows")
                return {"status": "success", "rows": rows, "database": DUCKDB_PATH}

        except Exception as e:
            logging.error(f"Error storing data: {e}")
            raise

    # Create PostgreSQL tables
    create_pg_table = SQLExecuteQueryOperator(
        task_id="create_gtfs_table",
        conn_id="postgres_default",
        sql="""
        CREATE TABLE IF NOT EXISTS public.gtfs_vehicle_positions (
            vehicle_id TEXT,
            latitude DOUBLE PRECISION,
            longitude DOUBLE PRECISION,
            bearing DOUBLE PRECISION,
            speed DOUBLE PRECISION,
            timestamp TIMESTAMP,
            processing_time TIMESTAMP,
            PRIMARY KEY (vehicle_id, processing_time)
        );
        """
    )

    # Insert task with dynamic SQL
    @task()
    def insert_to_postgres(values):
        """Insert the values into PostgreSQL using SQLExecuteQueryOperator"""
        if not values or values == "''":
            logging.warning("No values to insert into PostgreSQL")
            return {"rows_inserted": 0}

        pg_insert = SQLExecuteQueryOperator(
            task_id="insert_gtfs_data",
            conn_id="postgres_default",
            sql=f"""
            INSERT INTO public.gtfs_vehicle_positions
            (vehicle_id, latitude, longitude, bearing, speed, timestamp, processing_time)
            VALUES {values}
            ON CONFLICT (vehicle_id, processing_time) 
            DO UPDATE SET
                latitude = EXCLUDED.latitude,
                longitude = EXCLUDED.longitude,
                bearing = EXCLUDED.bearing,
                speed = EXCLUDED.speed;
            """
        )

        pg_insert.execute(context={})
        return {"rows_inserted": values.count('),') + 1 if values else 0}

    # Task to clean up the flag file that triggered this DAG
    @task()
    def cleanup():
        """Clean up the flag file that triggered this DAG"""
        # Use the modularized GTFSProcessor class
        return GTFSProcessor.cleanup_flag_file("/data/gtfs/new_data.flag")

    # Define SQL branch based on configuration
    sql_branch = EmptyOperator(task_id="skip_sql_branch") if not USE_SQL_DB else EmptyOperator(task_id="use_sql_branch")

    # Define the task dependencies
    raw_data = fetch_gtfs()
    processed_data = process_data(raw_data)
    storage_result = store_data(processed_data)

    # SQL branch
    if USE_SQL_DB:
        sql_data = transform_for_sql(processed_data)
        sql_values = prepare_sql_values(sql_data)
        create_pg_table >> insert_to_postgres(sql_values) >> cleanup()

    # Main flow
    raw_data >> processed_data >> storage_result

    # Return the DAG result
    return {"result": storage_result}

# Instantiate the DAG
gtfs_pipeline = gtfs_data_pipeline() 


Overwriting src\orchestrators\learning-airflow\dags\nba_ingest_pipeline.py


In [178]:


%%writefile src\orchestrators\learning-airflow\include\custom_functions\weather_functions.py


#!/usr/bin/env python3
"""Weather data processing helper functions for Airflow DAGs"""

import os
import json
import logging
import time
from datetime import datetime
from typing import Dict, List, Any, Optional

class WeatherProcessor:
    """Class to process weather data from Kafka"""

    @staticmethod
    def consume_kafka_messages(
        consumer,
        topic: str = "weather-updates",
        max_messages: int = 100
    ) -> List[Dict[str, Any]]:
        """
        Consume weather data from Kafka

        Args:
            consumer: Kafka consumer instance
            topic: Kafka topic to consume from
            max_messages: Maximum number of messages to consume

        Returns:
            List of consumed messages
        """
        messages = []
        message_count = 0

        logging.info(f"Starting to consume messages from Kafka topic: {topic}")
        start_time = time.time()

        # Use poll to get better control over consumption
        try:
            while message_count < max_messages:
                poll_result = consumer.poll(timeout_ms=5000, max_records=max_messages)
                if not poll_result:
                    break

                # Process all partitions and messages
                for tp, records in poll_result.items():
                    for record in records:
                        try:
                            message = json.loads(record.value.decode('utf-8'))
                            message['_metadata'] = {
                                'topic': record.topic,
                                'partition': record.partition,
                                'offset': record.offset,
                                'timestamp': record.timestamp
                            }
                            messages.append(message)
                            message_count += 1
                        except json.JSONDecodeError:
                            logging.warning(
                                f"Skipping non-JSON message: {record.value}"
                            )

                # Commit offsets after processing
                consumer.commit()

            logging.info(
                f"Consumed {message_count} messages in "
                f"{time.time() - start_time:.2f} seconds"
            )
        finally:
            consumer.close()

        return messages

    @staticmethod
    def process_weather_data(messages: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
        """
        Process and transform the weather data

        Args:
            messages: Raw Kafka messages

        Returns:
            Processed weather data
        """
        if not messages:
            logging.warning("No weather messages to process")
            return []

        processed_data = []
        processing_time = datetime.now().isoformat()

        for message in messages:
            try:
                # Extract weather observation data
                observation = {
                    'location': message.get('location', 'unknown'),
                    'latitude': message.get('lat'),
                    'longitude': message.get('lon'),
                    'obs_time': message.get('observation_time'),
                    'temperature': message.get('temp_c'),
                    'humidity': message.get('humidity'),
                    'pressure': message.get('pressure_mb'),
                    'wind_speed': message.get('wind_kph'),
                    'wind_direction': message.get('wind_dir'),
                    'conditions': message.get('condition', {}).get('text'),
                    '_processing_time': processing_time
                }
                processed_data.append(observation)
            except Exception as e:
                logging.error(f"Error processing weather message: {e}")

        logging.info(f"Processed {len(processed_data)} weather observations")
        return processed_data

    @staticmethod
    def prepare_sql_values(observations: List[Dict[str, Any]]) -> str:
        """
        Prepare SQL VALUES for insertion

        Args:
            observations: List of processed weather observations

        Returns:
            SQL VALUES string for insertion
        """
        if not observations:
            return "''"

        values_strings = []
        for obs in observations:
            # Format values for SQL INSERT
            values_str = f"""(
                '{obs['location']}', 
                {obs['latitude'] if obs['latitude'] is not None else 'NULL'}, 
                {obs['longitude'] if obs['longitude'] is not None else 'NULL'}, 
                '{obs['obs_time']}', 
                {obs['temperature'] if obs['temperature'] is not None else 'NULL'}, 
                {obs['humidity'] if obs['humidity'] is not None else 'NULL'}, 
                {obs['pressure'] if obs['pressure'] is not None else 'NULL'}, 
                {obs['wind_speed'] if obs['wind_speed'] is not None else 'NULL'}, 
                '{obs['wind_direction']}', 
                '{obs['conditions'] if obs['conditions'] else ''}'
            )"""
            values_strings.append(values_str)

        return ", ".join(values_strings)

    @staticmethod
    def cleanup_flag_file(flag_file_path: str) -> Dict[str, str]:
        """
        Clean up the flag file that triggered a DAG

        Args:
            flag_file_path: Path to the flag file

        Returns:
            Status dictionary
        """
        try:
            if os.path.exists(flag_file_path):
                os.remove(flag_file_path)
                logging.info(f"Removed flag file: {flag_file_path}")
            else:
                logging.warning(f"Flag file not found: {flag_file_path}")
        except Exception as e:
            logging.error(f"Error removing flag file: {e}")

        return {"status": "success"} 

Overwriting src\orchestrators\learning-airflow\include\custom_functions\weather_functions.py


In [179]:
%%writefile src\orchestrators\learning-airflow\dags\weather_kafka_pipeline.py
from airflow.decorators import dag, task
from airflow.models import Variable
from airflow.providers.common.sql.operators.sql import SQLExecuteQueryOperator
from airflow.providers.apache.kafka.hooks.consume import KafkaConsumerHook  # fixed import
from pendulum import datetime

from include.custom_functions.weather_functions import WeatherProcessor

@dag(
    start_date=datetime(2025, 5, 1),
    catchup=False,
    dag_display_name="Weather Kafka Pipeline ☁️",
    tags=['weather', 'kafka', 'event-driven', 'sql']
)
def weather_kafka_pipeline():
    @task()
    def consume_kafka_messages():
        topic = Variable.get("WEATHER_KAFKA_TOPIC", default_var="weather-updates")
        max_messages = int(Variable.get("WEATHER_MAX_MESSAGES", default_var="100"))

        kafka_hook = KafkaConsumerHook(
            topics=[topic],
            kafka_config_id="kafka_default",
        )
        consumer = kafka_hook.get_consumer()
        return WeatherProcessor.consume_kafka_messages(consumer, topic, max_messages)

    @task()
    def process_weather_data(messages):
        return WeatherProcessor.process_weather_data(messages)

    @task()
    def prepare_sql_values(observations):
        return WeatherProcessor.prepare_sql_values(observations)

    create_table = SQLExecuteQueryOperator(
        task_id="create_weather_observations_table",
        conn_id="postgres_default",
        sql="""
        CREATE TABLE IF NOT EXISTS weather.observations (
          id SERIAL PRIMARY KEY,
          location TEXT,
          latitude DOUBLE PRECISION,
          longitude DOUBLE PRECISION,
          obs_time TIMESTAMP,
          temperature DOUBLE PRECISION,
          humidity DOUBLE PRECISION,
          pressure DOUBLE PRECISION,
          wind_speed DOUBLE PRECISION,
          wind_direction TEXT,
          conditions TEXT,
          created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
        );
        """
    )

    insert = SQLExecuteQueryOperator(
        task_id="insert_weather_data",
        conn_id="postgres_default",
        sql="""
        INSERT INTO weather.observations
        (location, latitude, longitude, obs_time, temperature, humidity, pressure, wind_speed, wind_direction, conditions)
        VALUES {{ ti.xcom_pull('prepare_sql_values') }};
        """
    )

    msgs = consume_kafka_messages()
    proc = process_weather_data(msgs)
    vals = prepare_sql_values(proc)
    create_table >> insert

weather_kafka_pipeline_dag = weather_kafka_pipeline()


Overwriting src\orchestrators\learning-airflow\dags\weather_kafka_pipeline.py


In [180]:
%%writefile src\orchestrators\learning-airflow\dags\user_metrics_etl.py
# src/orchestrators/learning-airflow/dags/user_metrics_etl.py

from airflow.decorators import dag, task
from airflow.providers.common.sql.operators.sql import SQLExecuteQueryOperator
from pendulum import datetime  # use pendulum directly

from include.custom_functions.user_metrics_functions import UserMetricsProcessor

@dag(
    start_date=datetime(2025, 5, 1),      # use pendulum directly here
    schedule="@daily",
    dag_display_name="User Metrics ETL 📊",
    catchup=False,
    tags=["metrics", "sql"]
)
def user_metrics_etl():
    @task()
    def extract():
        return UserMetricsProcessor.extract()

    @task()
    def transform(data):
        return UserMetricsProcessor.transform(data)

    create_table = SQLExecuteQueryOperator(
        task_id="create_user_metrics_table",
        conn_id="postgres_default",
        sql="""
        CREATE TABLE IF NOT EXISTS user_metrics (
            user_id INTEGER PRIMARY KEY,
            session_count INTEGER,
            total_duration_mins INTEGER,
            conversion_rate DOUBLE PRECISION,
            updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
        );
        """
    )

    load = SQLExecuteQueryOperator(
        task_id="load_user_metrics",
        conn_id="postgres_default",
        sql="""
        INSERT INTO user_metrics (user_id, session_count, total_duration_mins, conversion_rate)
        VALUES {{ ti.xcom_pull('transform') }}
        ON CONFLICT (user_id) DO UPDATE
          SET session_count       = EXCLUDED.session_count,
              total_duration_mins = EXCLUDED.total_duration_mins,
              conversion_rate     = EXCLUDED.conversion_rate,
              updated_at          = CURRENT_TIMESTAMP;
        """
    )

    data = extract()
    vals = transform(data)
    create_table >> load

user_metrics_etl_dag = user_metrics_etl()




Overwriting src\orchestrators\learning-airflow\dags\user_metrics_etl.py
