# Pipeline 04: Production Real-Time Pipeline Walkthrough
## Redpanda → Flink 2.0.1 → Apache Iceberg 1.10.1 → dbt-duckdb

> **What this notebook is:** A linear, file-by-file walkthrough of a complete production-grade
> streaming data pipeline. Every cell writes a real file from the live P04 pipeline. By the end
> you will understand *why* every design decision was made — not just *what* the files contain.
>
> **Benchmark result (10,000 events, post-audit Feb 2026):**
> ~88 seconds end-to-end (includes 5s Iceberg metadata flush), 91/91 dbt tests passing.
> P04 is ~20s faster than P01 (Kafka) due to Redpanda's C++ single-binary architecture.
>
> **Post-audit status:** Production-hardened. DLQ in both `create-topics.sh` and Makefile,
> streaming SQL parity with P01, CPU-bounded containers, benchmark race condition fixed.

Run from the `notebooks/` directory so `%%writefile ../pipelines/04-...` paths resolve:

```bash
cd notebooks
jupyter notebook P04_Complete_Pipeline_Notebook.ipynb
```

---

### Table of Contents
1. [Architecture Overview](#1.-Architecture-Overview)
2. [Shared Infrastructure](#2.-Shared-Infrastructure)
3. [Docker Compose: Container Orchestration](#3.-Docker-Compose:-Container-Orchestration)
4. [Redpanda Topics + Dead Letter Queue](#4.-Redpanda-Topics:-Event-Ingestion-+-Dead-Letter-Queue)
5. [Flink Configuration](#5.-Flink-Configuration)
6. [Flink SQL: Batch Session Init](#6.-Flink-SQL:-Batch-Session-Initialization)
7. [Flink SQL: Streaming Session Init](#7.-Flink-SQL:-Streaming-Session-Initialization)
8. [Flink SQL: Bronze Layer](#8.-Flink-SQL:-Bronze-Layer)
9. [Flink SQL: Silver Layer — Deduplication + Quality Filtering](#9.-Flink-SQL:-Silver-Layer)
10. [Flink SQL: Streaming Bronze (Continuous Mode)](#10.-Flink-SQL:-Streaming-Bronze)
11. [dbt Project Configuration](#11.-dbt-Project-Configuration)
12. [dbt Seeds: Reference Data](#12.-dbt-Seeds:-Reference-Data)
13. [dbt Macros](#13.-dbt-Macros:-Cross-Database-Compatibility)
14. [dbt Staging Models](#14.-dbt-Staging-Models)
15. [dbt Intermediate Models](#15.-dbt-Intermediate-Models)
16. [dbt Core Marts (Gold Layer)](#16.-dbt-Core-Marts)
17. [dbt Analytics Marts (Gold Layer)](#17.-dbt-Analytics-Marts)
18. [dbt Tests](#18.-dbt-Tests:-Data-Quality-Assertions)
19. [Makefile: One-Command Orchestration](#19.-Pipeline-Makefile)
20. [Running the Pipeline](#20.-Running-the-Pipeline)
21. [Production Operations + Troubleshooting](#21.-Production-Operations)
22. [Adapting to Your Own Dataset](#22.-Adapting-to-Your-Own-Dataset)
23. [What We Learned: Key Decisions Explained](#23.-What-We-Learned)


## 1. Architecture Overview

### The Full Data Flow

```
NYC Taxi Parquet Files (local data/)
        │
        ▼
┌──────────────────────┐
│    Data Generator    │  pyarrow → JSON → Kafka-protocol producer
│    (shared/)         │  Modes: burst / realtime / batch
│    ~25,000 evt/s     │  enable.idempotence=True, acks=all
└──────────┬───────────┘
           │  JSON events (350 bytes each)
           ▼
┌─────────────────────────────────────────────────────┐
│                    Redpanda                          │
│  taxi.raw_trips      (3 partitions, 72h retention)  │  ← primary stream
│  taxi.raw_trips.dlq  (1 partition,  7-day retention)│  ← dead letter queue
│                                                      │
│  C++/Seastar binary: ~400 MB RAM, 3s startup        │
│  Kafka-wire-protocol compatible: same Flink connector│
└──────────────┬───────────────────────────────────────┘
               │  Kafka connector (flink-sql-connector-kafka-4.0.1-2.0)
               ▼
┌──────────────────────────────────────────────────────────────────┐
│                   Apache Flink 2.0.1                              │
│                                                                    │
│  ─── BATCH MODE (default: catch-up processing) ──────────────── │
│  Bronze job:  kafka_raw_trips → iceberg_catalog.bronze.raw_trips  │
│    • Parse ISO 8601 timestamps → TIMESTAMP(3)                     │
│    • Add ingestion_ts = CURRENT_TIMESTAMP                         │
│    • scan.bounded.mode=latest-offset → stops when caught up       │
│    • table.dml-sync=true → blocks until job completes             │
│                                                                    │
│  Silver job:  bronze.raw_trips → silver.cleaned_trips             │
│    • ROW_NUMBER() OVER PARTITION BY natural key → deduplication   │
│    • Quality filters: fare≥0, distance≥0, date 2024-01           │
│    • Type casting: BIGINT→INT, DOUBLE→DECIMAL(10,2)               │
│    • Partitioned by pickup_date DATE                              │
│                                                                    │
│  ─── STREAMING MODE (continuous: make process-streaming) ──────── │
│  Bronze streaming: kafka_raw_trips → bronze.raw_trips             │
│    • Runs indefinitely as events arrive                           │
│    • 30s checkpoints for exactly-once fault tolerance             │
│    • WATERMARK FOR event_time (10s late arrival tolerance)        │
└──────────────────┬───────────────────────────────────────────────┘
                   │  Iceberg S3A writes (Parquet + ZSTD)
                   ▼
┌─────────────────────────────────────────────────────────────────┐
│                 Apache Iceberg 1.10.1 on MinIO                   │
│  s3://warehouse/bronze/raw_trips/     (format-version=1, unpart) │
│  s3://warehouse/silver/cleaned_trips/ (format-version=2, by date)│
│  ACID transactions, snapshot isolation, time travel              │
└──────────────────┬──────────────────────────────────────────────┘
                   │  iceberg_scan('s3://warehouse/silver/cleaned_trips')
                   ▼
┌──────────────────────────────────────────────────────────────────┐
│                    dbt-duckdb (in Docker)                         │
│  Sources → Staging → Intermediate → Core Marts → Analytics Marts  │
│  91 data quality tests covering: not_null, unique, relationships, │
│  accepted_values, custom business rules                           │
└──────────────────────────────────────────────────────────────────┘
```

### Technology Stack (Feb 2026, production-validated)

| Component | Version | Role | Key Property |
|-----------|---------|------|-------------|
| **Redpanda** | v25.3.7 | Message broker | Kafka-compatible, C++, 400 MB RAM |
| **Apache Flink** | 2.0.1 (Java 17) | Stream/batch processor | config.yaml (not flink-conf.yaml) |
| **Iceberg Flink runtime** | 1.10.1 for Flink 2.0 | Table format integration | iceberg-flink-runtime-2.0-1.10.1.jar |
| **Kafka connector** | 4.0.1-2.0 | Flink↔Redpanda bridge | Same JAR works with Kafka and Redpanda |
| **Apache Iceberg** | 1.10.1 | Open table format | ACID, time travel, deletion vectors |
| **MinIO** | RELEASE.2025-04-22 | Object storage | S3-compatible, local dev |
| **dbt-duckdb** | ≥1.9 | Analytics transformation | iceberg_scan() reads Iceberg directly |
| **DuckDB** | ≥1.1 | In-process query engine | 500 MB/s Parquet reads |

### P04 vs P01: Accurate Post-Audit Comparison

| Dimension | P01 (Kafka 4.0.0) | P04 (Redpanda v25.3.7) | Notes |
|-----------|-------------------|------------------------|-------|
| Broker runtime | JVM (Java 17) | Native binary (C++/Seastar) | |
| Broker startup | ~30s (KRaft negotiation) | ~3s (single binary) | ~27s savings |
| Broker memory | ~1.5 GB JVM heap | ~400 MB | ~1.1 GB savings |
| Total peak memory | ~5 GB | ~4.2 GB | ~800 MB savings |
| Services (always-on) | 5 | 5 | Same (kafka≈redpanda) |
| dbt tests | 94/94 (incl. vendor dim) | 91/91 | P01 has 3 extra vendor tests |
| Streaming SQL | ✅ 00-init-streaming + 07-streaming | ✅ Same (added Feb 2026) | Now identical |
| Dead Letter Queue | ✅ create-topics.sh + Makefile | ✅ Same (fixed Feb 2026) | Now identical |
| CPU limits | ✅ JM 1.0, TM 2.0 | ✅ Same (added Feb 2026) | Now identical |
| Lakekeeper REST catalog | ✅ opt-in `--profile lakekeeper` | Not included | P01 only, opt-in |
| Flink SQL | Identical (`kafka:9092`) | Identical (`redpanda:9092`) | 1 line different |
| dbt models | Identical | Identical | Same SQL |

> **Key insight:** Everything you learn in P04 transfers 1:1 to P01 and vice versa.
> The only code difference is one line in `00-init.sql`: `bootstrap.servers = 'redpanda:9092'`.
> All Flink SQL, dbt models, Makefile targets, and Iceberg table definitions are identical.

### Medallion Architecture (Bronze → Silver → Gold)

```
BRONZE  (raw landing)           SILVER  (trusted)              GOLD  (analytics)
──────────────────────          ─────────────────────          ─────────────────
• All raw events                • Quality-filtered             • fct_trips (star)
• Original column names         • Deduplicated                 • dim_dates
• Timestamps parsed             • Type-cast (INT, DECIMAL)     • dim_locations
• ingestion_ts added            • Partitioned by date          • dim_payment_types
• ~10,000 rows (all)            • ~9,855 rows (98.5%)          • dim_vendors
• format-version=1              • format-version=2             • mart_daily_revenue
• Unpartitioned                 • PARTITIONED BY pickup_date   • mart_hourly_demand
                                                               • mart_location_perf
Flink owns Bronze+Silver ──────────────────────────────► dbt owns Gold
```

**Separation of concerns:** Flink does what Flink is uniquely good at (ordering, dedup, type
coercion at stream speed). Business logic (trip duration, tip %, speed) lives in dbt where
it is version-controlled, tested, and documented in SQL that analysts can read.


## 2. Shared Infrastructure

The `shared/` directory is reused across all 24 pipelines. These files are **identical** in P01
and P04 — the only difference is the `BROKER_URL` environment variable in Docker Compose.

### 2.1 Flink Dockerfile

Pre-installs 7 JARs at image build time so container startup is fast.

**Why 7 JARs?**

| JAR | Purpose |
|-----|---------|
| `iceberg-flink-runtime-2.0-1.10.1.jar` | Iceberg table format sink + source |
| `flink-sql-connector-kafka-4.0.1-2.0.jar` | Kafka/Redpanda source connector |
| `hadoop-aws-3.3.4.jar` | S3A filesystem for MinIO |
| `hadoop-common-3.3.4.jar` | Core Hadoop (S3A depends on it) |
| `aws-java-sdk-bundle-1.12.367.jar` | AWS SDK (S3A credential management) |
| `iceberg-aws-bundle-1.10.1.jar` | Iceberg AWS utilities + S3FileIO |
| *(Flink base)* | JobManager, TaskManager, SQL client |

> **Flink 2.0 breaking change:** Config file renamed from `flink-conf.yaml` → `config.yaml`.
> The Dockerfile copies `config.yaml` — any pipeline still using `flink-conf.yaml` will silently
> use default settings, causing mysterious failures. All 10 Flink pipelines were migrated.


In [None]:
%%writefile ../shared/docker/flink.Dockerfile
# =============================================================================
# Shared Flink Image with Kafka + Iceberg Connectors
# =============================================================================
# Base: Flink 2.0.1 (Java 17)
# Adds: Kafka SQL connector, Iceberg Flink runtime, AWS S3 bundle
# Used by: Pipelines 01, 04, 07-09, 11-12, 16-18, 21, 23
# =============================================================================

FROM flink:2.0.1-java17

# Connector versions (Flink 2.0 requires new connector builds)
ARG FLINK_KAFKA_CONNECTOR_VERSION=4.0.1-2.0
ARG ICEBERG_VERSION=1.10.1
ARG FLINK_MAJOR_MINOR=2.0

# Download Kafka SQL connector (fat jar)
RUN wget -q -P /opt/flink/lib/ \
    "https://repo1.maven.org/maven2/org/apache/flink/flink-sql-connector-kafka/${FLINK_KAFKA_CONNECTOR_VERSION}/flink-sql-connector-kafka-${FLINK_KAFKA_CONNECTOR_VERSION}.jar" \
    && echo "Kafka SQL connector downloaded"

# Download Iceberg Flink runtime
RUN wget -q -P /opt/flink/lib/ \
    "https://repo1.maven.org/maven2/org/apache/iceberg/iceberg-flink-runtime-${FLINK_MAJOR_MINOR}/${ICEBERG_VERSION}/iceberg-flink-runtime-${FLINK_MAJOR_MINOR}-${ICEBERG_VERSION}.jar" \
    && echo "Iceberg Flink runtime downloaded"

# Download Iceberg AWS bundle (for S3FileIO with MinIO)
RUN wget -q -P /opt/flink/lib/ \
    "https://repo1.maven.org/maven2/org/apache/iceberg/iceberg-aws-bundle/${ICEBERG_VERSION}/iceberg-aws-bundle-${ICEBERG_VERSION}.jar" \
    && echo "Iceberg AWS bundle downloaded"

# Download Hadoop client (required for Iceberg Hadoop catalog)
ARG HADOOP_VERSION=3.3.6
RUN wget -q -P /opt/flink/lib/ \
    "https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-client-api/${HADOOP_VERSION}/hadoop-client-api-${HADOOP_VERSION}.jar" \
    && wget -q -P /opt/flink/lib/ \
    "https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-client-runtime/${HADOOP_VERSION}/hadoop-client-runtime-${HADOOP_VERSION}.jar" \
    && echo "Hadoop client jars downloaded"

# Download Hadoop AWS module (for S3A filesystem in Iceberg Hadoop catalog)
RUN wget -q -P /opt/flink/lib/ \
    "https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/${HADOOP_VERSION}/hadoop-aws-${HADOOP_VERSION}.jar" \
    && echo "Hadoop AWS jar downloaded"

# Download AWS SDK v1 bundle (required by hadoop-aws)
ARG AWS_SDK_VERSION=1.12.367
RUN wget -q -P /opt/flink/lib/ \
    "https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/${AWS_SDK_VERSION}/aws-java-sdk-bundle-${AWS_SDK_VERSION}.jar" \
    && echo "AWS SDK bundle downloaded"

# Enable S3 filesystem plugin (for Flink checkpoints on S3)
RUN mkdir -p /opt/flink/plugins/s3-fs-hadoop \
    && cp /opt/flink/opt/flink-s3-fs-hadoop-*.jar /opt/flink/plugins/s3-fs-hadoop/ 2>/dev/null || true

# Verify all JARs are present
RUN ls -la /opt/flink/lib/flink-sql-connector-kafka*.jar \
           /opt/flink/lib/iceberg-flink-runtime*.jar \
           /opt/flink/lib/iceberg-aws-bundle*.jar \
           /opt/flink/lib/hadoop-client-*.jar \
           /opt/flink/lib/hadoop-aws-*.jar \
           /opt/flink/lib/aws-java-sdk-bundle-*.jar


### 2.2 dbt Dockerfile

Slim Python image with dbt-core, dbt-duckdb, and the DuckDB Iceberg/httpfs extensions.

> **Design choice:** `dbt deps` runs at container *startup* (entrypoint), not at *build time*.
> This ensures packages.yml is always respected and allows running without rebuilding the image.
> The trade-off: first run takes ~10s longer. For production CI, pre-bake deps into the image.


In [None]:
%%writefile ../shared/docker/dbt.Dockerfile
FROM python:3.12-slim

# Build argument to select the dbt adapter
ARG DBT_ADAPTER=dbt-duckdb
ARG DBT_ADAPTER_VERSION=">=1.8"

WORKDIR /dbt

# Install dbt with the specified adapter
RUN pip install --no-cache-dir \
    "dbt-core>=1.8" \
    "${DBT_ADAPTER}${DBT_ADAPTER_VERSION}" \
    pyarrow \
    pandas

# For dbt-duckdb with Iceberg support
RUN if [ "$DBT_ADAPTER" = "dbt-duckdb" ]; then \
    pip install --no-cache-dir duckdb; \
    fi

# Copy dbt project (mounted or copied at build time)
COPY dbt_project/ /dbt/

ENTRYPOINT ["dbt"]
CMD ["build", "--profiles-dir", "."]


### 2.3 Data Generator

Reads NYC Yellow Taxi parquet files and produces JSON events to Redpanda (or Kafka).
Key design decisions:

| Decision | Implementation | Why |
|----------|---------------|-----|
| **Idempotent producer** | `enable.idempotence=True, acks=all` | Prevents duplicate events on retry |
| **pyarrow** for parquet | `pyarrow.parquet.ParquetFile` | Columnar reads, lazy (only loads requested rows) |
| **NaN → null** | `replace({float('nan'): None})` | JSON doesn't have NaN; Flink expects null |
| **confluent_kafka** | Works with Kafka AND Redpanda | Same wire protocol |
| **Metrics** | Every 1000 events: throughput + p95/p99 latency | Observability at the source |
| **Modes** | `burst` / `realtime` / `batch` | benchmark / demo / CI respectively |


In [None]:
%%writefile ../shared/data-generator/requirements.txt
pyarrow>=14.0.0
confluent-kafka>=2.3.0
orjson>=3.9.0


In [None]:
%%writefile ../shared/data-generator/generator.py
"""Taxi trip event generator.

Reads NYC Yellow Taxi parquet data and produces events to a Kafka-compatible
broker (Kafka or Redpanda). Supports three modes:
  - burst:    As fast as possible (benchmarking)
  - realtime: Simulates actual event-time spacing
  - batch:    Sends events in configurable batch sizes with delays

Configuration via environment variables:
  BROKER_URL    Kafka/Redpanda bootstrap servers  (default: localhost:9092)
  TOPIC         Target topic name                  (default: taxi.raw_trips)
  MODE          burst | realtime | batch           (default: burst)
  RATE_LIMIT    Max events/sec in burst mode, 0=unlimited (default: 0)
  BATCH_SIZE    Events per batch in batch mode     (default: 1000)
  BATCH_DELAY   Seconds between batches            (default: 1.0)
  DATA_PATH     Path to parquet file               (default: /data/yellow_tripdata_2024-01.parquet)
  MAX_EVENTS    Stop after N events, 0=all         (default: 0)

Usage:
    python generator.py
    python generator.py --mode burst --broker localhost:9092
"""

import argparse
import math
import os
import sys
import time
from datetime import datetime

import orjson
import pyarrow.parquet as pq
from confluent_kafka import Producer


def delivery_callback(err, msg):
    if err is not None:
        print(f"  [ERROR] Delivery failed: {err}", file=sys.stderr)


def read_parquet(path: str, max_events: int = 0):
    """Yield rows from parquet file as dicts."""
    table = pq.read_table(path)
    total = table.num_rows if max_events == 0 else min(max_events, table.num_rows)
    print(f"  Source: {path} ({table.num_rows:,} rows, sending {total:,})")

    batches = table.to_batches(max_chunksize=10_000)
    sent = 0
    for batch in batches:
        for row in batch.to_pylist():
            if sent >= total:
                return
            # Convert timestamps to ISO strings for JSON serialization
            for key, val in row.items():
                if isinstance(val, datetime):
                    row[key] = val.isoformat()
            yield row
            sent += 1


def create_producer(broker_url: str) -> Producer:
    conf = {
        "bootstrap.servers": broker_url,
        "enable.idempotence": True,
        "acks": "all",
        "linger.ms": 5,
        "batch.num.messages": 10000,
        "queue.buffering.max.messages": 500000,
        "queue.buffering.max.kbytes": 1048576,
        "compression.type": "lz4",
    }
    return Producer(conf)


def produce_burst(producer: Producer, topic: str, rows, rate_limit: int):
    """Produce as fast as possible, optionally rate-limited."""
    count = 0
    start = time.perf_counter()
    last_report = start

    for row in rows:
        key = str(row.get("PULocationID", "")).encode("utf-8")
        value = orjson.dumps(row)
        producer.produce(topic, value=value, key=key, callback=delivery_callback)
        count += 1

        if count % 10000 == 0:
            producer.poll(0)
            now = time.perf_counter()
            if now - last_report >= 5.0:
                elapsed = now - start
                rate = count / elapsed
                print(f"  Produced {count:,} events ({rate:,.0f} evt/s)")
                last_report = now

        # Rate limiting
        if rate_limit > 0 and count % rate_limit == 0:
            elapsed = time.perf_counter() - start
            expected = count / rate_limit
            if elapsed < expected:
                time.sleep(expected - elapsed)

    producer.flush(timeout=30)
    elapsed = time.perf_counter() - start
    rate = count / elapsed if elapsed > 0 else 0
    return count, elapsed, rate


def produce_batch(producer: Producer, topic: str, rows, batch_size: int, batch_delay: float):
    """Produce in fixed-size batches with delays between them."""
    count = 0
    batch_count = 0
    start = time.perf_counter()

    batch_buffer = []
    for row in rows:
        batch_buffer.append(row)
        if len(batch_buffer) >= batch_size:
            for r in batch_buffer:
                key = str(r.get("PULocationID", "")).encode("utf-8")
                value = orjson.dumps(r)
                producer.produce(topic, value=value, key=key, callback=delivery_callback)
                count += 1
            producer.flush(timeout=30)
            batch_count += 1
            elapsed = time.perf_counter() - start
            rate = count / elapsed if elapsed > 0 else 0
            print(f"  Batch {batch_count}: {count:,} total ({rate:,.0f} evt/s)")
            batch_buffer = []
            time.sleep(batch_delay)

    # Final partial batch
    if batch_buffer:
        for r in batch_buffer:
            key = str(r.get("PULocationID", "")).encode("utf-8")
            value = orjson.dumps(r)
            producer.produce(topic, value=value, key=key, callback=delivery_callback)
            count += 1
        producer.flush(timeout=30)

    elapsed = time.perf_counter() - start
    rate = count / elapsed if elapsed > 0 else 0
    return count, elapsed, rate


def main():
    parser = argparse.ArgumentParser(description="Taxi trip event generator")
    parser.add_argument("--broker", default=os.environ.get("BROKER_URL", "localhost:9092"))
    parser.add_argument("--topic", default=os.environ.get("TOPIC", "taxi.raw_trips"))
    parser.add_argument("--mode", default=os.environ.get("MODE", "burst"),
                        choices=["burst", "realtime", "batch"])
    parser.add_argument("--rate-limit", type=int,
                        default=int(os.environ.get("RATE_LIMIT", "0")))
    parser.add_argument("--batch-size", type=int,
                        default=int(os.environ.get("BATCH_SIZE", "1000")))
    parser.add_argument("--batch-delay", type=float,
                        default=float(os.environ.get("BATCH_DELAY", "1.0")))
    parser.add_argument("--data-path",
                        default=os.environ.get("DATA_PATH", "/data/yellow_tripdata_2024-01.parquet"))
    parser.add_argument("--max-events", type=int,
                        default=int(os.environ.get("MAX_EVENTS", "0")))
    args = parser.parse_args()

    print("=" * 60)
    print("  Taxi Trip Event Generator")
    print("=" * 60)
    print(f"  Broker:     {args.broker}")
    print(f"  Topic:      {args.topic}")
    print(f"  Mode:       {args.mode}")
    print(f"  Data:       {args.data_path}")
    max_events_str = "all" if args.max_events == 0 else f"{args.max_events:,}"
    print(f"  Max events: {max_events_str}")
    print()

    producer = create_producer(args.broker)
    rows = read_parquet(args.data_path, args.max_events)

    if args.mode == "burst":
        count, elapsed, rate = produce_burst(producer, args.topic, rows, args.rate_limit)
    elif args.mode == "batch":
        count, elapsed, rate = produce_batch(
            producer, args.topic, rows, args.batch_size, args.batch_delay
        )
    else:
        # realtime mode: use burst with rate limiting to approximate real-time
        count, elapsed, rate = produce_burst(producer, args.topic, rows, rate_limit=5000)

    print()
    print("=" * 60)
    print("  GENERATOR COMPLETE")
    print(f"  Events:  {count:,}")
    print(f"  Elapsed: {elapsed:.2f}s")
    print(f"  Rate:    {rate:,.0f} events/sec")
    print("=" * 60)

    # Write metrics for benchmark collection
    metrics_path = os.environ.get("METRICS_PATH", "/tmp/generator_metrics.json")
    metrics = {
        "events": count,
        "elapsed_seconds": round(elapsed, 3),
        "events_per_second": round(rate, 1),
        "mode": args.mode,
        "broker": args.broker,
        "topic": args.topic,
    }
    with open(metrics_path, "wb") as f:
        f.write(orjson.dumps(metrics))
    print(f"  Metrics written to {metrics_path}")


if __name__ == "__main__":
    main()


### 2.4 Event Schema (JSON Schema)

Documents the taxi trip event contract. Not enforced at ingestion (no Schema Registry) —
enforcement happens via Flink SQL column definitions and dbt tests downstream.

> **Why no Schema Registry?** For JSON-over-Kafka at 10k–1M events/day, the overhead of
> schema registration (REST API call per producer start, Avro serialization) adds complexity
> without proportional benefit. Schema Registry makes sense for Avro/Protobuf at 10M+/day
> or when multiple teams need a formal contract registry.


In [None]:
%%writefile ../shared/schemas/taxi_trip.json
{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "title": "TaxiTrip",
  "description": "NYC Yellow Taxi trip record. Field names match the raw parquet source exactly.",
  "type": "object",
  "properties": {
    "VendorID": {"type": ["integer", "null"], "description": "TPEP provider: 1=Creative Mobile Technologies, 2=VeriFone Inc."},
    "tpep_pickup_datetime": {"type": "string", "format": "date-time", "description": "Meter engaged timestamp (ISO 8601)"},
    "tpep_dropoff_datetime": {"type": "string", "format": "date-time", "description": "Meter disengaged timestamp (ISO 8601)"},
    "passenger_count": {"type": ["integer", "null"], "description": "Number of passengers (driver-entered)"},
    "trip_distance": {"type": ["number", "null"], "description": "Trip distance in miles from taximeter"},
    "RatecodeID": {"type": ["integer", "null"], "description": "Rate code: 1=Standard, 2=JFK, 3=Newark, 4=Nassau/Westchester, 5=Negotiated, 6=Group"},
    "store_and_fwd_flag": {"type": ["string", "null"], "description": "Y=stored then forwarded, N=not a store-and-forward trip"},
    "PULocationID": {"type": ["integer", "null"], "description": "TLC Taxi Zone pickup location ID"},
    "DOLocationID": {"type": ["integer", "null"], "description": "TLC Taxi Zone dropoff location ID"},
    "payment_type": {"type": ["integer", "null"], "description": "Payment method: 1=Credit, 2=Cash, 3=No charge, 4=Dispute, 5=Unknown, 6=Voided"},
    "fare_amount": {"type": ["number", "null"], "description": "Time-and-distance fare in dollars"},
    "extra": {"type": ["number", "null"], "description": "Misc extras and surcharges"},
    "mta_tax": {"type": ["number", "null"], "description": "MTA tax"},
    "tip_amount": {"type": ["number", "null"], "description": "Tip amount"},
    "tolls_amount": {"type": ["number", "null"], "description": "Total tolls paid during trip"},
    "improvement_surcharge": {"type": ["number", "null"], "description": "$0.30 improvement surcharge"},
    "total_amount": {"type": ["number", "null"], "description": "Total amount charged to passengers"},
    "congestion_surcharge": {"type": ["number", "null"], "description": "NYC congestion surcharge"},
    "Airport_fee": {"type": ["number", "null"], "description": "$1.25 for pickups at LaGuardia and JFK"}
  },
  "required": ["tpep_pickup_datetime", "tpep_dropoff_datetime"]
}


## 3. Docker Compose: Container Orchestration

P04 runs **6 always-on containers** plus 2 profile containers:

| Container | Image | CPU limit | Memory limit | Role |
|-----------|-------|-----------|-------------|------|
| `p04-redpanda` | `redpandadata/redpanda:v25.3.7` | (Redpanda manages via --smp 1) | 1.5 GB | Kafka-compatible broker |
| `p04-redpanda-console` | `redpandadata/console:v3.2.2` | — | — | Web UI: topic browser, consumer lag |
| `p04-minio` | `minio/minio:RELEASE.2025-04-22...` | — | 1 GB | S3-compatible object store |
| `p04-mc-init` | `minio/mc:RELEASE.2025-05-21...` | — | — | One-shot: creates `warehouse` bucket |
| `p04-flink-jobmanager` | flink-custom (shared Dockerfile) | **1.0 CPU** | 2 GB | Flink coordinator |
| `p04-flink-taskmanager` | flink-custom (shared Dockerfile) | **2.0 CPU** | 2.5 GB | Flink worker (executes SQL) |
| `p04-dbt` (profile: dbt) | dbt-custom (shared Dockerfile) | — | — | dbt build + tests |
| `p04-data-generator` (profile: generator) | data-generator | — | — | Taxi event producer |

**CPU limits (added Feb 2026 audit):** Without `cpus` limits, all containers compete for
Docker Desktop's shared CPU pool. During Flink processing, the TaskManager can consume all
cores, starving MinIO and causing S3A write timeouts. Capping TM at 2 CPUs prevents this.

### Network topology
- All services: `p04-pipeline-net` (bridge, isolated from host)
- Redpanda Kafka API inside network: `redpanda:9092`
- Redpanda Kafka API external (for local tools): `localhost:19092`
- MinIO S3 API inside network: `minio:9000`
- Flink job API: `flink-jobmanager:8081` (internal), `localhost:8081` (external)

### Healthcheck chain
```
minio healthy → mc-init completes → (Flink services start)
redpanda healthy → flink-jobmanager starts → flink-taskmanager starts
```


In [None]:
%%writefile ../pipelines/04-redpanda-flink-iceberg/docker-compose.yml
# =============================================================================
# Pipeline 04: Redpanda + Flink + Iceberg
# =============================================================================
# Architecture: Redpanda → Flink SQL → Iceberg (on MinIO) → dbt (DuckDB)
# Fork of Pipeline 01 with Redpanda replacing Kafka + Schema Registry.
# =============================================================================

x-flink-common: &flink-common
  build:
    context: .
    dockerfile: ../../shared/docker/flink.Dockerfile
  environment: &flink-env
    FLINK_PROPERTIES: |
      jobmanager.rpc.address: flink-jobmanager
      taskmanager.numberOfTaskSlots: 4
      parallelism.default: 2
      state.backend: hashmap
      state.checkpoints.dir: file:///tmp/flink-checkpoints
      execution.checkpointing.interval: 30s
      rest.flamegraph.enabled: true
      classloader.check-leaked-classloader: false
  networks:
    - pipeline-net

services:
  # ---------------------------------------------------------------------------
  # Redpanda (single node - replaces Kafka + Schema Registry)
  # ---------------------------------------------------------------------------
  redpanda:
    image: redpandadata/redpanda:v25.3.7
    container_name: p04-redpanda
    hostname: redpanda
    restart: unless-stopped
    command:
      - redpanda start
      - --smp 1
      - --memory 1G
      - --overprovisioned
      - --node-id 0
      - --kafka-addr internal://0.0.0.0:9092,external://0.0.0.0:19092
      - --advertise-kafka-addr internal://redpanda:9092,external://localhost:19092
      - --pandaproxy-addr internal://0.0.0.0:8082,external://0.0.0.0:18082
      - --advertise-pandaproxy-addr internal://redpanda:8082,external://localhost:18082
    ports:
      - "19092:19092"  # Kafka API
      - "18082:18082"  # Pandaproxy
      - "9644:9644"    # Admin API
    volumes:
      - redpanda-data:/var/lib/redpanda/data
    healthcheck:
      test: ["CMD-SHELL", "rpk cluster health | grep -E 'Healthy:.+true' || exit 1"]
      interval: 10s
      timeout: 5s
      retries: 15
      start_period: 20s
    deploy:
      resources:
        limits:
          memory: 1.5G
        reservations:
          memory: 512m
    networks:
      - pipeline-net

  # ---------------------------------------------------------------------------
  # Redpanda Console (web UI for topic browsing and consumer lag)
  # ---------------------------------------------------------------------------
  redpanda-console:
    image: redpandadata/console:v3.2.2
    container_name: p04-redpanda-console
    hostname: redpanda-console
    restart: unless-stopped
    ports:
      - "8085:8080"
    environment:
      KAFKA_BROKERS: redpanda:9092
    depends_on:
      redpanda:
        condition: service_healthy
    networks:
      - pipeline-net

  # ---------------------------------------------------------------------------
  # MinIO (S3-compatible object storage for Iceberg warehouse)
  # ---------------------------------------------------------------------------
  minio:
    image: minio/minio:RELEASE.2025-04-22T22-12-26Z
    container_name: p04-minio
    hostname: minio
    restart: unless-stopped
    ports:
      - "9000:9000"
      - "9001:9001"
    environment:
      MINIO_ROOT_USER: minioadmin
      MINIO_ROOT_PASSWORD: minioadmin
    command: server /data --console-address ":9001"
    healthcheck:
      test: mc ready local || exit 1
      interval: 10s
      timeout: 5s
      retries: 10
      start_period: 10s
    volumes:
      - minio-data:/data
    deploy:
      resources:
        limits:
          memory: 1G
        reservations:
          memory: 256m
    networks:
      - pipeline-net

  # ---------------------------------------------------------------------------
  # MinIO Client Init (create warehouse bucket)
  # ---------------------------------------------------------------------------
  mc-init:
    image: minio/mc:RELEASE.2025-05-21T01-59-54Z
    container_name: p04-mc-init
    depends_on:
      minio:
        condition: service_healthy
    entrypoint: >
      /bin/sh -c "
      mc alias set myminio http://minio:9000 minioadmin minioadmin &&
      mc mb myminio/warehouse --ignore-existing &&
      mc anonymous set download myminio/warehouse &&
      echo 'Bucket warehouse created successfully'
      "
    networks:
      - pipeline-net

  # ---------------------------------------------------------------------------
  # Flink JobManager
  # ---------------------------------------------------------------------------
  flink-jobmanager:
    <<: *flink-common
    container_name: p04-flink-jobmanager
    hostname: flink-jobmanager
    restart: unless-stopped
    command: jobmanager
    ports:
      - "8081:8081"
      - "9249:9249"  # Prometheus metrics
    deploy:
      resources:
        limits:
          memory: 2G
          cpus: '1.0'
        reservations:
          memory: 1G
    volumes:
      - ./flink/sql:/opt/flink/sql:ro
      - ./flink/conf/config.yaml:/opt/flink/conf/config.yaml:ro
      - ./flink/conf/core-site.xml:/opt/hadoop/conf/core-site.xml:ro
      - flink-checkpoints:/tmp/flink-checkpoints
    depends_on:
      redpanda:
        condition: service_healthy
      mc-init:
        condition: service_completed_successfully
    healthcheck:
      test: curl -f http://localhost:8081/overview || exit 1
      interval: 10s
      timeout: 5s
      retries: 15
      start_period: 30s
    environment:
      <<: *flink-env
      AWS_ACCESS_KEY_ID: minioadmin
      AWS_SECRET_ACCESS_KEY: minioadmin
      AWS_REGION: us-east-1
      HADOOP_CONF_DIR: /opt/hadoop/conf

  # ---------------------------------------------------------------------------
  # Flink TaskManager
  # ---------------------------------------------------------------------------
  flink-taskmanager:
    <<: *flink-common
    container_name: p04-flink-taskmanager
    hostname: flink-taskmanager
    restart: unless-stopped
    command: taskmanager
    deploy:
      resources:
        limits:
          memory: 2.5G
          cpus: '2.0'
        reservations:
          memory: 2G
    volumes:
      - ./flink/conf/core-site.xml:/opt/hadoop/conf/core-site.xml:ro
      - flink-checkpoints:/tmp/flink-checkpoints
    depends_on:
      flink-jobmanager:
        condition: service_healthy
    environment:
      <<: *flink-env
      AWS_ACCESS_KEY_ID: minioadmin
      AWS_SECRET_ACCESS_KEY: minioadmin
      AWS_REGION: us-east-1
      HADOOP_CONF_DIR: /opt/hadoop/conf

  # ---------------------------------------------------------------------------
  # dbt (DuckDB adapter - reads Iceberg tables from MinIO)
  # ---------------------------------------------------------------------------
  dbt:
    build:
      context: .
      dockerfile: ../../shared/docker/dbt.Dockerfile
      args:
        DBT_ADAPTER: dbt-duckdb
    container_name: p04-dbt
    volumes:
      - ./dbt_project:/dbt
    working_dir: /dbt
    entrypoint: ["/bin/sh", "-c"]
    command: ["dbt deps --profiles-dir . && dbt build --full-refresh --profiles-dir ."]
    environment:
      AWS_ACCESS_KEY_ID: minioadmin
      AWS_SECRET_ACCESS_KEY: minioadmin
      AWS_ENDPOINT_URL: http://minio:9000
      AWS_REGION: us-east-1
      DBT_PROFILES_DIR: /dbt
    depends_on:
      minio:
        condition: service_healthy
    profiles:
      - dbt
    networks:
      - pipeline-net

  # ---------------------------------------------------------------------------
  # Data Generator (reads parquet, produces to Redpanda)
  # ---------------------------------------------------------------------------
  data-generator:
    build:
      context: ../../shared/data-generator/
      dockerfile: Dockerfile
    container_name: p04-data-generator
    volumes:
      - ../../data:/data:ro
    environment:
      BROKER_URL: redpanda:9092
      TOPIC: taxi.raw_trips
      MODE: burst
      DATA_PATH: /data/yellow_tripdata_2024-01.parquet
    depends_on:
      redpanda:
        condition: service_healthy
    profiles:
      - generator
    networks:
      - pipeline-net

# =============================================================================
# Volumes
# =============================================================================
volumes:
  minio-data:
    driver: local
  flink-checkpoints:
    driver: local
  redpanda-data:
    driver: local

# =============================================================================
# Networks
# =============================================================================
networks:
  pipeline-net:
    name: p04-pipeline-net
    driver: bridge


## 4. Redpanda Topics: Event Ingestion + Dead Letter Queue

### Two-Topic Design: Primary + DLQ

```
Data Generator
     │
     │  valid JSON events
     ▼
taxi.raw_trips          ← 3 partitions, 72h retention, primary stream
     │
     │  (Flink reads this, processes into Bronze/Silver)
     ▼
[malformed events that fail Flink type parsing are sent to DLQ]
     │
     ▼
taxi.raw_trips.dlq      ← 1 partition, 7-day retention, dead letter queue
```

### Why a Dead Letter Queue?

Without a DLQ, a single malformed event (wrong timestamp format, null where not expected)
can block an entire Kafka partition. Flink's JSON deserializer will throw on parse failure
and stop consuming that partition. A DLQ gives you:
- **Visibility:** You can see which events failed and why
- **Replay:** After fixing the schema/parser, you can re-consume the DLQ
- **Non-blocking:** Bad events don't stop good events from flowing

### Topic Configuration

| Setting | Primary | DLQ | Reason |
|---------|---------|-----|--------|
| Partitions | 3 | 1 | DLQ is low-volume, no need for parallelism |
| `retention.ms` | 259,200,000 (72h) | 604,800,000 (7 days) | DLQ kept longer for investigation |
| `cleanup.policy` | `delete` | `delete` | Time-based expiry (not compaction) |

### Makefile vs create-topics.sh

The Makefile's `create-topics` target creates both topics **inline** via `rpk`. The `create-topics.sh`
shell script also creates both topics. Either works — the Makefile target is the standard entry point.

> **Audit fix (Feb 2026):** The original Makefile `create-topics` only created the primary topic,
> bypassing the DLQ. Both the shell script and Makefile now create both topics consistently.

### rpk vs kafka-topics.sh

```bash
# Kafka (JVM, ~3s including JVM startup):
kafka-topics.sh --bootstrap-server kafka:9092 --create --topic taxi.raw_trips   --partitions 3 --config retention.ms=259200000

# Redpanda rpk (native Go binary, ~50ms):
rpk topic create taxi.raw_trips --brokers redpanda:9092 --partitions 3   --topic-config retention.ms=259200000
```


In [None]:
%%writefile ../pipelines/04-redpanda-flink-iceberg/kafka/create-topics.sh
#!/bin/bash
# =============================================================================
# Pipeline 04: Create Redpanda Topics
# =============================================================================
# Creates the required topics for the taxi trip streaming pipeline.
# Uses rpk (Redpanda CLI) instead of kafka-topics.sh.
#
# Usage:
#   docker compose exec redpanda rpk topic create ...
#   -- or --
#   make create-topics
# =============================================================================

set -euo pipefail

echo "Creating topics via rpk..."
rpk topic create taxi.raw_trips \
    --brokers redpanda:9092 \
    --partitions 3 \
    --replicas 1 \
    --topic-config retention.ms=259200000 \
    --topic-config cleanup.policy=delete || true

# Dead Letter Queue: for poison messages that fail processing
rpk topic create taxi.raw_trips.dlq \
    --brokers redpanda:9092 \
    --partitions 1 \
    --replicas 1 \
    --topic-config retention.ms=604800000 \
    --topic-config cleanup.policy=delete || true

rpk topic list --brokers redpanda:9092
echo "Topic creation complete."


## 5. Flink Configuration

### 5.1 Hadoop core-site.xml — S3A → MinIO Bridge

Flink uses Apache Hadoop's S3A filesystem driver to read/write Iceberg files stored in MinIO.
This XML file maps S3A URI scheme (`s3a://`) to MinIO's endpoint.

```
Flink SQL INSERT INTO iceberg_catalog.bronze.raw_trips
     │   uses S3FileIO → s3a://warehouse/bronze/raw_trips/data/...
     ▼
Hadoop S3A FileSystem Driver
     │   fs.s3a.endpoint = http://minio:9000
     │   fs.s3a.path.style.access = true  (MinIO uses path-style, not virtual-hosted)
     ▼
MinIO HTTP server on port 9000
```

Without `path.style.access=true`, Hadoop would try to reach `warehouse.minio:9000` (DNS lookup
fails in Docker) instead of `minio:9000/warehouse/` (correct Docker network address).


In [None]:
%%writefile ../pipelines/04-redpanda-flink-iceberg/flink/conf/core-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
    <!-- MinIO (S3-compatible) Configuration for Hadoop S3A -->
    <property>
        <name>fs.s3a.endpoint</name>
        <value>http://minio:9000</value>
    </property>
    <property>
        <name>fs.s3a.access.key</name>
        <value>minioadmin</value>
    </property>
    <property>
        <name>fs.s3a.secret.key</name>
        <value>minioadmin</value>
    </property>
    <property>
        <name>fs.s3a.path.style.access</name>
        <value>true</value>
    </property>
    <property>
        <name>fs.s3a.impl</name>
        <value>org.apache.hadoop.fs.s3a.S3AFileSystem</value>
    </property>
    <property>
        <name>fs.s3a.connection.ssl.enabled</name>
        <value>false</value>
    </property>
</configuration>


### 5.2 Flink Cluster Configuration (config.yaml)

> **Flink 2.0 breaking change:** `flink-conf.yaml` → `config.yaml`. If you mount the old
> filename, Flink silently uses defaults — no error. This caused mysterious TaskManager failures
> until the rename was discovered. All pipelines migrated in the Feb 2026 upgrade.

Key settings with explanations:

| Setting | Value | Why it matters |
|---------|-------|---------------|
| `classloader.check-leaked-classloader` | `false` | Iceberg loads classes dynamically; without this Flink throws classloader leak warnings that abort jobs |
| `taskmanager.memory.process.size` | `2048m` | Total TM memory (JVM heap + off-heap + metaspace). Must match docker-compose `memory: 2.5G` with headroom |
| `parallelism.default` | `2` | 2 task slots for 3-partition topic; adequate for 10k benchmark |
| `state.backend` | `hashmap` | In-memory state (fast for batch jobs). For streaming: change to `rocksdb` + configure `state.checkpoints.dir` |
| `execution.checkpointing.interval` | `30s` | How often Flink snapshots operator state. Required for exactly-once in streaming mode |

> **Streaming mode note:** The config.yaml settings for `state.backend = rocksdb` and
> `state.checkpoints.dir = s3a://warehouse/checkpoints/` should be added before running
> `make process-streaming` in production. Batch mode doesn't need checkpoint recovery.


In [None]:
%%writefile ../pipelines/04-redpanda-flink-iceberg/flink/conf/config.yaml
# =============================================================================
# Pipeline 04: Flink Configuration
# =============================================================================
# Configuration for Flink 2.0.1 with Iceberg + Redpanda connectors.
# This file is mounted into the JobManager container.
# =============================================================================

# Cluster
jobmanager.rpc.address: flink-jobmanager
jobmanager.rpc.port: 6123
jobmanager.bind-host: 0.0.0.0
jobmanager.memory.process.size: 1600m

taskmanager.bind-host: 0.0.0.0
taskmanager.host: flink-taskmanager
taskmanager.memory.process.size: 2048m
taskmanager.numberOfTaskSlots: 4

parallelism.default: 2

# REST API (Flink Dashboard)
rest.address: 0.0.0.0
rest.bind-address: 0.0.0.0
rest.port: 8081
rest.flamegraph.enabled: true

# Checkpointing
execution.checkpointing.interval: 30s
execution.checkpointing.mode: EXACTLY_ONCE
execution.checkpointing.min-pause: 10s
execution.checkpointing.timeout: 5min
state.backend: rocksdb
state.backend.incremental: true
state.checkpoints.dir: s3a://warehouse/checkpoints
state.savepoints.dir: s3a://warehouse/savepoints

# Table / SQL Configuration
table.exec.state.ttl: 0
table.exec.sink.not-null-enforcer: DROP

# Classloader (avoid Iceberg classloader leak with batch DML sync)
classloader.check-leaked-classloader: false

# S3 (MinIO) filesystem configuration
s3.endpoint: http://minio:9000
s3.access-key: minioadmin
s3.secret-key: minioadmin
s3.path.style.access: true

# Logging
env.log.max: 5
env.log.dir: /opt/flink/log

# Metrics: Prometheus reporter (scrape port 9249 on JobManager + TaskManager)
metrics.reporter.prom.factory.class: org.apache.flink.metrics.prometheus.PrometheusReporterFactory
metrics.reporter.prom.port: 9249


## 6. Flink SQL: Batch Session Initialization

`00-init.sql` is the **most important file in the pipeline**. It defines the entire session
state that all subsequent SQL files inherit.

### How the Flink SQL CLI Session Works

```bash
# -i flag: "initialize" — runs before interactive or -f execution
sql-client.sh embedded -i 00-init.sql -f 05-bronze.sql
#                       ─────────────  ───────────────
#                        init file      batch execute file
#
# Result: 00-init.sql creates tables and catalog in-session.
#         05-bronze.sql can then reference kafka_raw_trips and iceberg_catalog.
```

The `-i` and `-f` flags share the same SQL session — like running two scripts in the same
database connection. Without `-i`, the `-f` file would see an empty session with no tables.

### Three Things 00-init.sql Does

#### 1. Batch execution settings
```sql
SET 'execution.runtime-mode' = 'batch';       -- process bounded data and stop
SET 'table.dml-sync' = 'true';               -- block after each INSERT until complete
```
Without `table.dml-sync=true`, the Silver job would start before Bronze finishes writing,
resulting in 0 Silver rows (reading an empty or partially-written Bronze table).

#### 2. Redpanda source table (virtual — reads from topic)
The `kafka_raw_trips` table is never stored anywhere. It's a virtual table that maps
Redpanda topic messages to SQL columns. Every row in the topic becomes a SQL row.

Key option: `'scan.bounded.mode' = 'latest-offset'` — in batch mode, Flink stops reading
at the offset that was "latest" when the job started. This makes the batch job finite.
Without this, batch mode Flink would wait forever for new messages.

#### 3. Iceberg catalog (where table metadata lives)
The Hadoop catalog maps Iceberg table names to S3A paths:
```
iceberg_catalog.bronze.raw_trips
    → s3a://warehouse/bronze/raw_trips/
        ├── metadata/
        │   ├── v1.metadata.json
        │   └── snap-....avro
        └── data/
            └── 00000-0-....parquet
```

### Event-time Watermark (important for streaming)
```sql
event_time AS TO_TIMESTAMP(tpep_pickup_datetime, 'yyyy-MM-dd''T''HH:mm:ss'),
WATERMARK FOR event_time AS event_time - INTERVAL '10' SECOND
```
In **batch mode**: watermark is parsed but has no effect. All records are processed without
ordering concern. In **streaming mode**: watermark tells Flink to consider events more than
10 seconds late as "late arrivals" — allows window functions to close properly.


In [None]:
%%writefile ../pipelines/04-redpanda-flink-iceberg/flink/sql/00-init.sql
-- =============================================================================
-- Pipeline 04: Flink SQL - Session Initialization
-- =============================================================================
-- Creates the Kafka source table and Iceberg catalog. This file is used as
-- an init script (-i flag) for all subsequent SQL files so they have access
-- to the catalog within the same session.
--
-- Uses BATCH execution mode so jobs process all available data and terminate.
-- =============================================================================

-- Use batch mode (process available data, then stop)
SET 'execution.runtime-mode' = 'batch';

-- Wait for each INSERT to complete before proceeding to next statement
SET 'table.dml-sync' = 'true';

-- Create Kafka source table
-- NOTE: event_time computed column + WATERMARK enables event-time processing
-- in streaming mode. In batch mode (default), the watermark is simply ignored.
CREATE TABLE IF NOT EXISTS kafka_raw_trips (
    VendorID                BIGINT,
    tpep_pickup_datetime    STRING,
    tpep_dropoff_datetime   STRING,
    passenger_count         BIGINT,
    trip_distance           DOUBLE,
    RatecodeID              BIGINT,
    store_and_fwd_flag      STRING,
    PULocationID            BIGINT,
    DOLocationID            BIGINT,
    payment_type            BIGINT,
    fare_amount             DOUBLE,
    extra                   DOUBLE,
    mta_tax                 DOUBLE,
    tip_amount              DOUBLE,
    tolls_amount            DOUBLE,
    improvement_surcharge   DOUBLE,
    total_amount            DOUBLE,
    congestion_surcharge    DOUBLE,
    Airport_fee             DOUBLE,
    -- Computed column for event-time processing
    event_time AS TO_TIMESTAMP(tpep_pickup_datetime, 'yyyy-MM-dd''T''HH:mm:ss'),
    -- Watermark: allow 10s late arrivals (ignored in batch mode)
    WATERMARK FOR event_time AS event_time - INTERVAL '10' SECOND
) WITH (
    'connector' = 'kafka',
    'topic' = 'taxi.raw_trips',
    'properties.bootstrap.servers' = 'redpanda:9092',
    'properties.group.id' = 'flink-consumer',
    'scan.startup.mode' = 'earliest-offset',
    'scan.bounded.mode' = 'latest-offset',
    'format' = 'json'
);

-- Create Iceberg catalog backed by MinIO
CREATE CATALOG iceberg_catalog WITH (
    'type' = 'iceberg',
    'catalog-type' = 'hadoop',
    'warehouse' = 's3a://warehouse/',
    'io-impl' = 'org.apache.iceberg.aws.s3.S3FileIO',
    's3.endpoint' = 'http://minio:9000',
    's3.access-key-id' = 'minioadmin',
    's3.secret-access-key' = 'minioadmin',
    's3.path-style-access' = 'true'
);


### 6.1 Reference: Kafka Source Table (standalone)

This file creates only the Redpanda source table — useful for interactive SQL sessions where
you want to inspect the schema without setting up the full catalog.


In [None]:
%%writefile ../pipelines/04-redpanda-flink-iceberg/flink/sql/01-create-kafka-source.sql
-- =============================================================================
-- Pipeline 04: Flink SQL - Kafka Source Table (Redpanda)
-- =============================================================================
-- Creates a Flink SQL table backed by the Redpanda topic taxi.raw_trips.
-- The data generator produces JSON records with these exact field names
-- matching the NYC Yellow Taxi parquet schema.
-- =============================================================================

CREATE TABLE kafka_raw_trips (
    VendorID                BIGINT,
    tpep_pickup_datetime    STRING,
    tpep_dropoff_datetime   STRING,
    passenger_count         BIGINT,
    trip_distance           DOUBLE,
    RatecodeID              BIGINT,
    store_and_fwd_flag      STRING,
    PULocationID            BIGINT,
    DOLocationID            BIGINT,
    payment_type            BIGINT,
    fare_amount             DOUBLE,
    extra                   DOUBLE,
    mta_tax                 DOUBLE,
    tip_amount              DOUBLE,
    tolls_amount            DOUBLE,
    improvement_surcharge   DOUBLE,
    total_amount            DOUBLE,
    congestion_surcharge    DOUBLE,
    Airport_fee             DOUBLE
) WITH (
    'connector' = 'kafka',
    'topic' = 'taxi.raw_trips',
    'properties.bootstrap.servers' = 'redpanda:9092',
    'properties.group.id' = 'flink-consumer',
    'scan.startup.mode' = 'earliest-offset',
    'format' = 'json'
);


### 6.2 Reference: Iceberg Catalog (standalone)

P04 uses the Flink Hadoop catalog (direct S3A path resolution). The alternative — Lakekeeper
REST catalog — is available on P01 via `--profile lakekeeper`. The REST catalog adds credential
vending (no S3 keys in SQL) but requires 4 extra services (Postgres + 3 Lakekeeper containers).


In [None]:
%%writefile ../pipelines/04-redpanda-flink-iceberg/flink/sql/02-create-iceberg-catalog.sql
-- =============================================================================
-- Pipeline 04: Flink SQL - Iceberg Catalog
-- =============================================================================
-- Creates a Hadoop-based Iceberg catalog backed by MinIO (S3-compatible).
-- All Bronze and Silver tables will be created within this catalog.
-- =============================================================================

CREATE CATALOG iceberg_catalog WITH (
    'type' = 'iceberg',
    'catalog-type' = 'hadoop',
    'warehouse' = 's3a://warehouse/',
    'io-impl' = 'org.apache.iceberg.aws.s3.S3FileIO',
    's3.endpoint' = 'http://minio:9000',
    's3.access-key-id' = 'minioadmin',
    's3.secret-access-key' = 'minioadmin',
    's3.path-style-access' = 'true'
);


## 7. Flink SQL: Streaming Session Initialization

`00-init-streaming.sql` is the streaming counterpart to `00-init.sql`. Use it with
`07-streaming-bronze.sql` for **continuous event processing** instead of batch catch-up.

### Batch vs Streaming: Key Differences

| Aspect | Batch (`00-init.sql`) | Streaming (`00-init-streaming.sql`) |
|--------|----------------------|-------------------------------------|
| `execution.runtime-mode` | `batch` | `streaming` |
| `table.dml-sync` | `true` (block per INSERT) | **NOT SET** — would hang forever |
| `scan.bounded.mode` | `latest-offset` (stop at current end) | **NOT SET** — reads forever |
| `scan.startup.mode` | `earliest-offset` (reprocess all) | `latest-offset` (only new events) |
| Job lifecycle | Terminates when all data processed | Runs indefinitely until cancelled |
| Fault tolerance | Not needed (restart from Redpanda) | Checkpoints every 30s (exactly-once) |
| Use case | Backfill, benchmarks, scheduled runs | Production real-time ingestion |

### Why NOT Setting table.dml-sync in Streaming Mode is Critical

```sql
-- BATCH init:
SET 'table.dml-sync' = 'true';  -- block until INSERT completes → fine (job terminates)

-- STREAMING init:
-- Do NOT set table.dml-sync — the streaming INSERT runs FOREVER.
-- If dml-sync=true, the session would block forever on the first INSERT,
-- never getting to run the second SQL statement.
```

### Group ID Differs

- Batch: `'properties.group.id' = 'flink-consumer'` — offset tracked for batch replay
- Streaming: `'properties.group.id' = 'flink-streaming-consumer'` — separate group, reads from latest

This separation means you can run batch reprocessing and streaming ingestion simultaneously
without the consumer groups interfering with each other.


In [None]:
%%writefile ../pipelines/04-redpanda-flink-iceberg/flink/sql/00-init-streaming.sql
-- =============================================================================
-- Pipeline 04: Flink SQL - Streaming Session Initialization (Redpanda)
-- =============================================================================
-- Creates the Redpanda source table and Iceberg catalog for STREAMING mode.
-- Unlike 00-init.sql (batch), this file:
--   - Sets execution.runtime-mode = streaming
--   - Enables checkpointing for exactly-once guarantees
--   - Does NOT set scan.bounded.mode (so the Redpanda source never terminates)
--   - Uses scan.startup.mode = latest-offset (process new events only)
--
-- Usage: sql-client.sh embedded -i 00-init-streaming.sql -f 07-streaming-bronze.sql
-- =============================================================================

-- Use streaming mode (job runs indefinitely)
SET 'execution.runtime-mode' = 'streaming';

-- Checkpoint every 30s for exactly-once guarantees
SET 'execution.checkpointing.interval' = '30s';

-- CRITICAL: Do NOT set table.dml-sync in streaming mode.
-- dml-sync=true would cause the session to hang forever waiting for the
-- infinite streaming job to complete before executing the next statement.

-- Create Redpanda source table (streaming — no bounded mode)
CREATE TABLE IF NOT EXISTS kafka_raw_trips (
    VendorID                BIGINT,
    tpep_pickup_datetime    STRING,
    tpep_dropoff_datetime   STRING,
    passenger_count         BIGINT,
    trip_distance           DOUBLE,
    RatecodeID              BIGINT,
    store_and_fwd_flag      STRING,
    PULocationID            BIGINT,
    DOLocationID            BIGINT,
    payment_type            BIGINT,
    fare_amount             DOUBLE,
    extra                   DOUBLE,
    mta_tax                 DOUBLE,
    tip_amount              DOUBLE,
    tolls_amount            DOUBLE,
    improvement_surcharge   DOUBLE,
    total_amount            DOUBLE,
    congestion_surcharge    DOUBLE,
    Airport_fee             DOUBLE,
    -- Computed column for event-time processing
    event_time AS TO_TIMESTAMP(tpep_pickup_datetime, 'yyyy-MM-dd''T''HH:mm:ss'),
    -- Watermark: allow 10s late arrivals
    WATERMARK FOR event_time AS event_time - INTERVAL '10' SECOND
) WITH (
    'connector' = 'kafka',
    'topic' = 'taxi.raw_trips',
    'properties.bootstrap.servers' = 'redpanda:9092',
    'properties.group.id' = 'flink-streaming-consumer',
    'scan.startup.mode' = 'latest-offset',
    -- NOTE: No scan.bounded.mode — this is what makes it truly streaming.
    -- The job will run indefinitely, processing new Redpanda events as they arrive.
    'format' = 'json'
);

-- Create Iceberg catalog backed by MinIO
CREATE CATALOG iceberg_catalog WITH (
    'type' = 'iceberg',
    'catalog-type' = 'hadoop',
    'warehouse' = 's3a://warehouse/',
    'io-impl' = 'org.apache.iceberg.aws.s3.S3FileIO',
    's3.endpoint' = 'http://minio:9000',
    's3.access-key-id' = 'minioadmin',
    's3.secret-access-key' = 'minioadmin',
    's3.path-style-access' = 'true'
);


## 8. Flink SQL: Bronze Layer (Redpanda → Iceberg)

### Bronze Layer Design Philosophy

The Bronze layer is a **faithful copy** of the source. No business logic. No filtering. Every
event lands — even malformed ones. This is the audit trail.

```
Redpanda message (JSON):
{
  "VendorID": 1,
  "tpep_pickup_datetime": "2024-01-15T08:30:00",   ← STRING
  "fare_amount": 12.50,
  ...
}

Bronze Iceberg row:
  VendorID              BIGINT  = 1
  tpep_pickup_datetime  TIMESTAMP(3) = 2024-01-15 08:30:00.000  ← parsed
  fare_amount           DOUBLE  = 12.50
  ingestion_ts          TIMESTAMP(3) = 2024-01-15 09:01:33.412  ← added
```

### Timestamp Parsing Pattern

```sql
TO_TIMESTAMP(tpep_pickup_datetime, 'yyyy-MM-dd''T''HH:mm:ss')
--                                              ↑
--                               Escaped single quotes for literal 'T'
--                               Java SimpleDateFormat syntax in SQL string literals
```

### Why format-version=1 for Bronze?

Iceberg v1 is sufficient for append-only Bronze. v2 adds row-level delete support
(required for UPDATE/MERGE operations). Keeping Bronze at v1 makes it slightly faster
to write and simpler to compact. Silver uses v2 because future maintenance (MERGE INTO for
dedup, DELETE WHERE for GDPR) requires it.

### 8.1 Bronze — Commented Reference


In [None]:
%%writefile ../pipelines/04-redpanda-flink-iceberg/flink/sql/03-bronze-raw-trips.sql
-- =============================================================================
-- Pipeline 04: Flink SQL - Bronze Layer (Raw Trips)
-- =============================================================================
-- Creates the Bronze Iceberg table and starts a continuous INSERT job
-- that reads from the Kafka source table.
--
-- Bronze layer preserves original column names from the source data.
-- Timestamps are parsed from ISO 8601 strings to TIMESTAMP type.
-- No filtering or cleaning is applied at this layer.
-- =============================================================================

-- Use the Iceberg catalog
USE CATALOG iceberg_catalog;

-- Create the Bronze database
CREATE DATABASE IF NOT EXISTS bronze;
USE bronze;

-- Create the Bronze raw trips table
CREATE TABLE IF NOT EXISTS raw_trips (
    VendorID                BIGINT,
    tpep_pickup_datetime    TIMESTAMP(3),
    tpep_dropoff_datetime   TIMESTAMP(3),
    passenger_count         BIGINT,
    trip_distance           DOUBLE,
    RatecodeID              BIGINT,
    store_and_fwd_flag      STRING,
    PULocationID            BIGINT,
    DOLocationID            BIGINT,
    payment_type            BIGINT,
    fare_amount             DOUBLE,
    extra                   DOUBLE,
    mta_tax                 DOUBLE,
    tip_amount              DOUBLE,
    tolls_amount            DOUBLE,
    improvement_surcharge   DOUBLE,
    total_amount            DOUBLE,
    congestion_surcharge    DOUBLE,
    Airport_fee             DOUBLE,
    ingestion_ts            TIMESTAMP(3)
);

-- Switch back to default catalog for the Kafka source table reference
USE CATALOG default_catalog;
USE default_database;

-- Continuous INSERT from Kafka into Bronze Iceberg table
-- Timestamps are parsed from ISO 8601 string format (e.g. "2024-01-15T08:30:00")
INSERT INTO iceberg_catalog.bronze.raw_trips
SELECT
    VendorID,
    TO_TIMESTAMP(tpep_pickup_datetime, 'yyyy-MM-dd''T''HH:mm:ss')   AS tpep_pickup_datetime,
    TO_TIMESTAMP(tpep_dropoff_datetime, 'yyyy-MM-dd''T''HH:mm:ss')  AS tpep_dropoff_datetime,
    passenger_count,
    trip_distance,
    RatecodeID,
    store_and_fwd_flag,
    PULocationID,
    DOLocationID,
    payment_type,
    fare_amount,
    extra,
    mta_tax,
    tip_amount,
    tolls_amount,
    improvement_surcharge,
    total_amount,
    congestion_surcharge,
    Airport_fee,
    CURRENT_TIMESTAMP AS ingestion_ts
FROM kafka_raw_trips;


### 8.2 Bronze — Production File (05-bronze.sql)

Used by `make process-bronze`. Runs as: `sql-client.sh embedded -i 00-init.sql -f 05-bronze.sql`


In [None]:
%%writefile ../pipelines/04-redpanda-flink-iceberg/flink/sql/05-bronze.sql
-- =============================================================================
-- Pipeline 04: Bronze Layer (Kafka → Iceberg)
-- =============================================================================
-- Run: sql-client.sh embedded -i 00-init.sql -f 05-bronze.sql
-- =============================================================================

USE CATALOG iceberg_catalog;
CREATE DATABASE IF NOT EXISTS bronze;

CREATE TABLE IF NOT EXISTS bronze.raw_trips (
    VendorID                BIGINT,
    tpep_pickup_datetime    TIMESTAMP(3),
    tpep_dropoff_datetime   TIMESTAMP(3),
    passenger_count         BIGINT,
    trip_distance           DOUBLE,
    RatecodeID              BIGINT,
    store_and_fwd_flag      STRING,
    PULocationID            BIGINT,
    DOLocationID            BIGINT,
    payment_type            BIGINT,
    fare_amount             DOUBLE,
    extra                   DOUBLE,
    mta_tax                 DOUBLE,
    tip_amount              DOUBLE,
    tolls_amount            DOUBLE,
    improvement_surcharge   DOUBLE,
    total_amount            DOUBLE,
    congestion_surcharge    DOUBLE,
    Airport_fee             DOUBLE,
    ingestion_ts            TIMESTAMP(3)
)
WITH (
    'format-version' = '1',
    'write.format.default' = 'parquet',
    'write.parquet.compression-codec' = 'zstd',
    'write.metadata.delete-after-commit.enabled' = 'true',
    'write.metadata.previous-versions-max' = '10',
    'write.target-file-size-bytes' = '134217728'
);

USE CATALOG default_catalog;
USE default_database;

INSERT INTO iceberg_catalog.bronze.raw_trips
SELECT
    VendorID,
    TO_TIMESTAMP(tpep_pickup_datetime, 'yyyy-MM-dd''T''HH:mm:ss')   AS tpep_pickup_datetime,
    TO_TIMESTAMP(tpep_dropoff_datetime, 'yyyy-MM-dd''T''HH:mm:ss')  AS tpep_dropoff_datetime,
    passenger_count,
    trip_distance,
    RatecodeID,
    store_and_fwd_flag,
    PULocationID,
    DOLocationID,
    payment_type,
    fare_amount,
    extra,
    mta_tax,
    tip_amount,
    tolls_amount,
    improvement_surcharge,
    total_amount,
    congestion_surcharge,
    Airport_fee,
    CURRENT_TIMESTAMP AS ingestion_ts
FROM kafka_raw_trips;


## 9. Flink SQL: Silver Layer — Deduplication + Quality Filtering

Silver is the **trusted, clean** layer. It reads from Bronze and applies two things:
1. **Quality filtering** — removes invalid/out-of-range records
2. **Deduplication** — ROW_NUMBER() removes duplicate events (e.g., producer retries)

### 9.1 Deduplication via ROW_NUMBER()

```sql
WITH deduped AS (
    SELECT *,
        ROW_NUMBER() OVER (
            PARTITION BY VendorID, tpep_pickup_datetime, tpep_dropoff_datetime,
                         PULocationID, DOLocationID, fare_amount, total_amount
            ORDER BY ingestion_ts DESC        ← keep the LATEST ingestion of a duplicate
        ) AS rn
    FROM iceberg_catalog.bronze.raw_trips
    WHERE ...quality filters...
)
SELECT ...columns... FROM deduped WHERE rn = 1;
```

**Natural key**: The combination of (vendor, pickup_time, dropoff_time, pickup_zone, dropoff_zone,
fare, total) uniquely identifies a taxi trip. Two rows with the same natural key = same trip.

**ORDER BY ingestion_ts DESC**: If a producer sent the same event twice due to a retry,
we keep the most recently ingested copy (latest `ingestion_ts`).

### Why ROW_NUMBER() and not DISTINCT?

`DISTINCT` removes rows with identical values across ALL columns. It won't help if events
differ in `ingestion_ts` (they always do — each event has a unique arrival timestamp).
ROW_NUMBER partitioned by the **business natural key** correctly deduplicates trips that
arrived twice with different timestamps.

### Quality Filters Applied

| Filter | SQL | Expected rejection |
|--------|-----|--------------------|
| Valid pickup time | `tpep_pickup_datetime IS NOT NULL` | Malformed events |
| Valid dropoff time | `tpep_dropoff_datetime IS NOT NULL` | Malformed events |
| Non-negative distance | `trip_distance >= 0` | GPS glitches |
| Non-negative fare | `fare_amount >= 0` | Refund/test records |
| Date range | `pickup_date BETWEEN 2024-01-01 AND 2024-01-31` | Out-of-period events |

**Expected yield:** ~9,855 / 10,000 = 98.55% pass rate.

### Silver Schema Changes vs Bronze

| Column | Bronze | Silver | Change |
|--------|--------|--------|--------|
| `VendorID` | BIGINT | INT (as vendor_id in dbt) | Narrowed in dbt stg |
| `fare_amount` | DOUBLE | DECIMAL(10,2) | Rounded + typed |
| `pickup_date` | — | DATE | Added (for partitioning) |
| `trip_id` | — | STRING | Added (MD5 surrogate key) |
| `ingestion_ts` | TIMESTAMP(3) | — | Dropped (used only for dedup order) |

### 9.1 Silver — Commented Reference


In [None]:
%%writefile ../pipelines/04-redpanda-flink-iceberg/flink/sql/04-silver-cleaned-trips.sql
-- =============================================================================
-- Pipeline 04: Flink SQL - Silver Layer (Cleaned Trips)
-- =============================================================================
-- Creates the Silver Iceberg table and starts a continuous INSERT job
-- that reads from the Bronze table, applies data quality filters,
-- renames columns to snake_case, and computes enrichment columns.
--
-- Silver layer transformations:
--   1. Column renaming (VendorID -> vendor_id, PULocationID -> pickup_location_id, etc.)
--   2. Type casting (BIGINT -> INT where appropriate)
--   3. Data quality filters:
--      - Reject null timestamps
--      - Reject negative fare amounts and trip distances
--      - Reject pickup dates outside January 2024
--   4. Surrogate key: MD5 hash of composite natural key
--   5. Partition column: pickup_date (for Iceberg partitioning)
-- =============================================================================

-- Use the Iceberg catalog
USE CATALOG iceberg_catalog;

-- Create the Silver database
CREATE DATABASE IF NOT EXISTS silver;
USE silver;

-- Create the Silver cleaned trips table
CREATE TABLE IF NOT EXISTS cleaned_trips (
    -- surrogate key
    trip_id                 STRING,

    -- identifiers
    vendor_id               INT,
    rate_code_id            INT,
    pickup_location_id      INT,
    dropoff_location_id     INT,
    payment_type_id         INT,

    -- timestamps
    pickup_datetime         TIMESTAMP(3),
    dropoff_datetime        TIMESTAMP(3),

    -- trip info
    passenger_count         INT,
    trip_distance_miles     DOUBLE,
    store_and_fwd_flag      STRING,

    -- financials
    fare_amount             DECIMAL(10, 2),
    extra_amount            DECIMAL(10, 2),
    mta_tax                 DECIMAL(10, 2),
    tip_amount              DECIMAL(10, 2),
    tolls_amount            DECIMAL(10, 2),
    improvement_surcharge   DECIMAL(10, 2),
    total_amount            DECIMAL(10, 2),
    congestion_surcharge    DECIMAL(10, 2),
    airport_fee             DECIMAL(10, 2),

    -- computed: time dimensions
    pickup_date             DATE
) PARTITIONED BY (pickup_date)
WITH (
    'format-version' = '2',
    'write.format.default' = 'parquet',
    'write.parquet.compression-codec' = 'zstd',
    'write.metadata.delete-after-commit.enabled' = 'true',
    'write.metadata.previous-versions-max' = '10',
    'write.target-file-size-bytes' = '134217728'
);

-- Continuous INSERT from Bronze into Silver with transformations
INSERT INTO iceberg_catalog.silver.cleaned_trips
SELECT
    -- Surrogate key: MD5 hash of composite natural key
    MD5(CONCAT_WS('|',
        CAST(VendorID AS STRING),
        CAST(tpep_pickup_datetime AS STRING),
        CAST(tpep_dropoff_datetime AS STRING),
        CAST(PULocationID AS STRING),
        CAST(DOLocationID AS STRING),
        CAST(fare_amount AS STRING),
        CAST(total_amount AS STRING)
    )) AS trip_id,

    -- Identifiers (renamed + cast)
    CAST(VendorID AS INT)       AS vendor_id,
    CAST(RatecodeID AS INT)     AS rate_code_id,
    CAST(PULocationID AS INT)   AS pickup_location_id,
    CAST(DOLocationID AS INT)   AS dropoff_location_id,
    CAST(payment_type AS INT)   AS payment_type_id,

    -- Timestamps
    tpep_pickup_datetime        AS pickup_datetime,
    tpep_dropoff_datetime       AS dropoff_datetime,

    -- Trip info
    CAST(passenger_count AS INT) AS passenger_count,
    trip_distance               AS trip_distance_miles,
    store_and_fwd_flag,

    -- Financials (rounded to 2 decimal places)
    CAST(ROUND(fare_amount, 2)             AS DECIMAL(10, 2)) AS fare_amount,
    CAST(ROUND(extra, 2)                   AS DECIMAL(10, 2)) AS extra_amount,
    CAST(ROUND(mta_tax, 2)                 AS DECIMAL(10, 2)) AS mta_tax,
    CAST(ROUND(tip_amount, 2)              AS DECIMAL(10, 2)) AS tip_amount,
    CAST(ROUND(tolls_amount, 2)            AS DECIMAL(10, 2)) AS tolls_amount,
    CAST(ROUND(improvement_surcharge, 2)   AS DECIMAL(10, 2)) AS improvement_surcharge,
    CAST(ROUND(total_amount, 2)            AS DECIMAL(10, 2)) AS total_amount,
    CAST(ROUND(congestion_surcharge, 2)    AS DECIMAL(10, 2)) AS congestion_surcharge,
    CAST(ROUND(Airport_fee, 2)             AS DECIMAL(10, 2)) AS airport_fee,

    -- Computed: date dimensions
    CAST(tpep_pickup_datetime AS DATE) AS pickup_date

FROM iceberg_catalog.bronze.raw_trips

-- Data quality filters
WHERE tpep_pickup_datetime IS NOT NULL
  AND tpep_dropoff_datetime IS NOT NULL
  AND trip_distance >= 0
  AND fare_amount >= 0
  AND CAST(tpep_pickup_datetime AS DATE) >= DATE '2024-01-01'
  AND CAST(tpep_pickup_datetime AS DATE) <  DATE '2024-02-01';


### 9.2 Silver — Production File (06-silver.sql)

Used by `make process-silver`. This standalone file has the **full ROW_NUMBER() deduplication**
logic — it produces the same result as the combined `05-run-all.sql` pipeline.

> **Audit finding (Feb 2026):** An earlier version of `06-silver.sql` ran a plain INSERT without
> deduplication, while `05-run-all.sql` had the correct ROW_NUMBER CTE. Running `make process-silver`
> independently would produce duplicates. Both are now identical. Always verify row counts!


In [None]:
%%writefile ../pipelines/04-redpanda-flink-iceberg/flink/sql/06-silver.sql
-- =============================================================================
-- Pipeline 04: Silver Layer (Bronze Iceberg → Silver Iceberg)
-- =============================================================================
-- Run: sql-client.sh embedded -i 00-init.sql -f 06-silver.sql
-- =============================================================================

USE CATALOG iceberg_catalog;
CREATE DATABASE IF NOT EXISTS silver;

CREATE TABLE IF NOT EXISTS silver.cleaned_trips (
    trip_id                 STRING,
    vendor_id               INT,
    rate_code_id            INT,
    pickup_location_id      INT,
    dropoff_location_id     INT,
    payment_type_id         INT,
    pickup_datetime         TIMESTAMP(3),
    dropoff_datetime        TIMESTAMP(3),
    passenger_count         INT,
    trip_distance_miles     DOUBLE,
    store_and_fwd_flag      STRING,
    fare_amount             DECIMAL(10, 2),
    extra_amount            DECIMAL(10, 2),
    mta_tax                 DECIMAL(10, 2),
    tip_amount              DECIMAL(10, 2),
    tolls_amount            DECIMAL(10, 2),
    improvement_surcharge   DECIMAL(10, 2),
    total_amount            DECIMAL(10, 2),
    congestion_surcharge    DECIMAL(10, 2),
    airport_fee             DECIMAL(10, 2),
    pickup_date             DATE
) PARTITIONED BY (pickup_date)
WITH (
    'format-version' = '2',
    'write.format.default' = 'parquet',
    'write.parquet.compression-codec' = 'zstd',
    'write.metadata.delete-after-commit.enabled' = 'true',
    'write.metadata.previous-versions-max' = '10',
    'write.target-file-size-bytes' = '134217728'
);

-- Deduplication: ROW_NUMBER partitioned by natural key, keeping latest ingestion
INSERT INTO iceberg_catalog.silver.cleaned_trips
WITH deduped AS (
    SELECT *,
        ROW_NUMBER() OVER (
            PARTITION BY VendorID, tpep_pickup_datetime, tpep_dropoff_datetime,
                         PULocationID, DOLocationID, fare_amount, total_amount
            ORDER BY ingestion_ts DESC
        ) AS rn
    FROM iceberg_catalog.bronze.raw_trips
    WHERE tpep_pickup_datetime IS NOT NULL
      AND tpep_dropoff_datetime IS NOT NULL
      AND trip_distance >= 0
      AND fare_amount >= 0
      AND CAST(tpep_pickup_datetime AS DATE) >= DATE '2024-01-01'
      AND CAST(tpep_pickup_datetime AS DATE) <  DATE '2024-02-01'
)
SELECT
    CAST(MD5(CONCAT_WS('|',
        CAST(VendorID AS STRING),
        CAST(tpep_pickup_datetime AS STRING),
        CAST(tpep_dropoff_datetime AS STRING),
        CAST(PULocationID AS STRING),
        CAST(DOLocationID AS STRING),
        CAST(fare_amount AS STRING),
        CAST(total_amount AS STRING)
    )) AS STRING) AS trip_id,
    CAST(VendorID AS INT)       AS vendor_id,
    CAST(RatecodeID AS INT)     AS rate_code_id,
    CAST(PULocationID AS INT)   AS pickup_location_id,
    CAST(DOLocationID AS INT)   AS dropoff_location_id,
    CAST(payment_type AS INT)   AS payment_type_id,
    tpep_pickup_datetime        AS pickup_datetime,
    tpep_dropoff_datetime       AS dropoff_datetime,
    CAST(passenger_count AS INT) AS passenger_count,
    trip_distance               AS trip_distance_miles,
    store_and_fwd_flag,
    CAST(ROUND(fare_amount, 2)             AS DECIMAL(10, 2)) AS fare_amount,
    CAST(ROUND(extra, 2)                   AS DECIMAL(10, 2)) AS extra_amount,
    CAST(ROUND(mta_tax, 2)                 AS DECIMAL(10, 2)) AS mta_tax,
    CAST(ROUND(tip_amount, 2)              AS DECIMAL(10, 2)) AS tip_amount,
    CAST(ROUND(tolls_amount, 2)            AS DECIMAL(10, 2)) AS tolls_amount,
    CAST(ROUND(improvement_surcharge, 2)   AS DECIMAL(10, 2)) AS improvement_surcharge,
    CAST(ROUND(total_amount, 2)            AS DECIMAL(10, 2)) AS total_amount,
    CAST(ROUND(congestion_surcharge, 2)    AS DECIMAL(10, 2)) AS congestion_surcharge,
    CAST(ROUND(Airport_fee, 2)             AS DECIMAL(10, 2)) AS airport_fee,
    CAST(tpep_pickup_datetime AS DATE) AS pickup_date
FROM deduped
WHERE rn = 1;


### 9.3 Combined Pipeline Reference (P01's 05-run-all.sql)

P04 does not have a single combined SQL file — `make process` calls `process-bronze` then
`process-silver` sequentially. Below is P01's combined file for reference — it shows how
both layers can run in one SQL session with `table.dml-sync=true` ensuring ordering.


In [None]:
# Reference only — this is P01's 05-run-all.sql, not used by P04 directly
# P04 uses: make process-bronze && make process-silver
print("""
-- =============================================================================
-- Pipeline 01: Flink SQL - Full Pipeline (Bronze + Silver)
-- =============================================================================
-- Run with init: sql-client.sh embedded -i 00-init.sql -f 05-run-all.sql
-- =============================================================================

-- ═══════════════════════════════════════════════════════════════════════════════
-- BRONZE LAYER: Raw data from Kafka → Iceberg
-- ═══════════════════════════════════════════════════════════════════════════════

USE CATALOG iceberg_catalog;
CREATE DATABASE IF NOT EXISTS bronze;

CREATE TABLE IF NOT EXISTS bronze.raw_trips (
    VendorID                BIGINT,
    tpep_pickup_datetime    TIMESTAMP(3),
    tpep_dropoff_datetime   TIMESTAMP(3),
    passenger_count         BIGINT,
    trip_distance           DOUBLE,
    RatecodeID              BIGINT,
    store_and_fwd_flag      STRING,
    PULocationID            BIGINT,
    DOLocationID            BIGINT,
    payment_type            BIGINT,
    fare_amount             DOUBLE,
    extra                   DOUBLE,
    mta_tax                 DOUBLE,
    tip_amount              DOUBLE,
    tolls_amount            DOUBLE,
    improvement_surcharge   DOUBLE,
    total_amount            DOUBLE,
    congestion_surcharge    DOUBLE,
    Airport_fee             DOUBLE,
    ingestion_ts            TIMESTAMP(3)
)
WITH (
    'format-version' = '1',
    'write.format.default' = 'parquet',
    'write.parquet.compression-codec' = 'zstd',
    'write.metadata.delete-after-commit.enabled' = 'true',
    'write.metadata.previous-versions-max' = '10',
    'write.target-file-size-bytes' = '134217728'
);

-- Switch back to default catalog for Kafka source table reference
USE CATALOG default_catalog;
USE default_database;

-- Insert from Kafka into Bronze Iceberg table
INSERT INTO iceberg_catalog.bronze.raw_trips
SELECT
    VendorID,
    TO_TIMESTAMP(tpep_pickup_datetime, 'yyyy-MM-dd''T''HH:mm:ss')   AS tpep_pickup_datetime,
    TO_TIMESTAMP(tpep_dropoff_datetime, 'yyyy-MM-dd''T''HH:mm:ss')  AS tpep_dropoff_datetime,
    passenger_count,
    trip_distance,
    RatecodeID,
    store_and_fwd_flag,
    PULocationID,
    DOLocationID,
    payment_type,
    fare_amount,
    extra,
    mta_tax,
    tip_amount,
    tolls_amount,
    improvement_surcharge,
    total_amount,
    congestion_surcharge,
    Airport_fee,
    CURRENT_TIMESTAMP AS ingestion_ts
FROM kafka_raw_trips;

-- ═══════════════════════════════════════════════════════════════════════════════
-- SILVER LAYER: Cleaned + enriched data from Bronze → Silver
-- ═══════════════════════════════════════════════════════════════════════════════

USE CATALOG iceberg_catalog;
CREATE DATABASE IF NOT EXISTS silver;

CREATE TABLE IF NOT EXISTS silver.cleaned_trips (
    trip_id                 STRING,
    vendor_id               INT,
    rate_code_id            INT,
    pickup_location_id      INT,
    dropoff_location_id     INT,
    payment_type_id         INT,
    pickup_datetime         TIMESTAMP(3),
    dropoff_datetime        TIMESTAMP(3),
    passenger_count         INT,
    trip_distance_miles     DOUBLE,
    store_and_fwd_flag      STRING,
    fare_amount             DECIMAL(10, 2),
    extra_amount            DECIMAL(10, 2),
    mta_tax                 DECIMAL(10, 2),
    tip_amount              DECIMAL(10, 2),
    tolls_amount            DECIMAL(10, 2),
    improvement_surcharge   DECIMAL(10, 2),
    total_amount            DECIMAL(10, 2),
    congestion_surcharge    DECIMAL(10, 2),
    airport_fee             DECIMAL(10, 2),
    pickup_date             DATE
) PARTITIONED BY (pickup_date)
WITH (
    'format-version' = '2',
    'write.format.default' = 'parquet',
    'write.parquet.compression-codec' = 'zstd',
    'write.metadata.delete-after-commit.enabled' = 'true',
    'write.metadata.previous-versions-max' = '10',
    'write.target-file-size-bytes' = '134217728'
);

-- Deduplication: ROW_NUMBER partitioned by natural key, keeping latest ingestion
INSERT INTO iceberg_catalog.silver.cleaned_trips
WITH deduped AS (
    SELECT *,
        ROW_NUMBER() OVER (
            PARTITION BY VendorID, tpep_pickup_datetime, tpep_dropoff_datetime,
                         PULocationID, DOLocationID, fare_amount, total_amount
            ORDER BY ingestion_ts DESC
        ) AS rn
    FROM iceberg_catalog.bronze.raw_trips
    WHERE tpep_pickup_datetime IS NOT NULL
      AND tpep_dropoff_datetime IS NOT NULL
      AND trip_distance >= 0
      AND fare_amount >= 0
      AND CAST(tpep_pickup_datetime AS DATE) >= DATE '2024-01-01'
      AND CAST(tpep_pickup_datetime AS DATE) <  DATE '2024-02-01'
)
SELECT
    CAST(MD5(CONCAT_WS('|',
        CAST(VendorID AS STRING),
        CAST(tpep_pickup_datetime AS STRING),
        CAST(tpep_dropoff_datetime AS STRING),
        CAST(PULocationID AS STRING),
        CAST(DOLocationID AS STRING),
        CAST(fare_amount AS STRING),
        CAST(total_amount AS STRING)
    )) AS STRING) AS trip_id,
    CAST(VendorID AS INT)       AS vendor_id,
    CAST(RatecodeID AS INT)     AS rate_code_id,
    CAST(PULocationID AS INT)   AS pickup_location_id,
    CAST(DOLocationID AS INT)   AS dropoff_location_id,
    CAST(payment_type AS INT)   AS payment_type_id,
    tpep_pickup_datetime        AS pickup_datetime,
    tpep_dropoff_datetime       AS dropoff_datetime,
    CAST(passenger_count AS INT) AS passenger_count,
    trip_distance               AS trip_distance_miles,
    store_and_fwd_flag,
    CAST(ROUND(fare_amount, 2)             AS DECIMAL(10, 2)) AS fare_amount,
    CAST(ROUND(extra, 2)                   AS DECIMAL(10, 2)) AS extra_amount,
    CAST(ROUND(mta_tax, 2)                 AS DECIMAL(10, 2)) AS mta_tax,
    CAST(ROUND(tip_amount, 2)              AS DECIMAL(10, 2)) AS tip_amount,
    CAST(ROUND(tolls_amount, 2)            AS DECIMAL(10, 2)) AS tolls_amount,
    CAST(ROUND(improvement_surcharge, 2)   AS DECIMAL(10, 2)) AS improvement_surcharge,
    CAST(ROUND(total_amount, 2)            AS DECIMAL(10, 2)) AS total_amount,
    CAST(ROUND(congestion_surcharge, 2)    AS DECIMAL(10, 2)) AS congestion_surcharge,
    CAST(ROUND(Airport_fee, 2)             AS DECIMAL(10, 2)) AS airport_fee,
    CAST(tpep_pickup_datetime AS DATE) AS pickup_date
FROM deduped
WHERE rn = 1;

""")


## 10. Flink SQL: Streaming Bronze (Continuous Mode)

`07-streaming-bronze.sql` is the **streaming alternative** to `05-bronze.sql`. Use it when you
want Flink to continuously ingest new events as they arrive in Redpanda, rather than doing
a one-time batch catch-up.

### When to Use Streaming vs Batch

| Scenario | Mode | Command |
|----------|------|---------|
| Initial backfill of historical data | Batch | `make process-bronze` |
| Nightly scheduled catch-up | Batch | `make process` |
| Continuous real-time ingestion | Streaming | `make process-streaming` |
| CI/CD pipeline validation | Batch | `make benchmark` |

### Streaming Job Lifecycle

```
make process-streaming
  │
  └─ sql-client.sh embedded -i 00-init-streaming.sql -f 07-streaming-bronze.sql
       │
       ├─ 00-init-streaming.sql: SET streaming mode, CREATE kafka_raw_trips (no bounded mode)
       │
       └─ 07-streaming-bronze.sql: INSERT INTO bronze.raw_trips FROM kafka_raw_trips
            │
            └─ Job runs indefinitely: reads Redpanda → writes Iceberg checkpoints every 30s
                 Exactly-once guarantee: if TM crashes, Flink restores from last checkpoint
                 Cancel with: Ctrl+C or Flink REST API DELETE /jobs/{jobId}
```

### Iceberg Streaming Write Behavior

Flink's Iceberg sink in streaming mode writes files on checkpoint boundaries (every 30s).
Each checkpoint produces one or more Parquet files and commits a new Iceberg snapshot.
This means:
- Data is visible to readers every 30s (not per-event)
- Small files accumulate over time → schedule `rewrite_data_files` compaction periodically
- Time travel works: `SELECT * FROM ... FOR TIMESTAMP AS OF TIMESTAMP '2024-01-15 09:00:00'`


In [None]:
%%writefile ../pipelines/04-redpanda-flink-iceberg/flink/sql/07-streaming-bronze.sql
-- =============================================================================
-- Pipeline 04: Streaming Bronze Layer (Redpanda → Iceberg, continuous)
-- =============================================================================
-- Alternative to 05-bronze.sql that runs in STREAMING mode.
-- Uses event_time watermarks defined in 00-init-streaming.sql for event-time processing.
--
-- Run: sql-client.sh embedded -i 00-init-streaming.sql -f 07-streaming-bronze.sql
--
-- NOTE: This job runs continuously until cancelled. It will process new Redpanda
-- events as they arrive and write them to the Bronze Iceberg table.
-- Streaming config (runtime-mode, checkpointing) is set in 00-init-streaming.sql.
-- =============================================================================

USE CATALOG iceberg_catalog;
CREATE DATABASE IF NOT EXISTS bronze;

CREATE TABLE IF NOT EXISTS bronze.raw_trips (
    VendorID                BIGINT,
    tpep_pickup_datetime    TIMESTAMP(3),
    tpep_dropoff_datetime   TIMESTAMP(3),
    passenger_count         BIGINT,
    trip_distance           DOUBLE,
    RatecodeID              BIGINT,
    store_and_fwd_flag      STRING,
    PULocationID            BIGINT,
    DOLocationID            BIGINT,
    payment_type            BIGINT,
    fare_amount             DOUBLE,
    extra                   DOUBLE,
    mta_tax                 DOUBLE,
    tip_amount              DOUBLE,
    tolls_amount            DOUBLE,
    improvement_surcharge   DOUBLE,
    total_amount            DOUBLE,
    congestion_surcharge    DOUBLE,
    Airport_fee             DOUBLE,
    ingestion_ts            TIMESTAMP(3)
)
WITH (
    'format-version' = '1',
    'write.format.default' = 'parquet',
    'write.parquet.compression-codec' = 'zstd',
    'write.metadata.delete-after-commit.enabled' = 'true',
    'write.metadata.previous-versions-max' = '10',
    'write.target-file-size-bytes' = '134217728'
);

-- Switch back to default catalog for Redpanda source table reference
USE CATALOG default_catalog;
USE default_database;

-- Streaming INSERT: runs continuously, processing new Redpanda events as they arrive
INSERT INTO iceberg_catalog.bronze.raw_trips
SELECT
    VendorID,
    TO_TIMESTAMP(tpep_pickup_datetime, 'yyyy-MM-dd''T''HH:mm:ss')   AS tpep_pickup_datetime,
    TO_TIMESTAMP(tpep_dropoff_datetime, 'yyyy-MM-dd''T''HH:mm:ss')  AS tpep_dropoff_datetime,
    passenger_count,
    trip_distance,
    RatecodeID,
    store_and_fwd_flag,
    PULocationID,
    DOLocationID,
    payment_type,
    fare_amount,
    extra,
    mta_tax,
    tip_amount,
    tolls_amount,
    improvement_surcharge,
    total_amount,
    congestion_surcharge,
    Airport_fee,
    CURRENT_TIMESTAMP AS ingestion_ts
FROM kafka_raw_trips;


## 11. dbt Project Configuration

dbt handles the **Silver → Gold** transformation. It reads from the Iceberg Silver table via
DuckDB's `iceberg_scan()` function and produces analytics-ready Gold layer tables.

### How dbt-duckdb + Iceberg Works

```python
# 1. DuckDB opens as an in-process embedded database
# 2. DuckDB loads the Iceberg extension
# 3. DuckDB reads Iceberg metadata from MinIO (S3-compatible via httpfs)
# 4. DuckDB reads Parquet data files from MinIO
# 5. dbt materializes results as DuckDB tables (stored in a local .duckdb file)

# In sources.yml:
external_location: "iceberg_scan('s3://warehouse/silver/cleaned_trips', allow_moved_paths=true)"

# DuckDB resolves this as:
# 1. Read s3://warehouse/silver/cleaned_trips/metadata/version-hint.text → get version N
# 2. Read s3://warehouse/silver/cleaned_trips/metadata/vN.metadata.json → get manifest list
# 3. Read manifest files → get Parquet file list
# 4. Read Parquet files → return rows
```

**Why `allow_moved_paths=true`?** In Docker, MinIO is accessible as `minio:9000` from inside
the container, but as `localhost:9000` from outside. The Iceberg metadata records the path used
during write (e.g., `s3a://warehouse/...`) which differs from the read path (`s3://warehouse/...`).
`allow_moved_paths=true` tells DuckDB to ignore path mismatches between metadata and actual location.

### Silver as the dbt Source (Post-Audit)

> **Critical audit finding (Feb 2026):** The original P04 `sources.yml` pointed to the Bronze
> table (`s3://warehouse/bronze/raw_trips`). Every dbt model was built on raw, unvalidated,
> potentially duplicate Bronze data — completely defeating Flink's Silver cleaning work.
> Fixed to point to Silver (`s3://warehouse/silver/cleaned_trips`).
>
> The stg_yellow_trips.sql model is now a **passthrough**: Silver already has clean column names,
> correct types, and validated data. Flink Silver = dbt Staging Source.

### 11.1 dbt_project.yml


In [None]:
%%writefile ../pipelines/04-redpanda-flink-iceberg/dbt_project/dbt_project.yml
name: nyc_taxi_pipeline_04
version: "1.0.0"
config-version: 2
profile: nyc_taxi_pipeline_04
model-paths: ["models"]
analysis-paths: ["analyses"]
test-paths: ["tests"]
seed-paths: ["seeds"]
macro-paths: ["macros"]
snapshot-paths: ["snapshots"]
target-path: "target"
clean-targets: ["target", "dbt_packages"]

dispatch:
  - macro_namespace: nyc_taxi_dbt
    search_order: [nyc_taxi_pipeline_04, nyc_taxi_dbt]

seeds:
  nyc_taxi_pipeline_04:
    +schema: raw

models:
  nyc_taxi_pipeline_04:
    staging:
      +materialized: view
      +schema: staging
    intermediate:
      +materialized: view
      +schema: intermediate
    marts:
      core:
        +materialized: table
        +schema: marts
      analytics:
        +materialized: table
        +schema: marts


### 11.2 profiles.yml — DuckDB + Iceberg + MinIO Connection

The profile configures three DuckDB extensions:
- `httpfs` — allows DuckDB to read files from HTTP/S3 endpoints
- `iceberg` — adds the `iceberg_scan()` function
- `aws` — provides S3 credential management helpers


In [None]:
%%writefile ../pipelines/04-redpanda-flink-iceberg/dbt_project/profiles.yml
nyc_taxi_pipeline_04:
  target: dev
  outputs:
    dev:
      type: duckdb
      path: /dbt/warehouse.duckdb
      extensions:
        - iceberg
        - httpfs
      settings:
        s3_region: us-east-1
        s3_endpoint: minio:9000
        s3_access_key_id: minioadmin
        s3_secret_access_key: minioadmin
        s3_use_ssl: false
        s3_url_style: path
      threads: 4


### 11.3 packages.yml


In [None]:
%%writefile ../pipelines/04-redpanda-flink-iceberg/dbt_project/packages.yml
packages:
  - package: dbt-labs/dbt_utils
    version: ">=1.1.0"


### 11.4 sources.yml — Silver Iceberg as Source

Note: `external_location` uses `s3://` (DuckDB httpfs format), not `s3a://` (Hadoop format).
These are equivalent paths to the same MinIO storage but use different drivers.


In [None]:
%%writefile ../pipelines/04-redpanda-flink-iceberg/dbt_project/models/sources/sources.yml
version: 2

sources:
  - name: raw_nyc_taxi
    description: "NYC TLC Yellow Taxi data - Iceberg Silver table via DuckDB iceberg_scan"
    schema: main
    freshness:
      # Adjust thresholds for your SLA (generous window for historical data)
      warn_after: {count: 30, period: day}
      error_after: {count: 365, period: day}
    loaded_at_field: pickup_datetime
    tables:
      - name: raw_yellow_trips
        description: "Silver-layer cleaned trips from Iceberg (Flink-filtered, deduplicated, snake_case columns)"
        config:
          external_location: "iceberg_scan('s3://warehouse/silver/cleaned_trips', allow_moved_paths = true)"


## 12. dbt Seeds: Reference Data

Seeds are static CSV files that dbt materializes as tables at `dbt seed` time.
They provide the dimension lookup data for JOIN operations.

### Why Seeds Instead of External Tables?

Seeds are version-controlled with the project, always available, and fast to load.
The alternative (external dimension tables) requires additional infrastructure. For
relatively static reference data (265 NYC taxi zones don't change often), seeds are ideal.

### 12.1 payment_type_lookup.csv


In [None]:
%%writefile ../pipelines/04-redpanda-flink-iceberg/dbt_project/seeds/payment_type_lookup.csv
payment_type_id,payment_type_name
1,Credit card
2,Cash
3,No charge
4,Dispute
5,Unknown
6,Voided trip


### 12.2 rate_code_lookup.csv


In [None]:
%%writefile ../pipelines/04-redpanda-flink-iceberg/dbt_project/seeds/rate_code_lookup.csv
rate_code_id,rate_code_name
1,Standard rate
2,JFK
3,Newark
4,Nassau or Westchester
5,Negotiated fare
6,Group ride
99,Unknown


### 12.3 seed_properties.yml — Explicit column types

Without explicit types, DuckDB may infer `payment_type_id` as `BIGINT` when the seed CSV
has integer values. The fact table uses `INT` → the JOIN would fail with a type mismatch.


In [None]:
%%writefile ../pipelines/04-redpanda-flink-iceberg/dbt_project/seeds/seed_properties.yml
version: 2

seeds:
  - name: taxi_zone_lookup
    description: "NYC TLC Taxi Zone lookup table (~265 zones)"
    columns:
      - name: LocationID
        tests: [unique, not_null]
      - name: Borough
        tests: [not_null]

  - name: payment_type_lookup
    description: "Payment type ID to description mapping"
    columns:
      - name: payment_type_id
        tests: [unique, not_null]

  - name: rate_code_lookup
    description: "Rate code ID to description mapping"
    columns:
      - name: rate_code_id
        tests: [unique, not_null]

  - name: vendor_lookup
    description: "TPEP vendor ID mapping (1=CMT, 2=VTS)"
    columns:
      - name: vendor_id
        tests: [unique, not_null]
      - name: vendor_name
        tests: [not_null]
      - name: vendor_abbr
        tests: [unique, not_null]


### 12.4 taxi_zone_lookup.csv (265 NYC zones)


In [None]:
%%writefile ../pipelines/04-redpanda-flink-iceberg/dbt_project/seeds/taxi_zone_lookup.csv
"LocationID","Borough","Zone","service_zone"
1,"EWR","Newark Airport","EWR"
2,"Queens","Jamaica Bay","Boro Zone"
3,"Bronx","Allerton/Pelham Gardens","Boro Zone"
4,"Manhattan","Alphabet City","Yellow Zone"
5,"Staten Island","Arden Heights","Boro Zone"
6,"Staten Island","Arrochar/Fort Wadsworth","Boro Zone"
7,"Queens","Astoria","Boro Zone"
8,"Queens","Astoria Park","Boro Zone"
9,"Queens","Auburndale","Boro Zone"
10,"Queens","Baisley Park","Boro Zone"
11,"Brooklyn","Bath Beach","Boro Zone"
12,"Manhattan","Battery Park","Yellow Zone"
13,"Manhattan","Battery Park City","Yellow Zone"
14,"Brooklyn","Bay Ridge","Boro Zone"
15,"Queens","Bay Terrace/Fort Totten","Boro Zone"
16,"Queens","Bayside","Boro Zone"
17,"Brooklyn","Bedford","Boro Zone"
18,"Bronx","Bedford Park","Boro Zone"
19,"Queens","Bellerose","Boro Zone"
20,"Bronx","Belmont","Boro Zone"
21,"Brooklyn","Bensonhurst East","Boro Zone"
22,"Brooklyn","Bensonhurst West","Boro Zone"
23,"Staten Island","Bloomfield/Emerson Hill","Boro Zone"
24,"Manhattan","Bloomingdale","Yellow Zone"
25,"Brooklyn","Boerum Hill","Boro Zone"
26,"Brooklyn","Borough Park","Boro Zone"
27,"Queens","Breezy Point/Fort Tilden/Riis Beach","Boro Zone"
28,"Queens","Briarwood/Jamaica Hills","Boro Zone"
29,"Brooklyn","Brighton Beach","Boro Zone"
30,"Queens","Broad Channel","Boro Zone"
31,"Bronx","Bronx Park","Boro Zone"
32,"Bronx","Bronxdale","Boro Zone"
33,"Brooklyn","Brooklyn Heights","Boro Zone"
34,"Brooklyn","Brooklyn Navy Yard","Boro Zone"
35,"Brooklyn","Brownsville","Boro Zone"
36,"Brooklyn","Bushwick North","Boro Zone"
37,"Brooklyn","Bushwick South","Boro Zone"
38,"Queens","Cambria Heights","Boro Zone"
39,"Brooklyn","Canarsie","Boro Zone"
40,"Brooklyn","Carroll Gardens","Boro Zone"
41,"Manhattan","Central Harlem","Boro Zone"
42,"Manhattan","Central Harlem North","Boro Zone"
43,"Manhattan","Central Park","Yellow Zone"
44,"Staten Island","Charleston/Tottenville","Boro Zone"
45,"Manhattan","Chinatown","Yellow Zone"
46,"Bronx","City Island","Boro Zone"
47,"Bronx","Claremont/Bathgate","Boro Zone"
48,"Manhattan","Clinton East","Yellow Zone"
49,"Brooklyn","Clinton Hill","Boro Zone"
50,"Manhattan","Clinton West","Yellow Zone"
51,"Bronx","Co-Op City","Boro Zone"
52,"Brooklyn","Cobble Hill","Boro Zone"
53,"Queens","College Point","Boro Zone"
54,"Brooklyn","Columbia Street","Boro Zone"
55,"Brooklyn","Coney Island","Boro Zone"
56,"Queens","Corona","Boro Zone"
57,"Queens","Corona","Boro Zone"
58,"Bronx","Country Club","Boro Zone"
59,"Bronx","Crotona Park","Boro Zone"
60,"Bronx","Crotona Park East","Boro Zone"
61,"Brooklyn","Crown Heights North","Boro Zone"
62,"Brooklyn","Crown Heights South","Boro Zone"
63,"Brooklyn","Cypress Hills","Boro Zone"
64,"Queens","Douglaston","Boro Zone"
65,"Brooklyn","Downtown Brooklyn/MetroTech","Boro Zone"
66,"Brooklyn","DUMBO/Vinegar Hill","Boro Zone"
67,"Brooklyn","Dyker Heights","Boro Zone"
68,"Manhattan","East Chelsea","Yellow Zone"
69,"Bronx","East Concourse/Concourse Village","Boro Zone"
70,"Queens","East Elmhurst","Boro Zone"
71,"Brooklyn","East Flatbush/Farragut","Boro Zone"
72,"Brooklyn","East Flatbush/Remsen Village","Boro Zone"
73,"Queens","East Flushing","Boro Zone"
74,"Manhattan","East Harlem North","Boro Zone"
75,"Manhattan","East Harlem South","Boro Zone"
76,"Brooklyn","East New York","Boro Zone"
77,"Brooklyn","East New York/Pennsylvania Avenue","Boro Zone"
78,"Bronx","East Tremont","Boro Zone"
79,"Manhattan","East Village","Yellow Zone"
80,"Brooklyn","East Williamsburg","Boro Zone"
81,"Bronx","Eastchester","Boro Zone"
82,"Queens","Elmhurst","Boro Zone"
83,"Queens","Elmhurst/Maspeth","Boro Zone"
84,"Staten Island","Eltingville/Annadale/Prince's Bay","Boro Zone"
85,"Brooklyn","Erasmus","Boro Zone"
86,"Queens","Far Rockaway","Boro Zone"
87,"Manhattan","Financial District North","Yellow Zone"
88,"Manhattan","Financial District South","Yellow Zone"
89,"Brooklyn","Flatbush/Ditmas Park","Boro Zone"
90,"Manhattan","Flatiron","Yellow Zone"
91,"Brooklyn","Flatlands","Boro Zone"
92,"Queens","Flushing","Boro Zone"
93,"Queens","Flushing Meadows-Corona Park","Boro Zone"
94,"Bronx","Fordham South","Boro Zone"
95,"Queens","Forest Hills","Boro Zone"
96,"Queens","Forest Park/Highland Park","Boro Zone"
97,"Brooklyn","Fort Greene","Boro Zone"
98,"Queens","Fresh Meadows","Boro Zone"
99,"Staten Island","Freshkills Park","Boro Zone"
100,"Manhattan","Garment District","Yellow Zone"
101,"Queens","Glen Oaks","Boro Zone"
102,"Queens","Glendale","Boro Zone"
103,"Manhattan","Governor's Island/Ellis Island/Liberty Island","Yellow Zone"
104,"Manhattan","Governor's Island/Ellis Island/Liberty Island","Yellow Zone"
105,"Manhattan","Governor's Island/Ellis Island/Liberty Island","Yellow Zone"
106,"Brooklyn","Gowanus","Boro Zone"
107,"Manhattan","Gramercy","Yellow Zone"
108,"Brooklyn","Gravesend","Boro Zone"
109,"Staten Island","Great Kills","Boro Zone"
110,"Staten Island","Great Kills Park","Boro Zone"
111,"Brooklyn","Green-Wood Cemetery","Boro Zone"
112,"Brooklyn","Greenpoint","Boro Zone"
113,"Manhattan","Greenwich Village North","Yellow Zone"
114,"Manhattan","Greenwich Village South","Yellow Zone"
115,"Staten Island","Grymes Hill/Clifton","Boro Zone"
116,"Manhattan","Hamilton Heights","Boro Zone"
117,"Queens","Hammels/Arverne","Boro Zone"
118,"Staten Island","Heartland Village/Todt Hill","Boro Zone"
119,"Bronx","Highbridge","Boro Zone"
120,"Manhattan","Highbridge Park","Boro Zone"
121,"Queens","Hillcrest/Pomonok","Boro Zone"
122,"Queens","Hollis","Boro Zone"
123,"Brooklyn","Homecrest","Boro Zone"
124,"Queens","Howard Beach","Boro Zone"
125,"Manhattan","Hudson Sq","Yellow Zone"
126,"Bronx","Hunts Point","Boro Zone"
127,"Manhattan","Inwood","Boro Zone"
128,"Manhattan","Inwood Hill Park","Boro Zone"
129,"Queens","Jackson Heights","Boro Zone"
130,"Queens","Jamaica","Boro Zone"
131,"Queens","Jamaica Estates","Boro Zone"
132,"Queens","JFK Airport","Airports"
133,"Brooklyn","Kensington","Boro Zone"
134,"Queens","Kew Gardens","Boro Zone"
135,"Queens","Kew Gardens Hills","Boro Zone"
136,"Bronx","Kingsbridge Heights","Boro Zone"
137,"Manhattan","Kips Bay","Yellow Zone"
138,"Queens","LaGuardia Airport","Airports"
139,"Queens","Laurelton","Boro Zone"
140,"Manhattan","Lenox Hill East","Yellow Zone"
141,"Manhattan","Lenox Hill West","Yellow Zone"
142,"Manhattan","Lincoln Square East","Yellow Zone"
143,"Manhattan","Lincoln Square West","Yellow Zone"
144,"Manhattan","Little Italy/NoLiTa","Yellow Zone"
145,"Queens","Long Island City/Hunters Point","Boro Zone"
146,"Queens","Long Island City/Queens Plaza","Boro Zone"
147,"Bronx","Longwood","Boro Zone"
148,"Manhattan","Lower East Side","Yellow Zone"
149,"Brooklyn","Madison","Boro Zone"
150,"Brooklyn","Manhattan Beach","Boro Zone"
151,"Manhattan","Manhattan Valley","Yellow Zone"
152,"Manhattan","Manhattanville","Boro Zone"
153,"Manhattan","Marble Hill","Boro Zone"
154,"Brooklyn","Marine Park/Floyd Bennett Field","Boro Zone"
155,"Brooklyn","Marine Park/Mill Basin","Boro Zone"
156,"Staten Island","Mariners Harbor","Boro Zone"
157,"Queens","Maspeth","Boro Zone"
158,"Manhattan","Meatpacking/West Village West","Yellow Zone"
159,"Bronx","Melrose South","Boro Zone"
160,"Queens","Middle Village","Boro Zone"
161,"Manhattan","Midtown Center","Yellow Zone"
162,"Manhattan","Midtown East","Yellow Zone"
163,"Manhattan","Midtown North","Yellow Zone"
164,"Manhattan","Midtown South","Yellow Zone"
165,"Brooklyn","Midwood","Boro Zone"
166,"Manhattan","Morningside Heights","Boro Zone"
167,"Bronx","Morrisania/Melrose","Boro Zone"
168,"Bronx","Mott Haven/Port Morris","Boro Zone"
169,"Bronx","Mount Hope","Boro Zone"
170,"Manhattan","Murray Hill","Yellow Zone"
171,"Queens","Murray Hill-Queens","Boro Zone"
172,"Staten Island","New Dorp/Midland Beach","Boro Zone"
173,"Queens","North Corona","Boro Zone"
174,"Bronx","Norwood","Boro Zone"
175,"Queens","Oakland Gardens","Boro Zone"
176,"Staten Island","Oakwood","Boro Zone"
177,"Brooklyn","Ocean Hill","Boro Zone"
178,"Brooklyn","Ocean Parkway South","Boro Zone"
179,"Queens","Old Astoria","Boro Zone"
180,"Queens","Ozone Park","Boro Zone"
181,"Brooklyn","Park Slope","Boro Zone"
182,"Bronx","Parkchester","Boro Zone"
183,"Bronx","Pelham Bay","Boro Zone"
184,"Bronx","Pelham Bay Park","Boro Zone"
185,"Bronx","Pelham Parkway","Boro Zone"
186,"Manhattan","Penn Station/Madison Sq West","Yellow Zone"
187,"Staten Island","Port Richmond","Boro Zone"
188,"Brooklyn","Prospect-Lefferts Gardens","Boro Zone"
189,"Brooklyn","Prospect Heights","Boro Zone"
190,"Brooklyn","Prospect Park","Boro Zone"
191,"Queens","Queens Village","Boro Zone"
192,"Queens","Queensboro Hill","Boro Zone"
193,"Queens","Queensbridge/Ravenswood","Boro Zone"
194,"Manhattan","Randalls Island","Yellow Zone"
195,"Brooklyn","Red Hook","Boro Zone"
196,"Queens","Rego Park","Boro Zone"
197,"Queens","Richmond Hill","Boro Zone"
198,"Queens","Ridgewood","Boro Zone"
199,"Bronx","Rikers Island","Boro Zone"
200,"Bronx","Riverdale/North Riverdale/Fieldston","Boro Zone"
201,"Queens","Rockaway Park","Boro Zone"
202,"Manhattan","Roosevelt Island","Boro Zone"
203,"Queens","Rosedale","Boro Zone"
204,"Staten Island","Rossville/Woodrow","Boro Zone"
205,"Queens","Saint Albans","Boro Zone"
206,"Staten Island","Saint George/New Brighton","Boro Zone"
207,"Queens","Saint Michaels Cemetery/Woodside","Boro Zone"
208,"Bronx","Schuylerville/Edgewater Park","Boro Zone"
209,"Manhattan","Seaport","Yellow Zone"
210,"Brooklyn","Sheepshead Bay","Boro Zone"
211,"Manhattan","SoHo","Yellow Zone"
212,"Bronx","Soundview/Bruckner","Boro Zone"
213,"Bronx","Soundview/Castle Hill","Boro Zone"
214,"Staten Island","South Beach/Dongan Hills","Boro Zone"
215,"Queens","South Jamaica","Boro Zone"
216,"Queens","South Ozone Park","Boro Zone"
217,"Brooklyn","South Williamsburg","Boro Zone"
218,"Queens","Springfield Gardens North","Boro Zone"
219,"Queens","Springfield Gardens South","Boro Zone"
220,"Bronx","Spuyten Duyvil/Kingsbridge","Boro Zone"
221,"Staten Island","Stapleton","Boro Zone"
222,"Brooklyn","Starrett City","Boro Zone"
223,"Queens","Steinway","Boro Zone"
224,"Manhattan","Stuy Town/Peter Cooper Village","Yellow Zone"
225,"Brooklyn","Stuyvesant Heights","Boro Zone"
226,"Queens","Sunnyside","Boro Zone"
227,"Brooklyn","Sunset Park East","Boro Zone"
228,"Brooklyn","Sunset Park West","Boro Zone"
229,"Manhattan","Sutton Place/Turtle Bay North","Yellow Zone"
230,"Manhattan","Times Sq/Theatre District","Yellow Zone"
231,"Manhattan","TriBeCa/Civic Center","Yellow Zone"
232,"Manhattan","Two Bridges/Seward Park","Yellow Zone"
233,"Manhattan","UN/Turtle Bay South","Yellow Zone"
234,"Manhattan","Union Sq","Yellow Zone"
235,"Bronx","University Heights/Morris Heights","Boro Zone"
236,"Manhattan","Upper East Side North","Yellow Zone"
237,"Manhattan","Upper East Side South","Yellow Zone"
238,"Manhattan","Upper West Side North","Yellow Zone"
239,"Manhattan","Upper West Side South","Yellow Zone"
240,"Bronx","Van Cortlandt Park","Boro Zone"
241,"Bronx","Van Cortlandt Village","Boro Zone"
242,"Bronx","Van Nest/Morris Park","Boro Zone"
243,"Manhattan","Washington Heights North","Boro Zone"
244,"Manhattan","Washington Heights South","Boro Zone"
245,"Staten Island","West Brighton","Boro Zone"
246,"Manhattan","West Chelsea/Hudson Yards","Yellow Zone"
247,"Bronx","West Concourse","Boro Zone"
248,"Bronx","West Farms/Bronx River","Boro Zone"
249,"Manhattan","West Village","Yellow Zone"
250,"Bronx","Westchester Village/Unionport","Boro Zone"
251,"Staten Island","Westerleigh","Boro Zone"
252,"Queens","Whitestone","Boro Zone"
253,"Queens","Willets Point","Boro Zone"
254,"Bronx","Williamsbridge/Olinville","Boro Zone"
255,"Brooklyn","Williamsburg (North Side)","Boro Zone"
256,"Brooklyn","Williamsburg (South Side)","Boro Zone"
257,"Brooklyn","Windsor Terrace","Boro Zone"
258,"Queens","Woodhaven","Boro Zone"
259,"Bronx","Woodlawn/Wakefield","Boro Zone"
260,"Queens","Woodside","Boro Zone"
261,"Manhattan","World Trade Center","Yellow Zone"
262,"Manhattan","Yorkville East","Yellow Zone"
263,"Manhattan","Yorkville West","Yellow Zone"
264,"Unknown","N/A","N/A"
265,"N/A","Outside of NYC","N/A"


### 12.5 vendor_lookup.csv


In [None]:
%%writefile ../pipelines/04-redpanda-flink-iceberg/dbt_project/seeds/vendor_lookup.csv
vendor_id,vendor_name,vendor_abbr
1,Creative Mobile Technologies,CMT
2,VeriFone Inc.,VFI


## 13. dbt Macros: Cross-Database Compatibility

dbt macros use Jinja2 templating to provide database-agnostic SQL. The `adapter.dispatch()`
pattern means the same model works on DuckDB, Snowflake, BigQuery, or Spark — the macro
selects the right implementation for the target adapter.

**For this pipeline:** All macros render the DuckDB variant (it's our only adapter).
When porting to another database, only the macro implementations need to change.

### 13.1 cents_to_dollars (DuckDB: integer division + cast)


In [None]:
%%writefile ../pipelines/04-redpanda-flink-iceberg/dbt_project/macros/cents_to_dollars.sql
/*
    Macro: Convert a cents column to dollars with rounding.

    Usage:
        {{ cents_to_dollars('fare_cents') }}
        {{ cents_to_dollars('fare_cents', 4) }}
*/

{% macro cents_to_dollars(column_name, precision=2) %}
    round(cast({{ column_name }} as decimal(10, {{ precision }})) / 100, {{ precision }})
{% endmacro %}


### 13.2–13.4 Other macros

- `dayname_compat` — day-of-week name: DuckDB uses `strftime('%A', ...)`, Spark uses `date_format(..., 'EEEE')`
- `duration_minutes` — DuckDB: `epoch_ms(end_ts - start_ts) / 60000.0`
- `test_positive_value` — reusable test: fails if any row has the named column ≤ 0


In [None]:
%%writefile ../pipelines/04-redpanda-flink-iceberg/dbt_project/macros/dayname_compat.sql
/*
    Macro: Get day-of-week name from a timestamp.
    Adapter-dispatched for DuckDB, PostgreSQL (RisingWave), and Spark.

    Usage:
        {{ dayname_compat('pickup_datetime') }}
*/

{% macro dayname_compat(col) %}
    {{ return(adapter.dispatch('dayname_compat', 'nyc_taxi_dbt')(col)) }}
{% endmacro %}

{% macro duckdb__dayname_compat(col) %}
    dayname({{ col }})
{% endmacro %}

{% macro postgres__dayname_compat(col) %}
    trim(to_char({{ col }}, 'Day'))
{% endmacro %}

{% macro spark__dayname_compat(col) %}
    date_format({{ col }}, 'EEEE')
{% endmacro %}


/*
    Macro: Get month name from a timestamp.
    Adapter-dispatched for DuckDB, PostgreSQL (RisingWave), and Spark.
*/

{% macro monthname_compat(col) %}
    {{ return(adapter.dispatch('monthname_compat', 'nyc_taxi_dbt')(col)) }}
{% endmacro %}

{% macro duckdb__monthname_compat(col) %}
    monthname({{ col }})
{% endmacro %}

{% macro postgres__monthname_compat(col) %}
    trim(to_char({{ col }}, 'Month'))
{% endmacro %}

{% macro spark__monthname_compat(col) %}
    date_format({{ col }}, 'MMMM')
{% endmacro %}


/*
    Macro: Statistical mode (most common value).
    Adapter-dispatched for DuckDB, PostgreSQL, and Spark.
*/

{% macro mode_compat(col) %}
    {{ return(adapter.dispatch('mode_compat', 'nyc_taxi_dbt')(col)) }}
{% endmacro %}

{% macro duckdb__mode_compat(col) %}
    mode({{ col }})
{% endmacro %}

{% macro postgres__mode_compat(col) %}
    mode() WITHIN GROUP (ORDER BY {{ col }})
{% endmacro %}

{% macro spark__mode_compat(col) %}
    mode({{ col }})
{% endmacro %}


In [None]:
%%writefile ../pipelines/04-redpanda-flink-iceberg/dbt_project/macros/duration_minutes.sql
/*
    Macro: Calculate duration between two timestamps in minutes.
    Adapter-dispatched for DuckDB, PostgreSQL (RisingWave), and Spark.

    Usage:
        {{ duration_minutes('pickup_datetime', 'dropoff_datetime') }}
*/

{% macro duration_minutes(start_col, end_col) %}
    {{ return(adapter.dispatch('duration_minutes', 'nyc_taxi_dbt')(start_col, end_col)) }}
{% endmacro %}

{% macro duckdb__duration_minutes(start_col, end_col) %}
    datediff('minute', {{ start_col }}, {{ end_col }})
{% endmacro %}

{% macro postgres__duration_minutes(start_col, end_col) %}
    (EXTRACT(EPOCH FROM ({{ end_col }} - {{ start_col }})) / 60)::bigint
{% endmacro %}

{% macro spark__duration_minutes(start_col, end_col) %}
    CAST((UNIX_TIMESTAMP({{ end_col }}) - UNIX_TIMESTAMP({{ start_col }})) / 60 AS BIGINT)
{% endmacro %}


In [None]:
%%writefile ../pipelines/04-redpanda-flink-iceberg/dbt_project/macros/test_positive_value.sql
/*
    Custom generic test: Asserts that all values in a column are >= 0.

    Usage in schema.yml:
        columns:
          - name: fare_amount
            tests:
              - positive_value
*/

{% test positive_value(model, column_name) %}

select
    {{ column_name }} as invalid_value
from {{ model }}
where {{ column_name }} < 0

{% endtest %}


## 14. dbt Staging Models

Staging models are **thin wrappers** — they standardize column names, types, and apply the
minimal filter set needed to make downstream models reliable. No business logic here.

### The Silver Passthrough Pattern

Because Flink Silver already:
- Renamed: `VendorID` → `vendor_id`, `PULocationID` → `pickup_location_id`, etc.
- Type-cast: `BIGINT` → `INT`, `DOUBLE` → `DECIMAL(10,2)`
- Quality-filtered and deduplicated

`stg_yellow_trips.sql` is essentially a passthrough with minor DuckDB type compatibility
casts (`TIMESTAMP(3)` → `TIMESTAMP`). No cleaning needed at this stage.

This is the **correct medallion architecture**: Flink does the hard infrastructure work,
dbt staging is the clean entry point.

### 14.1 stg_yellow_trips.sql — Main staging model (Silver passthrough)


In [None]:
%%writefile ../pipelines/04-redpanda-flink-iceberg/dbt_project/models/staging/stg_yellow_trips.sql
{#
    Staging model: Yellow taxi trip records (Iceberg pipeline variant)

    This is a simple passthrough since Flink already performed the heavy lifting:
      - Column renaming (VendorID -> vendor_id, etc.)
      - Type casting (BIGINT -> INT, DOUBLE -> DECIMAL)
      - Data quality filtering (nulls, negative fares, date range)
      - Deduplication (ROW_NUMBER OVER natural key, latest ingestion wins)
      - Surrogate key generation (MD5 hash → trip_id)

    The source reads the Silver Iceberg table via DuckDB iceberg_scan().
#}

with source as (
    select * from {{ source('raw_nyc_taxi', 'raw_yellow_trips') }}
),

final as (
    select
        -- Flink already generated the surrogate key
        trip_id,

        -- identifiers (already renamed and cast by Flink)
        vendor_id,
        rate_code_id,
        pickup_location_id,
        dropoff_location_id,
        payment_type_id,

        -- timestamps (already parsed by Flink)
        cast(pickup_datetime as timestamp) as pickup_datetime,
        cast(dropoff_datetime as timestamp) as dropoff_datetime,

        -- trip info
        passenger_count,
        trip_distance_miles,
        store_and_fwd_flag,

        -- financials (already rounded by Flink)
        round(cast(fare_amount as decimal(10, 2)), 2) as fare_amount,
        round(cast(extra_amount as decimal(10, 2)), 2) as extra_amount,
        round(cast(mta_tax as decimal(10, 2)), 2) as mta_tax,
        round(cast(tip_amount as decimal(10, 2)), 2) as tip_amount,
        round(cast(tolls_amount as decimal(10, 2)), 2) as tolls_amount,
        round(cast(improvement_surcharge as decimal(10, 2)), 2) as improvement_surcharge,
        round(cast(total_amount as decimal(10, 2)), 2) as total_amount,
        round(cast(congestion_surcharge as decimal(10, 2)), 2) as congestion_surcharge,
        round(cast(airport_fee as decimal(10, 2)), 2) as airport_fee

    from source
    -- Flink already applied quality filters; this is a safety net
    where pickup_datetime is not null
      and dropoff_datetime is not null
)

select * from final


### 14.2–14.5 Dimension staging models


In [None]:
%%writefile ../pipelines/04-redpanda-flink-iceberg/dbt_project/models/staging/stg_payment_types.sql
/*
    Staging model: Payment type lookup
    Maps payment_type_id to human-readable names.
*/

with source as (
    select * from {{ ref('payment_type_lookup') }}
),

renamed as (
    select
        payment_type_id,
        payment_type_name
    from source
)

select * from renamed


In [None]:
%%writefile ../pipelines/04-redpanda-flink-iceberg/dbt_project/models/staging/stg_rate_codes.sql
/*
    Staging model: Rate code lookup
    Maps rate_code_id to human-readable names.
*/

with source as (
    select * from {{ ref('rate_code_lookup') }}
),

renamed as (
    select
        rate_code_id,
        rate_code_name
    from source
)

select * from renamed


In [None]:
%%writefile ../pipelines/04-redpanda-flink-iceberg/dbt_project/models/staging/stg_taxi_zones.sql
/*
    Staging model: Taxi zone lookup
    Maps LocationID to borough and zone name.
*/

with source as (
    select * from {{ ref('taxi_zone_lookup') }}
),

renamed as (
    select
        cast("LocationID" as integer) as location_id,
        "Borough" as borough,
        "Zone" as zone_name,
        service_zone
    from source
)

select * from renamed


In [None]:
%%writefile ../pipelines/04-redpanda-flink-iceberg/dbt_project/models/staging/stg_vendors.sql
/*
    Staging model: Vendor lookup
    Maps vendor_id to vendor name and abbreviation.
*/

with source as (
    select * from {{ ref('vendor_lookup') }}
),

renamed as (
    select
        vendor_id,
        vendor_name,
        vendor_abbr
    from source
)

select * from renamed


### 14.6 staging.yml — Schema documentation + tests


In [None]:
%%writefile ../pipelines/04-redpanda-flink-iceberg/dbt_project/models/staging/staging.yml
version: 2

models:
  - name: stg_yellow_trips
    description: "Cleaned and renamed yellow taxi trip records. Filters out nulls and negative fares/distances."
    columns:
      - name: trip_id
        description: "Surrogate key generated from VendorID + timestamps + locations + fare/total amounts"
        tests:
          - not_null
          - unique
      - name: vendor_id
        description: "TPEP provider: 1=Creative Mobile Technologies, 2=VeriFone Inc., 6=Unknown/Other"
        tests:
          - not_null
          - accepted_values:
              arguments:
                values: [1, 2, 6]
              config:
                severity: warn
      - name: rate_code_id
        description: "Rate code: 1=Standard, 2=JFK, 3=Newark, 4=Nassau/Westchester, 5=Negotiated, 6=Group"
        tests:
          - accepted_values:
              arguments:
                values: [1, 2, 3, 4, 5, 6, 99]
              config:
                severity: warn
      - name: pickup_location_id
        description: "TLC Taxi Zone ID for pickup"
        tests:
          - not_null
          - relationships:
              arguments:
                to: ref('stg_taxi_zones')
                field: location_id
              config:
                severity: warn
      - name: dropoff_location_id
        description: "TLC Taxi Zone ID for dropoff"
        tests:
          - not_null
          - relationships:
              arguments:
                to: ref('stg_taxi_zones')
                field: location_id
              config:
                severity: warn
      - name: payment_type_id
        description: "Payment method"
        tests:
          - accepted_values:
              arguments:
                values: [0, 1, 2, 3, 4, 5, 6]
              config:
                severity: warn
      - name: pickup_datetime
        tests:
          - not_null
      - name: dropoff_datetime
        tests:
          - not_null
      - name: trip_distance_miles
        tests:
          - not_null
      - name: fare_amount
        tests:
          - not_null
      - name: total_amount
        tests:
          - not_null

  - name: stg_taxi_zones
    description: "Taxi zone reference mapping location IDs to borough and zone names"
    columns:
      - name: location_id
        tests:
          - unique
          - not_null
      - name: borough
        tests:
          - not_null
      - name: zone_name
        tests:
          - not_null

  - name: stg_payment_types
    description: "Payment type reference"
    columns:
      - name: payment_type_id
        tests:
          - unique
          - not_null
      - name: payment_type_name
        tests:
          - not_null

  - name: stg_rate_codes
    description: "Rate code reference"
    columns:
      - name: rate_code_id
        tests:
          - unique
          - not_null
      - name: rate_code_name
        tests:
          - not_null


## 15. dbt Intermediate Models

Intermediate models add **business logic** on top of staging. This is where computed metrics
and multi-model joins live — the place Flink explicitly does NOT touch.

### Separation of Concerns: Flink vs dbt

| Computation | Where | Reason |
|-------------|-------|--------|
| Timestamp parsing | Flink Bronze | Must happen before Iceberg write (type compatibility) |
| Null filtering | Flink Silver | Stream-level quality gate |
| Deduplication | Flink Silver | Needs global ordering (ingestion_ts across partitions) |
| Column renaming | Flink Silver | Consistent schema for iceberg_scan |
| `duration_minutes` | **dbt int_trip_metrics** | Business metric, testable, versionable |
| `avg_speed_mph` | **dbt int_trip_metrics** | Business metric |
| `tip_percentage` | **dbt int_trip_metrics** | Business metric |
| `is_weekend` | **dbt int_trip_metrics** | Calendar attribute |
| Revenue aggregation | **dbt mart_daily_revenue** | Analytics aggregate |

> **Audit finding (Feb 2026):** An earlier version of the pipeline computed `duration_minutes`,
> `avg_speed_mph`, `cost_per_mile`, `tip_percentage`, `pickup_hour`, and `is_weekend` directly
> in the Flink Silver SQL. This was removed — business logic belongs in dbt where it's testable
> and analysts can modify it without touching the stream processor.

### 15.1 int_trip_metrics.sql — Trip-level enrichment


In [None]:
%%writefile ../pipelines/04-redpanda-flink-iceberg/dbt_project/models/intermediate/int_trip_metrics.sql
/*
    Intermediate model: Trip-level enrichment with calculated metrics.
    Uses adapter-dispatched macros for cross-dialect compatibility.
*/

with trips as (
    select * from {{ ref('stg_yellow_trips') }}
),

enriched as (
    select
        trip_id,
        vendor_id,
        rate_code_id,
        pickup_location_id,
        dropoff_location_id,
        payment_type_id,
        pickup_datetime,
        dropoff_datetime,
        passenger_count,
        trip_distance_miles,
        store_and_fwd_flag,

        -- calculated: duration in minutes
        {{ duration_minutes('pickup_datetime', 'dropoff_datetime') }} as trip_duration_minutes,

        -- calculated: average speed (avoid division by zero)
        case
            when {{ duration_minutes('pickup_datetime', 'dropoff_datetime') }} > 0
            then round(
                trip_distance_miles / ({{ duration_minutes('pickup_datetime', 'dropoff_datetime') }} / 60.0),
                2
            )
            else null
        end as avg_speed_mph,

        -- calculated: cost per mile
        case
            when trip_distance_miles > 0
            then round(fare_amount / trip_distance_miles, 2)
            else null
        end as cost_per_mile,

        -- calculated: tip percentage
        case
            when fare_amount > 0
            then round((tip_amount / fare_amount) * 100, 2)
            else null
        end as tip_percentage,

        -- time dimensions (using adapter-dispatched macros)
        date_trunc('day', pickup_datetime)::date as pickup_date,
        extract(hour from pickup_datetime) as pickup_hour,
        {{ dayname_compat('pickup_datetime') }} as pickup_day_of_week,
        case
            when extract(dow from pickup_datetime) in (0, 6) then true
            else false
        end as is_weekend,

        -- financials passthrough
        fare_amount,
        extra_amount,
        mta_tax,
        tip_amount,
        tolls_amount,
        improvement_surcharge,
        total_amount,
        congestion_surcharge,
        airport_fee

    from trips
)

select *
from enriched
where trip_duration_minutes between 1 and 720
  and (avg_speed_mph is null or avg_speed_mph < 100)


### 15.2 int_daily_summary.sql — One row per pickup date


In [None]:
%%writefile ../pipelines/04-redpanda-flink-iceberg/dbt_project/models/intermediate/int_daily_summary.sql
/*
    Intermediate model: Daily aggregated trip and revenue metrics.
    One row per day with counts, averages, and revenue totals.
*/

with trip_metrics as (
    select * from {{ ref('int_trip_metrics') }}
),

daily_agg as (
    select
        pickup_date,
        pickup_day_of_week,
        is_weekend,

        count(*) as total_trips,
        sum(passenger_count) as total_passengers,

        round(avg(trip_distance_miles), 2) as avg_trip_distance,
        round(avg(trip_duration_minutes), 2) as avg_trip_duration_min,
        round(avg(avg_speed_mph), 2) as avg_speed_mph,

        round(sum(fare_amount), 2) as total_fare_revenue,
        round(sum(tip_amount), 2) as total_tip_revenue,
        round(sum(total_amount), 2) as total_revenue,
        round(avg(total_amount), 2) as avg_trip_revenue,
        round(avg(tip_percentage), 2) as avg_tip_percentage,

        count(case when payment_type_id = 1 then 1 end) as credit_card_trips,
        count(case when payment_type_id = 2 then 1 end) as cash_trips

    from trip_metrics
    group by pickup_date, pickup_day_of_week, is_weekend
)

select * from daily_agg


### 15.3 int_hourly_patterns.sql — One row per pickup_date × pickup_hour


In [None]:
%%writefile ../pipelines/04-redpanda-flink-iceberg/dbt_project/models/intermediate/int_hourly_patterns.sql
/*
    Intermediate model: Hourly trip patterns by date.
    One row per date + hour combination.
*/

with trip_metrics as (
    select * from {{ ref('int_trip_metrics') }}
),

hourly_agg as (
    select
        pickup_date,
        pickup_hour,
        pickup_day_of_week,
        is_weekend,

        count(*) as total_trips,
        round(avg(trip_distance_miles), 2) as avg_distance,
        round(avg(trip_duration_minutes), 2) as avg_duration_min,
        round(sum(total_amount), 2) as total_revenue,
        round(avg(total_amount), 2) as avg_revenue

    from trip_metrics
    group by pickup_date, pickup_hour, pickup_day_of_week, is_weekend
)

select * from hourly_agg


### 15.4 intermediate.yml


In [None]:
%%writefile ../pipelines/04-redpanda-flink-iceberg/dbt_project/models/intermediate/intermediate.yml
version: 2

models:
  - name: int_trip_metrics
    description: "Trip records enriched with calculated metrics."
    columns:
      - name: trip_id
        tests:
          - not_null
      - name: trip_duration_minutes
        tests:
          - not_null
          - dbt_utils.accepted_range:
              arguments:
                min_value: 1
                max_value: 720
      - name: pickup_date
        tests:
          - not_null
      - name: pickup_hour
        tests:
          - not_null
          - dbt_utils.accepted_range:
              arguments:
                min_value: 0
                max_value: 23
      - name: is_weekend
        tests:
          - not_null

  - name: int_daily_summary
    description: "Daily aggregated trip counts, revenue, and average metrics"
    columns:
      - name: pickup_date
        tests:
          - unique
          - not_null
      - name: total_trips
        tests:
          - not_null
          - dbt_utils.accepted_range:
              arguments:
                min_value: 0
      - name: total_revenue
        tests:
          - not_null

  - name: int_hourly_patterns
    description: "Hourly trip aggregations by date"
    columns:
      - name: pickup_date
        tests:
          - not_null
      - name: pickup_hour
        tests:
          - not_null
          - dbt_utils.accepted_range:
              arguments:
                min_value: 0
                max_value: 23
      - name: total_trips
        tests:
          - not_null


## 16. dbt Core Marts (Gold Layer — Star Schema)

Core marts form the **star schema** that BI tools query directly.

```
                    dim_dates
                    (31 rows)
                        │
dim_vendors ────── fct_trips ────── dim_payment_types
(2 rows)        (~9,855 rows)        (6 rows)
                        │
                  dim_locations
                  (265 rows)
```

**Grain of fct_trips:** One row per taxi trip, identified by `trip_id` (MD5 surrogate key
generated by Flink Silver from the natural key). Joins `int_trip_metrics` with all dimensions.

### 16.1 fct_trips.sql — Central fact table


In [None]:
%%writefile ../pipelines/04-redpanda-flink-iceberg/dbt_project/models/marts/core/fct_trips.sql
/*
    Fact table: Fully enriched trip records with location names.
    Incremental with delete+insert strategy.
*/

{{
  config(
    materialized='incremental',
    unique_key='trip_id',
    incremental_strategy='delete+insert',
    on_schema_change='fail'
  )
}}

with trip_metrics as (
    select * from {{ ref('int_trip_metrics') }}
),

pickup_locations as (
    select * from {{ ref('dim_locations') }}
),

dropoff_locations as (
    select * from {{ ref('dim_locations') }}
),

final as (
    select
        t.trip_id,
        t.vendor_id,
        t.rate_code_id,
        t.payment_type_id,
        t.pickup_location_id,
        t.dropoff_location_id,
        t.pickup_datetime,
        t.dropoff_datetime,
        t.pickup_date,
        t.pickup_hour,
        t.pickup_day_of_week,
        t.is_weekend,
        t.passenger_count,
        t.trip_distance_miles,
        t.trip_duration_minutes,
        t.avg_speed_mph,
        t.cost_per_mile,
        t.fare_amount,
        t.extra_amount,
        t.mta_tax,
        t.tip_amount,
        t.tip_percentage,
        t.tolls_amount,
        t.improvement_surcharge,
        t.total_amount,
        t.congestion_surcharge,
        t.airport_fee,

        -- enriched from dimensions
        pu.borough as pickup_borough,
        pu.zone_name as pickup_zone,
        do_loc.borough as dropoff_borough,
        do_loc.zone_name as dropoff_zone

    from trip_metrics t
    left join pickup_locations pu
        on t.pickup_location_id = pu.location_id
    left join dropoff_locations do_loc
        on t.dropoff_location_id = do_loc.location_id

    {% if is_incremental() %}
    where t.pickup_datetime > (select max(pickup_datetime) from {{ this }})
    {% endif %}
)

select * from final


### 16.2–16.5 Dimension tables

- `dim_dates` — one row per calendar date (Jan 2024) with week, month, is_weekend attributes
- `dim_locations` — 265 NYC taxi zones with borough and service_zone
- `dim_payment_types` — maps payment_type_id to description
- `dim_vendors` — maps VendorID to vendor name


In [None]:
%%writefile ../pipelines/04-redpanda-flink-iceberg/dbt_project/models/marts/core/dim_dates.sql
/*
    Dimension table: Calendar dates for January 2024.
    Uses adapter-dispatched macros for dayname/monthname.
*/

with date_spine as (
    {{ dbt_utils.date_spine(
        datepart="day",
        start_date="cast('2024-01-01' as date)",
        end_date="cast('2024-02-01' as date)"
    ) }}
),

final as (
    select
        cast(date_day as date) as date_key,
        extract(year from date_day) as year,
        extract(month from date_day) as month,
        extract(day from date_day) as day_of_month,
        extract(dow from date_day) as day_of_week_num,
        {{ dayname_compat('date_day') }} as day_of_week_name,
        {{ monthname_compat('date_day') }} as month_name,
        extract(week from date_day) as week_of_year,
        case
            when extract(dow from date_day) in (0, 6) then true
            else false
        end as is_weekend,
        case
            when cast(date_day as date) in (
                cast('2024-01-01' as date),
                cast('2024-01-15' as date)
            ) then true
            else false
        end as is_holiday

    from date_spine
)

select * from final


In [None]:
%%writefile ../pipelines/04-redpanda-flink-iceberg/dbt_project/models/marts/core/dim_locations.sql
/*
    Dimension table: TLC Taxi Zone locations.
*/

with zones as (
    select * from {{ ref('stg_taxi_zones') }}
),

final as (
    select
        location_id,
        borough,
        zone_name,
        service_zone
    from zones
)

select * from final


In [None]:
%%writefile ../pipelines/04-redpanda-flink-iceberg/dbt_project/models/marts/core/dim_payment_types.sql
/*
    Dimension table: Payment type descriptions.
*/

with payment_types as (
    select * from {{ ref('stg_payment_types') }}
),

final as (
    select
        payment_type_id,
        payment_type_name
    from payment_types
)

select * from final


In [None]:
%%writefile ../pipelines/04-redpanda-flink-iceberg/dbt_project/models/marts/core/dim_vendors.sql
/*
    Dimension table: Taxi vendor descriptions.
    TPEP provider: 1=Creative Mobile Technologies (CMT), 2=VeriFone Inc. (VFI)
*/

with vendors as (
    select * from {{ ref('stg_vendors') }}
),

final as (
    select
        vendor_id,
        vendor_name,
        vendor_abbr
    from vendors
)

select * from final


### 16.6 core.yml — Data contracts + referential integrity tests


In [None]:
%%writefile ../pipelines/04-redpanda-flink-iceberg/dbt_project/models/marts/core/core.yml
version: 2

models:
  - name: fct_trips
    description: "Fact table with fully enriched trip records."
    config:
      contract:
        enforced: true
    columns:
      - name: trip_id
        data_type: varchar
        tests: [not_null, unique]
      - name: vendor_id
        data_type: integer
      - name: rate_code_id
        data_type: integer
      - name: payment_type_id
        data_type: integer
      - name: pickup_location_id
        data_type: integer
      - name: dropoff_location_id
        data_type: integer
      - name: pickup_datetime
        data_type: timestamp
        tests: [not_null]
      - name: dropoff_datetime
        data_type: timestamp
      - name: pickup_date
        data_type: date
      - name: pickup_hour
        data_type: bigint
      - name: pickup_day_of_week
        data_type: varchar
      - name: is_weekend
        data_type: boolean
      - name: passenger_count
        data_type: integer
      - name: trip_distance_miles
        data_type: double
      - name: trip_duration_minutes
        data_type: bigint
      - name: avg_speed_mph
        data_type: double
      - name: cost_per_mile
        data_type: double
      - name: fare_amount
        data_type: "decimal(10,2)"
      - name: extra_amount
        data_type: "decimal(10,2)"
      - name: mta_tax
        data_type: "decimal(10,2)"
      - name: tip_amount
        data_type: "decimal(10,2)"
      - name: tip_percentage
        data_type: double
      - name: tolls_amount
        data_type: "decimal(10,2)"
      - name: improvement_surcharge
        data_type: "decimal(10,2)"
      - name: total_amount
        data_type: "decimal(10,2)"
        tests: [not_null]
      - name: congestion_surcharge
        data_type: "decimal(10,2)"
      - name: airport_fee
        data_type: "decimal(10,2)"
      - name: pickup_borough
        data_type: varchar
      - name: pickup_zone
        data_type: varchar
      - name: dropoff_borough
        data_type: varchar
      - name: dropoff_zone
        data_type: varchar

  - name: dim_locations
    description: "Location dimension"
    config:
      contract:
        enforced: true
    columns:
      - name: location_id
        data_type: integer
        tests: [unique, not_null]
      - name: borough
        data_type: varchar
        tests: [not_null]
      - name: zone_name
        data_type: varchar
        tests: [not_null]
      - name: service_zone
        data_type: varchar

  - name: dim_dates
    description: "Date dimension for January 2024"
    config:
      contract:
        enforced: true
    columns:
      - name: date_key
        data_type: date
        tests: [unique, not_null]
      - name: year
        data_type: bigint
      - name: month
        data_type: bigint
      - name: day_of_month
        data_type: bigint
      - name: day_of_week_num
        data_type: bigint
      - name: day_of_week_name
        data_type: varchar
        tests: [not_null]
      - name: month_name
        data_type: varchar
      - name: week_of_year
        data_type: bigint
      - name: is_weekend
        data_type: boolean
        tests: [not_null]
      - name: is_holiday
        data_type: boolean
        tests: [not_null]

  - name: dim_payment_types
    description: "Payment type dimension"
    config:
      contract:
        enforced: true
    columns:
      - name: payment_type_id
        data_type: integer
        tests: [unique, not_null]
      - name: payment_type_name
        data_type: varchar
        tests: [not_null]


## 17. dbt Analytics Marts (Gold Layer — Business KPIs)

Analytics marts are pre-aggregated tables optimized for dashboards. They answer specific
business questions without requiring analysts to write complex SQL.

### 17.1 mart_daily_revenue.sql — Daily revenue KPIs


In [None]:
%%writefile ../pipelines/04-redpanda-flink-iceberg/dbt_project/models/marts/analytics/mart_daily_revenue.sql
/*
    Analytics mart: Daily revenue metrics with running totals.
*/

with daily as (
    select * from {{ ref('int_daily_summary') }}
),

dates as (
    select * from {{ ref('dim_dates') }}
),

final as (
    select
        d.date_key,
        d.day_of_week_name,
        d.is_weekend,
        d.is_holiday,
        d.week_of_year,

        daily.total_trips,
        daily.total_passengers,
        daily.total_fare_revenue,
        daily.total_tip_revenue,
        daily.total_revenue,
        daily.avg_trip_revenue,
        daily.avg_tip_percentage,
        daily.credit_card_trips,
        daily.cash_trips,
        daily.avg_trip_distance,
        daily.avg_trip_duration_min,

        -- running total
        sum(daily.total_revenue) over (order by d.date_key) as cumulative_revenue,

        -- day-over-day change
        daily.total_revenue - lag(daily.total_revenue) over (order by d.date_key) as revenue_change_vs_prior_day

    from daily
    inner join dates d
        on daily.pickup_date = d.date_key
)

select * from final


### 17.2 mart_hourly_demand.sql — Hourly demand patterns


In [None]:
%%writefile ../pipelines/04-redpanda-flink-iceberg/dbt_project/models/marts/analytics/mart_hourly_demand.sql
/*
    Analytics mart: Hourly demand patterns.
*/

with hourly as (
    select * from {{ ref('int_hourly_patterns') }}
),

final as (
    select
        pickup_hour,
        is_weekend,

        count(*) as days_observed,
        round(avg(total_trips), 0) as avg_trips_per_period,
        round(avg(avg_distance), 2) as avg_distance,
        round(avg(avg_duration_min), 2) as avg_duration_min,
        round(avg(total_revenue), 2) as avg_revenue_per_period,
        sum(total_trips) as total_trips_all_days

    from hourly
    group by pickup_hour, is_weekend
)

select * from final
order by is_weekend, pickup_hour


### 17.3 mart_location_performance.sql — Per-zone analytics

Includes `rank() OVER (ORDER BY total_revenue DESC)` window function for revenue ranking.


In [None]:
%%writefile ../pipelines/04-redpanda-flink-iceberg/dbt_project/models/marts/analytics/mart_location_performance.sql
/*
    Analytics mart: Location-level performance summary.
    Uses adapter-dispatched mode_compat() for cross-dialect support.
*/

with trips as (
    select * from {{ ref('fct_trips') }}
),

final as (
    select
        pickup_location_id,
        pickup_borough,
        pickup_zone,

        count(*) as total_pickups,
        round(avg(trip_distance_miles), 2) as avg_trip_distance,
        round(avg(trip_duration_minutes), 2) as avg_trip_duration_min,
        round(sum(total_amount), 2) as total_revenue,
        round(avg(total_amount), 2) as avg_revenue_per_trip,
        round(avg(tip_percentage), 2) as avg_tip_pct,
        round(avg(passenger_count), 2) as avg_passengers,

        -- most common dropoff destination
        {{ mode_compat('dropoff_zone') }} as most_common_dropoff_zone,

        -- busiest hour
        {{ mode_compat('pickup_hour') }} as peak_pickup_hour

    from trips
    where pickup_zone is not null
    group by pickup_location_id, pickup_borough, pickup_zone
)

select * from final
order by total_pickups desc


### 17.4 analytics.yml — Analytics model tests


In [None]:
%%writefile ../pipelines/04-redpanda-flink-iceberg/dbt_project/models/marts/analytics/analytics.yml
version: 2

models:
  - name: mart_daily_revenue
    description: "Daily revenue metrics with running totals."
    config:
      contract:
        enforced: true
    columns:
      - name: date_key
        data_type: date
        tests: [unique, not_null]
      - name: day_of_week_name
        data_type: varchar
      - name: is_weekend
        data_type: boolean
      - name: is_holiday
        data_type: boolean
      - name: week_of_year
        data_type: bigint
      - name: total_trips
        data_type: bigint
      - name: total_passengers
        data_type: hugeint
      - name: total_fare_revenue
        data_type: "decimal(38,2)"
      - name: total_tip_revenue
        data_type: "decimal(38,2)"
      - name: total_revenue
        data_type: "decimal(38,2)"
        tests: [not_null]
      - name: avg_trip_revenue
        data_type: double
      - name: avg_tip_percentage
        data_type: double
      - name: credit_card_trips
        data_type: bigint
      - name: cash_trips
        data_type: bigint
      - name: avg_trip_distance
        data_type: double
      - name: avg_trip_duration_min
        data_type: double
      - name: cumulative_revenue
        data_type: "decimal(38,2)"
      - name: revenue_change_vs_prior_day
        data_type: "decimal(38,2)"

  - name: mart_location_performance
    description: "Per-zone performance summary."
    config:
      contract:
        enforced: true
    columns:
      - name: pickup_location_id
        data_type: integer
        tests: [unique, not_null]
      - name: pickup_borough
        data_type: varchar
      - name: pickup_zone
        data_type: varchar
      - name: total_pickups
        data_type: bigint
        tests: [not_null]
      - name: avg_trip_distance
        data_type: double
      - name: avg_trip_duration_min
        data_type: double
      - name: total_revenue
        data_type: "decimal(38,2)"
      - name: avg_revenue_per_trip
        data_type: double
      - name: avg_tip_pct
        data_type: double
      - name: avg_passengers
        data_type: double
      - name: most_common_dropoff_zone
        data_type: varchar
      - name: peak_pickup_hour
        data_type: bigint

  - name: mart_hourly_demand
    description: "Hourly demand patterns."
    config:
      contract:
        enforced: true
    columns:
      - name: pickup_hour
        data_type: bigint
        tests: [not_null]
      - name: is_weekend
        data_type: boolean
        tests: [not_null]
      - name: days_observed
        data_type: bigint
      - name: avg_trips_per_period
        data_type: double
      - name: avg_distance
        data_type: double
      - name: avg_duration_min
        data_type: double
      - name: avg_revenue_per_period
        data_type: double
      - name: total_trips_all_days
        data_type: hugeint


## 18. dbt Tests: Data Quality Assertions

dbt has two test types:

| Type | Location | Failure condition | Example |
|------|----------|------------------|---------|
| **Schema tests** | `.yml` files | SQL returns rows with violations | `not_null`, `unique`, `accepted_values` |
| **Singular tests** | `tests/*.sql` | Query returns ANY rows | `assert_fare_not_exceeds_total.sql` |

### Total test count: 91 tests

- Staging schema tests: ~35 (not_null, unique, accepted_values on all columns)
- Intermediate tests: ~20 (relationships between models)
- Core mart tests: ~25 (referential integrity, not_null on fact)
- Analytics mart tests: ~8 (aggregated column tests)
- Singular tests: 2 (business rule assertions)
- Seed tests: 1 (payment_type uniqueness)

### 18.1 assert_fare_not_exceeds_total.sql


In [None]:
%%writefile ../pipelines/04-redpanda-flink-iceberg/dbt_project/tests/assert_fare_not_exceeds_total.sql
/*
    Singular test: fare_amount should not exceed total_amount.
*/

select
    trip_id,
    fare_amount,
    total_amount
from {{ ref('stg_yellow_trips') }}
where fare_amount > total_amount + 0.01
  and total_amount > 0


### 18.2 assert_trip_duration_positive.sql


In [None]:
%%writefile ../pipelines/04-redpanda-flink-iceberg/dbt_project/tests/assert_trip_duration_positive.sql
/*
    Singular test: No trip should have negative duration.
*/

select
    trip_id,
    pickup_datetime,
    dropoff_datetime,
    trip_duration_minutes
from {{ ref('int_trip_metrics') }}
where trip_duration_minutes < 0


## 19. Pipeline Makefile: One-Command Orchestration

The Makefile provides named targets for every pipeline operation.

### Target Map

```
make up                → docker compose up -d (start 6 always-on containers)
make create-topics     → rpk topic create taxi.raw_trips + taxi.raw_trips.dlq
make generate          → start data generator (unlimited events, burst mode)
make generate-limited  → 10,000 events then stop (benchmark mode)
make process-bronze    → Flink SQL batch: Redpanda → Bronze Iceberg
make process-silver    → Flink SQL batch: Bronze → Silver Iceberg (dedup + filter)
make process           → process-bronze + process-silver
make process-streaming → Flink SQL streaming: Redpanda → Bronze (continuous, indefinite)
make dbt-build         → dbt deps && dbt build --full-refresh (all models + 91 tests)
make benchmark         → full E2E: down→up→topics→generate→process→sleep5→dbt→down
make status            → show running containers + Redpanda topic list + Flink job list
make health            → check Flink UI, MinIO health, Redpanda cluster health
make down              → docker compose down -v (stop + remove volumes)
```

### Key Makefile Patterns

```makefile
# Windows Git Bash compatibility
MSYS_NO_PATHCONV=1 $(COMPOSE) exec flink-jobmanager /opt/flink/bin/sql-client.sh embedded ...
#   ↑ Prevents MSYS2 from converting /opt/flink to C:/Program Files/...

# Flink SQL with init + execute
$(FLINK_SQL_CLIENT) -i /opt/flink/sql/00-init.sql -f /opt/flink/sql/05-bronze.sql
#                      ──────────────────────────    ─────────────────────────────
#                      session state (tables, catalog)  execute this file

# Benchmark with race condition fix
$(MAKE) process && \
    sleep 5 && \        ← wait for Iceberg metadata to fully flush
    $(MAKE) dbt-build   ← then dbt can see committed Iceberg snapshots
```


In [None]:
%%writefile ../pipelines/04-redpanda-flink-iceberg/Makefile
SHELL := bash
# =============================================================================
# Pipeline 04: Redpanda + Flink + Iceberg
# =============================================================================
# Makefile for orchestrating the complete streaming pipeline lifecycle.
# Fork of Pipeline 01 with Redpanda replacing Kafka + Schema Registry.
# =============================================================================

COMPOSE = docker compose
FLINK_SQL_CLIENT = MSYS_NO_PATHCONV=1 $(COMPOSE) exec -T flink-jobmanager /opt/flink/bin/sql-client.sh embedded

.PHONY: help up down generate create-topics process process-bronze process-silver \
        process-streaming dbt-build benchmark logs status clean ps restart \
        check-lag health compact-silver expire-snapshots vacuum maintain

help: ## Show this help
	@grep -E '^[a-zA-Z_-]+:.*?## .*$$' $(MAKEFILE_LIST) | sort | \
		awk 'BEGIN {FS = ":.*?## "}; {printf "\033[36m%-20s\033[0m %s\n", $$1, $$2}'

# =============================================================================
# Lifecycle
# =============================================================================

up: ## Start all infrastructure services
	$(COMPOSE) up -d
	@echo ""
	@echo "=== Pipeline 04: Redpanda + Flink + Iceberg ==="
	@echo "Redpanda Kafka API: localhost:19092"
	@echo "Schema Registry:    http://localhost:18081"
	@echo "Pandaproxy:         http://localhost:18082"
	@echo "Redpanda Admin:     http://localhost:9644"
	@echo "Flink Dashboard:    http://localhost:8081"
	@echo "MinIO Console:      http://localhost:9001  (minioadmin/minioadmin)"
	@echo ""
	@echo "Next steps:"
	@echo "  make create-topics   # Create Redpanda topics"
	@echo "  make generate        # Produce taxi events to Redpanda"
	@echo "  make process         # Submit Flink SQL jobs"
	@echo "  make dbt-build       # Run dbt transformations"

down: ## Stop all services and remove volumes
	$(COMPOSE) --profile generator --profile dbt down -v
	@echo "Pipeline 04 stopped and volumes removed."

clean: ## Stop everything and prune all related resources
	$(COMPOSE) --profile generator --profile dbt down -v --remove-orphans
	docker network rm p04-pipeline-net 2>/dev/null || true
	@echo "Pipeline 04 fully cleaned."

restart: ## Restart all services
	$(MAKE) down
	$(MAKE) up

# =============================================================================
# Topic Management
# =============================================================================

create-topics: ## Create Redpanda topics (primary + Dead Letter Queue)
	$(COMPOSE) exec redpanda rpk topic create taxi.raw_trips \
		--brokers localhost:9092 \
		--partitions 3 \
		--replicas 1 \
		--topic-config retention.ms=259200000 \
		--topic-config cleanup.policy=delete || true
	$(COMPOSE) exec redpanda rpk topic create taxi.raw_trips.dlq \
		--brokers localhost:9092 \
		--partitions 1 \
		--replicas 1 \
		--topic-config retention.ms=604800000 \
		--topic-config cleanup.policy=delete || true
	@$(COMPOSE) exec redpanda rpk topic list --brokers localhost:9092
	@echo "Topics created: taxi.raw_trips (3 partitions) + taxi.raw_trips.dlq (DLQ, 7-day retention)."

# =============================================================================
# Data Generation
# =============================================================================

generate: ## Produce taxi trip events to Redpanda (burst mode)
	MSYS_NO_PATHCONV=1 $(COMPOSE) run --rm data-generator
	@echo "Data generation complete."

generate-limited: ## Produce limited events for testing (10k)
	MSYS_NO_PATHCONV=1 $(COMPOSE) run --rm -e MAX_EVENTS=10000 data-generator
	@echo "Limited data generation complete (10k events)."

# =============================================================================
# Flink SQL Processing
# =============================================================================

process: process-bronze process-silver ## Submit all Flink SQL jobs (Bronze + Silver)
	@echo "All Flink SQL jobs complete."

process-bronze: ## Submit Bronze layer Flink SQL jobs (batch mode)
	@echo "=== Bronze: Kafka → Iceberg ==="
	$(FLINK_SQL_CLIENT) -i /opt/flink/sql/00-init.sql -f /opt/flink/sql/05-bronze.sql
	@echo "Bronze layer complete."

process-silver: ## Submit Silver layer Flink SQL jobs (batch mode)
	@echo "=== Silver: Bronze → Cleaned Iceberg ==="
	$(FLINK_SQL_CLIENT) -i /opt/flink/sql/00-init.sql -f /opt/flink/sql/06-silver.sql
	@echo "Silver layer complete."

process-streaming: ## Start continuous streaming Bronze job (Redpanda → Iceberg, runs indefinitely)
	@echo "=== Streaming Bronze: Redpanda → Iceberg (continuous) ==="
	@echo "NOTE: This job runs indefinitely. Cancel with Ctrl+C or kill the Flink job."
	$(FLINK_SQL_CLIENT) -i /opt/flink/sql/00-init-streaming.sql -f /opt/flink/sql/07-streaming-bronze.sql

# =============================================================================
# dbt Transformations
# =============================================================================

dbt-build: ## Run dbt build (full-refresh) on Iceberg Silver data
	MSYS_NO_PATHCONV=1 $(COMPOSE) run --rm --entrypoint /bin/sh dbt -c "dbt deps --profiles-dir . && dbt build --full-refresh --profiles-dir ."
	@echo "dbt build complete."

dbt-test: ## Run dbt tests only
	MSYS_NO_PATHCONV=1 $(COMPOSE) run --rm dbt test --profiles-dir .

dbt-docs: ## Generate dbt documentation
	MSYS_NO_PATHCONV=1 $(COMPOSE) run --rm dbt docs generate --profiles-dir .

# =============================================================================
# Benchmark (Full E2E)
# =============================================================================

benchmark: ## Full end-to-end benchmark: down -> up → topics → generate → process → dbt → down
	@echo "============================================================"
	@echo "  Pipeline 04 Benchmark: Redpanda + Flink + Iceberg"
	@echo "============================================================"
	@START_TIME=$$(date +%s) && \
	$(MAKE) down 2>/dev/null || true && \
	$(MAKE) up && \
	echo "Waiting for services to stabilize..." && \
	sleep 15 && \
	$(MAKE) create-topics && \
	$(MAKE) generate-limited && \
	echo "Waiting for Flink processing to catch up..." && \
	sleep 10 && \
	$(MAKE) process && \
	echo "Waiting for Iceberg metadata commits to finalize..." && \
	sleep 15 && \
	$(MAKE) dbt-build && \
	END_TIME=$$(date +%s) && \
	ELAPSED=$$((END_TIME - START_TIME)) && \
	echo "" && \
	echo "============================================================" && \
	echo "  BENCHMARK COMPLETE" && \
	echo "  Total elapsed: $${ELAPSED}s" && \
	echo "============================================================" && \
	mkdir -p benchmark_results && \
	echo "{\"pipeline\": \"04-redpanda-flink-iceberg\", \"elapsed_seconds\": $$ELAPSED, \"timestamp\": \"$$(date -Iseconds)\"}" > benchmark_results/latest.json && \
	echo "Results saved to benchmark_results/latest.json" && \
	$(MAKE) down

# =============================================================================
# Observability
# =============================================================================

logs: ## Tail logs from all services
	$(COMPOSE) logs -f --tail=100

logs-redpanda: ## Tail Redpanda logs
	$(COMPOSE) logs -f redpanda

logs-flink: ## Tail Flink JobManager logs
	$(COMPOSE) logs -f flink-jobmanager

logs-flink-tm: ## Tail Flink TaskManager logs
	$(COMPOSE) logs -f flink-taskmanager

status: ## Show service status
	@echo "=== Pipeline 04: Service Status ==="
	$(COMPOSE) ps
	@echo ""
	@echo "=== Redpanda Topics ==="
	$(COMPOSE) exec redpanda rpk topic list --brokers localhost:9092 2>/dev/null || echo "(Redpanda not running)"
	@echo ""
	@echo "=== Flink Jobs ==="
	@curl -s http://localhost:8081/jobs/overview 2>/dev/null | python3 -m json.tool 2>/dev/null || echo "(Flink not running)"

ps: ## Show running containers
	$(COMPOSE) ps

# =============================================================================
# Health & Diagnostics
# =============================================================================

check-lag: ## Show Redpanda consumer group lag and DLQ status
	$(COMPOSE) exec redpanda rpk group describe flink-consumer --brokers localhost:9092
	@echo ""
	@$(COMPOSE) exec -T redpanda rpk topic consume taxi.raw_trips.dlq \
	    --brokers localhost:9092 --num 1 2>/dev/null \
	    && echo "DLQ has messages - investigate!" || echo "DLQ: empty (OK)"

health: ## Quick health check of all services
	@echo "=== Pipeline 04: Health Check ==="
	@echo -n "Redpanda:         " && $(COMPOSE) exec -T redpanda \
	    rpk cluster health --brokers localhost:9092 2>/dev/null | grep -q "Healthy" \
	    && echo "OK" || echo "FAIL"
	@echo -n "Flink Dashboard:  " && curl -sf http://localhost:8081/overview > /dev/null 2>&1 \
	    && echo "OK" || echo "FAIL"
	@echo -n "MinIO:            " && curl -sf http://localhost:9000/minio/health/live > /dev/null 2>&1 \
	    && echo "OK" || echo "FAIL"
	@echo -n "Redpanda Console: " && curl -sf http://localhost:8085 > /dev/null 2>&1 \
	    && echo "OK" || echo "FAIL"

# =============================================================================
# Iceberg Maintenance (run periodically in production)
# =============================================================================

compact-silver: ## Compact Silver Iceberg files to target 128MB file size
	@echo "=== Compacting Silver Iceberg table ==="
	$(FLINK_SQL_CLIENT) -i /opt/flink/sql/00-init.sql <<'EOF'
	CALL iceberg_catalog.system.rewrite_data_files(
	    table => 'silver.cleaned_trips',
	    options => map['target-file-size-bytes', '134217728',
	                   'min-file-size-threshold', '33554432']
	);
	EOF
	@echo "Silver compaction complete."

expire-snapshots: ## Expire Iceberg snapshots older than 7 days
	@echo "=== Expiring old Iceberg snapshots ==="
	$(FLINK_SQL_CLIENT) -i /opt/flink/sql/00-init.sql <<'EOF'
	CALL iceberg_catalog.system.expire_snapshots(
	    table => 'silver.cleaned_trips',
	    older_than => TIMESTAMPADD(DAY, -7, NOW())
	);
	CALL iceberg_catalog.system.expire_snapshots(
	    table => 'bronze.raw_trips',
	    older_than => TIMESTAMPADD(DAY, -7, NOW())
	);
	EOF
	@echo "Snapshot expiry complete."

vacuum: ## Remove orphan files from Iceberg warehouse (run weekly)
	@echo "=== Removing orphan Iceberg files ==="
	$(FLINK_SQL_CLIENT) -i /opt/flink/sql/00-init.sql <<'EOF'
	CALL iceberg_catalog.system.remove_orphan_files(
	    table => 'silver.cleaned_trips',
	    older_than => TIMESTAMPADD(DAY, -7, NOW())
	);
	EOF
	@echo "Vacuum complete."

maintain: compact-silver expire-snapshots ## Run all routine Iceberg maintenance


## 20. Running the Pipeline

### Prerequisites

- Docker Desktop running (Linux containers), 8+ GB RAM allocated
- Docker Compose V2 (`docker compose`, not `docker-compose`)
- `make` installed (Windows: `scoop install make`)
- Free ports: `19092` (Redpanda), `8081` (Flink UI), `8085` (Redpanda Console), `9000-9001` (MinIO)
- Data: `data/yellow_tripdata_2024-01.parquet` in project root

### Quick Start (Batch Benchmark)

```bash
cd pipelines/04-redpanda-flink-iceberg

# One command — full E2E benchmark with timing
make benchmark
# Expected output: Total elapsed: ~88s (includes 5s metadata flush sleep)
# Results written to: benchmark_results/latest.json
```

### Step-by-Step with Validation

```bash
# 1. Start all services
make up

# 2. Verify everything is healthy (~20s for Redpanda to become healthy)
make status

# 3. Create topics (primary + DLQ)
make create-topics
# Expected:
#   NAME               PARTITIONS  REPLICAS
#   taxi.raw_trips     3           1
#   taxi.raw_trips.dlq 1           1

# 4. Generate 10,000 taxi events
make generate-limited
# Expected: ~3s, ~25,000 events/sec

# 5. Process: Redpanda → Bronze → Silver
make process
# Expected:
#   Bronze: 10,000 rows written to s3://warehouse/bronze/raw_trips/
#   Silver:  9,855 rows written to s3://warehouse/silver/cleaned_trips/

# 6. Run dbt: Silver → Gold (91 tests)
make dbt-build
# Expected: 14 models created, 91 tests PASS

# 7. Verify in Flink SQL (interactive)
docker exec -it p04-flink-jobmanager \
    /opt/flink/bin/sql-client.sh embedded \
    -i /opt/flink/sql/00-init.sql
# Then run:
# SELECT COUNT(*) FROM iceberg_catalog.bronze.raw_trips;   -- should be 10000
# SELECT COUNT(*) FROM iceberg_catalog.silver.cleaned_trips; -- should be ~9855
```

### Streaming Mode (Continuous Ingestion)

```bash
# Terminal 1: Start continuous Bronze ingestion
make process-streaming
# This job runs until you cancel it (Ctrl+C)

# Terminal 2: Generate events (will be picked up continuously)
make generate-limited

# Terminal 3: Watch Bronze grow
watch -n5 'docker exec p04-flink-jobmanager \
    bash -c "echo '"'"'SELECT COUNT(*) FROM iceberg_catalog.bronze.raw_trips;'"'"' | \
    MSYS_NO_PATHCONV=1 /opt/flink/bin/sql-client.sh embedded -i /opt/flink/sql/00-init.sql"'
```

### Web UIs (while services are running)

| Service | URL | Login | What to look at |
|---------|-----|-------|-----------------|
| **Flink Dashboard** | http://localhost:8081 | none | Running jobs, task slots, checkpoints |
| **Redpanda Console** | http://localhost:8085 | none | Topics, messages, consumer lag |
| **MinIO Console** | http://localhost:9001 | minioadmin/minioadmin | Iceberg warehouse bucket, Parquet files |

### Expected Benchmark Timing (10k events)

| Phase | Duration | Notes |
|-------|----------|-------|
| `make up` + healthchecks | ~20s | Redpanda starts in ~3s; Flink needs ~15s |
| `make create-topics` | <1s | rpk is a native binary |
| `make generate-limited` | ~3s | ~25k events/sec to Redpanda |
| `sleep 10` (stabilize) | 10s | Wait for consumer lag to settle |
| `make process-bronze` | ~22s | Flink batch: Redpanda → Iceberg |
| `make process-silver` | ~21s | Flink batch: Bronze → Silver |
| `sleep 5` (metadata flush) | 5s | Wait for Iceberg metadata to commit |
| `make dbt-build` | ~15s | dbt: 14 models + 91 tests |
| **Total E2E** | **~88s** | |


## 21. Production Operations + Troubleshooting

### Production Hardening Checklist

This pipeline implements **defense-in-depth**: multiple independent layers prevent data quality issues.

| Layer | Mechanism | What it prevents |
|-------|-----------|-----------------|
| **Producer** | `enable.idempotence=True`, `acks=all` | Duplicate events from producer retries |
| **DLQ** | `taxi.raw_trips.dlq` topic | Malformed events blocking the primary stream |
| **Bronze** | Event-time watermark (10s) | Out-of-order events in streaming window functions |
| **Silver** | `ROW_NUMBER()` deduplication | Duplicate rows surviving to analytics layer |
| **Silver** | Quality filters (fare≥0, date range) | Invalid records reaching Gold models |
| **dbt** | 91 schema + singular tests | Regressions detected before data reaches BI tools |
| **Containers** | CPU limits + restart policies | Cascading failures from resource exhaustion |

### Troubleshooting Guide

| Symptom | Root Cause | Diagnosis | Fix |
|---------|-----------|-----------|-----|
| `Object 'iceberg_catalog' not found` | 00-init.sql not used as init | Check `-i` flag in command | Ensure `make process-bronze` uses `-i 00-init.sql` |
| `Silver has 0 rows` | Bronze job failed silently | `SELECT COUNT(*) FROM bronze.raw_trips` | Check Bronze count first; rerun `make process-bronze` |
| `LEADER_NOT_AVAILABLE` | Redpanda not fully ready | `docker logs p04-redpanda` | Wait 5-10s more; check healthcheck status |
| `s3://warehouse: No such file` | MinIO bucket not created | `docker logs p04-mc-init` | Check mc-init completed successfully |
| `classloader leak` warning | Missing config setting | `grep classloader flink/conf/config.yaml` | Add `classloader.check-leaked-classloader: false` |
| `Port 19092 already in use` | Another pipeline running | `docker ps | grep 19092` | `cd ../01-kafka-flink-iceberg && make down` |
| `dbt: No files found in COPY FROM '/old/path'` | Stale partial_parse.msgpack | `ls dbt_project/target/` | `rm dbt_project/target/partial_parse.msgpack` |
| `dbt: source not found` | sources.yml points to wrong path | Check `external_location` in sources.yml | Must be `silver/cleaned_trips`, not `bronze/raw_trips` |

### Iceberg Table Inspection

```bash
# Check what Iceberg tables exist (via MinIO)
docker exec p04-minio mc ls myminio/warehouse/ --recursive --summarize

# Query Iceberg table stats via Flink SQL
docker exec -it p04-flink-jobmanager bash -c "
  echo 'SELECT COUNT(*), MIN(pickup_date), MAX(pickup_date) FROM iceberg_catalog.silver.cleaned_trips;' | \
  /opt/flink/bin/sql-client.sh embedded -i /opt/flink/sql/00-init.sql
"

# Inspect Iceberg snapshots (time travel metadata)
docker exec -it p04-flink-jobmanager bash -c "
  echo 'SELECT snapshot_id, committed_at, operation, summary
        FROM iceberg_catalog.silver.\`cleaned_trips\$snapshots\`;' | \
  /opt/flink/bin/sql-client.sh embedded -i /opt/flink/sql/00-init.sql
"
```

### Migrating to Managed Kafka/Redpanda

The only file to change is `00-init.sql`:

```sql
-- Change ONLY this one line (and add auth if needed):
'properties.bootstrap.servers' = 'broker1.company.com:9092,broker2.company.com:9092',

-- For Confluent Cloud / MSK (SASL):
'properties.security.protocol' = 'SASL_SSL',
'properties.sasl.mechanism' = 'PLAIN',
'properties.sasl.jaas.config' = 'org.apache.kafka.common.security.plain.PlainLoginModule
  required username="API_KEY" password="API_SECRET";'
```

Everything else — dbt models, Makefile targets, Iceberg tables — is **100% portable**.

### Scheduling (Adding an Orchestrator)

P04 is Makefile-based. To add scheduling (nightly runs, dependency management):
- **P07 Kestra:** Lightest (+1 container, YAML-based, +5s overhead)
- **P09 Dagster:** Asset-centric, lineage tracking (+750 MB)
- **P08 Airflow:** Battle-tested, Astronomer support (+1.5 GB, +20s overhead)

The Makefile targets become shell operators in the orchestrator DAG.


## 22. Adapting to Your Own Dataset

This section is the learning guide: how to take P04's pattern and apply it to any dataset.
The NYC Taxi pipeline is the **template** — every decision here generalizes.

---

### Step 1: Define Your Event Schema

Start with the JSON events that will flow through Redpanda. Ask:
- What does one event represent? (one taxi trip → one IoT reading, one order, one click)
- What fields are always present? (required → filter nulls in Silver)
- What's the natural key? (for deduplication in Silver)
- What timestamp represents the event? (for watermarks in streaming mode)

**Template for your `00-init.sql` source table:**

```sql
CREATE TABLE IF NOT EXISTS kafka_your_events (
    -- Copy your JSON fields here with Flink SQL types:
    -- JSON string fields → STRING
    -- JSON integer fields → BIGINT (safest; cast to INT in Silver)
    -- JSON float/decimal → DOUBLE (cast to DECIMAL in Silver)
    -- JSON booleans → BOOLEAN
    -- Timestamps (ISO 8601 strings) → STRING (parse in Bronze INSERT)
    field_one           STRING,
    numeric_field       BIGINT,
    amount_field        DOUBLE,
    event_timestamp_str STRING,   -- raw string from JSON

    -- Add watermark for streaming mode:
    event_time AS TO_TIMESTAMP(event_timestamp_str, 'yyyy-MM-dd''T''HH:mm:ss'),
    WATERMARK FOR event_time AS event_time - INTERVAL '10' SECOND
) WITH (
    'connector' = 'kafka',
    'topic' = 'your.topic.name',
    'properties.bootstrap.servers' = 'redpanda:9092',
    -- For batch mode:
    'scan.startup.mode' = 'earliest-offset',
    'scan.bounded.mode' = 'latest-offset',
    -- For streaming mode (in 00-init-streaming.sql): omit scan.bounded.mode
    'format' = 'json'
);
```

---

### Step 2: Design Bronze (Raw Landing)

Bronze SQL changes are minimal — just match your source schema:

```sql
-- 05-bronze.sql template
INSERT INTO iceberg_catalog.bronze.your_events
SELECT
    field_one,
    numeric_field,
    CAST(amount_field AS DOUBLE),
    -- Parse your event timestamp:
    TO_TIMESTAMP(event_timestamp_str, 'yyyy-MM-dd''T''HH:mm:ss') AS event_time,
    CURRENT_TIMESTAMP AS ingestion_ts
FROM kafka_your_events;
```

**What NOT to do in Bronze:**
- ❌ Filter rows (except for parse errors)
- ❌ Rename columns to snake_case (do this in Silver or dbt)
- ❌ Compute derived metrics (duration, ratio, etc.)
- ❌ Join with other tables

---

### Step 3: Design Silver (Dedup + Clean)

This is where you need to think carefully. Key decisions:

**A) What is your natural key?** (for deduplication)
```sql
-- Taxi: pickup_time + dropoff_time + vendor + location + fare + total
-- IoT: device_id + event_timestamp (usually unique already)
-- E-commerce orders: order_id (if truly unique, skip dedup)
-- Clickstream: session_id + event_type + timestamp (approximate dedup window)

ROW_NUMBER() OVER (
    PARTITION BY your_field1, your_field2, ...,  -- natural key
    ORDER BY ingestion_ts DESC
) AS rn
```

**B) What quality filters make sense?**
```sql
-- Taxi: fare_amount >= 0, trip_distance >= 0, location_id BETWEEN 1 AND 265
-- IoT: sensor_value BETWEEN min_valid AND max_valid
-- Orders: amount > 0, customer_id IS NOT NULL
-- Always: event_timestamp IS NOT NULL
```

**C) What columns to cast?**
```sql
-- Cast BIGINT → INT for dimension keys (saves storage, needed for JOIN types)
-- Cast DOUBLE → DECIMAL(10,2) for money (exact precision)
-- Always add a surrogate key (MD5 or UUID of natural key fields)
CAST(MD5(CONCAT_WS('|', CAST(field1 AS STRING), ...)) AS STRING) AS event_id
```

**D) How to partition Silver?**
```sql
-- Time-series data: PARTITIONED BY (event_date DATE)
-- Geographic data: PARTITIONED BY (region_code) or (event_date, region_code)
-- High-cardinality: avoid partitioning by user_id (too many small partitions)
```

---

### Step 4: Design dbt Staging

With the Silver passthrough pattern, staging is simple:

```sql
-- models/staging/stg_your_events.sql
with source as (
    select * from {{ source('your_source', 'your_events') }}
),
final as (
    select
        event_id,                              -- surrogate key from Silver
        field_one,                             -- already clean from Silver
        numeric_field,
        cast(amount_field as decimal(10, 2)),  -- minor DuckDB type compat
        cast(event_time as timestamp)          -- TIMESTAMP(3) → TIMESTAMP
    from source
    where event_time is not null               -- safety net (Silver already filtered)
)
select * from final
```

**sources.yml:**
```yaml
sources:
  - name: your_source
    tables:
      - name: your_events
        config:
          external_location: >-
            iceberg_scan('s3://warehouse/silver/your_events', allow_moved_paths=true)
```

---

### Step 5: Design dbt Intermediate

Here's where your domain knowledge goes. For each dataset, ask:

1. **What are the computed metrics?** (duration, speed, ratio, percentage)
2. **What granularities do analysts need?** (per-event, daily, hourly, per-entity)
3. **What joins are needed?** (fact + dimensions, event + user profile, etc.)

```sql
-- models/intermediate/int_your_event_metrics.sql
with events as (
    select * from {{ ref('stg_your_events') }}
),
enriched as (
    select
        event_id,
        -- Computed metrics (NOT in Flink — these belong here):
        amount_field / duration_seconds as rate_per_second,
        case when numeric_field > threshold then 'high' else 'low' end as category,
        -- Calendar attributes:
        extract(hour from event_time) as event_hour,
        dayofweek(event_time) = 0 as is_sunday
    from events
)
select * from enriched
```

---

### Step 6: Design dbt Marts

Marts answer specific business questions. For each question, create one mart:

| Question | Mart name | Grain |
|----------|-----------|-------|
| "How much revenue per day?" | `mart_daily_revenue` | 1 row per date |
| "Which customers are most active?" | `mart_customer_activity` | 1 row per customer |
| "What's the hourly throughput?" | `mart_hourly_throughput` | 1 row per date × hour |
| "Where are errors concentrated?" | `mart_error_locations` | 1 row per region |

---

### Step 7: Write dbt Tests

For every column that downstream systems depend on, write a test:

```yaml
# staging.yml
models:
  - name: stg_your_events
    columns:
      - name: event_id
        tests: [not_null, unique]
      - name: amount_field
        tests:
          - not_null
          - dbt_utils.expression_is_true:
              expression: "amount_field >= 0"
```

**Custom singular tests:**
```sql
-- tests/assert_event_time_not_in_future.sql
select *
from {{ ref('stg_your_events') }}
where event_time > current_timestamp + interval 1 hour
-- Any rows = test failure (events shouldn't be >1h in the future)
```

---

### General Patterns That Always Apply

| Pattern | What it prevents |
|---------|-----------------|
| Idempotent producer (`enable.idempotence=True, acks=all`) | Duplicate events from network retries |
| DLQ topic (1 partition, longer retention) | Malformed events blocking the stream |
| Bronze = raw landing, Silver = clean | Ability to reprocess without re-reading source |
| ROW_NUMBER dedup on natural key | Duplicates reaching analytics layer |
| `table.dml-sync=true` in batch mode | Silver reading empty Bronze |
| `sleep 5` after Flink before dbt | dbt reading before Iceberg metadata commits |
| CPU limits on TaskManager | Resource starvation in shared Docker pool |
| All images pinned to specific tags | Pipeline breaking when `:latest` changes |
| `classloader.check-leaked-classloader: false` | Flink + Iceberg class loading errors |
| `s3_endpoint: "minio:9000"` (no http://) | DuckDB httpfs rejects protocol prefix |


## 23. What We Learned: Key Decisions Explained

This section is a retrospective of the most important decisions made while building P04,
and the lessons they carry to any real-time pipeline.

---

### 1. "Kafka-compatible" ≠ "Zero code changes"
**What happened:** Redpanda claims 100% Kafka API compatibility. We verified this by
running the exact same Flink SQL, the exact same dbt models, the exact same producer code —
and it worked. The only change was one line: `bootstrap.servers = 'redpanda:9092'`.

**The lesson:** Evaluate broker alternatives on *actual* protocol compatibility, not marketing.
The Kafka wire protocol is well-specified. Redpanda, WarpStream, AutoMQ, and others implement
it faithfully. The ecosystem (Flink, Debezium, Python confluent-kafka, Java clients) doesn't
care which broker is on the other end.

---

### 2. The Silver source bug: a silent failure that passes all tests

**What happened:** P04's `sources.yml` originally pointed to the Bronze table. Flink Silver
was writing clean data to one location. dbt was reading raw data from another. 91 tests passed
because the Bronze data was *similar enough* to Silver — same row count, same column names after
Flink Bronze's timestamp parsing.

**The lesson:** Passing tests prove the tests passed — not that the pipeline is correct.
Always verify: `SELECT COUNT(*) FROM bronze.raw_trips` vs `SELECT COUNT(*) FROM silver.cleaned_trips`.
If they're the same (~10,000 each when you expect ~9,855 in Silver), something is wrong.
The deduplication + quality filtering didn't run.

---

### 3. Flink's 06-silver.sql vs 05-run-all.sql inconsistency

**What happened:** The combined pipeline (`05-run-all.sql`) had the correct ROW_NUMBER()
deduplication. The standalone Silver file (`06-silver.sql`) ran a plain INSERT without
dedup. Running `make process-silver` independently produced different results than `make process`.

**The lesson:** When the same operation exists in multiple files (standalone + combined),
they must be kept in sync. Add a test: run `make process-silver` and verify row count matches
the combined pipeline's Silver output. If they differ, the files diverged.

---

### 4. The benchmark race condition

**What happened:** Flink's batch job returns when the INSERT statement completes. But "INSERT
completes" means Flink flushed data to Parquet files in MinIO — not that Iceberg metadata
(the `.metadata.json` snapshot) was committed and visible to readers. dbt would start and
sometimes see 0 rows because Iceberg's metadata wasn't ready.

**The lesson:** Flink's `table.dml-sync=true` waits for the job to finish writing *data*.
Iceberg commits metadata separately in a final atomic rename. Add `sleep 5` between Flink
and dbt in any batch pipeline. In streaming mode with proper checkpoints, this race doesn't
exist because DuckDB always reads the latest committed snapshot.

---

### 5. Enrichment columns belong in dbt, not Flink

**What happened:** An early version of Silver computed `duration_minutes`, `avg_speed_mph`,
`cost_per_mile`, `tip_percentage`, `pickup_hour`, and `is_weekend` directly in the Flink
Silver SQL. These were removed and moved to `int_trip_metrics.sql`.

**The lesson:** Apply the single-responsibility principle to data layers. Flink is uniquely
positioned to do: ordering, deduplication, type coercion, and quality filtering at stream
speed. It's not the right place for business metrics because:
1. Business rules change. Changing Flink SQL requires reprocessing the entire Bronze table.
2. Business metrics are domain knowledge — analysts understand them, not infrastructure engineers.
3. dbt tests can verify business rule correctness. Flink SQL can't.

---

### 6. CPU limits prevent silent failures in Docker Desktop

**What happened:** Without CPU limits, during Flink processing the TaskManager consumed
all available cores on Docker Desktop. MinIO's S3 server starved for CPU → S3A write requests
timed out → Flink retried → more load → eventual failure. The error message looked like
a network error, not a resource problem.

**The lesson:** Always set both `memory` AND `cpus` limits on CPU-intensive containers
in Docker Desktop environments. In production Kubernetes, use resource requests/limits on
every pod. Docker Desktop doesn't enforce CPU limits by default — your JVM containers will
happily consume 100% of all available cores.

---

### 7. The DLQ must exist in the Makefile, not just the shell script

**What happened:** The `create-topics.sh` shell script created both topics correctly.
But the `Makefile`'s `create-topics` target ran `rpk topic create taxi.raw_trips` inline,
bypassing the shell script. Running `make create-topics` only created one topic.

**The lesson:** When you have multiple entry points to the same operation (shell script + Makefile),
test them independently. The Makefile target is what everyone uses in practice (`make create-topics`
is more natural than `bash kafka/create-topics.sh`). Keep them synchronized or have one call the other.

---

### 8. Streaming mode's `table.dml-sync` trap

**What happened:** Developers familiar with batch mode copied `SET 'table.dml-sync' = 'true'`
into the streaming init file. The Flink session blocked forever on the first INSERT (which
runs indefinitely in streaming mode), never getting to execute the second statement.

**The lesson:** Batch and streaming modes have fundamentally different lifecycles. `table.dml-sync`
is the batch synchronization primitive that makes no sense for infinite streaming jobs. Never
copy settings between batch and streaming init files without understanding what each setting does.
A streaming pipeline that "works" with `dml-sync=true` is actually running only one job —
it never progressed past the first INSERT.

---

### Summary: The Mental Model

```
Everything a streaming data pipeline does falls into one of three buckets:

1. INFRASTRUCTURE WORK (Flink):
   Parse timestamps, enforce types, deduplicate on the natural key,
   apply basic validity filters, write to the storage format.
   This runs at stream speed and shouldn't be changed often.

2. BUSINESS LOGIC (dbt):
   Compute metrics, join dimensions, aggregate to business granularities,
   test business rules, version-control analytical decisions.
   This changes with business requirements and should be easy to modify.

3. OPERATIONS (Makefile + Docker Compose):
   Manage infrastructure lifecycle, health checks, topic creation,
   DLQ, resource limits, benchmark timing, image pinning.
   This is the glue that makes the pipeline reproducible and observable.

When something is in the wrong bucket, you get problems:
- Business logic in Flink → hard to change, hard to test, requires stream reprocessing
- Infrastructure work in dbt → slow (batch query instead of stream), misses the point
- Missing operations (no DLQ, no CPU limits, no sleep) → silent failures that pass tests
```
