# Structured Streaming Quickstart: Kafka → Iceberg

This notebook demonstrates a real-time pipeline: Kafka source → transformation → Iceberg sink.

**Note:** Structured Streaming requires Spark in local or cluster mode (not Spark Connect). Run this notebook with `local[*]` master.

## 1. Setup

Start Kafka first: `docker compose -f examples/streaming/docker-compose.yaml up -d`

Then create the Spark session with Kafka and Iceberg packages.

In [None]:
import os
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, from_json, current_timestamp
from pyspark.sql.types import StructType, StructField, StringType, LongType, DoubleType

KAFKA_BOOTSTRAP = os.environ.get("KAFKA_BOOTSTRAP_SERVERS", "localhost:9092")
CHECKPOINT = os.environ.get("CHECKPOINT_LOCATION", "/tmp/streaming-checkpoint")

# Spark 4.1: use spark-sql-kafka-0-10_2.13:4.1.0; Spark 3.5: use _2.12:3.5.0
spark = (
    SparkSession.builder
    .appName("streaming-quickstart")
    .master("local[*]")
    .config("spark.jars.packages", "org.apache.spark:spark-sql-kafka-0-10_2.13:4.1.0")
    .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions")
    .config("spark.sql.catalog.iceberg", "org.apache.iceberg.spark.SparkCatalog")
    .config("spark.sql.catalog.iceberg.type", "hadoop")
    .config("spark.sql.catalog.iceberg.warehouse", "s3a://warehouse/iceberg")
    .config("spark.sql.streaming.checkpointLocation", CHECKPOINT)
    .getOrCreate()
)

print(f"Spark {spark.version}")

## 2. Define schema and read from Kafka

In [None]:
schema = StructType([
    StructField("id", StringType(), False),
    StructField("user_id", StringType(), True),
    StructField("amount", DoubleType(), True),
    StructField("event_ts", LongType(), True),
])

stream_df = (
    spark.readStream.format("kafka")
    .option("kafka.bootstrap.servers", KAFKA_BOOTSTRAP)
    .option("subscribe", "events")
    .option("startingOffsets", "earliest")
    .load()
    .select(from_json(col("value").cast("string"), schema).alias("data"))
    .select("data.*")
    .withColumn("processing_time", current_timestamp())
)

stream_df.printSchema()

## 3. Run streaming query (console sink for demo)

For production, use Iceberg sink. See `examples/streaming/kafka_to_iceberg.py`.

In [None]:
query = (
    stream_df.writeStream
    .outputMode("append")
    .format("console")
    .option("checkpointLocation", CHECKPOINT)
    .trigger(processingTime="5 seconds")
    .start()
)

# Run for 60 seconds
query.awaitTermination(60)
query.stop()

## 4. Full pipeline (spark-submit)

For Kafka → Iceberg with exactly-once, run:

```bash
spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.13:4.1.0,org.apache.iceberg:iceberg-spark-runtime-3.5_2.13:1.5.0 examples/streaming/kafka_to_iceberg.py
```

See `examples/streaming/README.md` for full setup.