# 02 - Streaming Bronze

Ingestion des donnÃ©es brutes depuis Kafka vers la couche Bronze (Delta Lake).

## Configuration

In [1]:
import os
from dotenv import load_dotenv
from pyspark.sql.functions import from_json, col
from pyspark.sql.types import StructType, StructField, StringType, FloatType, IntegerType, BooleanType, LongType
from config import get_s3_path, create_spark_session

load_dotenv()

KAFKA_BOOTSTRAP = os.getenv("KAFKA_BOOTSTRAP", "kafka1:9092")
TOPIC_NAME = os.getenv("TOPIC_NAME", "opensky-data")
BRONZE_PATH = get_s3_path("bronze", "flights")
CHECKPOINT_PATH = get_s3_path("checkpoints", "bronze_flights")

spark = create_spark_session("StreamingBronze", extra_packages=["org.apache.spark:spark-sql-kafka-0-10_2.12:3.5.3"])

print(f"âœ… Output: {BRONZE_PATH}")

âœ… Configuration chargÃ©e depuis .env
:: loading settings :: url = jar:file:/opt/conda/lib/python3.12/site-packages/pyspark/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml


Ivy Default Cache set to: /home/jovyan/.ivy2/cache
The jars for the packages stored in: /home/jovyan/.ivy2/jars
org.apache.hadoop#hadoop-aws added as a dependency
com.amazonaws#aws-java-sdk-bundle added as a dependency
org.apache.spark#spark-hadoop-cloud_2.12 added as a dependency
io.delta#delta-spark_2.12 added as a dependency
org.apache.spark#spark-sql-kafka-0-10_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-f5492da6-1898-48e3-b9cb-92c2b00e3752;1.0
	confs: [default]
	found org.apache.hadoop#hadoop-aws;3.3.4 in central
	found com.amazonaws#aws-java-sdk-bundle;1.12.262 in central
	found org.wildfly.openssl#wildfly-openssl;1.0.7.Final in central
	found org.apache.spark#spark-hadoop-cloud_2.12;3.5.3 in central
	found org.apache.hadoop#hadoop-client-runtime;3.3.4 in central
	found org.apache.hadoop#hadoop-client-api;3.3.4 in central
	found org.xerial.snappy#snappy-java;1.1.10.5 in central
	found org.slf4j#slf4j-api;2.0.7 in central
	found com

âœ… Spark Session 'StreamingBronze' configurÃ©e
âœ… Output: s3a://datalake/bronze/flights


## SchÃ©ma des donnÃ©es

In [2]:
schema = StructType([
    StructField("time", LongType(), True),
    StructField("icao24", StringType(), True),
    StructField("callsign", StringType(), True),
    StructField("origin_country", StringType(), True),
    StructField("time_position", LongType(), True),
    StructField("last_contact", LongType(), True),
    StructField("longitude", FloatType(), True),
    StructField("latitude", FloatType(), True),
    StructField("baro_altitude", FloatType(), True),
    StructField("on_ground", BooleanType(), True),
    StructField("velocity", FloatType(), True),
    StructField("true_track", FloatType(), True),
    StructField("vertical_rate", FloatType(), True),
    StructField("geo_altitude", FloatType(), True),
    StructField("squawk", StringType(), True),
    StructField("spi", BooleanType(), True),
    StructField("position_source", IntegerType(), True),
    StructField("category", IntegerType(), True)
])

print("âœ… SchÃ©ma dÃ©fini")

âœ… SchÃ©ma dÃ©fini


## Streaming Kafka â†’ Bronze

In [5]:
kafka_df = spark.readStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", KAFKA_BOOTSTRAP) \
    .option("subscribe", TOPIC_NAME) \
    .option("startingOffsets", "earliest") \
    .option("failOnDataLoss", "false") \
    .load()

parsed_df = kafka_df.select(
    from_json(col("value").cast("string"), schema).alias("data")
).select("data.*")

print(f"ðŸš€ Streaming vers {BRONZE_PATH}...")

query = parsed_df.writeStream \
    .format("delta") \
    .outputMode("append") \
    .option("checkpointLocation", CHECKPOINT_PATH) \
    .start(BRONZE_PATH)

query.awaitTermination()

ðŸš€ Streaming vers s3a://datalake/bronze/flights...


26/01/23 14:07:58 WARN ResolveWriteToStream: spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets and will be disabled.
26/01/23 14:07:58 WARN StreamingQueryManager: Stopping existing streaming query [id=00581bd0-e6af-4af7-b49c-64568b446ba4, runId=6d4927a3-f9ab-45b7-9dde-9d38f754cb5a], as a new run is being started.
26/01/23 14:07:58 WARN AdminClientConfig: These configurations '[key.deserializer, value.deserializer, enable.auto.commit, max.poll.records, auto.offset.reset]' were supplied but are not used yet.
26/01/23 14:12:57 ERROR NonFateSharingFuture: Failed to get result from future  
scala.runtime.NonLocalReturnControl
ERROR:root:KeyboardInterrupt while sending command.                             
Traceback (most recent call last):
  File "/opt/conda/lib/python3.12/site-packages/py4j/java_gateway.py", line 1038, in send_command
    response = connection.send_command(command)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.12/

KeyboardInterrupt: 