# 07 - Streaming Gold

Agrégations temps réel avec fenêtres temporelles sur les données Silver_ML.

**Requêtes implémentées:**
1. **Comptage par phase de vol** - Tumbling window 1 minute
2. **Alertes anomalies** - Sliding window 5 minutes (slide 1 min)

## Configuration

In [1]:
from pyspark.sql.functions import (
    col, window, count, avg, stddev, max as spark_max, min as spark_min,
    when, lit, current_timestamp
)
from config import get_s3_path, create_spark_session

SILVER_ML_PATH = get_s3_path("silver", "flights_ml")
GOLD_AGGREGATIONS_PATH = get_s3_path("gold", "streaming_aggregations", "flight_phase_counts")
GOLD_ANOMALIES_PATH = get_s3_path("gold", "streaming_aggregations", "anomaly_alerts")
CHECKPOINT_AGGREGATIONS = get_s3_path("checkpoints", "gold_aggregations")
CHECKPOINT_ANOMALIES = get_s3_path("checkpoints", "gold_anomalies")

spark = create_spark_session("StreamingGold")

print(f"Input:  {SILVER_ML_PATH}")
print(f"Output Aggregations: {GOLD_AGGREGATIONS_PATH}")
print(f"Output Anomalies:    {GOLD_ANOMALIES_PATH}")

✅ Configuration chargée depuis .env
:: loading settings :: url = jar:file:/opt/conda/lib/python3.12/site-packages/pyspark/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml


Ivy Default Cache set to: /home/jovyan/.ivy2/cache
The jars for the packages stored in: /home/jovyan/.ivy2/jars
org.apache.hadoop#hadoop-aws added as a dependency
com.amazonaws#aws-java-sdk-bundle added as a dependency
org.apache.spark#spark-hadoop-cloud_2.12 added as a dependency
io.delta#delta-spark_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-3f6c74fb-6da3-4939-a364-6be49204c6c7;1.0
	confs: [default]
	found org.apache.hadoop#hadoop-aws;3.3.4 in central
	found com.amazonaws#aws-java-sdk-bundle;1.12.262 in central
	found org.wildfly.openssl#wildfly-openssl;1.0.7.Final in central
	found org.apache.spark#spark-hadoop-cloud_2.12;3.5.3 in central
	found org.apache.hadoop#hadoop-client-runtime;3.3.4 in central
	found org.apache.hadoop#hadoop-client-api;3.3.4 in central
	found org.xerial.snappy#snappy-java;1.1.10.5 in central
	found org.slf4j#slf4j-api;2.0.7 in central
	found commons-logging#commons-logging;1.1.3 in central
	found com.google.c

✅ Spark Session 'StreamingGold' configurée
Input:  s3a://datalake/silver/flights_ml
Output Aggregations: s3a://datalake/gold/streaming_aggregations/flight_phase_counts
Output Anomalies:    s3a://datalake/gold/streaming_aggregations/anomaly_alerts


## Lecture du stream Silver_ML

In [2]:
df_silver_stream = spark.readStream \
    .format("delta") \
    .load(SILVER_ML_PATH)

print(f"Stream Silver_ML initialisé")
print(f"Colonnes: {df_silver_stream.columns}")

26/01/23 14:07:07 WARN MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties


Stream Silver_ML initialisé
Colonnes: ['event_timestamp', 'icao24', 'callsign', 'origin_country', 'longitude', 'latitude', 'velocity_kmh', 'altitude_meters', 'on_ground', 'category', 'prev_altitude', 'prev_velocity', 'altitude_change', 'velocity_change', 'observation_rank', 'airport_icao', 'airport_name', 'airport_country', 'rolling_avg_altitude', 'rolling_std_altitude', 'rolling_avg_velocity', 'flight_phase']


## Stream 1 : Comptage par phase de vol (Tumbling Window 1 min)

Agrégation temps réel du nombre de vols par phase (CLIMB, CRUISE, DESCENT, etc.) avec une fenêtre tumbling de 1 minute.

In [3]:
df_phase_counts = df_silver_stream \
    .withWatermark("event_timestamp", "2 minutes") \
    .groupBy(
        window(col("event_timestamp"), "1 minute"),
        col("flight_phase")
    ) \
    .agg(
        count("*").alias("flight_count"),
        avg("altitude_meters").alias("avg_altitude"),
        avg("velocity_kmh").alias("avg_velocity")
    ) \
    .select(
        col("window.start").alias("window_start"),
        col("window.end").alias("window_end"),
        col("flight_phase"),
        col("flight_count"),
        col("avg_altitude"),
        col("avg_velocity")
    )

print("Stream 1: Comptage par phase de vol (Tumbling Window 1 min)")

Stream 1: Comptage par phase de vol (Tumbling Window 1 min)


In [4]:
query_aggregations = df_phase_counts.writeStream \
    .format("delta") \
    .outputMode("append") \
    .option("checkpointLocation", CHECKPOINT_AGGREGATIONS) \
    .start(GOLD_AGGREGATIONS_PATH)

print(f"Stream 1 démarré -> {GOLD_AGGREGATIONS_PATH}")

Stream 1 démarré -> s3a://datalake/gold/streaming_aggregations/flight_phase_counts


26/01/23 14:07:10 WARN ResolveWriteToStream: spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets and will be disabled.


## Stream 2 : Alertes anomalies par pays (Sliding Window 5 min, slide 1 min)

Détection de vitesses et altitudes anormales par pays d'origine avec une fenêtre glissante de 5 minutes.

In [5]:
# Seuils d'anomalie
ALTITUDE_MAX_THRESHOLD = 12000  # mètres
VELOCITY_MAX_THRESHOLD = 1000   # km/h
ALTITUDE_MIN_THRESHOLD = -100   # mètres (sous le niveau de la mer)
VELOCITY_MIN_THRESHOLD = 0      # km/h

print(f"Seuils d'anomalie:")
print(f"  Altitude: {ALTITUDE_MIN_THRESHOLD}m - {ALTITUDE_MAX_THRESHOLD}m")
print(f"  Vitesse:  {VELOCITY_MIN_THRESHOLD} - {VELOCITY_MAX_THRESHOLD} km/h")

Seuils d'anomalie:
  Altitude: -100m - 12000m
  Vitesse:  0 - 1000 km/h


In [6]:
# Relecture du stream pour le second pipeline
df_silver_stream_2 = spark.readStream \
    .format("delta") \
    .load(SILVER_ML_PATH)

df_anomalies = df_silver_stream_2 \
    .withColumn(
        "is_altitude_anomaly",
        when(
            (col("altitude_meters") > ALTITUDE_MAX_THRESHOLD) | 
            (col("altitude_meters") < ALTITUDE_MIN_THRESHOLD),
            1
        ).otherwise(0)
    ) \
    .withColumn(
        "is_velocity_anomaly",
        when(
            (col("velocity_kmh") > VELOCITY_MAX_THRESHOLD) | 
            (col("velocity_kmh") < VELOCITY_MIN_THRESHOLD),
            1
        ).otherwise(0)
    ) \
    .withWatermark("event_timestamp", "6 minutes") \
    .groupBy(
        window(col("event_timestamp"), "5 minutes", "1 minute"),
        col("origin_country")
    ) \
    .agg(
        count("*").alias("total_observations"),
        count(when(col("is_altitude_anomaly") == 1, 1)).alias("altitude_anomalies"),
        count(when(col("is_velocity_anomaly") == 1, 1)).alias("velocity_anomalies"),
        spark_max("altitude_meters").alias("max_altitude"),
        spark_min("altitude_meters").alias("min_altitude"),
        spark_max("velocity_kmh").alias("max_velocity"),
        avg("altitude_meters").alias("avg_altitude"),
        avg("velocity_kmh").alias("avg_velocity"),
        stddev("altitude_meters").alias("stddev_altitude"),
        stddev("velocity_kmh").alias("stddev_velocity")
    ) \
    .withColumn(
        "anomaly_rate",
        (col("altitude_anomalies") + col("velocity_anomalies")) / col("total_observations")
    ) \
    .select(
        col("window.start").alias("window_start"),
        col("window.end").alias("window_end"),
        col("origin_country"),
        col("total_observations"),
        col("altitude_anomalies"),
        col("velocity_anomalies"),
        col("anomaly_rate"),
        col("max_altitude"),
        col("min_altitude"),
        col("max_velocity"),
        col("avg_altitude"),
        col("avg_velocity"),
        col("stddev_altitude"),
        col("stddev_velocity")
    )

print("Stream 2: Alertes anomalies par pays (Sliding Window 5 min)")

Stream 2: Alertes anomalies par pays (Sliding Window 5 min)


In [7]:
query_anomalies = df_anomalies.writeStream \
    .format("delta") \
    .outputMode("append") \
    .option("checkpointLocation", CHECKPOINT_ANOMALIES) \
    .start(GOLD_ANOMALIES_PATH)

print(f"Stream 2 démarré -> {GOLD_ANOMALIES_PATH}")

Stream 2 démarré -> s3a://datalake/gold/streaming_aggregations/anomaly_alerts


26/01/23 14:07:11 WARN ResolveWriteToStream: spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets and will be disabled.


## Monitoring des streams

In [8]:
import time

print("Monitoring des streams Gold (Ctrl+C pour arrêter)")
print("=" * 60)

try:
    while True:
        print(f"\n{time.strftime('%H:%M:%S')}")
        print(f"  Aggregations: {query_aggregations.status}")
        print(f"  Anomalies:    {query_anomalies.status}")
        time.sleep(30)
except KeyboardInterrupt:
    print("\nArrêt demandé...")

Monitoring des streams Gold (Ctrl+C pour arrêter)

14:07:11
  Aggregations: {'message': 'Getting offsets from DeltaSource[s3a://datalake/silver/flights_ml]', 'isDataAvailable': False, 'isTriggerActive': True}
  Anomalies:    {'message': 'Initializing sources', 'isDataAvailable': False, 'isTriggerActive': False}


26/01/23 14:07:11 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
                                                                                


14:07:41
  Aggregations: {'message': 'Getting offsets from DeltaSource[s3a://datalake/silver/flights_ml]', 'isDataAvailable': False, 'isTriggerActive': True}
  Anomalies:    {'message': 'Getting offsets from DeltaSource[s3a://datalake/silver/flights_ml]', 'isDataAvailable': False, 'isTriggerActive': True}

14:08:11
  Aggregations: {'message': 'Getting offsets from DeltaSource[s3a://datalake/silver/flights_ml]', 'isDataAvailable': False, 'isTriggerActive': True}
  Anomalies:    {'message': 'Waiting for data to arrive', 'isDataAvailable': False, 'isTriggerActive': False}


                                                                                


14:08:41
  Aggregations: {'message': 'Getting offsets from DeltaSource[s3a://datalake/silver/flights_ml]', 'isDataAvailable': False, 'isTriggerActive': True}
  Anomalies:    {'message': 'Getting offsets from DeltaSource[s3a://datalake/silver/flights_ml]', 'isDataAvailable': False, 'isTriggerActive': True}

14:09:11
  Aggregations: {'message': 'Getting offsets from DeltaSource[s3a://datalake/silver/flights_ml]', 'isDataAvailable': False, 'isTriggerActive': True}
  Anomalies:    {'message': 'Getting offsets from DeltaSource[s3a://datalake/silver/flights_ml]', 'isDataAvailable': False, 'isTriggerActive': True}

14:09:41
  Aggregations: {'message': 'Getting offsets from DeltaSource[s3a://datalake/silver/flights_ml]', 'isDataAvailable': False, 'isTriggerActive': True}
  Anomalies:    {'message': 'Getting offsets from DeltaSource[s3a://datalake/silver/flights_ml]', 'isDataAvailable': False, 'isTriggerActive': True}


                                                                                


14:10:11
  Aggregations: {'message': 'Getting offsets from DeltaSource[s3a://datalake/silver/flights_ml]', 'isDataAvailable': False, 'isTriggerActive': True}
  Anomalies:    {'message': 'Getting offsets from DeltaSource[s3a://datalake/silver/flights_ml]', 'isDataAvailable': False, 'isTriggerActive': True}


                                                                                


14:10:41
  Aggregations: {'message': 'Getting offsets from DeltaSource[s3a://datalake/silver/flights_ml]', 'isDataAvailable': False, 'isTriggerActive': True}
  Anomalies:    {'message': 'Getting offsets from DeltaSource[s3a://datalake/silver/flights_ml]', 'isDataAvailable': False, 'isTriggerActive': True}


                                                                                


14:11:11
  Aggregations: {'message': 'Getting offsets from DeltaSource[s3a://datalake/silver/flights_ml]', 'isDataAvailable': False, 'isTriggerActive': True}
  Anomalies:    {'message': 'Getting offsets from DeltaSource[s3a://datalake/silver/flights_ml]', 'isDataAvailable': False, 'isTriggerActive': True}


[Stage 17045:>              (0 + 5) / 5][Stage 17047:>              (0 + 3) / 5]


14:11:42
  Aggregations: {'message': 'Processing new data', 'isDataAvailable': True, 'isTriggerActive': True}
  Anomalies:    {'message': 'Processing new data', 'isDataAvailable': True, 'isTriggerActive': True}


26/01/23 14:11:48 WARN HDFSBackedStateStoreProvider: The state for version 2 doesn't exist in loadedMaps. Reading snapshot file and delta files if needed...Note that this is normal for the first batch of starting query.
26/01/23 14:11:48 WARN HDFSBackedStateStoreProvider: The state for version 2 doesn't exist in loadedMaps. Reading snapshot file and delta files if needed...Note that this is normal for the first batch of starting query.
26/01/23 14:11:48 WARN HDFSBackedStateStoreProvider: The state for version 2 doesn't exist in loadedMaps. Reading snapshot file and delta files if needed...Note that this is normal for the first batch of starting query.
26/01/23 14:11:49 WARN HDFSBackedStateStoreProvider: The state for version 2 doesn't exist in loadedMaps. Reading snapshot file and delta files if needed...Note that this is normal for the first batch of starting query.
26/01/23 14:11:49 WARN HDFSBackedStateStoreProvider: The state for version 2 doesn't exist in loadedMaps. Reading snapsh


14:12:13
  Aggregations: {'message': 'Processing new data', 'isDataAvailable': True, 'isTriggerActive': True}
  Anomalies:    {'message': 'Processing new data', 'isDataAvailable': True, 'isTriggerActive': True}


26/01/23 14:12:26 WARN MemoryManager: Total allocation exceeds 95.00% (1,020,054,720 bytes) of heap memory
Scaling row group sizes to 95.00% for 8 writers
[Stage 17071:>                                                      (0 + 5) / 5]


14:12:44
  Aggregations: {'message': 'Processing new data', 'isDataAvailable': True, 'isTriggerActive': True}
  Anomalies:    {'message': 'Processing new data', 'isDataAvailable': True, 'isTriggerActive': True}





14:13:15
  Aggregations: {'message': 'No new data but cleaning up state', 'isDataAvailable': False, 'isTriggerActive': True}
  Anomalies:    {'message': 'Processing new data', 'isDataAvailable': True, 'isTriggerActive': True}


                                                                                


14:13:45
  Aggregations: {'message': 'Processing new data', 'isDataAvailable': True, 'isTriggerActive': True}
  Anomalies:    {'message': 'Processing new data', 'isDataAvailable': True, 'isTriggerActive': True}


                                                                                


14:14:15
  Aggregations: {'message': 'Getting offsets from DeltaSource[s3a://datalake/silver/flights_ml]', 'isDataAvailable': False, 'isTriggerActive': True}
  Anomalies:    {'message': 'Getting offsets from DeltaSource[s3a://datalake/silver/flights_ml]', 'isDataAvailable': False, 'isTriggerActive': True}

14:14:45
  Aggregations: {'message': 'Getting offsets from DeltaSource[s3a://datalake/silver/flights_ml]', 'isDataAvailable': False, 'isTriggerActive': True}
  Anomalies:    {'message': 'Getting offsets from DeltaSource[s3a://datalake/silver/flights_ml]', 'isDataAvailable': False, 'isTriggerActive': True}

Arrêt demandé...


## Arrêt des streams

In [None]:
query_aggregations.stop()
query_anomalies.stop()
print("Tous les streams Gold arrêtés")

## Vérification des données Gold

In [9]:
print("Statistiques Gold:")

try:
    df_agg = spark.read.format("delta").load(GOLD_AGGREGATIONS_PATH)
    print(f"  Aggregations: {df_agg.count():,} lignes")
    print("\n  Dernières agrégations par phase:")
    df_agg.orderBy(col("window_start").desc()).limit(10).show(truncate=False)
except Exception as e:
    print(f"  Aggregations: Table non disponible ({e})")

try:
    df_anom = spark.read.format("delta").load(GOLD_ANOMALIES_PATH)
    print(f"\n  Anomalies: {df_anom.count():,} lignes")
    print("\n  Pays avec le plus d'anomalies:")
    df_anom.filter(col("anomaly_rate") > 0) \
        .orderBy(col("anomaly_rate").desc()) \
        .limit(10).show(truncate=False)
except Exception as e:
    print(f"  Anomalies: Table non disponible ({e})")

Statistiques Gold:
  Aggregations: 14 lignes

  Dernières agrégations par phase:
+-------------------+-------------------+------------+------------+------------------+------------------+
|window_start       |window_end         |flight_phase|flight_count|avg_altitude      |avg_velocity      |
+-------------------+-------------------+------------+------------+------------------+------------------+
|2026-01-23 13:37:00|2026-01-23 13:38:00|TRANSITION  |8008        |2314.071631850777 |349.0629907592404 |
|2026-01-23 13:37:00|2026-01-23 13:38:00|TAKEOFF     |677         |1539.0936806423801|399.10004431314627|
|2026-01-23 13:37:00|2026-01-23 13:38:00|DESCENT     |3788        |4783.215619616674 |580.8134345300947 |
|2026-01-23 13:37:00|2026-01-23 13:38:00|GROUND      |84          |412.3871455873762 |7.297142857142856 |
|2026-01-23 13:37:00|2026-01-23 13:38:00|CLIMB       |3361        |6125.106081411548 |648.9094346920566 |
|2026-01-23 13:37:00|2026-01-23 13:38:00|CRUISE      |13600       |1078

                                                                                

+-------------------+-------------------+--------------+------------------+------------------+------------------+------------+------------+------------+------------+------------------+-----------------+------------------+------------------+
|window_start       |window_end         |origin_country|total_observations|altitude_anomalies|velocity_anomalies|anomaly_rate|max_altitude|min_altitude|max_velocity|avg_altitude      |avg_velocity     |stddev_altitude   |stddev_velocity   |
+-------------------+-------------------+--------------+------------------+------------------+------------------+------------+------------+------------+------------+------------------+-----------------+------------------+------------------+
|2026-01-23 13:34:00|2026-01-23 13:39:00|Ethiopia      |48                |12                |12                |0.5         |12001.5     |10355.58    |1120.1      |11304.27001953125 |910.0291666666666|610.1307357183057 |139.1286301024501 |
|2026-01-23 13:31:00|2026-01-23 13:3