# 07 - Streaming Gold

Agrégations temps réel avec fenêtres temporelles sur les données Silver_ML.

**Requêtes implémentées:**
1. **Comptage par phase de vol** - Tumbling window 1 minute
2. **Alertes anomalies** - Sliding window 5 minutes (slide 1 min)

## Configuration

In [34]:
from pyspark.sql.functions import (
    col, window, count, avg, stddev, max as spark_max, min as spark_min,
    when, lit, current_timestamp
)
from config import get_s3_path, create_spark_session

SILVER_ML_PATH = get_s3_path("silver", "flights_ml")
GOLD_AGGREGATIONS_PATH = get_s3_path("gold", "phase_stats")
GOLD_ANOMALIES_PATH = get_s3_path("gold", "country_stats")
CHECKPOINT_AGGREGATIONS = get_s3_path("checkpoints", "gold_aggregations")
CHECKPOINT_ANOMALIES = get_s3_path("checkpoints", "gold_anomalies")

spark = create_spark_session("StreamingGold")

print(f"Input:  {SILVER_ML_PATH}")
print(f"Output Aggregations: {GOLD_AGGREGATIONS_PATH}")
print(f"Output Anomalies:    {GOLD_ANOMALIES_PATH}")

✅ Spark Session 'StreamingGold' configurée
Input:  s3a://datalake/silver/flights_ml
Output Aggregations: s3a://datalake/gold/phase_stats
Output Anomalies:    s3a://datalake/gold/country_stats


## Lecture du stream Silver_ML

In [35]:
df_silver_stream = spark.readStream \
    .format("delta") \
    .load(SILVER_ML_PATH)

print(f"Stream Silver_ML initialisé")
print(f"Colonnes: {df_silver_stream.columns}")

Stream Silver_ML initialisé
Colonnes: ['event_timestamp', 'icao24', 'callsign', 'origin_country', 'longitude', 'latitude', 'velocity_kmh', 'altitude_meters', 'on_ground', 'category', 'prev_altitude', 'prev_velocity', 'altitude_change', 'velocity_change', 'observation_rank', 'airport_icao', 'airport_name', 'airport_country', 'rolling_avg_altitude', 'rolling_std_altitude', 'rolling_avg_velocity', 'flight_phase']


## Stream 1 : Comptage par phase de vol (Tumbling Window 1 min)

Agrégation temps réel du nombre de vols par phase (CLIMB, CRUISE, DESCENT, etc.) avec une fenêtre tumbling de 1 minute.

In [None]:
df_phase_counts = df_silver_stream \
    .withWatermark("event_timestamp", "2 minutes") \
    .groupBy(
        window(col("event_timestamp"), "1 minute"),
        col("flight_phase")
    ) \
    .agg(
        count("*").alias("flight_count"),
        avg("altitude_meters").alias("avg_altitude"),
        avg("velocity_kmh").alias("avg_velocity")
    ) \
    .select(
        col("window.start").alias("window_start"),
        col("window.end").alias("window_end"),
        col("flight_phase"),
        col("flight_count"),
        col("avg_altitude"),
        col("avg_velocity")
    )

print("Stream 1: Comptage par phase de vol (Tumbling Window 1 min)")

Stream 1: Comptage par phase de vol (Tumbling Window 1 min)


In [None]:
query_aggregations = df_phase_counts.writeStream \
    .format("delta") \
    .outputMode("append") \
    .option("checkpointLocation", CHECKPOINT_AGGREGATIONS) \
    .start(GOLD_AGGREGATIONS_PATH)

print(f"Stream 1 démarré -> {GOLD_AGGREGATIONS_PATH}")

26/01/23 15:44:10 WARN ResolveWriteToStream: spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets and will be disabled.


Stream 1 démarré -> s3a://datalake/gold/phase_stats


                                                                                

## Stream 2 : Alertes anomalies par pays (Sliding Window 5 min, slide 1 min)

Détection de vitesses et altitudes anormales par pays d'origine avec une fenêtre glissante de 5 minutes.

In [38]:
# Seuils d'anomalie
ALTITUDE_MAX_THRESHOLD = 12000  # mètres
VELOCITY_MAX_THRESHOLD = 1000   # km/h
ALTITUDE_MIN_THRESHOLD = -100   # mètres (sous le niveau de la mer)
VELOCITY_MIN_THRESHOLD = 0      # km/h

print(f"Seuils d'anomalie:")
print(f"  Altitude: {ALTITUDE_MIN_THRESHOLD}m - {ALTITUDE_MAX_THRESHOLD}m")
print(f"  Vitesse:  {VELOCITY_MIN_THRESHOLD} - {VELOCITY_MAX_THRESHOLD} km/h")

Seuils d'anomalie:
  Altitude: -100m - 12000m
  Vitesse:  0 - 1000 km/h


In [None]:
# Relecture du stream pour le second pipeline
df_silver_stream_2 = spark.readStream \
    .format("delta") \
    .load(SILVER_ML_PATH)

df_anomalies = df_silver_stream_2 \
    .withColumn(
        "is_altitude_anomaly",
        when(
            (col("altitude_meters") > ALTITUDE_MAX_THRESHOLD) | 
            (col("altitude_meters") < ALTITUDE_MIN_THRESHOLD),
            1
        ).otherwise(0)
    ) \
    .withColumn(
        "is_velocity_anomaly",
        when(
            (col("velocity_kmh") > VELOCITY_MAX_THRESHOLD) | 
            (col("velocity_kmh") < VELOCITY_MIN_THRESHOLD),
            1
        ).otherwise(0)
    ) \
    .withWatermark("event_timestamp", "6 minutes") \
    .groupBy(
        window(col("event_timestamp"), "5 minutes", "1 minute"),
        col("origin_country")
    ) \
    .agg(
        count("*").alias("total_observations"),
        count(when(col("is_altitude_anomaly") == 1, 1)).alias("altitude_anomalies"),
        count(when(col("is_velocity_anomaly") == 1, 1)).alias("velocity_anomalies"),
        spark_max("altitude_meters").alias("max_altitude"),
        spark_min("altitude_meters").alias("min_altitude"),
        spark_max("velocity_kmh").alias("max_velocity"),
        avg("altitude_meters").alias("avg_altitude"),
        avg("velocity_kmh").alias("avg_velocity"),
        stddev("altitude_meters").alias("stddev_altitude"),
        stddev("velocity_kmh").alias("stddev_velocity")
    ) \
    .withColumn(
        "anomaly_rate",
        (col("altitude_anomalies") + col("velocity_anomalies")) / col("total_observations")
    ) \
    .select(
        col("window.start").alias("window_start"),
        col("window.end").alias("window_end"),
        col("origin_country"),
        col("total_observations"),
        col("altitude_anomalies"),
        col("velocity_anomalies"),
        col("anomaly_rate"),
        col("max_altitude"),
        col("min_altitude"),
        col("max_velocity"),
        col("avg_altitude"),
        col("avg_velocity"),
        col("stddev_altitude"),
        col("stddev_velocity")
    )

print("Stream 2: Alertes anomalies par pays (Sliding Window 5 min)")

                                                                                

Stream 2: Alertes anomalies par pays (Sliding Window 5 min)


26/01/23 15:44:33 WARN HDFSBackedStateStoreProvider: The state for version 10 doesn't exist in loadedMaps. Reading snapshot file and delta files if needed...Note that this is normal for the first batch of starting query.
26/01/23 15:44:33 WARN HDFSBackedStateStoreProvider: The state for version 10 doesn't exist in loadedMaps. Reading snapshot file and delta files if needed...Note that this is normal for the first batch of starting query.
26/01/23 15:44:33 WARN HDFSBackedStateStoreProvider: The state for version 10 doesn't exist in loadedMaps. Reading snapshot file and delta files if needed...Note that this is normal for the first batch of starting query.
26/01/23 15:44:33 WARN HDFSBackedStateStoreProvider: The state for version 10 doesn't exist in loadedMaps. Reading snapshot file and delta files if needed...Note that this is normal for the first batch of starting query.
26/01/23 15:44:33 WARN HDFSBackedStateStoreProvider: The state for version 10 doesn't exist in loadedMaps. Reading s

In [40]:
query_anomalies = df_anomalies.writeStream \
    .format("delta") \
    .outputMode("append") \
    .option("checkpointLocation", CHECKPOINT_ANOMALIES) \
    .start(GOLD_ANOMALIES_PATH)

print(f"Stream 2 démarré -> {GOLD_ANOMALIES_PATH}")

26/01/23 15:44:59 WARN ResolveWriteToStream: spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets and will be disabled.


Stream 2 démarré -> s3a://datalake/gold/country_stats


26/01/23 15:45:02 WARN MemoryManager: Total allocation exceeds 95.00% (1,020,054,720 bytes) of heap memory
Scaling row group sizes to 95.00% for 8 writers
[Stage 22:>                (0 + 8) / 10][Stage 25:>                (0 + 0) / 10]

## Monitoring des streams

In [42]:
import time

print("Monitoring des streams Gold (Ctrl+C pour arrêter)")
print("=" * 60)

try:
    while True:
        print(f"\n{time.strftime('%H:%M:%S')}")
        print(f"  Aggregations: {query_aggregations.status}")
        print(f"  Anomalies:    {query_anomalies.status}")
        time.sleep(30)
except KeyboardInterrupt:
    print("\nArrêt demandé...")



Monitoring des streams Gold (Ctrl+C pour arrêter)

15:46:40
  Aggregations: {'message': 'Processing new data', 'isDataAvailable': True, 'isTriggerActive': True}
  Anomalies:    {'message': 'No new data but cleaning up state', 'isDataAvailable': False, 'isTriggerActive': True}

Arrêt demandé...




                                                                                

## Arrêt des streams

In [None]:
query_aggregations.stop()
query_anomalies.stop()
print("Tous les streams Gold arrêtés")

Tous les streams Gold arrêtés


## Vérification des données Gold

In [43]:
print("Statistiques Gold:")

try:
    df_agg = spark.read.format("delta").load(GOLD_AGGREGATIONS_PATH)
    print(f"  Aggregations: {df_agg.count():,} lignes")
    print("\n  Dernières agrégations par phase:")
    df_agg.orderBy(col("window_start").desc()).limit(10).show(truncate=False)
except Exception as e:
    print(f"  Aggregations: Table non disponible ({e})")

try:
    df_anom = spark.read.format("delta").load(GOLD_ANOMALIES_PATH)
    print(f"\n  Anomalies: {df_anom.count():,} lignes")
    print("\n  Pays avec le plus d'anomalies:")
    df_anom.filter(col("anomaly_rate") > 0) \
        .orderBy(col("anomaly_rate").desc()) \
        .limit(10).show(truncate=False)
except Exception as e:
    print(f"  Anomalies: Table non disponible ({e})")

Statistiques Gold:


                                                                                

  Aggregations: 95 lignes

  Dernières agrégations par phase:


                                                                                

+-------------------+-------------------+------------+------------+------------------+------------------+
|window_start       |window_end         |flight_phase|flight_count|avg_altitude      |avg_velocity      |
+-------------------+-------------------+------------+------------+------------------+------------------+
|2026-01-23 15:42:00|2026-01-23 15:43:00|TRANSITION  |27118       |6832.430376908976 |609.5445397890696 |
|2026-01-23 15:42:00|2026-01-23 15:43:00|GROUND      |64          |233.3625025600195 |14.790625000000002|
|2026-01-23 15:41:00|2026-01-23 15:42:00|TRANSITION  |27044       |6834.77202292639  |609.3469730809036 |
|2026-01-23 15:41:00|2026-01-23 15:42:00|GROUND      |68          |430.97823266422046|5.1635294117647055|
|2026-01-23 15:40:00|2026-01-23 15:41:00|GROUND      |69          |234.01130753669185|5.86376811594203  |
|2026-01-23 15:40:00|2026-01-23 15:41:00|TRANSITION  |27045       |6836.639210580109 |609.2889850249575 |
|2026-01-23 15:40:00|2026-01-23 15:41:00|CRUIS

                                                                                


  Anomalies: 2,820 lignes

  Pays avec le plus d'anomalies:


                                                                                

+-------------------+-------------------+--------------------+------------------+------------------+------------------+------------+------------+------------+------------+------------------+------------------+------------------+------------------+
|window_start       |window_end         |origin_country      |total_observations|altitude_anomalies|velocity_anomalies|anomaly_rate|max_altitude|min_altitude|max_velocity|avg_altitude      |avg_velocity      |stddev_altitude   |stddev_velocity   |
+-------------------+-------------------+--------------------+------------------+------------------+------------------+------------+------------+------------+------------+------------------+------------------+------------------+------------------+
|2026-01-23 14:59:00|2026-01-23 15:04:00|Nigeria             |11                |11                |0                 |1.0         |12192.0     |12184.38    |817.88      |12191.307262073864|813.6290909090909 |2.297551779043264 |2.897169151242152 |
|2026-01

26/01/23 15:48:00 ERROR NonFateSharingFuture: Failed to get result from future  
scala.runtime.NonLocalReturnControl
26/01/23 15:49:15 ERROR NonFateSharingFuture: Failed to get result from future  
scala.runtime.NonLocalReturnControl
26/01/23 15:52:39 ERROR NonFateSharingFuture: Failed to get result from future  
scala.runtime.NonLocalReturnControl