# 03 - Streaming Silver & Silver_ML

Pipeline de transformation :
- **Bronze ‚Üí Silver** : Nettoyage et enrichissement
- **Bronze ‚Üí Silver_ML** : Feature engineering pour le Machine Learning

## Configuration

In [1]:
from pyspark.sql.functions import (
    col, from_unixtime, to_timestamp, round,
    lag, avg, stddev, row_number, when, sqrt, pow, lit, min as spark_min, broadcast
)
from pyspark.sql.window import Window
from config import get_s3_path, create_spark_session

BRONZE_PATH = get_s3_path("bronze", "flights")
SILVER_PATH = get_s3_path("silver", "flights")
SILVER_ML_PATH = get_s3_path("silver", "flights_ml")
CHECKPOINT_SILVER = get_s3_path("checkpoints", "silver_flights")
CHECKPOINT_SILVER_ML = get_s3_path("checkpoints", "silver_ml_flights")
AIRPORTS_CSV = "./data/airports.csv"

spark = create_spark_session("StreamingSilver")

print(f"‚úÖ Input:     {BRONZE_PATH}")
print(f"‚úÖ Silver:    {SILVER_PATH}")
print(f"‚úÖ Silver_ML: {SILVER_ML_PATH}")

‚úÖ Configuration charg√©e depuis .env
:: loading settings :: url = jar:file:/opt/conda/lib/python3.12/site-packages/pyspark/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml


Ivy Default Cache set to: /home/jovyan/.ivy2/cache
The jars for the packages stored in: /home/jovyan/.ivy2/jars
org.apache.hadoop#hadoop-aws added as a dependency
com.amazonaws#aws-java-sdk-bundle added as a dependency
org.apache.spark#spark-hadoop-cloud_2.12 added as a dependency
io.delta#delta-spark_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-5c42e1d1-0066-4ffb-af93-677331a368ec;1.0
	confs: [default]
	found org.apache.hadoop#hadoop-aws;3.3.4 in central
	found com.amazonaws#aws-java-sdk-bundle;1.12.262 in central
	found org.wildfly.openssl#wildfly-openssl;1.0.7.Final in central
	found org.apache.spark#spark-hadoop-cloud_2.12;3.5.3 in central
	found org.apache.hadoop#hadoop-client-runtime;3.3.4 in central
	found org.apache.hadoop#hadoop-client-api;3.3.4 in central
	found org.xerial.snappy#snappy-java;1.1.10.5 in central
	found org.slf4j#slf4j-api;2.0.7 in central
	found commons-logging#commons-logging;1.1.3 in central
	found com.google.c

‚úÖ Spark Session 'StreamingSilver' configur√©e
‚úÖ Input:     s3a://datalake/bronze/flights
‚úÖ Silver:    s3a://datalake/silver/flights
‚úÖ Silver_ML: s3a://datalake/silver/flights_ml


## Chargement des donn√©es a√©roports (pour Silver_ML)

In [2]:
df_airports = spark.read.option("header", "true").csv(AIRPORTS_CSV).select(
    col("ident").alias("airport_icao"),
    col("name").alias("airport_name"),
    col("iso_country").alias("airport_country"),
    col("latitude_deg").cast("double").alias("airport_lat"),
    col("longitude_deg").cast("double").alias("airport_lon")
).filter(col("type").isin("large_airport", "medium_airport"))

print(f"‚úÖ {df_airports.count()} a√©roports charg√©s")

‚úÖ 5211 a√©roports charg√©s


## Stream 1 : Bronze ‚Üí Silver

In [9]:
df_bronze_stream = spark.readStream.format("delta").load(BRONZE_PATH)

df_silver = df_bronze_stream \
    .filter(col("icao24").isNotNull()) \
    .filter(col("latitude").isNotNull() & col("longitude").isNotNull()) \
    .withColumn("event_timestamp", to_timestamp(from_unixtime(col("time")))) \
    .withColumn("velocity_kmh", round(col("velocity") * 3.6, 2)) \
    .withColumn("altitude_meters", col("baro_altitude")) \
    .select(
        "event_timestamp", "icao24", "callsign", "origin_country",
        "longitude", "latitude", "velocity_kmh", "altitude_meters",
        "on_ground", "category"
    )

print(f"üöÄ Stream 1: Bronze ‚Üí Silver")

query_silver = df_silver.writeStream \
    .format("delta") \
    .outputMode("append") \
    .option("checkpointLocation", CHECKPOINT_SILVER) \
    .option("mergeSchema", "true") \
    .start(SILVER_PATH)

üöÄ Stream 1: Bronze ‚Üí Silver


26/01/23 14:08:17 WARN ResolveWriteToStream: spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets and will be disabled.




## Stream 2 : Bronze ‚Üí Silver_ML (Feature Engineering)

Transformation avec features pour le ML, directement depuis Bronze.

In [4]:
def process_ml_batch(batch_df, batch_id):
    """Traitement d'un micro-batch pour Silver_ML avec feature engineering."""
    
    if batch_df.isEmpty():
        return
    
    # Transformation Bronze ‚Üí format Silver
    df_base = batch_df \
        .filter(col("icao24").isNotNull()) \
        .filter(col("latitude").isNotNull() & col("longitude").isNotNull()) \
        .withColumn("event_timestamp", to_timestamp(from_unixtime(col("time")))) \
        .withColumn("velocity_kmh", round(col("velocity") * 3.6, 2)) \
        .withColumn("altitude_meters", col("baro_altitude"))
    
    # Nettoyage ML
    df_clean = df_base \
        .filter(col("altitude_meters").between(-500, 15000)) \
        .filter(col("velocity_kmh").between(0, 1200))
    
    if df_clean.isEmpty():
        return
    
    # Features temporelles
    window_aircraft = Window.partitionBy("icao24").orderBy("event_timestamp")
    
    df_temporal = df_clean \
        .withColumn("prev_altitude", lag("altitude_meters", 1).over(window_aircraft)) \
        .withColumn("prev_velocity", lag("velocity_kmh", 1).over(window_aircraft)) \
        .withColumn("altitude_change", col("altitude_meters") - col("prev_altitude")) \
        .withColumn("velocity_change", col("velocity_kmh") - col("prev_velocity")) \
        .withColumn("observation_rank", row_number().over(window_aircraft))
    
    # Jointure a√©roports
    df_on_ground = df_temporal.filter(col("on_ground") == True)
    df_in_flight = df_temporal.filter(col("on_ground") == False)
    
    if df_on_ground.count() > 0:
        df_with_airports = df_on_ground.crossJoin(broadcast(df_airports)).withColumn(
            "dist", sqrt(pow(col("latitude") - col("airport_lat"), 2) + pow(col("longitude") - col("airport_lon"), 2))
        )
        
        w = Window.partitionBy("icao24", "event_timestamp")
        df_closest = df_with_airports.withColumn("min_dist", spark_min("dist").over(w)) \
            .filter(col("dist") == col("min_dist")) \
            .drop("dist", "min_dist", "airport_lat", "airport_lon")
        
        df_enriched = df_closest.unionByName(
            df_in_flight.withColumn("airport_icao", lit(None))
                        .withColumn("airport_name", lit(None))
                        .withColumn("airport_country", lit(None)),
            allowMissingColumns=True
        )
    else:
        df_enriched = df_in_flight \
            .withColumn("airport_icao", lit(None)) \
            .withColumn("airport_name", lit(None)) \
            .withColumn("airport_country", lit(None))
    
    # Features rolling window
    rolling_window = Window.partitionBy("icao24").orderBy("event_timestamp").rowsBetween(-5, 0)
    
    df_rolling = df_enriched \
        .withColumn("rolling_avg_altitude", avg("altitude_meters").over(rolling_window)) \
        .withColumn("rolling_std_altitude", stddev("altitude_meters").over(rolling_window)) \
        .withColumn("rolling_avg_velocity", avg("velocity_kmh").over(rolling_window))
    
    # Label flight_phase
    df_ml = df_rolling.withColumn(
        "flight_phase",
        when(col("on_ground") == True, "GROUND")
        .when((col("altitude_change") > 50) & (col("altitude_meters") < 3000), "TAKEOFF")
        .when(col("altitude_change") > 20, "CLIMB")
        .when(col("altitude_change").between(-20, 20) & (col("altitude_meters") > 8000), "CRUISE")
        .when(col("altitude_change") < -20, "DESCENT")
        .otherwise("TRANSITION")
    )
    
    # S√©lection des colonnes finales
    df_final = df_ml.select(
        "event_timestamp", "icao24", "callsign", "origin_country",
        "longitude", "latitude", "velocity_kmh", "altitude_meters",
        "on_ground", "category",
        "prev_altitude", "prev_velocity", "altitude_change", "velocity_change", "observation_rank",
        "airport_icao", "airport_name", "airport_country",
        "rolling_avg_altitude", "rolling_std_altitude", "rolling_avg_velocity",
        "flight_phase"
    )
    
    # √âcriture
    df_final.write.format("delta").mode("append").save(SILVER_ML_PATH)

In [10]:
df_bronze_ml_stream = spark.readStream.format("delta").load(BRONZE_PATH)

print(f"üöÄ Stream 2: Bronze ‚Üí Silver_ML (Feature Engineering)")

query_silver_ml = df_bronze_ml_stream.writeStream \
    .foreachBatch(process_ml_batch) \
    .option("checkpointLocation", CHECKPOINT_SILVER_ML) \
    .start()

                                                                                

üöÄ Stream 2: Bronze ‚Üí Silver_ML (Feature Engineering)


26/01/23 14:08:22 WARN ResolveWriteToStream: spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets and will be disabled.




## Monitoring des streams

In [None]:
import time

print("üìä Monitoring des streams (Ctrl+C pour arr√™ter)")
print("="*60)

try:
    while True:
        print(f"\n‚è±Ô∏è  {time.strftime('%H:%M:%S')}")
        print(f"  Silver:    {query_silver.status}")
        print(f"  Silver_ML: {query_silver_ml.status}")
        time.sleep(30)
except KeyboardInterrupt:
    print("\n‚èπÔ∏è  Arr√™t demand√©...")



üìä Monitoring des streams (Ctrl+C pour arr√™ter)

‚è±Ô∏è  14:08:25
  Silver:    {'message': 'Initializing sources', 'isDataAvailable': False, 'isTriggerActive': True}
  Silver_ML: {'message': 'Initializing sources', 'isDataAvailable': False, 'isTriggerActive': True}


                                                                                


‚è±Ô∏è  14:08:55
  Silver:    {'message': 'Getting offsets from DeltaSource[s3a://datalake/bronze/flights]', 'isDataAvailable': False, 'isTriggerActive': True}
  Silver_ML: {'message': 'Getting offsets from DeltaSource[s3a://datalake/bronze/flights]', 'isDataAvailable': False, 'isTriggerActive': True}

‚è±Ô∏è  14:09:25
  Silver:    {'message': 'Getting offsets from DeltaSource[s3a://datalake/bronze/flights]', 'isDataAvailable': False, 'isTriggerActive': True}
  Silver_ML: {'message': 'Getting offsets from DeltaSource[s3a://datalake/bronze/flights]', 'isDataAvailable': False, 'isTriggerActive': True}


                                                                                


‚è±Ô∏è  14:09:55
  Silver:    {'message': 'Getting offsets from DeltaSource[s3a://datalake/bronze/flights]', 'isDataAvailable': False, 'isTriggerActive': True}
  Silver_ML: {'message': 'Getting offsets from DeltaSource[s3a://datalake/bronze/flights]', 'isDataAvailable': False, 'isTriggerActive': True}


                                                                                


‚è±Ô∏è  14:10:25
  Silver:    {'message': 'Getting offsets from DeltaSource[s3a://datalake/bronze/flights]', 'isDataAvailable': False, 'isTriggerActive': True}
  Silver_ML: {'message': 'Getting offsets from DeltaSource[s3a://datalake/bronze/flights]', 'isDataAvailable': False, 'isTriggerActive': True}


                                                                                


‚è±Ô∏è  14:10:55


[Stage 56930:>              (0 + 2) / 2][Stage 56933:>              (0 + 2) / 2]

  Silver:    {'message': 'Processing new data', 'isDataAvailable': True, 'isTriggerActive': True}
  Silver_ML: {'message': 'Processing new data', 'isDataAvailable': True, 'isTriggerActive': True}


26/01/23 14:11:04 WARN MemoryManager: Total allocation exceeds 95.00% (1,020,054,720 bytes) of heap memory
Scaling row group sizes to 95.00% for 8 writers


‚è±Ô∏è  14:11:27
  Silver:    {'message': 'Processing new data', 'isDataAvailable': True, 'isTriggerActive': True}
  Silver_ML: {'message': 'Processing new data', 'isDataAvailable': True, 'isTriggerActive': True}


26/01/23 14:11:39 WARN MemoryManager: Total allocation exceeds 95.00% (1,020,054,720 bytes) of heap memory
Scaling row group sizes to 95.00% for 8 writers
                                                                                


‚è±Ô∏è  14:11:57
  Silver:    {'message': 'Processing new data', 'isDataAvailable': True, 'isTriggerActive': True}
  Silver_ML: {'message': 'Processing new data', 'isDataAvailable': True, 'isTriggerActive': True}


26/01/23 14:12:11 WARN MemoryManager: Total allocation exceeds 95.00% (1,020,054,720 bytes) of heap memory
Scaling row group sizes to 95.00% for 8 writers


‚è±Ô∏è  14:12:27
  Silver:    {'message': 'Processing new data', 'isDataAvailable': True, 'isTriggerActive': True}
  Silver_ML: {'message': 'Processing new data', 'isDataAvailable': True, 'isTriggerActive': True}


26/01/23 14:12:43 WARN MemoryManager: Total allocation exceeds 95.00% (1,020,054,720 bytes) of heap memory
Scaling row group sizes to 95.00% for 8 writers


‚è±Ô∏è  14:12:57
  Silver:    {'message': 'Getting offsets from DeltaSource[s3a://datalake/bronze/flights]', 'isDataAvailable': False, 'isTriggerActive': True}
  Silver_ML: {'message': 'Processing new data', 'isDataAvailable': True, 'isTriggerActive': True}





‚è±Ô∏è  14:13:27
  Silver:    {'message': 'Processing new data', 'isDataAvailable': True, 'isTriggerActive': True}
  Silver_ML: {'message': 'Processing new data', 'isDataAvailable': True, 'isTriggerActive': True}


26/01/23 14:13:38 WARN MemoryManager: Total allocation exceeds 95.00% (1,020,054,720 bytes) of heap memory
Scaling row group sizes to 95.00% for 8 writers
                                                                                


‚è±Ô∏è  14:13:58
  Silver:    {'message': 'Getting offsets from DeltaSource[s3a://datalake/bronze/flights]', 'isDataAvailable': False, 'isTriggerActive': True}
  Silver_ML: {'message': 'Getting offsets from DeltaSource[s3a://datalake/bronze/flights]', 'isDataAvailable': False, 'isTriggerActive': True}

‚è±Ô∏è  14:14:28
  Silver:    {'message': 'Waiting for data to arrive', 'isDataAvailable': False, 'isTriggerActive': False}
  Silver_ML: {'message': 'Getting offsets from DeltaSource[s3a://datalake/bronze/flights]', 'isDataAvailable': False, 'isTriggerActive': True}

‚è±Ô∏è  14:14:58
  Silver:    {'message': 'Waiting for data to arrive', 'isDataAvailable': False, 'isTriggerActive': False}
  Silver_ML: {'message': 'Getting offsets from DeltaSource[s3a://datalake/bronze/flights]', 'isDataAvailable': False, 'isTriggerActive': True}


## Arr√™t des streams

In [None]:
query_silver.stop()
query_silver_ml.stop()
print("‚úÖ Tous les streams arr√™t√©s")

‚úÖ Tous les streams arr√™t√©s


## V√©rification

In [8]:
print("üìä Statistiques :")
print(f"  Bronze:    {spark.read.format('delta').load(BRONZE_PATH).count():,} lignes")
print(f"  Silver:    {spark.read.format('delta').load(SILVER_PATH).count():,} lignes")
print(f"  Silver_ML: {spark.read.format('delta').load(SILVER_ML_PATH).count():,} lignes")

print("\nüìä Distribution flight_phase (Silver_ML) :")
spark.read.format("delta").load(SILVER_ML_PATH).groupBy("flight_phase").count().orderBy("count", ascending=False).show()

üìä Statistiques :
  Bronze:    98,840 lignes
  Silver:    97,950 lignes
  Silver_ML: 88,432 lignes

üìä Distribution flight_phase (Silver_ML) :
+------------+-----+
|flight_phase|count|
+------------+-----+
|      CRUISE|37190|
|  TRANSITION|29374|
|     DESCENT|10699|
|       CLIMB| 8877|
|     TAKEOFF| 2035|
|      GROUND|  257|
+------------+-----+

