# Notebook 3 : Silver 2 - Calculs Avanc√©s (PySpark)

**Dur√©e** : 20 minutes  
**Lakehouse** : Lakehouse_silver  
**Objectif** : Utiliser PySpark pour calculs impossibles en SQL

## Cellule 1 : Import PySpark

In [1]:
from pyspark.sql import functions as F, Window
from pyspark.sql.types import DoubleType
from pyspark.sql.functions import pandas_udf
import pandas as pd

print("‚úÖ PySpark import√©")

StatementMeta(, db49a5ee-204a-49d0-9a20-c112a5099070, 3, Finished, Available, Finished)

‚úÖ PySpark import√©


## Cellule 2 : Chargement Silver

In [2]:
df = spark.table("Lakehouse_silver.silver.consumption_with_prices")
print(f"üìä Donn√©es charg√©es : {df.count()} lignes")
df.show(5)

StatementMeta(, db49a5ee-204a-49d0-9a20-c112a5099070, 4, Finished, Available, Finished)

üìä Donn√©es charg√©es : 17280 lignes
+-------------------+------------+------------------+------------------+------------------+------------+-------------+---------+
|               hour|     site_id|avg_consumption_mw|max_consumption_mw|min_consumption_mw|measurements|price_eur_mwh|   market|
+-------------------+------------+------------------+------------------+------------------+------------+-------------+---------+
|2025-01-03 13:00:00|SITE_COM_002|             0.862|             1.045|             0.601|           4|       101.28|EPEX Spot|
|2025-01-03 13:00:00|SITE_COM_002|             0.862|             1.045|             0.601|           4|        88.88|EPEX Spot|
|2025-01-03 13:00:00|SITE_COM_002|             0.862|             1.045|             0.601|           4|        90.48|EPEX Spot|
|2025-01-03 13:00:00|SITE_COM_002|             0.862|             1.045|             0.601|           4|        89.16|EPEX Spot|
|2025-01-04 19:00:00|SITE_RES_002|             0.305|     

## Cellule 3 : Ajout colonnes is_weekend et day_of_week

In [3]:
# D√©tection weekends et jour de la semaine
df = df.withColumn("is_weekend", F.dayofweek("hour").isin([1, 7]))  # 1=Dimanche, 7=Samedi
df = df.withColumn("day_of_week", F.dayofweek("hour"))
df = df.withColumn("hour_of_day", F.hour("hour"))

print("‚úÖ Colonnes temporelles ajout√©es")
df.select("hour", "site_id", "avg_consumption_mw", "is_weekend", "day_of_week", "hour_of_day").show(10)

StatementMeta(, db49a5ee-204a-49d0-9a20-c112a5099070, 5, Finished, Available, Finished)

‚úÖ Colonnes temporelles ajout√©es
+-------------------+------------+-------------------+----------+-----------+-----------+
|               hour|     site_id| avg_consumption_mw|is_weekend|day_of_week|hour_of_day|
+-------------------+------------+-------------------+----------+-----------+-----------+
|2025-01-03 13:00:00|SITE_COM_002|              0.862|     false|          6|         13|
|2025-01-03 13:00:00|SITE_COM_002|              0.862|     false|          6|         13|
|2025-01-03 13:00:00|SITE_COM_002|              0.862|     false|          6|         13|
|2025-01-03 13:00:00|SITE_COM_002|              0.862|     false|          6|         13|
|2025-01-04 19:00:00|SITE_RES_002|              0.305|      true|          7|         19|
|2025-01-04 19:00:00|SITE_RES_002|              0.305|      true|          7|         19|
|2025-01-04 19:00:00|SITE_RES_002|              0.305|      true|          7|         19|
|2025-01-04 19:00:00|SITE_RES_002|              0.305|      true|

## Cellule 4 : Jointure avec maintenance

In [4]:
# Charger maintenance depuis Bronze
df_maintenance = spark.sql("SELECT * FROM Lakehouse_bronze.bronze.maintenance_events")

# Conversion timestamps
df_maintenance = df_maintenance.withColumn("start_time", F.col("start_time").cast("timestamp"))
df_maintenance = df_maintenance.withColumn("end_time", F.col("end_time").cast("timestamp"))

# Jointure
df = df.join(
    df_maintenance.select("site_id", "start_time", "end_time"),
    on="site_id",
    how="left"
).withColumn(
    "in_maintenance",
    (F.col("hour") >= F.col("start_time")) & (F.col("hour") <= F.col("end_time"))
).fillna({"in_maintenance": False})

# Supprimer colonnes temporaires
df = df.drop("start_time", "end_time")

print(f"‚úÖ Maintenance ajout√©e : {df.filter('in_maintenance = true').count()} p√©riodes en maintenance")

StatementMeta(, db49a5ee-204a-49d0-9a20-c112a5099070, 6, Finished, Available, Finished)

‚úÖ Maintenance ajout√©e : 488 p√©riodes en maintenance


## Cellule 5 : Calcul baseline 7j intelligente (Window + exclusions)

In [5]:
# Fen√™tre glissante 7 jours (168 heures)
window_7d = Window.partitionBy("site_id").orderBy("hour").rowsBetween(-168, 0)

# Baseline intelligente : exclut weekends et maintenance
df = df.withColumn(
    "baseline_7d_mw",
    F.avg(
        F.when(
            (~F.col("in_maintenance")) & (~F.col("is_weekend")), 
            F.col("avg_consumption_mw")
        )
    ).over(window_7d)
)

print("‚úÖ Baseline 7j calcul√©e (exclut weekends + maintenance)")
df.select("hour", "site_id", "avg_consumption_mw", "baseline_7d_mw", "is_weekend", "in_maintenance").show(10)

StatementMeta(, db49a5ee-204a-49d0-9a20-c112a5099070, 7, Finished, Available, Finished)

‚úÖ Baseline 7j calcul√©e (exclut weekends + maintenance)
+-------------------+------------+------------------+-------------------+----------+--------------+
|               hour|     site_id|avg_consumption_mw|     baseline_7d_mw|is_weekend|in_maintenance|
+-------------------+------------+------------------+-------------------+----------+--------------+
|2025-01-01 00:00:00|SITE_COM_001|            0.2035|             0.2035|     false|         false|
|2025-01-01 00:00:00|SITE_COM_001|            0.2035|             0.2035|     false|         false|
|2025-01-01 00:00:00|SITE_COM_001|            0.2035|             0.2035|     false|         false|
|2025-01-01 00:00:00|SITE_COM_001|            0.2035|             0.2035|     false|         false|
|2025-01-01 00:00:00|SITE_COM_001|            0.2035|0.20349999999999996|     false|         false|
|2025-01-01 00:00:00|SITE_COM_001|            0.2035|             0.2035|     false|         false|
|2025-01-01 00:00:00|SITE_COM_001|        

## Cellule 6 : Calcul z-score avec UDF

In [7]:
from pyspark.sql import functions as F

# Calculer mean et stddev par site sur fen√™tre glissante
df = df.withColumn(
    "mean_consumption",
    F.avg("avg_consumption_mw").over(window_7d)
).withColumn(
    "stddev_consumption",
    F.stddev("avg_consumption_mw").over(window_7d)
)

# Calcul z-score manuel
df = df.withColumn(
    "z_score",
    F.when(
        F.col("stddev_consumption") > 0,
        (F.col("avg_consumption_mw") - F.col("mean_consumption")) / F.col("stddev_consumption")
    ).otherwise(0)
)

# Flag anomalies (|z| > 3)
df = df.withColumn(
    "anomaly",
    F.when(F.abs(F.col("z_score")) > 3, "‚ö†Ô∏è ANOMALIE").otherwise("Normal")
)

# Nettoyer colonnes temporaires
df = df.drop("mean_consumption", "stddev_consumption")

anomaly_count = df.filter("anomaly = '‚ö†Ô∏è ANOMALIE'").count()
print(f"‚úÖ Z-score calcul√© : {anomaly_count} anomalies d√©tect√©es")
df.filter("anomaly = '‚ö†Ô∏è ANOMALIE'").show(10)

StatementMeta(, db49a5ee-204a-49d0-9a20-c112a5099070, 9, Finished, Available, Finished)

‚úÖ Z-score calcul√© : 315 anomalies d√©tect√©es
+------------+-------------------+------------------+------------------+------------------+------------+-------------+---------+----------+-----------+-----------+--------------+-------------------+------------------+-----------+
|     site_id|               hour|avg_consumption_mw|max_consumption_mw|min_consumption_mw|measurements|price_eur_mwh|   market|is_weekend|day_of_week|hour_of_day|in_maintenance|     baseline_7d_mw|           z_score|    anomaly|
+------------+-------------------+------------------+------------------+------------------+------------+-------------+---------+----------+-----------+-----------+--------------+-------------------+------------------+-----------+
|SITE_COM_001|2025-01-29 15:00:00|           2.33325|             5.548|             1.155|           4|        77.38|EPEX Spot|     false|          4|         15|          true| 0.6339504504504498|3.0212267379898865|‚ö†Ô∏è ANOMALIE|
|SITE_COM_002|2025-01-01 02

## Cellule 7 : Feature engineering pour ML (lags)

In [9]:
from pyspark.sql import Window

# Fen√™tre SANS rowsBetween pour lag/lead
window_for_lag = Window.partitionBy("site_id").orderBy("hour")

# Lags (consommation J-1, J-7)
df = df.withColumn("consumption_lag_1d", F.lag("avg_consumption_mw", 24).over(window_for_lag))
df = df.withColumn("consumption_lag_7d", F.lag("avg_consumption_mw", 168).over(window_for_lag))

# Ratio vs baseline
df = df.withColumn(
    "ratio_vs_baseline",
    F.when(F.col("baseline_7d_mw") > 0, F.col("avg_consumption_mw") / F.col("baseline_7d_mw")).otherwise(0)
)

print("‚úÖ Features cr√©√©es : lags J-1, J-7, ratio vs baseline")
df.select("hour", "site_id", "avg_consumption_mw", "consumption_lag_1d", "consumption_lag_7d", "ratio_vs_baseline").show(10)

StatementMeta(, db49a5ee-204a-49d0-9a20-c112a5099070, 11, Finished, Available, Finished)

‚úÖ Features cr√©√©es : lags J-1, J-7, ratio vs baseline
+-------------------+------------+------------------+------------------+------------------+------------------+
|               hour|     site_id|avg_consumption_mw|consumption_lag_1d|consumption_lag_7d| ratio_vs_baseline|
+-------------------+------------+------------------+------------------+------------------+------------------+
|2025-01-01 00:00:00|SITE_COM_001|            0.2035|              NULL|              NULL|               1.0|
|2025-01-01 00:00:00|SITE_COM_001|            0.2035|              NULL|              NULL|               1.0|
|2025-01-01 00:00:00|SITE_COM_001|            0.2035|              NULL|              NULL|               1.0|
|2025-01-01 00:00:00|SITE_COM_001|            0.2035|              NULL|              NULL|               1.0|
|2025-01-01 00:00:00|SITE_COM_001|            0.2035|              NULL|              NULL|1.0000000000000002|
|2025-01-01 00:00:00|SITE_COM_001|            0.2035|  

## Cellule 8 : Jointure avec m√©t√©o

In [10]:
df_weather = spark.sql("SELECT * FROM Lakehouse_bronze.bronze.weather_data")
df_weather = df_weather.withColumn("hour", F.date_trunc("hour", F.col("timestamp").cast("timestamp")))

df = df.join(
    df_weather.select("hour", "temperature_c", "wind_speed_ms"),
    on="hour",
    how="left"
)

print("‚úÖ M√©t√©o ajout√©e")

StatementMeta(, db49a5ee-204a-49d0-9a20-c112a5099070, 12, Finished, Available, Finished)

‚úÖ M√©t√©o ajout√©e


## Cellule 9 : Jointure avec r√©f√©rentiel sites (broadcast)

In [11]:
from pyspark.sql.functions import broadcast

df_sites = spark.sql("SELECT * FROM Lakehouse_bronze.bronze.sites_reference")
df = df.join(broadcast(df_sites), on="site_id", how="left")

print("‚úÖ R√©f√©rentiel sites ajout√© (broadcast join)")
df.printSchema()

StatementMeta(, db49a5ee-204a-49d0-9a20-c112a5099070, 13, Finished, Available, Finished)

‚úÖ R√©f√©rentiel sites ajout√© (broadcast join)
root
 |-- site_id: string (nullable = true)
 |-- hour: timestamp (nullable = true)
 |-- avg_consumption_mw: double (nullable = true)
 |-- max_consumption_mw: double (nullable = true)
 |-- min_consumption_mw: double (nullable = true)
 |-- measurements: long (nullable = true)
 |-- price_eur_mwh: double (nullable = true)
 |-- market: string (nullable = true)
 |-- is_weekend: boolean (nullable = true)
 |-- day_of_week: integer (nullable = true)
 |-- hour_of_day: integer (nullable = true)
 |-- in_maintenance: boolean (nullable = false)
 |-- baseline_7d_mw: double (nullable = true)
 |-- z_score: double (nullable = true)
 |-- anomaly: string (nullable = false)
 |-- consumption_lag_1d: double (nullable = true)
 |-- consumption_lag_7d: double (nullable = true)
 |-- ratio_vs_baseline: double (nullable = true)
 |-- temperature_c: double (nullable = true)
 |-- wind_speed_ms: double (nullable = true)
 |-- site_type: string (nullable = true)
 |-- capa

## Cellule 10 : Sauvegarde Silver enrichi

In [12]:
df.write.mode("overwrite").format("delta").saveAsTable("silver.consumption_enriched")
print(f"‚úÖ Silver enrichi sauvegard√© : {df.count()} lignes")

StatementMeta(, db49a5ee-204a-49d0-9a20-c112a5099070, 14, Finished, Available, Finished)

‚úÖ Silver enrichi sauvegard√© : 25920 lignes


## Cellule 11 : Comparaison SQL vs PySpark

### üìä Bilan SQL vs PySpark

| T√¢che | Spark SQL | PySpark | Gagnant |
|-------|-----------|---------|-------|
| Nettoyage simple | ‚úÖ 3 min | ‚ö†Ô∏è 4 min | üèÜ SQL (plus concis) |
| Baseline 7j intelligente | ‚ùå Impossible (exclure weekends) | ‚úÖ 2 min | üèÜ PySpark (seule option) |
| Z-score avec UDF | ‚ùå Impossible | ‚úÖ 1 min | üèÜ PySpark (seule option) |
| Feature engineering | ‚ùå Tr√®s complexe | ‚úÖ 2 min | üèÜ PySpark (plus lisible) |

**üí° R√®gle d√©couverte** :
- **SQL** : Import, nettoyage, agr√©gations simples
- **PySpark** : Calculs complexes, UDF, logique m√©tier, ML

## Cellule 12 : R√©sum√©

### ‚úÖ Silver enrichi termin√©

**Ajouts PySpark** :
- ‚úÖ Baseline 7j intelligente (exclut weekends + maintenance)
- ‚úÖ Z-score pour d√©tection anomalies
- ‚úÖ Features pour ML (lags, ratios, m√©t√©o)
- ‚úÖ Jointures optimis√©es (broadcast pour petites tables)

**Colonnes cr√©√©es** :
- is_weekend, day_of_week, hour_of_day
- in_maintenance
- baseline_7d_mw
- z_score, anomaly
- consumption_lag_1d, consumption_lag_7d
- ratio_vs_baseline
- temperature_c, wind_speed_ms
- site_type, capacity_mw, flexible, region, etc.

‚û°Ô∏è **Prochaine √©tape** : ML pr√©dictif (Notebook 4)