# Notebook 4 : Silver 3 - ML Pr√©diction (PySpark MLlib)

**Dur√©e** : 20 minutes  
**Lakehouse** : Lakehouse_silver  
**Objectif** : Entra√Æner un mod√®le de pr√©diction de consommation J+1

## Cellule 1 : Import MLlib

In [None]:
from pyspark.ml.feature import VectorAssembler, StandardScaler
from pyspark.ml.regression import LinearRegression
from pyspark.ml import Pipeline
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.sql import functions as F

print("‚úÖ MLlib import√©")

## Cellule 2 : Chargement donn√©es Silver enriched

In [None]:
df = spark.table("Lakehouse_silver.silver.consumption_enriched")
print(f"üìä Donn√©es charg√©es : {df.count()} lignes")
df.printSchema()

## Cellule 3 : Pr√©paration features pour ML

In [None]:
# Features pertinentes
features_cols = [
    "hour_of_day",
    "day_of_week",
    "temperature_c",
    "wind_speed_ms",
    "consumption_lag_1d",
    "consumption_lag_7d",
    "baseline_7d_mw",
    "price_eur_mwh"
]

target_col = "avg_consumption_mw"

# Filtrer les NULL (apr√®s lags)
df_ml = df.select(features_cols + [target_col, "hour", "site_id"]).na.drop()

print(f"üìä Dataset ML : {df_ml.count()} lignes (apr√®s suppression NULL)")
df_ml.show(5)

## Cellule 4 : VectorAssembler + StandardScaler

In [None]:
assembler = VectorAssembler(
    inputCols=features_cols,
    outputCol="features_raw"
)

scaler = StandardScaler(
    inputCol="features_raw",
    outputCol="features",
    withStd=True,
    withMean=True
)

print("‚úÖ VectorAssembler + Scaler configur√©s")

## Cellule 5 : Split Train/Test (80/20 temporel)

In [None]:
# Split temporel : 80% premiers jours = train, 20% derniers jours = test
dates_sorted = df_ml.select("hour").distinct().orderBy("hour").collect()
split_idx = int(len(dates_sorted) * 0.8)
split_date = dates_sorted[split_idx]["hour"]

train_df = df_ml.filter(F.col("hour") < split_date)
test_df = df_ml.filter(F.col("hour") >= split_date)

print(f"üìä Train : {train_df.count()} lignes (80%)")
print(f"üìä Test  : {test_df.count()} lignes (20%)")
print(f"üìÖ Date de split : {split_date}")

## Cellule 6 : Entra√Ænement LinearRegression

In [None]:
lr = LinearRegression(
    featuresCol="features",
    labelCol=target_col,
    predictionCol="prediction",
    maxIter=100,
    regParam=0.01
)

pipeline = Pipeline(stages=[assembler, scaler, lr])

print("üöÄ Entra√Ænement du mod√®le...")
model = pipeline.fit(train_df)
print("‚úÖ Mod√®le entra√Æn√©")

# Coefficients
lr_model = model.stages[-1]
print(f"\nüìä Intercept : {lr_model.intercept:.4f}")
print(f"üìä Nombre de coefficients : {len(lr_model.coefficients)}")

## Cellule 7 : Pr√©dictions sur test

In [None]:
predictions = model.transform(test_df)

predictions.select(
    "hour", 
    "site_id", 
    target_col, 
    "prediction"
).orderBy("hour").show(20)

## Cellule 8 : √âvaluation (RMSE, MAE, R¬≤)

In [None]:
evaluator_rmse = RegressionEvaluator(
    labelCol=target_col,
    predictionCol="prediction",
    metricName="rmse"
)

evaluator_mae = RegressionEvaluator(
    labelCol=target_col,
    predictionCol="prediction",
    metricName="mae"
)

evaluator_r2 = RegressionEvaluator(
    labelCol=target_col,
    predictionCol="prediction",
    metricName="r2"
)

rmse = evaluator_rmse.evaluate(predictions)
mae = evaluator_mae.evaluate(predictions)
r2 = evaluator_r2.evaluate(predictions)

print("\n" + "="*50)
print("üìä M√âTRIQUES DU MOD√àLE")
print("="*50)
print(f"RMSE (Root Mean Square Error) : {rmse:.4f} MW")
print(f"MAE  (Mean Absolute Error)    : {mae:.4f} MW")
print(f"R¬≤   (Coefficient determination): {r2:.4f}")
print("="*50)

if r2 > 0.7:
    print("‚úÖ Mod√®le performant (R¬≤ > 0.7)")
elif r2 > 0.5:
    print("‚ö†Ô∏è Mod√®le acceptable (R¬≤ entre 0.5 et 0.7)")
else:
    print("‚ùå Mod√®le faible (R¬≤ < 0.5)")

## Cellule 9 : Analyse des erreurs

In [None]:
# Calculer erreur absolue
predictions = predictions.withColumn(
    "error_abs",
    F.abs(F.col(target_col) - F.col("prediction"))
)

error_stats = predictions.select(
    F.mean("error_abs").alias("mean_error"),
    F.stddev("error_abs").alias("std_error"),
    F.max("error_abs").alias("max_error"),
    F.min("error_abs").alias("min_error")
).collect()[0]

print("\nüìä STATISTIQUES DES ERREURS")
print("="*50)
print(f"Erreur moyenne    : {error_stats['mean_error']:.4f} MW")
print(f"√âcart-type        : {error_stats['std_error']:.4f} MW")
print(f"Erreur maximale   : {error_stats['max_error']:.4f} MW")
print(f"Erreur minimale   : {error_stats['min_error']:.4f} MW")
print("="*50)

# Top 10 pires pr√©dictions
print("\n‚ö†Ô∏è Top 10 pires pr√©dictions :")
predictions.select(
    "hour", 
    "site_id", 
    target_col, 
    "prediction", 
    "error_abs"
).orderBy(F.desc("error_abs")).show(10)

## Cellule 10 : Comparaison avec baseline na√Øve

In [None]:
# Baseline na√Øve : pr√©diction = consommation J-1
predictions_baseline = predictions.withColumn(
    "prediction_baseline",
    F.col("consumption_lag_1d")
)

rmse_baseline = RegressionEvaluator(
    labelCol=target_col,
    predictionCol="prediction_baseline",
    metricName="rmse"
).evaluate(predictions_baseline)

print("\nüìä COMPARAISON MOD√àLE vs BASELINE NA√èVE")
print("="*50)
print(f"RMSE Mod√®le ML      : {rmse:.4f} MW")
print(f"RMSE Baseline (J-1) : {rmse_baseline:.4f} MW")
improvement = ((rmse_baseline - rmse) / rmse_baseline * 100)
print(f"Am√©lioration        : {improvement:.1f}%")
print("="*50)

if rmse < rmse_baseline:
    print("‚úÖ Le mod√®le ML est meilleur que la baseline na√Øve")
else:
    print("‚ùå Le mod√®le ML n'am√©liore pas la baseline")

## Cellule 11 : Sauvegarde pr√©dictions

In [None]:
# Sauvegarder toutes les pr√©dictions (train + test)
all_predictions = model.transform(df_ml)

all_predictions.write.mode("overwrite").format("delta").saveAsTable("silver.consumption_predictions")

print(f"‚úÖ Pr√©dictions sauvegard√©es : {all_predictions.count()} lignes")
print("üìç Table : silver.consumption_predictions")

## Cellule 12 : R√©sum√©

### ‚úÖ ML Pr√©diction termin√©

**Mod√®le entra√Æn√©** :
- Algorithme : Linear Regression
- Features : 8 (heure, jour, m√©t√©o, lags, baseline, prix)
- Train/Test : 80/20 (split temporel)

**Performance** :
- RMSE, MAE, R¬≤ affich√©s ci-dessus
- Am√©lioration vs baseline na√Øve calcul√©e

**Pr√©dictions sauvegard√©es** :
- üìç Table : silver.consumption_predictions

**üí° Pourquoi PySpark √©tait obligatoire ?**
- ‚ùå Spark SQL ne peut pas faire de ML (pas de MLlib en SQL)
- ‚úÖ PySpark seul permet VectorAssembler, LinearRegression, Pipeline
- ‚úÖ PySpark permet √©valuation avec m√©triques (RMSE, MAE, R¬≤)

‚û°Ô∏è **Prochaine √©tape** : Agr√©gations business dans Gold (Notebook 5)