# Grid Consumption Forecasting with PySpark

This notebook demonstrates how to forecast grid consumption using PySpark. It involves:
- Loading and preprocessing weather and consumption data.
- Feature engineering, including temporal and lagged features.
- Training a Random Forest Regressor for prediction.
- Evaluating the model's performance.

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, lag, month, dayofweek, hour, unix_timestamp, from_unixtime
from pyspark.sql.window import Window
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.regression import RandomForestRegressor
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
from pyspark.ml.evaluation import RegressionEvaluator

## Initialize Spark Session

In [None]:
spark = SparkSession.builder\
         .master("local")\
         .appName("Grid Consumption Forecasting")\
         .config('spark.ui.port', '4050')\
         .config("spark.executor.memory", "4g") \
         .getOrCreate()

## Load Weather and Consumption Data

Load the weather and grid consumption data from CSV files.

In [None]:
# This code need to uncomment for local data read
weather_file = "../dataset/weather_data.csv"
consumption_file = "../dataset/grid_consumption.csv"

weather_data = spark.read.csv(weather_file, header=True, inferSchema=True)
consumption_data = spark.read.csv(consumption_file, header=True, inferSchema=True)

Load the weather and grid consumption data from CSV files.

In [None]:
from google.colab import drive

# Mount Google Drive
drive.mount('/content/drive')

# File paths to point to the mounted drive
weather_file = "/content/drive/MyDrive/Weather_Data/weather_data.csv"
consumption_file = "/content/drive/MyDrive/Weather_Data/grid_consumption.csv"

weather_data = spark.read.csv(weather_file, header=True, inferSchema=True)
consumption_data = spark.read.csv(consumption_file, header=True, inferSchema=True)

## Data Preprocessing

Convert timestamps to Spark's timestamp format and round weather data to hourly timestamps.

In [None]:
# Cast 'Date' column to timestamp and align
weather_data = weather_data.withColumn("Date", col("Date").cast("timestamp"))
consumption_data = consumption_data.withColumn("Date", col("Date").cast("timestamp"))
weather_data = weather_data.withColumn("Date", (unix_timestamp("Date") / 3600).cast("int") * 3600)

# Convert weather_data.Date back to TIMESTAMP and join datasets
weather_data = weather_data.withColumn("Date", from_unixtime(col("Date").cast("int")))
merged_data = consumption_data.join(weather_data, on=["City", "Date"], how="inner")

## Merge Datasets

Join the weather and consumption datasets on `City` and `Date`.

In [None]:
merged_data = merged_data \
    .withColumn("Hour", hour(col("Date"))) \
    .withColumn("DayOfWeek", dayofweek(col("Date"))) \
    .withColumn("Month", month(col("Date")))

## Feature Engineering

Add temporal features and lagged consumption features for better modeling.

In [None]:
window_spec = Window.partitionBy("City").orderBy("Date")
for lag_val in range(1, 25):
    merged_data = merged_data.withColumn(f"Lag_{lag_val}", lag("Consumption (MW)", lag_val).over(window_spec))

merged_data = merged_data.dropna()

## Prepare Data for Modeling

Use a `VectorAssembler` to combine features into a single vector for model training.

In [None]:
feature_columns = [
    "Temperature (C)", "Feels Like (C)", "Humidity (%)", "Pressure (hPa)",
    "Wind Speed (m/s)", "Cloudiness (%)", "Rain (1h mm)", "Hour", "DayOfWeek", "Month"
] + [f"Lag_{lag_val}" for lag_val in range(1, 25)]

assembler = VectorAssembler(inputCols=feature_columns, outputCol="features")
data = assembler.transform(merged_data).select("features", col("Consumption (MW)").alias("label"))

## Split Data into Training and Testing Sets

In [None]:
train_data, test_data = data.randomSplit([0.8, 0.2], seed=42)

## Train a Random Forest Regressor

In [None]:
# Initialize RandomForest model
rf = RandomForestRegressor(featuresCol="features", labelCol="label")

# Set up hyperparameter grid
param_grid = (ParamGridBuilder()
              .addGrid(rf.numTrees, [2, 2])
              .addGrid(rf.maxDepth, [15, 20])
              .build())

# Cross-validator for model selection
crossval = CrossValidator(estimator=rf,
                          estimatorParamMaps=param_grid,
                          evaluator=RegressionEvaluator(labelCol="label", predictionCol="prediction", metricName="rmse"),
                          numFolds=3)

# Train model using cross-validation
cv_model = crossval.fit(train_data)
model = cv_model.bestModel

## Evaluate the Model

In [None]:
test_predictions = model.transform(test_data)
evaluator = RegressionEvaluator(labelCol="label", predictionCol="prediction", metricName="rmse")
rmse = evaluator.evaluate(test_predictions)

print(f"Root Mean Squared Error (RMSE): {rmse}")

## Save the Model for Future Use

In [None]:
# Save the trained model using PySpark's persistence method
model.write().overwrite().save("spark_rf_model")

print(f"Model saved successfully with RMSE: {rmse}")