# Probabilistic Forecasting - Electricity

This notebook demonstrates how to perform Data Analysis and Preparation Engineering with Amazon SageMaker Studio using AWS Glue Interactive Session.

Using this notebook, we can execute cells in order to read data, visualize, and perform transformations using PySpark with AWS Glue Interactice Session.

Let's start preparing our dataset.

In [None]:
%pip install -U -q sagemaker

***

# Dataset

The data set (Electricity Price Forecasting) was downloaded from [Kaggle](https://www.kaggle.com/code/dimitriosroussis/electricity-price-forecasting-with-dnns-eda/data).

This dataset is using the past values of the electricity price as well as those of another features related to energy generation and weather conditions

# Step 1 - Import Modules

Here we’ll import some libraries and define some variables.

In [None]:
import os

# os.environ["AWS_PROFILE"] = "<aws_profile>"

In [None]:
import boto3
import sagemaker
from sagemaker.processing import ProcessingInput, ProcessingOutput
from sagemaker.spark.processing import PySparkProcessor

In [None]:
sagemaker_client = boto3.client("sagemaker")
s3_client = boto3.client("s3")

Create a SageMaker Session and save the default region and the execution role in some Python variables

In [None]:
sagemaker_session = sagemaker.Session()
region = boto3.session.Session().region_name
role = sagemaker.get_execution_role()

In [None]:
bucket_name = sagemaker_session.default_bucket()

bucket_name

***

# Step 2 - Prepare data and upload to S3

In [None]:
! python utils/syntetic_data_energy.py

In [None]:
! python utils/syntetic_data_weather.py

In [None]:
from pathlib import Path

output_dir = Path("./data/output")

for file_path in output_dir.rglob("*"):
    if file_path.is_file():
        # Create S3 key by replacing local path structure
        relative_path = file_path.relative_to(output_dir)
        s3_key = f"electricity-forecasting/data/input/{relative_path}"

        print(f"Uploading {file_path} to s3://{bucket_name}/{s3_key}")
        s3_client.upload_file(str(file_path), bucket_name, s3_key)

print("Upload complete!")

***

# Step 3 - Upload Python Scripts

In [None]:
script_location = "electricity-forecasting/code/processing"

In [None]:
# Download the
# clean the buckets first
s3_client.delete_object(Bucket=bucket_name, Key=script_location)

code_path = sagemaker_session.upload_data('./code', key_prefix=script_location)

code_path

***

# Step 4 - Run the processing job

By using [PySparkProcessor](https://sagemaker-examples.readthedocs.io/en/latest/sagemaker_processing/spark_distributed_data_processing/sagemaker-spark-processing.html), we can provide to the Amazon SageMaker Job the execution PySpark scripts in distributed data processing mode

In [None]:
! pygmentize ./code/processing.py

## Global Parameters

In order to allow users to execute the SageMaker Processing Job locally, we are defining the variable `local_mode`. If you want to test the local mode capability, please put the variable to `True`

In [None]:
processing_image_uri = "{}.dkr.ecr.{}.amazonaws.com/sagemaker-spark-custom-container:3.3".format(boto3.client("sts").get_caller_identity()["Account"], region)

processing_code = "electricity-forecasting/code/processing"
processing_input_files_path = "electricity-forecasting/data/input"
processing_output_files_path = "electricity-forecasting/data/output"

processing_instance_count = 2
processing_instance_type = "ml.m5.12xlarge"

spark_configurations = [
    {
        "Classification":"spark-defaults",
        "Properties":{
            "spark.executor.cores": 5,
            "spark.driver.cores": 5,
            "spark.executor.memory": "35g",
            "spark.executor.memoryOverhead": "3g",
            "spark.driver.memory": "35g",
            "spark.executor.instances": 17,
            "spark.sql.parquet.fs.optimized.comitter.optimization-enabled": True
        }
    }
]

processing_image_uri

Define the `PySparkProcessor` object.

### Update:

From the container version `sagemaker-spark-processing:3.3-cpu-py39-v1.2`, SageMaker Spark Containers are providing an automated optimized Spark configuration. For using it, provide the environment variable `AWS_SPARK_CONFIG_MODE = "2"`

```
env={
    "AWS_SPARK_CONFIG_MODE": "2"
}
```

In [None]:
processor = PySparkProcessor(
    image_uri=processing_image_uri,
    role=role,
    instance_count=processing_instance_count,
    instance_type=processing_instance_type,
    sagemaker_session=sagemaker_session,
    env={
        "AWS_SPARK_CONFIG_MODE": "2"
    }
)

In [None]:
processor.run(
    "./code/processing.py",

    inputs=[
        ProcessingInput(
            input_name="input",
            source="s3://{}/{}/".format(bucket_name, processing_input_files_path),
            s3_data_distribution_type="ShardedByS3Key",
            destination="/opt/ml/processing/input"
        ),
        ProcessingInput(
            input_name="scripts",
            source="s3://{}/{}".format(bucket_name, processing_code),
            destination="/opt/ml/processing/input/code/scripts"
        )
    ],
    outputs=[
        ProcessingOutput(
            output_name="output",
            source="/opt/ml/processing/output",
            destination="s3://{}/{}".format(bucket_name, processing_output_files_path))
    ],
    #configuration=spark_configurations,
    spark_event_logs_s3_uri="s3://{}/electricity-forecasting/logs".format(bucket_name),
    wait=False,
)