# Probabilistic Forecasting - Electricity

This notebook demonstrates how to perform Data Analysis and Preparation Engineering with Amazon SageMaker Studio using AWS Glue Interactive Session.

Using this notebook, we can execute cells in order to read data, visualize, and perform transformations using PySpark with AWS Glue Interactice Session.

Let's start preparing our dataset.

**SageMaker Studio Kernel**: DataScience 3.0 - Python3

***

# Dataset

The data set (Electricity Price Forecasting) was downloaded from [Kaggle](https://www.kaggle.com/code/dimitriosroussis/electricity-price-forecasting-with-dnns-eda/data).
This dataset is using the past values of the electricity price as well as those of another features related to energy generation and weather conditions

# Step 1 - Import Modules

Here we’ll import some libraries and define some variables.

In [2]:
import boto3
import sagemaker
from sagemaker.processing import ProcessingInput, ProcessingOutput
from sagemaker.spark.processing import PySparkProcessor

In [3]:
sagemaker_client = boto3.client("sagemaker")
s3_client = boto3.client("s3")

Create a SageMaker Session and save the default region and the execution role in some Python variables

In [4]:
sagemaker_session = sagemaker.Session()
region = boto3.session.Session().region_name
role = sagemaker.get_execution_role()

In [5]:
bucket_name = sagemaker_session.default_bucket()

***

# Step 2 - Run the processing job

By using [PySparkProcessor](https://sagemaker-examples.readthedocs.io/en/latest/sagemaker_processing/spark_distributed_data_processing/sagemaker-spark-processing.html), we can provide to the Amazon SageMaker Job the execution PySpark scripts in distributed data processing mode

In [6]:
! pygmentize ./code/processing.py

[34mimport[39;49;00m [04m[36margparse[39;49;00m
[34mimport[39;49;00m [04m[36mcsv[39;49;00m
[34mimport[39;49;00m [04m[36mlogging[39;49;00m
[34mimport[39;49;00m [04m[36mnumpy[39;49;00m [34mas[39;49;00m [04m[36mnp[39;49;00m
[34mimport[39;49;00m [04m[36mos[39;49;00m
[34mimport[39;49;00m [04m[36mpandas[39;49;00m [34mas[39;49;00m [04m[36mpd[39;49;00m
[34mfrom[39;49;00m [04m[36mpyspark[39;49;00m[04m[36m.[39;49;00m[04m[36msql[39;49;00m [34mimport[39;49;00m SparkSession
[34mimport[39;49;00m [04m[36mpyspark[39;49;00m[04m[36m.[39;49;00m[04m[36msql[39;49;00m[04m[36m.[39;49;00m[04m[36mfunctions[39;49;00m [34mas[39;49;00m [04m[36mF[39;49;00m
[34mfrom[39;49;00m [04m[36mpyspark[39;49;00m[04m[36m.[39;49;00m[04m[36msql[39;49;00m[04m[36m.[39;49;00m[04m[36mfunctions[39;49;00m [34mimport[39;49;00m pandas_udf
[34mfrom[39;49;00m [04m[36mpyspark[39;49;00m[04m[36m.[39;49;00m[04m[36msql[39;49;00m[04m[36m.

## Global Parameters

In order to allow users to execute the SageMaker Processing Job locally, we are defining the variable `local_mode`. If you want to test the local mode capability, please put the variable to `True`

In [7]:
local_mode = False

In [8]:
processing_framework_version = "3.1"
processing_input_files_path = "electricity-forecasting/data/input"
processing_instance_count = 2
processing_output_files_path = "electricity-forecasting/data/output"

if local_mode:
    processing_instance_type = "local"
else:
    processing_instance_type = "ml.m5.xlarge"

Define the `FrameworkProcessor` object

In [9]:
processor = PySparkProcessor(
    framework_version=processing_framework_version,
    role=role,
    instance_count=processing_instance_count,
    instance_type=processing_instance_type,
    sagemaker_session=sagemaker_session
)

In [10]:
run_args = processor.get_run_args(
        "./code/processing.py",
        inputs=[
            ProcessingInput(
                input_name="input",
                source="s3://{}/{}".format(bucket_name, processing_input_files_path),
                destination="/opt/ml/processing/input"
            )
        ],
        outputs=[
            ProcessingOutput(
                output_name="output",
                source="/opt/ml/processing/output",
                destination="s3://{}/{}".format(bucket_name, processing_output_files_path))
        ]
    )

This function has been deprecated and could break pipeline step caching. We recommend using the run() function directly with pipeline sessionsto access step arguments.


In [11]:
processor.run(
    submit_app=run_args.code,
    arguments=[
        "--copy_hdfs",
        "1",
        "--bucket_name",
        bucket_name,
        "--processing_input_files_path",
        processing_input_files_path,
        "--processing_output_files_path",
        processing_output_files_path
    ],
    inputs=run_args.inputs,
    outputs=run_args.outputs,
    spark_event_logs_s3_uri="s3://{}/electricity-forecasting/logs".format(bucket_name),
    wait=True
)

INFO:sagemaker:Creating processing-job with name sagemaker-spark-processing-2023-02-02-13-06-02-090



Job Name:  sagemaker-spark-processing-2023-02-02-13-06-02-090
Inputs:  [{'InputName': 'input', 'AppManaged': False, 'S3Input': {'S3Uri': 's3://sagemaker-eu-west-1-691148928602/electricity-forecasting/data/input', 'LocalPath': '/opt/ml/processing/input', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}, {'InputName': 'code', 'AppManaged': False, 'S3Input': {'S3Uri': 's3://sagemaker-eu-west-1-691148928602/sagemaker-spark-processing-2023-02-02-13-06-02-090/input/code/processing.py', 'LocalPath': '/opt/ml/processing/input/code', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}]
Outputs:  [{'OutputName': 'output', 'AppManaged': False, 'S3Output': {'S3Uri': 's3://sagemaker-eu-west-1-691148928602/electricity-forecasting/data/output', 'LocalPath': '/opt/ml/processing/output', 'S3UploadMode': 'EndOfJob'}}, {'OutputName': 'output-2', 'AppManaged':