# Looking at 5-minute pre-dispatch demand forecast errors in 2021

In this example, we will take a look at 5-minute pre-dispatch ({term}`5MPD`) demand forecast "error" (the difference between actual and forecasted demand) for 2021. AEMO runs {term}`5MPD` to provide system and market information for the next hour.

We'll look at forecast "error" on a NEM-wide basis; that is, we will sum actual scheduled demand across all NEM regions and then compare that to the sum of forecast scheduled demand across all NEM regions. 

The code below could be modified to do this analysis on a region by region basis.

## Key imports

In [1]:
# standard libraries
from datetime import datetime, timedelta
from dateutil.relativedelta import relativedelta
from pathlib import Path
import multiprocessing as mp

# NEM data libraries
# NEMOSIS for actual demand data
# NEMSEER for forecast demand data
import nemosis
from nemseer import compile_data, download_raw_data, generate_runtimes
from nemseer.data import DATETIME_FORMAT

# data wrangling libraries
import numpy as np
import pandas as pd

# interactive plotting
import plotly.express as px
import plotly.io as pio

# progress bar for error computation
from tqdm.autonotebook import tqdm

# supress logging from NEMSEER and NEMOSIS
import logging

logging.getLogger("nemosis").setLevel(logging.WARNING)
logging.getLogger("nemseer").setLevel(logging.ERROR)

## Defining our analysis start and end dates

In [2]:
analysis_start = "2021/01/01 00:05:00"
analysis_end = "2022/01/01 00:00:00"

## Obtaining actual demand data from `NEMOSIS`

We will download `DISPATCHREGIONSUM` to access the `TOTALDEMAND` field (actual scheduled demand).

We'll first download the data we need and cache it so that it's ready for computation.

In [3]:
nemosis_cache = Path("nemosis_cache/")
if not nemosis_cache.exists():
    nemosis_cache.mkdir()

In [4]:
nemosis.cache_compiler(
    analysis_start, analysis_end, "DISPATCHREGIONSUM", nemosis_cache, fformat="parquet"
)

## Obtaining forecast demand data from `NEMSEER`

We will download `REGIONSOLUTION` to access the `TOTALDEMAND` field in `P5MIN` forecasts.

We'll first download the data we need and cache it so that it's ready for computation.

In [5]:
download_raw_data(
    "P5MIN",
    "REGIONSOLUTION",
    "nemseer_cache/",
    forecasted_start=analysis_start,
    forecasted_end=analysis_end,
)

## Calculating forecast error

Below we calculate demand forecast error for `P5MIN` forecasts using forecast demand data and actual demand data. 

```{attention}

The {term}`actual run time` of 5MPD is approximately 5 minutes before the nominal {term}`run time`. We will adjust for this in this when calculating forecast ahead times. See the note in {ref}`this section <quick_start:core concepts and information for users>`.
```

We provide two methods below:

1. A **simpler** implementation that uses handy functionalities from both `xarray` and `pandas`. This implementation is a quick and simple way to compute demand forecast error for a couple of forecasted intervals. Though we provide a way to compute error over a longer period (e.g. a year), you should use the next method to compute error unless RAM/memory is a limiting factor (though it should be noted that whilst using `multiprocessing` with this method will speed things up, it will consume more memory).

2. A **vectorised**, pure-`pandas` implementation. This implementation requires more lines of `pandas` code, but is much faster and preferable to the first implementation if you are computing error across a longer period (e.g. a year). However, as data for the entire period is loaded into memory, adapt the length of the period you select to your machine specifications (e.g. a year's worth of forecast data consumed ~15GB on the test machine).

### `xarray` + `pandas` implementation (simpler code)

The code below uses functionalities offered by `NEMOSIS`, `NEMSEER` and `xarray` to simplify coding effort. 

In [6]:
def calculate_p5min_demand_forecast_error_simpler(forecasted_time: str) -> pd.DataFrame:
    """
    Calculates P5MIN demand forecast error (Actual - Forecast) for all forecasts
    that are run for a given forecasted_time.

    Args:
        forecasted_time: Datetime string in the form YYYY/mm/dd HH:MM:SS
    Returns:
        pandas DataFrame with forecast error in `TOTALDEMAND` columns, the ahead time
        of the forecast run in `ahead_time`, and the forecasted time in
        `forecasted_time`.
    """
    # necessary for datetime indexing with pandas and xarray
    time = str(forecasted_time).replace("-", "/")
    # get forecast data for forecasted_time
    run_start, run_end = generate_runtimes(time, time, "P5MIN")
    nemseer_data = compile_data(
        run_start,
        run_end,
        time,
        time,
        "P5MIN",
        "REGIONSOLUTION",
        "nemseer_cache/",
        data_format="xr",
    )
    demand_forecasts = nemseer_data["REGIONSOLUTION"]["TOTALDEMAND"]
    # get actual demand data for forecasted_time
    # nemosis start time must precede end of interval of interest by 5 minutes
    nemosis_start = (
        datetime.strptime(time, "%Y/%m/%d %H:%M:%S") - timedelta(minutes=5)
    ).strftime("%Y/%m/%d %H:%M:%S")
    # compile data using nemosis, using cached parquet and filtering out interventions
    nemosis_data = nemosis.dynamic_data_compiler(
        nemosis_start,
        time,
        "DISPATCHREGIONSUM",
        nemosis_cache,
        filter_cols=["INTERVENTION"],
        filter_values=([0],),
        fformat="parquet",
    )
    # sum actual demand across regions
    actual_demand = nemosis_data.groupby("SETTLEMENTDATE")["TOTALDEMAND"].sum()[time]
    # sum forecast demand across regions
    query_forecasts = demand_forecasts.sum(dim="REGIONID").sel(forecasted_time=time)
    # calculate error and return as a pandas DataFrame
    error = (actual_demand - query_forecasts).to_dataframe()
    # calculate number of minutes ahead, but adjust for nominal vs actual run time of P5MIN
    error["ahead_time"] = error["forecasted_time"] - (
        error.index - timedelta(minutes=5)
    )
    error = error.set_index("forecasted_time")
    return error

#### Computing error across 2021

```{caution}
While this code demonstrates how you could use the `pandas` + `xarray` implementation to compute error across a year, we only provide this as an example. We recommend you use the vectorised implementation if your system memory permits.
```

Because we haven't optimised our code, it will take a while to calculate forecast error across a year.

To speed up computation, we will use Python's [`multiprocessing`](https://docs.python.org/3/library/multiprocessing.html) module. In this example, we use 10 simultaneous processes.

`tqdm` provides us with a progress bar that shows us how many iterations are being completed in a second, as well as the progress over all intervals in the year or interest.

Results DataFrames are added to a list as processes finish computation. Once they've finished, we can then concatenate these DataFrames to get a forecast error DataFrame

In [7]:
times = pd.date_range(analysis_start, analysis_end, freq="5T")
with mp.Pool(10) as p:
    results = list(
        tqdm(
            p.imap(calculate_p5min_demand_forecast_error_simpler, times),
            total=len(times),
        )
    )
forecast_error = pd.concat(results, axis=0)

  0%|          | 0/105120 [00:00<?, ?it/s]

### pure-`pandas` implementation (vectorised code)

The code below uses functionalities offered by `NEMOSIS`, `NEMSEER` and `pandas` to quickly calculate demand forecast error across a longer period.

In [7]:
def calculate_p5min_demand_forecast_error_vectorised(
    analysis_start: str, analysis_end: str
) -> pd.DataFrame:
    """
    Calculates P5MIN demand forecast error (Actual - Forecast) for all forecasts
    that are run for a given forecasted_time in a vectorised fashion.

    Args:
        forecasted_time: Datetime string in the form YYYY/mm/dd HH:MM:SS
    Returns:
        pandas DataFrame with forecast error in `TOTALDEMAND` columns, the ahead time
        of the forecast run in `ahead_time`, and the forecasted time in
        `forecasted_time`.
    """

    def get_forecast_data(analysis_start: str, analysis_end: str) -> pd.DataFrame:
        """
        Use NEMSEER to get 5MPD forecast data. Also omits any intervention periods.
        """
        # use NEMSEER functions to compile pre-cached data
        forecasts_run_start, forecasts_run_end = generate_runtimes(
            analysis_start, analysis_end, "P5MIN"
        )
        forecast_df = compile_data(
            forecasts_run_start,
            forecasts_run_end,
            analysis_start,
            analysis_end,
            "P5MIN",
            "REGIONSOLUTION",
            "nemseer_cache/",
        )["REGIONSOLUTION"]
        # remove intervention periods
        forecast_df = forecast_df.query("INTERVENTION == 0")
        return forecast_df

    def get_actual_data(analysis_start: str, analysis_end: str) -> pd.DataFrame:
        """
        Use NEMOSIS to get actual data. Also omits any intervention periods
        """
        # NEMOSIS start time must precede end of interval of interest by 5 minutes
        nemosis_start = (
            datetime.strptime(analysis_start, DATETIME_FORMAT + ":%S")
            - timedelta(minutes=5)
        ).strftime(DATETIME_FORMAT + ":%S")
        # use NEMOSIS to compile pre-cached data and filter out interventions
        actual_df = nemosis.dynamic_data_compiler(
            nemosis_start,
            analysis_end,
            "DISPATCHREGIONSUM",
            nemosis_cache,
            filter_cols=["INTERVENTION"],
            filter_values=([0],),
            fformat="parquet",
        )
        return actual_df

    def calculate_p5min_forecast_demand_error(
        actual_demand: pd.DataFrame, forecast_demand: pd.DataFrame
    ) -> pd.DataFrame:
        """
        Calculate P5MIN forecast demand error given actual and forecast demand

        Ahead time calculation reflects the fact that P5MIN actual run time is
        5 minutes before the nominal run time.
        """
        # left merge ensures all forecasted values have the corresponding actual value merged in
        merged = pd.merge(
            forecast_demand, actual_demand, on="forecasted_time", how="left"
        )
        if len(merged) > len(forecast_demand):
            raise ValueError(
                "Merge should return DataFrame with dimensions of forecast data"
            )
        # subtract 5 minutes from run time to get actual run time
        merged["ahead_time"] = merged["forecasted_time"] - (
            merged["RUN_DATETIME"] - timedelta(minutes=5)
        )
        forecast_error = (
            merged["TOTALDEMAND"] - merged["FORECAST_TOTALDEMAND"]
        ).rename("TOTALDEMAND")
        # create the forecast error DataFrame
        forecast_error = pd.concat(
            [forecast_error, merged["ahead_time"]], axis=1
        ).set_index(merged["forecasted_time"])
        return forecast_error

    # get forecast data
    forecast_df = get_forecast_data(analysis_start, analysis_end)
    # rename columns in preparation for merge
    forecast_df = forecast_df.rename(
        columns={
            "TOTALDEMAND": "FORECAST_TOTALDEMAND",
            "INTERVAL_DATETIME": "forecasted_time",
        }
    )
    # group by forecasted and run times, then sum demand across regions to get NEM-wide demand
    forecast_demand = forecast_df.groupby(["forecasted_time", "RUN_DATETIME"])[
        "FORECAST_TOTALDEMAND"
    ].sum()
    forecast_demand = forecast_demand.reset_index()

    # get actual data
    actual_df = get_actual_data(analysis_start, analysis_end)
    # rename columns in preparation for merge
    actual_df = actual_df.rename(
        columns={
            "SETTLEMENTDATE": "forecasted_time",
            "TOTALDEMAND": "TOTALDEMAND",
        }
    )
    # group by forecasted time and then sum demand across regions to get NEM-wide demand
    actual_demand = (
        actual_df.groupby("forecasted_time")["TOTALDEMAND"].sum().reset_index()
    )

    # calculate forecast error
    forecast_error = calculate_p5min_forecast_demand_error(
        actual_demand, forecast_demand
    )
    return forecast_error

In [8]:
forecast_error = calculate_p5min_demand_forecast_error_vectorised(
    analysis_start, analysis_end
)

## Plotting forecast error percentiles for each ahead time

How does forecast error change based on how many minutes they are ahead of the time they are forecasting for?

### Forecast error percentiles

We can compute forecast error percentiles across `ahead_times` (between 0 and 55 minutes for 5-minute pre-dispatch).

To do this, we will group the error DataFrame by `ahead_time`, compute the percentile and then add a column that indicates the computed percentile. We'll repeat this process across all percentiles of interest and then concatenate the results to form a single DataFrame for plotting.

In [9]:
percentile_data = []
for quantile in (0.01, 0.05, 0.1, 0.25, 0.5, 0.75, 0.9, 0.95, 0.99):
    quantile_result = forecast_error.groupby(
        forecast_error["ahead_time"].dt.seconds / 60
    )["TOTALDEMAND"].quantile(quantile)
    percentile_result = pd.concat(
        [
            quantile_result,
            pd.Series(
                np.repeat(quantile * 100, len(quantile_result)),
                index=quantile_result.index,
                name="percentile",
            ).astype(int),
        ],
        axis=1,
    )
    percentile_data.append(percentile_result)
percentile_df = pd.concat(percentile_data, axis=0).reset_index()

We can plot these quantiles for each ahead time. 

It's interesting to note that there is only a slight positive bias in the 50th percentile forecast as the forecast ahead time approaches one hour.

In [10]:
ahead_percentile = px.line(
    percentile_df,
    x="ahead_time",
    y="TOTALDEMAND",
    color="percentile",
    title="5MPD NEM-wide Demand Forecast Error 2021 (Actual - Forecast)",
    labels={
        "TOTALDEMAND": "Demand Forecast Error (MW)",
        "ahead_time": "Forecast Ahead Time (minutes)",
    },
)
ahead_percentile["layout"]["xaxis"]["autorange"] = "reversed"

In [11]:
pio.write_html(
    ahead_percentile, "../_static/p5min_error_2021_ahead_time_percentile.html"
)

```{raw} html
---
file: ../_static/p5min_error_2021_ahead_time_percentile.html
---
```

## Plotting the distributions of forecast errors by ahead time

We can look at the full distributions of forecast errors across ahead times. 

But first, we'll remove "forecasts" at `ahead_time` = 5, as these correspond to actual dispatch conditions.

We'll also convert the Timedeltas into an integer, which will be helpful for plotting.

In [12]:
error_excluding_real_time = forecast_error[
    forecast_error["ahead_time"].dt.seconds > 300
]
error_excluding_real_time.loc[:, "ahead_time"] = (
    error_excluding_real_time.loc[:, "ahead_time"].dt.seconds / 60
).astype(int)



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



In [13]:
ahead_hist = px.histogram(
    error_excluding_real_time, x="TOTALDEMAND", color="ahead_time"
)

In [14]:
pio.write_html(ahead_hist, "../_static/p5min_error_2021_ahead_time_hists.html")

```{raw} html
---
file: ../_static/p5min_error_2021_ahead_time_hists.html
---
```

## Plotting forecast error quantiles against time of day

How does forecast error change across the day?

Below, we repeat percentile calculations, but this time we group the data by the time of day.

From the chart below, we can see that, across the NEM, intra-hour demand forecasting errors tend to be larger during the morning and evening ramps.

In [15]:
TOD_percentile_data = []
for quantile in (0.01, 0.05, 0.1, 0.25, 0.5, 0.75, 0.9, 0.95, 0.99):
    quantile_result = error_excluding_real_time.groupby(
        error_excluding_real_time.index.time
    )["TOTALDEMAND"].quantile(quantile)
    percentile_result = pd.concat(
        [
            quantile_result,
            pd.Series(
                np.repeat(quantile * 100, len(quantile_result)),
                index=quantile_result.index,
                name="percentile",
            ).astype(int),
        ],
        axis=1,
    )
    TOD_percentile_data.append(percentile_result)
TOD_percentile = pd.concat(TOD_percentile_data, axis=0).reset_index()

In [16]:
tod_percentile = px.line(
    TOD_percentile,
    x="index",
    y="TOTALDEMAND",
    color="percentile",
    labels={
        "TOTALDEMAND": "Demand Forecast Error (MW)",
        "ahead_time": "Forecast Ahead Time (minutes)",
        "index": "Time of Day",
    },
    title="5MPD NEM-wide Demand Forecast Error 2021 (Actual - Forecast,"
    + " excludes forecast run at real time)",
)

In [17]:
pio.write_html(tod_percentile, "../_static/p5min_error_2021_tod_percentile.html")

```{raw} html
---
file: ../_static/p5min_error_2021_tod_percentile.html
---
```