# Enefit - Predict Energy Behavior of Prosumers

The challenge in this competition is to predict the amount of electricity produced and consumed by Estonian energy customers who have installed solar panels. The dataset includes weather data, the relevant energy prices, and records of the installed photovoltaic capacity.

This is a forecasting competition using the time series API.

**Description**

The number of prosumers is rapidly increasing, and solving the problems of energy imbalance and their rising costs is vital. If left unaddressed, this could lead to increased operational costs, potential grid instability, and inefficient use of energy resources. If this problem were effectively solved, it would significantly reduce the imbalance costs, improve the reliability of the grid, and make the integration of prosumers into the energy system more efficient and sustainable. Moreover, it could potentially incentivize more consumers to become prosumers, knowing that their energy behavior can be adequately managed, thus promoting renewable energy production and use.

**About us**

Enefit is one of the biggest energy companies in Baltic region. As experts in the field of energy, we help customers plan their green journey in a personal and flexible manner as well as implement it by using environmentally friendly energy solutions.
At present, Enefit is attempting to solve the imbalance problem by developing internal predictive models and relying on third-party forecasts. However, these methods have proven to be insufficient due to their low accuracy in forecasting the energy behavior of prosumers. The shortcomings of these current methods lie in their inability to accurately account for the wide range of variables that influence prosumer behavior, leading to high imbalance costs. By opening up the challenge to the world's best data scientists through the Kaggle platform, Enefit aims to leverage a broader pool of expertise and novel approaches to improve the accuracy of these predictions and consequently reduce the imbalance and associated costs.

**Evaluation**

Submissions are evaluated on the Mean Absolute Error (MAE) between the predicted return and the observed target. The formula is given by:

𝑀𝐴𝐸=1𝑛∑𝑖=1𝑛|𝑦𝑖−𝑥𝑖|

Where:
* 𝑛 is the total number of data points.
* 𝑦𝑖 is the predicted value for data point i.
* 𝑥𝑖 is the observed value for data point i.

**Submitting**

You must submit to this competition using the provided python time-series API, which ensures that models do not peek forward in time. To use the API, follow the template in this [notebook](https://www.kaggle.com/code/sohier/enefit-basic-submission-demo).

**Timeline**

This is a future data prediction competition with an active training phase and a second period where selected submissions will be evaluated against future ground truth data.

*Training Timeline*

* November 1, 2023 - Start Date.
* January 24, 2024 - Entry Deadline. You must accept the competition rules before this date in order to compete.
* January 24, 2024 - Team Merger Deadline. This is the last day participants may join or merge teams.
* January 31, 2024 - Final Submission Deadline.

All deadlines are at 11:59 PM UTC on the corresponding day unless otherwise noted. The competition organizers reserve the right to update the contest timeline if they deem it necessary.

*Prediction Timeline:*

Starting after the final submission deadline there will be periodic updates to the leaderboard to reflect future data updates that will be evaluated against selected submissions. We anticipate 1-3 interim updates before the final evaluation.

* April 30, 2024 - Competition End Date

**Prizes**

* 1st Place - $ 15,000
* 2nd Place - $ 10,000
* 3rd Place - $ 8,000
* 4th Place - $ 7,000
* 5th Place - $ 5,000
* 6th Place - $ 5,000

**Code Requirements**

Submissions to this competition must be made through Notebooks. In order for the "Submit" button to be active after a commit, the following conditions must be met:

* CPU Notebook <= 9 hours run-time
* GPU Notebook <= 9 hours run-time
* Internet access disabled
* Freely & publicly available external data is allowed, including pre-trained models
* Submission file must be named submission.csv and be generated by the API.

Please see the [Code Competition FAQ](https://www.kaggle.com/docs/competitions#notebooks-only-FAQ) for more information on how to submit. And review the [code debugging doc](https://www.kaggle.com/code-competition-debugging) if you are encountering submission errors.

### Load Workspace

In [1]:
import re
import datetime as dt
import itertools

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
mpl.style.use('fivethirtyeight')
import seaborn as sns
import opendatasets as od
import kaggle
import zipfile
import io
import json
import warnings

from scipy import stats
import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.tsa.seasonal import seasonal_decompose, STL
from statsmodels.stats.diagnostic import acorr_ljungbox
from statsmodels.tsa.stattools import acf, pacf
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.tsa.statespace.sarimax import SARIMAX
from statsmodels.graphics.gofplots import qqplot
from statsmodels.tsa.stattools import adfuller
from tqdm import notebook
from itertools import product
from typing import Union

from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, roc_curve, classification_report

### Load the Data

In [2]:
def list_files_in_zip(zip_file_path):
    zip_files = list()
    with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
        file_list = zip_ref.namelist()
        for file in file_list:
            zip_files.append(file)
    return zip_files

zip_file_path = 'predict-energy-behavior-of-prosumers.zip'

enefit_files = list_files_in_zip(zip_file_path)
enefit_files

['client.csv',
 'county_id_to_name_map.json',
 'electricity_prices.csv',
 'enefit/__init__.py',
 'enefit/competition.cpython-310-x86_64-linux-gnu.so',
 'example_test_files/client.csv',
 'example_test_files/electricity_prices.csv',
 'example_test_files/forecast_weather.csv',
 'example_test_files/gas_prices.csv',
 'example_test_files/historical_weather.csv',
 'example_test_files/revealed_targets.csv',
 'example_test_files/sample_submission.csv',
 'example_test_files/test.csv',
 'forecast_weather.csv',
 'gas_prices.csv',
 'historical_weather.csv',
 'public_timeseries_testing_util.py',
 'train.csv',
 'weather_station_to_county_mapping.csv']

In [3]:
def read_csv_from_zip(zip_file_path, csv_file_name):
    with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
        with zip_ref.open(csv_file_name) as file:
            df = pd.read_csv(io.TextIOWrapper(file))
            return df

def read_json_from_zip(zip_file_path, json_file_name):
    with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
        with zip_ref.open(json_file_name) as file:
            data = json.load(file)
            df = pd.DataFrame(data, index=range(len(data)))
            return df

def clean_date(df):
    date_cols = ['date', 'datetime', 'forecast_date', 'origin_date', 'forecast_datetime', 'origin_datetime']
    for col in df.columns:
        if col in date_cols:
            df[col] = pd.to_datetime(df[col])
    return df

In [4]:
enefit_dict = dict()
keys = ['forecast_weather']

for key in keys:
    if key + '.csv' in enefit_files:
        csv_file_name = key + '.csv'
        enefit_dict[key] = read_csv_from_zip(zip_file_path, csv_file_name)
        enefit_dict[key] = clean_date(enefit_dict[key])
    elif key + '.json' in enefit_files:
        json_file_name = key + '.json'
        enefit_dict[key] = read_json_from_zip(zip_file_path, json_file_name)

enefit_dict['forecast_weather'].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3424512 entries, 0 to 3424511
Data columns (total 18 columns):
 #   Column                             Dtype              
---  ------                             -----              
 0   latitude                           float64            
 1   longitude                          float64            
 2   origin_datetime                    datetime64[ns, UTC]
 3   hours_ahead                        int64              
 4   temperature                        float64            
 5   dewpoint                           float64            
 6   cloudcover_high                    float64            
 7   cloudcover_low                     float64            
 8   cloudcover_mid                     float64            
 9   cloudcover_total                   float64            
 10  10_metre_u_wind_component          float64            
 11  10_metre_v_wind_component          float64            
 12  data_block_id                      int64  

In [15]:
deep_colors = [
    '#4C72B0', '#55A868', '#C44E52',
    '#8172B2', '#CCB974', '#64B5CD'
]

**Forecast Weather**

Weather forecasts that would have been available at prediction time. Sourced from the European Centre for Medium-Range Weather Forecasts.

The features in this dataset are:

* latitude/longitude: The coordinates of the weather forecast.
* origin_datetime: The timestamp of when the forecast was generated.
* hours_ahead: The number of hours between the forecast generation and the forecast weather. Each forecast covers 48 hours in total.
* temperature: The air temperature at 2 meters above ground in degrees Celsius.
* dewpoint: The dew point temperature at 2 meters above ground in degrees Celsius.
* cloudcover_[low/mid/high/total]: The percentage of the sky covered by clouds in the following altitude bands: 0-2 km, 2-6, 6+, and total.
* 10_metre_[u/v]_wind_component: The [eastward/northward] component of wind speed measured 10 meters above surface in meters per second.
* data_block_id:  All rows sharing the same `data_block_id` will be available at the same forecast time. This is a function of what information is available when forecasts are actually made, at 11 AM each morning. For example, if the forecast weather `data_block_id` for predictions made on October 31st is 100 then the historic weather data_block_id for October 31st will be 101 as the historic weather data is only actually available the next day.
* forecast_datetime: The timestamp of the predicted weather. Generated from origin_datetime plus hours_ahead.
* direct_solar_radiation: The direct solar radiation reaching the surface on a plane perpendicular to the direction of the Sun accumulated during the preceding hour, in watt-hours per square meter.
* surface_solar_radiation_downwards: The solar radiation, both direct and diffuse, that reaches a horizontal plane at the surface of the Earth, in watt-hours per square meter.
* snowfall: Snowfall over the previous hour in units of meters of water equivalent.
* total_precipitation: The accumulated liquid, comprising rain and snow that falls on Earth's surface over the preceding hour, in units of meters.

In [5]:
enefit_dict['forecast_weather'].head()

Unnamed: 0,latitude,longitude,origin_datetime,hours_ahead,temperature,dewpoint,cloudcover_high,cloudcover_low,cloudcover_mid,cloudcover_total,10_metre_u_wind_component,10_metre_v_wind_component,data_block_id,forecast_datetime,direct_solar_radiation,surface_solar_radiation_downwards,snowfall,total_precipitation
0,57.6,21.7,2021-09-01 00:00:00+00:00,1,15.655786,11.553613,0.904816,0.019714,0.0,0.905899,-0.411328,-9.106137,1,2021-09-01 01:00:00+00:00,0.0,0.0,0.0,0.0
1,57.6,22.2,2021-09-01 00:00:00+00:00,1,13.003931,10.689844,0.886322,0.004456,0.0,0.886658,0.206347,-5.355405,1,2021-09-01 01:00:00+00:00,0.0,0.0,0.0,0.0
2,57.6,22.7,2021-09-01 00:00:00+00:00,1,14.206567,11.671777,0.729034,0.005615,0.0,0.730499,1.451587,-7.417905,1,2021-09-01 01:00:00+00:00,0.0,0.0,0.0,0.0
3,57.6,23.2,2021-09-01 00:00:00+00:00,1,14.844507,12.264917,0.336304,0.074341,0.000626,0.385468,1.090869,-9.163999,1,2021-09-01 01:00:00+00:00,0.0,0.0,0.0,0.0
4,57.6,23.7,2021-09-01 00:00:00+00:00,1,15.293848,12.458887,0.102875,0.088074,1.5e-05,0.17659,1.268481,-8.975766,1,2021-09-01 01:00:00+00:00,0.0,0.0,0.0,0.0


For each date, there are 224 entries except for 2 days where there are 112.

In [7]:
enefit_dict['forecast_weather'].groupby('forecast_datetime').size().value_counts()

224    15264
112       48
Name: count, dtype: int64

In [13]:
np.unique(enefit_dict['forecast_weather'].groupby('forecast_datetime').size().loc[enefit_dict['forecast_weather'].groupby('forecast_datetime').size() == 112].index.date)

array([datetime.date(2021, 9, 1), datetime.date(2021, 9, 2),
       datetime.date(2023, 5, 31), datetime.date(2023, 6, 1)],
      dtype=object)

In [None]:
h

For this dataset, we will consider the relationship between direct and surface solar radiation and other features.

In [None]:
plt.figure(figsize=(10, 8))
sns.scatterplot(
    data=enefit_dict['forecast_weather'],
    x='temperature', y='dewpoint', alpha=.4
)
plt.xlabel('Temperature')
plt.ylabel('Dewpoint')
plt.title('Relationship between Temperature & Dewpoint')
plt.show()

In [None]:
plt.figure(figsize=(10, 8))
sns.scatterplot(
    data=enefit_dict['forecast_weather'],
    x='surface_solar_radiation_downwards', 
    y='direct_solar_radiation', alpha=.4
)
plt.xlabel('Surface Solar Radiation')
plt.ylabel('Direct Solar Radiation')
plt.title('Relationship between Surface Solar Radiation & Direct Solar Radiation')
plt.show()

In [None]:
pearson, p = stats.pearsonr(
    enefit_dict['forecast_weather'].temperature,
    enefit_dict['forecast_weather'].direct_solar_radiation
)
pearson = round(pearson, 2)
p = '{:.2e}'.format(p)

plt.figure(figsize=(8, 6))
sns.jointplot(
    x='temperature',
    y='direct_solar_radiation', data=enefit_dict['forecast_weather'], 
    kind="reg",
    # joint_kws={'color':deep_colors[0]},
    line_kws={'color':deep_colors[2]}
).ax_joint.text(
    s=f' pearsonr = {pearson}; p = {p} ',
    ha='left', va='top', x=-20, y=1.1,
    bbox={
        'boxstyle':'round','pad':0.25,
        'facecolor':'white','edgecolor':'gray'
    }
)

plt.xlabel('Temperature')
plt.ylabel('Direct Solar Radiation')
plt.suptitle('Relationship between Temperature & Dewpoint')

plt.show()

In [None]:
pearson, p = stats.pearsonr(
    enefit_dict['forecast_weather'].surface_solar_radiation_downwards.bfill(),
    enefit_dict['forecast_weather'].direct_solar_radiation
)
pearson = round(pearson, 2)
p = '{:.2e}'.format(p)

plt.figure(figsize=(8, 6))
sns.jointplot(
    x='surface_solar_radiation_downwards',
    y='direct_solar_radiation', data=enefit_dict['forecast_weather'], 
    kind="reg",
    # joint_kws={'color':deep_colors[0]},
    line_kws={'color':deep_colors[2]}
).ax_joint.text(
    s=f' pearsonr = {pearson}; p = {p} ',
    ha='left', va='top', x=20, y=800,
    bbox={
        'boxstyle':'round','pad':0.25,
        'facecolor':'white','edgecolor':'gray'
    }
)

plt.xlabel('Surface Solar Radiation')
plt.ylabel('Direct Solar Radiation')
plt.title('Relationship between Surface Solar Radiation & \nDirect Solar Radiation', y=1.2, fontsize=12)

plt.show()

There is a positive correlation between Solar Radiation & Temperature, but this relationship is not visually significant:

In [None]:
plt.figure(figsize=(12, 8))
pearson, p = stats.pearsonr(
    enefit_dict['forecast_weather'].temperature,
    enefit_dict['forecast_weather'].direct_solar_radiation
)
pearson = round(pearson, 2)
p = '{:.2e}'.format(p)

sns.jointplot(
    x='temperature',
    y='direct_solar_radiation', 
    data=enefit_dict['forecast_weather'], 
    kind="reg",
    # joint_kws={'color':deep_colors[0]},
    line_kws={'color':deep_colors[2]}
).ax_joint.text(
    s=f' pearsonr = {pearson}; p = {p} ',
    ha='left', va='top', x=-20, y=1000,
    bbox={
        'boxstyle':'round','pad':0.25,
        'facecolor':'white','edgecolor':'gray'
    }
)
plt.xlabel('Temperature')
plt.ylabel('Direct Solar Radiation')
plt.suptitle('Relationship between Temperature & Direct Solar Radiation', 
             y=1, fontsize=13)
plt.grid(alpha=.3)
plt.ylim(0, 1050)

plt.show()

In [None]:
plt.figure(figsize=(12, 8))

pearson, p = stats.pearsonr(
    enefit_dict['forecast_weather'].temperature,
    enefit_dict['forecast_weather'].surface_solar_radiation_downwards.bfill()
)
pearson = round(pearson, 2)
p = '{:.2e}'.format(p)

sns.jointplot(
    x='temperature',
    y='surface_solar_radiation_downwards', data=enefit_dict['forecast_weather'], 
    kind="reg",
    # joint_kws={'color':deep_colors[0]},
    line_kws={'color':deep_colors[2]}
).ax_joint.text(
    s=f' pearsonr = {pearson}; p = {p} ',
    ha='left', va='top', x=-20, y=800,
    bbox={
        'boxstyle':'round','pad':0.25,
        'facecolor':'white','edgecolor':'gray'
    }
)
plt.xlabel('Temperature')
plt.ylabel('Surface Solar Radiation')
plt.suptitle('Relationship between Temperature & Surface Solar Radiation', 
             y=1, fontsize=13)
plt.grid(alpha=.3)
plt.ylim(0, 850)

plt.show()

Dewpoint & Solar Radiation:

In [None]:
plt.figure(figsize=(12, 8))
pearson, p = stats.pearsonr(
    enefit_dict['forecast_weather'].dewpoint,
    enefit_dict['forecast_weather'].direct_solar_radiation
)
pearson = round(pearson, 2)
p = '{:.2e}'.format(p)

sns.jointplot(
    x='dewpoint',
    y='direct_solar_radiation', 
    data=enefit_dict['forecast_weather'], 
    kind="reg",
    # joint_kws={'color':deep_colors[0]},
    line_kws={'color':deep_colors[2]}
).ax_joint.text(
    s=f' pearsonr = {pearson}; p = {p} ',
    ha='left', va='top', x=-20, y=700,
    bbox={
        'boxstyle':'round','pad':0.25,
        'facecolor':'white','edgecolor':'gray'
    }
)
plt.xlabel('Dewpoint')
plt.ylabel('Direct Solar Radiation')
plt.suptitle('Relationship between Dewpoint & Direct Solar Radiation', 
             y=1, fontsize=13)
plt.grid(alpha=.3)
plt.ylim(0, 1050)

plt.show()

In [None]:
plt.figure(figsize=(12, 8))

pearson, p = stats.pearsonr(
    enefit_dict['forecast_weather'].dewpoint,
    enefit_dict['forecast_weather'].surface_solar_radiation_downwards.bfill()
)
pearson = round(pearson, 2)
p = '{:.2e}'.format(p)

sns.jointplot(
    x='dewpoint',
    y='surface_solar_radiation_downwards', data=enefit_dict['forecast_weather'], 
    kind="reg",
    # joint_kws={'color':deep_colors[0]},
    line_kws={'color':deep_colors[2]}
).ax_joint.text(
    s=f' pearsonr = {pearson}; p = {p} ',
    ha='left', va='top', x=-20, y=800,
    bbox={
        'boxstyle':'round','pad':0.25,
        'facecolor':'white','edgecolor':'gray'
    }
)
plt.xlabel('Dewpoint')
plt.ylabel('Surface Solar Radiation')
plt.suptitle('Relationship between Dewpoint & Surface Solar Radiation', 
             y=1, fontsize=13)
plt.grid(alpha=.3)
plt.ylim(0, 850)

plt.show()

Solar Radiation & Cloud Cover

In [None]:
plt.figure(figsize=(12, 8))

pearson, p = stats.pearsonr(
    enefit_dict['forecast_weather'].cloudcover_total,
    enefit_dict['forecast_weather'].direct_solar_radiation
)
pearson = round(pearson, 2)
p = '{:.2e}'.format(p)

sns.jointplot(
    x='cloudcover_total',
    y='direct_solar_radiation', 
    data=enefit_dict['forecast_weather'], 
    kind="reg",
    # joint_kws={'color':deep_colors[0]},
    line_kws={'color':deep_colors[2]}
).ax_joint.text(
    s=f' pearsonr = {pearson}; p = {p} ',
    ha='left', va='top', x=.2, y=700,
    bbox={
        'boxstyle':'round','pad':0.25,
        'facecolor':'white','edgecolor':'gray'
    }
)
plt.xlabel('Total Cloud Cover')
plt.ylabel('Direct Solar Radiation')
plt.suptitle('Relationship between Total Cloud Cover & Direct Solar Radiation', 
             y=1.05, fontsize=13)
plt.grid(alpha=.3)
plt.show()

In [None]:
plt.figure(figsize=(12, 8))

pearson, p = stats.pearsonr(
    enefit_dict['forecast_weather'].cloudcover_total,
    enefit_dict['forecast_weather'].surface_solar_radiation_downwards.bfill()
)
pearson = round(pearson, 2)
p = '{:.2e}'.format(p)

sns.jointplot(
    x='cloudcover_total',
    y='surface_solar_radiation_downwards', 
    data=enefit_dict['forecast_weather'], 
    kind="reg",
    # joint_kws={'color':deep_colors[0]},
    line_kws={'color':deep_colors[2]}
).ax_joint.text(
    s=f' pearsonr = {pearson}; p = {p} ',
    ha='left', va='top', x=.2, y=800,
    bbox={
        'boxstyle':'round','pad':0.25,
        'facecolor':'white','edgecolor':'gray'
    }
)
plt.xlabel('Total Cloud Cover')
plt.ylabel('Surface Solar Radiation')
plt.suptitle('Relationship between Total Cloud Cover & Surface Solar Radiation', 
             y=1.05, fontsize=13)
plt.grid(alpha=.3)

plt.show()

Snowfall & Solar Radiation:

In [None]:
plt.figure(figsize=(12, 8))

pearson, p = stats.pearsonr(
    enefit_dict['forecast_weather'].snowfall,
    enefit_dict['forecast_weather'].direct_solar_radiation
)
pearson = round(pearson, 2)
p = '{:.2e}'.format(p)

sns.jointplot(
    x='snowfall',
    y='direct_solar_radiation', 
    data=enefit_dict['forecast_weather'], 
    kind="reg",
    # joint_kws={'color':deep_colors[0]},
    line_kws={'color':deep_colors[2]}
).ax_joint.text(
    s=f' pearsonr = {pearson}; p = {p} ',
    ha='left', va='top', x=.001, y=700,
    bbox={
        'boxstyle':'round','pad':0.25,
        'facecolor':'white','edgecolor':'gray'
    }
)
plt.xlabel('Snowfall')
plt.ylabel('Direct Solar Radiation')
plt.suptitle('Relationship between Snowfall & Direct Solar Radiation', 
             y=1.05, fontsize=13)
plt.grid(alpha=.3)
plt.ylim(0, 850)
plt.show()

In [None]:
plt.figure(figsize=(12, 8))

pearson, p = stats.pearsonr(
    enefit_dict['forecast_weather'].snowfall,
    enefit_dict['forecast_weather'].surface_solar_radiation_downwards.bfill()
)
pearson = round(pearson, 2)
p = '{:.2e}'.format(p)

sns.jointplot(
    x='snowfall',
    y='surface_solar_radiation_downwards', 
    data=enefit_dict['forecast_weather'], 
    kind="reg",
    # joint_kws={'color':deep_colors[0]},
    line_kws={'color':deep_colors[2]}
).ax_joint.text(
    s=f' pearsonr = {pearson}; p = {p} ',
    ha='left', va='top', x=.001, y=800,
    bbox={
        'boxstyle':'round','pad':0.25,
        'facecolor':'white','edgecolor':'gray'
    }
)
plt.xlabel('Snowfall')
plt.ylabel('Surface Solar Radiation')
plt.suptitle('Relationship between Snowfall & Surface Solar Radiation', 
             y=1.05, fontsize=13)
plt.grid(alpha=.3)
plt.ylim(0, 850)
plt.show()

Precipitation adn Solar Radiation:

In [None]:
plt.figure(figsize=(12, 8))

pearson, p = stats.pearsonr(
    enefit_dict['forecast_weather'].total_precipitation,
    enefit_dict['forecast_weather'].direct_solar_radiation
)
pearson = round(pearson, 2)
p = '{:.2e}'.format(p)

sns.jointplot(
    x='total_precipitation',
    y='direct_solar_radiation', 
    data=enefit_dict['forecast_weather'], 
    kind="reg",
    # joint_kws={'color':deep_colors[0]},
    line_kws={'color':deep_colors[2]}
).ax_joint.text(
    s=f' pearsonr = {pearson}; p = {p} ',
    ha='left', va='top', x=.001, y=700,
    bbox={
        'boxstyle':'round','pad':0.25,
        'facecolor':'white','edgecolor':'gray'
    }
)
plt.xlabel('Total Precipitation')
plt.ylabel('Direct Solar Radiation')
plt.suptitle('Relationship between Total Precipitation & Direct Solar Radiation', 
             y=1.05, fontsize=13)
plt.grid(alpha=.3)
plt.ylim(0, 1050)
plt.show()

In [None]:
plt.figure(figsize=(12, 8))

pearson, p = stats.pearsonr(
    enefit_dict['forecast_weather'].total_precipitation,
    enefit_dict['forecast_weather'].surface_solar_radiation_downwards.bfill()
)
pearson = round(pearson, 2)
p = '{:.2e}'.format(p)

sns.jointplot(
    x='total_precipitation',
    y='surface_solar_radiation_downwards', 
    data=enefit_dict['forecast_weather'], 
    kind="reg",
    # joint_kws={'color':deep_colors[0]},
    line_kws={'color':deep_colors[2]}
).ax_joint.text(
    s=f' pearsonr = {pearson}; p = {p} ',
    ha='left', va='top', x=.001, y=800,
    bbox={
        'boxstyle':'round','pad':0.25,
        'facecolor':'white','edgecolor':'gray'
    }
)
plt.xlabel('Total Precipitation')
plt.ylabel('Surface Solar Radiation')
plt.suptitle('Relationship between Total Precipitation & Surface Solar Radiation', 
             y=1.05, fontsize=13)
plt.grid(alpha=.3)
plt.ylim(0, 850)
plt.show()

Wind Component & Solar Radiation

In [None]:
plt.figure(figsize=(12, 8))

pearson, p = stats.pearsonr(
    enefit_dict['forecast_weather']['10_metre_u_wind_component'],
    enefit_dict['forecast_weather'].direct_solar_radiation
)
pearson = round(pearson, 2)
p = '{:.2e}'.format(p)

sns.jointplot(
    x='10_metre_u_wind_component',
    y='direct_solar_radiation', 
    data=enefit_dict['forecast_weather'], 
    kind="reg",
    # joint_kws={'color':deep_colors[0]},
    line_kws={'color':deep_colors[2]}
).ax_joint.text(
    s=f' pearsonr = {pearson}; p = {p} ',
    ha='left', va='top', x=-10, y=700,
    bbox={
        'boxstyle':'round','pad':0.25,
        'facecolor':'white','edgecolor':'gray'
    }
)
plt.xlabel('10m East Wind Component')
plt.ylabel('Direct Solar Radiation')
plt.suptitle('Relationship between Wind Component & Direct Solar Radiation', 
             y=1.05, fontsize=13)
plt.grid(alpha=.3)
plt.ylim(0, 1050)
plt.show()

In [None]:
plt.figure(figsize=(12, 8))

pearson, p = stats.pearsonr(
    enefit_dict['forecast_weather']['10_metre_u_wind_component'],
    enefit_dict['forecast_weather'].surface_solar_radiation_downwards.bfill()
)
pearson = round(pearson, 2)
p = '{:.2e}'.format(p)

sns.jointplot(
    x='10_metre_v_wind_component',
    y='surface_solar_radiation_downwards', 
    data=enefit_dict['forecast_weather'], 
    kind="reg",
    # joint_kws={'color':deep_colors[0]},
    line_kws={'color':deep_colors[2]}
).ax_joint.text(
    s=f' pearsonr = {pearson}; p = {p} ',
    ha='left', va='top', x=-20, y=800,
    bbox={
        'boxstyle':'round','pad':0.25,
        'facecolor':'white','edgecolor':'gray'
    }
)
plt.xlabel('10m East Wind Component')
plt.ylabel('Surface Solar Radiation')
plt.suptitle('Relationship between Wind Component & Surface Solar Radiation', 
             y=1.05, fontsize=13)
plt.grid(alpha=.3)
plt.ylim(0, 850)
plt.show()

In [None]:
plt.figure(figsize=(12, 8))

pearson, p = stats.pearsonr(
    enefit_dict['forecast_weather']['10_metre_v_wind_component'],
    enefit_dict['forecast_weather'].direct_solar_radiation
)
pearson = round(pearson, 2)
p = '{:.2e}'.format(p)

sns.jointplot(
    x='10_metre_v_wind_component',
    y='direct_solar_radiation', 
    data=enefit_dict['forecast_weather'], 
    kind="reg",
    # joint_kws={'color':deep_colors[0]},
    line_kws={'color':deep_colors[2]}
).ax_joint.text(
    s=f' pearsonr = {pearson}; p = {p} ',
    ha='left', va='top', x=-20, y=700,
    bbox={
        'boxstyle':'round','pad':0.25,
        'facecolor':'white','edgecolor':'gray'
    }
)
plt.xlabel('10m North Wind Component')
plt.ylabel('Direct Solar Radiation')
plt.suptitle('Relationship between Wind Component & Direct Solar Radiation', 
             y=1.05, fontsize=13)
plt.grid(alpha=.3)
plt.ylim(0, 1050)
plt.show()

In [None]:
plt.figure(figsize=(12, 8))

pearson, p = stats.pearsonr(
    enefit_dict['forecast_weather']['10_metre_v_wind_component'],
    enefit_dict['forecast_weather'].surface_solar_radiation_downwards.bfill()
)
pearson = round(pearson, 2)
p = '{:.2e}'.format(p)

sns.jointplot(
    x='10_metre_v_wind_component',
    y='surface_solar_radiation_downwards', 
    data=enefit_dict['forecast_weather'], 
    kind="reg",
    # joint_kws={'color':deep_colors[0]},
    line_kws={'color':deep_colors[2]}
).ax_joint.text(
    s=f' pearsonr = {pearson}; p = {p} ',
    ha='left', va='top', x=-20, y=800,
    bbox={
        'boxstyle':'round','pad':0.25,
        'facecolor':'white','edgecolor':'gray'
    }
)
plt.xlabel('10m North Wind Component')
plt.ylabel('Surface Solar Radiation')
plt.suptitle('Relationship between Wind Component & Surface Solar Radiation', 
             y=1.05, fontsize=13)
plt.grid(alpha=.3)
plt.ylim(0, 850)
plt.show()

Heatmap Correlation of Features in the Forecast Weather Dataset:

Temperature and dewpoint are positively correlated with solar radiation while cloud cover (especially low cloud clover) is negatively correlated with solar radiation. This is reasonably inferred since higher temperatures imply hotter weather and more sun radiation and the higher the cloud cover, the hotter the weather. 

In [None]:
plt.figure(figsize=(15, 13))
sns.heatmap(
    enefit_dict['forecast_weather'].corr(), 
    annot=True,
    linewidths=0.5,
    fmt= ".2f",
    cmap="YlGnBu"
    )
plt.title('Correlation Heatmap of Foreecast Weather Table')
plt.show()

In [None]:
plt.figure(figsize=(15, 13))
sns.heatmap(
    enefit_dict['forecast_weather'].corr(method='spearman'), 
    annot=True,
    linewidths=0.5,
    fmt= ".2f",
    cmap="YlGnBu"
    )
plt.title('Rank Correlation Heatmap of Foreecast Weather Table')
plt.show()

Explaining Variations in Solar Radiation:

About 37% of the variation in solar radiation is explained by the other features in the dataset using a linear regression model. This may imply that the relationship between the features are not linear, however, the rank correlation heatmap did not provide values that were significantly different from the pearson correlation heatmap.

*Interpreting the Durbin-Watson Statistic*

The low Durbin-Watson statistics indicate positive autocorrelation, implying that there is a systematic relationship between teh residuals (errors) of the model at different points/ observations in the dataset. Since our dataset is a time series, this implies that adjacent or nearby residuals tend to have similar values or patterns. We'll see this in visually in the plots below. In our case, positive autocorrelation suggests that if solar radiation is higher (or lower) than predicted on one day, it is more likely to be higher (or lower) than predicted on the next day as well.

This positive serial correlation in residuals violates the assumption of independence and can affect the reliability of statistical tests and the efficiency of coefficient estimates in regression analysis. It indicates that the model does not account for some systematic pattern or structure present in the data, leading to inefficiencies in the estimation process. We will test for seasonality in the next section. 

*Interpreting the Jarque-Bera Statistic*

In Linear Regression, residuals should ideally be normally distributed for valid inference. A low Jarque-Bera test statistic (and a high p-value) suggests that residuals are normally distributed, indicating that the assumption of normality might be valid. Conversely, a high test statistic (and a low p-value) suggests departure from normality in the residuals.

In this case, we have a significantly high Jarque-Bera statistic, implying a significant deviation from normality in the residuals. We will see this visually in the following section.

*Interpreting the Skewness & Kurtosis Values*

The skewness values in both models indicate moderate positive skewness in the distribution of the residuals, i.e. tail extending to the right toward higher values.

The kurtosis values in both models indicate higher-than-normal kurtosis. 

For the direct solar radiation model, with a kurtosis value of 4, indicates a mesokurtic distribution. Mesokurtic distributions have a kurtosis value that is the same as that of a normal distribution. A kurtosis of 4 suggests a distribution that is similar in tailedness and peakedness to a normal distribution, without significant tails or an extremely peaked shape.

When combined with its skewness value of 0.9, these values indicate a distribution that is somewhat positively skewed (more values on the left side, fewer extreme values on the right) and has a kurtosis close to that of a normal distribution. Moderate skewness and kurtosis suggest that the distribution might deviate slightly from perfect symmetry and the exact shape of a normal distribution, but it does not exhibit extreme departures in tailedness or peakedness.

In the context of regression analysis, these mild departures from perfect normality might have minimal impact on the reliability of statistical inferences or the performance of the regression model. Usually, these levels of skewness and kurtosis are within an acceptable range and might not significantly affect the model assumptions or interpretations.

For the surface solar radiation model, with a kurtosis value of 5.7, indicates a leptokurtic distribution. Leptokurtic distributions have heavier tails and a higher peak (more concentration of values around the mean) compared to a normal distribution (which has a kurtosis of 3). A value of 5.7 suggests that the distribution or residuals have a higher number of outliers or extreme values in the tails than what is typical in a normal distribution.

When combined with its skewness value of 1.2, these values suggest that the distribution of the residuals may deviate from a normal distribution. A positively skewed distribution with higher kurtosis indicates non-normality, potentially indicating the presence of outliers, heavier tails, or a more peaked distribution than the normal curve.

When working with regression analysis, such departures from normality might impact the reliability of statistical inferences or the performance of the regression model.

*Conclusion*

On the question of skewness, kurtosis, Durbin-Watson and Jarque-Bera, the direct solar radiation model is a better fit model. 

In [None]:
data = enefit_dict['forecast_weather'].rename(columns={
    '10_metre_v_wind_component':'ten_metre_v_wind_component',
    '10_metre_u_wind_component':'ten_metre_u_wind_component',
})

formula = 'direct_solar_radiation ~ longitude + latitude + hours_ahead + temperature + dewpoint + cloudcover_total + snowfall + total_precipitation + ten_metre_u_wind_component + ten_metre_v_wind_component'
model = smf.ols(formula, data=data)
results = model.fit()
print(results.summary())

In [None]:
data = enefit_dict['forecast_weather'].rename(columns={
    '10_metre_v_wind_component':'ten_metre_v_wind_component',
    '10_metre_u_wind_component':'ten_metre_u_wind_component',
})
formula = 'surface_solar_radiation_downwards ~ longitude + latitude + hours_ahead + temperature + dewpoint + cloudcover_total + snowfall + total_precipitation + ten_metre_u_wind_component + ten_metre_v_wind_component'
model = smf.ols(formula, data=data)
results = model.fit()
print(results.summary())

Seasonality in Solar Radiation:

There is clearly seasonality in this plot. We'll use seasonal decomposition to dig deeper.

In [None]:
plt.figure(figsize=(12, 8))
sns.lineplot(
    x='forecast_datetime', y='direct_solar_radiation', 
    data=enefit_dict['forecast_weather'],
    estimator='mean', size=.7, label='Direct Solar Radiation',
    legend='brief', alpha=.5
)
sns.lineplot(
    x='forecast_datetime', y='surface_solar_radiation_downwards',
    data=enefit_dict['forecast_weather'],
    estimator='mean', size=.7, label='Surface Solar Radiation Downwards',
    legend='brief', alpha=.5
)

plt.xlabel('Date')
plt.ylabel('Solar Radiation')
plt.title('Time Series Plot of Forecasted Solar Radiation')
plt.xticks(rotation=45)
plt.grid(alpha=.3)
plt.legend()

plt.show()

In [None]:
fig, ax = plt.subplots(figsize=(12, 8))
sns.lineplot(
    x='forecast_datetime', y='direct_solar_radiation', 
    data=enefit_dict['forecast_weather'], color=deep_colors[0], alpha=.8,
    estimator='mean', linewidth=1.2, label='Avg Direct Solar Radiation',
    ax=ax
)
sns.lineplot(
    x='forecast_datetime', y='surface_solar_radiation_downwards',
    data=enefit_dict['forecast_weather'], color=deep_colors[1], alpha=.4,
    estimator='mean', linewidth=1.2, label='Avg Surface Solar Radiation',
    ax=ax
)
plt.xlabel('Date')
plt.ylabel('Solar Radiation')
plt.title('Time Series Plot of Avg Solar Radiation')
plt.xticks(rotation=45)
plt.grid(alpha=.3)

fig.autofmt_xdate()
plt.show()

In [None]:
fig, ax = plt.subplots(figsize=(12, 8))

monthly_mean = enefit_dict['forecast_weather'].set_index('forecast_datetime').direct_solar_radiation.resample('M').mean().interpolate()
ax.plot(
    monthly_mean.index, monthly_mean, marker='o', 
    linestyle='-', linewidth=1.5,
    label='Direct Solar Radiation'
)

monthly_mean = enefit_dict['forecast_weather'].set_index('forecast_datetime').surface_solar_radiation_downwards.resample('M').mean().interpolate()
ax.plot(
    monthly_mean.index, monthly_mean, marker='o', 
    linestyle='-', linewidth=1.5,
    label='Surface Solar Radiation'
)
ax.set_xlabel('Date')
ax.set_ylabel('Solar Radiation')
ax.set_title('Time Series Plot of Forecasted Solar Radiation')
plt.xticks(rotation=45)
plt.grid(alpha=.3)
plt.legend()

plt.show()


In [None]:

fig, ax = plt.subplots(figsize=(12, 8))
monthly_mean = enefit_dict['forecast_weather'].set_index('forecast_datetime').direct_solar_radiation.resample('M').mean().interpolate()

ax.plot(
    monthly_mean.index, monthly_mean, marker='o', 
    linestyle='-', linewidth=1.5, color=deep_colors[0],
    label='Direct Solar Radiation'
)
monthly_mean = enefit_dict['forecast_weather'].set_index('forecast_datetime').surface_solar_radiation_downwards.resample('M').mean().interpolate()
ax.plot(
    monthly_mean.index, monthly_mean, marker='o', 
    linestyle='-', linewidth=1.5, color=deep_colors[1],
    label='Surface Solar Radiation'
)
plt.xticks(rotation=45)
plt.ylabel('Mean Solar Radiation')
plt.title('Monthly Solar Radiation Over Time')
plt.grid(alpha=.5)

plt.tight_layout()
plt.show()

print('Solar Radiation peaks in the summer months.')


In [None]:
plt.figure(figsize=(12, 9))

plt.subplot(211)
sns.boxplot(
    x=enefit_dict['forecast_weather'].forecast_datetime.dt.month,
    y='direct_solar_radiation',
    data=enefit_dict['forecast_weather']
)
plt.title('Direct Solar Radiation by Month')

plt.subplot(212)
sns.boxplot(
    x=enefit_dict['forecast_weather'].forecast_datetime.dt.month,
    y='surface_solar_radiation_downwards',
    data=enefit_dict['forecast_weather']
)
plt.title('Surface Solar Radiation Downwards by Month')

plt.tight_layout()
plt.show()

print('In the peak (summer) months, surface solar radiation downwards is not as high as direct solar radiation, but in the winter months, it is significanly higher.')

Time Series Forecasting:

In [None]:
data = enefit_dict['forecast_weather'].set_index('forecast_datetime').direct_solar_radiation
duplicates = data.index.duplicated()
data = data[~duplicates].copy()
# data = data.asfreq('D').bfill()

decomposition = STL(data, period=365).fit()
trend = decomposition.trend
seasonal = decomposition.seasonal
residual = decomposition.resid

# Plot the original time series
plt.figure(figsize=(15, 15))
plt.subplot(411)
plt.plot(decomposition.observed, label='Original', color=deep_colors[0])
plt.legend(loc='upper left')
plt.title('Original Time Series')

# Plot the trend component
plt.subplot(412)
plt.plot(trend, label='Trend', color=deep_colors[1])
plt.legend(loc='upper left')
plt.title('Trend Component')

# Plot the seasonal component
plt.subplot(413)
plt.plot(seasonal, label='Seasonal', color=deep_colors[2])
plt.legend(loc='upper left')
plt.title('Seasonal Component')

# Plot the residual component
plt.subplot(414)
plt.plot(residual, label='Residual', color=deep_colors[3])
plt.legend(loc='upper left')
plt.title('Residual Component')

plt.tight_layout()
plt.show()

In [None]:
data = enefit_dict['forecast_weather'].set_index('forecast_datetime').surface_solar_radiation_downwards
duplicates = data.index.duplicated()
data = data[~duplicates].copy()
# data = data.asfreq('D').bfill()

decomposition = STL(data, period=365).fit()
trend = decomposition.trend
seasonal = decomposition.seasonal
residual = decomposition.resid

# Plot the original time series
plt.figure(figsize=(15, 15))
plt.subplot(411)
plt.plot(data, label='Original', color=deep_colors[0])
plt.legend(loc='upper left')
plt.title('Original Time Series')

# Plot the trend component
plt.subplot(412)
plt.plot(trend, label='Trend', color=deep_colors[1])
plt.legend(loc='upper left')
plt.title('Trend Component')

# Plot the seasonal component
plt.subplot(413)
plt.plot(seasonal, label='Seasonal', color=deep_colors[2])
plt.legend(loc='upper left')
plt.title('Seasonal Component')

# Plot the residual component
plt.subplot(414)
plt.plot(residual, label='Residual', color=deep_colors[3])
plt.legend(loc='upper left')
plt.title('Residual Component')

plt.tight_layout()
plt.show()

Analyzing Trend Component:

The trend component implies seasonality with the highest solar radiation occurring in the summer months. The overall trendline also indicates an increasing solar radiation value over time. 

In [None]:
data = enefit_dict['forecast_weather'].set_index('forecast_datetime').direct_solar_radiation
duplicates = data.index.duplicated()
data = data[~duplicates].copy()
decomposition = seasonal_decompose(data, model='additive')
trend = decomposition.trend

plt.figure(figsize=(10, 6))
plt.plot(trend, label='Trend', color=deep_colors[1])

# Fit a linear regression trend-line
x_values = pd.to_numeric(trend.index) / 10**9
coefficients = np.polyfit(x_values, trend.fillna(0), 1)
slope, intercept = coefficients
trendline_values = slope * x_values + intercept
trendline_dates = pd.to_datetime(x_values.astype(int) * 10**9)

plt.plot(trendline_dates, trendline_values, label='Trendline', linestyle='--', color='red')
plt.legend(loc='upper left')
plt.title('Direct Solar Radiation - Trend Component')

plt.show()

In [None]:
data = enefit_dict['forecast_weather'].set_index('forecast_datetime').surface_solar_radiation_downwards
duplicates = data.index.duplicated()
data = data[~duplicates].copy()
decomposition = seasonal_decompose(data, model='additive')
trend = decomposition.trend

plt.figure(figsize=(10, 6))
plt.plot(trend, label='Trend', color=deep_colors[1])

# Fit a linear regression trend-line
x_values = pd.to_numeric(trend.index) / 10**9
coefficients = np.polyfit(x_values, trend.fillna(0), 1)
slope, intercept = coefficients
trendline_values = slope * x_values + intercept
trendline_dates = pd.to_datetime(x_values.astype(int) * 10**9)

plt.plot(trendline_dates, trendline_values, label='Trendline', linestyle='--', color='red')
plt.legend(loc='upper left')
plt.title('Surface Solar Radiation Downward - Trend Component')

plt.show()

Analyzing Seasonal Component

When we dril down into the seasonal component, we find daily repetitive patterns that might be indicative of time of day being a factor in determining solar radiation activity. 

In [None]:
data = enefit_dict['forecast_weather'].set_index('forecast_datetime').direct_solar_radiation
duplicates = data.index.duplicated()
data = data[~duplicates].copy()
decomposition = seasonal_decompose(data, model='additive')
seasonal = decomposition.seasonal

year = 2022
months_to_select = [1]
days_to_select = range(1, 8)

selected_data = seasonal[(seasonal.index.year == year) & (seasonal.index.month.isin(months_to_select)) & (seasonal.index.day.isin(days_to_select))]

plt.figure(figsize=(10, 6))
plt.plot(selected_data, label='Seasonal', color=deep_colors[2])
plt.xlabel('Date')
plt.title(f'Direct Solar Radiation \nSeasonal Trends for January, {year}')
plt.legend(loc='upper left')
plt.xticks(rotation=45)
plt.show()

In [None]:
data = enefit_dict['forecast_weather'].set_index('forecast_datetime').surface_solar_radiation_downwards
duplicates = data.index.duplicated()
data = data[~duplicates].copy()
decomposition = seasonal_decompose(data, model='additive')
seasonal = decomposition.seasonal

year = 2022
months_to_select = [1]
days_to_select = range(1, 8)

selected_data = seasonal[(seasonal.index.year == year) & (seasonal.index.month.isin(months_to_select)) & (seasonal.index.day.isin(days_to_select))]

plt.figure(figsize=(10, 6))
plt.plot(selected_data, label='Seasonal', color=deep_colors[2])
plt.xlabel('Date')
plt.title(f'Surface Solar Radiation Downward \nSeasonal Trends for January, {year}')
plt.legend(loc='upper left')
plt.xticks(rotation=45)
plt.show()

Assessing the amplitude or magnitude of the seasonal fluctuations:

The seasonal component has a significantly low mean near zero and a much larger standard deviation resulting in an outsized coefficient of variation value. This indicates that the there is extreme variability in the data.


In [None]:
data = enefit_dict['forecast_weather'].set_index('forecast_datetime').direct_solar_radiation
duplicates = data.index.duplicated()
data = data[~duplicates].copy()
decomposition = seasonal_decompose(data, model='additive')
seasonal = decomposition.seasonal

seasonal_amplitude = seasonal.max() - seasonal.min()
std_dev = seasonal.std()
coefficient_of_variation = std_dev / seasonal.mean()
amplitude_mean = seasonal.mean()
amplitude_median = seasonal.median()
amplitude_variance = seasonal.var()
amplitude_percentile_95 = np.percentile(seasonal, 95)
amplitude_percentile_5 = np.percentile(seasonal, 5)

print('Direct Solar Radiation:')
print(f"Amplitude (Range) of Seasonal Component: {seasonal_amplitude:.2f}")
print(f"Mean of Seasonal Component: {amplitude_mean}")
print(f"Median of Seasonal Component: {amplitude_median:.2f}")
print(f"Variance of Seasonal Component: {amplitude_variance:.2f}")
print(f"Standard Deviation of Seasonal Component: {std_dev:.2f}")
print(f"Coefficient of Variation of Seasonal Component: {coefficient_of_variation:.4f}")
print(f"95th Percentile of Seasonal Component: {amplitude_percentile_95:.2f}")
print(f"5th Percentile of Seasonal Component: {amplitude_percentile_5:.2f}")

In [None]:
data = enefit_dict['forecast_weather'].set_index('forecast_datetime').surface_solar_radiation_downwards
duplicates = data.index.duplicated()
data = data[~duplicates].copy()
decomposition = seasonal_decompose(data, model='additive')
seasonal = decomposition.seasonal

seasonal_amplitude = seasonal.max() - seasonal.min()
std_dev = seasonal.std()
coefficient_of_variation = std_dev / seasonal.mean()
amplitude_mean = seasonal.mean()
amplitude_median = seasonal.median()
amplitude_variance = seasonal.var()
amplitude_percentile_95 = np.percentile(seasonal, 95)
amplitude_percentile_5 = np.percentile(seasonal, 5)

print('Surface Solar Radiation Downward:')
print(f"Amplitude (Range) of Seasonal Component: {seasonal_amplitude:.2f}")
print(f"Mean of Seasonal Component: {amplitude_mean}")
print(f"Median of Seasonal Component: {amplitude_median:.2f}")
print(f"Variance of Seasonal Component: {amplitude_variance:.2f}")
print(f"Standard Deviation of Seasonal Component: {std_dev:.2f}")
print(f"Coefficient of Variation of Seasonal Component: {coefficient_of_variation:.4f}")
print(f"95th Percentile of Seasonal Component: {amplitude_percentile_95:.2f}")
print(f"5th Percentile of Seasonal Component: {amplitude_percentile_5:.2f}")

In [None]:
data = enefit_dict['forecast_weather'].set_index('forecast_datetime').direct_solar_radiation
duplicates = data.index.duplicated()
data = data[~duplicates].copy()

result = seasonal_decompose(data, model='additive') 

period = 30
num_cycles = len(data) // period
seasonal_amplitudes = []
for i in range(num_cycles):
    start_idx = i * period
    end_idx = min((i + 1) * period, len(data))
    seasonal_component = result.seasonal[start_idx:end_idx]
    amplitude = seasonal_component.max() - seasonal_component.min()
    cv = (seasonal_component.std() / seasonal_component.mean()) * 100
    seasonal_amplitudes.append({'Period': i + 1, 'Amplitude_30': amplitude, 'CV_30': cv})
amplitudes_df_30 = pd.DataFrame(seasonal_amplitudes)

period = 7
num_cycles = len(data) // period
seasonal_amplitudes = []
for i in range(num_cycles):
    start_idx = i * period
    end_idx = min((i + 1) * period, len(data))
    seasonal_component = result.seasonal[start_idx:end_idx]
    amplitude = seasonal_component.max() - seasonal_component.min()
    cv = (seasonal_component.std() / seasonal_component.mean()) * 100
    seasonal_amplitudes.append({'Period': i + 1, 'Amplitude_7': amplitude, 'CV_7': cv})
amplitudes_df_7 = pd.DataFrame(seasonal_amplitudes)

amplitudes_df = pd.concat([amplitudes_df_7, amplitudes_df_30], axis=1)
amplitudes_df.head(7)

In [None]:
data = enefit_dict['forecast_weather'].set_index('forecast_datetime').surface_solar_radiation_downwards
duplicates = data.index.duplicated()
data = data[~duplicates].copy()

result = seasonal_decompose(data, model='additive') 

period = 30
num_cycles = len(data) // period
seasonal_amplitudes = []
for i in range(num_cycles):
    start_idx = i * period
    end_idx = min((i + 1) * period, len(data))
    seasonal_component = result.seasonal[start_idx:end_idx]
    amplitude = seasonal_component.max() - seasonal_component.min()
    cv = (seasonal_component.std() / seasonal_component.mean()) * 100
    seasonal_amplitudes.append({'Period': i + 1, 'Amplitude_30': amplitude, 'CV_30': cv})
amplitudes_df_30 = pd.DataFrame(seasonal_amplitudes)

period = 7
num_cycles = len(data) // period
seasonal_amplitudes = []
for i in range(num_cycles):
    start_idx = i * period
    end_idx = min((i + 1) * period, len(data))
    seasonal_component = result.seasonal[start_idx:end_idx]
    amplitude = seasonal_component.max() - seasonal_component.min()
    cv = (seasonal_component.std() / seasonal_component.mean()) * 100
    seasonal_amplitudes.append({'Period': i + 1, 'Amplitude_7': amplitude, 'CV_7': cv})
amplitudes_df_7 = pd.DataFrame(seasonal_amplitudes)

amplitudes_df = pd.concat([amplitudes_df_7, amplitudes_df_30], axis=1)
amplitudes_df.head(7)

Analyzing Residual Component:

The distribution of the residual plot appears to be approximately normal and the autocorrelation plot shows a flat line with no significant spikes. This indicates that residuals are centered and have consistent variability.

The flat autocorrelation also indicates the absence of significant autocorrelation at different lags, which suggests that the residuals are not correlated at different time lags, indicating randomness or lack of systematic patterns.

The normal distribution and absence of significant autocorrelation in residuals suggest that the seasonal decomposition model adequately captures the underlying trend, seasonality, and other components present in the time series data for electricty prices. This implies goodness of fit and model adequacy.

The lack of autocorrelation further implies that any remaining variation in the data captured by the residuals is random noise or unexplained variability, rather than systematic patterns or trends.

We can therefore use the decomposition model to forecast future observation as the model adequately captures the data's variability.

When we try to explain variance in solar radiation using date as a predictor, we find that linear models are not a good fit. 

In [None]:
data = enefit_dict['forecast_weather'].set_index('forecast_datetime').direct_solar_radiation
duplicates = data.index.duplicated()
data = data[~duplicates].copy()
decomposition = seasonal_decompose(data, model='additive')
residual = decomposition.resid

plt.figure(figsize=(12, 8))
plt.subplot(2, 2, 1)
residual.plot(title='Residuals Time Plot')

plt.subplot(2, 2, 2)
residual.plot(kind='hist', bins=20, title='Residuals Distribution')

plt.subplot(2, 2, 3)
residual.plot(kind='kde', title='Residuals Density Plot')

plt.subplot(2, 2, 4)
pd.plotting.autocorrelation_plot(residual)
plt.title('Autocorrelation Plot of Residuals')

plt.tight_layout()
plt.suptitle('Residuals Plot - Direct Solar Radiation', y=1.05, fontsize=25)
plt.show()

In [None]:
data = data.to_frame()
data['time'] = np.arange(len(data.index))
data['lag_1'] = data.direct_solar_radiation.shift(1)

fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(10, 7))
ax1.plot('time', 'direct_solar_radiation', data=data, color='0.75')
sns.regplot(
    x='time', y='direct_solar_radiation', data=data,
    ci=None, scatter_kws=dict(color='.25'), ax=ax1
)
ax1.set_title('Time Plot of Direct Solar Radiation')

sns.regplot(
    x='lag_1', y='direct_solar_radiation', data=data,
    ci=None, scatter_kws=dict(color='0.25'), ax=ax2
)
ax2.set_aspect('equal')
ax2.set_title('Lag Plot of Direct Solar Radiation')

plt.tight_layout()
plt.show()

OLS Model:

In [None]:
formula = 'direct_solar_radiation ~ time'
model = smf.ols(formula, data=data)
results = model.fit()
print(results.summary())

In [None]:
formula = 'direct_solar_radiation ~ lag_1'
model = smf.ols(formula, data=data)
results = model.fit()
print(results.summary())

Time series forecasting of solar radiation. We'll be using a ARIMAX or SARIMAX model depending on whether seasonal differencing is required.

In [None]:
exog_cols = ['temperature', 'dewpoint', 'cloudcover_total', 'snowfall', 'total_precipitation']
data = enefit_dict['forecast_weather'].set_index('forecast_datetime')[exog_cols + ['direct_solar_radiation']].resample('D').mean()
target = data.direct_solar_radiation
exog = data[exog_cols]

In [None]:
data.head()

First order differencing:

In [None]:
target_diff = target.diff()

ad_fuller_result = adfuller(target_diff[1:])

print(f'ADF Statistic: {ad_fuller_result[0]}')
print(f'p-value: {ad_fuller_result[1]}')

In [None]:
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(10, 8))
target.plot(title='Time Series - Observed Direct Solar Radiation', ax=ax1, color=deep_colors[0], label='Observed')
target_diff.plot(title='Time Series - Differenced Direct Solar Radiation', ax=ax2, color=deep_colors[1], label='Differenced')
ax1.legend()
ax2.legend()
fig.autofmt_xdate()
plt.tight_layout()
plt.show()

Test for autocorrelation and partial autocorrelation to determine whether the target can be modeled using ARIMAX:

In [None]:
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(10, 8))
plot_acf(target_diff[1:], lags=10, ax=ax1)
plot_pacf(target_diff[1:], lags=10, ax=ax2)
plt.show()

SARIMAX Modeling:

In [None]:
def optimize_SARIMAX(
        endog: Union[pd.Series, list], exog: Union[pd.Series, list],
        order_list: list, d: int, 
        # D: int, s: int
) -> pd.DataFrame:

    results = []

    with warnings.catch_warnings():
        warnings.simplefilter("ignore")
        for order in notebook.tqdm(order_list):
            try:
                model = SARIMAX(
                    endog,
                    exog,
                    order=(order[0], d, order[1]),
                    # seasonal_order=(order[2], D, order[3], s),
                    simple_differencing=False).fit(disp=False)
            except:
                continue
    
            aic = model.aic
            results.append([order, aic])
    
        result_df = pd.DataFrame(results)
        result_df.columns = ['(p,q)', 'AIC']
    
        result_df = result_df.sort_values(by='AIC', ascending=True).reset_index(drop=True)
    
        return result_df

In [None]:
ps = range(0, 10, 1)
qs = range(0, 10, 1)
d = 1

SARIMAX_order_list = list(product(ps, qs))
SARIMAX_result_df = optimize_SARIMAX(target, exog, SARIMAX_order_list, d)
SARIMAX_result_df.head()

In [None]:
SARIMAX_model = SARIMAX(
    endog=target, 
    exog=exog,
    order=(SARIMAX_result_df.iloc[0,0][0], d, SARIMAX_result_df.iloc[0,0][1]),
    simple_differencing=False
    )
SARIMAX_model_fit = SARIMAX_model.fit(disp=False)
print(SARIMAX_model_fit.summary())

In [None]:
SARIMAX_model_fit.plot_diagnostics(figsize=(10,8))
plt.show()

In [None]:
residuals = SARIMAX_model_fit.resid
lb_df = acorr_ljungbox(residuals, np.arange(1, 11, 1))

lb_df

Surface Solar Radiation:

In [None]:
exog_cols = ['temperature', 'dewpoint', 'cloudcover_total', 'snowfall', 'total_precipitation']
data = enefit_dict['forecast_weather'].set_index('forecast_datetime')[exog_cols + ['surface_solar_radiation_downwards']].resample('D').mean()
target = data.surface_solar_radiation_downwards
exog = data[exog_cols]

In [None]:
target_diff = target.diff()

ad_fuller_result = adfuller(target_diff[1:])

print(f'ADF Statistic: {ad_fuller_result[0]}')
print(f'p-value: {ad_fuller_result[1]}')

In [None]:
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(10, 8))
plot_acf(target_diff[1:], lags=10, ax=ax1)
plot_pacf(target_diff[1:], lags=10, ax=ax2)
plt.show()

In [None]:
ps = range(0, 10, 1)
qs = range(0, 10, 1)
d = 1

SARIMAX_order_list = list(product(ps, qs))
SARIMAX_result_df = optimize_SARIMAX(target, exog, SARIMAX_order_list, d)
SARIMAX_result_df.head()

In [None]:
SARIMAX_model = SARIMAX(
    endog=target,
    exog=exog,
    order=(SARIMAX_result_df.iloc[0,0][0], d, SARIMAX_result_df.iloc[0,0][1]),
    simple_differencing=False
)
SARIMAX_model_fit = SARIMAX_model.fit(disp=False)
print(SARIMAX_model_fit.summary())

In [None]:
SARIMAX_model_fit.plot_diagnostics(figsize=(10,8))
plt.show()

In [None]:
residuals = SARIMAX_model_fit.resid
lb_df = acorr_ljungbox(residuals, np.arange(1, 11, 1))

lb_df