<a href="https://www.kaggle.com/code/angelicababei/enefiteda?scriptVersionId=172931389" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

In [1]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt

from datetime import datetime, timedelta

In [2]:
df = pd.read_csv('/kaggle/input/trainingdata/trainingData.csv') 
# Data containes 7 columns; 'county', 'is_business', 'is_consumption', and 'product_type', which are categorical, and each determine a time series
# 'target' is the data we want to predict; '48h_shift' is the 48-hour lag. We have to predict 48 hours in advance
# 'datetime' is the date that we need the forecast for

df.head()

Unnamed: 0,county,is_business,product_type,target,is_consumption,datetime,data_block_id,row_id,prediction_unit_id
0,0,0,1,0.713,0,2021-09-01 00:00:00,0,0,0
1,0,0,1,96.59,1,2021-09-01 00:00:00,0,1,0
2,0,0,2,0.0,0,2021-09-01 00:00:00,0,2,1
3,0,0,2,17.314,1,2021-09-01 00:00:00,0,3,1
4,0,0,3,2.904,0,2021-09-01 00:00:00,0,4,2


# Introduction

In this notebook, we perform some basic EDA for the Enefit competition on Kaggle, which aims to predict energy generation and consumption in Estonia. 

Citation:
Kristjan Eljand, Martin Laid, Jean-Baptiste Scellier, Sohier Dane, Maggie Demkin, Addison Howard. (2023). 
Enefit - Predict Energy Behavior of Prosumers. Kaggle. https://kaggle.com/competitions/predict-energy-behavior-of-prosumers


# EDA

## 1. Plot the timeseries

In [3]:
# import utility script
from enefitutils import *
timeseries_plots = Plotting(df)

Next, we plot each timeseries, which we can pick via the dropdown menu. In the dropdown menu, we plot time series based on their ('county', 'is_business', 'is_consumption', 'product_type') tuple. Not all tuples are represented

In [4]:
timeseries_plots.series_plot()

interactive(children=(Dropdown(description='county:', layout=Layout(margin='0 0 0 250px'), options=(0, 1, 2, 3…

## 2. Decompose into seasonal, trend and residual components.

We notice a few things: the data seems to follow a yearly seasonality, which makes sense due to energy consumption and generation being dependent on weather conditions.
Moreover, the variation in the yearly pattern appears to be proportional to the target level, so we can try a multiplicative STL decomposition by looking at a logarithmic transformation of the target data. 

There is also a strong daily seasonality, which we incorporate in our decomposition. While there might also be a weekly seasonality, it is much weaker than the daily and yearly one, so we choose to exclude it from our decomposition.

We also print the strength of the treand and seasonality.

In [5]:
timeseries_plots.stl_decomposition() # this might take ~ 30 seconds

interactive(children=(Dropdown(description='county:', index=11, layout=Layout(margin='0 0 0 250px'), options=(…

Occasionaly (especially when consumption = False), the residuals seem to have a strong yearly behavior, despite the yearly component being already subtracted. This behavior is more pronounced when the yearly seasonality is not very strong, since adding the yearly and daily pattern makes a very thick line, which leaves its imprint when subtracted from the series. 

## 3. Correlations between different timeseries

Another thing we can notice from the initial plots is that the series seems to be correlated; in particular, if we fixing a county, product type and bussiness value, the consumption should fall when energy is generated, and vice-versa. In the following, we explore two correlation matrices: one as above, between series that have the same county, product type and bussiness value (so they either describe energy generation vs. energy consumption), and one between series that have a fixed ('is_business', 'is_consumption', 'product_type') tuple, so they have varying counties.

### 3.1. Correlations between energy consumption and energy generation

We expect energy consumption and energy generation to be negatively correlated. We can examine this assumption in the follwoing dropdown plot, where we can pick choices for county, is_business, and product_type.

In [6]:
timeseries_plots.gen_cons_corr() 

interactive(children=(Dropdown(description='county:', index=11, layout=Layout(margin='0 0 0 250px'), options=(…

### 3.2. Correlations between counties

At the same time, we expect energy readings to be positively correlated when ranging over counties for a specific choice of is_business, is_consumption, and product_type.

In [7]:
timeseries_plots.counties_corr() # each plot might take ~ 30 seconds to update

interactive(children=(Dropdown(description='business:', index=1, layout=Layout(margin='10px 0 0 250px'), optio…