# Enefit - Predict Energy Behavior of Prosumers

The challenge in this competition is to predict the amount of electricity produced and consumed by Estonian energy customers who have installed solar panels. The dataset includes weather data, the relevant energy prices, and records of the installed photovoltaic capacity.

This is a forecasting competition using the time series API.

**Description**

The number of prosumers is rapidly increasing, and solving the problems of energy imbalance and their rising costs is vital. If left unaddressed, this could lead to increased operational costs, potential grid instability, and inefficient use of energy resources. If this problem were effectively solved, it would significantly reduce the imbalance costs, improve the reliability of the grid, and make the integration of prosumers into the energy system more efficient and sustainable. Moreover, it could potentially incentivize more consumers to become prosumers, knowing that their energy behavior can be adequately managed, thus promoting renewable energy production and use.

**About us**

Enefit is one of the biggest energy companies in Baltic region. As experts in the field of energy, we help customers plan their green journey in a personal and flexible manner as well as implement it by using environmentally friendly energy solutions.
At present, Enefit is attempting to solve the imbalance problem by developing internal predictive models and relying on third-party forecasts. However, these methods have proven to be insufficient due to their low accuracy in forecasting the energy behavior of prosumers. The shortcomings of these current methods lie in their inability to accurately account for the wide range of variables that influence prosumer behavior, leading to high imbalance costs. By opening up the challenge to the world's best data scientists through the Kaggle platform, Enefit aims to leverage a broader pool of expertise and novel approaches to improve the accuracy of these predictions and consequently reduce the imbalance and associated costs.

**Evaluation**

Submissions are evaluated on the Mean Absolute Error (MAE) between the predicted return and the observed target. The formula is given by:

𝑀𝐴𝐸=1𝑛∑𝑖=1𝑛|𝑦𝑖−𝑥𝑖|

Where:
* 𝑛 is the total number of data points.
* 𝑦𝑖 is the predicted value for data point i.
* 𝑥𝑖 is the observed value for data point i.

**Submitting**

You must submit to this competition using the provided python time-series API, which ensures that models do not peek forward in time. To use the API, follow the template in this [notebook](https://www.kaggle.com/code/sohier/enefit-basic-submission-demo).

**Timeline**

This is a future data prediction competition with an active training phase and a second period where selected submissions will be evaluated against future ground truth data.

*Training Timeline*

* November 1, 2023 - Start Date.
* January 24, 2024 - Entry Deadline. You must accept the competition rules before this date in order to compete.
* January 24, 2024 - Team Merger Deadline. This is the last day participants may join or merge teams.
* January 31, 2024 - Final Submission Deadline.

All deadlines are at 11:59 PM UTC on the corresponding day unless otherwise noted. The competition organizers reserve the right to update the contest timeline if they deem it necessary.

*Prediction Timeline:*

Starting after the final submission deadline there will be periodic updates to the leaderboard to reflect future data updates that will be evaluated against selected submissions. We anticipate 1-3 interim updates before the final evaluation.

* April 30, 2024 - Competition End Date

**Prizes**

* 1st Place - $ 15,000
* 2nd Place - $ 10,000
* 3rd Place - $ 8,000
* 4th Place - $ 7,000
* 5th Place - $ 5,000
* 6th Place - $ 5,000

**Code Requirements**

Submissions to this competition must be made through Notebooks. In order for the "Submit" button to be active after a commit, the following conditions must be met:

* CPU Notebook <= 9 hours run-time
* GPU Notebook <= 9 hours run-time
* Internet access disabled
* Freely & publicly available external data is allowed, including pre-trained models
* Submission file must be named submission.csv and be generated by the API.

Please see the [Code Competition FAQ](https://www.kaggle.com/docs/competitions#notebooks-only-FAQ) for more information on how to submit. And review the [code debugging doc](https://www.kaggle.com/code-competition-debugging) if you are encountering submission errors.

### Load Workspace

In [1]:
import re
import datetime as dt
import itertools

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
mpl.style.use('fivethirtyeight')
import seaborn as sns
import opendatasets as od
import kaggle
import zipfile
import io
import json
import warnings

from scipy import stats
import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.tsa.seasonal import seasonal_decompose, STL
from statsmodels.stats.diagnostic import acorr_ljungbox
from statsmodels.tsa.stattools import acf, pacf
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.tsa.statespace.sarimax import SARIMAX
from statsmodels.graphics.gofplots import qqplot
from statsmodels.tsa.stattools import adfuller
from tqdm import notebook
from itertools import product
from typing import Union

from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, roc_curve, classification_report

### Load the Dataset

In [2]:
def list_files_in_zip(zip_file_path):
    zip_files = list()
    with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
        file_list = zip_ref.namelist()
        for file in file_list:
            zip_files.append(file)
    return zip_files

zip_file_path = 'predict-energy-behavior-of-prosumers.zip'

enefit_files = list_files_in_zip(zip_file_path)
enefit_files

['client.csv',
 'county_id_to_name_map.json',
 'electricity_prices.csv',
 'enefit/__init__.py',
 'enefit/competition.cpython-310-x86_64-linux-gnu.so',
 'example_test_files/client.csv',
 'example_test_files/electricity_prices.csv',
 'example_test_files/forecast_weather.csv',
 'example_test_files/gas_prices.csv',
 'example_test_files/historical_weather.csv',
 'example_test_files/revealed_targets.csv',
 'example_test_files/sample_submission.csv',
 'example_test_files/test.csv',
 'forecast_weather.csv',
 'gas_prices.csv',
 'historical_weather.csv',
 'public_timeseries_testing_util.py',
 'train.csv',
 'weather_station_to_county_mapping.csv']

In [3]:
def read_csv_from_zip(zip_file_path, csv_file_name):
    with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
        with zip_ref.open(csv_file_name) as file:
            df = pd.read_csv(io.TextIOWrapper(file))
            return df

def read_json_from_zip(zip_file_path, json_file_name):
    with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
        with zip_ref.open(json_file_name) as file:
            data = json.load(file)
            df = pd.DataFrame(data, index=range(len(data)))
            return df

In [4]:
enefit_dict = dict()
keys = [
    'client', 'county_id_to_name_map', 
    'electricity_prices', 'forecast_weather',
    'gas_prices', 'historical_weather', 
    'train', 'weather_station_to_county_mapping'
]

In [5]:
for key in keys:
    if key + '.csv' in enefit_files:
        csv_file_name = key + '.csv'
        enefit_dict[key] = read_csv_from_zip(zip_file_path, csv_file_name)
    elif key + '.json' in enefit_files:
        json_file_name = key + '.json'
        enefit_dict[key] = read_json_from_zip(zip_file_path, json_file_name)
    

In [6]:
def clean_date(df):
    date_cols = ['date', 'datetime', 'forecast_date', 'origin_date', 'forecast_datetime', 'origin_datetime']
    for col in df.columns:
        if col in date_cols:
            df[col] = pd.to_datetime(df[col])
    return df

for key in keys:
    enefit_dict[key] = clean_date(enefit_dict[key])

In [7]:
enefit_dict['county_id_to_name_map'] = enefit_dict['county_id_to_name_map'].iloc[0].T
enefit_dict['county_id_to_name_map']

0          HARJUMAA
1           HIIUMAA
2       IDA-VIRUMAA
3          JÄRVAMAA
4         JÕGEVAMAA
5     LÄÄNE-VIRUMAA
6          LÄÄNEMAA
7          PÄRNUMAA
8          PÕLVAMAA
9          RAPLAMAA
10         SAAREMAA
11         TARTUMAA
12          UNKNOWN
13         VALGAMAA
14      VILJANDIMAA
15          VÕRUMAA
Name: 0, dtype: object

### Create Training Data

**Train**

The features in this dataset are:
* county: An ID code for the county.
* is_business: Boolean for whether or not the prosumer is a business.
* product_type: ID code with the following mapping of codes to contract types: {0: "Combined", 1: "Fixed", 2: "General service", 3: "Spot"}.
* target: The consumption or production amount for the relevant segment for the hour. The segments are defined by the county, is_business, and product_type.
* is_consumption: Boolean for whether or not this row's target is consumption or production.
* datetime: The Estonian time in EET (UTC+2) / EEST (UTC+3). It describes the start of the 1-hour period on which target is given.
* data_block_id: All rows sharing the same data_block_id will be available at the same forecast time. This is a function of what information is available when forecasts are actually made, at 11 AM each morning. For example, if the forecast weather data_block_id for predictins made on October 31st is 100 then the historic weather data_block_id for October 31st will be 101 as the historic weather data is only actually available the next day.
* row_id: A unique identifier for the row.
* prediction_unit_id: A unique identifier for the county, is_business, and product_type combination. New prediction units can appear or disappear in the test set.

In [None]:
enefit_dict['train'].head()