<a href="https://colab.research.google.com/github/carlos-alves-one/-Energy-Comp/blob/main/enefit_project_FV.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Project Description

**Understand the specific problem of energy imbalance caused by prosumers and how the model can help Enefit**

Specific Problem: Energy Imbalance Resulting from Prosumers

The primary concern is the energy discrepancy that occurs when there is a disparity between the anticipated and the actual energy consumed or generated. The issue is worsened by prosumers contributing to the problem due to their simultaneous roles as energy consumers and producers. Their energy use and generation can be erratic, resulting in logistical and financial difficulties for energy firms like Enefit. These problems encompass the struggle to align supply and demand and the resulting expenditures from this imbalance.

The Role of the Model in Facilitating Enefit
The model aims to address these difficulties by offering precise forecasts of prosumers' energy usage and production. By doing so, the model will:

1. Increase Forecasting Precision: Enhance Enefit's capacity to anticipate energy demands and production levels accurately.
   
2. Minimise Imbalance Expenses: Enefit can optimise its energy allocation by utilising more accurate forecasts, hence decreasing the expenses linked to energy imbalance.

3. Enhance Resource Allocation Efficiency: Precise predictions will allow Enefit to distribute resources more optimally, thereby minimising waste and decreasing operational expenses.

4. Enhance Strategic Decision-Making: Enefit can improve its ability to make strategic decisions on infrastructure investments and policy changes by gaining deeper insights into consumer behaviour.

5. Encourage Sustainable Habits: Enefit may encourage prosumers to use renewable energy sources through efficient energy management, facilitating the shift towards more environmentally friendly energy habits.

The model must incorporate multiple variables that impact consumer behaviour, such as weather patterns, past energy usage trends, pricing fluctuations, etc. The model's performance will be assessed based on its Mean Absolute Error (MAE), which requires your predictions to match the actual values to minimise the error measurement closely.

The competition offers a dataset of historical meteorological data, energy pricing, and details regarding prosumer attributes. The given Python time-series API will guarantee that the model complies with the competition's specifications, including the prohibition of looking ahead in time and utilising just the available data for making predictions.

# Setup an Environment


This code snippet is written in Python and includes a series of import statements, bringing various Python libraries and functions into the current script. These libraries are primarily used for data manipulation, machine learning, and optimization. Let us break down each part:

1. **Import Statements**:
   - `import os`: Imports the OS module, which provides functions for interacting with the operating system.
   - `import gc`: Imports the garbage collector interface, which can manually trigger Python's garbage collection process.
   - `import pickle`: Imports the pickle module used for serializing (pickling) and deserializing (unpickling) Python object structures.
   - `import numpy as np`: Imports the NumPy library (as `np`), which is fundamental for scientific computing in Python, providing support for large, multi-dimensional arrays and matrices, along with a collection of mathematical functions to operate on these arrays.
   - `import pandas as pd`: Imports the pandas library (as `pd`), a powerful data manipulation and analysis tool.
   - `import polars as pl`: Imports the Polars library (as `pl`), another data manipulation and analysis library, similar to pandas but often faster for certain operations.
   - `from sklearn.model_selection import cross_val_score, cross_validate`: Imports `cross_val_score` and `cross_validate` functions from Scikit-Learn, used for cross-validation in machine learning.
   - `from sklearn.metrics import mean_absolute_error`: Imports the Mean Absolute Error (MAE) metric from Scikit-Learn, used to evaluate the performance of regression models.
   - `from sklearn.compose import TransformedTargetRegressor`: Imports a utility from Scikit-Learn to transform the target variable in regression problems.
   - `from sklearn.ensemble import VotingRegressor`: Imports the VotingRegressor from Scikit-Learn, a meta-regressor that fits several base regressors and averages their predictions.
   - `import lightgbm as lgb`: Imports LightGBM (as `lgb`), a gradient boosting framework that uses tree-based learning algorithms.
   - `!pip install optuna`: Uses the pip package installer to install Optuna, a library for hyperparameter optimization.
   - `import optuna`: Imports the Optuna library for optimizing machine learning models.

2. **Purpose of the Code**:
   - This code snippet creates an environment for data processing, machine learning, and hyperparameter optimization.
   - It is likely part of a larger script or project focused on building, evaluating, and optimizing machine learning models, particularly regression models, given the import of `mean_absolute_error` and `TransformedTargetRegressor`.
   - The use of `cross_val_score` and `cross_validate` suggests that cross-validation is used for model evaluation.
   - `VotingRegressor` and `lightgbm` indicate that ensemble methods and gradient boosting are used in model training.
   - `Optuna` is used to optimize the hyperparameters of these models to enhance performance.

This code, by itself, doesn't perform any data analysis or machine learning operations. It is a setup for such tasks, defining the necessary libraries and tools to be used in subsequent code.

In [2]:
import os
import gc
import pickle

import numpy as np
import pandas as pd
import polars as pl

from sklearn.model_selection import cross_val_score, cross_validate
from sklearn.metrics import mean_absolute_error
from sklearn.compose import TransformedTargetRegressor
from sklearn.ensemble import VotingRegressor

import lightgbm as lgb

!pip install optuna

import optuna



# Cross-Validation Strategy

The provided code snippet defines a Python class named `MonthlyKFold`, which appears to be a custom implementation of a cross-validation strategy. This class is designed to use time series data where observations are organized by month and year. Let us break down the code:

1. **Class Definition - `MonthlyKFold`**:
   - The class `MonthlyKFold` is intended for cross-validation in machine learning, specifically for datasets with a time component (monthly data in this case).

2. **Constructor - `__init__(self, n_splits=3)`**:
   - The `__init__` method is the class's constructor. It initializes the instance with the number of splits (`n_splits`) for cross-validation. By default, the splits are set to 3 if not specified.

3. **Method - `split(self, X, y, groups=None)`**:
   - The `split` method is the main functionality of this class. It is called with the features (`X`) and the target (`y'), along with an optional `groups` parameter.
   - The method calculates a unique identifier for each month by combining the year and month values from the dataset. This assumes that `X` has columns named "year" and "month".
   - It then sorts these unique monthly timesteps.
   - The method iterates over the last `n_splits` months. For each of these months:
     - It determines the indices of the training set (data before the current month) and the test set (data of the current month).
     - It yields these indices, allowing for splitting the dataset into training and testing sets for each fold in the cross-validation.

4. **Method - `get_n_splits(self, X, y, groups=None)`**:
   - This method returns the number of splits (`n_splits`) the cross-validation process will use. This is a standard method in Scikit-Learn's cross-validator classes.

5. **Usage and Purpose**:
   - This custom cross-validator is particularly useful for time-series data where you want to ensure that the validation set comes chronologically after the training set, respecting the temporal order.
   - The design suggests it has been used for scenarios where models must be evaluated based on their performance on unseen future data, a common requirement in time-series forecasting.

In summary, `MonthlyKFold` is a custom class for performing time-based cross-validation, particularly suited for monthly data sets. It ensures that the model is continually trained on past data and tested on future data, adhering to the chronological order, which is crucial in time-series analysis.

In [3]:
class MonthlyKFold:
    def __init__(self, n_splits=3):
        self.n_splits = n_splits

    def split(self, X, y, groups=None):
        dates = 12 * X["year"] + X["month"]
        timesteps = sorted(dates.unique().tolist())
        X = X.reset_index()

        for t in timesteps[-self.n_splits:]:
            idx_train = X[dates.values < t].index
            idx_test = X[dates.values == t].index

            yield idx_train, idx_test

    def get_n_splits(self, X, y, groups=None):
        return self.n_splits


# Data Preparation

The provided code snippet defines a Python function named `to_pandas`, which is designed to convert datasets into a Pandas DataFrame and perform specific transformations on it. Let us break down the function:

1. **Function Definition - `to_pandas(X, y=None)`**:
   - The function `to_pandas` takes two arguments: `X` and an optional `y'. `X` is presumably a dataset, and `y' could be a target variable associated with `X`.
   - The function is designed to work with datasets that have methods similar to the Pandas DataFrame (like `.to_pandas()`), suggesting that `X` and `y' might be in a format used by libraries like Polars, or they could be Pandas DataFrames already.

2. **Converting to Pandas DataFrame**:
   - If `y' is provided, the function concatenates `X` and `y' along the columns (axis=1) after converting them to Pandas DataFrames. This suggests that `X` and `y' are separate but related datasets, like features and target labels in machine learning.
   - If `y' is not provided, it converts `X` to a Pandas DataFrame.

3. **Data Transformation**:
   - The function defines a list of categorical columns (`cat_cols`). It then sets the "row_id" column as the index of the DataFrame and converts the columns in `cat_cols` to the categorical data type. This is often done to optimize memory usage and improve performance in certain operations.
   - It calculates the mean and standard deviation of a series of columns named "target_1" through "target_6" and creates new columns "target_mean" and "target_std" to store these values. This is typically done for feature engineering, where new features are derived from existing data.
   - It also calculates a ratio of "target_6" to "target_7" (with a small constant added to the denominator to avoid division by zero) and stores this in a new column "target_ratio".

4. **Return Value**:
   - The function returns the transformed DataFrame.

5. **Usage and Purpose**:
   - This function is likely used in a data processing pipeline, especially in contexts where machine learning models are being developed.
   - Its primary purpose is to prepare the data by converting it into a format suitable for analysis (Pandas DataFrame), setting the correct index, converting specific columns to categorical types for better processing, and creating new features valuable for predictive modelling.

In summary, `to_pandas` is a utility function for data preparation, particularly in a machine-learning context. It transforms datasets by converting them into Pandas DataFrames, setting appropriate data types, and engineering new features.

In [4]:
def to_pandas(X, y=None):
    cat_cols = ["county", "is_business", "product_type", "is_consumption", "category_1"]

    if y is not None:
        df = pd.concat([X.to_pandas(), y.to_pandas()], axis=1)
    else:
        df = X.to_pandas()

    df = df.set_index("row_id")
    df[cat_cols] = df[cat_cols].astype("category")

    df["target_mean"] = df[[f"target_{i}" for i in range(1, 7)]].mean(1)
    df["target_std"] = df[[f"target_{i}" for i in range(1, 7)]].std(1)
    df["target_ratio"] = df["target_6"] / (df["target_7"] + 1e-3)

    return df


# Hyperparameter Optimization

The provided code snippet defines a Python function named `lgb_objective`, which appears to be designed as an objective function for hyperparameter optimization using Optuna, specifically for a LightGBM regression model. Let us break down the function:

1. **Function Definition - `lgb_objective(trial)`**:
   - The function `lgb_objective` takes a single argument `trial`, representing an Optuna trial. An Optuna trial is used to suggest hyperparameters and evaluate their performance.

2. **Hyperparameter Suggestions**:
   - Inside the function, a dictionary named `params` holds the hyperparameters for a LightGBM model.
   - The function uses `trial.suggest_*` methods to define a range of values for various hyperparameters. These methods are part of Optuna and are used to sample hyperparameters from specified ranges:
     - `learning_rate`: A float between 0.01 and 0.1.
     - `colsample_bytree` and `colsample_bynode`: Floats between 0.5 and 1.0, specifying the subsample ratio of columns when constructing each tree/node.
     - `lambda_l1` and `lambda_l2`: Regularization parameters float between 0.01 and 10.0.
     - `min_data_in_leaf`: An integer between 4 and 256, specifying the minimum number of samples per leaf node.
     - `max_depth`: An integer between 5 and 10, specifying the maximum depth of the trees.
     - `max_bin`: An integer between 32 and 1024, specifying the maximum number of bins used for constructing histograms.
   - Fixed parameters like `n_iter`, `verbose`, `random_state`, and `objective` are also set.

3. **Model Training and Cross-Validation**:
   - A LightGBM regressor model is instantiated with the suggested parameters.
   - The function assumes the existence of a DataFrame `df_train`, from which it separates the features (`X`) and the target variable (`y').
   - It uses a previously defined `MonthlyKFold` object for cross-validation. This is likely the same `MonthlyKFold` class defined in a previous code snippet, which handles time-based splitting for cross-validation.
   - The model's performance is evaluated using the `cross_val_score` function from Scikit-Learn, with the scoring metric set to 'neg_mean_absolute_error'.

4. **Objective Value Return**:
   - The function returns the negative mean of the cross-validation scores. In optimization, we often minimize a value, so returning the negative mean absolute error makes it a minimization problem. Optuna will aim to find hyperparameters that minimize this returned value.

5. **Usage and Purpose**:
   - This function is used in hyperparameter tuning using Optuna for a LightGBM regression model.
   - It is designed to iteratively test different sets of hyperparameters, evaluate their performance using cross-validation, and return a score that reflects the average performance of a model with those hyperparameters.
   - The Optuna framework uses this function to identify the best set of hyperparameters for the model based on the objective function's output.

In summary, `lgb_objective` is an objective function for hyperparameter optimization of a LightGBM regression model, utilizing cross-validation and designed to be used with the Optuna hyperparameter optimization framework.

In [5]:
def lgb_objective(trial):
    params = {
        'n_iter'           : 1000,
        'verbose'          : -1,
        'random_state'     : 42,
        'objective'        : 'l2',
        'learning_rate'    : trial.suggest_float('learning_rate', 0.01, 0.1),
        'colsample_bytree' : trial.suggest_float('colsample_bytree', 0.5, 1.0),
        'colsample_bynode' : trial.suggest_float('colsample_bynode', 0.5, 1.0),
        'lambda_l1'        : trial.suggest_float('lambda_l1', 1e-2, 10.0),
        'lambda_l2'        : trial.suggest_float('lambda_l2', 1e-2, 10.0),
        'min_data_in_leaf' : trial.suggest_int('min_data_in_leaf', 4, 256),
        'max_depth'        : trial.suggest_int('max_depth', 5, 10),
        'max_bin'          : trial.suggest_int('max_bin', 32, 1024),
    }

    model  = lgb.LGBMRegressor(**params)
    X, y   = df_train.drop(columns=["target"]), df_train["target"]
    cv     = MonthlyKFold(1)
    scores = cross_val_score(model, X, y, cv=cv, scoring='neg_mean_absolute_error')

    return -1 * np.mean(scores)

# Feature Engineering

This Python code defines several functions focused on data preprocessing, particularly for handling time series and geospatial data using the Polars library. It also includes a comprehensive function for feature engineering that integrates all these preprocessing steps. Let us break down each function and the overall workflow:

1. **`convert_to_polars(*dfs)`**:
   - Converts a list of data frames (possibly in different formats) to Polaris DataFrames. It checks if each dataframe is already a Polars DataFrame; if not, it converts it.

2. **`process_datetime(df, column_name, is_date=False)`**:
   - Processes datetime columns in a dataframe to ensure they are in the correct format (either `pl.Date` or `pl.Datetime`). If `is_date` is `True, ' it processes the column as a date; otherwise, it processes it as a datetime.

3. **`process_location(df)`**:
   - Converts latitude and longitude columns in a dataframe to `float32` type. This is useful for geospatial data processing.

4. **`join_dataframes(df_main, dfs, join_conditions, suffixes)`**:
   - Joins multiple data frames with specified conditions and suffixes. It iterates through a list of data frames (`dfs`), joining each to the main data frame (`df_main`) based on the join conditions. If the specified join condition columns are not present in the main and the joining data frame, the join is skipped for that data frame.

5. **`add_time_features(df)`**:
   - Adds time-related features to a dataframe, such as ordinal day, hour, day, weekday, month, year, and trigonometric transformations of day and hour. This is useful for capturing cyclical nature in time data.

6. **`feature_eng(df_data, df_client, df_gas, df_electricity, df_forecast, df_historical, df_location, df_target)`**:
   - This is the primary feature engineering function. It takes multiple data frames as inputs, each representing different aspects of a dataset (like client information, gas usage, electricity usage, forecasts, historical data, location data, and target variables).
   - Each dataframe is processed using the previously defined functions. Datetime columns are formatted correctly, location data is cast to float, and the data frames are joined based on specified conditions.
   - Time-related features are added to the main dataframe.
   - Unnecessary columns are optionally dropped at the end.
   - The processed and feature-engineered main dataframe is returned.

This code is a comprehensive approach to preprocessing and feature engineering for a dataset that combines multiple data sources, including time series and geospatial data. It is likely part of a more extensive data analysis or machine learning pipeline where such preprocessing is crucial for model training and analysis.



In [6]:
import polars as pl
import numpy as np

def convert_to_polars(*dfs):
    """Converts a list of dataframes to Polars DataFrames."""
    return [pl.DataFrame(df) if not isinstance(df, pl.DataFrame) else df for df in dfs]

def process_datetime(df, column_name, is_date=False):
    """Processes datetime columns to ensure correct format."""
    if column_name in df.columns:
        if is_date and df.dtypes[df.columns.index(column_name)] == pl.Date:
            df = df.with_columns(pl.col(column_name).cast(pl.Date))
        elif not is_date and df.dtypes[df.columns.index(column_name)] == pl.Datetime:
            df = df.with_columns(pl.col(column_name).cast(pl.Datetime))
    return df

def process_location(df):
    """Converts latitude and longitude to float."""
    return df.with_columns(
        pl.col("latitude").cast(pl.Float32),
        pl.col("longitude").cast(pl.Float32)
    )

def join_dataframes(df_main, dfs, join_conditions, suffixes):
    """Joins multiple dataframes with specified conditions and suffixes."""
    for df, condition, suffix in zip(dfs, join_conditions, suffixes):
        if isinstance(condition, list):
            condition_check = all(col in df_main.columns and col in df.columns for col in condition)
        else:
            condition_check = condition in df_main.columns and condition in df.columns

        if condition_check:
            df_main = df_main.join(df, on=condition, how="left", suffix=suffix)
        else:
            print(f"Skipping join for {suffix} due to missing column in condition: {condition}")
    return df_main

def add_time_features(df):
    """Adds time-related features to the dataframe."""
    if "datetime" in df.columns and df.dtypes[df.columns.index("datetime")] in [pl.Datetime, pl.Date]:
        df = df.with_columns(
            pl.col("datetime").dt.ordinal_day().alias("dayofyear"),
            pl.col("datetime").dt.hour().alias("hour"),
            pl.col("datetime").dt.day().alias("day"),
            pl.col("datetime").dt.weekday().alias("weekday"),
            pl.col("datetime").dt.month().alias("month"),
            pl.col("datetime").dt.year().alias("year"),
            (np.pi * pl.col("dayofyear") / 183).sin().alias("sin(dayofyear)"),
            (np.pi * pl.col("dayofyear") / 183).cos().alias("cos(dayofyear)"),
            (np.pi * pl.col("hour") / 12).sin().alias("sin(hour)"),
            (np.pi * pl.col("hour") / 12).cos().alias("cos(hour)")
        )
    else:
        print("Warning: 'datetime' column not found. Time features cannot be added.")
    return df

def feature_eng(df_data, df_client, df_gas, df_electricity, df_forecast, df_historical, df_location, df_target):
    # Convert to Polars DataFrame
    df_data, df_client, df_gas, df_electricity, df_forecast, df_historical, df_location, df_target = convert_to_polars(df_data, df_client, df_gas, df_electricity, df_forecast, df_historical, df_location, df_target)

    # Process each DataFrame
    df_data = process_datetime(df_data, "datetime")
    df_client = process_datetime(df_client, "date", is_date=True)
    df_gas = process_datetime(df_gas, "forecast_date", is_date=True)
    df_electricity = process_datetime(df_electricity, "forecast_date")
    df_location = process_location(df_location)
    df_forecast = process_datetime(df_forecast, "forecast_datetime")
    df_historical = process_datetime(df_historical, "datetime")  # Assuming df_historical has a datetime column

    # Define join conditions and suffixes
    join_conditions = [
        "date",
        ["county", "is_business", "product_type", "date"],
        "datetime",
        "datetime",
        "datetime",
        "datetime",
        ["county", "datetime"],
        "datetime"
    ]
    suffixes = ["_gas", "_client", "_elec", "_fcast", "_hist", "_loc", "_target"]

    # Join dataframes
    df_data = join_dataframes(df_data, [df_gas, df_client, df_electricity, df_forecast, df_historical, df_location, df_target], join_conditions, suffixes)

    # Add time-related features
    df_data = add_time_features(df_data)

    # Optionally, drop unnecessary columns
    df_data = df_data.drop(["date", "datetime", "hour", "dayofyear"])

    return df_data


# Global Variables

The code snippet provided is a data preprocessing setup in a Python script, likely for a data analysis or machine learning project focused on energy behaviour prediction. It defines variables that specify file paths, column names for different datasets, and paths for saving or loading processed data. Let us break it down:

1. **File Paths:**
   - `root`: Defines the root directory where the data files are stored. It is currently set to a path in Google Drive, indicating that this script might be used in a Google Colab environment. The commented-out path suggests an alternative location, possibly for a Kaggle kernel.

2. **Column Name Lists:**
   - These lists define specific columns to be used or expected in different datasets. Each list corresponds to a different aspect of the data:
     - `data_cols`: The main data columns, likely the primary dataset.
     - `client_cols`: Columns related to client information.
     - `gas_cols`: Columns related to gas price forecasts.
     - `electricity_cols`: Columns related to electricity price forecasts.
     - `forecast_cols`: Columns related to weather forecasts.
     - `historical_cols`: Columns related to historical weather data.
     - `location_cols`: Columns related to geographical location.
     - `target_cols`: Columns related to target variables for prediction.

3. **Save and Load Paths:**
   - `save_path` and `load_path` are placeholders for paths where processed data can be saved or loaded. They are currently set to `None`, indicating that they will either be defined later in the script or that it currently does not use these functionalities.

4. **Usage and Purpose:**
   - This setup is typically used in data processing scripts where different datasets (like client data, energy forecasts, weather forecasts, historical weather data, etc.) are read, processed, and possibly joined or merged for analysis.
   - The defined column lists help select, filter, or process specific parts of these datasets.
   - The root path and the save/load paths are essential for file management in data processing workflows, especially when working with large datasets or in cloud environments like Google Colab.

In summary, this code snippet is a preparatory part of a larger data processing script, setting up necessary variables for file paths and dataset columns, which will be used in subsequent data loading, preprocessing, and analysis steps in a project related to predicting energy behaviour.

In [7]:
# root = "/kaggle/input/predict-energy-behavior-of-prosumers"
root = "/content/drive/MyDrive/project_energy"

data_cols        = ['target', 'county', 'is_business', 'product_type', 'is_consumption', 'datetime', 'row_id']
client_cols      = ['product_type', 'county', 'eic_count', 'installed_capacity', 'is_business', 'date']
gas_cols         = ['forecast_date', 'lowest_price_per_mwh', 'highest_price_per_mwh']
electricity_cols = ['forecast_date', 'euros_per_mwh']
forecast_cols    = ['latitude', 'longitude', 'hours_ahead', 'temperature', 'dewpoint', 'cloudcover_high', 'cloudcover_low', 'cloudcover_mid', 'cloudcover_total', '10_metre_u_wind_component', '10_metre_v_wind_component', 'forecast_datetime', 'direct_solar_radiation', 'surface_solar_radiation_downwards', 'snowfall', 'total_precipitation']
historical_cols  = ['datetime', 'temperature', 'dewpoint', 'rain', 'snowfall', 'surface_pressure','cloudcover_total','cloudcover_low','cloudcover_mid','cloudcover_high','windspeed_10m','winddirection_10m','shortwave_radiation','direct_solar_radiation','diffuse_radiation','latitude','longitude']
location_cols    = ['longitude', 'latitude', 'county']
target_cols      = ['target', 'county', 'is_business', 'product_type', 'is_consumption', 'datetime']

save_path = None
load_path = None


# Mount Google Drive

In [8]:
# Imports the 'drive' module from 'google.colab' and mounts the Google Drive to
# the '/content/drive' directory in the Colab environment.
from google.colab import drive

# This function mounts Google Drive
def mount_google_drive():
    drive.mount('/content/drive')

# Call the function to mount Google Drive
mount_google_drive()


Mounted at /content/drive


# Load the Data

This code snippet is part of a data loading and schema inspection process in a Python script, using the Polars library to handle CSV files. It reads multiple datasets from CSV files, selects specific columns, and extracts their schemas. Let us go through each part of the code:

1. **Reading CSV Files with Polars**:
   - The code uses `pl.read_csv()` from the Polars library to read various CSV files. Each file represents a different aspect of the data related to energy behaviour prediction.
   - `os.path.join(root, "filename.csv")` constructs the file path for each CSV file, using the `root` variable defined earlier as the base directory.
   The `columns` parameter in `pl.read_csv()` specifies which columns to read from each CSV file. The relevant column names are provided by the variables defined in the previous code snippet (like `data_cols`, `client_cols`, etc.).
   - `try_parse_dates=True` attempts to automatically parse columns recognized as date columns into appropriate date formats.

2. **DataFrames for Different Data Aspects**:
   - `df_data`, `df_client`, `df_gas`, `df_electricity`, `df_forecast`, `df_historical`, and `df_location` are the DataFrames created for the training data, client information, gas prices, electricity prices, weather forecast, historical weather, and location mapping, respectively.
   - `df_target` is created by selecting target columns from `df_data` using the `select()` method.

3. **Schema Inspection**:
   - For each DataFrame, the code extracts its schema (data types of each column) and stores it in a corresponding schema variable (like `schema_data`, `schema_client`, etc.).
   - The schema of a DataFrame provides information about the type of data in each column, which is crucial for data preprocessing and understanding the nature of the data.

4. **Purpose and Usage**:
   - This code is used for loading and initial inspection of various datasets that will likely be used in an energy behaviour prediction project.
   - It ensures that the data is loaded with the correct columns and provides an initial look at the data types, essential for subsequent data processing and analysis steps.
   - The separation of data into different frames based on their nature (like client data, gas prices, etc.) suggests a structured approach to handling a complex dataset with multiple facets.

In summary, the code is part of a data loading and exploration process, focusing on reading different datasets related to energy behaviour, selecting specific columns, parsing date columns where possible, and inspecting the data types of each column for further processing and analysis.

In [9]:
df_data        = pl.read_csv(os.path.join(root, "train.csv"), columns=data_cols, try_parse_dates=True)
df_client      = pl.read_csv(os.path.join(root, "client.csv"), columns=client_cols, try_parse_dates=True)
df_gas         = pl.read_csv(os.path.join(root, "gas_prices.csv"), columns=gas_cols, try_parse_dates=True)
df_electricity = pl.read_csv(os.path.join(root, "electricity_prices.csv"), columns=electricity_cols, try_parse_dates=True)
df_forecast    = pl.read_csv(os.path.join(root, "forecast_weather.csv"), columns=forecast_cols, try_parse_dates=True)
df_historical  = pl.read_csv(os.path.join(root, "historical_weather.csv"), columns=historical_cols, try_parse_dates=True)
df_location    = pl.read_csv(os.path.join(root, "weather_station_to_county_mapping.csv"), columns=location_cols, try_parse_dates=True)
df_target      = df_data.select(target_cols)

schema_data        = df_data.schema
schema_client      = df_client.schema
schema_gas         = df_gas.schema
schema_electricity = df_electricity.schema
schema_forecast    = df_forecast.schema
schema_historical  = df_historical.schema
schema_target      = df_target.schema


# Prepare the Data for ML

The code provided continues the data processing and feature engineering workflow in Python, with an additional step for filtering the data and inspecting the resulting DataFrame. Let us break it down:

1. **Separating Features and Target**:
   - `X, y = df_data.drop("target"), df_data.select("target")`: This line is the same as in the previous snippet. It splits `df_data` into features (`X`) and target variables (`y').

2. **Feature Engineering**:
   - `X = feature_eng(X, df_client, df_gas, df_electricity, df_forecast, df_historical, df_location, df_target)`: Also identical to the previous snippet, this line applies the `feature_eng` function to integrate and enrich the features from different data sources.

3. **Converting to Pandas DataFrame**:
   - `df_train = to_pandas(X, y)`: This line converts the feature-engineered data into a Pandas DataFrame for easier handling and analysis.

4. **Filtering the DataFrame**:
   - `df_train = df_train[df_train["target"].notnull() & df_train["year"].gt(2021)]`: This line filters out rows from `df_train` based on two conditions:
     - Rows where the "target" column has null values are removed (`df_train["target"].notnull()`).
     - Only rows where the "year" column is greater than 2021 are kept (`df_train["year"].gt(2021)`).
   - This filtering is likely done to ensure the model trains only on complete and recent data, a common practice to improve model performance and relevance.

5. **Inspecting the DataFrame**:
   - `df_train.info(verbose=True)`: This line prints detailed information about `df_train`, including the number of non-null entries in each column, the data type of each column, and memory usage.
   - This is a valuable step for getting an overview of the final dataset before proceeding to model training or further analysis.

6. **Purpose and Usage**:
   - This code snippet is part of the data preparation process in a machine learning or data analysis workflow. It is focused on ensuring that the data is rich in features, clean (accessible of null values), and relevant (consisting of recent data).
   - The inspection of the DataFrame helps in understanding the data structure, which is crucial before moving into model training or in-depth analysis.

In summary, the code finalizes the data preparation by filtering the dataset to remove null values and focus on recent data. It also inspects the final dataset to confirm its readiness for further steps in the machine learning or data analysis workflow.

In [None]:
X, y = df_data.drop("target"), df_data.select("target")

X = feature_eng(X, df_client, df_gas, df_electricity, df_forecast, df_historical, df_location, df_target)

df_train = to_pandas(X, y)
df_train = df_train[df_train["target"].notnull() & df_train["year"].gt(2021)]
df_train.info(verbose=True)

Skipping join for _gas due to missing column in condition: date
Skipping join for _client due to missing column in condition: ['county', 'is_business', 'product_type', 'date']
Skipping join for _elec due to missing column in condition: datetime
Skipping join for _fcast due to missing column in condition: datetime
Skipping join for _loc due to missing column in condition: datetime


# Hyperparameter Tuning

The provided code snippet defines a Python dictionary named `best_params`, which appears to contain a set of optimized hyperparameters for a machine learning model, specifically a LightGBM model, given the nature of the parameters. Let us go through each key-value pair in the dictionary:

1. **`'n_iter': 900`**:
   - Specifies the number of iterations for the model training process. In the context of LightGBM, this typically refers to the number of boosting rounds.

2. **`'verbose': -1`**:
   - Sets the verbosity level for the model's training process. A value of `-1` generally means that the process will be silent, i.e., no logs will be shown during training.

3. **`'objective': 'l2'`**:
   - Indicates the objective function to be used by the model. Here, `'l2'` refers to the L2 loss, also known as mean squared error, commonly used for regression tasks.

4. **`'learning_rate': 0.05689066836106983`**:
   - This is the learning rate of the model, a crucial hyperparameter in gradient-boosting models. It determines the step size at each iteration while moving towards a minimum of the loss function.

5. **`'colsample_bytree': 0.8915976762048253`**:
   - Specifies the subsample ratio of columns when constructing each tree. Values closer to 1 mean more columns are used to build each tree.

6. **`'colsample_bynode': 0.5942203285139224`**:
   - This parameter is similar to `colsample_bytree` but applies to each node of the trees, specifying the subsample ratio of columns for each split.

7. **`'lambda_l1': 3.6277555139102864`** and **`'lambda_l2': 1.6591278779517808`**:
   - These represent L1 (Lasso) and L2 (Ridge) regularization terms. They are used to prevent overfitting by adding penalties to the model.

8. **`'min_data_in_leaf': 186`**:
   - Defines the minimum number of data points required to form a leaf. This can be used to control overfitting.

9. **`'max_depth': 9`**:
   - Specifies the maximum depth of each tree. Deeper trees can model more complex patterns but can also lead to overfitting.

10. **`'max_bin': 813`**:
    - Determines the maximum number of bins used for bucketing feature values. Higher numbers allow the algorithm to consider more split points, potentially leading to more accurate models but increasing computation.

The dictionary `best_params` suggests that these parameters were likely obtained through a hyperparameter tuning process, possibly using a tool like Optuna, as indicated in previous code snippets. These optimized parameters are usually used to configure a LightGBM model to achieve better performance on a specific dataset. The exact values are tailored to the data's characteristics and the machine-learning task's specific requirements.

In [1]:
best_params = {
    'n_iter'           : 900,
    'verbose'          : -1,
    'objective'        : 'l2',
    'learning_rate'    : 0.05689066836106983,
    'colsample_bytree' : 0.8915976762048253,
    'colsample_bynode' : 0.5942203285139224,
    'lambda_l1'        : 3.6277555139102864,
    'lambda_l2'        : 1.6591278779517808,
    'min_data_in_leaf' : 186,
    'max_depth'        : 9,
    'max_bin'          : 813,
}


# Training the Model

The code snippet provided is part of a machine-learning workflow in Python, focused explicitly on model loading, training, and saving. It deals with either loading a pre-trained model or training a new one and optionally saving it. Let us dissect each part:

1. **Loading a Pre-trained Model (if available)**:
   - The first `if` statement checks whether `load_path` is not `None`. If there is a specified `load_path`, it assumes a pre-trained model is saved at this location.
   - `model = pickle.load(open(load_path, "rb"))`: This line uses the `pickle` module to load a serialized model from the specified file path. The `"rb"` argument indicates that the file is opened in read-binary mode.

2. **Creating and Training a New Model (if no pre-trained model)**:
   - If `load_path` is `None`, a new model is created using the `VotingRegressor` from Scikit-Learn. This model is an ensemble of several LightGBM regressors.
   - Each LightGBM regressor (`lgb.LGBMRegressor`) is instantiated with the same `best_params` (optimized hyperparameters) but different `random_state` values. This diversity in random states helps create slightly varied models, which can benefit an ensemble.
   - The `VotingRegressor` aggregates the predictions of each regressor to make more robust overall predictions.
   - The model is then trained on the `df_train` DataFrame using the `fit` method. Features (`X`) are obtained by dropping the `"target"` column, and the target variable (`y') is the `"target"` column.

3. **Saving the Trained Model (if specified)**:
   - After training (or loading) the model, the `if` statement checks if `save_path` is not `None`.
   - If a `save_path` is provided, the trained model is serialized and saved to this path using `pickle.dump`. The `"wb"` argument indicates that the file is opened in write-binary mode.

4. **Purpose and Usage**:
   - This code snippet is a crucial part of a machine-learning pipeline where model persistence is essential. It provides flexibility to either load a pre-trained model (useful for scenarios like model deployment or when retraining is not required) or train a new model from scratch.
   - Saving the trained model allows the user to reuse the model later without retraining it, saving time and computational resources.

In summary, the code manages a machine learning model's loading, training, and saving, specifically a Voting Regressor ensemble of LightGBM models. This allows for efficient model reuse and persistence in a machine-learning workflow.

In [None]:
if load_path is not None:
    model = pickle.load(open(load_path, "rb"))
else:
    model = VotingRegressor([
        ('lgb_1', lgb.LGBMRegressor(**best_params, random_state=100)),
        ('lgb_2', lgb.LGBMRegressor(**best_params, random_state=101)),
        ('lgb_3', lgb.LGBMRegressor(**best_params, random_state=102)),
        ('lgb_4', lgb.LGBMRegressor(**best_params, random_state=103)),
        ('lgb_5', lgb.LGBMRegressor(**best_params, random_state=104)),
    ])

    model.fit(
        X=df_train.drop(columns=["target"]),
        y=df_train["target"]
    )

if save_path is not None:
    with open(save_path, "wb") as f:
        pickle.dump(model, f)


# Real-Time Prediction Environment

This code snippet is part of a machine-learning workflow designed for a competition or a real-time prediction environment, on a platform like Kaggle. It uses an iterative testing approach, common in time-series forecasting competitions, where the model is used to make predictions on test data as it becomes available. Let us break down the critical components of the code:

1. **Import and Environment Setup**:
   - `import enefit`: This line imports a module named `enefit`, which is likely specific to the context or competition for which this code is written.
   - `env = enefit.make_env()`: Creates an environment for the iterative test process. This environment provides test data in chunks over time.

2. **Iterative Testing Loop**:
   - `iter_test = env.iter_test()`: Initializes an iterator for the test set.
   - The `for` loop iterates over the test data provided by `iter_test`. Each iteration yields several DataFrames representing different aspects of the test data, such as `test`, `client`, `historical_weather`, etc.

3. **Data Processing**:
   - `test = test.rename(columns={"prediction_datetime": "datetime"})`: Renames a column in the `test` DataFrame for consistency.
   - Several DataFrames (`df_test`, `df_client`, `df_gas`, etc.) are created from the test data chunks using Polars, with column selections based on predefined schemas.
   - `df_forecast`, `df_historical`, and `df_target` are updated by concatenating new data and removing duplicates.

4. **Feature Engineering and Prediction**:
   - `X_test = feature_eng(...)`: Applies the previously defined `feature_eng` function to process and combine the test data.
   - `X_test = to_pandas(X_test)`: Converts the processed test data to a Pandas DataFrame.
   - `sample_prediction["target"] = model.predict(X_test).clip(0)`: The model makes predictions on the test data. The `.clip(0)` method ensures that no negative predictions are made, which might be important depending on the context (e.g., predicting quantities that cannot be negative).

5. **Submitting Predictions**:
   - `env.predict(sample_prediction)`: Submits the predictions for the current test batch.

6. **Purpose and Usage**:
   - This code is typically used in a competition or real-world scenario where predictions are made as new data becomes available, often in a time-series forecasting context.
   - The iterative approach allows the model to use the most recent data for making predictions, which can be crucial for accuracy in time-sensitive contexts.

In summary, the code is set up for an iterative prediction environment, processing incoming test data, applying feature engineering, making predictions with a trained model, and submitting these predictions. The exact context (like the nature of the `enefit` library and the data involved) is specific to the particular use case or competition for which this code is written.

In [None]:
import enefit

env = enefit.make_env()
iter_test = env.iter_test()

for (test, revealed_targets, client, historical_weather,
        forecast_weather, electricity_prices, gas_prices, sample_prediction) in iter_test:

    test = test.rename(columns={"prediction_datetime": "datetime"})

    df_test           = pl.from_pandas(test[data_cols[1:]], schema_overrides=schema_data)
    df_client         = pl.from_pandas(client[client_cols], schema_overrides=schema_client)
    df_gas            = pl.from_pandas(gas_prices[gas_cols], schema_overrides=schema_gas)
    df_electricity    = pl.from_pandas(electricity_prices[electricity_cols], schema_overrides=schema_electricity)
    df_new_forecast   = pl.from_pandas(forecast_weather[forecast_cols], schema_overrides=schema_forecast)
    df_new_historical = pl.from_pandas(historical_weather[historical_cols], schema_overrides=schema_historical)
    df_new_target     = pl.from_pandas(revealed_targets[target_cols], schema_overrides=schema_target)

    df_forecast       = pl.concat([df_forecast, df_new_forecast]).unique()
    df_historical     = pl.concat([df_historical, df_new_historical]).unique()
    df_target         = pl.concat([df_target, df_new_target]).unique()

    X_test = feature_eng(df_test, df_client, df_gas, df_electricity, df_forecast, df_historical, df_location, df_target)
    X_test = to_pandas(X_test)

    sample_prediction["target"] = model.predict(X_test).clip(0)

    env.predict(sample_prediction)