<a href="https://colab.research.google.com/github/carlos-alves-one/-Energy-Comp/blob/main/enefit_project_FV.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Setting up an Environment for Data Processing

This code snippet is written in Python and includes a series of import statements, bringing various Python libraries and functions into the current script. These libraries are primarily used for data manipulation, machine learning, and optimization. Let us break down each part:

1. Import Statements:
   - `import os`: Imports the OS module, which provides functions for interacting with the operating system.
   - `import gc`: Imports the garbage collector interface, which can manually trigger Python's garbage collection process.
   - `import pickle`: Imports the pickle module used for serializing (pickling) and deserializing (unpickling) Python object structures.
   - `import numpy as np`: Imports the NumPy library (as `np`), which is fundamental for scientific computing in Python, providing support for large, multi-dimensional arrays and matrices, along with a collection of mathematical functions to operate on these arrays.
   - `import pandas as pd`: Imports the pandas library (as `pd`), a powerful data manipulation and analysis tool.
   - `import polars as pl`: Imports the Polars library (as `pl`), another data manipulation and analysis library, similar to pandas but often faster for certain operations.
   - `from sklearn.model_selection import cross_val_score, cross_validate`: Imports `cross_val_score` and `cross_validate` functions from Scikit-Learn, used for cross-validation in machine learning.
   - `from sklearn.metrics import mean_absolute_error`: Imports the Mean Absolute Error (MAE) metric from Scikit-Learn, used to evaluate the performance of regression models.
   - `from sklearn.compose import TransformedTargetRegressor`: Imports a utility from Scikit-Learn to transform the target variable in regression problems.
   - `from sklearn.ensemble import VotingRegressor`: Imports the VotingRegressor from Scikit-Learn, a meta-regressor that fits several base regressors and averages their predictions.
   - `import lightgbm as lgb`: Imports LightGBM (as `lgb`), a gradient boosting framework that uses tree-based learning algorithms.
   - `!pip install optuna`: Uses the pip package installer to install Optuna, a library for hyperparameter optimization.
   - `import optuna`: Imports the Optuna library for optimizing machine learning models.

2. Purpose of the Code:
   - This code snippet creates an environment for data processing, machine learning, and hyperparameter optimization.
   - It is likely part of a larger script or project focused on building, evaluating, and optimizing machine learning models, particularly regression models, given the import of `mean_absolute_error` and `TransformedTargetRegressor`.
   - The use of `cross_val_score` and `cross_validate` suggests that cross-validation is used for model evaluation.
   - `VotingRegressor` and `lightgbm` indicate that ensemble methods and gradient boosting are used in model training.
   - `Optuna` is used to optimize the hyperparameters of these models to enhance performance.

This code, by itself, doesn't perform any data analysis or machine learning operations. It is a setup for such tasks, defining the necessary libraries and tools to be used in subsequent code.

In [None]:
import os
import gc
import pickle

import numpy as np
import pandas as pd
import polars as pl

from sklearn.model_selection import cross_val_score, cross_validate
from sklearn.metrics import mean_absolute_error
from sklearn.compose import TransformedTargetRegressor
from sklearn.ensemble import VotingRegressor

import lightgbm as lgb

!pip install optuna

import optuna

Collecting optuna
  Downloading optuna-3.5.0-py3-none-any.whl (413 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m413.4/413.4 kB[0m [31m5.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting alembic>=1.5.0 (from optuna)
  Downloading alembic-1.13.0-py3-none-any.whl (230 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m230.6/230.6 kB[0m [31m10.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting colorlog (from optuna)
  Downloading colorlog-6.8.0-py3-none-any.whl (11 kB)
Collecting Mako (from alembic>=1.5.0->optuna)
  Downloading Mako-1.3.0-py3-none-any.whl (78 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m78.6/78.6 kB[0m [31m10.3 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: Mako, colorlog, alembic, optuna
Successfully installed Mako-1.3.0 alembic-1.13.0 colorlog-6.8.0 optuna-3.5.0


In [None]:
class MonthlyKFold:
    def __init__(self, n_splits=3):
        self.n_splits = n_splits

    def split(self, X, y, groups=None):
        dates = 12 * X["year"] + X["month"]
        timesteps = sorted(dates.unique().tolist())
        X = X.reset_index()

        for t in timesteps[-self.n_splits:]:
            idx_train = X[dates.values < t].index
            idx_test = X[dates.values == t].index

            yield idx_train, idx_test

    def get_n_splits(self, X, y, groups=None):
        return self.n_splits

In [None]:
def to_pandas(X, y=None):
    cat_cols = ["county", "is_business", "product_type", "is_consumption", "category_1"]

    if y is not None:
        df = pd.concat([X.to_pandas(), y.to_pandas()], axis=1)
    else:
        df = X.to_pandas()

    df = df.set_index("row_id")
    df[cat_cols] = df[cat_cols].astype("category")

    df["target_mean"] = df[[f"target_{i}" for i in range(1, 7)]].mean(1)
    df["target_std"] = df[[f"target_{i}" for i in range(1, 7)]].std(1)
    df["target_ratio"] = df["target_6"] / (df["target_7"] + 1e-3)

    return df

In [None]:
def lgb_objective(trial):
    params = {
        'n_iter'           : 1000,
        'verbose'          : -1,
        'random_state'     : 42,
        'objective'        : 'l2',
        'learning_rate'    : trial.suggest_float('learning_rate', 0.01, 0.1),
        'colsample_bytree' : trial.suggest_float('colsample_bytree', 0.5, 1.0),
        'colsample_bynode' : trial.suggest_float('colsample_bynode', 0.5, 1.0),
        'lambda_l1'        : trial.suggest_float('lambda_l1', 1e-2, 10.0),
        'lambda_l2'        : trial.suggest_float('lambda_l2', 1e-2, 10.0),
        'min_data_in_leaf' : trial.suggest_int('min_data_in_leaf', 4, 256),
        'max_depth'        : trial.suggest_int('max_depth', 5, 10),
        'max_bin'          : trial.suggest_int('max_bin', 32, 1024),
    }

    model  = lgb.LGBMRegressor(**params)
    X, y   = df_train.drop(columns=["target"]), df_train["target"]
    cv     = MonthlyKFold(1)
    scores = cross_val_score(model, X, y, cv=cv, scoring='neg_mean_absolute_error')

    return -1 * np.mean(scores)

In [None]:
import polars as pl
import numpy as np

def convert_to_polars(*dfs):
    """Converts a list of dataframes to Polars DataFrames."""
    return [pl.DataFrame(df) if not isinstance(df, pl.DataFrame) else df for df in dfs]

def process_datetime(df, column_name, is_date=False):
    """Processes datetime columns to ensure correct format."""
    if column_name in df.columns:
        if is_date and df.dtypes[df.columns.index(column_name)] == pl.Date:
            df = df.with_columns(pl.col(column_name).cast(pl.Date))
        elif not is_date and df.dtypes[df.columns.index(column_name)] == pl.Datetime:
            df = df.with_columns(pl.col(column_name).cast(pl.Datetime))
    return df

def process_location(df):
    """Converts latitude and longitude to float."""
    return df.with_columns(
        pl.col("latitude").cast(pl.Float32),
        pl.col("longitude").cast(pl.Float32)
    )

def join_dataframes(df_main, dfs, join_conditions, suffixes):
    """Joins multiple dataframes with specified conditions and suffixes."""
    for df, condition, suffix in zip(dfs, join_conditions, suffixes):
        if isinstance(condition, list):
            condition_check = all(col in df_main.columns and col in df.columns for col in condition)
        else:
            condition_check = condition in df_main.columns and condition in df.columns

        if condition_check:
            df_main = df_main.join(df, on=condition, how="left", suffix=suffix)
        else:
            print(f"Skipping join for {suffix} due to missing column in condition: {condition}")
    return df_main

def add_time_features(df):
    """Adds time-related features to the dataframe."""
    if "datetime" in df.columns and df.dtypes[df.columns.index("datetime")] in [pl.Datetime, pl.Date]:
        df = df.with_columns(
            pl.col("datetime").dt.ordinal_day().alias("dayofyear"),
            pl.col("datetime").dt.hour().alias("hour"),
            pl.col("datetime").dt.day().alias("day"),
            pl.col("datetime").dt.weekday().alias("weekday"),
            pl.col("datetime").dt.month().alias("month"),
            pl.col("datetime").dt.year().alias("year"),
            (np.pi * pl.col("dayofyear") / 183).sin().alias("sin(dayofyear)"),
            (np.pi * pl.col("dayofyear") / 183).cos().alias("cos(dayofyear)"),
            (np.pi * pl.col("hour") / 12).sin().alias("sin(hour)"),
            (np.pi * pl.col("hour") / 12).cos().alias("cos(hour)")
        )
    else:
        print("Warning: 'datetime' column not found. Time features cannot be added.")
    return df

def feature_eng(df_data, df_client, df_gas, df_electricity, df_forecast, df_historical, df_location, df_target):
    # Convert to Polars DataFrame
    df_data, df_client, df_gas, df_electricity, df_forecast, df_historical, df_location, df_target = convert_to_polars(df_data, df_client, df_gas, df_electricity, df_forecast, df_historical, df_location, df_target)

    # Process each DataFrame
    df_data = process_datetime(df_data, "datetime")
    df_client = process_datetime(df_client, "date", is_date=True)
    df_gas = process_datetime(df_gas, "forecast_date", is_date=True)
    df_electricity = process_datetime(df_electricity, "forecast_date")
    df_location = process_location(df_location)
    df_forecast = process_datetime(df_forecast, "forecast_datetime")
    df_historical = process_datetime(df_historical, "datetime")  # Assuming df_historical has a datetime column

    # Define join conditions and suffixes
    join_conditions = [
        "date",
        ["county", "is_business", "product_type", "date"],
        "datetime",
        "datetime",
        "datetime",
        "datetime",
        ["county", "datetime"],
        "datetime"
    ]
    suffixes = ["_gas", "_client", "_elec", "_fcast", "_hist", "_loc", "_target"]

    # Join dataframes
    df_data = join_dataframes(df_data, [df_gas, df_client, df_electricity, df_forecast, df_historical, df_location, df_target], join_conditions, suffixes)

    # Add time-related features
    df_data = add_time_features(df_data)

    # Optionally, drop unnecessary columns
    df_data = df_data.drop(["date", "datetime", "hour", "dayofyear"])

    return df_data


### Global Variables

In [None]:
# root = "/kaggle/input/predict-energy-behavior-of-prosumers"
root = "/content/drive/MyDrive/project_energy"

data_cols        = ['target', 'county', 'is_business', 'product_type', 'is_consumption', 'datetime', 'row_id']
client_cols      = ['product_type', 'county', 'eic_count', 'installed_capacity', 'is_business', 'date']
gas_cols         = ['forecast_date', 'lowest_price_per_mwh', 'highest_price_per_mwh']
electricity_cols = ['forecast_date', 'euros_per_mwh']
forecast_cols    = ['latitude', 'longitude', 'hours_ahead', 'temperature', 'dewpoint', 'cloudcover_high', 'cloudcover_low', 'cloudcover_mid', 'cloudcover_total', '10_metre_u_wind_component', '10_metre_v_wind_component', 'forecast_datetime', 'direct_solar_radiation', 'surface_solar_radiation_downwards', 'snowfall', 'total_precipitation']
historical_cols  = ['datetime', 'temperature', 'dewpoint', 'rain', 'snowfall', 'surface_pressure','cloudcover_total','cloudcover_low','cloudcover_mid','cloudcover_high','windspeed_10m','winddirection_10m','shortwave_radiation','direct_solar_radiation','diffuse_radiation','latitude','longitude']
location_cols    = ['longitude', 'latitude', 'county']
target_cols      = ['target', 'county', 'is_business', 'product_type', 'is_consumption', 'datetime']

save_path = None
load_path = None

### Data I/O

In [None]:
# Imports the 'drive' module from 'google.colab' and mounts the Google Drive to
# the '/content/drive' directory in the Colab environment.
from google.colab import drive

# This function mounts Google Drive
def mount_google_drive():
    drive.mount('/content/drive')

# Call the function to mount Google Drive
mount_google_drive()

Mounted at /content/drive


In [None]:
df_data        = pl.read_csv(os.path.join(root, "train.csv"), columns=data_cols, try_parse_dates=True)
df_client      = pl.read_csv(os.path.join(root, "client.csv"), columns=client_cols, try_parse_dates=True)
df_gas         = pl.read_csv(os.path.join(root, "gas_prices.csv"), columns=gas_cols, try_parse_dates=True)
df_electricity = pl.read_csv(os.path.join(root, "electricity_prices.csv"), columns=electricity_cols, try_parse_dates=True)
df_forecast    = pl.read_csv(os.path.join(root, "forecast_weather.csv"), columns=forecast_cols, try_parse_dates=True)
df_historical  = pl.read_csv(os.path.join(root, "historical_weather.csv"), columns=historical_cols, try_parse_dates=True)
df_location    = pl.read_csv(os.path.join(root, "weather_station_to_county_mapping.csv"), columns=location_cols, try_parse_dates=True)
df_target      = df_data.select(target_cols)

schema_data        = df_data.schema
schema_client      = df_client.schema
schema_gas         = df_gas.schema
schema_electricity = df_electricity.schema
schema_forecast    = df_forecast.schema
schema_historical  = df_historical.schema
schema_target      = df_target.schema

### Feature Engineering

In [None]:
X, y = df_data.drop("target"), df_data.select("target")

X = feature_eng(X, df_client, df_gas, df_electricity, df_forecast, df_historical, df_location, df_target)

df_train = to_pandas(X, y)

Skipping join for _gas due to missing column in condition: date
Skipping join for _client due to missing column in condition: ['county', 'is_business', 'product_type', 'date']
Skipping join for _elec due to missing column in condition: datetime
Skipping join for _fcast due to missing column in condition: datetime
Skipping join for _loc due to missing column in condition: datetime


In [None]:
df_train = df_train[df_train["target"].notnull() & df_train["year"].gt(2021)]

In [None]:
df_train.info(verbose=True)

### HyperParam Optimization

In [None]:
# study = optuna.create_study(direction='minimize', study_name='Regressor')
# study.optimize(lgb_objective, n_trials=100, show_progress_bar=True)

In [None]:
best_params = {
    'n_iter'           : 900,
    'verbose'          : -1,
    'objective'        : 'l2',
    'learning_rate'    : 0.05689066836106983,
    'colsample_bytree' : 0.8915976762048253,
    'colsample_bynode' : 0.5942203285139224,
    'lambda_l1'        : 3.6277555139102864,
    'lambda_l2'        : 1.6591278779517808,
    'min_data_in_leaf' : 186,
    'max_depth'        : 9,
    'max_bin'          : 813,
} # val score is 62.24 for the last month

### Validation

In [None]:
'''result = cross_validate(
    estimator=lgb.LGBMRegressor(**best_params, random_state=42),
    X=df_train.drop(columns=["target"]),
    y=df_train["target"],
    scoring="neg_mean_absolute_error",
    cv=MonthlyKFold(1),
)

print(f"Fit Time(s): {result['fit_time'].mean():.3f}")
print(f"Score Time(s): {result['score_time'].mean():.3f}")
print(f"Error(MAE): {-result['test_score'].mean():.3f}")'''

### Training

In [None]:
if load_path is not None:
    model = pickle.load(open(load_path, "rb"))
else:
    model = VotingRegressor([
        ('lgb_1', lgb.LGBMRegressor(**best_params, random_state=100)),
        ('lgb_2', lgb.LGBMRegressor(**best_params, random_state=101)),
        ('lgb_3', lgb.LGBMRegressor(**best_params, random_state=102)),
        ('lgb_4', lgb.LGBMRegressor(**best_params, random_state=103)),
        ('lgb_5', lgb.LGBMRegressor(**best_params, random_state=104)),
    ])

    model.fit(
        X=df_train.drop(columns=["target"]),
        y=df_train["target"]
    )

if save_path is not None:
    with open(save_path, "wb") as f:
        pickle.dump(model, f)

### Prediction

In [None]:
import enefit

env = enefit.make_env()
iter_test = env.iter_test()

In [None]:
for (test, revealed_targets, client, historical_weather,
        forecast_weather, electricity_prices, gas_prices, sample_prediction) in iter_test:

    test = test.rename(columns={"prediction_datetime": "datetime"})

    df_test           = pl.from_pandas(test[data_cols[1:]], schema_overrides=schema_data)
    df_client         = pl.from_pandas(client[client_cols], schema_overrides=schema_client)
    df_gas            = pl.from_pandas(gas_prices[gas_cols], schema_overrides=schema_gas)
    df_electricity    = pl.from_pandas(electricity_prices[electricity_cols], schema_overrides=schema_electricity)
    df_new_forecast   = pl.from_pandas(forecast_weather[forecast_cols], schema_overrides=schema_forecast)
    df_new_historical = pl.from_pandas(historical_weather[historical_cols], schema_overrides=schema_historical)
    df_new_target     = pl.from_pandas(revealed_targets[target_cols], schema_overrides=schema_target)

    df_forecast       = pl.concat([df_forecast, df_new_forecast]).unique()
    df_historical     = pl.concat([df_historical, df_new_historical]).unique()
    df_target         = pl.concat([df_target, df_new_target]).unique()

    X_test = feature_eng(df_test, df_client, df_gas, df_electricity, df_forecast, df_historical, df_location, df_target)
    X_test = to_pandas(X_test)

    sample_prediction["target"] = model.predict(X_test).clip(0)

    env.predict(sample_prediction)