Author:
        
        KIM, JeongYoong, jeongyoong@ccnets.org
        
    COPYRIGHT (c) 2024. CCNets. All Rights reserved.

<p align="center">
  <img src="https://storage.googleapis.com/kaggle-datasets-images/4956778/8344638/a2a6aa289fce8461958dc287f1dab799/dataset-cover.jpg?t=2024-05-07-09-36-53" alt="IMG">
</p>

<h1 style = 'font-family: Times New Roman'> <b>|</b><i> 1. Content</i></h1>

<br>

> <h4 style = 'font-family: Times New Roman'>
This dataset explores how weather conditions impact renewable energy generation. <br><br>Spanning from January 1, 2017, to August 31, 2022, the dataset provides climate data such as temperature, pressure, wind speed, and sunlight duration at 15-minute intervals. <br><br>By incorporating variables like GHI and SunlightTime, it enables the prediction of solar energy production.


- DataSource: https://www.kaggle.com/datasets/pythonafroz/renewable-power-generation-and-weather-conditions/data
  
<h1 style = 'font-family: Times New Roman'> <b>|</b><i> 2. About Dataset</i></h1>

<br> 

<details>
    <summary>More Columns Info</summary>
    17 columns in the dataset, below is <u>description of main features</u>:

    *  (1) Time: The timestamp of the recorded data in the format of YYYY-MM-DD HH:MM:SS.
        
    *  (2) Energy delta[Wh]: The difference in energy consumption in Watt-hours (Wh) from the previous timestamp to the current timestamp.
        
    *  (3) GHI: Global Horizontal Irradiance in Watts per square meter (W/m²) measured by a pyranometer.
        
    *  (4) temp: The temperature in degrees Celsius (°C) measured at the same height as the pyranometer.

    *  (5) pressure: The atmospheric pressure in hectopascals (hPa) measured at the same height as the pyranometer.

    *  (6) humidity: The relative humidity in percentage (%) measured at the same height as the pyranometer.

    *  (7) wind_speed: The wind speed in meters per second (m/s) measured at the same height as the pyranometer.

    *  (8) rain_1h: The amount of precipitation in millimeters (mm) measured over the past hour.
    
    *  (9) snow_1h: The amount of snowfall in millimeters.

    *  (10) clouds_all: The cloud situation.
</details>    
    
<h1 style = 'font-family: Times New Roman'> <b>|</b><i> 3. Goal of the Notebook</i></h1>
    
> <h4 style = 'font-family: Times New Roman'>
Goal is to train and test model using GPT, Pytorch. <br><br>
    The Target Column used here is <b>Energy delta[Wh]</b>
</h4>

 

***

<a id="1"></a>
> <h1 style = 'font-family: Times New Roman'><b> <b style = 'color: #42c2f5'>1.</b> Import Necessary Libraries </b></h1>

In [1]:
import os
import sys
import warnings
warnings.filterwarnings("ignore")

path_append = "../"
sys.path.append(path_append)  # Go up one directory from where you are.

In [2]:
import pandas  as pd
df = pd.read_csv(path_append + '../data/renewable power generation/Renewable.csv')
df.head()

Unnamed: 0,Time,Energy delta[Wh],GHI,temp,pressure,humidity,wind_speed,rain_1h,snow_1h,clouds_all,isSun,sunlightTime,dayLength,SunlightTime/daylength,weather_type,hour,month
0,2017-01-01 00:00:00,0,0.0,1.6,1021,100,4.9,0.0,0.0,100,0,0,450,0.0,4,0,1
1,2017-01-01 00:15:00,0,0.0,1.6,1021,100,4.9,0.0,0.0,100,0,0,450,0.0,4,0,1
2,2017-01-01 00:30:00,0,0.0,1.6,1021,100,4.9,0.0,0.0,100,0,0,450,0.0,4,0,1
3,2017-01-01 00:45:00,0,0.0,1.6,1021,100,4.9,0.0,0.0,100,0,0,450,0.0,4,0,1
4,2017-01-01 01:00:00,0,0.0,1.7,1020,100,5.2,0.0,0.0,100,0,0,450,0.0,4,1,1


<a id="2"></a>
> <h1 style = 'font-family: Times New Roman'><b> <b style = 'color: #4290f5'>2.</b> Modeling: Preprocess </b></h1>

In [3]:
# Remove rows where energy delta is 0
df1 = df[df['Energy delta[Wh]'] != 0] 
df1.head()

Unnamed: 0,Time,Energy delta[Wh],GHI,temp,pressure,humidity,wind_speed,rain_1h,snow_1h,clouds_all,isSun,sunlightTime,dayLength,SunlightTime/daylength,weather_type,hour,month
32,2017-01-01 08:00:00,5,10.6,3.5,1016,99,6.0,0.0,0.0,98,1,60,450,0.13,4,8,1
33,2017-01-01 08:15:00,33,6.0,3.5,1016,99,6.0,0.0,0.0,98,1,75,450,0.17,4,8,1
34,2017-01-01 08:30:00,44,2.8,3.5,1016,99,6.0,0.0,0.0,98,1,90,450,0.2,4,8,1
35,2017-01-01 08:45:00,61,3.1,3.5,1016,99,6.0,0.0,0.0,98,1,105,450,0.23,4,8,1
36,2017-01-01 09:00:00,65,3.5,3.6,1016,97,6.2,0.0,0.0,100,1,120,450,0.27,4,9,1


In [4]:
df1.tail()

Unnamed: 0,Time,Energy delta[Wh],GHI,temp,pressure,humidity,wind_speed,rain_1h,snow_1h,clouds_all,isSun,sunlightTime,dayLength,SunlightTime/daylength,weather_type,hour,month
196770,2022-08-31 16:30:00,157,31.9,18.6,1023,57,3.8,0.0,0.0,52,1,765,825,0.93,3,16,8
196771,2022-08-31 16:45:00,118,23.7,18.6,1023,57,3.8,0.0,0.0,52,1,780,825,0.95,3,16,8
196772,2022-08-31 17:00:00,82,15.6,18.5,1023,61,4.2,0.0,0.0,60,1,795,825,0.96,3,17,8
196773,2022-08-31 17:15:00,51,8.0,18.5,1023,61,4.2,0.0,0.0,60,1,810,825,0.98,3,17,8
196774,2022-08-31 17:30:00,24,2.1,18.5,1023,61,4.2,0.0,0.0,60,1,825,825,1.0,3,17,8


In [5]:
# Convert Time to datetime and then to timestamp
df1['Time'] = pd.to_datetime(df1['Time'], format='%Y-%m-%d %H:%M:%S')

df1['Timestamp'] = df1['Time'].astype('int64') // 10**9  # Convert to seconds

df1.drop(['Time'], axis=1, inplace=True)

df1.head()

Unnamed: 0,Energy delta[Wh],GHI,temp,pressure,humidity,wind_speed,rain_1h,snow_1h,clouds_all,isSun,sunlightTime,dayLength,SunlightTime/daylength,weather_type,hour,month,Timestamp
32,5,10.6,3.5,1016,99,6.0,0.0,0.0,98,1,60,450,0.13,4,8,1,1483257600
33,33,6.0,3.5,1016,99,6.0,0.0,0.0,98,1,75,450,0.17,4,8,1,1483258500
34,44,2.8,3.5,1016,99,6.0,0.0,0.0,98,1,90,450,0.2,4,8,1,1483259400
35,61,3.1,3.5,1016,99,6.0,0.0,0.0,98,1,105,450,0.23,4,8,1,1483260300
36,65,3.5,3.6,1016,97,6.2,0.0,0.0,100,1,120,450,0.27,4,9,1,1483261200


In [6]:
from sklearn.model_selection import train_test_split

X = df1.drop(columns=['Energy delta[Wh]'])
y = df1['Energy delta[Wh]'] 

# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Concatenate the X_train, y_train, X_test, y_test to create train and test dataframes
train_df = pd.concat([X_train, y_train], axis=1)
test_df = pd.concat([X_test, y_test], axis=1)

train_df.sort_values(by='Timestamp', ascending=True, inplace=True)
test_df.sort_values(by='Timestamp', ascending=True, inplace=True)

train_df = train_df.reset_index(drop=True)
test_df = test_df.reset_index(drop=True)

In [7]:
train_df.head()

Unnamed: 0,GHI,temp,pressure,humidity,wind_speed,rain_1h,snow_1h,clouds_all,isSun,sunlightTime,dayLength,SunlightTime/daylength,weather_type,hour,month,Timestamp,Energy delta[Wh]
0,10.6,3.5,1016,99,6.0,0.0,0.0,98,1,60,450,0.13,4,8,1,1483257600,5
1,6.0,3.5,1016,99,6.0,0.0,0.0,98,1,75,450,0.17,4,8,1,1483258500,33
2,2.8,3.5,1016,99,6.0,0.0,0.0,98,1,90,450,0.2,4,8,1,1483259400,44
3,3.1,3.5,1016,99,6.0,0.0,0.0,98,1,105,450,0.23,4,8,1,1483260300,61
4,3.5,3.6,1016,97,6.2,0.0,0.0,100,1,120,450,0.27,4,9,1,1483261200,65


In [8]:
import pandas as pd
from sklearn.preprocessing import RobustScaler, MinMaxScaler

# Define the categorical columns to exclude from scaling
categorical_columns = ['isSun', 'weather_type', 'hour', 'month']

# Initialize scalers
robust_scaler = RobustScaler()
min_max_scaler = MinMaxScaler(feature_range=(0, 1))

# Separate features and categorical columns
train_features = train_df.drop(columns=categorical_columns)
test_features = test_df.drop(columns=categorical_columns)

# Apply RobustScaler to all but the last column of the feature data
train_features.iloc[:, :-1] = robust_scaler.fit_transform(train_features.iloc[:, :-1].astype(float))
test_features.iloc[:, :-1] = robust_scaler.transform(test_features.iloc[:, :-1].astype(float))

# Ensure the last column is converted to float for scaling
train_features.iloc[:, -1] = train_features.iloc[:, -1].astype(float)
test_features.iloc[:, -1] = test_features.iloc[:, -1].astype(float)

# Apply MinMaxScaler to the last column
train_features.iloc[:, -1] = min_max_scaler.fit_transform(train_features.iloc[:, -1].values.reshape(-1, 1))
test_features.iloc[:, -1] = min_max_scaler.transform(test_features.iloc[:, -1].values.reshape(-1, 1))

# Concatenate the scaled features with the categorical columns
train_df = pd.concat([train_features, train_df[categorical_columns]], axis=1)
test_df = pd.concat([test_features, test_df[categorical_columns]], axis=1)

# Verification of the number of features
num_features = train_df.shape[1]

print(f"Number of features after scaling: {num_features}")


Number of features after scaling: 17


In [9]:
import torch
from torch.utils.data import Dataset
import random

class SequentialDataset(Dataset):
    def __init__(self, df, indices, max_window_size, num_classes):
        self.df = df
        self.indices = indices
        self.max_window_size = max_window_size
        self.min_window_size = max_window_size // 2
        self.num_classes = num_classes

    def __len__(self):
        return len(self.indices)

    def __getitem__(self, idx):
        start_idx = self.indices[idx]
        window_size = random.randint(self.min_window_size, self.max_window_size)
        # window_size = self.max_window_size
        end_idx = min(start_idx + window_size, len(self.df))

        seq = self.df.iloc[start_idx:end_idx]
        X, y = seq.values[:-1, :], seq.values[1:, -1:] # target column is the last column
        X = torch.tensor(X, dtype=torch.float32)
        y = torch.tensor(y, dtype=torch.float32)
        return X, y

In [10]:
from random import shuffle

# Assume 'df' is your DataFrame and 'event' is the column containing labels

def generate_indices(df, max_window_size):
    indices = []
    max_index = len(df) - max_window_size + 1  # Calculate the maximum starting index
    
    for i in range(max_index):
        # Check if all labels in the window are the same
        indices.append(i)
    
    return indices

# Assuming 'df' and 'num_classes' are defined
max_window_size = 48
train_indices = generate_indices(train_df, max_window_size)
test_indices = generate_indices(test_df, max_window_size)

print('train df shape: ', train_df.shape)
print('test df shape: ', test_df.shape)
print('Train indices: ', len(train_indices))
print('Test indices: ', len(test_indices))

trainset = SequentialDataset(df=train_df, indices=train_indices, max_window_size=max_window_size, num_classes=num_features)
testset = SequentialDataset(df=test_df, indices=test_indices, max_window_size=max_window_size, num_classes=num_features)

train df shape:  (76740, 17)
test df shape:  (19186, 17)
Train indices:  76693
Test indices:  19139


In [11]:
train_df.head()

Unnamed: 0,GHI,temp,pressure,humidity,wind_speed,rain_1h,snow_1h,clouds_all,sunlightTime,dayLength,SunlightTime/daylength,Timestamp,Energy delta[Wh],isSun,weather_type,hour,month
0,-0.422902,-0.782258,0.0,0.857143,0.875,0.0,0.0,0.292308,-0.846154,-1.238095,-0.8125,-0.970489,0.000814,1,4,8,1
1,-0.475057,-0.782258,0.0,0.857143,0.875,0.0,0.0,0.292308,-0.807692,-1.238095,-0.729167,-0.970479,0.006515,1,4,8,1
2,-0.511338,-0.782258,0.0,0.857143,0.875,0.0,0.0,0.292308,-0.769231,-1.238095,-0.666667,-0.970469,0.008754,1,4,8,1
3,-0.507937,-0.782258,0.0,0.857143,0.875,0.0,0.0,0.292308,-0.730769,-1.238095,-0.604167,-0.970459,0.012215,1,4,8,1
4,-0.503401,-0.774194,0.0,0.785714,0.958333,0.0,0.0,0.323077,-0.692308,-1.238095,-0.520833,-0.97045,0.013029,1,4,9,1


<a id="3"></a>
> <h1 style = 'font-family: Times New Roman'><b> <b style = 'color: #427bf5'>3.</b> Modeling</b></h1>

In [None]:
from tools.setting.data_config import DataConfig
from tools.setting.ml_params import MLParameters
from trainer_hub import TrainerHub

data_config = DataConfig(dataset_name = 'renewable-power-gen-prediction', task_type='regression', obs_shape=[num_features], label_size=1)

#  Set training configuration from the AlgorithmConfig class, returning them as a Namespace object.
ml_params = MLParameters(ccnet_network = 'gpt', encoder_network = 'none')

ml_params.num_epoch = 3
ml_params.model.ccnet_config.dropout = 0.1
ml_params.optimization.decay_rate_100k = 0.001

# Set the device to GPU if available, else CPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") 

# Initialize the TrainerHub class with the training configuration, data configuration, device, and use_print and use_wandb flags
trainer_hub = TrainerHub(ml_params, data_config, device, use_print=True, use_wandb=False) 

In [13]:
trainer_hub.train(trainset, testset)

Epochs:   0%|          | 0/3 [00:00<?, ?it/s]

Iterations:   0%|          | 0/1198 [00:00<?, ?it/s]

[0/3][50/1198][Time 8.05]
Unified LR across all optimizers: 0.0001992966486266836
--------------------Training Metrics--------------------
Trainer:  gpt
Inf: 0.0123	Gen: 0.6441	Rec: 0.6478	E: 0.0085	R: 0.0160	P: 1.2835
--------------------Test Metrics------------------------
mse: 1.3645
mae: 1.0284
r2: -2.6296

[0/3][100/1198][Time 7.63]
Unified LR across all optimizers: 0.00019860948975280947
--------------------Training Metrics--------------------
Trainer:  gpt
Inf: 0.0001	Gen: 0.2810	Rec: 0.2810	E: 0.0000	R: 0.0001	P: 0.5594
--------------------Test Metrics------------------------
mse: 1.1359
mae: 0.9630
r2: -3.5695

[0/3][150/1198][Time 7.54]
Unified LR across all optimizers: 0.00019792470014766712
--------------------Training Metrics--------------------
Trainer:  gpt
Inf: 0.0001	Gen: 0.2023	Rec: 0.2023	E: 0.0001	R: 0.0001	P: 0.4044
--------------------Test Metrics------------------------
mse: 0.1514
mae: 0.3155
r2: 0.5658

[0/3][200/1198][Time 7.53]
Unified LR across all optimizer

Iterations:   0%|          | 0/1198 [00:00<?, ?it/s]

[1/3][2/1198][Time 7.63]
Unified LR across all optimizers: 0.0001840771983090567
--------------------Training Metrics--------------------
Trainer:  gpt
Inf: 0.0000	Gen: 0.0809	Rec: 0.0809	E: 0.0000	R: 0.0000	P: 0.1637
--------------------Test Metrics------------------------
mse: 0.0336
mae: 0.1098
r2: 0.9081

[1/3][52/1198][Time 7.63]
Unified LR across all optimizers: 0.00018344251488027063
--------------------Training Metrics--------------------
Trainer:  gpt
Inf: 0.0000	Gen: 0.0801	Rec: 0.0801	E: 0.0000	R: 0.0000	P: 0.1626
--------------------Test Metrics------------------------
mse: 0.0415
mae: 0.1140
r2: 0.8778

[1/3][102/1198][Time 7.70]
Unified LR across all optimizers: 0.0001828100197890869
--------------------Training Metrics--------------------
Trainer:  gpt
Inf: 0.0000	Gen: 0.0803	Rec: 0.0803	E: 0.0000	R: 0.0000	P: 0.1625
--------------------Test Metrics------------------------
mse: 0.3189
mae: 0.2694
r2: 0.9795

[1/3][152/1198][Time 9.49]
Unified LR across all optimizers: 0.

Iterations:   0%|          | 0/1198 [00:00<?, ?it/s]

[2/3][4/1198][Time 12.01]
Unified LR across all optimizers: 0.00016943377835309297
--------------------Training Metrics--------------------
Trainer:  gpt
Inf: 0.0000	Gen: 0.0646	Rec: 0.0646	E: 0.0000	R: 0.0000	P: 0.1314
--------------------Test Metrics------------------------
mse: 0.0363
mae: 0.0998
r2: 0.8725

[2/3][54/1198][Time 11.50]
Unified LR across all optimizers: 0.000168849584262868
--------------------Training Metrics--------------------
Trainer:  gpt
Inf: 0.0000	Gen: 0.0644	Rec: 0.0644	E: 0.0000	R: 0.0000	P: 0.1307
--------------------Test Metrics------------------------
mse: 0.0485
mae: 0.1899
r2: 0.8726

[2/3][104/1198][Time 11.81]
Unified LR across all optimizers: 0.00016826740442705195
--------------------Training Metrics--------------------
Trainer:  gpt
Inf: 0.0000	Gen: 0.0644	Rec: 0.0644	E: 0.0000	R: 0.0000	P: 0.1308
--------------------Test Metrics------------------------
mse: 0.5861
mae: 0.1145
r2: 0.9202

[2/3][154/1198][Time 11.81]
Unified LR across all optimizers

In [14]:
trainer_hub.test(testset)

{'mse': 0.08140304684638977,
 'mae': 0.16518843173980713,
 'r2': 0.9903948307037354}