![](https://storage.googleapis.com/kaggle-datasets-images/312121/636393/a5097396fc07cf882d3e0d631b100a36/dataset-cover.jpg?t=2019-08-23-15-00-53)

***

<h1 style = 'font-family: Times New Roman'> <b>|</b><i> 1. Content</i></h1>

<br>

> <h4 style = 'font-family: Times New Roman'>
The Dataset is fully dedicated for the developers who want to train the model on Weather Forecasting for Indian climate.<br><br> This dataset provides data from 1st January 2013 to 24th April 2017 in the city of Delhi, India. <br><br>The 4 parameters here are
meantemp, humidity, wind_speed, meanpressure.
    
  
<h1 style = 'font-family: Times New Roman'> <b>|</b><i> 2. About Dataset</i></h1>

<br> 
    
5 columns in the dataset, below is <u>description of features</u>:

*  (1) <b>meantemp</b>: Mean temperature averaged out from multiple 3 hour intervals in a day. 
    
*  (2) <b>humidity</b>: Humidity value for the day (units are grams of water vapor per cubic meter volume of air).
    
    
*  (3) <b>wind_speed</b>: Wind speed measured in kmph.
    
    
*  (4) <b>meanpressure</b>: Pressure reading of weather (measure in atm)
    
    
<h1 style = 'font-family: Times New Roman'> <b>|</b><i> 3. Goal of the Notebook</i></h1>
    
> <h4 style = 'font-family: Times New Roman'>
Goal is to train and test model using LSTM, Pytorch. <br><br>
    The Target Column used here is <b>meantemp</b>
</h4>

 

https://www.kaggle.com/datasets/sumanthvrao/daily-climate-time-series-data

***

<a id="1"></a>
> <h1 style = 'font-family: Times New Roman'><b> <b style = 'color: #42c2f5'>1.</b> Import Necessary Libraries </b></h1>

In [1]:
import os
import sys
path_append = "../"
sys.path.append(path_append)  # Go up one directory from where you are.

In [2]:
import pandas  as pd
train_df = pd.read_csv(path_append + '../data/Daily Climate/DailyDelhiClimateTrain.csv')
test_df = pd.read_csv(path_append + '../data/Daily Climate/DailyDelhiClimateTest.csv')
train_df.head()

Unnamed: 0,date,meantemp,humidity,wind_speed,meanpressure
0,2013-01-01,10.0,84.5,0.0,1015.666667
1,2013-01-02,7.4,92.0,2.98,1017.8
2,2013-01-03,7.166667,87.0,4.633333,1018.666667
3,2013-01-04,8.666667,71.333333,1.233333,1017.166667
4,2013-01-05,6.0,86.833333,3.7,1016.5


<a id="2"></a>
> <h1 style = 'font-family: Times New Roman'><b> <b style = 'color: #4290f5'>2.</b> Modeling: Preprocess </b></h1>

In [3]:
print('Train set \n\n')
train_df.head()

Train set 




Unnamed: 0,date,meantemp,humidity,wind_speed,meanpressure
0,2013-01-01,10.0,84.5,0.0,1015.666667
1,2013-01-02,7.4,92.0,2.98,1017.8
2,2013-01-03,7.166667,87.0,4.633333,1018.666667
3,2013-01-04,8.666667,71.333333,1.233333,1017.166667
4,2013-01-05,6.0,86.833333,3.7,1016.5


In [4]:
from datetime import datetime

def get_day_of_year(date):
    date_obj = datetime.strptime(date, '%Y-%m-%d')  # Ensure your date format matches '%Y-%m-%d'
    return date_obj.timetuple().tm_yday

# Apply function to get the day of the year
train_df['day_of_year'] = train_df['date'].apply(get_day_of_year)
test_df['day_of_year'] = test_df['date'].apply(get_day_of_year)

# # Drop the original date column
train_df = train_df.drop('date', axis=1)
test_df = test_df.drop('date', axis=1)

# Display the first few rows to verify
print(train_df.head())
print(test_df.head())


    meantemp   humidity  wind_speed  meanpressure  day_of_year
0  10.000000  84.500000    0.000000   1015.666667            1
1   7.400000  92.000000    2.980000   1017.800000            2
2   7.166667  87.000000    4.633333   1018.666667            3
3   8.666667  71.333333    1.233333   1017.166667            4
4   6.000000  86.833333    3.700000   1016.500000            5
    meantemp   humidity  wind_speed  meanpressure  day_of_year
0  15.913043  85.869565    2.743478     59.000000            1
1  18.500000  77.222222    2.894444   1018.277778            2
2  17.111111  81.888889    4.016667   1018.333333            3
3  18.700000  70.050000    4.545000   1015.700000            4
4  18.388889  74.944444    3.300000   1014.333333            5


In [5]:
import pandas as pd
from sklearn.preprocessing import MinMaxScaler, StandardScaler

# Assuming train_df and test_df are pandas DataFrames

# Initialize scalers
standard_scaler = StandardScaler()
min_max_scaler = MinMaxScaler()

# Apply StandardScaler to all but the last column
train_df.iloc[:, :-1] = standard_scaler.fit_transform(train_df.iloc[:, :-1].astype(float))
test_df.iloc[:, :-1] = standard_scaler.transform(test_df.iloc[:, :-1].astype(float))

# Ensure the last column is converted to float
train_df.iloc[:, -1] = train_df.iloc[:, -1].astype(float)
test_df.iloc[:, -1] = test_df.iloc[:, -1].astype(float)

# Apply MinMaxScaler to the last column
train_df.iloc[:, -1] = min_max_scaler.fit_transform(train_df.iloc[:, -1].values.reshape(-1, 1))
test_df.iloc[:, -1] = min_max_scaler.transform(test_df.iloc[:, -1].values.reshape(-1, 1))

# Verification of the number of features
num_features = train_df.shape[1]


1         2.0
2         3.0
3         4.0
4         5.0
        ...  
1457    363.0
1458    364.0
1459    365.0
1460    366.0
1461      1.0
Name: day_of_year, Length: 1462, dtype: float64' has dtype incompatible with int64, please explicitly cast to a compatible dtype first.
  train_df.iloc[:, -1] = train_df.iloc[:, -1].astype(float)
1        2.0
2        3.0
3        4.0
4        5.0
       ...  
109    110.0
110    111.0
111    112.0
112    113.0
113    114.0
Name: day_of_year, Length: 114, dtype: float64' has dtype incompatible with int64, please explicitly cast to a compatible dtype first.
  test_df.iloc[:, -1] = test_df.iloc[:, -1].astype(float)


In [6]:
import torch
from torch.utils.data import Dataset
import random

class SequentialDataset(Dataset):
    def __init__(self, df, indices, max_window_size, num_classes):
        self.df = df
        self.indices = indices
        self.max_window_size = max_window_size
        self.min_window_size = max_window_size // 2
        self.num_classes = num_classes

    def __len__(self):
        return len(self.indices)

    def __getitem__(self, idx):
        start_idx = self.indices[idx]
        window_size = random.randint(self.min_window_size, self.max_window_size)
        end_idx = min(start_idx + window_size, len(self.df))

        seq = self.df.iloc[start_idx:end_idx]
        X, y = seq.values[:-1, :], seq.values[1:, :]

        X = torch.tensor(X, dtype=torch.float32)
        y = torch.tensor(y, dtype=torch.float32)
        return X, y

In [7]:
from random import shuffle

# Assume 'df' is your DataFrame and 'event' is the column containing labels

def generate_indices(df, max_window_size):
    indices = []
    max_index = len(df) - max_window_size + 1  # Calculate the maximum starting index
    
    for i in range(max_index):
        # Check if all labels in the window are the same
        indices.append(i)
    
    return indices

# Assuming 'df' and 'num_classes' are defined
max_window_size = 30
train_indices = generate_indices(train_df, max_window_size)
test_indices = generate_indices(test_df, max_window_size)

print('train df shape: ', train_df.shape)
print('test df shape: ', test_df.shape)
print('Train indices: ', len(train_indices))
print('Test indices: ', len(test_indices))

trainset = SequentialDataset(df=train_df, indices=train_indices, max_window_size=max_window_size, num_classes=num_features)
testset = SequentialDataset(df=test_df, indices=test_indices, max_window_size=max_window_size, num_classes=num_features)

train df shape:  (1462, 5)
test df shape:  (114, 5)
Train indices:  1433
Test indices:  85


<a id="4"></a>
> <h1 style = 'font-family: Times New Roman'><b> <b style = 'color: #427bf5'>4.</b> Modeling</b></h1>

<b> First, define Modeling Class 'LSTMModel'</b>

In [8]:
from tools.setting.data_config import DataConfig
from tools.setting.ml_params import MLParameters
from trainer_hub import TrainerHub

data_config = DataConfig(dataset_name = 'daily-delhi-climate', task_type='regression', obs_shape=[num_features], label_size=num_features)

#  Set training configuration from the AlgorithmConfig class, returning them as a Namespace object.
ml_params = MLParameters(core_model = 'gpt', encoder_model = 'none')

ml_params.core_config.d_model = 256 
ml_params.core_config.num_layer = 8 
ml_params.core_config.dropout = 0.1 

ml_params.num_epoch = 1000

# Set the device to GPU if available, else CPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") 

# Initialize the TrainerHub class with the training configuration, data configuration, device, and use_print and use_wandb flags
trainer_hub = TrainerHub(ml_params, data_config, device, use_print=True, use_wandb=False) 

In [9]:
trainer_hub.train(trainset, testset)

Epochs:   0%|          | 0/1000 [00:00<?, ?it/s]

[2/1000][6/22][Time 6.21]
Unified LR across all optimizers: 0.0001995308238189185
--------------------Training Metrics--------------------
Trainer:  gpt
Inf: 0.1514	Gen: 0.2349	Rec: 0.2219	E: 0.1658	R: 0.1390	P: 0.3072
--------------------Test Metrics------------------------
mse: 9.6497
mae: 2.4412
r2: -20.0180

[4/1000][12/22][Time 5.69]
Unified LR across all optimizers: 0.00019907191565870155
--------------------Training Metrics--------------------
Trainer:  gpt
Inf: 0.0618	Gen: 0.1174	Rec: 0.1129	E: 0.0671	R: 0.0575	P: 0.1676
--------------------Test Metrics------------------------
mse: 9.8677
mae: 2.2721
r2: -20.3952

[6/1000][18/22][Time 5.62]
Unified LR across all optimizers: 0.00019861406295796434
--------------------Training Metrics--------------------
Trainer:  gpt
Inf: 0.0529	Gen: 0.1026	Rec: 0.0994	E: 0.0569	R: 0.0502	P: 0.1484
--------------------Test Metrics------------------------
mse: 9.3111
mae: 2.1918
r2: -19.4094

[9/1000][2/22][Time 5.63]
Unified LR across all optimi