# Time Series Data Preprocessing and Model Preparation

In this notebook, we will:
- Normalize the `Global_active_power` feature using `MinMaxScaler`.
- Split the dataset into training, validation, and test sets.
- Create a function to generate input (`X`) and output (`y`) sets for time series forecasting.
- Output the shapes of the training, validation and test sets.

1. [Imports and Setup](#1.-Imports-and-Setup)
2. [Data Normalization](#2.-Data-Normalization)
3. [Define Time Periods for Training and Validation Sets](#3.-Define-Time-Periods-for-Training-and-Validation-Sets)
4. [Create X and y Sets](#4.-Create-X-and-y-Sets)
5. [Data Saving and Loading](#5.-Data-Saving-and-Loading)


## 1. Imports and Setup
We will use the following libraries:
- `pandas` for data manipulation.
- `numpy` for handling arrays.
- `MinMaxScaler` from `sklearn.preprocessing` to normalize the data.
- `joblib` to save and load the scaler for future use.

In [16]:
# Python Packages
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler
import joblib

In [6]:
# read csv
df = pd.read_csv('global_active_power_hourly.csv', index_col='Date',parse_dates=True)

In [13]:
df

Unnamed: 0_level_0,Global_active_power
Date,Unnamed: 1_level_1
2006-12-16 17:00:00,0.636816
2006-12-16 18:00:00,0.545045
2006-12-16 19:00:00,0.509006
2006-12-16 20:00:00,0.488550
2006-12-16 21:00:00,0.455597
...,...
2010-11-26 17:00:00,0.248876
2010-11-26 18:00:00,0.225194
2010-11-26 19:00:00,0.238534
2010-11-26 20:00:00,0.161531


## 2. Data Normalization
Normalizing the 'Global_active_power' column ensures that the data is on the same scale, which can improve the performance of many machine learning models.

In [None]:
scaler = MinMaxScaler()
df['Global_active_power'] = scaler.fit_transform(df[['Global_active_power']])

# Save the scaler for inverse transformation
# Saving the scaler allows us to revert the normalization when needed (e.g., to interpret predictions in original units).
joblib.dump(scaler, 'scaler.pkl')

## 3. Define Time Periods for Training and Validation Sets
- Setting the periods for training, validation, and testing helps to create a systematic approach for model evaluation.
- Creating the training and validation datasets from the defined periods. An 80-20 split is used for training and validation.

In [None]:
train_val_start = '2006-12-16' # 2006 to 2008
train_val_end = '2008-12-31'
test_start_1 = '2009-01-01' # 2009
test_end_1 = '2009-12-31'
test_start_2 = '2010-01-01' # 2010
test_end_2 = '2010-12-31'

# Spliting Training and validation (80-20 split)
train_val_df = df[train_val_start:train_val_end]
split_index = int(len(train_val_df) * 0.8)
train_df = train_val_df[:split_index] # 80%
val_df = train_val_df[split_index:] # 20%

## 4. Create X and y Sets

Prepares the time series data for forecasting by creating input (X) and output (y) sets. This process is used for training machine learning models, as it transforms historical data into a format that the models can learn from. Defining a look-back period and a prediction horizon enables the model to capture temporal patterns and make informed predictions about future values. This preparation step lays the foundation for accurate forecasting and effective model evaluation.

In [None]:
def create_X_y(data, look_back, prediction_horizon):
    """
    Create input (X) and output (y) sets for time series forecasting.

    Parameters:
    data (pd.Series): The time series data from which to create X and y sets.
    look_back (int): The number of previous time steps to use as input features.
    prediction_horizon (int): The number of time steps to predict into the future.

    Returns:
    tuple: Two numpy arrays (X, y) where:
        - X contains the input features for the model
        - y contains the corresponding output values to predict
    """
    
    # Initialize empty lists to hold the input (X) and output (y) sets
    X, y = [], []
    
    # Loop through the data, starting from the look_back period to the end
    for i in range(look_back, len(data) - prediction_horizon + 1):
        # Append the last 'look_back' values to X
        X.append(data.iloc[i - look_back:i].values)
        
        # Append the next 'prediction_horizon' values to y
        y.append(data.iloc[i:i + prediction_horizon].values)
    
    # Convert the lists to numpy arrays for model compatibility
    return np.array(X), np.array(y)

# Define Parameters for X and y Creation
look_back = 24  # Look back period in hours
prediction_horizon = 2  # Predict the next 2 hours

# Now we will create the X and y sets for each dataset (training, validation, and testing).

# Training Set
X_train, y_train = create_X_y(train_df['Global_active_power'], look_back, prediction_horizon)

# Validation Set
X_val, y_val = create_X_y(val_df['Global_active_power'], look_back, prediction_horizon)

# Test Set 1
test_data_1 = df[test_start_1:test_end_1]['Global_active_power']
X_test_1, y_test_1 = create_X_y(test_data_1, look_back, prediction_horizon)

# Test Set 2
test_data_2 = df[test_start_2:test_end_2]['Global_active_power']
X_test_2, y_test_2 = create_X_y(test_data_2, look_back, prediction_horizon)

# Print the shapes of the generated datasets to verify that they have been created correctly.
print("Train set shapes - X:", X_train.shape, ", y:", y_train.shape)
print("Validation set shapes - X:", X_val.shape, ", y:", y_val.shape)
print("Test set 1 shapes - X:", X_test_1.shape, ", y:", y_test_1.shape)
print("Test set 2 shapes - X:", X_test_2.shape, ", y:", y_test_2.shape)

Train set shapes - X: (14303, 24) , y: (14303, 2)
Validation set shapes - X: (3558, 24) , y: (3558, 2)
Test set 1 shapes - X: (8735, 24) , y: (8735, 2)
Test set 2 shapes - X: (7893, 24) , y: (7893, 2)


## 5. Data Saving and Loading
Saving the preprocessed training, validation, and test datasets to a compressed `.npz` file for efficient storage.

In [18]:
# np.savez('power_data.npz', 
#          X_train=X_train, y_train=y_train,
#          X_val=X_val, y_val=y_val,
#          X_test_1=X_test_1, y_test_1=y_test_1,
#          X_test_2=X_test_2, y_test_2=y_test_2)


# Load the datasets from the .npz file
loaded_data = np.load('power_data.npz')

# Accessing the loaded data
X_train_loaded = loaded_data['X_train']
y_train_loaded = loaded_data['y_train']
X_val_loaded = loaded_data['X_val']
y_val_loaded = loaded_data['y_val']
X_test_1_loaded = loaded_data['X_test_1']
y_test_1_loaded = loaded_data['y_test_1']
X_test_2_loaded = loaded_data['X_test_2']
y_test_2_loaded = loaded_data['y_test_2']

# Output the shapes of the loaded sets
print("Loaded Train set shapes - X:", X_train_loaded.shape, ", y:", y_train_loaded.shape)
print("Loaded Validation set shapes - X:", X_val_loaded.shape, ", y:", y_val_loaded.shape)
print("Loaded Test set 1 shapes - X:", X_test_1_loaded.shape, ", y:", y_test_1_loaded.shape)
print("Loaded Test set 2 shapes - X:", X_test_2_loaded.shape, ", y:", y_test_2_loaded.shape)

Loaded Train set shapes - X: (14303, 24) , y: (14303, 2)
Loaded Validation set shapes - X: (3558, 24) , y: (3558, 2)
Loaded Test set 1 shapes - X: (8735, 24) , y: (8735, 2)
Loaded Test set 2 shapes - X: (7893, 24) , y: (7893, 2)
