# 2 Format Correct Input to ML 

The most difficult and tedious part of this project is formatting our data into the correct "shape" so our machine learning program can read the data. We have to manually format our training and testing data. We will save our training data as a numpy array.  
In their paper, the authors suggested creating (2 * 2 * 8) - 14 = 18 data sets. 
The breakdown is as follows. We will need to create training / testing data sets. For each training / testing data set, we will need to create "input parameters (x)" and "output parameters (y)". In total, there are 8 ways to group the data sets. 

1. Explanatory Variables
2. Explanatory Variables + Garch Params
3. Explanatory Variables + Egarch Params
4. Explanatory Variables + Ewma Params
5. Explanatory Variables + Garch Params + Egarch Params
6. Explanatory Variables + Garch Params + Ewma Params
7. Explanatory Variables + Egarch Params + Ewma Params
8. Explanatory Variables + Garch Params + Egarch Params + Ewma Params

The reason we subtract 14 is because the training_y and testing_y are the same for groups 1-8. 

However, creating all of the data sets would be incredibly tedious. For the sake of time, we will create data set 8 (Explanatory Variables + Garch Params + Egarch Params + Ewma Params), and data set 1 (Explanatory Variables) only. In total, we will create 6 datasets.

In [243]:
import numpy as np
import pandas as pd
import os

**Creating Data Set 8**

**Training Set** - data_8_training_x, training_y

In [247]:
df_training = pd.read_excel("./excel_data/training_data.xlsx") # training dataframe

In [248]:
window_size = 7

Work on x data first

In [250]:
x_training = df_training.drop(columns = ['date','realized_volatility'])
x_training_numpy = x_training.to_numpy() 
subsets_x_training = [x_training_numpy[i:i+window_size] for i in range(len(x_training_numpy) - window_size)]
subsets_x_training_numpy = np.array(subsets_x_training)
print("The shape of subsets_x_training_numpy is: ", subsets_x_training_numpy.shape)

The shape of subsets_x_training_numpy is:  (2174, 7, 14)


In [251]:
head = subsets_x_training_numpy[:1]
print(head)

[[[ 2.09270000e+02 -7.09476109e-03  4.28000000e+00  5.33000000e+00
    7.72300000e+01  1.11445000e+03  1.97160000e-06  5.00000000e-02
    9.30000000e-01 -1.46200000e-01  1.07700000e-01  9.83900000e-01
    9.85000000e-01  1.50000000e-02]
  [ 2.11860000e+02  1.23003949e-02  4.27000000e+00  5.32000000e+00
    7.69800000e+01  1.10980000e+03  1.97160000e-06  5.00000000e-02
    9.30000000e-01 -1.46200000e-01  1.07700000e-01  9.83900000e-01
    9.85000000e-01  1.50000000e-02]
  [ 2.11720000e+02 -6.61032179e-04  4.28000000e+00  5.32000000e+00
    7.31400000e+01  1.06353000e+03  1.97160000e-06  5.00000000e-02
    9.30000000e-01 -1.46200000e-01  1.07700000e-01  9.83900000e-01
    9.85000000e-01  1.50000000e-02]
  [ 2.05060000e+02 -3.19620277e-02  4.22500000e+00  5.26500000e+00
    7.11900000e+01  1.06625000e+03  1.97160000e-06  5.00000000e-02
    9.30000000e-01 -1.46200000e-01  1.07700000e-01  9.83900000e-01
    9.85000000e-01  1.50000000e-02]
  [ 2.03370000e+02 -8.27563913e-03  4.23000000e+00  

To make the training converge faster, we will normalize the first 6 features (to mean = 0 and standard deviation = 1) while keeping the last 8 features the same. 

In [253]:
# Extract the first 6 features and last 8 features
first_6_features = subsets_x_training_numpy[:, :, :6]
last_8_features = subsets_x_training_numpy[:, :, 6:]

# Compute mean and std for the first 6 features across samples and timesteps
# this normalizes across each of the first 6 features
mean = np.mean(first_6_features, axis=(0, 1), keepdims=True)
std = np.std(first_6_features, axis=(0, 1), keepdims=True)

# Normalize the first 6 features
normalized_first_6 = (first_6_features - mean) / std

# Concatenate normalized and unchanged features
data_8_training_x = np.concatenate([normalized_first_6, last_8_features], axis=-1)

In [254]:
print("\nNormalized Data:")
head = data_8_training_x[:1] # results look correct
print(head)


Normalized Data:
[[[-2.02679181e+00 -7.24829151e-01  2.24219359e+00  2.51619232e+00
    1.61261941e-01 -1.26125674e+00  1.97160000e-06  5.00000000e-02
    9.30000000e-01 -1.46200000e-01  1.07700000e-01  9.83900000e-01
    9.85000000e-01  1.50000000e-02]
  [-1.92910201e+00  1.23038370e+00  2.22953566e+00  2.50535096e+00
    1.50046136e-01 -1.28705881e+00  1.97160000e-06  5.00000000e-02
    9.30000000e-01 -1.46200000e-01  1.07700000e-01  9.83900000e-01
    9.85000000e-01  1.50000000e-02]
  [-1.93438254e+00 -7.62491951e-02  2.24219359e+00  2.50535096e+00
   -2.22286390e-02 -1.54380323e+00  1.97160000e-06  5.00000000e-02
    9.30000000e-01 -1.46200000e-01  1.07700000e-01  9.83900000e-01
    9.85000000e-01  1.50000000e-02]
  [-2.18558488e+00 -3.23168185e+00  2.17257499e+00  2.44572348e+00
   -1.09711923e-01 -1.52871041e+00  1.97160000e-06  5.00000000e-02
    9.30000000e-01 -1.46200000e-01  1.07700000e-01  9.83900000e-01
    9.85000000e-01  1.50000000e-02]
  [-2.24932842e+00 -8.43872685e-01

Now, let's work on the y data

In [256]:
y_training = df_training[['realized_volatility']]
y_training_numpy = y_training['realized_volatility'].to_numpy()
y_training_truncated = y_training_numpy[window_size:]

In [257]:
# save numpy data to folder
folder_path = "./numpy_data"
os.makedirs(folder_path, exist_ok=True)

In [258]:
# save data_8_training_x
file_path = os.path.join(folder_path, "data_8_training_x.npy")
np.save(file_path, data_8_training_x)

In [259]:
# save y_training_truncated
file_path = os.path.join(folder_path, "training_y.npy")
np.save(file_path, y_training_truncated)

**Now work on the testing set** - data_8_testing_x, testing_y 

In [261]:
df_testing = pd.read_excel("./excel_data/testing_data.xlsx") # training dataframe

In [262]:
x_testing = df_testing.drop(columns = ['date','realized_volatility'])
x_testing_numpy = x_testing.to_numpy() 
subsets_x_testing = [x_testing_numpy[i:i+window_size] for i in range(len(x_testing_numpy) - window_size)]
subsets_x_testing_numpy = np.array(subsets_x_testing)
print("The shape of subsets_x_testing_numpy is: ", subsets_x_testing_numpy.shape)

The shape of subsets_x_testing_numpy is:  (1116, 7, 14)


In [263]:
# Extract the first 6 features and last 8 features
first_6_features = subsets_x_testing_numpy[:, :, :6]
last_8_features = subsets_x_testing_numpy[:, :, 6:]

# Compute mean and std for the first 6 features across samples and timesteps
# this normalizes across each of the first 6 features
mean = np.mean(first_6_features, axis=(0, 1), keepdims=True)
std = np.std(first_6_features, axis=(0, 1), keepdims=True)

# Normalize the first 6 features
normalized_first_6 = (first_6_features - mean) / std

# Concatenate normalized and unchanged features
data_8_testing_x = np.concatenate([normalized_first_6, last_8_features], axis=-1)

In [264]:
y_testing = df_testing[['realized_volatility']]
y_testing_numpy = y_testing['realized_volatility'].to_numpy()
y_testing_truncated = y_testing_numpy[window_size:]

In [265]:
y_testing_truncated.shape

(1116,)

In [266]:
# save data_8_testing_x
file_path = os.path.join(folder_path, "data_8_testing_x.npy")
np.save(file_path, data_8_testing_x)

In [None]:
# save y_testing_truncated
file_path = os.path.join(folder_path, "testing_y.npy")
np.save(file_path, y_testing_truncated)

**Creating Data Set 1**

Make sure to save data_1_training_x

In [322]:
x_training = df_training.drop(columns = ['date','realized_volatility'])
x_training = x_training.drop(x_training.columns[-8:], axis=1)
x_training.head()

Unnamed: 0,price_kospi_raw,log_return_kospi,interest_rate_government,interest_rate_corporate,price_oil,price_gold
0,209.27,-0.007095,4.28,5.33,77.23,1114.45
1,211.86,0.0123,4.27,5.32,76.98,1109.8
2,211.72,-0.000661,4.28,5.32,73.14,1063.53
3,205.06,-0.031962,4.225,5.265,71.19,1066.25
4,203.37,-0.008276,4.23,5.26,71.89,1062.85


In [None]:
x_training_numpy = x_training.to_numpy() 
subsets_x_training = [x_training_numpy[i:i+window_size] for i in range(len(x_training_numpy) - window_size)]
subsets_x_training_numpy = np.array(subsets_x_training)
print("The shape of subsets_x_training_numpy is: ", subsets_x_training_numpy.shape)

In [284]:
head = subsets_x_training_numpy[:1]
print(head)

[[[ 2.09270000e+02 -7.09476109e-03  4.28000000e+00  5.33000000e+00
    7.72300000e+01  1.11445000e+03]
  [ 2.11860000e+02  1.23003949e-02  4.27000000e+00  5.32000000e+00
    7.69800000e+01  1.10980000e+03]
  [ 2.11720000e+02 -6.61032179e-04  4.28000000e+00  5.32000000e+00
    7.31400000e+01  1.06353000e+03]
  [ 2.05060000e+02 -3.19620277e-02  4.22500000e+00  5.26500000e+00
    7.11900000e+01  1.06625000e+03]
  [ 2.03370000e+02 -8.27563913e-03  4.23000000e+00  5.26000000e+00
    7.18900000e+01  1.06285000e+03]
  [ 2.06010000e+02  1.28977312e-02  4.24000000e+00  5.26000000e+00
    7.37500000e+01  1.07810000e+03]
  [ 2.05940000e+02 -3.39847072e-04  4.23000000e+00  5.24000000e+00
    7.45200000e+01  1.07210000e+03]]]


In [290]:
 first_6_features = subsets_x_training_numpy[:, :, :6]

# Compute mean and std for the first 6 features across samples and timesteps
# this normalizes across each of the first 6 features
mean = np.mean(first_6_features, axis=(0, 1), keepdims=True)
std = np.std(first_6_features, axis=(0, 1), keepdims=True)

# Normalize the first 6 features
data_1_training_x = (first_6_features - mean) / std

In [294]:
print("\nNormalized Data:")
head = data_1_training_x[:1] # results look correct
print(head)


Normalized Data:
[[[-2.02679181 -0.72482915  2.24219359  2.51619232  0.16126194
   -1.26125674]
  [-1.92910201  1.2303837   2.22953566  2.50535096  0.15004614
   -1.28705881]
  [-1.93438254 -0.0762492   2.24219359  2.50535096 -0.02222864
   -1.54380323]
  [-2.18558488 -3.23168185  2.17257499  2.44572348 -0.10971192
   -1.52871041]
  [-2.24932842 -0.84387268  2.17890395  2.4403028  -0.07830767
   -1.54757644]
  [-2.14975271  1.29060077  2.19156188  2.4403028   0.00513793
   -1.46295676]
  [-2.15239298 -0.04387074  2.17890395  2.41862008  0.03968261
   -1.49624975]]]


In [296]:
file_path = os.path.join(folder_path, "data_1_training_x.npy")
np.save(file_path, data_1_training_x)

Make sure to save data_1_testing_x

In [336]:
df_testing = pd.read_excel("./excel_data/testing_data.xlsx") # training dataframe
x_testing = df_testing.drop(columns = ['date','realized_volatility'])
x_testing = x_testing.drop(x_testing.columns[-8:], axis=1)
x_testing_numpy = x_testing.to_numpy() 
subsets_x_testing = [x_testing_numpy[i:i+window_size] for i in range(len(x_testing_numpy) - window_size)]
subsets_x_testing_numpy = np.array(subsets_x_testing)
print("The shape of subsets_x_testing_numpy is: ", subsets_x_testing_numpy.shape)

The shape of subsets_x_testing_numpy is:  (1116, 7, 6)


In [338]:
# normalize
first_6_features = subsets_x_testing_numpy[:, :, :6]

# Compute mean and std for the first 6 features across samples and timesteps
# this normalizes across each of the first 6 features
mean = np.mean(first_6_features, axis=(0, 1), keepdims=True)
std = np.std(first_6_features, axis=(0, 1), keepdims=True)

# Normalize the first 6 features
data_1_testing_x = (first_6_features - mean) / std

In [340]:
head = data_1_testing_x[:1] # results look correct
print(head)

[[[-1.31056959  0.17027519 -0.43312125 -0.62131363 -0.23435912
   -2.98016009]
  [-1.28844425  0.33212355 -0.45056801 -0.63592424 -0.23959993
   -2.99870023]
  [-1.27833525  0.14554695 -0.44533398 -0.63884636 -0.31440065
   -3.00661226]
  [-1.28348512 -0.08995626 -0.46278073 -0.65345698 -0.48163029
   -2.94691776]
  [-1.32086932 -0.58972175 -0.46714242 -0.65710963 -0.44732678
   -2.9381791 ]
  [-1.31552872  0.07844127 -0.45492969 -0.6468822  -0.42302845
   -2.9712443 ]
  [-1.37904373 -1.00344307 -0.48720619 -0.67391183 -0.4387509
   -2.96858727]]]


In [342]:
# save data_1_testing_x
file_path = os.path.join(folder_path, "data_1_testing_x.npy")
np.save(file_path, data_1_testing_x)