**Ensemble Notebook**

In this notebook, we are going to ensmble the NN model and LGBM model.

In [1]:
import pandas as pd
import numpy as np
import torch
import torch.nn as nn
from sklearn.preprocessing import StandardScaler
import lightgbm as lgb
device = 'cpu'

**Load the data**

In [2]:
from google.colab import drive
drive.mount('/content/drive')
df_train = pd.read_csv('/content/drive/MyDrive/Data/df_train.csv')
train = pd.read_csv('/content/drive/MyDrive/Data/train.csv')

Mounted at /content/drive


**Preparing the data**

The lgbm model gets the stock data as a categorial variable, but the NN_model gets it as an integer and uses the embedding layer to assign a vector in embd_dim = 50 to each stock_id.

In [3]:
time_order = pd.read_csv('time_order.csv')

df_train['stock_id'] = df_train['row_id'].apply(lambda x: int(x.split('-')[0]))
df_train['time_id'] = df_train['row_id'].apply(lambda x: int(x.split('-')[1]))

df_train['target'] = train['target']

df_train = pd.merge(df_train, time_order, on = 'time_id')
df_train.sort_values('time_id_ordered', inplace = True)

In [4]:
df_train.dropna(inplace=True)

t_train = max(df_train['time_id_ordered'])*0.8

X_train = df_train[df_train['time_id_ordered'] <= t_train].drop(columns=['target', 'row_id', 'time_id', 'time_id_ordered', 'stock_id']).to_numpy()

stock_data_train = df_train[df_train['time_id_ordered'] <= t_train]['stock_id'].to_numpy()

X_val_NN = df_train[df_train['time_id_ordered'] > t_train].drop(columns=['target', 'row_id', 'time_id', 'time_id_ordered', 'stock_id']).to_numpy()

stock_data_val_NN = df_train[df_train['time_id_ordered'] > t_train]['stock_id'].to_numpy()

numeric_features = X_train.shape[1]

In [5]:
df_train['stock_id'] = df_train['row_id'].apply(lambda x: int(x.split('-')[0])).astype('category')

X_val_lgb = df_train[df_train['time_id_ordered'] > t_train].drop(columns=['target', 'row_id', 'time_id', 'time_id_ordered'])

y_val = df_train[df_train['time_id_ordered'] > t_train]['target'].to_numpy()

In [6]:
scaler = StandardScaler()

scaler.fit(X_train)

X_scaled_val_NN = scaler.transform(X_val_NN)

X_numeric_val_NN = torch.from_numpy(X_scaled_val_NN ).float().to(device)

X_stock_val_NN = torch.from_numpy(stock_data_val_NN.reshape(-1,1)).long().to(device)

In [7]:
num_of_diff_stocks = max(df_train['stock_id'].unique())+1

class Stock_Volatility(nn.Module):
  def __init__(self, hidden_units, emd_dim, num_features):
      super().__init__()
      self.embd = nn.Embedding(num_of_diff_stocks, emd_dim)
      layers = []
      in_dim = emd_dim+num_features
      self.out = nn.Sigmoid()
      for l in hidden_units:
        layers.append(nn.Linear(in_dim, l))
        layers.append(nn.SELU())
        in_dim = l
      self.hidden = nn.Sequential(*layers)
      self.output = nn.Linear(in_dim, 1)
  def forward(self, stock_id, numeric):
    x_stock = self.embd(stock_id.long()).squeeze(1)
    x = torch.cat([numeric, x_stock], dim=1)
    x = self.hidden(x)
    return self.out(self.output(x))

emd_dim = 50  #Setting the embd dimension

hidden_units = [256, 256, 128, 128, 64, 64, 32, 32] #Setting the hidden layers

model_NN = Stock_Volatility(hidden_units, emd_dim, numeric_features)

**Load the models**

In [8]:
state_dict = torch.load("/content/NN_model.pth", map_location='cpu')
model_NN.load_state_dict(state_dict)

model_lgb = lgb.Booster(model_file='/content/lgbm_model.txt')

In [13]:
def rmspe(y_true, y_pred):
    ratio = (y_pred - y_true) / y_true
    return np.sqrt(np.mean(ratio**2))

y_NN_pred_val = model_NN(X_stock_val_NN, X_numeric_val_NN).detach().numpy()


y_lgb_pred_val = model_lgb.predict(X_val_lgb)

**Compute the rmspe error for each model**

In [16]:
print(f'The NN model error is {rmspe(y_val.reshape(-1,1), y_NN_pred_val)}')

print(f'The LGBM model error is {rmspe(y_val, y_lgb_pred_val)}')

The NN model error is 0.23769501769098272
The LGBM model error is 0.239399752882906


**Finding the ensemble**

We use a simple convex combination of our models as our ensemble. We seach for the alpha that gives us the best validation error. The final model outperforms both models by about 0.5 on the validation set.

In [20]:
alphas = np.arange(0.1,1,0.01)

best_alpha = 1

best_err = 1

for alpha in alphas:
  y_en = alpha*y_NN_pred_val + (1-alpha)*y_lgb_pred_val.reshape(-1,1)
  if rmspe(y_val.reshape(-1,1), y_en)<best_err:
    best_err = rmspe(y_val.reshape(-1,1), y_en)
    best_alpha = alpha


print(f'The best alpha is {best_alpha}')
print(f'The best error is {best_err}')

The best alpha is 0.5399999999999998
The best error is 0.23255804998479193
