## Training with a Neural Network
This notebook contains the training of a Keras Sequential model on insider data, which attempts to predict the maximum 90-day percentage gain of a ticker whose insider(s) made a trade. We use a single dense hidden layer and a dropout layer, and we use the Keras Tuner to choose an optimal learning rate and number of hidden units.

In [1]:
%load_ext autoreload
%autoreload 1
%aimport my_functions

import pandas as pd
import datetime as dt
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf

from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, confusion_matrix 
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Input, Dense, Dropout
from operator import itemgetter

from my_functions import *

In [2]:
DAYS_TO_LOOK_BACK = 6  # used for calculating volume volatility and recent-trade counts

train_and_cv = my_model_prep.prepareForModel(pd.read_csv('data/training_and_cv_data.csv'))

startDate = min(train_and_cv.FilingDate) + dt.timedelta(days=DAYS_TO_LOOK_BACK)
endDate = max(train_and_cv.FilingDate)
splitDate = startDate + dt.timedelta(days=int(0.9*(endDate-startDate).days))

train_XY, train_X, train_Y = my_model_prep.returnXandY(
    train_and_cv, dt.date.isoformat(startDate), dt.date.isoformat(splitDate), binStarts=[-10, 0, 20]
)

cv_XY, cv_X, cv_Y = my_model_prep.returnXandY(
    train_and_cv, dt.date.isoformat(splitDate+dt.timedelta(days=1)), dt.date.isoformat(endDate)
)

In [3]:
print(f'train shape: {train_X.shape}')
print(f'cv shape: {cv_X.shape}')

train shape: (11082, 17)
cv shape: (1618, 17)


In [4]:
'''
Perform standard feature normalization.
'''
scaler = StandardScaler()
train_X_scaled = scaler.fit_transform(train_X)
cv_X_scaled = scaler.fit_transform(cv_X)

Here is a StackExchange answer that provides a starting point for deciding on the number of hidden units to include: https://stats.stackexchange.com/a/136542

### Below is a computation to obtain the absolute maximum number of hidden units:

In [5]:
Ns = train_X_scaled.shape[0]  # training examples
No = 1                           # output neurons
Ni = train_X_scaled.shape[1]  # input neurons
alpha = 2                        # scale factor

Nh = Ns / (alpha*(Ni + No))      # maximum hidden neurons

print(f'An upper bound for number of hidden units: {int(Nh)}')

An upper bound for number of hidden units: 307


This seems like *quite* a lot; we certainly don't need this many hidden units! 

### However, we can remove some of the guesswork by using Keras Tuner. 
With this tool, we can search the parameter space and also determine an optimal number of hidden units.

We have 16 inputs, so let's opt for one Dense and one Dropout hidden layer.

In [14]:
UNDERESTIMATE_BIAS = 10.

In [15]:
from tensorflow.python.ops import math_ops, numpy_ops
numpy_ops.np_config.enable_numpy_behavior()

#def asymmetric_loss(wgt):
'''This is our custom objective loss function that favors either underestimates (wgt > 1)
or overestimates (0 < wgt < 1).'''
def asymm_mse(y_true, y_pred):
    diff = UNDERESTIMATE_BIAS*math_ops.squared_difference(y_pred, y_true)*(y_true < y_pred).astype(float) + \
            math_ops.squared_difference(y_pred, y_true)*(y_true >= y_pred).astype(float)

    loss = tf.math.sqrt(tf.reduce_mean(diff, axis=-1))

    return loss
    #return asymm_mse

In [18]:
import keras_tuner as kt

tf.random.set_seed(40)

def model_builder(numFeatures):
    def builder(tuner):
        numUnits = tuner.Int('units', min_value=4, max_value=32, step=4)
        learningRate = tuner.Choice('learningRate', values=[1e-2, 1e-3, 1e-4])
        dropoutRate = tuner.Choice('dropoutRate', values=[0., 0.1, 0.2])
        
        model = Sequential(
            [               
                Input(shape=(numFeatures,)),
                Dense(units=numUnits, activation='relu', name='dense_1'),
                Dropout(dropoutRate),
                Dense(units=numUnits/2, activation='relu', name='dense_2'),
                Dropout(dropoutRate),
                Dense(units=1, activation='linear', name='dense_out')
            ], name = 'nn_model' 
        )

        model.compile(
            optimizer=tf.keras.optimizers.Adam(learning_rate=learningRate),
            loss=asymm_mse,
            metrics=asymm_mse
        )

        return model
    return builder

Hyperband is an algorithm that searches the hyperparameter space with respect to which we want to minimize the evaluation metric, i.e. the custom asymmetric loss. 

We use EarlyStopping to halt training early if there is no loss improvement in the 10 most recent epochs.

In [19]:
tuner = kt.Hyperband(
    model_builder(train_X_scaled.shape[1]),
    objective=kt.Objective('asymm_mse', 'min'),
    max_epochs=100,
    overwrite=True,
    directory='tuner logs',
    project_name=f'asymm_mse({int(UNDERESTIMATE_BIAS)})'
)

tuner.search(
    train_X_scaled, train_Y, 
    epochs=100,
    validation_data=(cv_X_scaled, cv_Y),
    callbacks=[tf.keras.callbacks.EarlyStopping(patience=10)]
)

best_hparams = tuner.get_best_hyperparameters()[0]

Trial 60 Complete [00h 00m 02s]
asymm_mse: 15.461211204528809

Best asymm_mse So Far: 14.532682418823242
Total elapsed time: 00h 01m 51s
INFO:tensorflow:Oracle triggered exit


In [20]:
print(f'These are the best hyperparameter values: \n {best_hparams.values}')
nn_model = tuner.hypermodel.build(best_hparams)

These are the best hyperparameter values: 
 {'units': 16, 'learningRate': 0.01, 'dropoutRate': 0.0, 'tuner/epochs': 2, 'tuner/initial_epoch': 0, 'tuner/bracket': 4, 'tuner/round': 0}


We aren't surprised to see that loss is minimized for UNDERESTIMATE_BIAS=1, i.e. a symmetric mean squared error, because increasing the bias directly increases the loss accumulated from each overestimate. I initially thought about only allowing values greater than 1, but as will be seen below, we have another, more pressing problem!

In [22]:
train_Y_preds = nn_model.predict(train_X_scaled)
cv_Y_preds = nn_model.predict(cv_X_scaled)

print(f'\ncv MSE: {asymm_mse(cv_Y, cv_Y_preds)}\n')

dfcv = pd.DataFrame(data={'cvPreds': np.reshape(cv_Y_preds, (cv_Y_preds.size,)), 'cvVals': cv_Y})



ValueError: Data must be 1-dimensional

Note that this cv MSE is ~100 points higher than that produced by the XGBoost model.

In [None]:
n = 5
nlargest = dfcv.nlargest(n, ['cvPreds'])
nsmallest = dfcv.nsmallest(n, ['cvPreds'])
print(f'The {n} largest predictions and their values: \n{nlargest}\n')
print(f'The {n} smallest predictions and their values: \n{nsmallest}')

### Here, we've reached a bit of a standstill in the modeling process.
Recall from above that these predictions are generated with UNDERESTIMATE_BIAS=1 -- there is no bias towards underestimates in our objective function! Yet all but 3 cross-validation estimates are well under 2%, and all but 3 estimates are in the interval \[-3, 2\]%.

This is puzzling, and I need to think more about why this would be. These predictions are kind of useless -- using them to create an investment strategy probably wouldn't result in us losing money, but we also likely wouldn't gain much, especially relative to the S&P500 gaining 8% in this time period!

However, the XGBoost model seems to be working better, so let's implement a strategy with that model in strategy_simulation.