# Stock Trading Strategy Based on Neural Net Time Series Analysis

For background, please see our previous notebook: leading_pattern/notebook.ipynb.

The basic idea is to predict the price of a target stock based on historical values (price, volume, etc..) of a set of predictor stocks; an event happens, it ripples through predictors then to the target. In our previous strategy we found this is difficult... maybe even impossible. Maybe we will have better luck with neural networks (NN).

One possible benefit of NN is they make it easy to combine predictors. Previously we looked for individual stocks as predictors with the thought that we would find a way to combine them. With that method it is even possible we could have two or more predictors that by themselves could not exceed a threshold, but together they would. With NN we would find such a predictor. Also it is possible that some predictors might not work for individual stocks, but will work in combination.

## Unbalanced Classes
In the previous analysis, we looked for spikes in the target that lasted 2 days and exceeded a 3% gain. That approach gave us about 10-20 events/year for an average stock. Using a NN classifier on data like that would fail because the algorithm would right 90% of the time just by predicting "no price increase" everyday. To get around that issue, we tried using a regression model to predict the stock price for every day.

### Regression Model
In a sense the regression model worked really well, which is not all that surprising given that the past prices of the target stock were used for prediction and the prices of a stock do not change very much day-to-day. However, as a stock buying strategy the results were almost useless because the model was usually a day late.

### Classification Model
Back to a classification NN. One way to balance the classes is to just decrease the price threshold for getting into the positive class. The problem with that approach is it is not consistent with our hypothesis. It is likely that small price increases are not related to any predictor. Adding them as positive samples is just adding noise.

Furthermore, when we were making synthetic data to test the model, we discovered that our model is not designed to handle price increases that happen over multiple days. To accomodate this, we changed our label maker. A label is positive if  $(p_{i} - p_{i-1}) / p_{i-1} < threshold$:

    1. (price[today] - price[yesterday]) / price[yesterday]  < threshold
    2. (price[tomorrow] - price[today]) / price[today]  <= threshold
    3. today is at least a convolution kernel width after the last label

Of course, this does not resolve the unbalanced classes. There are multiple techniques for dealing with unbalanced classes. Throwing away data seems wrong. So whenever possible we prefer to synthesize data to balance the classes. In this case, it is not clear how to do that. So we will solve the problem by under-sampling the majority class.

## Train, Validate and Test
This data is not stationary. So we should intermix the train and validation samples. This is somewhat tricky because each sample extends over a few timepoints and we do not want the train and validation samples to contain any of the same time points.

In production, we will only be predicting the next day and we will be doing that sequentially. So the test data will come from time points after the train and validate data.

## Needles in a Haystack
There are 5000 stocks in the NASDAQ. The number of possible combinations of predictors for each stock is enormous. The ideal model would allow us to test many predictors at once. This is similar to image classification, where there are an enormous number of pixels. Convolution models work well for this. We will give it a try in our domain.

## 1D Convolution Model
Our hypothesis is that in the few days before the target stock price goes up, there will be a signal in the predictors. Thus each sample is a $conv_window x n_predictors$ tensor. Using these samples, each 1D convolution filter has $conv_window x n_predictors$ weights. Every predictor has it's own kernel for every filter. 

After the convolution layer, there needs to be at least one dense layer with the number of inputs equal to the number of filters. If we let:


p = number predictors

f = number 1D convolution filters

k = number time points per sample


number of weights = $p*f*k + f \approx p * f * k$

It seems that approach gives too many degrees of freedom, which will exacerbate over-fitting and increase the training time. The alternative that we chose, was to reduce the number of weights by forcing each convolution filter to have only conv_window weights.

Our intuition is that we do not need a unique kernel for each predictor for each filter. Rather a small number of kernels, maybe 10, would work for all.

To accomplish this, we changed the shape of each sample by stacking the conv_window of points for each predictor. Now each 1D convolution filter has only one kernel. To prevent mixing of data between predictors in this stacked configuration, we set the convolution stride to conv_window. The number of outputs from each filter is now equal to the number of predictors. And the dense layer now has f * p inputs. The number of weights is:

number of weights = $k*f + f*p \approx p * f$

We are using k=3, so this modified 1D convolution has a factor of 3 fewer weights.

In [1]:
%matplotlib widget

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import tensorflow as tf

import nn.nn as nn
import nn.nn2 as nn2
from my_utils.volatility import load_volatility
from window_generators import IntermixedWindowGenerator

plt.rcParams['figure.figsize'] = [10, 5]

config = {
    'start_date': '20170103',
    'end_date': '20181231',
    'price_field': 'Open',
    'predictor_field': 'Open'
}
splits = [0.5, 0.75]
CONV_WIDTH = 3

# Load the most volitile stocks as predictors
results_dir = nn2.get_results_dir(config)

volatility = load_volatility(results_dir, config)
target = 'PLAG'
predictors = [x[0] for x in volatility[0:10]]
dataframe = nn.load_data(target, predictors, config)
n_predictors = dataframe.shape[1]
n_filters = max(10, n_predictors + 2)
print('n predictors: {}'.format(n_predictors))
print('n convolution filters: {}'.format(n_filters))
print(predictors + [target])

n predictors: 10
n convolution filters: 12
['CUEN', 'VERB', 'VRME', 'NMTR', 'AEYE', 'PLAG', 'BLNK', 'ANY', 'ELOX', 'CREX', 'PLAG']


The next step is to label the data. We are looking for price increases that are above some threshold. We don't want to set the threshold too high because that will give too few positive samples and make fitting the data difficult. If we set the threshold too low, it is possible that many of the positive samples will be just random noise and not related to the predictors.

Additionally, our model is not designed to handle multiple days of successive price increases. So we need to limit the samples to the first day the price increase was above threshold. And not allow the next positive class until the conv_window with after the last positive.

In [2]:
# For this target we could not get more than 9% of the samples to be in the + class
labels, threshold, frac_pos = nn2.make_labels(dataframe.target, CONV_WIDTH, frac_positive=0.09)
print('Percent positive: {}'.format(100 * frac_pos))

Percent positive: 9.362549800796813


The predictor data varies wildly from stock to stock and over time. Our hypothesis is that we are looking for localized ripples in the predictor. This suggests that the data can be normalized by taking the difference between successive time points and dividing by the value at one of those time points.

In [3]:
norm_df = nn2.diff_norm(dataframe)

Next we need to make the data samples. As mentioned above, we intermixed the train and validation samples. To balance the data, we under-sampled the majority class. 

In [4]:
conv_window = IntermixedWindowGenerator(norm_df, labels, splits, CONV_WIDTH, balanced=True)
model = nn2.limited_filters_conv_model(n_filters, CONV_WIDTH, n_predictors)

It's not obvious what loss function we should use. It is likely that not every price gain in the target is related to a predictor. So we expect a non-zero rate of false negatives, maybe as high as 50%. At the same time, false positives will either tie up our money on buys with meager gains or worse loose money. The way to avoid false positives is to set the threshold high. This will also increase false negatives. But what is the right rate of false negatives? And if we set the threshold to high, we will not make any buys. We might as well skip all this and just put our money in a stock index fund.

For now, we will go with binary crossentropy as the loss function.

In [None]:
history = nn.compile_and_fit_classifier(model, conv_window, patience=2, max_epochs=30, verbose=1)

Epoch 1/30


The value of the ROC AUC is for training is much higher than validation, indicating over-fitting.

This is confirmed by the plot of the loss functions:

In [None]:
plt.plot(history.history['loss'], label='train')
plt.plot(history.history['val_loss'], label='val')
plt.legend()
plt.title('Loss')

Even with over-fitting, the validation ROC AUC is near enough to 0.5 to indicate that these predictors are not predictors for this target. Which brings us back to the need to efficiently search the space of predictors.

## Efficiently Looking for Predictors
We can search the predictor space faster if we can test lots of predictors at once. How many should that be?

To anwser that question, we wrote some code to modify a predictor by adding a kernel to the predictor exactly before each positive class in the labels. When we ran the modified predictor along with the target through our model we got a perfect fit of both the training and validation data.

This continued when we add 9 other predictors that did not predict this target. With 99 others, we got an AUC in the high 0.90s. With 499 others AUC was around 0.8. So we probably could search in blocks of 500.

## Running on Test Data
At this point there is no need to run the model on the test data because we have not found a set of predictors that work on validation.

When we find a set of predictors that work, we will probably update the models weights every week. Even though we have a year of test data, we will only extrapolate over a short period of time.
