# Predicting ETF Volume and Volatility with News Data

In this notebook, similar to the notebook using ticker data, we attempt to predict whether or not an ETF in our selected universe will have higher than normal volume or volatility that day. We have sourced nearly 600,000 financial news articles in JSON format and have converted them into a pandas dataframe such that the article title and body are concatenated. Note that on any given day, we will often have many news articles. For now, we will treat each news article as a single training point and predict on the daily in_play target variables. Similar as before, we are dealing with a multilabel classification problem.

In [1]:
import os
import numpy as np
import pickle
import pandas as pd
import matplotlib.pyplot as plt
import alpaca_trade_api as tradeapi
import datetime
import seaborn as sns
from gensim.models import Word2Vec
from nltk import word_tokenize
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Input, Dense, LSTM, CuDNNLSTM, Embedding
from keras.layers import Dropout, Activation, Bidirectional, GlobalMaxPool1D
from keras.models import Model
from keras import initializers, regularizers, constraints, optimizers, layers
from keras.preprocessing import text, sequence
from keras.optimizers import SGD
np.random.seed(0)

api = tradeapi.REST(
    base_url=os.environ['APCA_API_BASE_URL'],
    key_id=os.environ['APCA_API_KEY_ID'],
    secret_key=os.environ['APCA_API_SECRET_KEY']
)

Using TensorFlow backend.


Recall that we have already cleaned and joined our data and have fit our tokenizer. Let's read these into memory.

In [2]:
# Get Text and Target Variables
#X_load = pd.read_csv('text_data.csv', index_col=0).astype(str)
y = pd.read_csv('data/vv_target.csv', index_col=0)

In [7]:
# loading tokenizer
with open('models/tokenizer.pickle', 'rb') as handle:
    tokenizer = pickle.load(handle)

In [8]:
X_load.head()

Unnamed: 0,combined_text
2017-12-08,Mexican official disputes reports of tainted a...
2017-12-08,Saudi prince has history of extravagant impuls...
2017-12-11,Risks From The WTOâ€™s New Power Vacuum WASHIN...
2017-12-14,Winners and Losers of the GOP Tax Bill Christm...
2017-12-15,WSJ. Magazineâ€™s 10 Most-Read Stories of the ...


In [3]:
y.head()

Unnamed: 0_level_0,DBC,EEM,EWJ,FXI,GDX,GLD,QQQ,SPY,TLT,USO,VTI,VXX,XHB,XLF,XRT,XSW
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
2013-12-02 00:00:00-05:00,True,True,False,True,True,True,False,False,False,False,False,False,True,False,False,True
2013-12-03 00:00:00-05:00,False,True,False,True,False,True,False,False,False,False,False,False,False,False,False,False
2013-12-04 00:00:00-05:00,False,True,True,True,True,True,False,True,False,False,True,False,True,True,False,False
2013-12-05 00:00:00-05:00,False,False,True,False,True,True,False,False,False,False,False,False,False,False,False,False
2013-12-06 00:00:00-05:00,False,True,False,True,False,True,False,False,False,False,False,False,False,False,False,True


We can fit our pre-trained tokenizer to get our data into numeric encoded format. This both vectorizes our data and consumes less memory. We are keeping only the top 20,000 words used for each article and have kept only the first 100 words per article. It is often the case that certain articles are short, we we will pad these with zeros at the front to keep the 100 word length consistent.

In [4]:
# Get tokenized headlines
#list_tokenized_headlines = tokenizer.texts_to_sequences(X_load['combined_text'])
# Fit on data - max sequences at 100 words and pad with zeros if too short
#X = sequence.pad_sequences(list_tokenized_headlines, maxlen=100)
X = pd.read_csv('tokenized_text.csv', index_col=0)

In [5]:
X.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
2017-12-08 00:00:00-05:00,1,2229,80,10,83,44,4760,40,112,1,...,2,6370,213,80,458,1406,35,179,483,8
2017-12-08 00:00:00-05:00,433,3,3229,1,4089,62,4,1,5303,1,...,65,552,469,672,5669,165,1587,93,663,762
2017-12-11 00:00:00-05:00,0,0,0,0,0,0,0,0,0,0,...,8,3047,14,1,6545,522,1619,2232,5,3367
2017-12-14 00:00:00-05:00,0,0,0,0,0,0,0,0,0,0,...,1420,3797,3,1420,2659,13,1,7487,236,1647
2017-12-15 00:00:00-05:00,0,0,0,0,0,0,0,0,0,0,...,1961,13,4,538,1,242,1211,2644,1447,5


In [6]:
y.head()

Unnamed: 0_level_0,DBC,EEM,EWJ,FXI,GDX,GLD,QQQ,SPY,TLT,USO,VTI,VXX,XHB,XLF,XRT,XSW
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
2013-12-02 00:00:00-05:00,True,True,False,True,True,True,False,False,False,False,False,False,True,False,False,True
2013-12-03 00:00:00-05:00,False,True,False,True,False,True,False,False,False,False,False,False,False,False,False,False
2013-12-04 00:00:00-05:00,False,True,True,True,True,True,False,True,False,False,True,False,True,True,False,False
2013-12-05 00:00:00-05:00,False,False,True,False,True,True,False,False,False,False,False,False,False,False,False,False
2013-12-06 00:00:00-05:00,False,True,False,True,False,True,False,False,False,False,False,False,False,False,False,True


In [7]:
y = X.join(y, how='left')[y.columns]

In [8]:
y.head()

Unnamed: 0,DBC,EEM,EWJ,FXI,GDX,GLD,QQQ,SPY,TLT,USO,VTI,VXX,XHB,XLF,XRT,XSW
2017-12-08 00:00:00-05:00,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False
2017-12-08 00:00:00-05:00,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False
2017-12-11 00:00:00-05:00,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
2017-12-14 00:00:00-05:00,False,False,False,False,False,False,False,True,True,False,True,False,False,True,True,True
2017-12-15 00:00:00-05:00,False,False,False,True,False,False,True,True,True,False,True,False,False,True,False,False


We will save these tokenizations with their appropriate datetimeindexes for use later. Ultimately, our goal is to use both news and chart data to make our predictions. However, due to the size of our data, we cannot concatenate models since they have heterogenous length. Instead, we will model each separately and use stacking to combine the predictions in a final notebook.

In [12]:
tokenized_df = pd.DataFrame(X, index=pd.to_datetime(X_load.index).tz_localize('US/Eastern'))

In [15]:
tokenized_df.to_csv('tokenized_text.csv')

In [13]:
tokenized_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
2017-12-08 00:00:00-05:00,1,2229,80,10,83,44,4760,40,112,1,...,2,6370,213,80,458,1406,35,179,483,8
2017-12-08 00:00:00-05:00,433,3,3229,1,4089,62,4,1,5303,1,...,65,552,469,672,5669,165,1587,93,663,762
2017-12-11 00:00:00-05:00,0,0,0,0,0,0,0,0,0,0,...,8,3047,14,1,6545,522,1619,2232,5,3367
2017-12-14 00:00:00-05:00,0,0,0,0,0,0,0,0,0,0,...,1420,3797,3,1420,2659,13,1,7487,236,1647
2017-12-15 00:00:00-05:00,0,0,0,0,0,0,0,0,0,0,...,1961,13,4,538,1,242,1211,2644,1447,5


In [9]:
embedding_size = 200
input_ = Input(shape=(X.shape[1],))
x = Embedding(20000, embedding_size)(input_)
x = CuDNNLSTM(40, return_sequences=True)(x)
x = GlobalMaxPool1D()(x)
x = Dropout(0.5)(x)
x = Dense(200, activation='relu')(x)
x = Dropout(0.5)(x)
x = Dense(100, activation='relu')(x)
x = Dropout(0.5)(x)
x = Dense(y.shape[1], activation='sigmoid')(x)
model = Model(inputs=input_, outputs=x)
# Choose Multilabel
model.compile(loss='binary_crossentropy',
              optimizer='adam', 
              metrics=['accuracy'])

Instructions for updating:
Colocations handled automatically by placer.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.


In [10]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         (None, 100)               0         
_________________________________________________________________
embedding_1 (Embedding)      (None, 100, 200)          4000000   
_________________________________________________________________
cu_dnnlstm_1 (CuDNNLSTM)     (None, 100, 40)           38720     
_________________________________________________________________
global_max_pooling1d_1 (Glob (None, 40)                0         
_________________________________________________________________
dropout_1 (Dropout)          (None, 40)                0         
_________________________________________________________________
dense_1 (Dense)              (None, 200)               8200      
_________________________________________________________________
dropout_2 (Dropout)          (None, 200)               0         
__________

In [11]:
model.fit(X, y, epochs=5, batch_size=256, validation_split=0.1)

Instructions for updating:
Use tf.cast instead.
Instructions for updating:
Deprecated in favor of operator or tf.math.divide.
Train on 531590 samples, validate on 59066 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x1cd1f8242b0>

In [12]:
model.evaluate(X, y)



[0.5314331532464357, 0.7189882943710044]

Overall, our model performance is solid. However, note that our validation data is random and does not account for our lookahead bias, meaning that it is trained on future data. In our combined model phase, we will test only on most recent data to really get a sense of how our model would perform real time.