# Preparing News Data for Modeling

In order to utilize our news data, we need to ensure that each article is fit on usable financial data. This means that we need to match articles that came out on the weekend to monday's financial news data. Furthermore, for news that occured after hours, we need to round to the next day. Finally, we must tokenize and clean our news data using keras preprocessing techniques. We will combine title and text and fit a tokenizer that keeps the top 20,000 words in the aggregate vocabulary. Then, we will save the tokenizer and our fit combined_text data. 

In [1]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import alpaca_trade_api as tradeapi
import datetime
import seaborn as sns
from gensim.models import Word2Vec
from nltk import word_tokenize
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Input, Dense, LSTM, Embedding
from keras.layers import Dropout, Activation, Bidirectional, GlobalMaxPool1D
from keras.models import Model
from keras import initializers, regularizers, constraints, optimizers, layers
from keras.preprocessing import text, sequence
from keras.optimizers import SGD
np.random.seed(0)

api = tradeapi.REST(
    base_url=os.environ['APCA_API_BASE_URL'],
    key_id=os.environ['APCA_API_KEY_ID'],
    secret_key=os.environ['APCA_API_SECRET_KEY']
)

Using TensorFlow backend.


In [6]:
def get_df():
    # Get financial news data
    # Sources: bloomberg, cnbc, reuters, wsj, fortune, ... (financial news sources)
    df = pd.read_csv('data/news_data.csv', index_col=0, parse_dates=True)
    # Round to nearest day - news beyond trading hours is not usable for that day
    # Only utilizing morning news in real trading
    df.index = pd.to_datetime(df.index, utc=True).round('D').date
    df.reset_index(inplace=True)
    df['index'] = pd.to_datetime(df['index'])
    # Set weekend news to be read on monday (market closed on weekend)
    df.loc[df['index'].dt.weekday == 5, ['index']] = df.loc[df['index'].dt.weekday == 5, ['index']] + datetime.timedelta(days=2)
    df.loc[df['index'].dt.weekday == 6, ['index']] = df.loc[df['index'].dt.weekday == 6, ['index']] + datetime.timedelta(days=1)
    df.set_index('index', inplace=True)
    return df

In [7]:
df = get_df()

In [8]:
df.head()

Unnamed: 0_level_0,title,text
index,Unnamed: 1_level_1,Unnamed: 2_level_1
2017-12-08,Mexican official disputes reports of tainted a...,Mexico's secretary of tourism disputed reports...
2017-12-08,Saudi prince has history of extravagant impuls...,Timothy A. Clary | AFP | Getty Images Christie...
2017-12-11,Risks From The WTOâ€™s New Power Vacuum,WASHINGTONâ€”The world trading system confront...
2017-12-14,Winners and Losers of the GOP Tax Bill,Christmas may be over but WSJâ€™s Richard Rubi...
2017-12-15,WSJ. Magazineâ€™s 10 Most-Read Stories of the ...,1. How Jony Ive Masterminded Appleâ€™s New Hea...


In [9]:
df.tail()

Unnamed: 0_level_0,title,text
index,Unnamed: 1_level_1,Unnamed: 2_level_1
2019-06-03,Ankr Network Price Tops $0.0077 on Major Excha...,Ankr Network Price Tops $0.0077 on Major Excha...
2019-06-03,"42-coin Price Reaches $19,937.09 on Major Exch...","42-coin Price Reaches $19,937.09 on Major Exch..."
2019-06-03,Photon Price Hits $0.0000 on Top Exchanges (PHO),Photon Price Hits $0.0000 on Top Exchanges (PH...
2019-06-03,Mao Zedong Trading 3.8% Higher Over Last Week ...,Mao Zedong Trading 3.8% Higher Over Last Week ...
2019-06-03,Nexty Trading 1% Higher Over Last 7 Days (NTY),Nexty Trading 1% Higher Over Last 7 Days (NTY)...


In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 615632 entries, 2017-12-08 to 2019-06-03
Data columns (total 2 columns):
title    615624 non-null object
text     615631 non-null object
dtypes: object(2)
memory usage: 14.1+ MB


In [17]:
# Combine title and text
def get_combined_text(df):
    return (df['title'] + ' ' + full_df['text']).astype(str)

In [18]:
X = get_combined_text(df)

In [25]:
# Tokenize top 20,000 words
tokenizer = text.Tokenizer(num_words=20000)
# Fit on news articles
tokenizer.fit_on_texts(list(X))

In [26]:
import pickle

# Save tokenizer
with open('models/tokenizer.pickle', 'wb') as handle:
    pickle.dump(tokenizer, handle, protocol=pickle.HIGHEST_PROTOCOL)

In [21]:
# Save data
X.to_csv('data/text_data.csv')