Last December, I set up a scraping script on AWS Lambda to automatically scrape headlines from financial news websites (e.g. Reuters, CNBC) every 3 hours. Using NLP on news data are now a popular topic in finance. Two Sigma is holding Kaggle competitions to use news to predict stock returns. I think it should be interesting to do some exploratory ML on the news I have scraped so far.

In [1]:
import pandas as pd
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
import json
import numpy as np
import math

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense, Embedding, LSTM, SpatialDropout1D
from keras.utils import to_categorical
from sklearn.metrics import accuracy_score

Using TensorFlow backend.


In [2]:
data = pd.read_csv('~/Downloads/headlines.csv', parse_dates=['timestamp'], index_col=0)
data.head()

Unnamed: 0,source,timestamp,headline
5765,reuters,2018-12-07 10:00:48.453100,"[""Huawei CFO to appear in Canada court"", ""A to..."
5763,cnbc,2018-12-07 10:00:48.453100,"[""Tech stocks lift European markets; oil price..."
5764,ft,2018-12-07 10:00:48.453100,"[""Top stories"", ""Cyber Security"", ""Huawei cave..."
5822,cnbc,2018-12-07 11:00:48.314326,"[""European markets rally after global sell-off..."
5823,ft,2018-12-07 11:00:48.314326,"[""Top stories"", ""Cyber Security"", ""Huawei cave..."


Peeking into what how our raw text data look like

In [3]:
data['headline'].iloc[1]

'["Tech stocks lift European markets; oil prices slide ahead of OPEC meeting", "Asian stocks take a breather after days of declines", "Dow set to fall by 200 points at the open as sell-off continues", "Gold set for best week since August, US jobs data eyed", "Oil sinks as OPEC mulls Iran supply cut exemption, tries to get Russia on board", "Treasury yields muted ahead of nonfarm payrolls", "Dollar struggles on Fed pause talk ahead of jobs data", "Dollar struggles on Fed pause talk ahead of jobs data", "Bitcoin/USD Coinbase", "Iran seeks exemptions as OPEC awaits approval from Russia to impose production cuts", "A crunch Brexit vote is coming next week that could plunge the UK into fresh political chaos", "2 Hours Ago", "Bitcoin plunges 11 percent as December rout continues", "Eustance Huang", "3 Hours Ago", "John Bolton \'knew in advance\' about arrest of Huawei executive; official says Trump did not", "Dow rebounds from 780-point plunge, ends day just slightly lower on report Fed may 

Pretty raw, this cell contains all the headlines from one news source (e.g. CNBC) at a specific timestamp. Multiple news list are stored as a JSON string. Next step is to do some cleaning on the raw text.

In [4]:
source = data['source'].tolist()
timestamp = data['timestamp'].tolist()
headline = data['headline'].tolist()

expanded_source = []
expanded_timestamp = []
expanded_headline = []

# in the original dataframe, each row contains multiple headlines as shown above.
# we need to separate them out into one headline per row.

for i, hl in enumerate(headline):
    hl_list = json.loads(hl)
    for j in range(len(hl_list)):
        
        # removing all texts that are not really headlines
        # noticed we have some news headline string that are not
        # really a headline, e.g. "2 Hours Ago", "Kelly Olsen". 
        # these are just irrelevant metadata scraped from the website.
        # It is easy to notice all such text data are 30 characters
        # or less, hence we could simply filter them out by string length.
        
        if len(hl_list[j]) < 30:
            continue
        expanded_source.append(source[i])
        expanded_timestamp.append(timestamp[i])
        expanded_headline.append(hl_list[j])

data = pd.DataFrame({'source': expanded_source, 'timestamp': expanded_timestamp,
                     'headline': expanded_headline})

stop_words=set(stopwords.words("english"))
lemmatiser = WordNetLemmatizer()
stemmer = PorterStemmer()

# preprocessing pipeline to clean the text
def text_preprocess(text, return_str=True):
    table = str.maketrans('', '', '!"#&\'()*+,./:;<=>?@[\\]^_`{|}~‘’')
    text = word_tokenize(text)
    text = [t.translate(table) for t in text]
    text = [t for t in text if t not in stop_words]
    text = [lemmatiser.lemmatize(t) for t in text]
    text = [stemmer.stem(t) for t in text]
    text = [t for t in text if t != '']
    return ' '.join(text) if return_str else text

data['processed'] = data['headline'].apply(text_preprocess)
data['date'] = data['timestamp'].apply(lambda x: x.date().isoformat())

Load stock index data for the past 7 months period. I downloaded both S&P 500 and MSCI World Indices. For this exercise, I will focus on MSCI, result on SPX are quite similar.

In [5]:
index_data = pd.read_csv('~/Downloads/index.csv', parse_dates=['date'])
index_data.head()

Unnamed: 0,date,spx,mxwo
0,2018-12-01,2760.17,2041.36
1,2018-12-02,2760.17,2041.36
2,2018-12-03,2790.37,2066.62
3,2018-12-04,2700.06,2016.89
4,2018-12-05,2700.06,2008.63


In [6]:
# original data included weekend as well. We filter them out.
index_data['weekday'] = index_data['date'].apply(lambda x: x.isoweekday() <= 5)
index_data = index_data[index_data['weekday']]
del index_data['weekday']

# use MSCI World Index 1 day return as the indicator for market sentiment
index = 'mxwo'
target = 'mxwo_1d_ret'

index_data[target] = np.log(index_data[index + ''].shift(-1) / index_data[index + ''])
index_data.dropna(inplace=True)

# helper function to convert a daily return into a label within 
# each label represents a 1% return interval
def generate_label(dataframe, column, interval=.01):
    def categorize(num):
        cat = (math.ceil(abs(num) / interval) ) * (1 if num > 0 else -1)
        # winsorize the data to (+/-)2% interval to 
        # prevent having outlier classes that
        # only show up in training set or test set
        return cat if abs(cat) <= 2 else 2 if cat > 0 else -2
    dataframe[column] = dataframe[column].apply(categorize)

generate_label(index_data, target)
index_data['date'] = index_data['date'].apply(lambda x: x.date().isoformat())

# merge headline dataframe with index data. Note weekend news are 
# dropped in inner join for simplicity.
data = data.merge(index_data, on='date', how='inner')

In [7]:
data.head()

Unnamed: 0,source,timestamp,headline,processed,date,spx,mxwo,mxwo_1d_ret
0,reuters,2018-12-07 10:00:48.453100,Huawei CFO to appear in Canada court,huawei cfo appear canada court,2018-12-07,2633.08,1965.24,-1
1,reuters,2018-12-07 10:00:48.453100,A top executive of China's Huawei Technologies...,A top execut china huawei technolog arrest can...,2018-12-07,2633.08,1965.24,-1
2,reuters,2018-12-07 10:00:48.453100,Huawei appoints chairman as acting CFO,huawei appoint chairman act cfo,2018-12-07,2633.08,1965.24,-1
3,reuters,2018-12-07 10:00:48.453100,"Japan to shun Huawei, ZTE equipment",japan shun huawei zte equip,2018-12-07,2633.08,1965.24,-1
4,reuters,2018-12-07 10:00:48.453100,TV: Trump was unaware of arrest - officials,TV trump unawar arrest - offici,2018-12-07,2633.08,1965.24,-1


In [8]:
print(data.shape)
print(data['date'].min())
print(data['date'].max())

(176721, 8)
2018-12-07
2019-07-08


176721 headlines for the period from 2018-12-7 to 2019-7-8 captured. I will sample 30000 news for the purpose of this exercise. 20000 will be used for training and the remaining 10000 used for testing.

In [9]:
sample_index = np.random.choice(data.shape[0], size=30000, replace=False)
train = data.loc[sample_index[:20000], :]
test = data.loc[sample_index[20000:], :]

# sampling without look-ahead bias
# train = data.iloc[:100000, :].sample(10000)
# test = data.iloc[100000:, :].sample(10000)

tokenizer = Tokenizer()
# fit tokenizer on the whole corpus of text. Maybe restrict this to 
# training set corpus to avoid look-ahead bias?
tokenizer.fit_on_texts(data['processed'].tolist())

train_x = train['processed'].tolist()
train_x = tokenizer.texts_to_sequences(train_x)
headline_len = [len(vector) for vector in train_x]
max_len = int(np.percentile(headline_len, 80))         # take the 80th percentile length
vocabulary_size = len(tokenizer.word_index) + 1
train_x = pad_sequences(train_x, maxlen=max_len, padding='pre', truncating='post')
train_y = to_categorical(train[target])

For the purpose of training, I will use a LSTM RNN to do this classification exercise. Building the model

In [10]:
embedding_dim = 500
rnn = Sequential()
rnn.add(Embedding(vocabulary_size, embedding_dim, input_length=max_len))
rnn.add(SpatialDropout1D(0.2))
rnn.add(LSTM(units=30, dropout=.2))
rnn.add(Dense(train_y.shape[1], activation='softmax'))
rnn.compile(loss='categorical_crossentropy',
              optimizer='adam', metrics=['accuracy'])
rnn.fit(train_x, train_y, batch_size=512, epochs=30, verbose=2)

Instructions for updating:
Colocations handled automatically by placer.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.
Instructions for updating:
Use tf.cast instead.
Epoch 1/30
 - 16s - loss: 0.8299 - acc: 0.5948
Epoch 2/30
 - 15s - loss: 0.6410 - acc: 0.6530
Epoch 3/30
 - 15s - loss: 0.5495 - acc: 0.7337
Epoch 4/30
 - 14s - loss: 0.4783 - acc: 0.7814
Epoch 5/30
 - 13s - loss: 0.4212 - acc: 0.8144
Epoch 6/30
 - 13s - loss: 0.3752 - acc: 0.8374
Epoch 7/30
 - 13s - loss: 0.3366 - acc: 0.8554
Epoch 8/30
 - 14s - loss: 0.3037 - acc: 0.8688
Epoch 9/30
 - 13s - loss: 0.2761 - acc: 0.8791
Epoch 10/30
 - 13s - loss: 0.2517 - acc: 0.8909
Epoch 11/30
 - 13s - loss: 0.2361 - acc: 0.8962
Epoch 12/30
 - 13s - loss: 0.2198 - acc: 0.9031
Epoch 13/30
 - 14s - loss: 0.2030 - acc: 0.9131
Epoch 14/30
 - 14s - loss: 0.1917 - acc: 0.9152
Epoch 15/30
 - 14s - loss: 0.1827 - acc: 0.9184
Epoch 16/30
 - 14s - loss: 0.1747 - acc: 0.9202
Epoch 

<keras.callbacks.History at 0x1a499a6b38>

From the training stats, looks like the process has converged successfully. Let's see how
it performs on test set.

In [11]:
# feed test set to feed into RNN
test_x = test['processed'].tolist()
test_x = tokenizer.texts_to_sequences(test_x)
test_x = pad_sequences(test_x, maxlen=max_len, padding='pre', truncating='post')
pred_y = rnn.predict(test_x)
pred_y = np.argmax(pred_y, axis=1)

# RNN's output doesn't use the class label we originally generated. So we 
# need to convert our generated label to the the class label using to_categorical() 
test_y = to_categorical(test[target])
test_y = np.argmax(test_y, axis=1)

headline = pd.DataFrame({'date': test['date'], 'headline': test['headline'], 'actual': test_y, 'predict': pred_y})
headline.sort_values('date', inplace=True)

# accuracy score on a per headline basis
print('headline level accuracy score {}'.format(accuracy_score(headline['actual'], headline['predict'])))

headline level accuracy score 0.7074


For each headline, the model is able to achieve an accuracy score of 71%. However, not all headlines are high relevance financial news headline. Some headlines are actually ads or articles on topics such as art, sports etc. These websites sometimes display non-financial articles (e.g. art, sport topics) on their front page. In order to come up with a prediction for the next day's return. We make use of all headlines we gathered on a day.

Since our the activation function of our last layer is a softmax function which spits out probability value of belonging to each class, we could sum up the probability of each class of all articles for a day, pick the class with the largest value as our predicted class.

In [12]:
pred_y = rnn.predict(test_x)
#pred_y = scale(pred_y)
pred_y = np.hstack((pred_y, test.loc[:, ['date']]))
category_prob = pd.DataFrame(pred_y)
category_prob.columns = [0, 1, 2, 'date']
category_prob.sort_values('date', inplace=True)

sum_prob = category_prob.groupby('date').sum()
date = sum_prob.index
label = np.argmax(sum_prob.values, axis=1)
sum_prob = pd.DataFrame({'date': date, 'predict': label})
sum_prob.set_index('date', inplace=True)
y_label = headline.groupby('date').mean()['actual']
result = sum_prob.join(y_label)

# accuracy score on a per day basis
print('day level accuracy score {}'.format(accuracy_score(result['actual'], result['predict'])))

day level accuracy score 0.8355263157894737


Interestingly, this drastically improves the accuracy rate to a 83%. That is out of my expectation. Let's have a look at how our prediction looks like on a per-headline basis and daily basis.

In [13]:
headline.sample(20)

Unnamed: 0,date,headline,actual,predict
154066,2019-05-23,Facebook will not allow marijuana sales on its...,1,1
114897,2019-03-06,This company may be first big Chinese IPO of 2...,2,1
109675,2019-02-22,"Queues of up to 15,000 people could stretch fo...",1,1
70398,2019-01-10,The National Butterfly Center in Hidalgo Count...,2,1
98102,2019-02-05,Michigan’s investment arm bought Intel and IBM...,2,1
37154,2018-12-26,Want a free country house in Japan? They're gi...,1,1
37589,2018-12-27,'The worst is yet to come': Experts say a glob...,1,2
40074,2018-12-27,"In Japan, a scramble for new workers disrupts ...",1,1
79601,2019-01-15,Markets need not be worried about China's econ...,1,1
36534,2018-12-26,Niche markets in 2018: the bad and the bountiful,1,1


In [14]:
result.sample(20)

Unnamed: 0_level_0,predict,actual
date,Unnamed: 1_level_1,Unnamed: 2_level_1
2018-12-11,2,2.0
2019-03-28,1,1.0
2019-03-14,1,1.0
2019-06-19,2,2.0
2019-02-27,2,2.0
2019-03-05,1,2.0
2019-04-01,1,2.0
2019-01-23,1,1.0
2019-03-26,1,2.0
2019-05-22,1,1.0


The result looks impressive. However, don't forget we generate our training and test 
set using random sampling on data across the entire 7 months period. This to some degree
suffered from look-ahead basis. headlines from the same period normally have similar word
distribution, in the beginning of Jan 2019, the market saw a wave of rebound. Having seen in training set words like "surge, rebound, recover" are associated with a very positive label, it is natural the classifier will link headlines with similar meaning to positive label. 

In the training/test data generation part,I commented out another way to generate train/test set without look ahead bias. Under this setting, the accuracy rate drops to around 50%! Will
longer history help? Since we only have 7 months of data to train on, probably a long 10 year 
series span across different market sentiment cycle.