In [1]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [None]:
#resources

import numpy as np
import pandas as pd
import os
import json
import re

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LinearRegression, ElasticNetCV, SGDRegressor

import autokeras as ak
import tensorflow as tf

import nltk
from nltk import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.snowball import SnowballStemmer
from nltk import pos_tag
from nltk.stem import WordNetLemmatizer

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

# Introduction

## Problem

Daily stock market return data are notoriously difficult to predict and forecast given volatiltity due to many possible predictors and underlying interactions.

## Goal

To predict S&P 500 returns based on news data.

## Data

Predictors:
* Huff Post News Data (https://www.kaggle.com/datasets/rmisra/news-category-dataset)
    * **category**: category in which the article was published.
    * **headline**: the headline of the news article.
    * **authors**: list of authors who contributed to the article.
    * **link**: link to the original news article.
    * **short_description**: Abstract of the news article.
    * **date**: publication date of the article between 2012-01-28 and 2022-09-23

Target:
* S&P500 Data (https://fred.stlouisfed.org/series/SP500)
    * **Returns** (USD) between 2013-06-27 to 2023-06-26


Side note: I wish we had timestamps to determine if the headlines on a given date occured before the market closed


## Methodology

1. Data ETL
2. Data Pre-Processing
3. Text predictor feature extraction
4. Feature engineering
5. Modeling
    * Logistic Regression (baseline prediction)
    * Random Forest Regression (ensemble learner prediction)
    * Autokeras (out-of-the-box neural net prediction)
    * 1D CNN (custom spatio-temportal prediction)
    * LSTM (custom time-series prediction)


# ETL

In [3]:
#predictors
news = []
with open('News_Category_Dataset_v3.json', 'r') as file:
    for line in file:
        news.append(json.loads(line))
news = pd.DataFrame.from_dict(news)

#target
returns = pd.read_csv('SP500.csv')

In [4]:
news.shape
news.head()
news.describe()
news.dtypes

returns.shape
returns.head()
returns.describe()
returns.dtypes

(209527, 6)

Unnamed: 0,link,headline,category,short_description,authors,date
0,https://www.huffpost.com/entry/covid-boosters-...,Over 4 Million Americans Roll Up Sleeves For O...,U.S. NEWS,Health experts said it is too early to predict...,"Carla K. Johnson, AP",2022-09-23
1,https://www.huffpost.com/entry/american-airlin...,"American Airlines Flyer Charged, Banned For Li...",U.S. NEWS,He was subdued by passengers and crew when he ...,Mary Papenfuss,2022-09-23
2,https://www.huffpost.com/entry/funniest-tweets...,23 Of The Funniest Tweets About Cats And Dogs ...,COMEDY,"""Until you have a dog you don't understand wha...",Elyse Wanshel,2022-09-23
3,https://www.huffpost.com/entry/funniest-parent...,The Funniest Tweets From Parents This Week (Se...,PARENTING,"""Accidentally put grown-up toothpaste on my to...",Caroline Bologna,2022-09-23
4,https://www.huffpost.com/entry/amy-cooper-lose...,Woman Who Called Cops On Black Bird-Watcher Lo...,U.S. NEWS,Amy Cooper accused investment firm Franklin Te...,Nina Golgowski,2022-09-22


Unnamed: 0,link,headline,category,short_description,authors,date
count,209527,209527,209527,209527.0,209527.0,209527
unique,209486,207996,42,187022.0,29169.0,3890
top,https://www.huffingtonpost.comhttps://www.wash...,Sunday Roundup,POLITICS,,,2014-03-25
freq,2,90,35602,19712.0,37418.0,100


link                 object
headline             object
category             object
short_description    object
authors              object
date                 object
dtype: object

(2608, 2)

Unnamed: 0,DATE,SP500
0,2013-06-27,1613.2
1,2013-06-28,1606.28
2,2013-07-01,1614.96
3,2013-07-02,1614.08
4,2013-07-03,1615.41


Unnamed: 0,DATE,SP500
count,2608,2608
unique,2608,2504
top,2013-06-27,.
freq,1,92


DATE     object
SP500    object
dtype: object

In [5]:
# cast date columns as datetime types
news['date'] = pd.to_datetime(news['date'])

returns['DATE'] = pd.to_datetime(returns['DATE'])


In [6]:
# cast returns column as float
returns['SP500'] = pd.to_numeric(returns['SP500'], errors='coerce')

returns.dtypes

DATE     datetime64[ns]
SP500           float64
dtype: object

Making a decision to drop authors and link as predictors. Authors write on certain topics and do not work indefinitely for the company, the links are based on the titles; there is a co-effect or colinearity between category and author, and description/title and link so we try to reduce multicollinearity right away.

In [7]:
data = news[['date', 'category', 'headline', 'short_description']]

#map target to predictors using date
di = dict(zip(returns.DATE, returns.SP500))

data['returns'] = data['date'].map(di)

data

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['returns'] = data['date'].map(di)


Unnamed: 0,date,category,headline,short_description,returns
0,2022-09-23,U.S. NEWS,Over 4 Million Americans Roll Up Sleeves For O...,Health experts said it is too early to predict...,3693.23
1,2022-09-23,U.S. NEWS,"American Airlines Flyer Charged, Banned For Li...",He was subdued by passengers and crew when he ...,3693.23
2,2022-09-23,COMEDY,23 Of The Funniest Tweets About Cats And Dogs ...,"""Until you have a dog you don't understand wha...",3693.23
3,2022-09-23,PARENTING,The Funniest Tweets From Parents This Week (Se...,"""Accidentally put grown-up toothpaste on my to...",3693.23
4,2022-09-22,U.S. NEWS,Woman Who Called Cops On Black Bird-Watcher Lo...,Amy Cooper accused investment firm Franklin Te...,3757.99
...,...,...,...,...,...
209522,2012-01-28,TECH,RIM CEO Thorsten Heins' 'Significant' Plans Fo...,Verizon Wireless and AT&T are already promotin...,
209523,2012-01-28,SPORTS,Maria Sharapova Stunned By Victoria Azarenka I...,"Afterward, Azarenka, more effusive with the pr...",
209524,2012-01-28,SPORTS,"Giants Over Patriots, Jets Over Colts Among M...","Leading up to Super Bowl XLVI, the most talked...",
209525,2012-01-28,SPORTS,Aldon Smith Arrested: 49ers Linebacker Busted ...,CORRECTION: An earlier version of this story i...,


In [8]:
#drop any rows with empty values in the target column
data = data[data['returns'].notna()]
data

Unnamed: 0,date,category,headline,short_description,returns
0,2022-09-23,U.S. NEWS,Over 4 Million Americans Roll Up Sleeves For O...,Health experts said it is too early to predict...,3693.23
1,2022-09-23,U.S. NEWS,"American Airlines Flyer Charged, Banned For Li...",He was subdued by passengers and crew when he ...,3693.23
2,2022-09-23,COMEDY,23 Of The Funniest Tweets About Cats And Dogs ...,"""Until you have a dog you don't understand wha...",3693.23
3,2022-09-23,PARENTING,The Funniest Tweets From Parents This Week (Se...,"""Accidentally put grown-up toothpaste on my to...",3693.23
4,2022-09-22,U.S. NEWS,Woman Who Called Cops On Black Bird-Watcher Lo...,Amy Cooper accused investment firm Franklin Te...,3757.99
...,...,...,...,...,...
161346,2013-06-27,STYLE & BEAUTY,Cheryl Cole's Style Evolution: From Cornrows T...,Cheryl Cole's path to fame wasn't exactly ordi...,1613.20
161347,2013-06-27,TRAVEL,Three of Europe's Most Hedonistic Cities: Part...,"Paris brings us back again and again, season a...",1613.20
161348,2013-06-27,WELLNESS,Anxiety Tied To Sleep Deprivation,"""It's been hard to tease out whether sleep los...",1613.20
161349,2013-06-27,FOOD & DRINK,Mac And Cheese Creations: Over The Top And Com...,You can add this dish to just about everything.,1613.20


In [9]:
#sanity check date range after drop
data.date.min()
data.date.max()

Timestamp('2013-06-27 00:00:00')

Timestamp('2022-09-23 00:00:00')

In [10]:
#add a combined corpus column to test as a feature downstream

data['corpus'] = data['headline'] + ' ' + data['short_description']
data.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['corpus'] = data['headline'] + ' ' + data['short_description']


Unnamed: 0,date,category,headline,short_description,returns,corpus
0,2022-09-23,U.S. NEWS,Over 4 Million Americans Roll Up Sleeves For O...,Health experts said it is too early to predict...,3693.23,Over 4 Million Americans Roll Up Sleeves For O...
1,2022-09-23,U.S. NEWS,"American Airlines Flyer Charged, Banned For Li...",He was subdued by passengers and crew when he ...,3693.23,"American Airlines Flyer Charged, Banned For Li..."
2,2022-09-23,COMEDY,23 Of The Funniest Tweets About Cats And Dogs ...,"""Until you have a dog you don't understand wha...",3693.23,23 Of The Funniest Tweets About Cats And Dogs ...
3,2022-09-23,PARENTING,The Funniest Tweets From Parents This Week (Se...,"""Accidentally put grown-up toothpaste on my to...",3693.23,The Funniest Tweets From Parents This Week (Se...
4,2022-09-22,U.S. NEWS,Woman Who Called Cops On Black Bird-Watcher Lo...,Amy Cooper accused investment firm Franklin Te...,3757.99,Woman Who Called Cops On Black Bird-Watcher Lo...


# Pre-Processing

In [11]:
#Remove emojis and unicode chars
#fxn from https://stackoverflow.com/questions/33404752/removing-emojis-from-a-string-in-python

def deEmojify(text):
    regex = re.compile(pattern = "["
        u"\U0001F600-\U0001F64F"  # emoticons
        u"\U0001F300-\U0001F5FF"  # symbols & pictographs
        u"\U0001F680-\U0001F6FF"  # transport & map symbols
        u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           "]+", flags = re.UNICODE)
    return regex.sub(r'',text)


def deSymbolify(text):
    regex = re.compile('[^a-zA-Z]')
    return regex.sub(r'', text)


def dataframe_preprocess(col):
    #tokenize into a new column
    data[f'tokens_{col}'] = data[col].apply(nltk.word_tokenize)

    #remove stop words
    stopword = stopwords.words('english')
    data[f'tokens_{col}'] = data[f'tokens_{col}'].apply(lambda x: [item for item in x if item not in stopword])

    #remove symbols, non-ascii, digits, and too long and too short tokens
    data[f'tokens_{col}'] = data[f'tokens_{col}'].apply(lambda x: [deSymbolify(word) for word in x])
    data[f'tokens_{col}'] = data[f'tokens_{col}'].apply(lambda x: [word for word in x if word.isascii()==True])
    data[f'tokens_{col}'] = data[f'tokens_{col}'].apply(lambda x: [word for word in x if not any(ch.isdigit() for ch in word)])
    data[f'tokens_{col}'] = data[f'tokens_{col}'].apply(lambda x: [word for word in x if len(word) > 4])
    data[f'tokens_{col}'] = data[f'tokens_{col}'].apply(lambda x: [word for word in x if len(word) < 12])
    
    #add stemmed corpus column
    stemmer = SnowballStemmer("english")
    data[f'stemmed_{col}'] = data[f'tokens_{col}'].apply(lambda x: [stemmer.stem(y) for y in x])
    
    #add lemmatized corpus
    wnl = WordNetLemmatizer()
    data[f'lemmatized_{col}'] = data[f'tokens_{col}'].apply(lambda x: [wnl.lemmatize(y) for y in x])
    

In [12]:
#text pre-processing for NLP use

for col in ['category', 'headline', 'short_description', 'corpus']:
    #Convert to lowercase
    data[col] = data[col].str.lower()

    #remove punctuation
    data[col] = data[col].str.replace('[^\w\s]','')

    #get rid of unicode chars and any emojis
    data[col] = data[col].apply(deEmojify)
    
    #tokenize, stem, and lemmatize 
    ## choice to not pre-process the category further as the values are usually unigrams or bigrams max
    if col != 'category':
        dataframe_preprocess(col)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data[col] = data[col].str.lower()
  data[col] = data[col].str.replace('[^\w\s]','')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data[col] = data[col].str.replace('[^\w\s]','')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data[col] = data[col].apply(deEmojify)
A value is trying to be set on a c

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data[f'tokens_{col}'] = data[f'tokens_{col}'].apply(lambda x: [word for word in x if not any(ch.isdigit() for ch in word)])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data[f'tokens_{col}'] = data[f'tokens_{col}'].apply(lambda x: [word for word in x if len(word) > 4])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-v

In [13]:
#clean up redundant categories

data.category.value_counts()

di = {'food  drink': 'food drink',
      'style  beauty': 'style',
      'the worldpost': 'worldpost',
      'arts  culture': 'culture',
      'culture  arts': 'culture',
      'home  living': 'home living'
     }

data.category.replace(di, inplace=True)

data.category.value_counts()


politics          26524
entertainment     11117
wellness           6092
healthy living     5163
travel             4168
queer voices       3523
parents            3495
comedy             3239
black voices       3238
parenting          3238
business           3144
sports             3133
women              3037
food  drink        2692
world news         2411
style  beauty      2390
media              2340
the worldpost      2180
impact             2106
crime              2021
weird news         1949
green              1800
style              1695
religion           1675
taste              1626
home  living       1354
worldpost          1253
arts  culture      1197
divorce            1163
tech               1150
good news          1112
weddings           1039
arts                995
science             967
latino voices       964
college             866
us news             863
fifty               796
education           701
money               298
environment          66
culture  arts   

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data.category.replace(di, inplace=True)


politics          26524
entertainment     11117
wellness           6092
healthy living     5163
travel             4168
style              4085
queer voices       3523
parents            3495
worldpost          3433
comedy             3239
parenting          3238
black voices       3238
business           3144
sports             3133
women              3037
food drink         2692
world news         2411
media              2340
impact             2106
crime              2021
weird news         1949
green              1800
religion           1675
taste              1626
home living        1354
culture            1246
divorce            1163
tech               1150
good news          1112
weddings           1039
arts                995
science             967
latino voices       964
college             866
us news             863
fifty               796
education           701
money               298
environment          66
Name: category, dtype: int64

In [14]:
#split date predictor into elementary components

data['year'] = data['date'].dt.year 
data['month'] = data['date'].dt.month 
data['day'] = data['date'].dt.day

data.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['year'] = data['date'].dt.year
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['month'] = data['date'].dt.month
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['day'] = data['date'].dt.day


Unnamed: 0,date,category,headline,short_description,returns,corpus,tokens_headline,stemmed_headline,lemmatized_headline,tokens_short_description,stemmed_short_description,lemmatized_short_description,tokens_corpus,stemmed_corpus,lemmatized_corpus,year,month,day
0,2022-09-23,us news,over 4 million americans roll up sleeves for o...,health experts said it is too early to predict...,3693.23,over 4 million americans roll up sleeves for o...,"[million, americans, sleeves, covid, boosters]","[million, american, sleev, covid, booster]","[million, american, sleeve, covid, booster]","[health, experts, early, predict, whether, dem...","[health, expert, earli, predict, whether, dema...","[health, expert, early, predict, whether, dema...","[million, americans, sleeves, covid, boosters,...","[million, american, sleev, covid, booster, hea...","[million, american, sleeve, covid, booster, he...",2022,9,23
1,2022-09-23,us news,american airlines flyer charged banned for lif...,he was subdued by passengers and crew when he ...,3693.23,american airlines flyer charged banned for lif...,"[american, airlines, flyer, charged, banned, p...","[american, airlin, flyer, charg, ban, punch, f...","[american, airline, flyer, charged, banned, pu...","[subdued, passengers, aircraft, according, att...","[subdu, passeng, aircraft, accord, attorney, o...","[subdued, passenger, aircraft, according, atto...","[american, airlines, flyer, charged, banned, p...","[american, airlin, flyer, charg, ban, punch, f...","[american, airline, flyer, charged, banned, pu...",2022,9,23
2,2022-09-23,comedy,23 of the funniest tweets about cats and dogs ...,until you have a dog you dont understand what ...,3693.23,23 of the funniest tweets about cats and dogs ...,"[funniest, tweets]","[funniest, tweet]","[funniest, tweet]","[understand, could, eaten]","[understand, could, eaten]","[understand, could, eaten]","[funniest, tweets, understand, could, eaten]","[funniest, tweet, understand, could, eaten]","[funniest, tweet, understand, could, eaten]",2022,9,23
3,2022-09-23,parenting,the funniest tweets from parents this week sep...,accidentally put grownup toothpaste on my todd...,3693.23,the funniest tweets from parents this week sep...,"[funniest, tweets, parents]","[funniest, tweet, parent]","[funniest, tweet, parent]","[grownup, toothpaste, toddlers, toothbrush, sc...","[grownup, toothpast, toddler, toothbrush, scre...","[grownup, toothpaste, toddler, toothbrush, scr...","[funniest, tweets, parents, grownup, toothpast...","[funniest, tweet, parent, grownup, toothpast, ...","[funniest, tweet, parent, grownup, toothpaste,...",2022,9,23
4,2022-09-22,us news,woman who called cops on black birdwatcher los...,amy cooper accused investment firm franklin te...,3757.99,woman who called cops on black birdwatcher los...,"[woman, called, black, birdwatcher, loses, law...","[woman, call, black, birdwatch, lose, lawsuit,...","[woman, called, black, birdwatcher, loses, law...","[cooper, accused, investment, franklin, temple...","[cooper, accus, invest, franklin, templeton, u...","[cooper, accused, investment, franklin, temple...","[woman, called, black, birdwatcher, loses, law...","[woman, call, black, birdwatch, lose, lawsuit,...","[woman, called, black, birdwatcher, loses, law...",2022,9,22


after manually reviewing records, choice to choose lemmatized over stemmed pre-processed text data to avoid nonsensical stems. also choosing to used a combined corpus per record rather than separate headline, short_description to limit the size of text vectors (avoid out of memory errors).

### build actionable dataframe from data


In [15]:

df = data[['returns', 'year', 'month', 'day', 'category', 
           'lemmatized_headline', 'lemmatized_short_description', 'lemmatized_corpus']].copy(deep=True)

#join tokens back together for final corpus
for col in ['lemmatized_headline', 'lemmatized_short_description', 'lemmatized_corpus']:
    df[col] = df[col].str.join(" ")
    
df

Unnamed: 0,returns,year,month,day,category,lemmatized_headline,lemmatized_short_description,lemmatized_corpus
0,3693.23,2022,9,23,us news,million american sleeve covid booster,health expert early predict whether demand wou...,million american sleeve covid booster health e...
1,3693.23,2022,9,23,us news,american airline flyer charged banned punching...,subdued passenger aircraft according attorney ...,american airline flyer charged banned punching...
2,3693.23,2022,9,23,comedy,funniest tweet,understand could eaten,funniest tweet understand could eaten
3,3693.23,2022,9,23,parenting,funniest tweet parent,grownup toothpaste toddler toothbrush screamed...,funniest tweet parent grownup toothpaste toddl...
4,3757.99,2022,9,22,us news,woman called black birdwatcher loses lawsuit e...,cooper accused investment franklin templeton u...,woman called black birdwatcher loses lawsuit e...
...,...,...,...,...,...,...,...,...
161346,1613.20,2013,6,27,style,cheryl cole style evolution cornrows couture p...,cheryl cole wasnt exactly ordinary winning gir...,cheryl cole style evolution cornrows couture p...
161347,1613.20,2013,6,27,travel,three europe hedonistic city paris,paris brings season season,three europe hedonistic city paris paris bring...
161348,1613.20,2013,6,27,wellness,anxiety sleep deprivation,tease whether sleep simply byproduct anxiety w...,anxiety sleep deprivation tease whether sleep ...
161349,1613.20,2013,6,27,food drink,cheese creation completely amazing,everything,cheese creation completely amazing everything


In [16]:
#drop duplicate records
df.drop_duplicates(inplace=True)

In [17]:
#create full text field as ML corpus
df['corpus'] = df['category'] + ' ' + df['lemmatized_corpus']


In [18]:
#train test 80:20 split

df_train = df.sample(frac=0.80)
df_test = df.drop(df_train.index)

In [19]:
#one hot encode the category predictor
enc = OneHotEncoder(handle_unknown='infrequent_if_exist') #if a category arises that is not present in training, add to infrequent category

enc.fit(df_train.category.to_numpy().reshape(-1, 1))

#transform train and test separately
df_train['category_enc'] = list(np.array(enc.transform(df_train.category.to_numpy().reshape(-1, 1)).todense()))
df_test['category_enc'] = list(np.array(enc.transform(df_test.category.to_numpy().reshape(-1, 1)).todense()))

df_train = df_train[['returns', 'year', 'month', 'day', 'category_enc', 'corpus']]
df_test = df_test[['returns', 'year', 'month', 'day', 'category_enc', 'corpus']]



In [20]:
df_train.shape
df_train.head()

df_test.shape
df_test.head()

(95042, 6)

Unnamed: 0,returns,year,month,day,category_enc,corpus
32714,2440.35,2017,6,13,"[0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",black voices kerry washington artist doesnt vo...
82235,2090.11,2015,11,27,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",worldpost world leader build momentum paris cl...
40611,2365.45,2017,3,14,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",politics going backward american insurance dec...
105341,2044.16,2015,3,10,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",worldpost white house need support egypt jorda...
96210,2124.2,2015,6,23,"[0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",college south carolina college president call ...


(23760, 6)

Unnamed: 0,returns,year,month,day,category_enc,corpus
2,3693.23,2022,9,23,"[0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, ...",comedy funniest tweet understand could eaten
7,3757.99,2022,9,22,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",world news puerto ricans desperate water hurri...
8,3757.99,2022,9,22,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, ...",culture documentary capture complexity child i...
17,3855.93,2022,9,20,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",sports maury will shortstop dodger maury will ...
20,3855.93,2022,9,20,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",entertainment golden globe returning january o...


# Text Feature Extraction

TF-IDF is used as the feature extraction for a few reasons:

1. bag of words will not capture the higher dimension interaction space of tokens and n-grams
2. word2vec and BERT are great for ANN applications and large corpuses. This is a rather small corpus in the NLP world and the goal is not necessarily to train a neural network over simpler model types, so large embedding spaces are not a hard requirement for this task.

Choice to vectorize as unigrams in order to avoid MemoryErrors.

In [21]:
#train validation 90:10 split

df_training = df_train.sample(frac=0.90)
df_val = df_train.drop(df_training.index)

df_training.shape
df_training.head()
df_val.shape
df_val.head()

(85538, 6)

Unnamed: 0,returns,year,month,day,category_enc,corpus
44250,2279.55,2017,2,1,"[0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",black voices twitter imago trump white house c...
88739,1995.31,2015,9,16,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",politics union chief call james blake arrest c...
130721,1900.53,2014,5,23,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",women swimsuit guide woman swimsuit guide they...
129308,1951.27,2014,6,9,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",latino voices sanction sanction forget leading...
96784,2100.44,2015,6,17,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",sports beyond number boston olympic opportunit...


(9504, 6)

Unnamed: 0,returns,year,month,day,category_enc,corpus
95435,2076.78,2015,7,2,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",women resilience overcome stress stress listen...
76766,1940.24,2016,1,29,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",sports gotcha charger diego charger battle con...
136179,1866.52,2014,3,21,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",food drink grass regular taste better burger s...
141559,1828.46,2014,1,23,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, ...",divorce dating process process begin first lik...
115743,2038.26,2014,11,10,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",parents smile goofy mouthwide handing pediatri...


In [22]:
# create master training corpus
corpus = (df_train['corpus']).to_list()

len(corpus)


95042

In [23]:
#TF-IDF vectorization
#strip any remaining unicode characters and set n-gram range
tfidf = TfidfVectorizer(strip_accents='unicode', ngram_range=(1,1), use_idf=True)
fit_tfidf = tfidf.fit(corpus)


In [24]:
#fit TFIDF to vectorize training and validation sets

vectors_train = fit_tfidf.transform(df_training['corpus'].to_list())
vectors_val = fit_tfidf.transform(df_val['corpus'].to_list())

In [25]:
#sanity check
pd.DataFrame(vectors_train[0].T.todense(), index=tfidf.get_feature_names_out(), columns=["tfidf"])

Unnamed: 0,tfidf
aaaaaah,0.0
aaaargh,0.0
aakayla,0.0
aakomon,0.0
aaliyah,0.0
...,...
zwirner,0.0
zwirners,0.0
zyola,0.0
zyrtec,0.0


In [26]:
X_train = vectors_train.toarray()
X_val = vectors_val.toarray()
y_train = df_training['returns'].to_numpy()
y_val = df_val['returns'].to_numpy()

### Apply feature extractions to test set based on trained tfidf

In [27]:
df_test.head()

Unnamed: 0,returns,year,month,day,category_enc,corpus
2,3693.23,2022,9,23,"[0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, ...",comedy funniest tweet understand could eaten
7,3757.99,2022,9,22,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",world news puerto ricans desperate water hurri...
8,3757.99,2022,9,22,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, ...",culture documentary capture complexity child i...
17,3855.93,2022,9,20,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",sports maury will shortstop dodger maury will ...
20,3855.93,2022,9,20,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",entertainment golden globe returning january o...


In [28]:
test_corpus = df_test['corpus'].to_list()

len(test_corpus)

23760

In [29]:
vectors_test = fit_tfidf.transform(df_test['corpus'].to_list())

X_test = vectors_test.toarray()
y_test = df_test['returns'].to_numpy()

# Feature Engineering

How do the non corpus predictors fare in a penalized regression to predict S&P returns?

In [29]:
# correlation
df_train.corr().style.background_gradient(cmap='coolwarm', vmin=-1, vmax=1).set_precision(2)

  df_train.corr().style.background_gradient(cmap='coolwarm', vmin=-1, vmax=1).set_precision(2)
  df_train.corr().style.background_gradient(cmap='coolwarm', vmin=-1, vmax=1).set_precision(2)


Unnamed: 0,returns,year,month,day
returns,1.0,0.91,-0.06,-0.0
year,0.91,1.0,-0.23,-0.02
month,-0.06,-0.23,1.0,-0.01
day,-0.0,-0.02,-0.01,1.0


year is highly correlated with S&P returns, but month and day are not. This may indicate that date is not sufficient to predict returns alone

In [127]:
## date features
# ElasticNet (L1, L2) regression with 10 fold cross validation

regr = ElasticNetCV(cv=10, random_state=47)
regr.fit(df_training[['year', 'month', 'day']].to_numpy(), y_train)

#validate - coef of determination R2
regr.score(df_val[['year', 'month', 'day']].to_numpy(), y_val)

#test - coef of determination R2
regr.score(df_test[['year', 'month', 'day']].to_numpy(), y_test)


0.8226728807944109

0.8181590873887096

Using date parts alone, an elastic net regression that regularizes predictors by shrinkage predicts on the validation and test datasets with a Coefficient of Determination at ~ 0.82 regarding the input date features

In [138]:
##categorical onehot encoding

# ElasticNet (L1, L2) regression with 10 fold cross validation

regr = ElasticNetCV(cv=10, random_state=47)
regr.fit(df_training['category_enc'].to_list(), y_train)

#validate - coef of determination R2
regr.score(df_val['category_enc'].to_list(), y_val)

#test - coef of determination R2
regr.score(df_test['category_enc'].to_list(), y_test)

0.1806686235894004

0.17709845285778514

Using categorical onehot encoding alone is insufficient for predicting S&P returns

In [144]:
##date and categorical onehot encoding together

# ElasticNet (L1, L2) regression with 10 fold cross validation

regr = ElasticNetCV(cv=10, random_state=47)
regr.fit(np.concatenate([df_training[['year', 'month', 'day']].to_numpy(), df_training['category_enc'].to_list()], axis=1), y_train)

#validate - coef of determination R2
regr.score(np.concatenate([df_val[['year', 'month', 'day']].to_numpy(), df_val['category_enc'].to_list()], axis=1), y_val)

#test - coef of determination R2
regr.score(np.concatenate([df_test[['year', 'month', 'day']].to_numpy(), df_test['category_enc'].to_list()], axis=1), y_test)

0.8228182848974891

0.8183309715701851

Using date parts together with the categorical encoding does not help to improve predictor performance given regularization.

# Predictive Models

In [31]:
#linear regression

reg = LinearRegression()
# reg.fit(X_train, y_train)

reg.fit(X_val, y_val)

reg.score(X_test, y_test) #R2


-5.476625958388923e+22

Horrible prediction using text data to train a ordinary least squares regressor

In [30]:
#Stochastic Gradient Descent regression

reg = SGDRegressor(max_iter=1000, tol=1e-3, random_state=47)

reg.fit(X_val, y_val)

reg.score(X_test, y_test)



0.30815385794393013

Improved predictive power using an SDG regressor to predict returns from text data but the Coef of Determination is still too low to be a reliable predictor. I would like to see >0.6 at least for R^2. Additionally these sklearn models take too long to train on huge matrix data so I had to use the small validation set to fit.

In [57]:
#autokeras -trying an OOB search btw 5 text regressor architectures by Keras

reg = ak.TextRegressor(project_name='trainset_test', overwrite=True, max_trials=5, metrics=['mean_squared_error',
                                                                                            'accuracy'])
reg.fit(df_val['corpus'].to_numpy(), y_val, epochs=5, shuffle=True, validation_split=0.1)

print(reg.evaluate(df_test['corpus'].to_numpy(), y_test))


Trial 5 Complete [00h 00m 28s]
val_loss: 186560.96875

Best val_loss So Far: 132005.59375
Total elapsed time: 00h 02m 24s
INFO:tensorflow:Oracle triggered exit


INFO:tensorflow:Oracle triggered exit


Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5




INFO:tensorflow:Assets written to: .\trainset_test\best_model\assets


INFO:tensorflow:Assets written to: .\trainset_test\best_model\assets


<keras.callbacks.History at 0x2579a9ff010>

[139907.109375, 139907.109375, 0.0]


training loss is much larger than validation loss, indicating that training is insufficient to generalize well while predicting on unseen data. these models are computationally expensive, I would re-run these on GPU if using AutoKeras again. I also can't recover the assets file that contains the plain text description of the best model from the trials - when using conda+jupyter, the files saved by AutoKeras are not utf-8 encoded and thus have saving disabled.

Not an improvement over the SDG regressor

In [67]:
#DNN

model = tf.keras.Sequential([
    tf.keras.layers.Dense(512, activation='relu'),
    tf.keras.layers.Dropout(0.6),
    tf.keras.layers.Dense(256, activation='relu'),
    tf.keras.layers.Dropout(0.4),
    tf.keras.layers.Dense(128, activation='relu'),
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(1)
])

model.compile(optimizer='Adam', loss='mean_squared_error', metrics=['accuracy'])

#train and store history
History = model.fit(x=X_val, y=y_val, epochs=3, validation_split=0.1, shuffle=True, verbose=1)
History.history


Epoch 1/3
Epoch 2/3
Epoch 3/3


{'loss': [1054927.5, 167652.453125, 134390.8125],
 'accuracy': [0.0, 0.0, 0.0],
 'val_loss': [152292.6875, 142208.453125, 151755.625],
 'val_accuracy': [0.0, 0.0, 0.0]}

Using a very dense network that is regularized is an improvement over the AutoKeras neural network in terms of training and validation losses.

With more time, this is something I would train into the ground to minimize loss and definitely experiment with the funnel architecture and testing initializers to speed up training.

In [68]:
#evaluate model
modeleval = model.evaluate(X_test, y_test, verbose=1)




test loss is higher with the current DNN architecture compared to autokeras but not by much. This DNN is a much simpler architecture and required less training

# Summary

Based on these experiments, complex models such as neural networks are necessary in predicting S&P500 returns from Huff Post News text data. AutoKeras may be an option as it does not require as much data preprocessing and can run tens of hundreds of experiments while tracking without user intervention. Tensorflow and Keras may be a strong contender since customization of the layers allows for fine-tuned training and regularization.


Simpler regression models such as ElasticNet can predict returns from date part news data, however the drop in test coefficient of determination suggests that this model may not generalize well.

A more advanced regression model such as SDG does show promise as a returns predictor using text data input if a larger training set and more epochs are employed.

* Huff Post News Category alone cannot predict S&P
* Year of news publication is highly correlated with S&P returns, and using the date parts as features for a linear regression (regularized) may be a useful predictive tool given Huff Post data.


## Limitations

I was severely limited by memory and hardware, which is fine for running outside of a cloud env as I did here but I typically do not fully train on my local machine for staging or production deploys.


## Future Work

I would like to collect more data from different news sources to reduce bias. I would further experiment with n-gram size, since increasing the n-gram size posed memory issue. Given that the training corpus would increase with more news sources, I would propose using word2vec or BERT embeddings instead of TF-IDF to handle the larger dimensionality and most likely better train complex networks, which for this NLP task are necessary compared to simple regressors.

I would experiment more with the custom neural network architecture, adding LSTM or 1-D CNN layers and trying different layer activations. In the background I would run a larger AutoKeras experiment but would not use this over a custom architecture for production since I do not have all the metadata on the underlying model.


## Final Thoughts

I enjoyed doing an NLP regression task as I typically am building for sentiment analysis or recommendation systems in the NLP space that lean more on the classification side. I focused primarily on the data ingestion, cleaning and processing as I believe in 'garbage in, garbage out' with models; my specialization in AI is in model architecture but I know I can't trust the performance of anything I build unless I am certain the underlying data is prepped to be actionable. I also wanted to highlight the scientist aspect of myself, where I chose to experiment with models and features rather than drilling down on a particular arbitrarily chosen set first. This is how I work whether I am at the computer or wet bench when working on something previously unreported: I run a set of quick yet robust experiments that I know I can accurately/reproducibly measure success and loss from, increasing model or experiment sophistication incremently based on the outcomes of the last. Then when I narrow down one or two models that seem optimal, I drill down regularization parameters and tune the architectures.