In [1]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [33]:
#resources

import pandas as pd
import os
import json

import nltk
from nltk import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.snowball import SnowballStemmer
from nltk import pos_tag

nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\alxra\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


True

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\alxra\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


True

# Introduction

## Problem

Daily stock market return data are notoriously difficult to predict given volatiltity due to many possible predictors and underlying interactions.

## Goal

To predict S&P 500 returns based on news data.

## Data

Predictors:
* Huff Post News Data (https://www.kaggle.com/datasets/rmisra/news-category-dataset)
    * **category**: category in which the article was published.
    * **headline**: the headline of the news article.
    * **authors**: list of authors who contributed to the article.
    * **link**: link to the original news article.
    * **short_description**: Abstract of the news article.
    * **date**: publication date of the article.

Target:
* S&P500 Data (https://fred.stlouisfed.org/series/SP500)
    * **Returns** (USD) between

## Methodology

1. Data ETL
2. Data Pre-Processing
3. Text predictor feature extraction (TF-IDF, BERT)
4. Feature engineering
5. Modeling
    * Logistic Regression (baseline prediction)
    * Random Forest Regression (ensemble learner prediction)
    * Autokeras (out-of-the-box neural net prediction)
    * 1D CNN (custom spatio-temportal prediction)
    * LSTM (custom time-series prediction)


# ETL

In [11]:
#predictors
news = []
with open('News_Category_Dataset_v3.json', 'r') as file:
    for line in file:
        news.append(json.loads(line))
news = pd.DataFrame.from_dict(news)

#target
returns = pd.read_csv('SP500.csv')

In [15]:
news.shape
news.head()
news.describe()
news.dtypes

returns.shape
returns.head()
returns.describe()
returns.dtypes

(209527, 6)

Unnamed: 0,link,headline,category,short_description,authors,date
0,https://www.huffpost.com/entry/covid-boosters-...,Over 4 Million Americans Roll Up Sleeves For O...,U.S. NEWS,Health experts said it is too early to predict...,"Carla K. Johnson, AP",2022-09-23
1,https://www.huffpost.com/entry/american-airlin...,"American Airlines Flyer Charged, Banned For Li...",U.S. NEWS,He was subdued by passengers and crew when he ...,Mary Papenfuss,2022-09-23
2,https://www.huffpost.com/entry/funniest-tweets...,23 Of The Funniest Tweets About Cats And Dogs ...,COMEDY,"""Until you have a dog you don't understand wha...",Elyse Wanshel,2022-09-23
3,https://www.huffpost.com/entry/funniest-parent...,The Funniest Tweets From Parents This Week (Se...,PARENTING,"""Accidentally put grown-up toothpaste on my to...",Caroline Bologna,2022-09-23
4,https://www.huffpost.com/entry/amy-cooper-lose...,Woman Who Called Cops On Black Bird-Watcher Lo...,U.S. NEWS,Amy Cooper accused investment firm Franklin Te...,Nina Golgowski,2022-09-22


Unnamed: 0,link,headline,category,short_description,authors,date
count,209527,209527,209527,209527.0,209527.0,209527
unique,209486,207996,42,187022.0,29169.0,3890
top,https://www.huffingtonpost.comhttps://www.wash...,Sunday Roundup,POLITICS,,,2014-03-25
freq,2,90,35602,19712.0,37418.0,100


link                 object
headline             object
category             object
short_description    object
authors              object
date                 object
dtype: object

(2608, 2)

Unnamed: 0,DATE,SP500
0,2013-06-27,1613.2
1,2013-06-28,1606.28
2,2013-07-01,1614.96
3,2013-07-02,1614.08
4,2013-07-03,1615.41


Unnamed: 0,DATE,SP500
count,2608,2608
unique,2608,2504
top,2013-06-27,.
freq,1,92


DATE     object
SP500    object
dtype: object

In [17]:
# cast date columns as datetime types
news['date'] = pd.to_datetime(news['date'])

returns['DATE'] = pd.to_datetime(returns['DATE'])


link                         object
headline                     object
category                     object
short_description            object
authors                      object
date                 datetime64[ns]
dtype: object

DATE     datetime64[ns]
SP500            object
dtype: object

In [27]:
# cast returns column as float
returns['SP500'] = pd.to_numeric(returns['SP500'], errors='coerce')

returns.dtypes

DATE     datetime64[ns]
SP500           float64
dtype: object

Making a decision to drop authors and link as predictors. Authors write on certain topics and do not work indefinitely for the company, the links are based on the titles; there is a co-effect or colinearity between category and author, and description/title and link so we try to reduce multicollinearity right away.

In [31]:
data = news[['date', 'category', 'headline', 'short_description']]

#map target to predictors using date
di = dict(zip(returns.DATE, returns.SP500))

data['returns'] = data['date'].map(di)

data

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['returns'] = data['date'].map(di)


Unnamed: 0,date,category,headline,short_description,returns
0,2022-09-23,U.S. NEWS,Over 4 Million Americans Roll Up Sleeves For O...,Health experts said it is too early to predict...,3693.23
1,2022-09-23,U.S. NEWS,"American Airlines Flyer Charged, Banned For Li...",He was subdued by passengers and crew when he ...,3693.23
2,2022-09-23,COMEDY,23 Of The Funniest Tweets About Cats And Dogs ...,"""Until you have a dog you don't understand wha...",3693.23
3,2022-09-23,PARENTING,The Funniest Tweets From Parents This Week (Se...,"""Accidentally put grown-up toothpaste on my to...",3693.23
4,2022-09-22,U.S. NEWS,Woman Who Called Cops On Black Bird-Watcher Lo...,Amy Cooper accused investment firm Franklin Te...,3757.99
...,...,...,...,...,...
209522,2012-01-28,TECH,RIM CEO Thorsten Heins' 'Significant' Plans Fo...,Verizon Wireless and AT&T are already promotin...,
209523,2012-01-28,SPORTS,Maria Sharapova Stunned By Victoria Azarenka I...,"Afterward, Azarenka, more effusive with the pr...",
209524,2012-01-28,SPORTS,"Giants Over Patriots, Jets Over Colts Among M...","Leading up to Super Bowl XLVI, the most talked...",
209525,2012-01-28,SPORTS,Aldon Smith Arrested: 49ers Linebacker Busted ...,CORRECTION: An earlier version of this story i...,


In [32]:
#drop any rows with empty values in the target column
data = data[data['returns'].notna()]
data

Unnamed: 0,date,category,headline,short_description,returns
0,2022-09-23,U.S. NEWS,Over 4 Million Americans Roll Up Sleeves For O...,Health experts said it is too early to predict...,3693.23
1,2022-09-23,U.S. NEWS,"American Airlines Flyer Charged, Banned For Li...",He was subdued by passengers and crew when he ...,3693.23
2,2022-09-23,COMEDY,23 Of The Funniest Tweets About Cats And Dogs ...,"""Until you have a dog you don't understand wha...",3693.23
3,2022-09-23,PARENTING,The Funniest Tweets From Parents This Week (Se...,"""Accidentally put grown-up toothpaste on my to...",3693.23
4,2022-09-22,U.S. NEWS,Woman Who Called Cops On Black Bird-Watcher Lo...,Amy Cooper accused investment firm Franklin Te...,3757.99
...,...,...,...,...,...
161346,2013-06-27,STYLE & BEAUTY,Cheryl Cole's Style Evolution: From Cornrows T...,Cheryl Cole's path to fame wasn't exactly ordi...,1613.20
161347,2013-06-27,TRAVEL,Three of Europe's Most Hedonistic Cities: Part...,"Paris brings us back again and again, season a...",1613.20
161348,2013-06-27,WELLNESS,Anxiety Tied To Sleep Deprivation,"""It's been hard to tease out whether sleep los...",1613.20
161349,2013-06-27,FOOD & DRINK,Mac And Cheese Creations: Over The Top And Com...,You can add this dish to just about everything.,1613.20


# Pre-Processing

In [None]:
#text pre-processing

stemmer = SnowballStemmer("english")

# Text Feature Extraction

In [None]:
#TF-IDF vectorization

In [None]:
#BERT embeddings

# Feature Engineering

In [None]:
# lasso regression

In [None]:
# ridge regression

# Predictive Models

In [None]:
#logistic regression

In [None]:
#random forest regression

In [None]:
#autokeras

In [None]:
#1D CNN

In [None]:
#LSTM