
# Predicting the Dow Jones with News

## General Data flow for a Text Related Business Problem

# Problem Statement & Reference Architecture

* **Aim**: Use Reddit News Headlines to predict the movement of Dow Jones Industrial Average.   


* **Data Source**: https://www.kaggle.com/aaron7sun/stocknews


* **Data Description**: Dow Jones details on Open, High, Low and Close for each day from 2008-08-08 to 2016-07-01 and headlines for those dates from Reddit News.


* **Methodology**: For this project, we will use GloVe to create our word embeddings and CNNs followed by LSTMs to build our model. This model is based off the work done in this paper https://www.aclweb.org/anthology/C/C16/C16-1229.pdf.

# Installation Prerequisites

In [1]:
# !apt-get update  && apt-get install -y --allow-downgrades --no-install-recommends git wget


In [2]:
# !apt-get -y install graphviz

In [3]:
# !pip install nltk keras

In [4]:
# !pip install pydot

In [5]:
# !pip install graphviz

In [49]:
!wget http://nlp.stanford.edu/data/glove.840B.300d.zip

--2024-11-10 12:25:14--  http://nlp.stanford.edu/data/glove.840B.300d.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.840B.300d.zip [following]
--2024-11-10 12:25:14--  https://nlp.stanford.edu/data/glove.840B.300d.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://downloads.cs.stanford.edu/nlp/data/glove.840B.300d.zip [following]
--2024-11-10 12:25:14--  https://downloads.cs.stanford.edu/nlp/data/glove.840B.300d.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2176768927 (2.0G) [application/

In [50]:
!unzip glove.840B.300d.zip

Archive:  glove.840B.300d.zip
  inflating: glove.840B.300d.txt     


# Imports

In [8]:
import pandas as pd
import numpy as np
import tensorflow as tf
import re
import nltk
from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split
from sklearn.metrics import median_absolute_error as mae
from sklearn.metrics import mean_squared_error as mse
import matplotlib.pyplot as plt

In [9]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [10]:
# # Keras Imports
# from keras.models import Sequential
# from keras import initializers
# from keras.layers import Dropout, Activation, Embedding, Convolution1D, MaxPooling1D, Input, Dense, add, \
#                          BatchNormalization, Flatten, Reshape, Concatenate
# from keras.layers.recurrent import LSTM, GRU
# from keras.callbacks import Callback, ModelCheckpoint, EarlyStopping, ReduceLROnPlateau
# from keras.models import Model
# from keras.optimizers import Adam, SGD, RMSprop
# from keras import regularizers
# from keras.utils.vis_utils import plot_model
# import re

In [11]:
# Keras Imports
from keras.models import Sequential
from keras import initializers
from keras.layers import Dropout, Activation, Embedding, Convolution1D, MaxPooling1D, Input, Dense, add, \
                         BatchNormalization, Flatten, Reshape, Concatenate, LSTM, GRU # Import LSTM and GRU directly from keras.layers
from keras.callbacks import Callback, ModelCheckpoint, EarlyStopping, ReduceLROnPlateau
from keras.models import Model
from keras.optimizers import Adam, SGD, RMSprop
from keras import regularizers
import tensorflow as tf # Import tensorflow
from tensorflow.keras.utils import plot_model # Import plot_model from the new location
import re

In [12]:
#!/bin/bash
# !kaggle datasets download lykin22/stock-headlines
!kaggle datasets download aaron7sun/stocknews

Dataset URL: https://www.kaggle.com/datasets/aaron7sun/stocknews
License(s): CC-BY-NC-SA-4.0
Downloading stocknews.zip to /content
  0% 0.00/5.82M [00:00<?, ?B/s]
100% 5.82M/5.82M [00:00<00:00, 89.0MB/s]


In [13]:
!unzip stocknews.zip

Archive:  stocknews.zip
  inflating: Combined_News_DJIA.csv  
  inflating: RedditNews.csv          
  inflating: upload_DJIA_table.csv   


In [14]:
# dj = pd.read_csv("dowjones-news-data/DowJones.csv")
# news = pd.read_csv("dowjones-news-data/News.csv")

dj = pd.read_csv("upload_DJIA_table.csv")
news = pd.read_csv("RedditNews.csv", encoding='latin-1')

## Inspect the data

In [15]:
dj.head()

Unnamed: 0,Date,Open,High,Low,Close,Volume,Adj Close
0,2016-07-01,17924.240234,18002.380859,17916.910156,17949.369141,82160000,17949.369141
1,2016-06-30,17712.759766,17930.609375,17711.800781,17929.990234,133030000,17929.990234
2,2016-06-29,17456.019531,17704.509766,17456.019531,17694.679688,106380000,17694.679688
3,2016-06-28,17190.509766,17409.720703,17190.509766,17409.720703,112190000,17409.720703
4,2016-06-27,17355.210938,17355.210938,17063.080078,17140.240234,138740000,17140.240234


In [16]:
dj.isnull().sum() #No missing data

Unnamed: 0,0
Date,0
Open,0
High,0
Low,0
Close,0
Volume,0
Adj Close,0


In [17]:
news.isnull().sum() #No missing data

Unnamed: 0,0
Date,0
News,0


In [18]:
news.head(2)

Unnamed: 0,Date,News
0,2016-07-01,A 117-year-old woman in Mexico City finally re...
1,2016-07-01,IMF chief backs Athens as permanent Olympic host


In [19]:
# Convert 'Date' column to datetime objects
dj['Date'] = pd.to_datetime(dj['Date'])
news['Date'] = pd.to_datetime(news['Date'])

In [20]:
print(dj.shape)
print(news.shape)

(1989, 7)
(73608, 2)


In [21]:
# Compare the number of unique dates. We want matching values.
print(len(set(dj.Date)))
print(len(set(news.Date)))

1989
2943


In [22]:
# Remove the extra dates that are in news
news = news[news.Date.isin(dj.Date)]

In [23]:
print(len(set(dj.Date)))
print(len(set(news.Date)))

1989
1989


In [24]:
dj.columns

Index(['Date', 'Open', 'High', 'Low', 'Close', 'Volume', 'Adj Close'], dtype='object')

In [25]:
# Convert 'Open' column to numeric, coercing errors to NaN
# dj['Open'] = pd.to_numeric(pd.to_numeric(dj['Open'].str.replace(',',''), errors='coerce'), errors='coerce')

In [26]:
dj.columns

Index(['Date', 'Open', 'High', 'Low', 'Close', 'Volume', 'Adj Close'], dtype='object')

In [27]:
# Remove unwanted features - keep the 'Open' price only
dj = dj.drop(['High', 'Low', 'Close', 'Volume', 'Adj Close'], axis=1) # Replace 1 with axis=1 or columns=
dj.head()

Unnamed: 0,Date,Open
0,2016-07-01,17924.240234
1,2016-06-30,17712.759766
2,2016-06-29,17456.019531
3,2016-06-28,17190.509766
4,2016-06-27,17355.210938


In [28]:
# Calculate the difference in opening prices between the following and current day.
# The model will try to predict the change in Open value based on the today's news.
dj = dj.set_index('Date')
dj.head()

Unnamed: 0_level_0,Open
Date,Unnamed: 1_level_1
2016-07-01,17924.240234
2016-06-30,17712.759766
2016-06-29,17456.019531
2016-06-28,17190.509766
2016-06-27,17355.210938


In [29]:
# Target variable = Tomorrow's Open Price - Today's Open Price
dj = -1 * dj.diff(periods=1)

In [30]:
dj

Unnamed: 0_level_0,Open
Date,Unnamed: 1_level_1
2016-07-01,
2016-06-30,211.480468
2016-06-29,256.740235
2016-06-28,265.509765
2016-06-27,-164.701172
...,...
2008-08-14,79.139649
2008-08-13,-100.739258
2008-08-12,-148.890625
2008-08-11,52.030273


In [31]:
# Remove top row since it has a null value.
dj = dj[dj.Open.notnull()]

In [32]:
# Check if there are any more null values.
dj.isnull().sum()

Unnamed: 0,0
Open,0


## Combine the two datasets - For each date, get all the headlines and the price

In [33]:
dj.reset_index(inplace=True)

In [34]:
# Create a list of the opening prices and their corresponding daily headlines from the news
# Define/Initialize the variables
price = []
headlines = []

# For all the rows in the dataframe
for row in dj.iterrows():
    # define a new variable to store all the headlines for the day
    daily_headlines = []
    # Spot the date in the given row
    date = row[1]['Date']
    # Store the price for the date
    price.append(row[1]['Open'])
    for row_ in news[news.Date==date].iterrows():
        daily_headlines.append(row_[1]['News'])

    # Append the headlines for the date
    headlines.append(daily_headlines)
    # Track progress
    if len(price) % 500 == 0:
        print(len(price))

500
1000
1500


<table size="100">
    <tr>
        <td>headlines</td>
        <td>price</td>
    </tr>
    <tr>
        <td>headline-1, headline-2 ..., headline-n</td>
        <td>211.48</td>
    </tr>
</table>

In [35]:
news

Unnamed: 0,Date,News
0,2016-07-01,A 117-year-old woman in Mexico City finally re...
1,2016-07-01,IMF chief backs Athens as permanent Olympic host
2,2016-07-01,"The president of France says if Brexit won, so..."
3,2016-07-01,British Man Who Must Give Police 24 Hours' Not...
4,2016-07-01,100+ Nobel laureates urge Greenpeace to stop o...
...,...,...
72078,2008-08-08,b'Why the Pentagon Thinks Attacking Iran is a ...
72079,2008-08-08,b'Caucasus in crisis: Georgia invades South Os...
72080,2008-08-08,b'Indian shoe manufactory - And again in a se...
72081,2008-08-08,b'Visitors Suffering from Mental Illnesses Ban...


In [36]:
# Check how headlines look like
headlines[:3], price[:3]

([['Jamaica proposes marijuana dispensers for tourists at airports following legalisation: The kiosks and desks would give people a license to purchase up to 2 ounces of the drug to use during their stay',
   "Stephen Hawking says pollution and 'stupidity' still biggest threats to mankind: we have certainly not become less greedy or less stupid in our treatment of the environment over the past decade",
   'Boris Johnson says he will not run for Tory party leadership',
   'Six gay men in Ivory Coast were abused and forced to flee their homes after they were pictured signing a condolence book for victims of the recent attack on a gay nightclub in Florida',
   'Switzerland denies citizenship to Muslim immigrant girls who refused to swim with boys: report',
   'Palestinian terrorist stabs israeli teen girl to death in her bedroom',
   'Puerto Rico will default on $1 billion of debt on Friday',
   'Republic of Ireland fans to be awarded medal for sportsmanship by Paris mayor.',
   "Afghan s

## Clean up the price list

In [37]:
price[:2]

[211.48046800000157, 256.7402349999975]

In [38]:
# Normalize opening prices (target values)
max_price = max(price)
min_price = min(price)
mean_price = np.mean(price)
def normalize(price):
    return ((price-min_price)/(max_price-min_price))

In [39]:
norm_price = []
for p in price:
    norm_price.append(normalize(p))

In [40]:
# Check that normalization worked well
print(min(norm_price))
print(max(norm_price))
print(np.mean(norm_price))

0.0
1.0
0.4551577545098642


In [41]:
# Compare the number of headlines for each day
print(max(len(i) for i in headlines))
print(min(len(i) for i in headlines))
print(np.mean([len(i) for i in headlines]))

25
22
24.996478873239436


In [42]:
norm_price[:2]

[0.5780280759194737, 0.6047364662478155]

## Clean up the headlines list

In [43]:
# remove contractions
def decontracted(phrase):
    if "'" in phrase:
        # specific
        phrase = re.sub(r"won't", "will not", phrase)
        phrase = re.sub(r"can\'t", "can not", phrase)

        # general
        phrase = re.sub(r"n\'t", " not", phrase)
        phrase = re.sub(r"\'re", " are", phrase)
        phrase = re.sub(r"\'s", " is", phrase)
        phrase = re.sub(r"\'d", " would", phrase)
        phrase = re.sub(r"\'ll", " will", phrase)
        phrase = re.sub(r"\'t", " not", phrase)
        phrase = re.sub(r"\'ve", " have", phrase)
        phrase = re.sub(r"\'m", " am", phrase)
    return phrase

text = "I should've gone to dentist so my teeth wouldn't hurt"
text1 = "But I am good now"
print(decontracted(text))
print(decontracted(text1))

I should have gone to dentist so my teeth would not hurt
But I am good now


In [44]:
def clean_text(text):
    '''Remove unwanted characters and format the text to create fewer nulls word embeddings'''

    # Convert words to lower case
    text = text.lower()

    # Replace contractions with their longer forms
    if True:
        text = text.split()
        new_text = []
        # Remove the contractions
        for word in text:
            new_text.append(decontracted(word))
        # Recreate the sentence
        text = " ".join(new_text)

    # Format words and remove unwanted characters
    text = re.sub(r'&amp;', '', text)
    text = re.sub(r'0,0', '00', text)
    text = re.sub(r'[_"\-;%()|.,+&=*%.,!?:#@\[\]]', ' ', text)
    text = re.sub(r'\'', ' ', text)
    text = re.sub(r'\$', ' $ ', text)
    text = re.sub(r'u s ', ' united states ', text)
    text = re.sub(r'u n ', ' united nations ', text)
    text = re.sub(r'u k ', ' united kingdom ', text)
    text = re.sub(r'j k ', ' jk ', text)
    text = re.sub(r' s ', ' ', text)
    text = re.sub(r' yr ', ' year ', text)
    text = re.sub(r' l g b t ', ' lgbt ', text)
    text = re.sub(r'0km ', '0 km ', text)

    # Remove stop words
    text = text.split()
    stops = set(stopwords.words("english"))
    text = [w for w in text if not w in stops]
    text = " ".join(text)

    return text

In [45]:
# Clean the headlines
clean_headlines = []

for daily_headlines in headlines:
    clean_daily_headlines = []
    for headline in daily_headlines:
        clean_daily_headlines.append(clean_text(headline))
    clean_headlines.append(clean_daily_headlines)

In [46]:
# Take a look at some headlines to ensure everything was cleaned well
clean_headlines[:2]

[['jamaica proposes marijuana dispensers tourists airports following legalisation kiosks desks would give people license purchase 2 ounces drug use stay',
  'stephen hawking says pollution istupidity still biggest threats mankind certainly become less greedy less stupid treatment environment past decade',
  'boris johnson says run tory party leadership',
  'six gay men ivory coast abused forced flee homes pictured signing condolence book victims recent attack gay nightclub florida',
  'switzerland denies citizenship muslim immigrant girls refused swim boys report',
  'palestinian terrorist stabs israeli teen girl death bedroom',
  'puerto rico default $ 1 billion debt friday',
  'republic ireland fans awarded medal sportsmanship paris mayor',
  'afghan suicide bomber kills 40 bbc news',
  'us airstrikes kill least 250 isis fighters convoy outside fallujah official says',
  'turkish cop took istanbul gunman hailed hero',
  'cannabis compounds could treat alzheimer removing plaque formin

In [47]:
# prompt: unique words and number of times they occur in headlines

from collections import Counter

def unique_words_count(headlines):
  """
  Calculates the frequency of unique words in a list of headlines.

  Args:
    headlines: A list of strings, where each string is a headline.

  Returns:
    A Counter object where keys are unique words and values are their frequencies.
  """
  word_counts = Counter()
  for daily_headlines in headlines:
    for headline in daily_headlines:
      words = headline.split()
      word_counts.update(words)
  return word_counts

unique_word_counts = unique_words_count(clean_headlines)
unique_word_counts

Counter({'jamaica': 23,
         'proposes': 65,
         'marijuana': 216,
         'dispensers': 1,
         'tourists': 95,
         'airports': 33,
         'following': 219,
         'legalisation': 4,
         'kiosks': 2,
         'desks': 1,
         'would': 888,
         'give': 289,
         'people': 1887,
         'license': 21,
         'purchase': 26,
         '2': 733,
         'ounces': 2,
         'drug': 648,
         'use': 605,
         'stay': 107,
         'stephen': 59,
         'hawking': 23,
         'says': 2563,
         'pollution': 135,
         'istupidity': 2,
         'still': 326,
         'biggest': 331,
         'threats': 118,
         'mankind': 13,
         'certainly': 12,
         'become': 314,
         'less': 204,
         'greedy': 6,
         'stupid': 27,
         'treatment': 133,
         'environment': 96,
         'past': 242,
         'decade': 109,
         'boris': 23,
         'johnson': 24,
         'run': 291,
         'tory': 24

# A note on Word Embeddings

![word_embed](https://github.com/dasarpai/DAI-Projects/blob/main/BFSI/DoeJones-Prediction-with-News/resources/wordvectors.png?raw=1)

**Reference**: https://nlp.stanford.edu/projects/glove/

## We are going to use Glove embeddings to initialize our weights while designing our neural network. Let's load the same so that we can ensure our headline corpus' vocabulary matches where possible with Glove Embedding vocabulary.

In [51]:
# Load GloVe's embeddings
embeddings_index = {}
with open('glove.840B.300d.txt', encoding='utf-8') as f:
    for line in f:
        values = line.split(' ')
        word = values[0]
        embedding = np.asarray(values[1:], dtype='float32')
        embeddings_index[word] = embedding

print('Word embeddings:', len(embeddings_index))

Word embeddings: 2196016


## It is not necessary that we will have embeddings for all the words in Glove. So to limit such cases by limiting vocabulary by applying simple logic:  Remove the words that are "rare" and are not available in Glove

In [52]:
# Limit the vocab that we will use to words that appear ≥ threshold or are in GloVe

# Define threshold
threshold = 10

#dictionary to convert words to integers
vocab_to_int = {}

value = 0
for word, count in unique_word_counts.items():
    if count >= threshold or word in embeddings_index:
        vocab_to_int[word] = value
        value += 1

In [53]:
vocab_to_int

{'jamaica': 0,
 'proposes': 1,
 'marijuana': 2,
 'dispensers': 3,
 'tourists': 4,
 'airports': 5,
 'following': 6,
 'legalisation': 7,
 'kiosks': 8,
 'desks': 9,
 'would': 10,
 'give': 11,
 'people': 12,
 'license': 13,
 'purchase': 14,
 '2': 15,
 'ounces': 16,
 'drug': 17,
 'use': 18,
 'stay': 19,
 'stephen': 20,
 'hawking': 21,
 'says': 22,
 'pollution': 23,
 'still': 24,
 'biggest': 25,
 'threats': 26,
 'mankind': 27,
 'certainly': 28,
 'become': 29,
 'less': 30,
 'greedy': 31,
 'stupid': 32,
 'treatment': 33,
 'environment': 34,
 'past': 35,
 'decade': 36,
 'boris': 37,
 'johnson': 38,
 'run': 39,
 'tory': 40,
 'party': 41,
 'leadership': 42,
 'six': 43,
 'gay': 44,
 'men': 45,
 'ivory': 46,
 'coast': 47,
 'abused': 48,
 'forced': 49,
 'flee': 50,
 'homes': 51,
 'pictured': 52,
 'signing': 53,
 'condolence': 54,
 'book': 55,
 'victims': 56,
 'recent': 57,
 'attack': 58,
 'nightclub': 59,
 'florida': 60,
 'switzerland': 61,
 'denies': 62,
 'citizenship': 63,
 'muslim': 64,
 'immigra

In [54]:
len(vocab_to_int)

31295

In [55]:
# Special tokens that will be added to our vocab
codes = ["<UNK>","<PAD>"]

# Add codes to vocab
for code in codes:
    vocab_to_int[code] = len(vocab_to_int)

# Dictionary to convert integers to words
int_to_vocab = {}
for word, value in vocab_to_int.items():
    int_to_vocab[value] = word

usage_ratio = round(len(vocab_to_int) / len(unique_word_counts),4)*100

print("Total Number of Unique Words:", len(unique_word_counts))
print("Number of Words we will use:", len(vocab_to_int))
print("Percent of Words we will use: {}%".format(usage_ratio))

Total Number of Unique Words: 36311
Number of Words we will use: 31297
Percent of Words we will use: 86.19%


## For the words which are common within headlines but are absent in Glove corpus, we will have to randomly initialize them. Over the training, those values will be finetuned along with those of Glove vectors.

In [56]:
# Need to use 300 for embedding dimensions to match GloVe's vectors.
embedding_dim = 300

nb_words = len(vocab_to_int)
# Create matrix with default values of zero
word_embedding_matrix = np.zeros((nb_words, embedding_dim))
for word, i in vocab_to_int.items():
    if word in embeddings_index:
        word_embedding_matrix[i] = embeddings_index[word]
    else:
        # If word not in GloVe, create a random embedding for it
        new_embedding = np.array(np.random.uniform(-1.0, 1.0, embedding_dim))
        embeddings_index[word] = new_embedding
        word_embedding_matrix[i] = new_embedding

# Check if value matches len(vocab_to_int)
print(len(word_embedding_matrix))

31297


## Convert the word sequences to equivalent integer sequences so that it can be used as input to the model

In [57]:
# Change the text from words to integers
# If word is not in vocab, replace it with <UNK> (unknown)
word_count = 0
unk_count = 0

headlines_sequence = []

for daily_headline in clean_headlines:
    daily_headlines_seq = []
    for headline in daily_headline:
        headline_seq = []
        for word in headline.split():
            word_count += 1
            if word in vocab_to_int:
                headline_seq.append(vocab_to_int[word])
            else:
                headline_seq.append(vocab_to_int["<UNK>"])
                unk_count += 1
        daily_headlines_seq.append(headline_seq)
    headlines_sequence.append(daily_headlines_seq)

unk_percent = round(unk_count/word_count,4)*100

print("Total number of words in headlines:", word_count)
print("Total number of UNKs in headlines:", unk_count)
print("Percent of words that are UNK: {}%".format(unk_percent))

Total number of words in headlines: 616686
Total number of UNKs in headlines: 7139
Percent of words that are UNK: 1.16%


In [58]:
len(headlines_sequence)
type(headlines_sequence[0][0][0])

int

In [59]:
len(headlines_sequence[0:1][0][2])

7

In [60]:
a=headlines_sequence[:1]
for k in range(len(headlines_sequence)):
    for i in range(len(headlines_sequence[k])) :
        print(headlines_sequence[k][i], end="\n")
        #for j in range(len(headlines_sequence[k][i])) :
        #    print(headlines_sequence[k][i], end="\n")
    print("================")

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
[5967, 212, 1011, 18152, 25281, 3202, 44, 29739, 27939, 31295, 31295]
[5967, 7349, 9455, 1037, 7921, 7737, 1798]
[5967, 5535, 801, 5405, 30236, 4658, 13977, 211, 101]
[5967, 25272, 2370, 341, 6864, 876, 2729, 514, 184]
[5967, 367, 476, 2584, 2165, 199, 11155, 503, 967, 5289, 5290, 22310, 823, 184, 11411, 102, 7628]
[5967, 162, 3013, 3213, 6099, 825, 21310, 45, 3645, 3455]
[5967, 29162, 7770, 8040, 31295, 2709, 4422, 562]
[5967, 28146, 2056, 14665, 2022, 2205, 2712, 29710]
[5967, 16118, 876, 18214, 4103, 11917, 31295, 18762, 437, 4037, 4431]
[5967, 294, 14146, 5948, 3153, 2545, 5185, 4405]
[5967, 711, 5254, 1823, 71, 416, 4872, 5909, 3895, 484, 12, 527, 71, 5314, 427, 30237]
[5967, 823, 45, 2262, 162, 163, 22996, 4060, 10784, 730, 23403, 1551, 2474]
[5967, 4719, 5549, 142, 638, 3922, 972, 786, 15183, 551, 3395, 3512, 9807]
[5967, 4760, 4761, 2044, 698, 1482, 18035, 804, 453, 21828, 496, 83, 514, 3029, 1530, 105, 613, 3029,

## Ensure that the variations in the number of news headlines each day and length of each headlines are handled by taking an average number of headlines each day and average length per headline

In [61]:
# Find the length of headlines
lengths = []
for headlines in headlines_sequence:
    for headline in headlines:
        lengths.append(len(headline))

# Create a dataframe so that the values can be inspected
lengths = pd.DataFrame(lengths, columns=['counts'])

In [62]:
lengths.describe()

Unnamed: 0,counts
count,49693.0
mean,12.409917
std,6.789827
min,1.0
25%,7.0
50%,10.0
75%,16.0
max,41.0


## Limit the length of a day's news to 200 words, and the length of any headline to 16 words. These values are chosen to not have an excessively long training time and balance the number of headlines used and the number of words from each headline.

In [63]:
max_headline_length = 16
max_daily_length = 200
pad_headlines = []

# For each date in all the dates available
for headlines in headlines_sequence:
    pad_daily_headlines = []
    # for each headline for each date
    for headline in headlines:
        # Add headline if it is less than max length
        if len(headline) <= max_headline_length:
            for word in headline:
                pad_daily_headlines.append(word)
        # Limit headline if it is more than max length
        else:
            headline = headline[:max_headline_length]
            for word in headline:
                pad_daily_headlines.append(word)

    # Pad daily_headlines if they are less than max length
    if len(pad_daily_headlines) < max_daily_length:
        for i in range(max_daily_length-len(pad_daily_headlines)):
            pad = vocab_to_int["<PAD>"]
            pad_daily_headlines.append(pad)
    # Limit daily_headlines if they are more than max length
    else:
        pad_daily_headlines = pad_daily_headlines[:max_daily_length]
    pad_headlines.append(pad_daily_headlines)

## Split data into training and testing sets.
## Validating data will be created during training.

In [64]:
x_train, x_test, y_train, y_test = train_test_split(pad_headlines, norm_price, test_size = 0.15, random_state = 2)

x_train = np.array(x_train)
x_test = np.array(x_test)
y_train = np.array(y_train)
y_test = np.array(y_test)

In [65]:
# Check the lengths
print(len(x_train))
print(len(x_test))

1689
299


# Model Building

## 1. Define the hyperparameters

In [66]:
filter_length = 5
dropout = 0.5
learning_rate = 0.001
weights = initializers.TruncatedNormal(mean=0.0, stddev=0.1, seed=2)
nb_filter = 16
rnn_output_size = 128
hidden_dims = 128

## 2. Create the model

In [67]:
def build_model():

    model = Sequential()

    # Layer 1 - Embedding
    model.add(Embedding(nb_words,
                         embedding_dim,
                         weights=[word_embedding_matrix],
                         input_length=max_daily_length))
    model.add(Dropout(dropout))

    # Layer 2 - Convolution 1 with dropout
    model.add(Convolution1D(filters = nb_filter,
                             kernel_size = filter_length,
                             padding = 'same',
                             activation = 'relu'))
    model.add(Dropout(dropout))

    # Layer 3 - Convolution 2 with Dropout
    model.add(Convolution1D(filters = nb_filter,
                                 kernel_size = filter_length,
                                 padding = 'same',
                                 activation = 'relu'))
    model.add(Dropout(dropout))

    # Layer 4 - RNN with dropout
    model.add(LSTM(rnn_output_size,
                    activation=None,
                    kernel_initializer=weights,
                    dropout = dropout))

    # Layer 5 - Dense FFN with Dropout
    model.add(Dense(hidden_dims, kernel_initializer=weights))
    model.add(Dropout(dropout))

    model.add(Dense(1,
                    kernel_initializer = weights,
                    name='output'))

    model.compile(loss='mean_squared_error',
                  optimizer=Adam(lr=learning_rate,clipvalue=1.0))
    return model

## 3. Fit the model

In [68]:
def build_model():
    model = Sequential()

    # Layer 1 - Embedding
    model.add(Embedding(nb_words,
                        embedding_dim,
                        weights=[word_embedding_matrix],
                        input_length=max_daily_length))  # Remove `input_length` if still causing issues
    model.add(Dropout(dropout))

    # Layer 2 - Convolution 1 with dropout
    model.add(Convolution1D(filters=nb_filter,
                            kernel_size=filter_length,
                            padding='same',
                            activation='relu'))
    model.add(Dropout(dropout))

    # Layer 3 - Convolution 2 with Dropout
    model.add(Convolution1D(filters=nb_filter,
                            kernel_size=filter_length,
                            padding='same',
                            activation='relu'))
    model.add(Dropout(dropout))

    # Layer 4 - RNN with dropout
    model.add(LSTM(rnn_output_size,
                   activation=None,
                   kernel_initializer=weights,
                   dropout=dropout))

    # Layer 5 - Dense FFN with Dropout
    model.add(Dense(hidden_dims, kernel_initializer=weights))
    model.add(Dropout(dropout))

    # Output layer
    model.add(Dense(1, kernel_initializer=weights, name='output'))

    # Compile model
    model.compile(loss='mean_squared_error',
                  optimizer=Adam(learning_rate=learning_rate, clipvalue=1.0))  # Corrected `learning_rate` keyword
    return model


In [69]:
gpus = tf.config.list_physical_devices('GPU')
if gpus:
    try:
        # Restrict TensorFlow to only allocate memory as needed
        for gpu in gpus:
            tf.config.experimental.set_memory_growth(gpu, True)
    except RuntimeError as e:
        print(e)


In [70]:
model = build_model()
print()
# save_best_weights = 'best_weights.h5'

# callbacks = [ModelCheckpoint(save_best_weights, monitor='val_loss', save_best_only=True),
#             EarlyStopping(monitor='val_loss', patience=5, verbose=1, mode='auto'),
#             ReduceLROnPlateau(monitor='val_loss', factor=0.2, verbose=1, patience=3)]

save_best_weights = 'best_weights.keras'

callbacks = [
    ModelCheckpoint(save_best_weights, monitor='val_loss', save_best_only=True),
    EarlyStopping(monitor='val_loss', patience=5, verbose=1, mode='auto'),
    ReduceLROnPlateau(monitor='val_loss', factor=0.2, verbose=1, patience=3)
]

# history = model.fit([x_train],
#                     y_train,
#                     batch_size=128,
#                     epochs=100,
#                     validation_split=0.15,
#                     verbose=True,
#                     shuffle=True,
#                     callbacks = callbacks)

history = model.fit(
    x_train,   # Remove the brackets here
    y_train,
    batch_size=128,
    epochs=100,
    validation_split=0.15,
    verbose=True,
    shuffle=True,
    callbacks=callbacks
)

print(model.summary())




Epoch 1/100
[1m12/12[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m27s[0m 620ms/step - loss: 0.1116 - val_loss: 0.0462 - learning_rate: 0.0010
Epoch 2/100
[1m12/12[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 217ms/step - loss: 0.0383 - val_loss: 0.0384 - learning_rate: 0.0010
Epoch 3/100
[1m12/12[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 147ms/step - loss: 0.0190 - val_loss: 0.0226 - learning_rate: 0.0010
Epoch 4/100
[1m12/12[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 134ms/step - loss: 0.0154 - val_loss: 0.0124 - learning_rate: 0.0010
Epoch 5/100
[1m12/12[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 234ms/step - loss: 0.0129 - val_loss: 0.0092 - learning_rate: 0.0010
Epoch 6/100
[1m12/12[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 143ms/step - loss: 0.0116 - val_loss: 0.0087 - learning_rate: 0.0010
Epoch 7/100
[1m12/12[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 163ms/step - loss: 0.0108 - val_loss: 0.0079 - learn

None


In [71]:
save_best_weights = 'best_weights.keras' # Change the file extension to .keras

callbacks = [ModelCheckpoint(save_best_weights, monitor='val_loss', save_best_only=True),
            EarlyStopping(monitor='val_loss', patience=5, verbose=1, mode='auto'),
            ReduceLROnPlateau(monitor='val_loss', factor=0.2, verbose=1, patience=3)]

## 4. Predict using the model

In [72]:
predictions = model.predict([x_test], verbose = True)

[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 124ms/step


In [73]:
predictions = model.predict(x_test, verbose=True) # Remove the square brackets around x_test

[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 84ms/step


In [74]:
import numpy as np

# Convert x_test to the expected data type if necessary
x_test = x_test.astype(np.float32)

# Reshape x_test to match the input shape expected by the model
# Assuming the input shape is (200, 1) for example:
x_test = x_test.reshape(-1, 200, 1)

predictions = model.predict(x_test, verbose=True) # Remove the square brackets around x_test

[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 99ms/step


In [75]:
import numpy as np

# Convert x_test to the expected data type if necessary
x_test = x_test.astype(np.float32)

# Reshape x_test to match the input shape expected by the model
# Assuming the input shape is (200, 1) for example:
x_test = x_test.reshape(-1, 200, 1)

# Check for None values in x_test
if np.any(np.isnan(x_test)): # Check for NaN which represents missing data or None
    print("x_test contains None or NaN values!")
    # Handle the None values (e.g., impute with mean/median or remove)
    x_test = np.nan_

In [86]:
predictions = model.predict(x_test, verbose=True) # Remove the square brackets around x_test

[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 15ms/step


In [87]:
import numpy as np

# Convert x_test to the expected data type if necessary
x_test = x_test.astype(np.float32)

# Reshape x_test to match the input shape expected by the model
# Assuming the input shape is (200, 1) for example:
x_test = x_test.reshape(-1, 200, 1)

# Check for None or NaN (Not a Number) values in x_test
if np.any(np.isnan(x_test)) or np.any(x_test == None): # Check for both NaN and None
    print("x_test contains None or NaN values!")
    # Handle the None/NaN values: Replace them with a suitable value (e.g., 0, mean, median)
    # Example: Replacing with 0
    x_test = np.nan_to_num(x_test, nan=0.0, posinf=0.0, neginf=0.0) # Replaces NaN with 0. Handles +/- infinity as well.
    # If you want to remove rows with None/NaN values instead:
    # rows_with_none = np.any(np.isnan(x_test), axis=1) # Find rows with None/NaN
    # x_test = x_test[~rows_with_none]  # Select rows without None/NaN


predictions = model.predict(x_test, verbose=True) # Remove the square brackets around x_test

[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 12ms/step


In [88]:
# Compare testing loss to training and validating loss
mse(y_test, predictions)

0.007533114229954784

In [89]:
# Revert prediction back to actual scale
def unnormalize(price):
    '''Revert values to their unnormalized amounts'''
    price = price*(max_price-min_price)+min_price
    return(price)

In [90]:
# Store back-scaled predictions
unnorm_predictions = []
for pred in predictions:
    unnorm_predictions.append(unnormalize(pred))

# Store back-scaled actuals
unnorm_y_test = []
for y in y_test:
    unnorm_y_test.append(unnormalize(y))

In [91]:
# Calculate the median absolute error for the predictions
mae(unnorm_y_test, unnorm_predictions)

84.98162773046897

In [92]:
pd.Series(unnorm_y_test).describe()

Unnamed: 0,0
count,299.0
mean,7.094101
std,139.532324
min,-673.139648
25%,-54.689941
50%,10.759766
75%,87.465332
max,541.050782


## Make Your Own Predictions

Below is the code necessary to make your own predictions. I found that the predictions are most accurate when there is no padding included in the input data. In the create_news variable, I have some default news that you can use, which is from April 30th, 2017. Just change the text to whatever you want, then see the impact your new headline will have.

In [93]:
def news_to_int(news):
    '''Convert your created news into integers'''
    ints = []
    for word in news.split():
        if word in vocab_to_int:
            ints.append(vocab_to_int[word])
        else:
            ints.append(vocab_to_int['<UNK>'])
    return ints

In [94]:
def padding_news(news):
    '''Adjusts the length of your created news to fit the model's input values.'''
    padded_news = news
    if len(padded_news) < max_daily_length:
        for i in range(max_daily_length-len(padded_news)):
            padded_news.append(vocab_to_int["<PAD>"])
    elif len(padded_news) > max_daily_length:
        padded_news = padded_news[:max_daily_length]
    return padded_news

In [95]:
# Default news that you can use

create_news =  "Woman says note from Chinese 'prisoner' was hidden in new purse. \
               21,000 AT&T workers poised for Monday strike \
               housands march against Trump climate policies in D.C., across USA \
               Kentucky judge won't hear gay adoptions because it's not in the child's \"best interest\" \
               Multiple victims shot in UTC area apartment complex \
               Drones Lead Police to Illegal Dumping in Riverside County | NBC Southern California \
               An 86-year-old Californian woman has died trying to fight a man who was allegedly sexually assaulting her 61-year-old friend. \
               Fyre Festival Named in $5Million+ Lawsuit after Stranding Festival-Goers on Island with Little Food, No Security. \
               The \"Greatest Show on Earth\" folds its tent for good \
               U.S.-led fight on ISIS have killed 352 civilians: Pentagon \
               Woman offers undercover officer sex for $25 and some Chicken McNuggets \
               Ohio bridge refuses to fall down after three implosion attempts \
               Jersey Shore MIT grad dies in prank falling from library dome \
               New York graffiti artists claim McDonald's stole work for latest burger campaign \
               SpaceX to launch secretive satellite for U.S. intelligence agency \
               Severe Storms Leave a Trail of Death and Destruction Through the U.S. \
               Hamas thanks N. Korea for its support against ‘Israeli occupation’ \
               Baker Police officer arrested for allegedly covering up details in shots fired investigation \
               Miami doctor’s call to broker during baby’s delivery leads to $33.8 million judgment \
               Minnesota man gets 15 years for shooting 5 Black Lives Matter protesters \
               South Australian woman facing possible 25 years in Colombian prison for drug trafficking \
               The Latest: Deal reached on funding government through Sept. \
               Russia flaunts Arctic expansion with new military bases"

clean_news = clean_text(create_news)

int_news = news_to_int(clean_news)

pad_news = padding_news(int_news)

pad_news = np.array(pad_news).reshape((1,-1))

pred = model.predict([pad_news])

price_change = unnormalize(pred)

print("The Dow should open: {} from the previous open.".format(np.round(price_change[0][0],2)))

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 26ms/step
The Dow should open: -38.849998474121094 from the previous open.
