# SENTIMENT-BASED STOCK MARKET PREDICTION
By Gauri Narayan & Raefah Wahid

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import copy
import pandas as pd
import json
from statistics import mean
from statsmodels.tsa.stattools import grangercausalitytests
import datetime as dt
from datetime import datetime, timedelta, date
import torch
from torch import nn
from pyro.nn import PyroModule
from pyro.infer import Predictive

assert issubclass(PyroModule[nn.Linear], nn.Linear)
assert issubclass(PyroModule[nn.Linear], PyroModule)

## Goal
Stock market prediction is widely known as a difficult and challenging task, in part due to the volatile and variable nature of the market itself. The Efficient Market Hypothesis (EMH) proposes that the stock market is primarily affected by new information, such as textual data in the form of news or tweets, rather than technical indicators that rely on past information (Bollen et al., 2011). Following this line of thought, researchers have explored different Natural Language Processing techniques when working with textual information as well as experimented with different prediction models. Our goal for this project is to evaluate the different models that can be used to tackle the problem of stock market prediction. We will begin with a simple Naive Bayes model for sentiment prediction as a baseline, and then follow it with a continuous Dirichlet Process Mixture Model for topic-based sentiment prediction. We will then use Vector Autoregression to evaluate how well these models’ outputs work in forecasting stock market closing prices.

## Data
We narrowed the focus of this project to ten companies, five of which have a relatively heavy social media presence and five of which do not: Amazon, Apple, Microsoft, Disney, Google, CVS, General Electric, Santander, Goldman Sachs, China Construction Bank. We scraped tweets for each company by using the company’s common name as a keyword. Ten tweets were gathered per day for each company over the year 2019.

In [6]:
# read in data
companies = [('AMZN', 'Amazon'), ('AAPL', 'Apple'), ('MSFT', 'Microsoft'),
             ('DIS', 'Disney'), ('GOOG', 'Google'), ('CVS', 'CVS'),
             ('GE', 'General Electric'), ('SAN', 'Santander'),
             ('GS', 'Goldman Sachs'), ('CICHY', 'China Construction Bank')]
stock_data = []
tweet_data = []
tweets = []
n = 1  # usually this would be equal to len(companies), but due to the long runtime this report uses a demo (i.e., just Amazon)
for company in companies:
    abbr = company[0]
    tweet = pd.read_csv('./tweets/' + abbr + '_tweets.csv')
    del tweet['Unnamed: 0']
    tweets.append(tweet)
    curr_stock = pd.read_csv('./financial/' + abbr + '_financial.csv')
    del curr_stock['Unnamed: 0']
    stock_data.append(curr_stock)
    curr_tweet = pd.read_csv('./sentiment/' + abbr + '_sentiment.csv')
    times = []
    for time in curr_tweet['Time']:
        date = time.split()[0]
        times.append(date)
    curr_tweet['Time'] = times
    del curr_tweet['Unnamed: 0']
    del curr_tweet['Unnamed: 0.1']
    tweet_data.append(curr_tweet)

In [7]:
tweets[0].head()  # sample tweet data for Amazon

Unnamed: 0,Time,Text
0,2018-12-31 23:59:56+00:00,dang amazon. i talking customer service the ph...
1,2020-11-25 17:06:25+00:00,learn the new cloud like way manage premise da...
2,2018-12-31 23:59:53+00:00,chacousa amazonpay i ordered pair chaco’s sund...
3,2018-12-31 23:59:43+00:00,check loiygit amazon music
4,2018-12-31 23:59:41+00:00,head banging doll [clean] kakicchysmusic mp do...


In [8]:
stock_data[0].head()  # sample stock data for Amazon

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume
0,2018-12-31,1510.800049,1520.76001,1487.0,1501.969971,1501.969971,6954500
1,2019-01-02,1465.199951,1553.359985,1460.930054,1539.130005,1539.130005,7983100
2,2019-01-03,1520.01001,1538.0,1497.109985,1500.280029,1500.280029,6975600
3,2019-01-04,1530.0,1594.0,1518.310059,1575.390015,1575.390015,9182600
4,2019-01-07,1602.310059,1634.560059,1589.189941,1629.51001,1629.51001,7993200


# Modeling

## Naive Bayes
The dataset of tweets were cleaned as they were scraped. The standard techniques of removing digits, punctuation, symbols, and stopwords were applied. For ease of analysis, we dealt only with tweets written in English. As a baseline for sentiment analysis, we began by implementing the Naive Bayes model. Though Naive Bayes assumes independence, it is a standard approach to text classification and a good way to test the performance of our forecasting model later on. The Naive Bayes model implemented assigns a positive and negative sentiment score to each tweet using the formula

$$\hat{y} = \frac{p(S_k) \cdot \prod_{i=1}^n p(x_i \mid S_k)}{\prod_{i=1}^n p(x_i)},$$

where $\hat{y}$ is the resulting sentiment score, $S_k$ is one of two sentiment labels (positive or negative), and $x_i$ is a particular word in the tweet. To train this model, we used a public dataset [LINK?] of standard positive and negative tweets that were already labelled. After computing prior probabilities based on this training data, we created a Naive Bayes model that read in tweets related to our chosen companies and computed those tweets’ likelihoods and resulting log sentiment scores. Due to the fact that some of the companies are not particularly well-known on Twitter, some days produced no tweets that mentioned the company; for these instances, a sentiment value of 0 was assigned. 

In [9]:
# sample sentiment data for Amazon; the first index in the sentiment is positive, the second negative
tweet_data[0].head()

Unnamed: 0,Time,Text,Sentiment
0,2018-12-31,dang amazon. i talking customer service the ph...,"[66.97592609158492, 71.83459462136356, 5.00994..."
1,2020-11-25,learn the new cloud like way manage premise da...,"[63.96525067459205, 83.39114379494283, 3.53773..."
2,2018-12-31,chacousa amazonpay i ordered pair chaco’s sund...,"[65.3480030078838, 40.37738505535857, -8.38073..."
3,2018-12-31,check loiygit amazon music,"[8.873573528973495, 6.795776260509992, -11.449..."
4,2018-12-31,head banging doll [clean] kakicchysmusic mp do...,"[33.08781006722295, 15.435225146842129, 11.620..."


Our output for each tweet is an array of length two, with the positive score in the first position and the negative score in the second. Instead of classifying any particular tweet into a final category of positive or negative, we kept these scores so that we could average together the relative positive or negative sentiment for a particular company across a day. Thus, we can see compare the effect of positive sentiment against negative sentiment when it comes to stock price forecasting.

## Vector Autoregression
Vector Autoregression (VAR) is often used for time series dependent forecasting due to the fact it models time series as a linear combination of their past values and the values of other time series. Since our stock information only deals with end-of-day results, we averaged the sentiment scores for tweets across each day for each company. With a lag of 1, our model was formatted as the following linear regression

$$y_t = \theta_1 x_{t- \text{lag}} + \theta_2 y_{t - \text{lag}} + b,$$

where $y_t$ is the closing price of the current day’s stock, $x_{t- \text{lag}}$ is the previous day’s average positive or negative sentiment, $y_{t - \text{lag}}$ is the previous day’s closing price, $\theta_1$ and $\theta_2$ are weights, and $b$ is the bias.

To implement this, we used Pyro’s linear regression module. We use a mean squared error (MSE) loss and optimized using Adam. We began with a lag = 1 using positive sentiment first, and then negative sentiment.

In [None]:
## getting lagged data
## running VAR (looking at loss)

Due to the highly dynamic nature of both sentiment and the stock market, forecasting with a lag of more than seven days is unlikely to be effective. We began with a lag = 1, as seen above, but we experimented with a lag = 3 and 5 as well. 

In [None]:
## lag of 3 and 4
## running VAR

We found that though the loss decreases further with a lag = 3, it does not improve much more than that with a lag = 5. A lag = 3 would be the most optimal, though its performance is only marginally better.

# Inference
## Granger Causality
///

In [None]:
## run GC

# Evaluation
## Mean Squared Error
To evaluate our model, we utilized MSE as a metric. For each company’s dataset, we split the data into a training and testing set, with 80% of the original data in the training set and 20% in the testing set. We ran VAR on the training set to retrieve an appropriate set of weights and bias for each company. We used this output to predict the closing prices of the testing set, which yielded the following MSE results.

In [None]:
## run MSE

# Modeling
## Continuous Dirichlet Process Mixture Model

In [None]:
## explanation + running

In [None]:
## Var on DPM

In [None]:
## inference and evaluation on DPM

# Conclusion