# Machine learning to predict stock price

In [186]:
# Set it up again
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
import numpy as np
import seaborn as sns
import sklearn
from afinn import Afinn
from sklearn.metrics import confusion_matrix
from sklearn.linear_model import LogisticRegression
%matplotlib inline

Talk about the different data models used for machine learning, hypos of sentiment analysis use in prediction. Compare top1 and top25

In [187]:
reddit_news_df = pd.read_csv("Combined_News_DJIA.csv")
# reddit_news_df.head()

Previously in phase 2 we found out only 3 rows contain null values, so I'm dropping those again.

In [188]:
reddit_news_df = reddit_news_df.dropna()

## Are sentiment analysis values of the more popular headlines/articles in the r/worldnews subreddit, better stock market predictors than less popular articles?
In the following cells I'll train logistic regression models using sentiment analysis of different Top# articles to predict whether the Dow Jones Industrial Average(DJIA) goes up or down.

Initializing sentiment analysis values:

In [189]:
af = Afinn()

top1_headlines = reddit_news_df['Top1']
top1_headlines = [headline.replace('b"', '').replace('b\'', "'") for headline in top1_headlines]
top1_sentiment_scores = [af.score(headline) for headline in top1_headlines]

reddit_news_df['Top1 Sentiment'] = top1_sentiment_scores

top5_headlines = reddit_news_df['Top5']
top5_headlines = [headline.replace('b"', '').replace('b\'', "'") for headline in top5_headlines]
top5_sentiment_scores = [af.score(headline) for headline in top5_headlines]

reddit_news_df['Top5 Sentiment'] = top5_sentiment_scores

top15_headlines = reddit_news_df['Top15']
top15_headlines = [headline.replace('b"', '').replace('b\'', "'") for headline in top15_headlines]
top15_sentiment_scores = [af.score(headline) for headline in top15_headlines]

reddit_news_df['Top15 Sentiment'] = top15_sentiment_scores

top25_headlines = reddit_news_df['Top25']
top25_headlines = [headline.replace('b"', '').replace('b\'', "'") for headline in top25_headlines]
top25_sentiment_scores = [af.score(headline) for headline in top25_headlines]

reddit_news_df['Top25 Sentiment'] = top25_sentiment_scores

In [190]:
reddit_train = reddit_news_df.loc[:944].copy()
reddit_test = reddit_news_df.loc[944:].copy()

### Comparing models using Top1, Top5, Top15, Top25 headlines sentiment analysis values

In [191]:
x_train = reddit_train[['Top1 Sentiment']]
x_test = reddit_test[['Top1 Sentiment']]
y_train = reddit_train[['Label']]
y_test = reddit_test[['Label']]

logReg = LogisticRegression()
logReg.fit(x_train, y_train.values.ravel())
y_pred = logReg.predict(x_test)
print('Accuracy of Top1 test set: {:.2f}'.format(logReg.score(x_test, y_test)))

tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()
print('Top1 model true negatives:', tn)
print('Top1 model true positives:', fp)
print('Top1 model false negatives:', fn)
print('Top1 model false positives:', tp, '\n')

################################# TOP 5 ####################################

x_train = reddit_train[['Top5 Sentiment']]
x_test = reddit_test[['Top5 Sentiment']]
y_train = reddit_train[['Label']]
y_test = reddit_test[['Label']]

logReg = LogisticRegression()
logReg.fit(x_train, y_train.values.ravel())
y_pred = logReg.predict(x_test)
print('Accuracy of Top5 test set: {:.2f}'.format(logReg.score(x_test, y_test)))

tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()
print('Top5 model true negatives:', tn)
print('Top5 model true positives:', fp)
print('Top5 model false negatives:', fn)
print('Top5 model false positives:', tp, '\n')


################################# TOP 15 ####################################

x_train = reddit_train[['Top15 Sentiment']]
x_test = reddit_test[['Top15 Sentiment']]
y_train = reddit_train[['Label']]
y_test = reddit_test[['Label']]

logReg = LogisticRegression()
logReg.fit(x_train, y_train.values.ravel())
y_pred = logReg.predict(x_test)
print('Accuracy of Top15 test set: {:.2f}'.format(logReg.score(x_test, y_test)))

tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()
print('Top15 model true negatives:', tn)
print('Top15 model true positives:', fp)
print('Top15 model false negatives:', fn)
print('Top15 model false positives:', tp, '\n')

################################# TOP 25 ####################################

x_train = reddit_train[['Top25 Sentiment']]
x_test = reddit_test[['Top25 Sentiment']]
y_train = reddit_train[['Label']]
y_test = reddit_test[['Label']]

logReg = LogisticRegression()
logReg.fit(x_train, y_train.values.ravel())
y_pred = logReg.predict(x_test)
print('Accuracy of Top25 test set: {:.2f}'.format(logReg.score(x_test, y_test)))

tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()
print('Top25 model true negatives:', tn)
print('Top25 model true positives:', fp)
print('Top25 model false negatives:', fn)
print('Top25 model false positives:', tp, '\n')

Accuracy of Top1 test set: 0.53
Top1 model true negatives: 0
Top1 model true positives: 322
Top1 model false negatives: 0
Top1 model false positives: 367 

Accuracy of Top5 test set: 0.53
Top5 model true negatives: 0
Top5 model true positives: 322
Top5 model false negatives: 0
Top5 model false positives: 367 

Accuracy of Top15 test set: 0.52
Top15 model true negatives: 8
Top15 model true positives: 314
Top15 model false negatives: 14
Top15 model false positives: 353 

Accuracy of Top25 test set: 0.53
Top25 model true negatives: 0
Top25 model true positives: 322
Top25 model false negatives: 0
Top25 model false positives: 367 



#### Observations
So it appears that models that use the sentiment analysis values of articles of different popularity have roughly the same accuracy of predicting the DJIA. One other observation is that sentiment analysis doesn't appear to be a good feature at first glance. All models had an accuracy of 52-53%. Though maybe this is actually good for a single feature. As we know, predicting the stock market is hard and other features need to be considered before disregarding sentiment analysis. 

One thing to note is the large amount of true positives the models predict. So it seems the models tend to be better at predicting when the DJIA is going to up or stay the same than predicting when its going to go down.

## Are there better features than sentiment analysis to predict the DJIA?
alt title : Is there a better feature than sentiment analysis to predict the DJIA?

In this section I will use moving average of the DJIA as a feature to train a new model.

Moving average is the mean value of the DJIA over certain time periods. For example a 100-day moving average would be the mean value of the DJIA close value over the past 100 days. It's widely used by investors to aid in choosing to sell or hold a stock market position.

In [192]:
# reset df 
reddit_news_df = pd.read_csv("Combined_News_DJIA.csv")
reddit_news_df = reddit_news_df.dropna()

# CSV with the DJIA data
djia_df = pd.read_csv("upload_DJIA_table.csv")

# Get rows with that having matching dates.
merged_df = pd.merge(djia_df, reddit_news_df, on=['Date'], how='inner')
merged_df = merged_df.dropna()
# merged_df.head()

# ma_5 = merged_df['Close'].rolling(window=5).mean()
# # print(ma_5)

# ma_df = pd.DataFrame(ma_5, columns=['ma-5'])
# ma_df = ma_df.dropna()

# # print(ma_df.iloc[2])
# ma_df.head(100)

### Comparing models with 5-day moving average, 10-day moving average, 25-day moving average, 50-day moving average and 100-day moving average  

In [197]:
# Find moving averages
merged_df["ma-5"] = merged_df['Close'].rolling(window=5).mean()
merged_df["ma-10"] = merged_df['Close'].rolling(window=10).mean()
merged_df["ma-25"] = merged_df['Close'].rolling(window=25).mean()
merged_df["ma-50"] = merged_df['Close'].rolling(window=50).mean()
merged_df["ma-100"] = merged_df['Close'].rolling(window=100).mean()

merged_df = merged_df.dropna()

ma_train = merged_df.loc[:1300].copy()
ma_test = merged_df.loc[1300:].copy()

merged_df.head()

Unnamed: 0,Date,Open,High,Low,Close,Volume,Adj Close,Label,Top1,Top2,...,Top21,Top22,Top23,Top24,Top25,ma-5,ma-10,ma-25,ma-50,ma-100
198,2015-09-18,16674.740234,16674.740234,16343.759766,16384.580078,341690000,16384.580078,0,Brazil's Supreme Court has banned corporate co...,Investigation finds Exxon knew about CO2's eff...,...,Farmers in northern France have been ordered t...,'Super-gonorrhoea' outbreak in Leeds,"In egalitarian Sweden, richer regions reluctan...",China Is Building The Mother Of All Reputation...,Half a million children have fled attacks by t...,16341.289844,16262.884863,16732.680195,17222.747949,17041.170156
199,2015-09-17,16738.080078,16933.429688,16639.929688,16674.740234,129600000,16674.740234,0,"Efficiency up, turnover down: Sweden experimen...",7.9-Magnitude Earthquake Strikes off the Coast...,...,"Threatened, starved: Cook reveals life at Saud...",Global study reveals soaring antibiotic resist...,Burkina Faso 'coup': Presidential guard dissol...,Malicious Cisco router backdoor found on 79 mo...,Russian Authorities Close Down American Center...,16435.973828,16303.15791,16700.103398,17199.974941,17048.770156
200,2015-09-16,16599.509766,16755.980469,16593.900391,16739.949219,99620000,16739.949219,1,Tuna and mackerel populations suffer catastrop...,"Australian Government introduces ""No Jab No Pa...",...,"Flying Korea's farmed dogs to safety - ""Our go...",Turkish presidents office says insulting presi...,Saudi suspends Binladen group over Mecca crane...,"Anheuser-Busch InBev, the maker of Budweiser a...",China stocks resume sharp slide as economic wo...,16527.985742,16348.682812,16682.956992,17178.506113,17056.02585
201,2015-09-15,16382.580078,16644.109375,16382.580078,16599.849609,93050000,16599.849609,1,Egyptian Billionaire who wants to purchase pri...,The UN Says US Drone Strikes in Yemen Targetin...,...,Canadian banks helping clients bend rules to m...,Poland &amp; Sweden agree to intensify militar...,North Korea 'restarts nuclear operations' | BBC,Teen Arrested for Planning Alleged ISIS-Inspir...,'Syria Is Emptying',16581.861719,16403.754785,16658.266601,17154.259316,17061.753848
202,2015-09-14,16450.859375,16450.859375,16330.870117,16370.959961,92660000,16370.959961,0,Malcom Turnbull becomes Prime Minister of Aust...,El Nino set to be strongest ever. The most pow...,...,Taliban storms Afghan jail with suicide bomber...,NASA Launching 4K TV Channel,The Egyptian army announced it has killed 64 a...,Oxfam: Increasing inequality plunging millions...,Czech PM insists migrant quotas 'won't work',16554.01582,16440.661816,16623.883437,17125.824922,17063.41375


In [194]:
x_train = ma_train[['ma-5']]
x_test = ma_test[['ma-5']]
y_train = ma_train[['Label']]
y_test = ma_test[['Label']]

logReg = LogisticRegression()
logReg.fit(x_train, y_train.values.ravel())
y_pred = logReg.predict(x_test)
print('Accuracy of 5-day moving average test set: {:.2f}'.format(logReg.score(x_test, y_test)))

tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()
print('5-day moving average model true negatives:', tn)
print('5-day moving average model true positives:', fp)
print('5-day moving average model false negatives:', fn)
print('5-day moving average model false positives:', tp, '\n')

############################ 10-day model ####################################

x_train = ma_train[['ma-10']]
x_test = ma_test[['ma-10']]
y_train = ma_train[['Label']]
y_test = ma_test[['Label']]

logReg = LogisticRegression()
logReg.fit(x_train, y_train.values.ravel())
y_pred = logReg.predict(x_test)
print('Accuracy of 10-day moving average test set: {:.2f}'.format(logReg.score(x_test, y_test)))

tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()
print('10-day moving average model true negatives:', tn)
print('10-day moving average model true positives:', fp)
print('10-day moving average model false negatives:', fn)
print('10-day moving average model false positives:', tp, '\n')

############################ 25-day model ####################################

x_train = ma_train[['ma-25']]
x_test = ma_test[['ma-25']]
y_train = ma_train[['Label']]
y_test = ma_test[['Label']]

logReg = LogisticRegression()
logReg.fit(x_train, y_train.values.ravel())
y_pred = logReg.predict(x_test)
print('Accuracy of 25-day moving average test set: {:.2f}'.format(logReg.score(x_test, y_test)))

tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()
print('25-day moving average model true negatives:', tn)
print('25-day moving average model true positives:', fp)
print('25-day moving average model false negatives:', fn)
print('25-day moving average model false positives:', tp, '\n')

############################ 50-day model ####################################

x_train = ma_train[['ma-50']]
x_test = ma_test[['ma-50']]
y_train = ma_train[['Label']]
y_test = ma_test[['Label']]

logReg = LogisticRegression()
logReg.fit(x_train, y_train.values.ravel())
y_pred = logReg.predict(x_test)
print('Accuracy of 50-day moving average test set: {:.2f}'.format(logReg.score(x_test, y_test)))

tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()
print('50-day moving average model true negatives:', tn)
print('50-day moving average model true positives:', fp)
print('50-day moving average model false negatives:', fn)
print('50-day moving average model false positives:', tp, '\n')

############################ 100-day model ####################################

x_train = ma_train[['ma-100']]
x_test = ma_test[['ma-100']]
y_train = ma_train[['Label']]
y_test = ma_test[['Label']]

logReg = LogisticRegression()
logReg.fit(x_train, y_train.values.ravel())
y_pred = logReg.predict(x_test)
print('Accuracy of 100-day moving average test set: {:.2f}'.format(logReg.score(x_test, y_test)))

tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()
print('100-day moving average model true negatives:', tn)
print('100-day moving average model true positives:', fp)
print('100-day moving average model false negatives:', fn)
print('100-day moving average model false positives:', tp, '\n')


Accuracy of 5-day moving average test set: 0.55
5-day moving average model true negatives: 0
5-day moving average model true positives: 311
5-day moving average model false negatives: 0
5-day moving average model false positives: 375 

Accuracy of 10-day moving average test set: 0.55
10-day moving average model true negatives: 0
10-day moving average model true positives: 311
10-day moving average model false negatives: 0
10-day moving average model false positives: 375 

Accuracy of 25-day moving average test set: 0.55
25-day moving average model true negatives: 0
25-day moving average model true positives: 311
25-day moving average model false negatives: 0
25-day moving average model false positives: 375 

Accuracy of 50-day moving average test set: 0.55
50-day moving average model true negatives: 0
50-day moving average model true positives: 311
50-day moving average model false negatives: 0
50-day moving average model false positives: 375 

Accuracy of 100-day moving average test s

#### Observations
So it appears that models that use moving averages are only slighly better at predicting the DJIA, by 1% greater accuracy. Though just like the sentiment analysis based models, these models are really good at finding true positives and false positives. Leading me to believe that the models have a hard time finding a connection between the features and the DJIA value going down. 