# NUS Fintech Society: Stock Market Prediction Project 

Credit to: Samuel Khoo, Darren Lim, Tan Qing Lin, Leonard Tan

This is an extension of the NUS Fintech Stock Market Prediction Project done by my team. 

Source: https://www.kaggle.com/aaron7sun/stocknews


#### There are three datasets. 

1. RedditNews.csv: two columns The first column is the "date", and second column is the "news headlines". All news are ranked from top to bottom based on how hot they are. Hence, there are 25 lines for each date.

2. DJIA_table.csv: Downloaded directly from Yahoo Finance: check out the web page for more info.

3. Combined_News_DJIA.csv: To make things easier for my students, I provide this combined dataset with 27 columns. The first column is "Date", the second is "Label", and the following ones are news headlines ranging from "Top1" to "Top25".


#### Business Problem

**"Given the news headline of a particular day, are we able to predict if the market will move upwards or downwards?"**


To me, this is a interesting business problem. We all know that the happenings of the world affect the markets. I believe this to be a good first step in attempting to discovering the impact of news on the markets. 


### Importing Packages

In [1]:
import nltk
import pandas as pd
import numpy as np
import seaborn as sns
import plotly.plotly as py
import plotly.graph_objs as go
import plotly.offline as py_offline

from nltk.stem import PorterStemmer
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from matplotlib import pyplot

### Reading in datasets

In [2]:
DJIA_news = pd.read_csv("data/Combined_News_DJIA.csv")
DJIA = pd.read_csv("data/DJIA_table.csv")
reddit_news = pd.read_csv("data/RedditNews.csv")

### Exploring the data

In [3]:
DJIA_news.head(3)

Unnamed: 0,Date,Label,Top1,Top2,Top3,Top4,Top5,Top6,Top7,Top8,...,Top16,Top17,Top18,Top19,Top20,Top21,Top22,Top23,Top24,Top25
0,2008-08-08,0,"b""Georgia 'downs two Russian warplanes' as cou...",b'BREAKING: Musharraf to be impeached.',b'Russia Today: Columns of troops roll into So...,b'Russian tanks are moving towards the capital...,"b""Afghan children raped with 'impunity,' U.N. ...",b'150 Russian tanks have entered South Ossetia...,"b""Breaking: Georgia invades South Ossetia, Rus...","b""The 'enemy combatent' trials are nothing but...",...,b'Georgia Invades South Ossetia - if Russia ge...,b'Al-Qaeda Faces Islamist Backlash',"b'Condoleezza Rice: ""The US would not act to p...",b'This is a busy day: The European Union has ...,"b""Georgia will withdraw 1,000 soldiers from Ir...",b'Why the Pentagon Thinks Attacking Iran is a ...,b'Caucasus in crisis: Georgia invades South Os...,b'Indian shoe manufactory - And again in a se...,b'Visitors Suffering from Mental Illnesses Ban...,"b""No Help for Mexico's Kidnapping Surge"""
1,2008-08-11,1,b'Why wont America and Nato help us? If they w...,b'Bush puts foot down on Georgian conflict',"b""Jewish Georgian minister: Thanks to Israeli ...",b'Georgian army flees in disarray as Russians ...,"b""Olympic opening ceremony fireworks 'faked'""",b'What were the Mossad with fraudulent New Zea...,b'Russia angered by Israeli military sale to G...,b'An American citizen living in S.Ossetia blam...,...,b'Israel and the US behind the Georgian aggres...,"b'""Do not believe TV, neither Russian nor Geor...",b'Riots are still going on in Montreal (Canada...,b'China to overtake US as largest manufacturer',b'War in South Ossetia [PICS]',b'Israeli Physicians Group Condemns State Tort...,b' Russia has just beaten the United States ov...,b'Perhaps *the* question about the Georgia - R...,b'Russia is so much better at war',"b""So this is what it's come to: trading sex fo..."
2,2008-08-12,0,b'Remember that adorable 9-year-old who sang a...,"b""Russia 'ends Georgia operation'""","b'""If we had no sexual harassment we would hav...","b""Al-Qa'eda is losing support in Iraq because ...",b'Ceasefire in Georgia: Putin Outmaneuvers the...,b'Why Microsoft and Intel tried to kill the XO...,b'Stratfor: The Russo-Georgian War and the Bal...,"b""I'm Trying to Get a Sense of This Whole Geor...",...,b'U.S. troops still in Georgia (did you know t...,b'Why Russias response to Georgia was right',"b'Gorbachev accuses U.S. of making a ""serious ...","b'Russia, Georgia, and NATO: Cold War Two'",b'Remember that adorable 62-year-old who led y...,b'War in Georgia: The Israeli connection',b'All signs point to the US encouraging Georgi...,b'Christopher King argues that the US and NATO...,b'America: The New Mexico?',"b""BBC NEWS | Asia-Pacific | Extinction 'by man..."


In [4]:
DJIA_news.shape

(1989, 27)

In [5]:
DJIA_news['Label'].value_counts()

1    1065
0     924
Name: Label, dtype: int64

In [6]:
DJIA.head()

Unnamed: 0,Date,Open,High,Low,Close,Volume,Adj Close
0,2016-07-01,17924.240234,18002.380859,17916.910156,17949.369141,82160000,17949.369141
1,2016-06-30,17712.759766,17930.609375,17711.800781,17929.990234,133030000,17929.990234
2,2016-06-29,17456.019531,17704.509766,17456.019531,17694.679688,106380000,17694.679688
3,2016-06-28,17190.509766,17409.720703,17190.509766,17409.720703,112190000,17409.720703
4,2016-06-27,17355.210938,17355.210938,17063.080078,17140.240234,138740000,17140.240234


In [7]:
DJIA.shape

(1989, 7)

In [8]:
reddit_news.head()

Unnamed: 0,Date,News
0,2016-07-01,A 117-year-old woman in Mexico City finally re...
1,2016-07-01,IMF chief backs Athens as permanent Olympic host
2,2016-07-01,"The president of France says if Brexit won, so..."
3,2016-07-01,British Man Who Must Give Police 24 Hours' Not...
4,2016-07-01,100+ Nobel laureates urge Greenpeace to stop o...


In [9]:
reddit_news['Date'].nunique()

2943

##### Note

As per https://www.kaggle.com/aaron7sun/stocknews , the uploader has made things easier for us by combining the news dataset for each date. (eg. reddit_news -> DJIA_news)

### Data Preprocessing

There are three datasets. For now, we will **ignore reddit_news** and work on DJIA and DJIA_news. This are the things we need to do. 

---
1. Datetime format of DJIA
2. Focus on open & close of DJIA
3. Datetime format of DJIA_news
4. Removing b prefix of DJIA_news
5. Converting everything to lowercase of DJIA_news
6. Stemming


For sentiment analysis, we will be using [vader](https://github.com/cjhutto/vaderSentiment)

[Alternative sentiment analysis](https://medium.com/@Intellica.AI/vader-ibm-watson-or-textblob-which-is-better-for-unsupervised-sentiment-analysis-db4143a39445)



##### Processing DJIA 

- datatimeformat 
- dropping of certain columns

In [10]:
DJIA['Date'] = pd.to_datetime(DJIA['Date'])


In [11]:
DJIA_close = DJIA.drop(columns=['High', 'Low', 'Open', 'Close'], axis=1)

In [12]:
DJIA_close.head(2)

Unnamed: 0,Date,Volume,Adj Close
0,2016-07-01,82160000,17949.369141
1,2016-06-30,133030000,17929.990234


##### Note 

What is [adjusted close?](https://budgeting.thenest.com/adjusted-closing-price-vs-closing-price-32457.html)

The difference between adjusted close and close price is that it takes into account factors such as dividends, stock splits and new stock offerings

##### Processing DJIA_news

- datatimeformat 
- removing b prefix
- lower case
- [Stemming](https://medium.com/@tusharsri/nlp-a-quick-guide-to-stemming-60f1ca5db49e)

In [13]:
DJIA_news.head(2)

Unnamed: 0,Date,Label,Top1,Top2,Top3,Top4,Top5,Top6,Top7,Top8,...,Top16,Top17,Top18,Top19,Top20,Top21,Top22,Top23,Top24,Top25
0,2008-08-08,0,"b""Georgia 'downs two Russian warplanes' as cou...",b'BREAKING: Musharraf to be impeached.',b'Russia Today: Columns of troops roll into So...,b'Russian tanks are moving towards the capital...,"b""Afghan children raped with 'impunity,' U.N. ...",b'150 Russian tanks have entered South Ossetia...,"b""Breaking: Georgia invades South Ossetia, Rus...","b""The 'enemy combatent' trials are nothing but...",...,b'Georgia Invades South Ossetia - if Russia ge...,b'Al-Qaeda Faces Islamist Backlash',"b'Condoleezza Rice: ""The US would not act to p...",b'This is a busy day: The European Union has ...,"b""Georgia will withdraw 1,000 soldiers from Ir...",b'Why the Pentagon Thinks Attacking Iran is a ...,b'Caucasus in crisis: Georgia invades South Os...,b'Indian shoe manufactory - And again in a se...,b'Visitors Suffering from Mental Illnesses Ban...,"b""No Help for Mexico's Kidnapping Surge"""
1,2008-08-11,1,b'Why wont America and Nato help us? If they w...,b'Bush puts foot down on Georgian conflict',"b""Jewish Georgian minister: Thanks to Israeli ...",b'Georgian army flees in disarray as Russians ...,"b""Olympic opening ceremony fireworks 'faked'""",b'What were the Mossad with fraudulent New Zea...,b'Russia angered by Israeli military sale to G...,b'An American citizen living in S.Ossetia blam...,...,b'Israel and the US behind the Georgian aggres...,"b'""Do not believe TV, neither Russian nor Geor...",b'Riots are still going on in Montreal (Canada...,b'China to overtake US as largest manufacturer',b'War in South Ossetia [PICS]',b'Israeli Physicians Group Condemns State Tort...,b' Russia has just beaten the United States ov...,b'Perhaps *the* question about the Georgia - R...,b'Russia is so much better at war',"b""So this is what it's come to: trading sex fo..."


In [14]:
DJIA_news['Top25'].isna().value_counts()

False    1986
True        3
Name: Top25, dtype: int64

In [15]:
len(DJIA_news) - DJIA_news.count()

Date     0
Label    0
Top1     0
Top2     0
Top3     0
Top4     0
Top5     0
Top6     0
Top7     0
Top8     0
Top9     0
Top10    0
Top11    0
Top12    0
Top13    0
Top14    0
Top15    0
Top16    0
Top17    0
Top18    0
Top19    0
Top20    0
Top21    0
Top22    0
Top23    1
Top24    3
Top25    3
dtype: int64

##### Note 

There are a few NA objects inside the dataframe. 1 NA object in Top23, 3 in Top24 and Top25.

We will take two steps to solve this problem. 

1. We check reddit_news dataframe if they exist. 
2. We ignore them as they are few in number.

In [16]:
reddit_news.head(2)

Unnamed: 0,Date,News
0,2016-07-01,A 117-year-old woman in Mexico City finally re...
1,2016-07-01,IMF chief backs Athens as permanent Olympic host


In [17]:
DJIA_news[DJIA_news['Top23'].isna()]

Unnamed: 0,Date,Label,Top1,Top2,Top3,Top4,Top5,Top6,Top7,Top8,...,Top16,Top17,Top18,Top19,Top20,Top21,Top22,Top23,Top24,Top25
277,2009-09-15,1,b'The Church of Scientology won\'t be dissolve...,b'New virus from rats can kill 80 per cent of ...,b'The gruesome spectacle of dolphins being sla...,b'The End of Innocence in Afghanistan: \'The G...,b'France approves Internet piracy bill',b'The Rural Doctors Association says right now...,b'Al Jazeera English - Africa - Shabab to aven...,"b""How Sri Lanka governs through detentions - S...",...,b'In an equine echo of the controversy surroun...,b'UPDATE: 5-New York homes raided in terrorism...,b'Population Growth Impeding Progress on the M...,b'Global Population to Reach 7 Billion by 2011',b'Government Funded Feminist Porn ',b'Can someone enlighten me re:Holy Land disput...,b'Human Rights Watch official suspended for co...,,,


In [18]:
reddit_news['Date'] = pd.to_datetime(reddit_news['Date'])


In [19]:
reddit_news[reddit_news['Date'] == pd.to_datetime("2009-09-15")].count()

Date    22
News    22
dtype: int64

Since the reddit_news dataframe does not contain any news headlines for those dates, we shall ignore them for now. 

In [20]:
DJIA_news['Date'] = pd.to_datetime(DJIA_news['Date'])

In [21]:
DJIA_news_col = DJIA_news.columns

for i in range(2, len(DJIA_news_col)):
    DJIA_news[DJIA_news_col[i]] = DJIA_news[DJIA_news_col[i]].astype(str)
    DJIA_news[DJIA_news_col[i]] = DJIA_news[DJIA_news_col[i]].apply(lambda x: x[1:] if x[0] == "b" else x)

DJIA_news.head(2)

Unnamed: 0,Date,Label,Top1,Top2,Top3,Top4,Top5,Top6,Top7,Top8,...,Top16,Top17,Top18,Top19,Top20,Top21,Top22,Top23,Top24,Top25
0,2008-08-08,0,"""Georgia 'downs two Russian warplanes' as coun...",'BREAKING: Musharraf to be impeached.','Russia Today: Columns of troops roll into Sou...,'Russian tanks are moving towards the capital ...,"""Afghan children raped with 'impunity,' U.N. o...",'150 Russian tanks have entered South Ossetia ...,"""Breaking: Georgia invades South Ossetia, Russ...","""The 'enemy combatent' trials are nothing but ...",...,'Georgia Invades South Ossetia - if Russia get...,'Al-Qaeda Faces Islamist Backlash',"'Condoleezza Rice: ""The US would not act to pr...",'This is a busy day: The European Union has a...,"""Georgia will withdraw 1,000 soldiers from Ira...",'Why the Pentagon Thinks Attacking Iran is a B...,'Caucasus in crisis: Georgia invades South Oss...,'Indian shoe manufactory - And again in a ser...,'Visitors Suffering from Mental Illnesses Bann...,"""No Help for Mexico's Kidnapping Surge"""
1,2008-08-11,1,'Why wont America and Nato help us? If they wo...,'Bush puts foot down on Georgian conflict',"""Jewish Georgian minister: Thanks to Israeli t...",'Georgian army flees in disarray as Russians a...,"""Olympic opening ceremony fireworks 'faked'""",'What were the Mossad with fraudulent New Zeal...,'Russia angered by Israeli military sale to Ge...,'An American citizen living in S.Ossetia blame...,...,'Israel and the US behind the Georgian aggress...,"'""Do not believe TV, neither Russian nor Georg...",'Riots are still going on in Montreal (Canada)...,'China to overtake US as largest manufacturer','War in South Ossetia [PICS]','Israeli Physicians Group Condemns State Torture',' Russia has just beaten the United States ove...,'Perhaps *the* question about the Georgia - Ru...,'Russia is so much better at war',"""So this is what it's come to: trading sex for..."


Above, we first create a list of the column names. We then iterate across, and check the first character if the prefix is indeed 'b'. An ifelse statement is used to clean the data.

In [22]:
DJIA_news_col

Index(['Date', 'Label', 'Top1', 'Top2', 'Top3', 'Top4', 'Top5', 'Top6', 'Top7',
       'Top8', 'Top9', 'Top10', 'Top11', 'Top12', 'Top13', 'Top14', 'Top15',
       'Top16', 'Top17', 'Top18', 'Top19', 'Top20', 'Top21', 'Top22', 'Top23',
       'Top24', 'Top25'],
      dtype='object')

##### Note

We've completed the following for DJIA_news.
- Converted datetime format
- Removed 'b' prefix

---

We will conduct the following on a separate dataset
- lower case
- stemming
- removal of punctuation

In [47]:

DJIA_news2 = DJIA_news.copy()
DJIA_news2_col = DJIA_news2.columns

for i in range(2, len(DJIA_news_col)):
    DJIA_news2[DJIA_news_col[i]] = DJIA_news[DJIA_news_col[i]].apply(lambda x: x.lower())
    DJIA_news2[DJIA_news2_col[i]] = DJIA_news2[DJIA_news2_col[i]].str.replace('[^\w\s]','')
    


In [48]:
DJIA_news['Top1'][2]

"'Remember that adorable 9-year-old who sang at the opening ceremonies? That was fake, too.'"

In [49]:
DJIA_news2['Top1'][2]

'remember that adorable 9yearold who sang at the opening ceremonies that was fake too'

In [50]:
st = PorterStemmer()

DJIA_news2_col = DJIA_news2.columns

for i in range(2, len(DJIA_news2_col)):
    DJIA_news2[DJIA_news2_col[i]] = DJIA_news2[DJIA_news2_col[i]].apply(lambda x: " ".join([st.stem(word) for word in x.split()]))

In [51]:
DJIA_news['Top1'][2]

"'Remember that adorable 9-year-old who sang at the opening ceremonies? That was fake, too.'"

In [52]:
DJIA_news2['Top1'][2]

'rememb that ador 9yearold who sang at the open ceremoni that wa fake too'

#### Note

Above, we have completed the following
- converting string to lower case
- remove punctuation using [regex](https://www.w3schools.com/python/python_regex.asp): 
In essence, we have removed puncutation, dropping everything except whitespace, characters and numbers.
- [stemming](https://medium.com/@tusharsri/nlp-a-quick-guide-to-stemming-60f1ca5db49e) of strings: 
Porter Stemmer was used. 


The main aim of stemming is "reduce the inflectional forms of each word into a common base word or root word or stem word."

In [53]:
DJIA_news2.head(2)

Unnamed: 0,Date,Label,Top1,Top2,Top3,Top4,Top5,Top6,Top7,Top8,...,Top16,Top17,Top18,Top19,Top20,Top21,Top22,Top23,Top24,Top25
0,2008-08-08,0,georgia down two russian warplan as countri mo...,break musharraf to be impeach,russia today column of troop roll into south o...,russian tank are move toward the capit of sout...,afghan children rape with impun un offici say ...,150 russian tank have enter south ossetia whil...,break georgia invad south ossetia russia warn ...,the enemi combat trial are noth but a sham sal...,...,georgia invad south ossetia if russia get invo...,alqaeda face islamist backlash,condoleezza rice the us would not act to preve...,thi is a busi day the european union ha approv...,georgia will withdraw 1000 soldier from iraq t...,whi the pentagon think attack iran is a bad id...,caucasu in crisi georgia invad south ossetia,indian shoe manufactori and again in a seri of...,visitor suffer from mental ill ban from olymp,no help for mexico kidnap surg
1,2008-08-11,1,whi wont america and nato help us if they wont...,bush put foot down on georgian conflict,jewish georgian minist thank to isra train wer...,georgian armi flee in disarray as russian adva...,olymp open ceremoni firework fake,what were the mossad with fraudul new zealand ...,russia anger by isra militari sale to georgia,an american citizen live in sossetia blame us ...,...,israel and the us behind the georgian aggress,do not believ tv neither russian nor georgian ...,riot are still go on in montreal canada becaus...,china to overtak us as largest manufactur,war in south ossetia pic,isra physician group condemn state tortur,russia ha just beaten the unit state over the ...,perhap the question about the georgia russia c...,russia is so much better at war,so thi is what it come to trade sex for food


# Checkpoint

To date, we have achieved the following. 
1. Basic data exploration
2. Data pre-processing of DJIA and DJIA_news

We will continue wth basic data visualization and sentiment analysis of the stemmed dataset.