# Project 4: Predicting Volatility Index price with Sentiment Analysis on News headlines

## Problem statement : 

To train a classifer to predict whether Volatility Index(VIX) will go up or down and a sentimental analysist tool to predict the extent of the movement of Volatility Index based on our news headlines dataset which consists of top 25 news sentiments provided by Reddit WorldNews Channel for a period of about 8 years. 

### Executive Summary

Efficient Market Hypothesis proposed by Fama [1965] states that stock market prices are driven by all observable information. In reality, it has been shown that investor sentiment can affect the asset prices due to the well-known psychological fact that investors with positive (negative) sentiment tend to make overly optimistic(pessimistic) judgments and decisions [Keynes, 1937]

Hence the purpose of this notebook is to check if the above hypthesis holds true and if the news sentiment can indeed affect the Volatilty price index and to what extent.

### Data Dictionary


**Top 25 news headlines from Reddit WorldNews Channel** : 

Source : https://www.kaggle.com/aaron7sun/stocknews

Historical news headlines from Reddit WorldNews Channel (/r/worldnews). They are ranked by reddit users' votes, and only the top 25 headlines are considered for a single date. (Range: 2008-06-08 to 2016-07-01) All news are ranked from top to bottom based on how hot they are.Hence, there are 25 lines for each date. The first column is "Date", the second is "Label", and the following ones are news headlines ranging from "Top1" to "Top25".

The columns included in this dataset are:

|S/N|Label|Description|
|---|:--|:--|
|1|ID|Numeric ID of the article|
|2|Title|the headline of the article|
|3|URL|URL of the article|
|4|PUBLISHER|Publisher of the article|
|5|CATEGORY|Category of the news item|
|6|STORY|Alphanumeric ID of the news story that the article discusses|
|7|HOSTNAME|Hostname where the article was posted|
|8|TIMESTAMP|Approximate timestamp of the article's publication, given in Unix time (seconds since midnight on Jan 1, 1970)|

### 2. Vix Historical data : VIX data for 2004 to 2020

**Violatiliy Index(VIX)** : 

Source : http://www.cboe.com/products/vix-index-volatility/vix-options-and-futures/vix-index/vix-historical-data

Created by the Chicago Board Options Exchange (CBOE), the Volatility Index, or VIX, is a real-time market index that represents the market's expectation of 30-day forward-looking volatility. 

Derived from the price inputs of the S&P 500 index options, it provides a measure of market risk and investors' sentiments. It is also known by other names like "Fear Gauge" or "Fear Index." Investors, research analysts and portfolio managers look to VIX values as a way to measure market risk, fear and stress before they take investment decisions.

Higher market risk, fear and stress in the market usually indicates an increased in VIX price. 

The columns included in this dataset are : 

|S/N|Label|Description|
|---|:--|:--|
|1|Date|Date of Vix|
|2|VIX Open|Opening price of VIX for the date|
|3|VIX High|Highest price of VIX for the date|
|4|VIX Low|Lowest price of VIX for the data|
|5|VIX Close|Closing price of VIX for the data|

### 3. Process

Data collection - Collect data and create news-label dataset (described in previous section)

Text preprocessing - remove punctuation, stopwords and malformed words, lowercase, lemmatize and finally tokenize words

Train Test Split - Randomly shuffled and split the processed data into 80% training and 20% test set.

Create Model for training - we defined these models: 

1. Logistic Regression with CountVectorizer
2. Logistic Regression with TFID
3. Naive Bayes with TFID
4. Random Forest with TFID
5. 3 layers of Stacked LSTM
6. LSTM with Convolutional Neural Network for Sequence Classification

Evaluate performance - used the tuned models to predict the test set and compare the performance of three models using accuracy and F1 score for metrics.

# Importing of Libraries

In [1]:
# get some libraries that will be useful
import re
import numpy as np # linear algebra
import pandas as pd
import seaborn as sns
import string
import matplotlib.pyplot as plt
import pandas_datareader as dr
#To remove weekends from dataset
from pandas.tseries.offsets import BDay

# necessary libraries for wordcloud
from wordcloud import WordCloud
from wordcloud import STOPWORDS
from PIL import Image

# the Naive Bayes model
from sklearn.naive_bayes import MultinomialNB
# function to split the data for cross-validation
from sklearn.model_selection import train_test_split
# function for transforming documents into counts
from sklearn.feature_extraction.text import CountVectorizer
# function for encoding categories
from sklearn.preprocessing import LabelEncoder

#to filter out selected dates from dataset
import datetime


%matplotlib inline

  from pandas.util.testing import assert_frame_equal


# Data Import

In [2]:
combined_news = pd.read_csv("../data/Combined_News_DJIA.csv")

In [3]:
combined_news.head()

Unnamed: 0,Date,Label,Top1,Top2,Top3,Top4,Top5,Top6,Top7,Top8,...,Top16,Top17,Top18,Top19,Top20,Top21,Top22,Top23,Top24,Top25
0,2008-08-08,0,"b""Georgia 'downs two Russian warplanes' as cou...",b'BREAKING: Musharraf to be impeached.',b'Russia Today: Columns of troops roll into So...,b'Russian tanks are moving towards the capital...,"b""Afghan children raped with 'impunity,' U.N. ...",b'150 Russian tanks have entered South Ossetia...,"b""Breaking: Georgia invades South Ossetia, Rus...","b""The 'enemy combatent' trials are nothing but...",...,b'Georgia Invades South Ossetia - if Russia ge...,b'Al-Qaeda Faces Islamist Backlash',"b'Condoleezza Rice: ""The US would not act to p...",b'This is a busy day: The European Union has ...,"b""Georgia will withdraw 1,000 soldiers from Ir...",b'Why the Pentagon Thinks Attacking Iran is a ...,b'Caucasus in crisis: Georgia invades South Os...,b'Indian shoe manufactory - And again in a se...,b'Visitors Suffering from Mental Illnesses Ban...,"b""No Help for Mexico's Kidnapping Surge"""
1,2008-08-11,1,b'Why wont America and Nato help us? If they w...,b'Bush puts foot down on Georgian conflict',"b""Jewish Georgian minister: Thanks to Israeli ...",b'Georgian army flees in disarray as Russians ...,"b""Olympic opening ceremony fireworks 'faked'""",b'What were the Mossad with fraudulent New Zea...,b'Russia angered by Israeli military sale to G...,b'An American citizen living in S.Ossetia blam...,...,b'Israel and the US behind the Georgian aggres...,"b'""Do not believe TV, neither Russian nor Geor...",b'Riots are still going on in Montreal (Canada...,b'China to overtake US as largest manufacturer',b'War in South Ossetia [PICS]',b'Israeli Physicians Group Condemns State Tort...,b' Russia has just beaten the United States ov...,b'Perhaps *the* question about the Georgia - R...,b'Russia is so much better at war',"b""So this is what it's come to: trading sex fo..."
2,2008-08-12,0,b'Remember that adorable 9-year-old who sang a...,"b""Russia 'ends Georgia operation'""","b'""If we had no sexual harassment we would hav...","b""Al-Qa'eda is losing support in Iraq because ...",b'Ceasefire in Georgia: Putin Outmaneuvers the...,b'Why Microsoft and Intel tried to kill the XO...,b'Stratfor: The Russo-Georgian War and the Bal...,"b""I'm Trying to Get a Sense of This Whole Geor...",...,b'U.S. troops still in Georgia (did you know t...,b'Why Russias response to Georgia was right',"b'Gorbachev accuses U.S. of making a ""serious ...","b'Russia, Georgia, and NATO: Cold War Two'",b'Remember that adorable 62-year-old who led y...,b'War in Georgia: The Israeli connection',b'All signs point to the US encouraging Georgi...,b'Christopher King argues that the US and NATO...,b'America: The New Mexico?',"b""BBC NEWS | Asia-Pacific | Extinction 'by man..."
3,2008-08-13,0,b' U.S. refuses Israel weapons to attack Iran:...,"b""When the president ordered to attack Tskhinv...",b' Israel clears troops who killed Reuters cam...,b'Britain\'s policy of being tough on drugs is...,b'Body of 14 year old found in trunk; Latest (...,b'China has moved 10 *million* quake survivors...,"b""Bush announces Operation Get All Up In Russi...",b'Russian forces sink Georgian ships ',...,b'Elephants extinct by 2020?',b'US humanitarian missions soon in Georgia - i...,"b""Georgia's DDOS came from US sources""","b'Russian convoy heads into Georgia, violating...",b'Israeli defence minister: US against strike ...,b'Gorbachev: We Had No Choice',b'Witness: Russian forces head towards Tbilisi...,b' Quarter of Russians blame U.S. for conflict...,b'Georgian president says US military will ta...,b'2006: Nobel laureate Aleksander Solzhenitsyn...
4,2008-08-14,1,b'All the experts admit that we should legalis...,b'War in South Osetia - 89 pictures made by a ...,b'Swedish wrestler Ara Abrahamian throws away ...,b'Russia exaggerated the death toll in South O...,b'Missile That Killed 9 Inside Pakistan May Ha...,"b""Rushdie Condemns Random House's Refusal to P...",b'Poland and US agree to missle defense deal. ...,"b'Will the Russians conquer Tblisi? Bet on it,...",...,b'Bank analyst forecast Georgian crisis 2 days...,"b""Georgia confict could set back Russia's US r...",b'War in the Caucasus is as much the product o...,"b'""Non-media"" photos of South Ossetia/Georgia ...",b'Georgian TV reporter shot by Russian sniper ...,b'Saudi Arabia: Mother moves to block child ma...,b'Taliban wages war on humanitarian aid workers',"b'Russia: World ""can forget about"" Georgia\'s...",b'Darfur rebels accuse Sudan of mounting major...,b'Philippines : Peace Advocate say Muslims nee...


In [4]:
#We have 1989 rows and 27 columns
combined_news.shape  

(1989, 27)

# Data Cleaning and Preprocessing

In [5]:
#Extracting only columns top 25 news headlines and date from combined_news dataset and rename dataframe as X_df
features = [col for col in combined_news.columns if not col == 'Label']
X_df = combined_news[features]

In [29]:
# Function to getting some basic information about each dataframe
# shape of dataframe i.e. number of rows and columns
# total number of rows with null values
# total number of duplicates
# data types of columns

def basic_eda(df, df_name):
    print(df_name.upper())
    print()
    print(f"Rows: {df.shape[0]} \t Columns: {df.shape[1]}")
    print()
    
    print(f"Total null rows: {df.isnull().sum().sum()}")
    print(f"Percentage null rows: {round(df.isnull().sum().sum() / df.shape[0] * 100, 2)}%")
    print()
    
    print(f"Total duplicate rows: {df[df.duplicated(keep=False)].shape[0]}")
    print(f"Percentage dupe rows: {round(df[df.duplicated(keep=False)].shape[0] / df.shape[0] * 100, 2)}%")
    print()
    
    print(df.dtypes)
    print("-----\n")

In [30]:
basic_eda(df, "X_df")

X_DF

Rows: 1989 	 Columns: 28

Total null rows: 7
Percentage null rows: 0.35%

Total duplicate rows: 0
Percentage dupe rows: 0.0%

Date_x       object
Top1         object
Top2         object
Top3         object
Top4         object
Top5         object
Top6         object
Top7         object
Top8         object
Top9         object
Top10        object
Top11        object
Top12        object
Top13        object
Top14        object
Top15        object
Top16        object
Top17        object
Top18        object
Top19        object
Top20        object
Top21        object
Top22        object
Top23        object
Top24        object
Top25        object
Date_y       object
upordown    float64
dtype: object
-----



There appears to be no duplicate rows on our news headlines. Lets proceed to check on the null values.

In [None]:
print('Train set missing values:\n', train.isnull().sum(), '\n')


There appears to be 1 empty column for the top 23 column, 3 empty columns for top 24 and 3 empty columns for top 25. No actions would be taken as these empty values would not significantly impact our results as we have a total of 1989 days for our dataset.

# Since we are testing against Volatility index , we only want the weekdays of the news headlines and not the weekends.

In [7]:
#Extracting only the weekdays out of the dataset according to 'Date' column
isBusinessDay = BDay().onOffset
match_series = pd.to_datetime(X_df['Date']).map(isBusinessDay)

  new_values = map_f(values, mapper)


In [8]:
#to remove the weekend out of the combined dataset.
X_df = X_df[match_series]

In [9]:
#confirm that there is still 1989 rows, no weekends for this dataset!
X_df.shape 

(1989, 26)

In [10]:
#We save a copy of a csv after extracting out the label column and extracting out weekends.
combined_news.to_csv('../data/X_features.csv', index=False)

# Lets proceed to obtain our Y variable (VIX PRICE)!!

In [11]:
#Dataset consist of Y Variable(VIX PRICE) from 2004 to 2020
price = pd.read_csv("../data/vixcurrent.csv") 

### Since our news headline datasets are from between June 8th, 2008 and July 1st, 2016, we would need to extract these date range out from Y variable (VIX price).

In [12]:
#Index 1158 indicates price on 8th August 2008
price[price['Date'] == '8/8/2008'] 

Unnamed: 0,Date,VIX Open,VIX High,VIX Low,VIX Close
1158,8/8/2008,21.15,21.69,20.11,20.66


In [13]:
#Index 3146 indicates price on 1st July 2016
price [price['Date'] == '7/1/2016'] 

Unnamed: 0,Date,VIX Open,VIX High,VIX Low,VIX Close
3146,7/1/2016,15.59,15.86,14.61,14.77


In [14]:
#We know that the price range is between index 1158 and 3146
#Date is between 8 August 2008 and  1 July 2016
price = price.iloc[ 1158:3147 , : ]
price 
#We can see that there ar 1989 rows between these date ranges which matches to the number of rows for our news headlines dataset.

Unnamed: 0,Date,VIX Open,VIX High,VIX Low,VIX Close
1158,8/8/2008,21.15,21.69,20.11,20.66
1159,8/11/2008,20.66,20.96,19.66,20.12
1160,8/12/2008,20.64,21.51,20.38,21.17
1161,8/13/2008,21.57,22.11,20.80,21.55
1162,8/14/2008,22.30,22.30,20.07,20.34
...,...,...,...,...,...
3142,6/27/2016,24.38,26.72,22.93,23.85
3143,6/28/2016,21.76,22.07,18.75,18.75
3144,6/29/2016,18.12,18.27,16.48,16.64
3145,6/30/2016,16.91,16.99,15.29,15.63


In [None]:
#We will save a copy of this price dataframe for EDA purpose for our part 2 notebook
price.to_csv('../data/vix_price.csv', index=False)

In [18]:
#create a new column for the difference in the Closing and Opening Price
price['upordown'] = price['VIX Close'] - price['VIX Open']
#if closing price is higher then opening price, will assign value 1
price['upordown'] = np.where(price['upordown'] > 0,1, price['upordown'])
#if closing price is equals to opening price, will assign value 0
price['upordown'] = np.where(price['upordown'] == 0 ,0, price['upordown'])
#if closing price is lower than opening price, will assign value 0
price['upordown'] = np.where(price['upordown'] < 0,0, price['upordown'])

In [19]:
#updown column will be our Y variables for modelling with either 1 or 0.
price.head() 

Unnamed: 0,Date,VIX Open,VIX High,VIX Low,VIX Close,upordown
1158,8/8/2008,21.15,21.69,20.11,20.66,0.0
1159,8/11/2008,20.66,20.96,19.66,20.12,0.0
1160,8/12/2008,20.64,21.51,20.38,21.17,1.0
1161,8/13/2008,21.57,22.11,20.8,21.55,0.0
1162,8/14/2008,22.3,22.3,20.07,20.34,0.0


In [20]:
#We finally create the Y variables for the date range below. 
Y_feature = price.filter(['Date','upordown'], axis=1)
Y_feature.reset_index(drop=True, inplace=True)
Y_feature.head()

Unnamed: 0,Date,upordown
0,8/8/2008,0.0
1,8/11/2008,0.0
2,8/12/2008,1.0
3,8/13/2008,0.0
4,8/14/2008,0.0


In [27]:
#We merge 2 datafarme together with upordown as the price of VIX with the top 25 headings according to dates.
df = pd.merge(X_df, Y_feature, left_index=True, right_index=True, how='left')
#indicates columns have been successfully merged 
df.shape

(1989, 28)

In [26]:
#columns have 
df.columns

Index(['Date', 'Top1', 'Top2', 'Top3', 'Top4', 'Top5', 'Top6', 'Top7', 'Top8',
       'Top9', 'Top10', 'Top11', 'Top12', 'Top13', 'Top14', 'Top15', 'Top16',
       'Top17', 'Top18', 'Top19', 'Top20', 'Top21', 'Top22', 'Top23', 'Top24',
       'Top25', 'upordown'],
      dtype='object')

In [22]:
#We drop 'Date_y' column as it is not required. 
df.drop(columns=['Date_y'],inplace = True)
#We then rename the column Date_x into Date.
df.rename(columns={"Date_x": "Date"},inplace= True)

In [23]:
#Finally we have our dataframe for modelling, before that lets proceed to part 2 for more EDA 

In [24]:
df.to_csv('../data/final_dataframe.csv', index=False)