# Title

Efficient Market Hypothesis proposed by Fama [1965] states that stock market prices are driven by all observable information. In reality, it has been shown that investor sentiment
can affect the asset prices due to the well-known psychological fact that investors with positive (negative) sentiment tend
to make overly optimistic(pessimistic) judgments and decisions [Keynes, 1937]

## Problem statement : 

To create a sentiment analysis estimator tools for traders usng news headlines based on VIX (Fear index) via Google Chrome Extension. We will be using sentiment reviews on our news headlines to predict the violatilty of VIX.

# Feature Engineering process

This notebook is to feature enginner our dataset which is : 

X Features : News headline titles  (Range: 2008-08-08 to 2016-07-01)

Y variable : If VIX is up or down on that day.(Range : 2004 to 2020)





# Dataset : 

### 1. Data Csv : uci-news-aggregator.csv 

Link : https://www.kaggle.com/aaron7sun/stocknews
    
CombinedNewsDJIA.csv:

Historical news headlines from Reddit WorldNews Channel (/r/worldnews). They are ranked by reddit users' votes, and only the top 25 headlines are considered for a single date.
(Range: 2008-06-08 to 2016-07-01)

All news are ranked from top to bottom based on how hot they are.Hence, there are 25 lines for each date.
The first column is "Date", the second is "Label", and the following ones are news headlines ranging from "Top1" to "Top25".

### Data Dictionary

The columns included in this dataset are:


|S/N|Label|Description|
|---|:--|:--|
|1|ID|Numeric ID of the article|
|2|Title|the headline of the article|
|3|URL|URL of the article|
|4|PUBLISHER|Publisher of the article|
|5|CATEGORY|Category of the news item|
|6|STORY|Alphanumeric ID of the news story that the article discusses|
|7|HOSTNAME|Hostname where the article was posted|
|8|TIMESTAMP|Approximate timestamp of the article's publication, given in Unix time (seconds since midnight on Jan 1, 1970)|


### 2. Vix Historical data : VIX data for 2004 to 2020

Source : 
http://www.cboe.com/products/vix-index-volatility/vix-options-and-futures/vix-index/vix-historical-data

Content :

The columns included in this dataset are : 


|S/N|Label|Description|
|---|:--|:--|
|1|Date|Date of Vix|
|2|VIX Open|Opening price of VIX for the date|
|3|VIX High|Highest price of VIX for the date|
|4|VIX Low|Lowest price of VIX for the data|
|5|VIX Close|Closing price of VIX for the data|

### 3. Process

Data collection - Collect data and create news-label dataset (described in previous section)

Text preprocessing - remove punctuation, stopwords and malformed words, lowercase, lemmatize and finally tokenize words

Train Test Split - Randomly shuffled and split the processed data into 80% training and 20% test set.

Create Model for training - we defined these models: 

1. Logistic Regression with CountVectorizer
2. Logistic Regression with TFID
3. Naive Bayes with TFID
4. Random Forest with TFID
5. 3 layers of Stacked LSTM
6. LSTM with Convolutional Neural Network for Sequence Classification

Hyerparameter Tuning -  for hyperparameter tuning.

Evaluate performance - used the tuned models to predict the test set and compare the performance of three models using accuracy and F1 score for metrics.

In [1]:
# get some libraries that will be useful
import re
import numpy as np # linear algebra
import pandas as pd
import seaborn as sns
import string
import matplotlib.pyplot as plt
import pandas_datareader as dr
#To remove weekends from dataset
from pandas.tseries.offsets import BDay

# necessary libraries for wordcloud
from wordcloud import WordCloud
from wordcloud import STOPWORDS
from PIL import Image

# the Naive Bayes model
from sklearn.naive_bayes import MultinomialNB
# function to split the data for cross-validation
from sklearn.model_selection import train_test_split
# function for transforming documents into counts
from sklearn.feature_extraction.text import CountVectorizer
# function for encoding categories
from sklearn.preprocessing import LabelEncoder

#to filter out selected dates from dataset
import datetime


%matplotlib inline

  from pandas.util.testing import assert_frame_equal


In [2]:
combined_news = pd.read_csv("../data/Combined_News_DJIA.csv")
combined_news.tail()

Unnamed: 0,Date,Label,Top1,Top2,Top3,Top4,Top5,Top6,Top7,Top8,...,Top16,Top17,Top18,Top19,Top20,Top21,Top22,Top23,Top24,Top25
1984,2016-06-27,0,Barclays and RBS shares suspended from trading...,Pope says Church should ask forgiveness from g...,Poland 'shocked' by xenophobic abuse of Poles ...,"There will be no second referendum, cabinet ag...","Scotland welcome to join EU, Merkel ally says",Sterling dips below Friday's 31-year low amid ...,No negative news about South African President...,Surge in Hate Crimes in the U.K. Following U.K...,...,German lawyers to probe Erdogan over alleged w...,"Boris Johnson says the UK will continue to ""in...",Richard Branson is calling on the UK governmen...,Turkey 'sorry for downing Russian jet',Edward Snowden lawyer vows new push for pardon...,Brexit opinion poll reveals majority don't wan...,"Conservative MP Leave Campaigner: ""The leave c...","Economists predict UK recession, further weake...","New EU 'superstate plan by France, Germany: Cr...",Pakistani clerics declare transgender marriage...
1985,2016-06-28,1,"2,500 Scientists To Australia: If You Want To ...","The personal details of 112,000 French police ...",S&amp;P cuts United Kingdom sovereign credit r...,Huge helium deposit found in Africa,CEO of the South African state broadcaster qui...,"Brexit cost investors $2 trillion, the worst o...",Hong Kong democracy activists call for return ...,Brexit: Iceland president says UK can join 'tr...,...,"US, Canada and Mexico pledge 50% of power from...",There is increasing evidence that Australia is...,"Richard Branson, the founder of Virgin Group, ...","37,000-yr-old skull from Borneo reveals surpri...",Palestinians stone Western Wall worshipers; po...,Jean-Claude Juncker asks Farage: Why are you h...,"""Romanians for Remainians"" offering a new home...",Brexit: Gibraltar in talks with Scotland to st...,8 Suicide Bombers Strike Lebanon,Mexico's security forces routinely use 'sexual...
1986,2016-06-29,1,Explosion At Airport In Istanbul,Yemeni former president: Terrorism is the offs...,UK must accept freedom of movement to access E...,Devastated: scientists too late to captive bre...,British Labor Party leader Jeremy Corbyn loses...,A Muslim Shop in the UK Was Just Firebombed Wh...,Mexican Authorities Sexually Torture Women in ...,UK shares and pound continue to recover,...,"Escape Tunnel, Dug by Hand, Is Found at Holoca...",The land under Beijing is sinking by as much a...,Car bomb and Anti-Islamic attack on Mosque in ...,Emaciated lions in Taiz Zoo are trapped in blo...,Rupert Murdoch describes Brexit as 'wonderful'...,More than 40 killed in Yemen suicide attacks,Google Found Disastrous Symantec and Norton Vu...,Extremist violence on the rise in Germany: Dom...,BBC News: Labour MPs pass Corbyn no-confidence...,Tiny New Zealand town with 'too many jobs' lau...
1987,2016-06-30,1,Jamaica proposes marijuana dispensers for tour...,Stephen Hawking says pollution and 'stupidity'...,Boris Johnson says he will not run for Tory pa...,Six gay men in Ivory Coast were abused and for...,Switzerland denies citizenship to Muslim immig...,Palestinian terrorist stabs israeli teen girl ...,Puerto Rico will default on $1 billion of debt...,Republic of Ireland fans to be awarded medal f...,...,Googles free wifi at Indian railway stations i...,Mounting evidence suggests 'hobbits' were wipe...,The men who carried out Tuesday's terror attac...,Calls to suspend Saudi Arabia from UN Human Ri...,More Than 100 Nobel Laureates Call Out Greenpe...,British pedophile sentenced to 85 years in US ...,"US permitted 1,200 offshore fracks in Gulf of ...",We will be swimming in ridicule - French beach...,UEFA says no minutes of silence for Istanbul v...,Law Enforcement Sources: Gun Used in Paris Ter...
1988,2016-07-01,1,A 117-year-old woman in Mexico City finally re...,IMF chief backs Athens as permanent Olympic host,"The president of France says if Brexit won, so...",British Man Who Must Give Police 24 Hours' Not...,100+ Nobel laureates urge Greenpeace to stop o...,Brazil: Huge spike in number of police killing...,Austria's highest court annuls presidential el...,"Facebook wins privacy case, can track any Belg...",...,"The United States has placed Myanmar, Uzbekist...",S&amp;P revises European Union credit rating t...,India gets $1 billion loan from World Bank for...,U.S. sailors detained by Iran spoke too much u...,Mass fish kill in Vietnam solved as Taiwan ste...,Philippines president Rodrigo Duterte urges pe...,Spain arrests three Pakistanis accused of prom...,"Venezuela, where anger over food shortages is ...",A Hindu temple worker has been killed by three...,Ozone layer hole seems to be healing - US &amp...


In [3]:
combined_news.shape

(1989, 27)

In [4]:
features = [col for col in combined_news.columns if not col == 'Label']
X_df = combined_news[features]

In [5]:
#change format to date time 
#X_df['Date'] = X_df['Date'].astype('datetime64[ns]')

# NEXT WE ONLY want the weekdays of the headline for forex operating hours hence we will extract the weekdays out

In [6]:
isBusinessDay = BDay().onOffset
match_series = pd.to_datetime(X_df['Date']).map(isBusinessDay)

  new_values = map_f(values, mapper)


In [7]:
#to remove the weekend out of the combined dataset.
X_df = X_df[match_series]
X_df.shape #confirm that there is still 1989 rows, no weekends for this dataset!

(1989, 26)

# We will save our final X _features first for feature engineering as X_features.csv

In [8]:
combined_news.to_csv('../data/X_features.csv', index=False)

# Lets proceed to obtain our Y variable (VIX PRICE)!!

We would like to obtain our Y variable as per below :

1. If the closing price close higher or lower than the opening price.

In [30]:
#Dataset consist of Y Variable(VIX PRICE) from 2004 to 2020
price = pd.read_csv("../data/vixcurrent.csv") 

# We would need to filter out Y variables by the selected dates based on our X dataframe.

### Since our news headline datasets are from between June 8th, 2008 and July 1st, 2016, let;s obtain these date range first.

In [31]:
price[price['Date'] == '8/8/2008'] #we know its index 1156

Unnamed: 0,Date,VIX Open,VIX High,VIX Low,VIX Close
1158,8/8/2008,21.15,21.69,20.11,20.66


In [32]:
#We will end our price on 7/1/2016
price [price['Date'] == '7/1/2016'] #we know its index 3024

Unnamed: 0,Date,VIX Open,VIX High,VIX Low,VIX Close
3146,7/1/2016,15.59,15.86,14.61,14.77


In [33]:
#this is between index 1158 and 3147
#Date is between 8 August 2008 and  1 July 2016
price.iloc[ 1158:3147 , : ]

Unnamed: 0,Date,VIX Open,VIX High,VIX Low,VIX Close
1158,8/8/2008,21.15,21.69,20.11,20.66
1159,8/11/2008,20.66,20.96,19.66,20.12
1160,8/12/2008,20.64,21.51,20.38,21.17
1161,8/13/2008,21.57,22.11,20.80,21.55
1162,8/14/2008,22.30,22.30,20.07,20.34
...,...,...,...,...,...
3142,6/27/2016,24.38,26.72,22.93,23.85
3143,6/28/2016,21.76,22.07,18.75,18.75
3144,6/29/2016,18.12,18.27,16.48,16.64
3145,6/30/2016,16.91,16.99,15.29,15.63


In [34]:
price = price.iloc[ 1158:3147 , : ]
price

Unnamed: 0,Date,VIX Open,VIX High,VIX Low,VIX Close
1158,8/8/2008,21.15,21.69,20.11,20.66
1159,8/11/2008,20.66,20.96,19.66,20.12
1160,8/12/2008,20.64,21.51,20.38,21.17
1161,8/13/2008,21.57,22.11,20.80,21.55
1162,8/14/2008,22.30,22.30,20.07,20.34
...,...,...,...,...,...
3142,6/27/2016,24.38,26.72,22.93,23.85
3143,6/28/2016,21.76,22.07,18.75,18.75
3144,6/29/2016,18.12,18.27,16.48,16.64
3145,6/30/2016,16.91,16.99,15.29,15.63


In [35]:
#We will save a copy of this price dataframe for EDA purpose for part 2.

In [15]:
price.to_csv('../data/vix_price.csv', index=False)

In [50]:
#create a new column for the difference in the Closing and Opening Price
price['upordown'] = price['VIX Close'] - price['VIX Open']
#if closing price is higher then opening price, will assign value 1
price['upordown'] = np.where(price['upordown'] > 0,1, price['upordown'])
#if closing price is equals to opening price, will assign value 0
price['upordown'] = np.where(price['upordown'] == 0 ,0, price['upordown'])
#if closing price is lower than opening price, will assign value 0
price['upordown'] = np.where(price['upordown'] < 0,0, price['upordown'])

In [51]:
price.head() #updown column will be our Y variables for modelling

Unnamed: 0,Date,VIX Open,VIX High,VIX Low,VIX Close,upordown
1158,8/8/2008,21.15,21.69,20.11,20.66,0.0
1159,8/11/2008,20.66,20.96,19.66,20.12,0.0
1160,8/12/2008,20.64,21.51,20.38,21.17,1.0
1161,8/13/2008,21.57,22.11,20.8,21.55,0.0
1162,8/14/2008,22.3,22.3,20.07,20.34,0.0


In [52]:
#We finally create the Y variables for the date range below. 
Y_feature = price.filter(['Date','upordown'], axis=1)
Y_feature.reset_index(drop=True, inplace=True)
Y_feature.head()

Unnamed: 0,Date,upordown
0,8/8/2008,0.0
1,8/11/2008,0.0
2,8/12/2008,1.0
3,8/13/2008,0.0
4,8/14/2008,0.0


In [57]:
#We merge 2 datafarme together with upordown as the price of VIX with the top 25 headings according to dates.
df = pd.merge(X_df, Y_feature, left_index=True, right_index=True, how='left')
df.columns

Index(['Date_x', 'Top1', 'Top2', 'Top3', 'Top4', 'Top5', 'Top6', 'Top7',
       'Top8', 'Top9', 'Top10', 'Top11', 'Top12', 'Top13', 'Top14', 'Top15',
       'Top16', 'Top17', 'Top18', 'Top19', 'Top20', 'Top21', 'Top22', 'Top23',
       'Top24', 'Top25', 'Date_y', 'upordown'],
      dtype='object')

In [58]:
#We drop 'Date_y' column as it is not required. 
df.drop(columns=['Date_y'],inplace = True)
#We then rename the column Date_x into Date.
df.rename(columns={"Date_x": "Date"},inplace= True)

In [54]:
#Finally we have our dataframe for modelling, before that lets proceed to part 2 for more EDA 

In [59]:
df.to_csv('../data/final_dataframe.csv', index=False)