# Project 4: Predicting Volatility Index price with Sentiment Analysis on News headlines

## Notebook 5 - Validation

This portion of notebook we will be performing a validation check on our : 

1. Chosen Classifier     (3 stacked LSTM)
2. TradingSentiment Tool (Textblob)

We have retrieve a small sample set of data from a new source called News Api.

News API is a simple HTTP REST API for searching and retrieving live articles from all over the web, in this case we have choosen to retrive top news headlnes.

Source  :  https://newsapi.org/docs/endpoints/top-headlines

**News headlines** consist of : 

- Top 10 BBC Headlines
- Top 10 Google Headlines
- Top 10 Tech Crunch Headlines
- Top 20 Trump Headlines
- Top 20 UK headlines
- Top 20 US headlines

In [1]:
# get some libraries that will be useful
import re
import numpy as np # linear algebra
import pandas as pd
import seaborn as sns
import string
import matplotlib.pyplot as plt
import pandas_datareader as dr
#To remove weekends from dataset
from pandas.tseries.offsets import BDay

# the Naive Bayes model
from sklearn.naive_bayes import MultinomialNB
# function to split the data for cross-validation
from sklearn.model_selection import train_test_split
# function for transforming documents into counts
from sklearn.feature_extraction.text import CountVectorizer
# function for encoding categories
from sklearn.preprocessing import LabelEncoder



#keras modeling
from keras.preprocessing import sequence
from keras.utils import np_utils
from keras.layers.embeddings import Embedding
from keras.layers.recurrent import LSTM, SimpleRNN, GRU
from keras.layers.convolutional import Convolution1D
from keras import backend as K
from keras.preprocessing.text import Tokenizer
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation
from keras.layers.convolutional import Conv1D
from keras.layers.convolutional import MaxPooling1D
from sklearn.metrics import confusion_matrix, classification_report, roc_curve, auc, roc_auc_score, accuracy_score

#Sentiment modelling
from textblob import TextBlob

#to filter out selected dates from dataset
import datetime


%matplotlib inline

  from pandas.util.testing import assert_frame_equal
Using TensorFlow backend.


In [2]:
Finaldf = pd.read_csv("../data/Final_df.csv")
Finaldf.head()

Unnamed: 0,publishedAt,title
0,3/16/2020,'History will not forgive us for waiting': San...
1,3/17/2020,"Singapore, Taiwan and Hong Kong Face Second Wa..."
2,3/18/2020,"Stock market live updates: Futures tumble, hit..."
3,3/19/2020,UK braces for coronavirus shut down as London ...


In [3]:
#Rename column publishedAt to Date
Finaldf.rename(columns={"publishedAt": "Date"},inplace = True)
Finaldf.head()

Unnamed: 0,Date,title
0,3/16/2020,'History will not forgive us for waiting': San...
1,3/17/2020,"Singapore, Taiwan and Hong Kong Face Second Wa..."
2,3/18/2020,"Stock market live updates: Futures tumble, hit..."
3,3/19/2020,UK braces for coronavirus shut down as London ...


In [4]:
# We would need to import VIX Price first

In [5]:
price = pd.read_csv("../data/vixcurrent.csv") 

In [6]:
price.tail()

Unnamed: 0,Date,VIX Open,VIX High,VIX Low,VIX Close
4091,4/3/2020,51.11,52.29,46.74,46.8
4092,4/6/2020,44.17,45.73,43.45,45.24
4093,4/7/2020,44.83,47.51,43.51,46.7
4094,4/8/2020,45.9,47.28,42.53,43.35
4095,4/9/2020,43.0,45.73,41.39,41.67


In [7]:
price[price['Date'] == '3/16/2020'] #we know its index 1156

Unnamed: 0,Date,VIX Open,VIX High,VIX Low,VIX Close
4077,3/16/2020,57.83,83.56,57.83,82.69


In [8]:
price [price['Date'] == '3/19/2020'] #we know its index 3024

Unnamed: 0,Date,VIX Open,VIX High,VIX Low,VIX Close
4080,3/19/2020,80.62,84.26,68.57,72.0


In [9]:
price = price.iloc[ 4077:4081 , : ]
price

Unnamed: 0,Date,VIX Open,VIX High,VIX Low,VIX Close
4077,3/16/2020,57.83,83.56,57.83,82.69
4078,3/17/2020,82.69,84.83,70.37,75.91
4079,3/18/2020,69.37,85.47,69.37,76.45
4080,3/19/2020,80.62,84.26,68.57,72.0


In [10]:
#create a new column for the difference in the Closing and Opening Price
price['upordown'] = price['VIX Close'] - price['VIX Open']
#if closing price is higher then opening price, will assign value 1
price['upordown'] = np.where(price['upordown'] > 0,1, price['upordown'])
#if closing price is equals to opening price, will assign value 0
price['upordown'] = np.where(price['upordown'] == 0 ,0, price['upordown'])
#if closing price is lower than opening price, will assign value 0
price['upordown'] = np.where(price['upordown'] < 0,0, price['upordown'])

In [11]:
price.head()

Unnamed: 0,Date,VIX Open,VIX High,VIX Low,VIX Close,upordown
4077,3/16/2020,57.83,83.56,57.83,82.69,1.0
4078,3/17/2020,82.69,84.83,70.37,75.91,0.0
4079,3/18/2020,69.37,85.47,69.37,76.45,1.0
4080,3/19/2020,80.62,84.26,68.57,72.0,0.0


In [12]:
#We finally create the Y variables for the date range below. 
Y_feature = price.filter(['Date','upordown'], axis=1)
Y_feature.reset_index(drop=True, inplace=True)
Y_feature.head()

Unnamed: 0,Date,upordown
0,3/16/2020,1.0
1,3/17/2020,0.0
2,3/18/2020,1.0
3,3/19/2020,0.0


In [13]:
#We merge 2 datafarme together with upordown as the price of VIX with the top 25 headings according to dates.
df = pd.merge(Finaldf, Y_feature, left_index=True, right_index= True)
#indicates columns have been successfully merged 
df

Unnamed: 0,Date_x,title,Date_y,upordown
0,3/16/2020,'History will not forgive us for waiting': San...,3/16/2020,1.0
1,3/17/2020,"Singapore, Taiwan and Hong Kong Face Second Wa...",3/17/2020,0.0
2,3/18/2020,"Stock market live updates: Futures tumble, hit...",3/18/2020,1.0
3,3/19/2020,UK braces for coronavirus shut down as London ...,3/19/2020,0.0


In [14]:
#We drop 'Date_y' column as it is not required. 
df.drop(columns=['Date_y'],inplace = True)
#We then rename the column Date_x into Date.
df.rename(columns={"Date_x": "Date"},inplace= True)

In [15]:
#We got our Final Datefrmae
df

Unnamed: 0,Date,title,upordown
0,3/16/2020,'History will not forgive us for waiting': San...,1.0
1,3/17/2020,"Singapore, Taiwan and Hong Kong Face Second Wa...",0.0
2,3/18/2020,"Stock market live updates: Futures tumble, hit...",1.0
3,3/19/2020,UK braces for coronavirus shut down as London ...,0.0


## Time for Modelling

Lets test for days for 3/16/2020 to 3/18/2020

In [16]:
X = df['title']
y = df['upordown']

In [17]:
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.50,stratify = y)

In [18]:
#num_words - This will be the maximum number of words 
#from our resulting tokenized data vocabulary which are to be used, 
#truncated after the 10000 most common words in our case.
tokenizer = Tokenizer(num_words=10000)
# Tokenize our training data'trainheadlines'
tokenizer.fit_on_texts(X_train)
# Encode training data sentences into sequences for both train and test data.
sequences_train = tokenizer.texts_to_sequences(X_train)
sequences_test = tokenizer.texts_to_sequences(X_val)

In [19]:
print('Pad sequences (samples x time)')

#Features for model training
#nb_classes - total number of classes.
nb_classes = 2
# maxlen is feature of maximum sequence length for padding our encoded sentences
maxlen = 200

# Pad the training sequences as we need our encoded sequences to be of the same length. 
# use that to pad all other sequences with extra '0's at the end ('post') and
# will also truncate any sequences longer than maximum length from the end ('post') as well. 
X_train = sequence.pad_sequences(sequences_train, maxlen=maxlen)
X_val = sequence.pad_sequences(sequences_test, maxlen=maxlen)

#convert them into array before we put them into model
y_train = np.array(y_train)
y_val = np.array(y_val)

# np_utils.to_categorical to convert array of labeled data(from 0 to nb_classes-1) to one-hot vector.
Y_train = np_utils.to_categorical(y_train, 2)
Y_val = np_utils.to_categorical(y_val, 2)

#print out X_train and X_test shape.
print('X_train shape:', X_train.shape)
print('X_val shape:', X_val.shape)
print('y_train shape:', Y_train.shape)
print('y_val shape:', Y_val.shape)

Pad sequences (samples x time)
X_train shape: (2, 200)
X_val shape: (2, 200)
y_train shape: (2, 2)
y_val shape: (2, 2)


In [20]:
print('Build LSTM model...')
# expected input data shape: (batch_size, timesteps, data_dim)
data_dim = 16
timesteps = 8
max_features = 10000
#intialize model
model = Sequential()
#Embedding with 128
model.add(Embedding(max_features, 128))
# returns 16 sequences of vectors of dimension 32
model.add(LSTM(32, return_sequences=True,input_shape=(timesteps, 16)))  
# returns a sequence of vectors of dimension 32
model.add(LSTM(32, return_sequences=True)) 
# return a single vector of dimension 32
model.add(LSTM(32))  
model.add(Dense(nb_classes))
model.add(Activation('softmax'))
#Compile model
model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])
print(model.summary())

Build LSTM model...
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, None, 128)         1280000   
_________________________________________________________________
lstm_1 (LSTM)                (None, None, 32)          20608     
_________________________________________________________________
lstm_2 (LSTM)                (None, None, 32)          8320      
_________________________________________________________________
lstm_3 (LSTM)                (None, 32)                8320      
_________________________________________________________________
dense_1 (Dense)              (None, 2)                 66        
_________________________________________________________________
activation_1 (Activation)    (None, 2)                 0         
Total params: 1,317

In [21]:
# Final evaluation of the model
history = model.fit(X_train, Y_train,
          batch_size=64, epochs=5,
          validation_data=(X_val, Y_val))

Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where

Train on 2 samples, validate on 2 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [22]:
model.predict_classes(X_val, verbose=0)

array([0, 0], dtype=int64)

# LSTM and Convolutional Neural Network For Sequence Classification

In [23]:
# create the model
model = Sequential()
#Embedding
model.add(Embedding(max_features, 128))
#Convolutional 1D layer 
model.add(Conv1D(filters=32, kernel_size=3, padding='same', activation='relu'))
#Maxpool 
model.add(MaxPooling1D(pool_size=2))
#LSTM
model.add(LSTM(100))
#Dense
model.add(Dense(2, activation='sigmoid'))
#Compile
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

print(model.summary())


Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, None, 128)         1280000   
_________________________________________________________________
conv1d_1 (Conv1D)            (None, None, 32)          12320     
_________________________________________________________________
max_pooling1d_1 (MaxPooling1 (None, None, 32)          0         
_________________________________________________________________
lstm_4 (LSTM)                (None, 100)               53200     
_________________________________________________________________
dense_2 (Dense)              (None, 2)                 202       
Total params: 1,345,722
Trainable params: 1,345,722
Non-trainable params: 0
_________________________________________________________________
None


In [24]:
# Final evaluation of the model
history = model.fit(X_train, Y_train,
                    batch_size=64, 
                    epochs=5,
                    validation_data=(X_val, Y_val))

Train on 2 samples, validate on 2 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [25]:
#Training Accuracy Score on training dataset
print("Generating training accuracy...")
#We take an average of the training accuracy score
trainingacc6 = np.mean(history.history['accuracy'])
print('Training Accuracy Score: ',trainingacc6)

Generating training accuracy...
Training Accuracy Score:  0.85


In [26]:
print("Generating test predictions...")
score, acc = model.evaluate(X_val, Y_val,batch_size=64)
preds = model.predict_classes(X_val, verbose=0)
acc = accuracy_score(y_val, preds)
print('Prediction accuracy: ', acc)

Generating test predictions...
Prediction accuracy:  0.0


In [27]:
preds 

array([0, 1], dtype=int64)

In [28]:
# Predicted values for 18 and 19 March are 0 ,0 while true values are 0 and 1 respectively.

Next we will try Textblob 

In [30]:
pol = lambda x : TextBlob(x).sentiment.polarity
df['textblobpol']  = df['title'].apply(pol)

In [31]:
df

Unnamed: 0,Date,title,upordown,textblobpol
0,3/16/2020,'History will not forgive us for waiting': San...,1.0,-0.031818
1,3/17/2020,"Singapore, Taiwan and Hong Kong Face Second Wa...",0.0,0.023502
2,3/18/2020,"Stock market live updates: Futures tumble, hit...",1.0,0.084927
3,3/19/2020,UK braces for coronavirus shut down as London ...,0.0,0.100088


### Based on the result on our small fresh dataset from 16 april to 17 april, we can see that 

1. Our Model LSTM produce a accuracy score of *****....... based on up or down.
2. Textblob gives a accuracy score of *****....... based on sentiment reviews