# Project 4: Predicting Volatility Index price with Sentiment Analysis on News headlines

### Notebook 4 : Sentiment Analysis Tool - Extent of the Direction  

This portion of the notebook will be using sentiment analysis tool to predict the extent(how positive or negative) of the direction of the Volatility Price Index : 

1. Vader
2. Textblob

In [1]:
# get some libraries that will be useful

import re
import numpy as np # linear algebra
import pandas as pd
import seaborn as sns
import string
import matplotlib.pyplot as plt
import pandas_datareader as dr
#To remove weekends from dataset
from pandas.tseries.offsets import BDay

# necessary libraries for wordcloud
from wordcloud import WordCloud
from wordcloud import STOPWORDS
from PIL import Image

# the Naive Bayes model
from sklearn.naive_bayes import MultinomialNB
# function to split the data for cross-validation
from sklearn.model_selection import train_test_split
# function for transforming documents into counts
#words
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer, HashingVectorizer, TfidfVectorizer
from sklearn.pipeline import Pipeline

# function for encoding categories
from sklearn.preprocessing import LabelEncoder

from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.model_selection import train_test_split, GridSearchCV

from sklearn.feature_extraction.text import TfidfVectorizer,CountVectorizer
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import SGDClassifier, SGDRegressor,LogisticRegression
#keras modeling
from keras.preprocessing import sequence
from keras.utils import np_utils
from keras.layers.embeddings import Embedding
from keras.layers.recurrent import LSTM, SimpleRNN, GRU
from keras.layers.convolutional import Convolution1D
from keras import backend as K
from keras.preprocessing.text import Tokenizer
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation
from keras.layers.convolutional import Conv1D
from keras.layers.convolutional import MaxPooling1D
#evaluation
from sklearn.metrics import f1_score



#Sentiment modelling
from nltk.sentiment.vader import SentimentIntensityAnalyzer as SIA
from textblob import TextBlob


%matplotlib inline

  from pandas.util.testing import assert_frame_equal
Using TensorFlow backend.


Source : 
https://medium.com/@Intellica.AI/vader-ibm-watson-or-textblob-which-is-better-for-unsupervised-sentiment-analysis-db4143a39445

In [2]:
# grab the data #we will first grab the news data set first
combined_news = pd.read_csv("../data/final_dataframe.csv")

In [3]:
combined_news.head()

Unnamed: 0,Date,all25,upordown
0,2008-08-08,"0,b""georgia 'downs two russian warplanes' as c...",0.0
1,2008-08-11,"1,b'why wont america and nato help us? if they...",0.0
2,2008-08-12,"0,b'remember that adorable 9-year-old who sang...",1.0
3,2008-08-13,"0,b' u.s. refuses israel weapons to attack ira...",0.0
4,2008-08-14,"1,b'all the experts admit that we should legal...",0.0


## Define a new dataframe that only include the columns that we want for sentiment analysis.

In [4]:
df = combined_news[['Date','upordown','all25']]

# VADER Sentiment Analysis

Vader is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media. 

VADER uses a combination of A sentiment lexicon is a list of lexical features (e.g., words) which are generally labeled according to their semantic orientation as either positive or negative. 

VADER not only tells about the Positivity and Negativity score but also tells us about how positive or negative a sentiment is.

In [5]:
#Create a function to run through vader analysis on all 25 headlines.
results = []

for headline in df['all25']:
    pol_score = SIA().polarity_scores(headline) # run analysis
    pol_score['headline'] = df['all25'] # add headlines for viewing
    results.append(pol_score)

In [6]:
results

[{'neg': 0.214,
  'neu': 0.724,
  'pos': 0.062,
  'compound': -0.9966,
  'headline': 0       0,b"georgia 'downs two russian warplanes' as c...
  1       1,b'why wont america and nato help us? if they...
  2       0,b'remember that adorable 9-year-old who sang...
  3       0,b' u.s. refuses israel weapons to attack ira...
  4       1,b'all the experts admit that we should legal...
                                ...                        
  1984    0,barclays and rbs shares suspended from tradi...
  1985    1,2,500 scientists to australia: if you want t...
  1986    1,explosion at airport in istanbul,yemeni form...
  1987    1,jamaica proposes marijuana dispensers for to...
  1988    1,a 117-year-old woman in mexico city finally ...
  Name: all25, Length: 1989, dtype: object},
 {'neg': 0.135,
  'neu': 0.773,
  'pos': 0.092,
  'compound': -0.9075,
  'headline': 0       0,b"georgia 'downs two russian warplanes' as c...
  1       1,b'why wont america and nato help us? if they...
  2      

In [7]:
df['VaderScore'] = pd.DataFrame(results)['compound']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


## Negative VaderScore indicates overall negative sentiment on that day while Positive VaderScore indcates overall positive sentiment on that day.

In [8]:
df.head() 

Unnamed: 0,Date,upordown,all25,VaderScore
0,2008-08-08,0.0,"0,b""georgia 'downs two russian warplanes' as c...",-0.9966
1,2008-08-11,0.0,"1,b'why wont america and nato help us? if they...",-0.9075
2,2008-08-12,1.0,"0,b'remember that adorable 9-year-old who sang...",-0.9739
3,2008-08-13,0.0,"0,b' u.s. refuses israel weapons to attack ira...",-0.9842
4,2008-08-14,0.0,"1,b'all the experts admit that we should legal...",-0.9774


### Create another column Vader Sentiment and compare to 'upordown' column

In [9]:
df['Vadersentimentscore'] = df['VaderScore']
#Assign value 0 if Vader Score is more than 1 
df['Vadersentimentscore'] = np.where(df['VaderScore'] > 0,0, df['Vadersentimentscore'])
#Assign value 1 if Vader Score is equal to 0.
df['Vadersentimentscore'] = np.where(df['VaderScore'] == 0 ,1, df['Vadersentimentscore'])
#Assign value 1 if Vader Score is less than 0.
df['Vadersentimentscore'] = np.where(df['VaderScore'] < 0,1, df['Vadersentimentscore'])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """
A value is trying to be set on a copy of a slice from a DataFrame.
Try us

In [10]:
df.head()

Unnamed: 0,Date,upordown,all25,VaderScore,Vadersentimentscore
0,2008-08-08,0.0,"0,b""georgia 'downs two russian warplanes' as c...",-0.9966,1.0
1,2008-08-11,0.0,"1,b'why wont america and nato help us? if they...",-0.9075,1.0
2,2008-08-12,1.0,"0,b'remember that adorable 9-year-old who sang...",-0.9739,1.0
3,2008-08-13,0.0,"0,b' u.s. refuses israel weapons to attack ira...",-0.9842,1.0
4,2008-08-14,0.0,"1,b'all the experts admit that we should legal...",-0.9774,1.0


## ACCURACY SCORE FOR VADER SENTIMENT ANLAYSIS

In [41]:
vaderacc = accuracy_score(df['upordown'], df['Vadersentimentscore'])
vaderacc #39% accuracy for Vader 

0.39819004524886875

In [42]:
# F1 score for vader sentiment analysis 
vaderf1 = f1_score(df['upordown'], df['Vadersentimentscore'], average='weighted')
vaderf1

0.24350903253953143

# TextBlob

For TextBlob : 

TextBlob is a python library that is built on top of NLTK. It is easy to use and offers a simple API to access its methods and perform basic NLP tasks such as rules-based sentiment scores.



Polarity    indicates  positive sentiment (+1)       or negative sentiment (-1)  

In [12]:
pol = lambda x : TextBlob(x).sentiment.polarity
df['textblobpol']  = df['all25'].apply(pol)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [13]:
df.head()

Unnamed: 0,Date,upordown,all25,VaderScore,Vadersentimentscore,textblobpol
0,2008-08-08,0.0,"0,b""georgia 'downs two russian warplanes' as c...",-0.9966,1.0,-0.048568
1,2008-08-11,0.0,"1,b'why wont america and nato help us? if they...",-0.9075,1.0,0.121956
2,2008-08-12,1.0,"0,b'remember that adorable 9-year-old who sang...",-0.9739,1.0,-0.04653
3,2008-08-13,0.0,"0,b' u.s. refuses israel weapons to attack ira...",-0.9842,1.0,0.011398
4,2008-08-14,0.0,"1,b'all the experts admit that we should legal...",-0.9774,1.0,0.040677


In [14]:
#Create new column textblobpolscore to compare against upordown
df['textblobpolscore'] = df['textblobpol']
#Assign value 0 if Vader Score is more than 1 
df['textblobpolscore'] = np.where(df['textblobpol'] > 0,0, df['textblobpolscore'])
#Assign value 1 if Vader Score is equal to 0.
df['textblobpolscore'] = np.where(df['textblobpol'] == 0 ,1, df['textblobpolscore'])
#Assign value 1 if Vader Score is less than 0.
df['textblobpolscore'] = np.where(df['textblobpol'] < 0,1, df['textblobpolscore'])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: htt

In [15]:
df.head()

Unnamed: 0,Date,upordown,all25,VaderScore,Vadersentimentscore,textblobpol,textblobpolscore
0,2008-08-08,0.0,"0,b""georgia 'downs two russian warplanes' as c...",-0.9966,1.0,-0.048568,1.0
1,2008-08-11,0.0,"1,b'why wont america and nato help us? if they...",-0.9075,1.0,0.121956,0.0
2,2008-08-12,1.0,"0,b'remember that adorable 9-year-old who sang...",-0.9739,1.0,-0.04653,1.0
3,2008-08-13,0.0,"0,b' u.s. refuses israel weapons to attack ira...",-0.9842,1.0,0.011398,0.0
4,2008-08-14,0.0,"1,b'all the experts admit that we should legal...",-0.9774,1.0,0.040677,0.0


In [16]:
#compare the score between upordown to textblob
#53% of textblobpolscore matches upordown.
textblobacc = accuracy_score(df['upordown'], df['textblobpolscore'])
textblobacc #53% accuracy for Textblob 

0.5349421820010055

In [43]:
#to calculate f1 score for textblob
textblobf1 = f1_score(df['upordown'], df['textblobpolscore'], average='weighted')
textblobf1

0.5244319997460022

# Lets check on positive accuracy and negative accuracy for TextBlob and Vader

In [17]:
#Count the number of real positive and negatives counts on upordown column
#total value 1.0 in upordown #positive counts
totalpositivecounts = list(df['upordown']).count(1.0)
print("Total positive counts :",totalpositivecounts)
#total value 1.0 in upordown #negative counts
totalnegativecounts = list(df['upordown']).count(0.0)
print("Total negative counts :",totalnegativecounts)

Total positive counts : 788
Total negative counts : 1201


### Positive Accuracy and Negative Accuracy Counts for Vader

In [18]:
#count the number of predicted positive correct count for Vader
vaderppc = df[(df.Vadersentimentscore == 1.0) & (df.upordown == 1.0)].count() #771 counts
vaderppc['VaderScore']

771

In [19]:
#Vader positive accuracy score (Sensitivity  - true positive rate)
vaderpositiveaccuracy =  vaderppc['VaderScore']/totalpositivecounts #97%
vaderpositiveaccuracy

0.9784263959390863

In [20]:
#count the number of predicted negative correct count for Vader
vaderpnc = df[(df.Vadersentimentscore == 0.0) & (df.upordown == 0.0)].count() #21 counts
vaderpnc['VaderScore']

21

In [21]:
#Vader negative accuracy score (Specificity - true negative rate)
vadernegativeaccuracy = vaderpnc['VaderScore'] /totalnegativecounts #1.7%
vadernegativeaccuracy

0.017485428809325562

### Positive Accuracy and Negative Accuracy Counts for TextBlob

In [22]:
#count the number of predicted positive correct count for textblob  
textblobppc = df[(df.textblobpolscore == 1.0) & (df.upordown == 1.0)].count()
textblobppc['textblobpolscore']

252

In [23]:
#textblob positive accuracy score  (Sensitivity  - true positive rate)
textblobpositiveaccuracy =  textblobppc['textblobpolscore']/totalpositivecounts 
textblobpositiveaccuracy #31% accuracy 

0.3197969543147208

In [24]:
#count the number of predicted negative correct count for textblob 
textblobpnc = df[(df.textblobpolscore == 0.0) & (df.upordown == 0.0)].count() #814 counts 
textblobpnc['textblobpolscore']

812

In [25]:
#textblob negative accuracy score  (Specificity - true negative rate)
textblobnegativeaccuracy =  textblobpnc['textblobpolscore']/totalnegativecounts 
textblobnegativeaccuracy #67% accuracy 

0.6761032472939217

Summary of Accuracy

In [44]:
#settings the parameters of the dataframe

#Model type
Model               = ['Vader','TextBlob']
#Dataset Accuracy Scores
Accuracy            = [vaderacc,textblobacc]
#Positive Accuracy Scores
Sensitivity_true_positive_rate = [vaderpositiveaccuracy,textblobpositiveaccuracy]
#Negative Accuracy Scores
Specificity_true_negative_rate  = [vadernegativeaccuracy,textblobnegativeaccuracy]
#F1 scores 
F1scores = [vaderf1,textblobf1]

In [45]:
summary = pd.DataFrame(
    {'Model'                           :  Model,
     'Dataset Accuracy Scores'         :  Accuracy,
     'Sensitivity_true_positive_rate'  :  Sensitivity_true_positive_rate,
     'Specificity_true negative rate'  :  Specificity_true_negative_rate,
     'F1 scores'                       :  F1scores})
  
pd.set_option('display.max_colwidth', -1)
summary.sort_values('Dataset Accuracy Scores', ascending=False).reset_index(drop=True)
summary

  


Unnamed: 0,Model,Dataset Accuracy Scores,Sensitivity_true_positive_rate,Specificity_true negative rate,F1 scores
0,Vader,0.39819,0.978426,0.017485,0.243509
1,TextBlob,0.534942,0.319797,0.676103,0.524432


# Conclusion 

Based on our findings,  we can see that vader score a accuracy of 39% overall and predicted 97% correctly when Vix prices goes up(value 1) and only able to predict 1.7% correctly when Vix goes down(value 0).

On the other hand, we can see that Textblob predicted 31% correctly in the when VIX prices goes up and predicted 67% correctly when VIX goes down.

By comparing F1 scores, we have decided to chose TextBlob as out model and that fact that it achieves a higher accuracy scores as well. 