# Simple News Sentiment Analysis

In [1]:
import json
import nltk
import numpy as np
import pandas as pd
import pickle
import pprint
import re

from tqdm import tqdm

## Load Data

In [2]:
with open('../MarketWatch-Web-Crawler/latest.json') as json_file:
    data = json.load(json_file)

## Preprocess Data
#### Download NLP Corpora
two corpora: the stopwords corpus for removing stopwords and wordnet for lemmatizing

In [3]:
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to /home/shiyu/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /home/shiyu/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [4]:
from nltk.corpus import stopwords

wnl = nltk.stem.WordNetLemmatizer()

lemma_english_stopwords = [wnl.lemmatize(x) for x in stopwords.words('english')]

def preprocess(paragraphs):
    message = ' '.join(paragraphs)
    
    # Lowercase the message
    text = message.lower()
    
    # Replace everything not a letter with a space
    text = re.sub(r'[^a-zA-Z]', ' ', text)
    
    # Tokenize by splitting the string on whitespace into a list of words
    tokens = text.split()

    # Lemmatize wotwo corpora: the stopwords corpus for removing stopwords and wordnet for lemmatizing.

    wnl = nltk.stem.WordNetLemmatizer()
    tokens = [wnl.lemmatize(x) for x in tokens if len(x)>1]
    tokens = [x for x in tokens if x not in lemma_english_stopwords]
    
    return tokens

In [5]:
for d in data:
    if d['frontpage_summary']['label'] == 'flash headline':
        d['token'] = preprocess(d['frontpage_summary']['headline'])
    else:
        d['token'] = preprocess(d['paragraphs'])

#### example of tokenized article

In [6]:
data[50]['token'][:20]

['courtship',
 'aurora',
 'cannabis',
 'inc',
 'reliva',
 'began',
 'many',
 'romance',
 'gathering',
 'industry',
 'bigwig',
 'banker',
 'quite',
 'love',
 'first',
 'sight',
 'well',
 'ahead',
 'first',
 'meeting']

#### why do I tokenize data while using VADER?

Since punctuation, capitalization and use of intensifiers are more rarely used in serious news articles than social media posts, I still lowered and lemmatized the words, so try to only rely on the words sentiment from this package.

# Analysis on News

## VaderSentiment
https://github.com/cjhutto/vaderSentiment#python-demo-and-code-examples

VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media. 

Threshold values it used are:
- positive sentiment: compound score >= 0.05
- neutral sentiment: (compound score > -0.05) and (compound score < 0.05)
- negative sentiment: compound score <= -0.05

VADER judges sentiment by 
- Punctuation — I love pizza vs I love pizza!!
- Capitalization — I’m hungry!! vs I’M HUNGRY!!
- Degree modifiers (use of intensifiers)— I WANT TO EAT!! VS I REALLY WANT TO EAT!!
- Conjunctions (shift in sentiment polarity, with later dictating polarity) — I love pizza, but I really hate Pizza Hut (bad review)
- Preceding Tri-gram (identifying reverse polarity by examining the tri-gram before the lexical feature— Canadian Pizza is not really all that great.


VADER is focused on social media and short texts while Financial News are almost the opposite. So I update the VADER lexicon with words+sentiments from the Loughran-McDonald Financial Sentiment Word Lists.

## Loughran McDonald Sentiment Word Lists
data source: https://sraf.nd.edu/textual-analysis/resources/

Loughran McDonald Sentiment Word Lists categorizes words into the 7 sentiments. It's built using exclusively corporate disclosures.

- Negative
- Positive
- Uncertainty
- Litigious
- StrongModal
- WeakModal
- Constraining


#### Load Loughran McDonald Sentiment Words

In [7]:
import os

sentiments = ['Negative', 'Positive', 'Uncertainty', 'Litigious', 'StrongModal', 'WeakModal', 'Constraining']

# read sentiment words
path_LM = os.path.join('data', 'LoughranMcDonald_SentimentWordLists_2018.xlsx')
dict_LM = {} # sentiment: [word list]
for s in sentiments:
    df_temp = pd.read_excel(path_LM, s, header=None)
    dict_LM[s] = np.array(df_temp.iloc[:,0])
    
# print out sentiment+words
for s, word_list in dict_LM.items():
    print(s,':\t', word_list[:5], '...')

Negative :	 ['ABANDON' 'ABANDONED' 'ABANDONING' 'ABANDONMENT' 'ABANDONMENTS'] ...
Positive :	 ['ABLE' 'ABUNDANCE' 'ABUNDANT' 'ACCLAIMED' 'ACCOMPLISH'] ...
Uncertainty :	 ['ABEYANCE' 'ABEYANCES' 'ALMOST' 'ALTERATION' 'ALTERATIONS'] ...
Litigious :	 ['ABOVEMENTIONED' 'ABROGATE' 'ABROGATED' 'ABROGATES' 'ABROGATING'] ...
StrongModal :	 ['ALWAYS' 'BEST' 'CLEARLY' 'DEFINITELY' 'DEFINITIVELY'] ...
WeakModal :	 ['ALMOST' 'APPARENTLY' 'APPEARED' 'APPEARING' 'APPEARS'] ...
Constraining :	 ['ABIDE' 'ABIDING' 'BOUND' 'BOUNDED' 'COMMIT'] ...


#### Update VADER

In [8]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer

new_words = {}
for w in dict_LM['Negative']:
    new_words[w.lower()] = -1
for w in dict_LM['Positive']:
    new_words[w.lower()] = 1

sia = SentimentIntensityAnalyzer()

sia.lexicon.update(new_words)

## Sentiment Results

In [9]:
for d in data:
    d['sentiment_scores'] = sia.polarity_scores(' '.join(d['token']))

Let's print out scores with results:

In [10]:
import random

for d in random.choices(data,k=20):
    print("headline: " + d['frontpage_summary']['headline'])
    print("\t",d['sentiment_scores'])

headline: It’s ‘no fun’ to be a small fund manager most of the time, Jan van Eck says — but right now is an exception
	 {'neg': 0.087, 'neu': 0.738, 'pos': 0.175, 'compound': 0.9973}
headline: Virgin Orbit’s first test launch postponed
	 {'neg': 0.057, 'neu': 0.943, 'pos': 0.0, 'compound': -0.6249}
headline: It’s ‘no fun’ to be a small fund manager most of the time, Jan van Eck says — but right now is an exception
	 {'neg': 0.087, 'neu': 0.738, 'pos': 0.175, 'compound': 0.9973}
headline: Is your city reopening after coronavirus lockdown? Scientists say avoid these places
	 {'neg': 0.096, 'neu': 0.838, 'pos': 0.066, 'compound': -0.9638}
headline: ‘I owe child support from my first marriage and did not receive a stimulus check. Does Trump not realize I have another family to take care of?’
	 {'neg': 0.127, 'neu': 0.667, 'pos': 0.206, 'compound': 0.9935}
headline: Shocking before-and-after photo shows what the coronavirus did to an otherwise healthy man in just 6 weeks
	 {'neg': 0.153, 'n

## Conclusion
For most of the articles, the sentiments result shows neutral. But the relative values of Negative and Positive scores do shed some light to the sentiment.

#### Here are the Next Steps I could follow:
1. parse articles for a longer period of time, and from more sources
2. link the article sentiment analysis to referenced stock
3. use stock sentiment to predict future return
4. construct alhpa signals from the stock sentiment scores