# Simple News Sentiment Analysis

In [1]:
import json
import nltk
import numpy as np
import pandas as pd
import pickle
import pprint
import re

from tqdm import tqdm

## Load Data

In [2]:
with open('../MarketWatch-Web-Crawler/latest.json') as json_file:
    data = json.load(json_file)

#### example of parsed article

In [3]:
display(data[0])

{'frontpage_summary': {'label': 'flash headline',
  'headline': "Cummings cites wife's symptoms and child-care needs as reasons controversial London departure",
  'article_summary': None,
  'timestamp': '2020-05-25T11:41:56'}}

In [4]:
display(data[100])

{'frontpage_summary': {'url': 'https://www.marketwatch.com/story/san-francisco-fire-destroys-fishermans-wharf-warehouse-early-saturday-2020-05-23?mod=newsviewer_click_seemore',
  'headline': 'San Francisco fire destroys Fisherman’s Wharf warehouse early Saturday',
  'article_summary': 'A fire engulfed a warehouse on San Francisco’s Fisherman’s Wharf early Saturday, sending a thick plume of smoke over the waterfront and threatening to spread to a historic World War II–era ship before firefighters brought ...',
  'label': 'Updated'},
 'category': ['General', 'Associated Press'],
 'header': {'headline': 'San Francisco fire destroys Fisherman’s Wharf warehouse early Saturday',
  'sub_headline': '',
  'time_publish': 'Published: May 23, 2020 at 8:38 p.m. ET',
  'time_update': ''},
 'author': {'author_name': 'Associated Press',
  'author_link': 'https://www.marketwatch.com/column/associated-press?mod=MW_author_byline'},
 'referenced_tickers': {},
 'paragraphs': ['SAN FRANCISCO (AP) — A fire 

## Preprocess Data
#### Download NLP Corpora
two corpora: the stopwords corpus for removing stopwords and wordnet for lemmatizing

In [5]:
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to /home/shiyu/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /home/shiyu/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [6]:
from nltk.corpus import stopwords

wnl = nltk.stem.WordNetLemmatizer()

lemma_english_stopwords = [wnl.lemmatize(x) for x in stopwords.words('english')]

def preprocess(paragraphs):
    message = ' '.join(paragraphs)
    
    # Lowercase the message
    text = message.lower()
    
    # Replace everything not a letter with a space
    text = re.sub(r'[^a-zA-Z]', ' ', text)
    
    # Tokenize by splitting the string on whitespace into a list of words
    tokens = text.split()

    # Lemmatize wotwo corpora: the stopwords corpus for removing stopwords and wordnet for lemmatizing.

    wnl = nltk.stem.WordNetLemmatizer()
    tokens = [wnl.lemmatize(x) for x in tokens if len(x)>1]
    tokens = [x for x in tokens if x not in lemma_english_stopwords]
    
    return tokens

In [7]:
for d in data:
    if d['frontpage_summary']['label'] == 'flash headline':
        d['token'] = preprocess(d['frontpage_summary']['headline'])
    else:
        d['token'] = preprocess(d['paragraphs'])

#### example of tokenized article

In [8]:
data[50]['token']

['courtship',
 'aurora',
 'cannabis',
 'inc',
 'reliva',
 'began',
 'many',
 'romance',
 'gathering',
 'industry',
 'bigwig',
 'banker',
 'quite',
 'love',
 'first',
 'sight',
 'well',
 'ahead',
 'first',
 'meeting',
 'conference',
 'run',
 'investment',
 'bank',
 'aurora',
 'acb',
 'acb',
 'shopping',
 'way',
 'enter',
 'market',
 'time',
 'saying',
 'publicly',
 'earnings',
 'call',
 'interview',
 'marketwatch',
 'took',
 'aurora',
 'month',
 'seriously',
 'vet',
 'reliva',
 'acquisition',
 'target',
 'chief',
 'executive',
 'company',
 'told',
 'marketwatch',
 'telephone',
 'interview',
 'week',
 'month',
 'first',
 'meeting',
 'aurora',
 'executive',
 'team',
 'flew',
 'boston',
 'met',
 'reliva',
 'company',
 'specializes',
 'cannabidiol',
 'cbd',
 'hour',
 'boss',
 'aurora',
 'reliva',
 'visited',
 'wholesale',
 'brick',
 'mortar',
 'store',
 'talked',
 'business',
 'aurora',
 'interim',
 'ceo',
 'michael',
 'singer',
 'telling',
 'marketwatch',
 'learned',
 'enough',
 'two',
 'd

#### why do I tokenize data while using VADER?

Since punctuation, capitalization and use of intensifiers are more rarely used in serious news articles than social media posts, I still lowered and lemmatized the words, so try to only rely on the words sentiment from this package.

# Analysis on News

## VaderSentiment
https://github.com/cjhutto/vaderSentiment#python-demo-and-code-examples

VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media. 

Threshold values it used are:
- positive sentiment: compound score >= 0.05
- neutral sentiment: (compound score > -0.05) and (compound score < 0.05)
- negative sentiment: compound score <= -0.05

VADER judges sentiment by 
- Punctuation — I love pizza vs I love pizza!!
- Capitalization — I’m hungry!! vs I’M HUNGRY!!
- Degree modifiers (use of intensifiers)— I WANT TO EAT!! VS I REALLY WANT TO EAT!!
- Conjunctions (shift in sentiment polarity, with later dictating polarity) — I love pizza, but I really hate Pizza Hut (bad review)
- Preceding Tri-gram (identifying reverse polarity by examining the tri-gram before the lexical feature— Canadian Pizza is not really all that great.


VADER is focused on social media and short texts while Financial News are almost the opposite. So I update the VADER lexicon with words+sentiments from the Loughran-McDonald Financial Sentiment Word Lists.

## Loughran McDonald Sentiment Word Lists
data source: https://sraf.nd.edu/textual-analysis/resources/

Loughran McDonald Sentiment Word Lists categorizes words into the 7 sentiments. It's built using exclusively corporate disclosures.

- Negative
- Positive
- Uncertainty
- Litigious
- StrongModal
- WeakModal
- Constraining


#### Load Loughran McDonald Sentiment Words

In [9]:
import os

sentiments = ['Negative', 'Positive', 'Uncertainty', 'Litigious', 'StrongModal', 'WeakModal', 'Constraining']

# read sentiment words
path_LM = os.path.join('data', 'LoughranMcDonald_SentimentWordLists_2018.xlsx')
dict_LM = {} # sentiment: [word list]
for s in sentiments:
    df_temp = pd.read_excel(path_LM, s, header=None)
    dict_LM[s] = np.array(df_temp.iloc[:,0])
    
# print out sentiment+words
for s, word_list in dict_LM.items():
    print(s,':\t', word_list[:5], '...')

Negative :	 ['ABANDON' 'ABANDONED' 'ABANDONING' 'ABANDONMENT' 'ABANDONMENTS'] ...
Positive :	 ['ABLE' 'ABUNDANCE' 'ABUNDANT' 'ACCLAIMED' 'ACCOMPLISH'] ...
Uncertainty :	 ['ABEYANCE' 'ABEYANCES' 'ALMOST' 'ALTERATION' 'ALTERATIONS'] ...
Litigious :	 ['ABOVEMENTIONED' 'ABROGATE' 'ABROGATED' 'ABROGATES' 'ABROGATING'] ...
StrongModal :	 ['ALWAYS' 'BEST' 'CLEARLY' 'DEFINITELY' 'DEFINITIVELY'] ...
WeakModal :	 ['ALMOST' 'APPARENTLY' 'APPEARED' 'APPEARING' 'APPEARS'] ...
Constraining :	 ['ABIDE' 'ABIDING' 'BOUND' 'BOUNDED' 'COMMIT'] ...


#### Update VADER

In [12]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer

new_words = {}
for w in dict_LM['Negative']:
    new_words[w.lower()] = -1
for w in dict_LM['Positive']:
    new_words[w.lower()] = 1

sia = SentimentIntensityAnalyzer()

sia.lexicon.update(new_words)

## Sentiment Results

In [13]:
for d in data:
    d['sentiment_scores'] = sia.polarity_scores(' '.join(d['token']))

Let's print out scores with results:

In [14]:
import random

for d in random.choices(data,k=20):
    print("headline: " + d['frontpage_summary']['headline'])
    print("\t",d['sentiment_scores'])

headline: Are ETFs safe... for retail investors?
	 {'neg': 0.115, 'neu': 0.726, 'pos': 0.16, 'compound': 0.9871}
headline: Trump pins reelection on hopes that voters trust him to reboot economy
	 {'neg': 0.098, 'neu': 0.779, 'pos': 0.123, 'compound': 0.9062}
headline: Trump pins reelection on hopes that voters trust him to reboot economy
	 {'neg': 0.098, 'neu': 0.779, 'pos': 0.123, 'compound': 0.9062}
headline: The Stock Market’s Great Week Is Reason to Worry
	 {'neg': 0.134, 'neu': 0.711, 'pos': 0.155, 'compound': -0.2547}
headline: Clorox and Netflix Shares Have Prospered in the Pandemic. But the Risk Is Rising.
	 {'neg': 0.0, 'neu': 0.956, 'pos': 0.044, 'compound': 0.296}
headline: TJX results show shoppers will head back to stores if the price is right, analysts say
	 {'neg': 0.059, 'neu': 0.809, 'pos': 0.132, 'compound': 0.9834}
headline: ‘I woke up in a free country’: Costco shopper gets bounced from store after refusing to wear a mask
	 {'neg': 0.102, 'neu': 0.779, 'pos': 0.119,

## Conclusion
For most of the articles, the sentiments result shows neutral. But the relative value of Negative and Positive scores do sense the opinion of the author.

#### Next Steps:
1. parse articles for a longer period of time, and from more sources
2. link the article sentiment analysis to referenced stock
3. use stock sentiment to predict future return
4. construct alhpa signals from the stock sentiment scores