# News Sentiment Analytics | FIN:9160

Spring 2022 | Instructor: Qihang Lin and Tong Yao

Sentiment analysis is a technique to detect and quantifiy sentiment polarity within the text data. It can be used to generate sentiment scores for news and financial reports (e.g. Form 10-K).  

There are **two major approaches** to sentiment analysis:
- Supervised predictive modeling approaches.
    - Require each sentence/document being annotated by a sentiment score.  
    
    
- Unsupervised lexicon-based (rule-based) approaches. (**The focus of this lecture.**)
    - Require a lexicon where each word receives a pre-determined sentiment score.

## Challenges

Why is sentiment analysis difficult to perform in a high quality?
1. **Negations.**  "I don't think this coffee machine is a good choice."

2. **Multipolarity.** "The acting was good but the movie could have been better."

3. **Irony and sarcasm.** "This phone has an awesome battery back-up of 2 hours."

4. **Word ambiguity.** "The story is unpredictable." VS "The steering wheel is unpredictable."

5. **Others:** Shorthand, abbreviations, different spellings, misspelled words, punctuation, slang, and emojis.

## News Data

In [1]:
import pandas as pd
dfnews=pd.read_csv('classdata/News.RTRS_CMPNY_AMER.201901-202102-TSLA-TextBlob.csv')
dfnews["timestamp"]=pd.to_datetime(dfnews["timestamp"],format="%Y-%m-%dT%H:%M:%S.%fZ",utc=True)
dfnews["timestamp"]=dfnews["timestamp"].dt.tz_convert('US/Eastern')
#We delete the existing sentiment scores and we will generate them again later.
del dfnews["TextBlobTitle"]
del dfnews["TextBlobBody"]
dfnews.head()

Unnamed: 0,timestamp,headline,body
0,2019-01-01 22:54:59-05:00,BREAKINGVIEWS-Truce brakes Evergrande's bad be...,(The author is a Reuters Breakingviews column...
1,2019-01-02 07:32:29.952000-05:00,Reuters Insider - U.S. Morning Call: U.S. stoc...,\nClick the following link to watch video: htt...
2,2019-01-02 08:30:00-05:00,NORWAY'S SALE OF FULLY ELECTRIC CARS ROSE TO 3...,
3,2019-01-02 08:30:01-05:00,Almost a third of Norway's car sales in 2018 e...,"OSLO, Jan 2 (Reuters) - Almost a third of ..."
4,2019-01-02 08:33:18.667000-05:00,"TESLA Q4 2018 VEHICLE PRODUCTION & DELIVERIES,...",


# TextBlob

TextBlob is a library for processing **generic** textual data. In this lecture, we only focus on its functionality in sentiment analysis.

TextBlob returns **polarity** of a text based on a lexicon. 
- The lexicon used (https://github.com/sloria/TextBlob/blob/dev/textblob/en/en-sentiment.xml) 
- The scores for each word are averaged out to obtain an overall score to a text.
- Polarity is normalized between [-1,1],
- Negation words reverse the polarity (but not always).
- TextBlob score based on emoticons and exclamation mark. 
- Each word also has an **intensity**. Intensity determines if a word modifies the next word ('very good', 'kind of good').
-  A more detailed description of these rules is available: https://planspace.org/20150607-textblob_sentiment/

In [2]:
#!pip3 install --upgrade textblob
#!pip3 install --upgrade nltk
from textblob import TextBlob

In [3]:
TextBlob("That is not a very good idea!").sentiment.polarity

-0.33653846153846145

We then generate the sentiment score for each headline and each body. 

In [4]:
dfnews["TextBlobTitle"]=[TextBlob(s).sentiment.polarity for s in dfnews["headline"]]

In [5]:
#"body" can be NaN so we replace them by an empty string ""
dfnews["body"]=dfnews["body"].fillna("")
dfnews["TextBlobBody"]=[TextBlob(s).sentiment.polarity for s in dfnews["body"]]

## Loughran and McDonald Financial Sentiment Dictionaries.

The lexicon of TextBlob is created based on movie reviews and might not accuractely measure the sentiment of text in other domains. Some alternatives are needed.

Loughran and McDonald (LM) lexicon is a sentiment lexicon created by Tim Loughran and Bill McDonald in NUniversity of Notre Dame. It was created by analysing over 50,000 companies financial reports (Form 10-K) over the 1994–2008 period. 

The complete lexicon can be found: https://drive.google.com/file/d/15UPaF2xJLSVz8DYuphierz67trCxFLcl/view?usp=sharing. 

Note that only LM lexicon and TextBlob lexicon have about only 10% overlap.  

**pysentiment2** is a library for sentiment analysis based on the LM lexicon. It calculates a sentiment polarity score between -1 and 1 using the following equation: 

$$Polarity= \frac{N_{pos}-N_{neg}}{N_{pos}+N_{neg}+\epsilon}$$

where $N_{pos}$ and $N_{neg}$ represent the numbers of positive and negative words in the LM lexicon, respecitvely, and $\epsilon=0.000001$ (a smoothing parameter).

In [6]:
#!pip3 install pysentiment2
import pysentiment2 as ps

Before calculating the sentiment score, pysentiment2 library needs to preprocess the text:

   * Tokenize the text into words
   * Remove all digits and punctuations
   * Turn all letters to lower case

In [7]:
lm = ps.LM()
tokens = lm.tokenize("10S&P SAYS TESLA INC. UPGRADED TO 'B+' ON STRENGTHENING BUSINESS POSITION, IMPROVING METRICS; OUTLOOK STABLE")
score = lm.get_score(tokens)
print("Sentiment score:")
print(score['Polarity'])
print("Number of positive words:")
print(score['Positive'])
print("Number of negative words:")
print(score['Negative'])

Sentiment score:
0.9999997500000625
Number of positive words:
4
Number of negative words:
0


In [8]:
from IPython.display import display, clear_output
dfnews["LMTitle"]=0
for i in range(len(dfnews["headline"])):
    tokens = lm.tokenize(dfnews["headline"][i])  
    score = lm.get_score(tokens)
    dfnews.loc[i,"LMTitle"]=score["Polarity"]
    if (i % 1000) ==0:
        clear_output()
        display('News '+str(i)+' out of '+str(len(dfnews["headline"])))

'News 9000 out of 9895'

In [9]:
from IPython.display import display, clear_output
dfnews["LMBody"]=0
for i in range(len(dfnews["body"])):
    if dfnews["body"].isna()[i]==False:
        tokens = lm.tokenize(dfnews["body"][i])  
        score = lm.get_score(tokens)
        dfnews.loc[i,"LMBody"]=score["Polarity"]
    if (i % 1000) ==0:
        clear_output()
        display('News '+str(i)+' out of '+str(len(dfnews["body"])))

'News 9000 out of 9895'

In [10]:
dfnews.head()

Unnamed: 0,timestamp,headline,body,TextBlobTitle,TextBlobBody,LMTitle,LMBody
0,2019-01-01 22:54:59-05:00,BREAKINGVIEWS-Truce brakes Evergrande's bad be...,(The author is a Reuters Breakingviews column...,-0.7,0.036709,-0.999999,-0.636364
1,2019-01-02 07:32:29.952000-05:00,Reuters Insider - U.S. Morning Call: U.S. stoc...,\nClick the following link to watch video: htt...,-0.1875,-0.05,-1.0,-1.0
2,2019-01-02 08:30:00-05:00,NORWAY'S SALE OF FULLY ELECTRIC CARS ROSE TO 3...,,0.6,0.0,0.0,0.0
3,2019-01-02 08:30:01-05:00,Almost a third of Norway's car sales in 2018 e...,"OSLO, Jan 2 (Reuters) - Almost a third of ...",0.068182,0.120779,0.0,-1.0
4,2019-01-02 08:33:18.667000-05:00,"TESLA Q4 2018 VEHICLE PRODUCTION & DELIVERIES,...",,0.0,0.0,0.0,0.0


Calculate the correlation between all (numeric) columns.

In [11]:
dfnews.corr()

Unnamed: 0,TextBlobTitle,TextBlobBody,LMTitle,LMBody
TextBlobTitle,1.0,0.110051,0.195738,0.074213
TextBlobBody,0.110051,1.0,0.05376,0.072557
LMTitle,0.195738,0.05376,1.0,0.238553
LMBody,0.074213,0.072557,0.238553,1.0
