# Sentiment Analysis | BAIS 6100

**Instructor: Qihang Lin**

Sentiment analysis is a technique to detect and quantify sentiment polarity (e.g. a positive or negative opinion) within the text data.
- A star rating may be used as a sentiment measure of the whole review. However, sentiment analysis can measure the sentiment of each sentence, providing more information than the overall star rating. 
- Sentiment analysis can gauge customers' response to a product/service in social media, where star ratings are not available.

There are **two major approaches** to sentiment analysis:
- Supervised predictive modeling approaches.
    - Require each sentence/document being annotated by a sentiment score.  
- Unsupervised lexicon-based (rule-based) approaches. (**The focus of this lecture.**)
    - Require a lexicon where each word receives a pre-determined sentiment score.
    
In both approaches, the scores and rules are typically manually assigned by domain experts, linguists, or wisdom of the crowd.

Both approaches are sensitive to stemming.

## Challenges

Why is Sentiment Analysis difficult to perform in a high quality?
1. **Negations.**  "I don't think this coffee machine is a good choice."

2. **Multipolarity.** "The acting was good but the movie could have been better."

3. **Irony and sarcasm.** "This phone has an awesome battery back-up of 2 hours."

4. **Word ambiguity.** "The story is unpredictable." VS "The steering wheel is unpredictable."

5. **Others:** Shorthand, abbreviations, different spellings, misspelled words, punctuation, slang, and emojis.

To check the performances of different libraries, we generate the following data frame **df1**.

In [1]:
sentences_negation=["This move is bad.",
                    "This move is not bad.",
                    "This movie is good.",
                    "This movie is not good.", 
                    "This movie is not very good.",
                    "This movie isn't good.",
                    "This movie isn't all that good.",
                    "This movie isn't really all that good.",
                    "I don't think this movie is good."]
import pandas as pd
#Create a data frame from a list!
df1=pd.DataFrame(sentences_negation,columns=["text"])
df1

Unnamed: 0,text
0,This move is bad.
1,This move is not bad.
2,This movie is good.
3,This movie is not good.
4,This movie is not very good.
5,This movie isn't good.
6,This movie isn't all that good.
7,This movie isn't really all that good.
8,I don't think this movie is good.


## AFINN

The AFINN lexicon is the simplest lexicons for sentiment analysis. It contains over 3,300+ words with a polarity score between [-5,5] associated with each word. 

- The complete AFINN lexicons: https://github.com/fnielsen/afinn/tree/master/afinn/data
- It can scores some emoticons like :) and :(.
- It doesn't handle negation except a few fixed phrases such as "not good" and "not working".
- The total score of a text is simply the summation of the score from each word.

In [2]:
#Install afinn if running first time
#!pip3 install --upgrade afinn 
from afinn import Afinn
afinn = Afinn(emoticons=True)        #Initialize an analyzer
afinn.score("He is a nice colleague and a great father. :)")

Defaulting to user installation because normal site-packages is not writeable
Collecting afinn
  Downloading afinn-0.1.tar.gz (52 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m52.6/52.6 KB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
[?25hUsing legacy 'setup.py install' for afinn, since package 'wheel' is not installed.
Installing collected packages: afinn
  Running setup.py install for afinn ... [?25ldone
[?25hSuccessfully installed afinn-0.1
You should consider upgrading via the '/usr/local/bin/python3.9 -m pip install --upgrade pip' command.[0m[33m
[0m

8.0

In [3]:
df1["AFINN"]=[afinn.score(s) for s in df1.text]
df1

Unnamed: 0,text,AFINN
0,This move is bad.,-3.0
1,This move is not bad.,-3.0
2,This movie is good.,3.0
3,This movie is not good.,-2.0
4,This movie is not very good.,3.0
5,This movie isn't good.,3.0
6,This movie isn't all that good.,3.0
7,This movie isn't really all that good.,3.0
8,I don't think this movie is good.,3.0


# TextBlob

TextBlob is a library for processing textual data. It is a good alternative to NLTK and can perform common NLP and text mining tasks such as part-of-speech tagging, tokenization, sentiment analysis, classification, and more.

In this lecture, we only focus on its functionality in sentiment analysis.

TextBlob returns **polarity** of a text based on a lexicon. 
- The lexicon used (https://github.com/sloria/TextBlob/blob/dev/textblob/en/en-sentiment.xml) 
- The scores for each word are averaged out to obtain an overall score to a text.
- Polarity is normalized between [-1,1],
- Negation words reverse the polarity (but not always).
- TextBlob score based on emoticons and exclamation mark. 
- Use "**intensifier**" to modify the score of the next word ('very good', 'kind of good').
-  A more detailed description of these rules is available: https://planspace.org/20150607-textblob_sentiment/

In [7]:
#Install if running first time
#!pip3 install --upgrade textblob
#import nltk
#nltk.download('movie_reviews')
#nltk.download('subjectivity')
#nltk.download('vader_lexicon')
#nltk.download('sentiwordnet')
#nltk.download('wordnet')
from textblob import TextBlob

In [11]:
TextBlob("That is not bad :)!!").sentiment.polarity

0.5656249999999999

In [9]:
df1["TextBlob"]=[TextBlob(s).sentiment.polarity for s in df1.text]
df1

Unnamed: 0,text,AFINN,TextBlob
0,This move is bad.,-3.0,-0.7
1,This move is not bad.,-3.0,0.35
2,This movie is good.,3.0,0.7
3,This movie is not good.,-2.0,-0.35
4,This movie is not very good.,3.0,-0.269231
5,This movie isn't good.,3.0,0.7
6,This movie isn't all that good.,3.0,0.7
7,This movie isn't really all that good.,3.0,0.45
8,I don't think this movie is good.,3.0,0.7


## VADER

VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in **social media**.
- The lexicon used (https://github.com/cjhutto/vaderSentiment/blob/master/vaderSentiment/vader_lexicon.txt)
- It works exceedingly well on social media text.
- **Positive, Negative and Neutral scores** are generated to represent the proportion of text that falls in these categories. 
- A **Compound score** is generated using the sum of scores of each word in thex text and it is normalized between [-1, 1]. 
- More details on VADER scoring methodology: https://github.com/cjhutto/vaderSentiment#about-the-scoring
- It exams the **tri-gram** preceding a sentiment-heavy term to detect negation and flip the polarity.
- It also scores based on emoticons, emojis, slangs, exclamation mark, and capitalization.
- Intensifiers ("extremely", "a little bit") are used to modify the sentiment scores.
- Conjunction ("but", "although") is used to signal a shift in sentiment polarity. Depending on the conjuntion, the text before or after the conjuction might be weighted higher during the calculation of sentiment score. 

In [13]:
#!pip3 install --upgrade vaderSentiment
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
#Initialize VADER sentiment analyzer
analyzer=SentimentIntensityAnalyzer()

In [14]:
#The output is a dictionary object with four items.
analyzer.polarity_scores("The food here is great, but the service is horrible")

{'neg': 0.31, 'neu': 0.523, 'pos': 0.167, 'compound': -0.4939}

In [15]:
#To get the compound score only.
analyzer.polarity_scores("The food here is great, but the service is horrible")['compound']

-0.4939

In [16]:
df1["VADER"]=[analyzer.polarity_scores(s)['compound'] for s in df1.text]
df1

Unnamed: 0,text,AFINN,TextBlob,VADER
0,This move is bad.,-3.0,-0.7,-0.5423
1,This move is not bad.,-3.0,0.35,0.431
2,This movie is good.,3.0,0.7,0.4404
3,This movie is not good.,-2.0,-0.35,-0.3412
4,This movie is not very good.,3.0,-0.269231,-0.3865
5,This movie isn't good.,3.0,0.7,-0.3412
6,This movie isn't all that good.,3.0,0.7,-0.3412
7,This movie isn't really all that good.,3.0,0.45,0.4877
8,I don't think this movie is good.,3.0,0.7,0.4404


## Sentence Tokenization before Sentiment Analysis (To be Discussed in Zoom Meeting)

A long document might express mixed sentiment, so it is helpful to apply sentiment analysis to each individual sentence. 

Some sentiment analyzer (e.g., VADER) is built on short texts (e.g. tweets) and, therefore, does not work well for long documents. When applying that analyzer to a long document, it is better to apply it to each sentence and then take the summation/average of the scores.  

In [17]:
from nltk import tokenize
df = pd.read_csv("classdata/clinton-street-social-club.csv",encoding="latin-1")
df.reviews[131]

'Dinner and cocktails on a Friday night. Place was packed which I took as a good sign. We ordered classic cocktails (negroni and an old fashioned). I got an americano, which is close to what I ordered, but not right. The server definitely got the name right so the bartender didn\'t know what she was doing.... The old fashioned was not great.\r\nWe ordered the cheese curds and romaine salad to start and they were both fantastic! Both were perfect sharing size and tasted great.\r\nFor the second course I ordered wine which came in a short "juice" glass. I get that it may fit the theme of the place to serve it that way but it is impossible to enjoy and actually taste red wine when not served in a proper glass. \r\nI ordered the veg slider, which as others have said was served on a cold bun with wilted greens on the side. Not terribly appetizing. We had ordered a side of mayo that I added to it that helped a little bit. The cauliflower gratin was a tasty preparation of an otherwise boring 

Review 131 has a star rating of only 2 out of 5, but VADER gives it a high compound score. 

In [18]:
analyzer.polarity_scores(df.reviews[131])['compound']

0.9903

We then tokenize this review into sentences and evaluate each sentence.

In [19]:
sentences_temp = tokenize.sent_tokenize(df.reviews[131])
sentences_temp

['Dinner and cocktails on a Friday night.',
 'Place was packed which I took as a good sign.',
 'We ordered classic cocktails (negroni and an old fashioned).',
 'I got an americano, which is close to what I ordered, but not right.',
 "The server definitely got the name right so the bartender didn't know what she was doing....",
 'The old fashioned was not great.',
 'We ordered the cheese curds and romaine salad to start and they were both fantastic!',
 'Both were perfect sharing size and tasted great.',
 'For the second course I ordered wine which came in a short "juice" glass.',
 'I get that it may fit the theme of the place to serve it that way but it is impossible to enjoy and actually taste red wine when not served in a proper glass.',
 'I ordered the veg slider, which as others have said was served on a cold bun with wilted greens on the side.',
 'Not terribly appetizing.',
 'We had ordered a side of mayo that I added to it that helped a little bit.',
 'The cauliflower gratin was a

In [20]:
sentences_temp = tokenize.sent_tokenize(df.reviews[131])
dftemp=pd.DataFrame(sentences_temp,columns=["Sentence"])
dftemp["VADER"]=[analyzer.polarity_scores(s)['compound'] for s in dftemp.Sentence]
dftemp

Unnamed: 0,Sentence,VADER
0,Dinner and cocktails on a Friday night.,0.0
1,Place was packed which I took as a good sign.,0.4404
2,We ordered classic cocktails (negroni and an o...,0.0
3,"I got an americano, which is close to what I o...",0.0
4,The server definitely got the name right so th...,0.4019
5,The old fashioned was not great.,-0.5096
6,We ordered the cheese curds and romaine salad ...,0.5983
7,Both were perfect sharing size and tasted great.,0.891
8,For the second course I ordered wine which cam...,0.0
9,I get that it may fit the theme of the place t...,0.7227


Once we have created the sentiment score (by any of the three methods) for each row, we can sort the data frame by the score to find the most positive/negative rows.

In [21]:
dftemp.sort_values(by="VADER",ascending=True,inplace=True)
dftemp.reset_index(drop=True,inplace=True)
list(dftemp.Sentence[0:5])

['The old fashioned was not great.',
 'The cauliflower gratin was a tasty preparation of an otherwise boring veggie.',
 'Dinner and cocktails on a Friday night.',
 'We ordered classic cocktails (negroni and an old fashioned).',
 'I got an americano, which is close to what I ordered, but not right.']