# Sentiment and Classification

For sentiment, we will look at VADER and NLTK's Sentiwordnet.

* "VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media."
  * https://github.com/cjhutto/vaderSentiment
  * nice example: https://rstudio-pubs-static.s3.amazonaws.com/79360_850b2a69980c4488b1db95987a24867a.html
  
* Sentiwordnet is a part of the NLTK library that includes sentiment scores for words on top of the information provided by wordnet.
  * https://www.nltk.org/howto/sentiwordnet.html

## VADER

In [None]:
import nltk
from nltk.sentiment import vader
nltk.download('vader_lexicon')
nltk.download('stopwords')

In [None]:
sia = vader.SentimentIntensityAnalyzer()

In [None]:
sia.polarity_scores('Luke, I am your father.')

In [None]:
sia.polarity_scores('NO!!!!!!')

In [None]:
sia.polarity_scores('I hate you.')

In [None]:
sia.polarity_scores('I HATE you.')

In [None]:
sia.polarity_scores('I HATE you!!!!')

In [None]:
sia.polarity_scores('Thank you Dad')

Try typing in a couple sentences to explore polarity scores

sia.polarity_scores(':D')

In [None]:
sia.polarity_scores(':D')

In [None]:
sia.polarity_scores(':(')

In [None]:
sia.polarity_scores('>:(')

Negation

In [None]:
sia.polarity_scores("I don't hate you")

In [None]:
sia.polarity_scores("I don't not love you")

In [None]:
sia.polarity_scores("I love you")

In [None]:
sia.polarity_scores("I LOVE you")

In [None]:
sia.polarity_scores("I really love you")

In [None]:
sia.polarity_scores("I am in love with you")

In [None]:
sia.polarity_scores("I am so in love with you")

Contrast

In [None]:
sia.polarity_scores("I usually hate shrimp but I loved this")

The part after the but takes precendence

In [None]:
sia.polarity_scores("I usually hate shrimp and I loved this")

In [None]:
sia.polarity_scores("I usually hate shrimp and I liked this")

## Sentiment Classification

Using NLTK's example movie review dataset, we'll first explore thinking about how to tackle a problem like this manually.

In [None]:
import nltk

nltk.download('movie_reviews')
from nltk.corpus import movie_reviews

# Access the movie reviews dataset
reviews = movie_reviews.fileids()

In [None]:
reviews[0]

In [None]:
movie_reviews.raw(reviews[0])

In [None]:
print(movie_reviews.raw(reviews[0]))

In [None]:
movie_reviews.categories(reviews[0])

* There's at least one processing step we should apply right away: get rid of the "\n"
* Let's also put all reviews into a dataframe

In [None]:
import pandas as pd

# Create a pandas dataframe
df = pd.DataFrame({'review_sentiment': [movie_reviews.categories(review)[0]
                                        for review in reviews], 
                   'review_text': [movie_reviews.raw(review).replace('\n','')
                                   for review in reviews]})

# Display the dataframe
df.head()

In [None]:
df.groupby('review_sentiment').count()

The following defines a function to return the Sentiment Intensity Analyzer's compound score for any review.

In [None]:
def getSentiment(review):
    return sia.polarity_scores(review)['compound']

In [None]:
# Test
myreview = 'This movie tries to be Star Wars but fails miserably.'

In [None]:
getSentiment(myreview)

We're going to make a new list of the review scores from Vader.

The following makes a list of Vader's review scores for every review and adds the scores into a new column `VaderSentiment`

In [None]:
df['VaderSentiment'] = df['review_text'].apply(getSentiment)

In [None]:
df.head()

Count the number of rows where sentiment is 'pos' and Vader Sentiment is > 0 (that is, where Vader would classify the sentiment as being positive).

In [None]:
df.loc[(df['review_sentiment']=='pos') & 
       (df['VaderSentiment']>0),'review_text'].count()

We can quantify the percentage correctly classified by Vader as positive.

In [None]:
correct = df.loc[(df['review_sentiment']=='pos') & 
                 (df['VaderSentiment']>0),'review_text'].count()
total = df.loc[(df['review_sentiment']=='pos'),'review_text'].count()
correct/total

And the percentage correctly classified as negative.

In [None]:
correct = df.loc[(df['review_sentiment']=='neg') & 
                 (df['VaderSentiment']<0),'review_text'].count()
total = df.loc[(df['review_sentiment']=='neg'),'review_text'].count()
correct/total

Less than 50% correct for the negative sentiments!!  Worse than random chance.

Let's check out a couple examples.

In [None]:
for i in df.loc[(df['review_sentiment']=='neg')][:5].index:
    print(df.loc[i,'VaderSentiment'], ':', df.loc[i,'review_text'])

Let's look at the distribution of scores to see if that provides any insights.

In [None]:
df.loc[df['review_sentiment']=='pos', 'VaderSentiment'].hist()

In [None]:
df.loc[df['review_sentiment']=='neg', 'VaderSentiment'].hist()

The total accuracy is given by:

In [None]:
poscorrect = df.loc[(df['review_sentiment']=='pos') & (df['VaderSentiment']>0),'review_text'].count()
negcorrect = df.loc[(df['review_sentiment']=='neg') & (df['VaderSentiment']<0),'review_text'].count()
total = df['review_text'].count()
(poscorrect + negcorrect)/total

We can encapsulate the essential code from above into a function to generalize the process.

In [None]:
def runScoring(df):
    poscorrect = df.loc[(df['review_sentiment']=='pos') & 
                        (df['VaderSentiment']>0),'review_text'].count()
    postotal = df.loc[(df['review_sentiment']=='pos'),'review_text'].count()

    negcorrect = df.loc[(df['review_sentiment']=='neg') & 
                        (df['VaderSentiment']<0),'review_text'].count()
    negtotal = df.loc[(df['review_sentiment']=='neg'),'review_text'].count()

    total = df['review_text'].count()

    print('The accuracy for positive reviews is: ' + str(poscorrect/postotal*100) + '%')
    print('The accuracy for negative reviews is: ' + str(negcorrect/negtotal*100) + '%')
    print('The overall accuracy is: ' + str((poscorrect+negcorrect)/total*100) + '%')

In [None]:
runScoring(df)

# Sentiwordnet

NLTK includes functionality for using Sentiwordnet, a lexical tool that includes information about words' synsets (words that are like synonyms) and thereby can be used to help assess sentiment.

In [None]:
from nltk.corpus import sentiwordnet as swn
nltk.download('sentiwordnet')
nltk.download('wordnet')

In [None]:
list(swn.senti_synsets('funny'))

In [None]:
list(swn.senti_synsets('funny'))[0]

In [None]:
list(swn.senti_synsets('funny'))[0].pos_score()

In [None]:
list(swn.senti_synsets('funny'))[0].neg_score()

In [None]:
list(swn.senti_synsets('funny'))[0].obj_score()

In [None]:
for i in list(swn.senti_synsets('funny')):
    print(i)

`wordnet` allows us to get definitions of the synsets

In [None]:
from nltk.corpus import wordnet

In [None]:
for i in wordnet.synsets('funny'):
    print(i,i.definition())

Consider one review:

In [None]:
df.loc[0,'review_text']

We could use the synset polarity scores of individual words in a sentence to manually score a review's sentiment.
1. break up a sentence into words
2. remove stopwords
3. sum the synset scores of the words
  * for each word, a simple first attempt is to take all the synsets and (a) add the positive score if the positive score is largest or (b) subtract the negative score if the negative score is largest, and then divide the total sum of all synset scores by the number of synsets.

In [None]:
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from string import punctuation
myStopWords = list(punctuation) + stopwords.words('english')

Example of breaking a review into a list of individual words:

In [None]:
[w for w in word_tokenize(df.loc[0,'review_text'].lower())]

The same list, but with stopwords removed:

In [None]:
[w for w in word_tokenize(df.loc[0,'review_text'].lower()) if w not in myStopWords]

Here's our function for getting the average synset scores of words in a review and summing them all up to get a polarity score for the review.

In [None]:
def naiveSentiment(review):
    reviewPolarity = 0.0
    words = [w for w in word_tokenize(review.lower()) if w not in myStopWords]
    for word in words:
        sentScore = 0.0
        if len(list(swn.senti_synsets(word))) > 0:
            for i in list(swn.senti_synsets(word)):
                if i.pos_score() > i.neg_score():
                    sentScore += i.pos_score()
                else:
                    sentScore -= i.neg_score()
            reviewPolarity += sentScore / len(list(swn.senti_synsets(word)))
    
    return reviewPolarity

In [None]:
naiveSentiment(df.loc[0,'review_text'])

Make a new column in our main dataframe that uses our sentiwordnet-based scoring system.

In [None]:
df['naiveSentiment'] = df['review_text'].apply(naiveSentiment)

Copy the above `runScoring` for a final method assessment, but now add an extra variable for specifying the particular sentiment column to use.

In [None]:
def runScoring(dfall,sentimentMethod):
    poscorrect = df.loc[(df['review_sentiment']=='pos') & 
                        (df[sentimentMethod]>0),'review_text'].count()
    postotal = df.loc[(df['review_sentiment']=='pos'),'review_text'].count()

    negcorrect = df.loc[(df['review_sentiment']=='neg') & 
                        (df[sentimentMethod]<0),'review_text'].count()
    negtotal = df.loc[(df['review_sentiment']=='neg'),'review_text'].count()

    total = df['review_text'].count()

    print('The accuracy for positive reviews is: ' + str(poscorrect/postotal*100) + '%')
    print('The accuracy for negative reviews is: ' + str(negcorrect/negtotal*100) + '%')
    print('The overall accuracy is: ' + str((poscorrect+negcorrect)/total*100) + '%')

In [None]:
runScoring(df, 'VaderSentiment')

In [None]:
runScoring(df, 'naiveSentiment')

This method with synset scoring does a slightly worse job at classification, though admittedly its method of approach is relatively simplistic.

Let's again look at the distribution of scores for the reviews that have a real positive or negative polarity.

In [None]:
df.loc[df['review_sentiment']=='pos', 'naiveSentiment'].hist()

In [None]:
df.loc[df['review_sentiment']=='neg', 'naiveSentiment'].hist()

There are many shortcomings.  But actually, one shortcoming is very easy to spot:  Vader properly accounts for negation while our naive sentiment scorer with synset averaging does not. 

In [None]:
getSentiment('this restaurant is lousy')

In [None]:
getSentiment('this restaurant is not lousy')

In [None]:
naiveSentiment('this restaurant is lousy')

In [None]:
naiveSentiment('this restaurant is not lousy')

Why is this?

In [None]:
print(myStopWords)

Note that "not" is in the stopwords -> it's been completely dropped before our naiveSentiment scorer ran.

What happens if we.... completely ignore the sentiment connotations of individual words?  Does it make sense to completely ignore meaning and look at statistical occurrences of words across a given set of texts?
* this will be studied in our part-2 notebook