# Sentiment and Classification

For sentiment, we will look at VADER and NLTK's Sentiwordnet.

* "VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media."
  * https://github.com/cjhutto/vaderSentiment
  * nice example: https://rstudio-pubs-static.s3.amazonaws.com/79360_850b2a69980c4488b1db95987a24867a.html
  
* Sentiwordnet is a part of the NLTK library that includes sentiment scores for words on top of the information provided by wordnet.
  * https://www.nltk.org/howto/sentiwordnet.html

## VADER

In [None]:
import nltk
from nltk.sentiment import vader
nltk.download('vader_lexicon')
nltk.download('stopwords')

In [None]:
sia = vader.SentimentIntensityAnalyzer()

In [None]:
sia.polarity_scores('Luke, I am your father.')

In [None]:
sia.polarity_scores('NO!!!!!!')

In [None]:
sia.polarity_scores('I hate you.')

In [None]:
sia.polarity_scores('I HATE you.')

In [None]:
sia.polarity_scores('I HATE you!!!!')

In [None]:
sia.polarity_scores('Thank you Dad')

Try typing in a couple sentences to explore polarity scores

sia.polarity_scores(':D')

In [None]:
sia.polarity_scores(':D')

In [None]:
sia.polarity_scores(':(')

In [None]:
sia.polarity_scores('>:(')

Negation

In [None]:
sia.polarity_scores("I don't hate you")

In [None]:
sia.polarity_scores("I don't not love you")

In [None]:
sia.polarity_scores("I love you")

In [None]:
sia.polarity_scores("I LOVE you")

In [None]:
sia.polarity_scores("I really love you")

In [None]:
sia.polarity_scores("I am in love with you")

In [None]:
sia.polarity_scores("I am so in love with you")

Contrast

In [None]:
sia.polarity_scores("I usually hate shrimp but I loved this")

The part after the but takes precendence

In [None]:
sia.polarity_scores("I usually hate shrimp and I loved this")

In [None]:
sia.polarity_scores("I usually hate shrimp and I liked this")

## Cornell's movie data reviews

https://www.cs.cornell.edu/people/pabo/movie-review-data/

In [None]:
# Open the data files
# read the lines of the files
# and for every line, convert it into an ASCII string

with open('rt-polaritydata/rt-polarity.neg','rb') as f:
    negReviews = f.readlines()
    for i in range(len(negReviews)):
        negReviews[i] = str(negReviews[i], 'ascii', errors='ignore')
        
with open('rt-polaritydata/rt-polarity.pos','rb') as f:
    posReviews = f.readlines()
    for i in range(len(posReviews)):
        posReviews[i] = str(posReviews[i], 'ascii', errors='ignore')

We'll work with the data in Pandas.

1. put the data into dataframes
2. add a column for polarity scores
3. concatenate the positive and negative reviews together into one collective dataframe

In [None]:
import pandas as pd

In [None]:
dfpos = pd.DataFrame({'Review':posReviews, 'Polarity':1})
dfneg = pd.DataFrame({'Review':negReviews, 'Polarity':-1})

In [None]:
dfpos.head()

In [None]:
dfall = pd.concat([dfpos,dfneg], ignore_index=True)

Let's look at a couple example entries.

In [None]:
dfall.head()

In [None]:
dfall.tail()

Remember that `loc` is for indexing based on row and column labels, and that you can use Boolean indexing (i.e. you can use a true/false condition to retrieve specific rows or columns).

In [None]:
dfall.loc[dfall['Polarity']==1,'Review']

In [None]:
dfall.loc[dfall['Polarity']==-1,'Review'][:5]

The following defines a function to return the Sentiment Intensity Analyzer's compound score for any review.

In [None]:
def getSentiment(review):
    return sia.polarity_scores(review)['compound']

In [None]:
# Test
myreview = 'This movie tries to be Star Wars but fails miserably.'

In [None]:
getSentiment(myreview)

We're going to make a new list of the review scores from Vader.

We'll use list comprehension to streamline this process.

In [None]:
# Example list comprehension
[i for i in [1,2,3,4]]

In [None]:
[a for a in range(5)]

In [None]:
[x for x in [2,3,6,5,7,8,4] if x > 5]

The following makes a list of Vader's review scores for every row of the dataframe `dfall` and adds the scores into a new column `VaderSentiment`

In [None]:
dfall['VaderSentiment'] = [getSentiment(review) for review in dfall['Review']]

In [None]:
dfall.head()

Count the number of rows where Polarity = 1 and Vader Sentiment is > 0 (that is, where Vader would classify the sentiment as being positive).

In [None]:
dfall.loc[(dfall['Polarity']==1) & (dfall['VaderSentiment']>0),'Review'].count()

We can quantify the percentage correctly classified by Vader as positive.

In [None]:
correct = dfall.loc[(dfall['Polarity']==1) & (dfall['VaderSentiment']>0),'Review'].count()
total = dfall.loc[(dfall['Polarity']==1),'Review'].count()
correct/total

And the percentage correctly classified as negative.

In [None]:
correct = dfall.loc[(dfall['Polarity']==-1) & (dfall['VaderSentiment']<0),'Review'].count()
total = dfall.loc[(dfall['Polarity']==-1),'Review'].count()
correct/total

Less than 50% correct for the negative sentiments!!  Worse than random chance.

Let's check out a couple examples.

In [None]:
for i in dfall.loc[(dfall['Polarity']==-1)][:5].index:
    print(dfall.loc[i,'VaderSentiment'], ':', dfall.loc[i,'Review'])

In [None]:
getSentiment('''exploitative and largely devoid of the depth or 
             sophistication that would make watching such a graphic 
             treatment of the crimes bearable''')

Let's look at the distribution of scores to see if that provides any insights.

In [None]:
dfall.loc[dfall['Polarity']==1, 'VaderSentiment'].hist()

In [None]:
dfall.loc[dfall['Polarity']==-1, 'VaderSentiment'].hist()

The total accuracy is given by:

In [None]:
poscorrect = dfall.loc[(dfall['Polarity']==1) & (dfall['VaderSentiment']>0),'Review'].count()
negcorrect = dfall.loc[(dfall['Polarity']==-1) & (dfall['VaderSentiment']<0),'Review'].count()
total = dfall['Review'].count()
(poscorrect + negcorrect)/total

We can encapsulate the essential code from above into a function to generalize the process.

In [None]:
def runScoring(dfall):
    poscorrect = dfall.loc[(dfall['Polarity']==1) & (dfall['VaderSentiment']>0),'Review'].count()
    postotal = dfall.loc[(dfall['Polarity']==1),'Review'].count()

    negcorrect = dfall.loc[(dfall['Polarity']==-1) & (dfall['VaderSentiment']<0),'Review'].count()
    negtotal = dfall.loc[(dfall['Polarity']==-1),'Review'].count()

    total = dfall['Review'].count()

    print('The accuracy for positive reviews is: ' + str(poscorrect/postotal*100) + '%')
    print('The accuracy for negative reviews is: ' + str(negcorrect/negtotal*100) + '%')
    print('The overall accuracy is: ' + str((poscorrect+negcorrect)/total*100) + '%')

In [None]:
runScoring(dfall)

# Sentiwordnet

NLTK includes functionality for using Sentiwordnet, a lexical tool that includes information about words' synsets (words that are like synonyms) and thereby can be used to help assess sentiment.

In [None]:
from nltk.corpus import sentiwordnet as swn
nltk.download('sentiwordnet')

In [None]:
list(swn.senti_synsets('funny'))

In [None]:
list(swn.senti_synsets('funny'))[0]

In [None]:
list(swn.senti_synsets('funny'))[0].pos_score()

In [None]:
list(swn.senti_synsets('funny'))[0].neg_score()

In [None]:
list(swn.senti_synsets('funny'))[0].obj_score()

In [None]:
for i in list(swn.senti_synsets('funny')):
    print(i)

`wordnet` allows us to get definitions of the synsets

In [None]:
from nltk.corpus import wordnet

In [None]:
for i in wordnet.synsets('funny'):
    print(i,i.definition())

Consider one review:

In [None]:
dfall.loc[0,'Review']

We could use the synset polarity scores of individual words in a sentence to manually score a review's sentiment.
1. break up a sentence into words
2. remove stopwords
3. sum the synset scores of the words
  * for each word, a simple first attempt is to take all the synsets and (a) add the positive score if the positive score is largest or (b) subtract the negative score if the negative score is largest, and then divide the total sum of all synset scores by the number of synsets.

In [None]:
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from string import punctuation
myStopWords = list(punctuation) + stopwords.words('english')

Example of breaking a review into a list of individual words:

In [None]:
[w for w in word_tokenize(dfall.loc[0,'Review'].lower())]

The same list, but with stopwords removed:

In [None]:
[w for w in word_tokenize(dfall.loc[0,'Review'].lower()) if w not in myStopWords]

Here's our function for getting the average synset scores of words in a review and summing them all up to get a polarity score for the review.

In [None]:
def naiveSentiment(review):
    reviewPolarity = 0.0
    words = [w for w in word_tokenize(review.lower()) if w not in myStopWords]
    for word in words:
        sentScore = 0.0
        if len(list(swn.senti_synsets(word))) > 0:
            for i in list(swn.senti_synsets(word)):
                if i.pos_score() > i.neg_score():
                    sentScore += i.pos_score()
                else:
                    sentScore -= i.neg_score()
            reviewPolarity += sentScore / len(list(swn.senti_synsets(word)))
    
    return reviewPolarity

In [None]:
naiveSentiment(dfall.loc[0,'Review'])

Make a new column in our main dataframe that uses our sentiwordnet-based scoring system.

In [None]:
dfall['naiveSentiment'] = [naiveSentiment(review) for review in dfall['Review']]

Copy the above `runScoring` for a final method assessment, but now add an extra variable for specifying the particular sentiment column to use.

In [None]:
def runScoring(dfall,sentimentMethod):
    poscorrect = dfall.loc[(dfall['Polarity']==1) & (dfall[sentimentMethod]>0),'Review'].count()
    postotal = dfall.loc[(dfall['Polarity']==1),'Review'].count()

    negcorrect = dfall.loc[(dfall['Polarity']==-1) & (dfall[sentimentMethod]<0),'Review'].count()
    negtotal = dfall.loc[(dfall['Polarity']==-1),'Review'].count()

    total = dfall['Review'].count()

    print('The accuracy for positive reviews is: ' + str(poscorrect/postotal*100) + '%')
    print('The accuracy for negative reviews is: ' + str(negcorrect/negtotal*100) + '%')
    print('The overall accuracy is: ' + str((poscorrect+negcorrect)/total*100) + '%')

In [None]:
runScoring(dfall, 'VaderSentiment')

In [None]:
runScoring(dfall, 'naiveSentiment')

This method with synset scoring does a slightly better job at classification, though admittedly its method of approach is relatively simplistic.

Let's again look at the distribution of scores for the reviews that have a real positive or negative polarity.

In [None]:
dfall.loc[dfall['Polarity']==1, 'naiveSentiment'].hist()

In [None]:
dfall.loc[dfall['Polarity']==-1, 'naiveSentiment'].hist()

There are many shortcomings.  But actually, one shortcoming is very easy to spot:  Vader properly accounts for negation while our naive sentiment scorer with synset averaging does not. 

In [None]:
getSentiment('this restaurant is lousy')

In [None]:
getSentiment('this restaurant is not lousy')

In [None]:
naiveSentiment('this restaurant is lousy')

In [None]:
naiveSentiment('this restaurant is not lousy')

Why is this?

In [None]:
print(myStopWords)

Note that "not" is in the stopwords -> it's been completely dropped before our naiveSentiment scorer ran.

## Naive Bayes

In [None]:
dfall

We can manually split our dataframe into training and test sets (and make sure that we keep a 50/50 split in each of positive/negative reviews).

In [None]:
trainNum = 2000
testNum = 5331 - trainNum

trainPosReviews = dfall.loc[dfall['Polarity']==1][:trainNum]
testPosReviews = dfall.loc[dfall['Polarity']==1][trainNum:]

trainNegReviews = dfall.loc[dfall['Polarity']==-1][:trainNum]
testNegReviews = dfall.loc[dfall['Polarity']==-1][trainNum:]

In [None]:
trainPosReviews

We're going to use word frequencies to get our probabilities for Bayesian estimation.

The following makes lists of words found in the positive reviews and in the negative reviews (and drops stopwords).  It also makes a list of all words called `vocab`.

In [None]:
posWords = []
negWords = []
vocab = []
for i in trainPosReviews.index:
    words = [w for w in word_tokenize(trainPosReviews.loc[i,'Review'].lower()) if w not in myStopWords]
    for word in words:
        if word not in posWords:
            posWords.append(word)
        if word not in vocab:
            vocab.append(word)
for i in trainNegReviews.index:
    words = [w for w in word_tokenize(trainNegReviews.loc[i,'Review'].lower()) if w not in myStopWords]
    for word in words:
        if word not in negWords:
            negWords.append(word)
        if word not in vocab:
            vocab.append(word)
    

In [None]:
# Here is the list of all words retained:
vocab

Each review is made into a "feature vector".  This vector is a long dictionary -- every word in the total `vocab` list is a key and for each key, the value is set to `1` if the word is in the review and to `0` if the word is not in the review.

In [None]:
def makeFeatureVector(review):
    words = [w for w in word_tokenize(review.lower()) if w not in myStopWords]
    featureVector = {}
    for word in vocab:
        if word in words:
            featureVector[word] = 1
        else:
            featureVector[word] = 0
    return featureVector

Here's an example of the feature vector for a review that reads "This is my favorite movie"

In [None]:
makeFeatureVector('This is my favorite movie')

Make our training data by making a list that contains the review strings and their respective Polarity scores.

In [None]:
trainingData = []
for i in trainPosReviews.index:
    trainingData.append((trainPosReviews.loc[i,'Review'],trainPosReviews.loc[i,'Polarity']))
for i in trainNegReviews.index:
    trainingData.append((trainNegReviews.loc[i,'Review'],trainNegReviews.loc[i,'Polarity']))

Here are the first five items of our training data:

In [None]:
trainingData[:5]

And here's an example negative review contained in our training data:

In [None]:
trainingData[2500]

As part of our steps to pre-process the data, we need to convert each review in our training data into a feature vector.

To do this, we can use `nltk.classify.apply_features`.  We pass in our training dataset, as well as the function that we have defined above to make a feature vector out of a review, `makeFeatureVector`.  `apply_features` applies the function to convert every review contained in the training dataset into a feature vector, and the result gets returned and then stored into our new variable `trainingFeatureVectors`.

In [None]:
trainingFeatureVectors = nltk.classify.apply_features(makeFeatureVector, trainingData)

Here's how the first review turned out:

In [None]:
trainingFeatureVectors[0]

NLTK has a module `NaiveBayesClassifier`.  Rather than using `fit` as we are used to from scikit-learn, here we use the `train` method.  Furthermore, the data passed into the `train` method has both the independent variable (the review's feature vector) and the dependent variable (the polarity score).

In [None]:
trainedClassifier = nltk.NaiveBayesClassifier.train(trainingFeatureVectors)

Now that we have trained our classifier, we can use it to predict the sentiment score of any review.

To make a prediction, we need to convert the review into a feature vector and then pass that feature vector into our trained classifier to get the prediction.

The following functions carries out those two steps:

In [None]:
def naiveBayesSentimentCalculator(review):
    problemFeatureVector = makeFeatureVector(review)
    return trainedClassifier.classify(problemFeatureVector)

Here are two test examples:

In [None]:
naiveBayesSentimentCalculator("What an awesome movie")

In [None]:
naiveBayesSentimentCalculator("What a terrible movie")

As you can see, since our Polarity scores were 1 and -1, our classifier gives us 1 and -1 as possible classes.

To quantify how our classifier performs, we now pass in the test data to produce predicted sentiment scores that we can compare against the actual test data's Polarity.

In [None]:
testPosReviews['naiveBayesSentiment'] = [naiveBayesSentimentCalculator(review) for review in testPosReviews['Review']]
testNegReviews['naiveBayesSentiment'] = [naiveBayesSentimentCalculator(review) for review in testNegReviews['Review']]

The following function assesses the accuracy of our Naives Bayes classifier.

In [None]:
def runScoringNB():
    poscorrect = testPosReviews.loc[(testPosReviews['Polarity']==1) & (testPosReviews['naiveBayesSentiment']==1),'Review'].count()
    postotal = testPosReviews.loc[(testPosReviews['Polarity']==1),'Review'].count()

    negcorrect = testNegReviews.loc[(testNegReviews['Polarity']==-1) & (testNegReviews['naiveBayesSentiment']==-1),'Review'].count()
    negtotal = testNegReviews.loc[(testNegReviews['Polarity']==-1),'Review'].count()

    total = testPosReviews['Review'].count() + testNegReviews['Review'].count()

    print('The accuracy for positive reviews is: ' + str(poscorrect/postotal*100) + '%')
    print('The accuracy for negative reviews is: ' + str(negcorrect/negtotal*100) + '%')
    print('The overall accuracy is: ' + str((poscorrect+negcorrect)/total*100) + '%')

In [None]:
runScoring(dfall, 'VaderSentiment')
runScoring(dfall, 'naiveSentiment')
runScoringNB()

The accuracy here is starting to improve!