# Sentiment and Classification

For sentiment, we will look at VADER and NLTK's Sentiwordnet.

* "VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media."
  * https://github.com/cjhutto/vaderSentiment
  
* Sentiwordnet is a part of the NLTK library that includes sentiment scores for words on top of the information provided by wordnet.
  * https://www.nltk.org/howto/sentiwordnet.html

In [None]:
import nltk
from nltk.sentiment import vader
nltk.download('vader_lexicon')

In [None]:
sia = vader.SentimentIntensityAnalyzer()

In [None]:
sia.polarity_scores('Luke, I am your father.')

In [None]:
sia.polarity_scores('NO!!!!!!')

In [None]:
sia.polarity_scores('I hate you.')

In [None]:
sia.polarity_scores('I HATE you.')

In [None]:
sia.polarity_scores('I HATE you!!!!')

In [None]:
sia.polarity_scores('Thank you Dad')

Try typing in a couple sentences to explore polarity scores

sia.polarity_scores(':D')

In [None]:
sia.polarity_scores(':D')

In [None]:
sia.polarity_scores(':(')

In [None]:
sia.polarity_scores('>:(')

Negation

In [None]:
sia.polarity_scores("I don't hate you")

In [None]:
sia.polarity_scores("I don't not love you")

In [None]:
sia.polarity_scores("I love you")

In [None]:
sia.polarity_scores("I LOVE you")

In [None]:
sia.polarity_scores("I really love you")

In [None]:
sia.polarity_scores("I am in love with you")

In [None]:
sia.polarity_scores("I am so in love with you")

Contrast

In [None]:
sia.polarity_scores("I usually hate shrimp but I loved this")

The part after the but takes precendence

In [None]:
sia.polarity_scores("I usually hate shrimp and I loved this")

In [None]:
sia.polarity_scores("I usually hate shrimp and I liked this")

## Cornell's movie data reviews

https://www.cs.cornell.edu/people/pabo/movie-review-data/

In [None]:
with open('rt-polaritydata/rt-polarity.neg','rb') as f:
    negReviews = f.readlines()
    for i in range(len(negReviews)):
        negReviews[i] = str(negReviews[i], 'ascii', errors='ignore')
        
with open('rt-polaritydata/rt-polarity.pos','rb') as f:
    posReviews = f.readlines()
    for i in range(len(posReviews)):
        posReviews[i] = str(posReviews[i], 'ascii', errors='ignore')

In [None]:
import pandas as pd

In [None]:
dfpos = pd.DataFrame({'Review':posReviews, 'Polarity':1})
dfneg = pd.DataFrame({'Review':negReviews, 'Polarity':-1})

In [None]:
dfpos.head()

In [None]:
dfall = pd.concat([dfpos,dfneg], ignore_index=True)

In [None]:
dfall.head()

In [None]:
dfall.tail()

In [None]:
dfall.loc[dfall['Polarity']==1,'Review']

In [None]:
dfall.loc[dfall['Polarity']==-1,'Review'][:5]

## Classify with VADER

In [None]:
def getSentiment(review):
    return sia.polarity_scores(review)['compound']

In [None]:
# Test
myreview = 'This movie tries to be Star Wars but fails miserably.'

In [None]:
getSentiment(myreview)

In [None]:
dfall['VaderSentiment'] = [getSentiment(review) for review in dfall['Review']]

In [None]:
dfall.head()

In [None]:
dfall.loc[(dfall['Polarity']==1) & (dfall['VaderSentiment']>0),'Review'].count()

In [None]:
correct = dfall.loc[(dfall['Polarity']==1) & (dfall['VaderSentiment']>0),'Review'].count()
total = dfall.loc[(dfall['Polarity']==1),'Review'].count()
correct/total

In [None]:
correct = dfall.loc[(dfall['Polarity']==-1) & (dfall['VaderSentiment']<0),'Review'].count()
total = dfall.loc[(dfall['Polarity']==-1),'Review'].count()
correct/total

In [None]:
for i in dfall.loc[(dfall['Polarity']==-1)][:5].index:
    print(dfall.loc[i,'VaderSentiment'], ':', dfall.loc[i,'Review'])

In [None]:
getSentiment('''exploitative and largely devoid of the depth or 
             sophistication that would make watching such a graphic 
             treatment of the crimes bearable''')

In [None]:
dfall.loc[dfall['Polarity']==1, 'VaderSentiment'].hist()

In [None]:
dfall.loc[dfall['Polarity']==-1, 'VaderSentiment'].hist()

In [None]:
poscorrect = dfall.loc[(dfall['Polarity']==1) & (dfall['VaderSentiment']>0),'Review'].count()
negcorrect = dfall.loc[(dfall['Polarity']==-1) & (dfall['VaderSentiment']<0),'Review'].count()
total = dfall['Review'].count()
(poscorrect + negcorrect)/total

Why do we write functions?  To generalize the coding we've done.

In [None]:
def runScoring(dfall):
    poscorrect = dfall.loc[(dfall['Polarity']==1) & (dfall['VaderSentiment']>0),'Review'].count()
    postotal = dfall.loc[(dfall['Polarity']==1),'Review'].count()

    negcorrect = dfall.loc[(dfall['Polarity']==-1) & (dfall['VaderSentiment']<0),'Review'].count()
    negtotal = dfall.loc[(dfall['Polarity']==-1),'Review'].count()

    total = dfall['Review'].count()

    print('The accuracy for Polarity reviews is: ' + str(poscorrect/postotal*100) + '%')
    print('The accuracy for negative reviews is: ' + str(negcorrect/negtotal*100) + '%')
    print('The overall accuracy is: ' + str((poscorrect+negcorrect)/total*100) + '%')

In [None]:
runScoring(dfall)

## Classify with Sentiwordnet

In [None]:
from nltk.corpus import sentiwordnet as swn
nltk.download('sentiwordnet')

In [None]:
list(swn.senti_synsets('funny'))

In [None]:
list(swn.senti_synsets('funny'))[0]

In [None]:
list(swn.senti_synsets('funny'))[0].pos_score()

In [None]:
list(swn.senti_synsets('funny'))[0].neg_score()

In [None]:
list(swn.senti_synsets('funny'))[0].obj_score()

In [None]:
for i in list(swn.senti_synsets('funny')):
    print(i)

In [None]:
from nltk.corpus import wordnet

In [None]:
for i in wordnet.synsets('funny'):
    print(i,i.definition())

In [None]:
dfall.loc[0,'Review']

In [None]:
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from string import punctuation
myStopWords = list(punctuation) + stopwords.words('english')

In [None]:
[w for w in word_tokenize(dfall.loc[0,'Review'].lower())]

In [None]:
[w for w in word_tokenize(dfall.loc[0,'Review'].lower()) if w not in myStopWords]

In [None]:
def naiveSentiment(review):
    reviewPolarity = 0.0
    words = [w for w in word_tokenize(review.lower()) if w not in myStopWords]
    for word in words:
        sentScore = 0.0
        if len(list(swn.senti_synsets(word))) > 0:
            for i in list(swn.senti_synsets(word)):
                if i.pos_score() > i.neg_score():
                    sentScore += i.pos_score()
                else:
                    sentScore -= i.neg_score()
            reviewPolarity += sentScore / len(list(swn.senti_synsets(word)))
    
    return reviewPolarity

In [None]:
naiveSentiment(dfall.loc[0,'Review'])

In [None]:
dfall['naiveSentiment'] = [naiveSentiment(review) for review in dfall['Review']]

Copy the above runScoring but give an extra variable now

In [None]:
def runScoring(dfall,sentimentMethod):
    poscorrect = dfall.loc[(dfall['Polarity']==1) & (dfall[sentimentMethod]>0),'Review'].count()
    postotal = dfall.loc[(dfall['Polarity']==1),'Review'].count()

    negcorrect = dfall.loc[(dfall['Polarity']==-1) & (dfall[sentimentMethod]<0),'Review'].count()
    negtotal = dfall.loc[(dfall['Polarity']==-1),'Review'].count()

    total = dfall['Review'].count()

    print('The accuracy for Polarity reviews is: ' + str(poscorrect/postotal*100) + '%')
    print('The accuracy for negative reviews is: ' + str(negcorrect/negtotal*100) + '%')
    print('The overall accuracy is: ' + str((poscorrect+negcorrect)/total*100) + '%')

In [None]:
runScoring(dfall, 'VaderSentiment')

In [None]:
runScoring(dfall, 'naiveSentiment')

In [None]:
dfall.loc[dfall['Polarity']==1, 'naiveSentiment'].hist()

In [None]:
dfall.loc[dfall['Polarity']==-1, 'naiveSentiment'].hist()

In [None]:
getSentiment('this restaurant is lousy')

In [None]:
getSentiment('this restaurant is not lousy')

In [None]:
naiveSentiment('this restaurant is lousy')

In [None]:
naiveSentiment('this restaurant is not lousy')

In [None]:
print(myStopWords)

## Classify with Naive Bayes Classifier

In [None]:
dfall

In [None]:
trainNum = 2000
testNum = 5331 - trainNum

trainPosReviews = dfall.loc[dfall['Polarity']==1][:trainNum]
testPosReviews = dfall.loc[dfall['Polarity']==1][trainNum:]

trainNegReviews = dfall.loc[dfall['Polarity']==-1][:trainNum]
testNegReviews = dfall.loc[dfall['Polarity']==-1][trainNum:]

In [None]:
trainPosReviews

In [None]:
posWords = []
negWords = []
vocab = []
for i in trainPosReviews.index:
    words = [w for w in word_tokenize(trainPosReviews.loc[i,'Review'].lower()) if w not in myStopWords]
    for word in words:
        if word not in posWords:
            posWords.append(word)
        if word not in vocab:
            vocab.append(word)
for i in trainNegReviews.index:
    words = [w for w in word_tokenize(trainNegReviews.loc[i,'Review'].lower()) if w not in myStopWords]
    for word in words:
        if word not in negWords:
            negWords.append(word)
        if word not in vocab:
            vocab.append(word)
    

In [None]:
vocab

In [None]:
def makeFeatureVector(review):
    words = [w for w in word_tokenize(review.lower()) if w not in myStopWords]
    featureVector = {}
    for word in vocab:
        if word in words:
            featureVector[word] = 1
        else:
            featureVector[word] = 0
    return featureVector

In [None]:
makeFeatureVector('This is my favorite movie')

In [None]:
trainingData = []
for i in trainPosReviews.index:
    trainingData.append((trainPosReviews.loc[i,'Review'],trainPosReviews.loc[i,'Polarity']))
for i in trainNegReviews.index:
    trainingData.append((trainNegReviews.loc[i,'Review'],trainNegReviews.loc[i,'Polarity']))

In [None]:
trainingData[:5]

In [None]:
trainingData[2500]

In [None]:
trainingFeatureVectors = nltk.classify.apply_features(makeFeatureVector, trainingData)

In [None]:
trainingFeatureVectors[0]

In [None]:
trainedClassifier = nltk.NaiveBayesClassifier.train(trainingFeatureVectors)

In [None]:
def naiveBayesSentimentCalculator(review):
    problemFeatureVector = makeFeatureVector(review)
    return trainedClassifier.classify(problemFeatureVector)

In [None]:
naiveBayesSentimentCalculator("What an awesome movie")

In [None]:
naiveBayesSentimentCalculator("What a terrible movie")

In [None]:
testPosReviews['naiveBayesSentiment'] = [naiveBayesSentimentCalculator(review) for review in testPosReviews['Review']]
testNegReviews['naiveBayesSentiment'] = [naiveBayesSentimentCalculator(review) for review in testNegReviews['Review']]

Copy the above runScoring but give an extra variable now

In [None]:
def runScoringNB():
    poscorrect = testPosReviews.loc[(testPosReviews['Polarity']==1) & (testPosReviews['naiveBayesSentiment']==1),'Review'].count()
    postotal = testPosReviews.loc[(testPosReviews['Polarity']==1),'Review'].count()

    negcorrect = testNegReviews.loc[(testNegReviews['Polarity']==-1) & (testNegReviews['naiveBayesSentiment']==-1),'Review'].count()
    negtotal = testNegReviews.loc[(testNegReviews['Polarity']==-1),'Review'].count()

    total = testPosReviews['Review'].count() + testNegReviews['Review'].count()

    print('The accuracy for Polarity reviews is: ' + str(poscorrect/postotal*100) + '%')
    print('The accuracy for negative reviews is: ' + str(negcorrect/negtotal*100) + '%')
    print('The overall accuracy is: ' + str((poscorrect+negcorrect)/total*100) + '%')

In [None]:
runScoring(dfall, 'VaderSentiment')
runScoring(dfall, 'naiveSentiment')
runScoringNB()