Lets download our processed data from our last step first and relevant packages.

In [28]:
import pickle
import pandas as pd
from nltk.corpus import sentiwordnet as swn
from nltk.corpus import wordnet

with open('../data/processed_reviews.pkl', 'rb') as file:
    try:
        processed_reviews = pickle.load(file)
        print("Procssed reviews loaded successfully.")
    except EOFError:
        print("Error: Processed reviews is empty or corrupted.")

with open('../data/normalized_reviews.pkl', 'rb') as file2:
    try:
        normalized_reviews = pickle.load(file2)
        print("Normalized reviews loaded successfully.")
    except EOFError:
        print("Error: Normalized reviews is empty or corrupted.")        

processed_reviews[0][0:5]        

Procssed reviews loaded successfully.
Normalized reviews loaded successfully.


[('Sean', 'NNP'),
 ('Murphy', 'NNP'),
 ('crew', 'VBD'),
 ('top', 'JJ'),
 ('salvage', 'NN')]

# Sentiment analysis

Within the context of our data we are trying to see if the movie review is positive or negative. The scores, sometimes referred to as valence scores, generated from this approach ranges from -1 to 1  or negative response to positive respectively.

How do we get sentiment scores? This method in a nutshell create scores using the unsupervised lexical based approach, this means we use a lexicon (dictionary) to help us tabulate these scores. My favorite one is the SentiWordNet lexicon which uses the Wordnet synsets and labels them. In plain english, we provide a word, and all relevant synonyms are loaded with scores that either lean negative (-1) or positive (1).

We can start with the word 'happy'. We see also a list of corresponding synonyms.

In [30]:
wordnet.synsets('happy')

[Synset('happy.a.01'),
 Synset('felicitous.s.02'),
 Synset('glad.s.02'),
 Synset('happy.s.04')]

Using our parts of speech we know that happy can be a adjective so we label it as such. And pull the corresponding scores based on our synonyms.

In [31]:
happy = swn.senti_synsets('happy', 'a')

happy0.pos_score(), happy0.neg_score(), happy0.obj_score()

(0.875, 0.0, 0.125)

The function below is used to tabulate scores for nouns, verbs, adjectives and adverbs. If relevant POS tag found synsets of the word are pulled and scores tallied.

In [12]:
def analyze_sentiment_sentiwordnet_lexicon(review, verbose=False):

    # tokenize and POS tag text tokens
    tagged_text = review
    pos_score = neg_score = token_count = obj_score = 0
    # get wordnet synsets based on POS tags
    # get sentiment scores if synsets are found
    for word, tag in tagged_text:
        ss_set = None
        if 'NN' in tag and list(swn.senti_synsets(word, 'n')): # noun
            ss_set = list(swn.senti_synsets(word, 'n'))[0]
        elif 'VB' in tag and list(swn.senti_synsets(word, 'v')): #verb
            ss_set = list(swn.senti_synsets(word, 'v'))[0]
        elif 'JJ' in tag and list(swn.senti_synsets(word, 'a')): #adjective
            ss_set = list(swn.senti_synsets(word, 'a'))[0]
        elif 'RB' in tag and list(swn.senti_synsets(word, 'r')): #adverb
            ss_set = list(swn.senti_synsets(word, 'r'))[0]
        # if senti-synset is found        
        if ss_set:
            # add scores for all found synsets
            pos_score += ss_set.pos_score()
            neg_score += ss_set.neg_score()
            obj_score += ss_set.obj_score()
            token_count += 1
    
    # aggregate final scores
    final_score = pos_score - neg_score
    norm_final_score = round(float(final_score) / token_count, 2)
    final_sentiment = 'positive' if norm_final_score >= 0 else 'negative'
    if verbose:
        norm_obj_score = round(float(obj_score) / token_count, 2)
        norm_pos_score = round(float(pos_score) / token_count, 2)
        norm_neg_score = round(float(neg_score) / token_count, 2)
        # to display results in a nice table
        sentiment_frame = pd.DataFrame([[final_sentiment, norm_obj_score, norm_pos_score, 
                                         norm_neg_score, norm_final_score]],
                                       columns=pd.MultiIndex(levels=[['SENTIMENT STATS:'], 
                                                             ['Predicted Sentiment', 'Objectivity',
                                                              'Positive', 'Negative', 'Overall']], 
                                                             labels=[[0,0,0,0,0],[0,1,2,3,4]]))
        print(sentiment_frame)
        
    return (final_sentiment, norm_final_score)

Running the function across all our reviews.

In [13]:
sentiments = [analyze_sentiment_sentiwordnet_lexicon(x) for x in processed_reviews]

word_sentiment, sentiment_score = zip(*sentiments)

#Preview

print(sentiments[0])

('positive', -0.0)


Adding score and overal verbal sentiment, e.g. positive or negative. We see that the Ghost Ship movie was had a positive review, in this case anything with sentiment scores greater than or equal to zero are labelled as positive. This may not accurately represent how great the movie is, as this is only 1 review per movie. We could scape over hundreds of reviews per movie to make this even greater.

In [37]:
normalized_reviews['word_sentiment'] = word_sentiment
normalized_reviews['sentiment_score'] = sentiment_score
normalized_reviews.head()

Unnamed: 0,Movie,Review,word_sentiment,sentiment_score
0,Ghost Ship,Sean Murphy crew top salvage experts land sea ...,positive,-0.0
1,The Craft,SPOILERSI thought decent teen flick remember e...,negative,-0.02
2,House of 1000 Corpses,opinion House 1000 Corpses fan Fans genre Rob ...,negative,-0.01
3,The Haunting of Bly Manor,many people saying right Haunted house tales g...,positive,0.03
4,Attack on Titan,moment watch audiovisual masterpiece immediate...,positive,0.02


In [39]:
import pickle


with open('../data/sentiment_scores.pkl', 'wb') as file:
    pickle.dump(normalized_reviews, file)