# Implementing a Sentiment Analyzer


Sentiment analysis is the process of determining the sentiment of a given piece of text and is also referred to as Opinion Mining. It is one of the most popular applications of Natural Language Processing and it's mostly used in social media and customer reviews data. 

In this Notebook we are using sentiment analysis to determine whether a movie review is positive or negative. 

## 1. Import and download the data

We will use the NLTK's movie_reviews corpus as our labeled training data. The movie_reviews corpus contains 2K movie reviews with sentiment classification. We're going to use the Naive Bayes classifier. This is a pretty popular classifier used in text classification, sentiment analysis, spam filtering, ...

In [1]:
from nltk.corpus import movie_reviews 
from nltk.classify import NaiveBayesClassifier
from nltk.classify.util import accuracy as nltk_accuracy
from nltk.tokenize import word_tokenize

import nltk

You can download the data as follows. We will also download the English stopwords for later use.

In [2]:
nltk.download('movie_reviews')
nltk.download('stopwords')

[nltk_data] Downloading package movie_reviews to
[nltk_data]     /home/yori/nltk_data...
[nltk_data]   Unzipping corpora/movie_reviews.zip.
[nltk_data] Downloading package stopwords to /home/yori/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

## 2. Explore the data

We have two categories for classification: positive and negative. The movie_reviews corpus has already categorized the reviews  as positive or negative. As you can see there are 1,000 positive reviews and 1,000 negative ones.

In [3]:
# review categories
print (movie_reviews.categories())

# total reviews
print (len(movie_reviews.fileids()))
 
# total positive reviews
print (len(movie_reviews.fileids('pos')))
 
# total negative reviews
print (len(movie_reviews.fileids('neg')))
 
# print the name of the first positive review file
positive_review_file = movie_reviews.fileids('neg')[0] 
print (positive_review_file)

['neg', 'pos']
2000
1000
1000
neg/cv000_29416.txt


We can also print the content of a file. We can obtain all words in a review with the words(review_file)-method. Using the words()-method without any parameter, would return the words in all movie reviews.

In [4]:
for word in movie_reviews.words('neg/cv000_29416.txt'):
    print(word, end = ' ') # use a space instead of a linefeed after each word

plot : two teen couples go to a church party , drink and then drive . they get into an accident . one of the guys dies , but his girlfriend continues to see him in her life , and has nightmares . what ' s the deal ? watch the movie and " sorta " find out . . . critique : a mind - fuck movie for the teen generation that touches on a very cool idea , but presents it in a very bad package . which is what makes this review an even harder one to write , since i generally applaud films which attempt to break the mold , mess with your head and such ( lost highway & memento ) , but there are good and bad ways of making all types of films , and these folks just didn ' t snag this one correctly . they seem to have taken this pretty neat concept , but executed it terribly . so what are the problems with the movie ? well , its main problem is that it ' s simply too jumbled . it starts off " normal " but then downshifts into this " fantasy " world in which you , as an audience member , have no idea

## 3. Create a list of documents

First, execute the code below. We iterate over the two categories ('neg' and 'pos') and take all of the file IDs (each review has its own review file). Then we'll store the word_tokenized version (a list of words) of the file ID, followed by the positive or negative label, in one big list.

In [5]:
documents = []
for category in movie_reviews.categories():
    for fileid in movie_reviews.fileids(category):
        documents.append((list(movie_reviews.words(fileid)), category))

Just so you can see the outcome of the code above, we print out documents[0]: the first element is a list of words (from the first file), and the 2nd element is the "pos" or "neg" label (look at the end of the output).

In [6]:
print(documents[0])

(['plot', ':', 'two', 'teen', 'couples', 'go', 'to', 'a', 'church', 'party', ',', 'drink', 'and', 'then', 'drive', '.', 'they', 'get', 'into', 'an', 'accident', '.', 'one', 'of', 'the', 'guys', 'dies', ',', 'but', 'his', 'girlfriend', 'continues', 'to', 'see', 'him', 'in', 'her', 'life', ',', 'and', 'has', 'nightmares', '.', 'what', "'", 's', 'the', 'deal', '?', 'watch', 'the', 'movie', 'and', '"', 'sorta', '"', 'find', 'out', '.', '.', '.', 'critique', ':', 'a', 'mind', '-', 'fuck', 'movie', 'for', 'the', 'teen', 'generation', 'that', 'touches', 'on', 'a', 'very', 'cool', 'idea', ',', 'but', 'presents', 'it', 'in', 'a', 'very', 'bad', 'package', '.', 'which', 'is', 'what', 'makes', 'this', 'review', 'an', 'even', 'harder', 'one', 'to', 'write', ',', 'since', 'i', 'generally', 'applaud', 'films', 'which', 'attempt', 'to', 'break', 'the', 'mold', ',', 'mess', 'with', 'your', 'head', 'and', 'such', '(', 'lost', 'highway', '&', 'memento', ')', ',', 'but', 'there', 'are', 'good', 'and', 'b

Next, we use random to shuffle our documents. This is because we're going to train and test. If we left them in order, we'd train on all the negatives, some positives, and then test only against positives. We don't want that, so we shuffle the data.

In [7]:
import random

random.shuffle(documents)

## 4. Collect the top 3,000 words

Now, we want to collect all words that we found, so we have a massive list of typical words.

In [8]:
all_words = []
for w in movie_reviews.words():
    all_words.append(w.lower())

From here, we can perform a frequency distribution, to find out the most common words. As you will see, the most popular "words" are actually things like punctuation, "the," "a" and so on.

In [9]:
word_frequency = nltk.FreqDist(all_words)
print(word_frequency.most_common(100))

[(',', 77717), ('the', 76529), ('.', 65876), ('a', 38106), ('and', 35576), ('of', 34123), ('to', 31937), ("'", 30585), ('is', 25195), ('in', 21822), ('s', 18513), ('"', 17612), ('it', 16107), ('that', 15924), ('-', 15595), (')', 11781), ('(', 11664), ('as', 11378), ('with', 10792), ('for', 9961), ('his', 9587), ('this', 9578), ('film', 9517), ('i', 8889), ('he', 8864), ('but', 8634), ('on', 7385), ('are', 6949), ('t', 6410), ('by', 6261), ('be', 6174), ('one', 5852), ('movie', 5771), ('an', 5744), ('who', 5692), ('not', 5577), ('you', 5316), ('from', 4999), ('at', 4986), ('was', 4940), ('have', 4901), ('they', 4825), ('has', 4719), ('her', 4522), ('all', 4373), ('?', 3771), ('there', 3770), ('like', 3690), ('so', 3683), ('out', 3637), ('about', 3523), ('up', 3405), ('more', 3347), ('what', 3322), ('when', 3258), ('which', 3161), ('or', 3148), ('she', 3141), ('their', 3122), (':', 3042), ('some', 2985), ('just', 2905), ('can', 2882), ('if', 2799), ('we', 2775), ('him', 2633), ('into', 2

Let's try to clean up things a little bit.

In [10]:
from nltk.corpus import stopwords 
import string 
 
stopwords_english = stopwords.words('english')

words_clean = []
 
for word in all_words:
    word = word.lower()
    if word not in stopwords_english and word not in string.punctuation:
        words_clean.append(word)

You can find out how many occurrences a word has.

In [11]:
word_frequency = nltk.FreqDist(words_clean)
print(word_frequency.most_common(100))
print(word_frequency["stupid"])

[('film', 9517), ('one', 5852), ('movie', 5771), ('like', 3690), ('even', 2565), ('good', 2411), ('time', 2411), ('story', 2169), ('would', 2109), ('much', 2049), ('character', 2020), ('also', 1967), ('get', 1949), ('two', 1911), ('well', 1906), ('characters', 1859), ('first', 1836), ('--', 1815), ('see', 1749), ('way', 1693), ('make', 1642), ('life', 1586), ('really', 1558), ('films', 1536), ('plot', 1513), ('little', 1501), ('people', 1455), ('could', 1427), ('scene', 1397), ('man', 1396), ('bad', 1395), ('never', 1374), ('best', 1333), ('new', 1292), ('scenes', 1274), ('many', 1268), ('director', 1237), ('know', 1217), ('movies', 1206), ('action', 1172), ('great', 1148), ('another', 1121), ('love', 1119), ('go', 1113), ('made', 1084), ('us', 1073), ('big', 1064), ('end', 1062), ('something', 1061), ('back', 1060), ('still', 1047), ('world', 1037), ('seems', 1033), ('work', 1020), ('makes', 992), ('however', 989), ('every', 947), ('though', 940), ('better', 922), ('real', 915), ('aud

This looks much better. Let's make a new variable, top_words, which contains the top 3,000 most common words.

In [12]:
top_words = list(word_frequency.keys())[:3000]
print(top_words)



## 5. Create the featureset

We're going to build a function that will find these top 3,000 words in our positive and negative documents, marking their presence either positive or negative.

In [13]:
def find_top_words(words):
    wordset = set(words)
    result = {}
    for w in top_words:
        result[w] = (w in wordset) # true if top_word is occurring in the wordset

    return result

Next we can create an object with all the top 3,000 words and an indication whether the word is present in the review.

In [14]:
print((find_top_words(movie_reviews.words('neg/cv000_29416.txt'))))



Finally we can do this for all of our documents, saving the word existence booleans and their respective positive or negative categories (have a look at the end of the output).

In [15]:
featuresets = []
for (words, category) in documents:
    featuresets.append((find_top_words(words), category))

In [16]:
print(featuresets[0])



As you can see, we are using the top 3,000 words as input features for our classifier (the value of the feature is a boolean indicating whether the word exists in the document). The output feature or label is "pos" or "neg" indicating whether the review is positive of negative.

## 6. Train the classifier 

Now it is time to choose an algorithm, separate our data into training and testing sets, and press go! The algorithm that we're going to use is the Naive Bayes classifier. This is a pretty popular algorithm used in text classification, sentiment analysis, spam filtering, ...

In [17]:
# training set that we'll train our classifier with
training_set = featuresets[:1900]

# testing set that we'll test against.
testing_set = featuresets[1900:]

Train a Naive Bayes classifier using the training data and compute the accuracy using the inbuilt method available in NLTK.

In [18]:
classifier = nltk.NaiveBayesClassifier.train(training_set)
print('\nAccuracy of the classifier:', nltk_accuracy(classifier, testing_set))


Accuracy of the classifier: 0.85


We can take it a step further to see what the most valuable words are when it comes to positive or negative reviews.

In [19]:
classifier.show_most_informative_features(15)

Most Informative Features
                   sucks = True              neg : pos    =     10.4 : 1.0
                  annual = True              pos : neg    =      9.8 : 1.0
                 frances = True              pos : neg    =      9.2 : 1.0
           unimaginative = True              neg : pos    =      7.5 : 1.0
                  crappy = True              neg : pos    =      6.9 : 1.0
                    mena = True              neg : pos    =      6.9 : 1.0
                  shoddy = True              neg : pos    =      6.9 : 1.0
             silverstone = True              neg : pos    =      6.9 : 1.0
                  suvari = True              neg : pos    =      6.9 : 1.0
               atrocious = True              neg : pos    =      6.9 : 1.0
                 idiotic = True              neg : pos    =      6.9 : 1.0
              schumacher = True              neg : pos    =      6.9 : 1.0
                  regard = True              pos : neg    =      6.7 : 1.0

We can see that the term "sucks" appears 10.7 more times as often in negative reviews as it does in positive reviews. You might get another value since we randomly shuffled our documents before splitting the train and test data.
 
We also can print the most informative features in a list.

In [20]:
N = 15
print('\nTop ' + str(N) + ' most informative words:')
for i, item in enumerate(classifier.most_informative_features(N)):
    print(str(i+1) + '. ' + item[0])


Top 15 most informative words:
1. sucks
2. annual
3. frances
4. unimaginative
5. crappy
6. mena
7. shoddy
8. silverstone
9. suvari
10. atrocious
11. idiotic
12. schumacher
13. regard
14. cunning
15. jumbo


## 7. Test the classifier with custom reviews

We provide custom review texts and check the classification output of the trained classifier. Will the classifier correctly predict both negative and positive reviews provided?

In [21]:
# test input movie reviews
input_reviews = [
    'The costumes in this movie were great.',
    'I think the story was terrible and the characters were very weak.',
    'People say that the director of the movie is amazing. It was a wonderful movie.',
    'This is such an idiotic movie. I will not recommend it to anyone.',
    'It doesn\'t matter how much you enjoy kung-fu and karate films: with 47 Ronin, you\'re better off saving your money, your popcorn, and time.',
    'Majili is the rare movie that succeeds fully on almost every level, where each character, scene, costume, and joke firing on all cylinders to make this great film worth repeated viewings.',
    'Despite a compelling lead performance by Tom Hanks, Forrest Gump never gets out of the shadow of its weak plot and questionable crappy premise.'
]

In [22]:
def cleanup_words(text):
    words = word_tokenize(text)
    
    words_clean = []
    for word in words:
        word = word.lower()
        if word not in stopwords_english and word not in string.punctuation:
            words_clean.append(word)
    
    words_dictionary = dict([word, True] for word in words_clean)  
    return words_dictionary

print("\nMovie review predictions:")
for review in input_reviews:
    print("\nReview:", review)

    # Compute the probabilities
    probabilities = classifier.prob_classify(cleanup_words(review))

    # Pick the maximum value
    predicted_sentiment = probabilities.max()

    # Print outputs
    print("Predicted sentiment:", predicted_sentiment)
    print("Probability:", round(probabilities.prob(predicted_sentiment), 2))


Movie review predictions:

Review: The costumes in this movie were great.
Predicted sentiment: pos
Probability: 0.63

Review: I think the story was terrible and the characters were very weak.
Predicted sentiment: neg
Probability: 0.84

Review: People say that the director of the movie is amazing. It was a wonderful movie.
Predicted sentiment: pos
Probability: 0.71

Review: This is such an idiotic movie. I will not recommend it to anyone.
Predicted sentiment: neg
Probability: 0.91

Review: It doesn't matter how much you enjoy kung-fu and karate films: with 47 Ronin, you're better off saving your money, your popcorn, and time.
Predicted sentiment: neg
Probability: 0.57

Review: Majili is the rare movie that succeeds fully on almost every level, where each character, scene, costume, and joke firing on all cylinders to make this great film worth repeated viewings.
Predicted sentiment: pos
Probability: 0.65

Review: Despite a compelling lead performance by Tom Hanks, Forrest Gump never get