# Movie Reviews Project
In this project, we will build a small classifier.

Given some training movie reviews and a label with them that tells whether the review is positive or not, our job is to classify test movie reviews into positive and negative classes.

In [1]:
# Importing Dataset
from nltk.corpus import movie_reviews

The above command has downloaded a **lot of files**. These files contains the reviews. It has stored the files into 2 different folders. One folder is for **negative reviews** and the other folder is for **positive reviews**

In [2]:
movie_reviews.categories()  # We can see that the data is of 2 classes

['neg', 'pos']

In [3]:
movie_reviews.fileids()  # This function retuns the FileIDs of those files that were just downloaded

['neg/cv000_29416.txt',
 'neg/cv001_19502.txt',
 'neg/cv002_17424.txt',
 'neg/cv003_12683.txt',
 'neg/cv004_12641.txt',
 'neg/cv005_29357.txt',
 'neg/cv006_17022.txt',
 'neg/cv007_4992.txt',
 'neg/cv008_29326.txt',
 'neg/cv009_29417.txt',
 'neg/cv010_29063.txt',
 'neg/cv011_13044.txt',
 'neg/cv012_29411.txt',
 'neg/cv013_10494.txt',
 'neg/cv014_15600.txt',
 'neg/cv015_29356.txt',
 'neg/cv016_4348.txt',
 'neg/cv017_23487.txt',
 'neg/cv018_21672.txt',
 'neg/cv019_16117.txt',
 'neg/cv020_9234.txt',
 'neg/cv021_17313.txt',
 'neg/cv022_14227.txt',
 'neg/cv023_13847.txt',
 'neg/cv024_7033.txt',
 'neg/cv025_29825.txt',
 'neg/cv026_29229.txt',
 'neg/cv027_26270.txt',
 'neg/cv028_26964.txt',
 'neg/cv029_19943.txt',
 'neg/cv030_22893.txt',
 'neg/cv031_19540.txt',
 'neg/cv032_23718.txt',
 'neg/cv033_25680.txt',
 'neg/cv034_29446.txt',
 'neg/cv035_3343.txt',
 'neg/cv036_18385.txt',
 'neg/cv037_19798.txt',
 'neg/cv038_9781.txt',
 'neg/cv039_5963.txt',
 'neg/cv040_8829.txt',
 'neg/cv041_22364.txt',


A total of **2000 files** were downloaded. **1000 files** were downloaded for **Negative Reviews** and **1000 files** were downloaded for **Positive Reviews**

To access words from individual documents

In [4]:
movie_reviews.words(movie_reviews.fileids()[5])  # This returns the words of document whose file ID index is 5

['capsule', ':', 'in', '2176', 'on', 'the', 'planet', ...]

## Movie Reviews Cleaning
Let's create an array of size 2000, each element of which will be a tuple of (words, category)

In [5]:
documents = []
for category in movie_reviews.categories() :
    for fileid in movie_reviews.fileids(category) :
        documents.append((movie_reviews.words(fileid), category))
        
documents[0:5]

[(['plot', ':', 'two', 'teen', 'couples', 'go', 'to', ...], 'neg'),
 (['the', 'happy', 'bastard', "'", 's', 'quick', 'movie', ...], 'neg'),
 (['it', 'is', 'movies', 'like', 'these', 'that', 'make', ...], 'neg'),
 (['"', 'quest', 'for', 'camelot', '"', 'is', 'warner', ...], 'neg'),
 (['synopsis', ':', 'a', 'mentally', 'unstable', 'man', ...], 'neg')]

Let's shuffle the list so that **positive** and **negative** reviews come at random

In [6]:
import random
random.shuffle(documents)
documents[0:5]

[(['ultra', 'low', 'budget', 'but', 'extremely', ...], 'pos'),
 (['if', 'anyone', 'had', 'been', 'able', 'in', '1983', ...], 'neg'),
 (['"', 'the', '44', 'caliber', 'killer', 'has', ...], 'neg'),
 (['call', 'me', 'crazy', ',', 'but', 'i', 'don', "'", ...], 'pos'),
 (['this', 'talky', ',', 'terribly', '-', 'plotted', ...], 'neg')]

Let's start with **cleaning**. We will clean each of the documents
<ul>
    <li>Removing Stop Words</li>
    <li>Lemmetization</li>
</ul>

In [7]:
# Making a list of stop words and punctuations
from nltk.corpus import stopwords
import string
stop_words = set(stopwords.words('english'))   # get a list of stop words in english
punctuations = list(string.punctuation)  # get a list of all punctuations
stop_words.update(punctuations)   # add punctuations in the set of stop words

In [8]:
from nltk import pos_tag     # Import pos_tag
from nltk.corpus import wordnet

# convert code from pos_tag to code that lemmatizer can understand
def get_simple_pos(tag) :
    if tag.startswith('J') :
        return wordnet.ADJ
    elif tag.startswith('V') :
        return wordnet.VERB
    elif tag.startswith('N') :
        return wordnet.NOUN
    elif tag.startswith('R') :
        return wordnet.ADV
    else :
        return wordnet.NOUN

In [9]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

In [10]:
def clean_review(words) :
    cleaned_words = []  # cleaned words
    for w in words :
        if w.lower() not in stop_words :
            pos = pos_tag([w])[0][1]    # pos_tag expects a list of words that it will give tags to. Thus we 
            # convert w to a list [w]. Now, pos_tag returns a list of tuples of the form (word, tag). Now we need
            # the 0th tuple from that list and the 1st entry, i.e. tag from the tuple
            
            clean_word = lemmatizer.lemmatize(w, pos = get_simple_pos(pos))  # lemmatized the word
            
            cleaned_words.append(clean_word.lower())
    return cleaned_words

In [11]:
# Testing if the function works
words = ['Jaikirat', 'is', 'good']
clean_review(words)

['jaikirat', 'good']

As we can see, the above function works

Let's go through all the documents and clean each and every one of them

In [12]:
import time
initial_time = time.time()
documents = [(clean_review(document), category) for document, category in documents]
final_time = time.time()

print('Time taken to clean documents : ', (final_time-initial_time)*1000, 'ms')

Time taken to clean documents :  256770.43628692627 ms


In [13]:
documents[0]

(['ultra',
  'low',
  'budget',
  'extremely',
  'inventive',
  'horror',
  'film',
  'group',
  'friend',
  'vacation',
  'cabin',
  'accidentally',
  'awaken',
  'evil',
  'force',
  'wood',
  'via',
  'necronomicon',
  'book',
  'dead',
  'bruce',
  'campbell',
  'star',
  'ash',
  'eventually',
  'becomes',
  'sole',
  'survivor',
  'battle',
  'demon',
  'wood',
  'friend',
  'become',
  'demon',
  'include',
  'girlfriend',
  'result',
  'show',
  'screen',
  'amaze',
  'consider',
  'film',
  'tiny',
  'budget',
  'constant',
  'location',
  'change',
  'film',
  'schedule',
  'sporadic',
  'two',
  'year',
  'follow',
  'two',
  'sequel',
  'evil',
  'dead',
  'ii',
  '1987',
  'army',
  'darkness',
  '1993',
  'dvd',
  'release',
  'film',
  'elite',
  'entertainment',
  'contains',
  'film',
  'original',
  'aspect',
  'ratio',
  '1',
  '33',
  '1',
  'disc',
  'contains',
  'two',
  'version',
  'film',
  'audio',
  'stereo',
  '2',
  '0',
  'track',
  'newly',
  'remastered

## Building the Feature Set
Once cleaning is done, the next step is to get the top words that make features.

Before finding top words, we need to split our data into training and test set

In [14]:
training_documents = documents[0:1500]
testing_documents =  documents[1500:0]

Let's create a huge array that will contain all the words from all the documents

In [15]:
all_words = []
for words, category in training_documents :
    all_words += words

Let's find the top most common 3000 words as features

In [16]:
import nltk
frequency = nltk.FreqDist(all_words)
top_features = [i[0] for i in frequency.most_common(3000)]

In [17]:
top_features

['film',
 'movie',
 'one',
 'make',
 'like',
 'character',
 'get',
 'see',
 'go',
 'time',
 'well',
 'scene',
 'even',
 'good',
 'story',
 'take',
 'much',
 'would',
 'come',
 'bad',
 'look',
 'give',
 'life',
 'two',
 'also',
 'way',
 'seem',
 'first',
 '--',
 'know',
 'end',
 'year',
 'work',
 'thing',
 'plot',
 'play',
 'really',
 'say',
 'show',
 'little',
 'could',
 'people',
 'man',
 'star',
 'love',
 'never',
 'try',
 'great',
 'best',
 'director',
 'performance',
 'big',
 'many',
 'new',
 'action',
 'u',
 'want',
 'actor',
 'watch',
 'find',
 'another',
 'think',
 'role',
 'act',
 'audience',
 'world',
 'something',
 'back',
 'turn',
 'still',
 'old',
 'day',
 'set',
 'use',
 'however',
 'though',
 'feel',
 'real',
 'every',
 'comedy',
 'guy',
 'part',
 'point',
 'around',
 'cast',
 'run',
 'enough',
 'last',
 'interest',
 'long',
 'name',
 'young',
 'begin',
 'write',
 'almost',
 'right',
 'funny',
 'may',
 'fact',
 'nothing',
 'john',
 'lot',
 'woman',
 'script',
 'minute',
 

Now we need to format create feature matrix X. This X will be the input for NLTK Classifiers

NLTK classifiers take data points (documents) in the following format :

Each document should be in a form of dictionary where key is the Features that we extracted in the last cell. And the value is whether the feature_word appears in that document. The output for each document would be its category, i.e. Negative or Positive

In [18]:
# Returns the dictionary that NLTK classifiers need
def get_feature_dict(words) :    # Input words of the document
    features = {}
    words_set = set(words)
    for w in top_features :
        features[w] = w in words_set    # Set value of feature as True if that word exists in document, else false
    return features

**Training/Testing** Data needs to of the form of **array of tuples of form (dictionary, category)**<br>

That is, array where each element is a tuple for each document<br>
Each document's tuple is of the form (dictionary, category) where **category** is the category of the document and the **dictionary** is as described above

In [19]:
training_data = [(get_feature_dict(doc), category) for doc, category in training_documents]
testing_data = [(get_feature_dict(doc), category) for doc, category in testing_documents]

## Classification using NLTK Naive Bayes
Now we have formatted our data as NLTK classifiers need. The next step is to simply use built-in NLTK Naive Bayes Classifier to classify test documents

In [20]:
from nltk import NaiveBayesClassifier
classifier = NaiveBayesClassifier.train(training_data)   # Code is a little different to fit

In [21]:
# Lets find the accuracy of our classifier
nltk.classify.accuracy(classifier, testing_data)

0