We have use the `nltk.corpus` `movie_reviews` to get the data set



> From this we have seperate out the positive reviews data in `pos_reviews` and 
negative reviews data in `neg_reviews`



---







In [0]:
from nltk.corpus import movie_reviews 
import nltk

nltk.download('movie_reviews') # downloading the data set

pos_reviews = [] 
for fileid in movie_reviews.fileids('pos'):
    words = movie_reviews.words(fileid)
    pos_reviews.append(words)
 
neg_reviews = []
for fileid in movie_reviews.fileids('neg'):
    words = movie_reviews.words(fileid)
    neg_reviews.append(words)
 
# print first positive review item from the pos_reviews list
print (pos_reviews[0])
'''
Output:
 
['films', 'adapted', 'from', 'comic', 'books', ...]
'''
 
# print first negative review item from the neg_reviews list
print (neg_reviews[0])
'''
Output:
 
['plot', ':', 'two', 'teen', 'couples', 'go', ...]
'''
 
# print first 20 items of the first item of positive review
print (pos_reviews[0][:20])
'''
Output:
 
['films', 'adapted', 'from', 'comic', 'books', 'have', 'had', 'plenty', 'of', 'success', ',', 'whether', 'they', "'", 're', 'about', 'superheroes', '(', 'batman', ',']
'''
 
# print first 20 items of the first item of negative review
print (neg_reviews[0][:20])
'''
Output:
 
['plot', ':', 'two', 'teen', 'couples', 'go', 'to', 'a', 'church', 'party', ',', 'drink', 'and', 'then', 'drive', '.', 'they', 'get', 'into', 'an']
'''


[nltk_data] Downloading package movie_reviews to /root/nltk_data...
[nltk_data]   Unzipping corpora/movie_reviews.zip.
['films', 'adapted', 'from', 'comic', 'books', 'have', ...]
['plot', ':', 'two', 'teen', 'couples', 'go', 'to', ...]
['films', 'adapted', 'from', 'comic', 'books', 'have', 'had', 'plenty', 'of', 'success', ',', 'whether', 'they', "'", 're', 'about', 'superheroes', '(', 'batman', ',']
['plot', ':', 'two', 'teen', 'couples', 'go', 'to', 'a', 'church', 'party', ',', 'drink', 'and', 'then', 'drive', '.', 'they', 'get', 'into', 'an']


"\nOutput:\n \n['plot', ':', 'two', 'teen', 'couples', 'go', 'to', 'a', 'church', 'party', ',', 'drink', 'and', 'then', 'drive', '.', 'they', 'get', 'into', 'an']\n"

Firstly, We will do data cleaning by removing stop words and punctuations.

> **Stop words** are those frequently words which do not carry any significant meaning in text analysis. For example, I, me, my, the, a, and, is, are, he, she, we, etc.
**Punctuation marks** like comma, fullstop. inverted comma, etc. occur highly in any text data.

# Create Feature Set

<ul>
  <li>Now, we write a function that will be used to create feature set. The feature set is used to train the classifier.</li>
  <li>We use the bag-of-words feature and tag each review with its respective category as positive or negative.</li>
</ul>


```
Read bag of words feature in Readme.md
```


---



In [0]:
from nltk.corpus import stopwords 
import string 
nltk.download('stopwords')
stopwords_english = stopwords.words('english')
 
# feature extractor function
def bag_of_words(words):
    words_clean = []
 
    for word in words:
        word = word.lower()
        if word not in stopwords_english and word not in string.punctuation:
            words_clean.append(word)
    
    words_dictionary = dict([word, True] for word in words_clean)
    
    return words_dictionary
 
# using dict will remove duplicate words from the words list
# note the output: stopword 'the' is also removed
print (bag_of_words(['the', 'the', 'good', 'bad', 'the', 'good']))
'''
Output:
 
{'good': True, 'bad': True}
'''

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
{'good': True, 'bad': True}


"\nOutput:\n \n{'bad': True, 'good': True}\n"

Here, we made the `pos_reviews_set` and `neg_reviews_set` which is different from the above defined `pos_reviews` and `neg_reviews`.
The `pos_reviews_set` and `neg_reviews_set` content the words feature list

which is use to create **train** and **test set**.

> **Train Set** : is used to train the classifier.

> **Test Set** : is used to test the classifier to check how accurately it classifies the given text. 


---



In [0]:
# positive reviews feature set
pos_reviews_set = []
for words in pos_reviews:
    pos_reviews_set.append((bag_of_words(words), 'pos'))
 
# negative reviews feature set
neg_reviews_set = []
for words in neg_reviews:
    neg_reviews_set.append((bag_of_words(words), 'neg'))
 
# print first positive review item from the pos_reviews list
print (pos_reviews_set[0])
'''
Output:
 
({'childs': True, 'steve': True, 'surgical': True, 'go': True, 'certainly': True, 'watchmen': True, 'song': True, 'simpsons': True, 'novel': True, ........................................................................
........................................................ 'menace': True, 'starting': True, 'original': True}, 'pos')
'''

# print first negative review item from the neg_reviews list
print (neg_reviews_set[0])
'''
Output:
 
({'concept': True, 'skip': True, 'insight': True, 'playing': True, 'executed': True, 'go': True, 'still': True, 'find': True, 'seemed': True, .............................................................................................
................................................. 'entertaining': True, 'years': True, 'away': True, 'came': True}, 'neg')
'''

({'films': True, 'adapted': True, 'comic': True, 'books': True, 'plenty': True, 'success': True, 'whether': True, 'superheroes': True, 'batman': True, 'superman': True, 'spawn': True, 'geared': True, 'toward': True, 'kids': True, 'casper': True, 'arthouse': True, 'crowd': True, 'ghost': True, 'world': True, 'never': True, 'really': True, 'book': True, 'like': True, 'hell': True, 'starters': True, 'created': True, 'alan': True, 'moore': True, 'eddie': True, 'campbell': True, 'brought': True, 'medium': True, 'whole': True, 'new': True, 'level': True, 'mid': True, '80s': True, '12': True, 'part': True, 'series': True, 'called': True, 'watchmen': True, 'say': True, 'thoroughly': True, 'researched': True, 'subject': True, 'jack': True, 'ripper': True, 'would': True, 'saying': True, 'michael': True, 'jackson': True, 'starting': True, 'look': True, 'little': True, 'odd': True, 'graphic': True, 'novel': True, '500': True, 'pages': True, 'long': True, 'includes': True, 'nearly': True, '30': Tru

"\nOutput:\n \n({'concept': True, 'skip': True, 'insight': True, 'playing': True, 'executed': True, 'go': True, 'still': True, 'find': True, 'seemed': True, .............................................................................................\n................................................. 'entertaining': True, 'years': True, 'away': True, 'came': True}, 'neg')\n"

<h3>Here we have just created the train set and test set with `train_set` and `test_set` arrays respectively.</h3>

# Create Train and Test Set


> There are 1000 positive reviews set and 1000 negative reviews set. We take 20% (i.e. 200) of positive reviews and 20% (i.e. 200) of negative reviews as a test set. The remaining negative and positive reviews will be taken as a training set.


---


In [0]:
print (len(pos_reviews_set), len(neg_reviews_set)) # Output: (1000, 1000)
 
# radomize pos_reviews_set and neg_reviews_set
# doing so will output different accuracy result everytime we run the program
from random import shuffle 
shuffle(pos_reviews_set)
shuffle(neg_reviews_set)
 
test_set = pos_reviews_set[:200] + neg_reviews_set[:200]
train_set = pos_reviews_set[200:] + neg_reviews_set[200:]
X_count_vect =  pos_reviews_set + neg_reviews_set
print(len(test_set),  len(train_set)) # Output: (400, 1600)

1000 1000
400 1600


#Testing the trained Classifier

Let’s see the accuracy percentage of the trained classifier. The accuracy value changes each time you run the program because of the names array being shuffled above.

> https://en.wikipedia.org/wiki/Naive_Bayes_classifier


---

#Training Classifier and Calculating Accuracy

We train Naive Bayes Classifier using the training set and calculate the classification accuracy of the trained classifier using the test set.



---




In [0]:
from nltk import classify
from nltk import NaiveBayesClassifier
 
classifier = NaiveBayesClassifier.train(train_set)
 
accuracy = classify.accuracy(classifier, test_set)
print(accuracy) # Output: 0.7325
 
print (classifier.show_most_informative_features(10))
'''
Output:
 
Most Informative Features
            breathtaking = True              pos : neg    =     20.3 : 1.0
                dazzling = True              pos : neg    =     12.3 : 1.0
               ludicrous = True              neg : pos    =     12.2 : 1.0
             outstanding = True              pos : neg    =     10.6 : 1.0
                 insipid = True              neg : pos    =     10.3 : 1.0
               stretched = True              neg : pos    =     10.3 : 1.0
               stupidity = True              neg : pos    =     10.2 : 1.0
                  annual = True              pos : neg    =      9.7 : 1.0
                headache = True              neg : pos    =      9.7 : 1.0
                  avoids = True              pos : neg    =      9.7 : 1.0
'''

0.72
Most Informative Features
                captures = True              pos : neg    =     17.0 : 1.0
               insulting = True              neg : pos    =     16.3 : 1.0
              astounding = True              pos : neg    =     11.7 : 1.0
                   stark = True              pos : neg    =     11.7 : 1.0
                    slip = True              pos : neg    =     11.0 : 1.0
                   kudos = True              pos : neg    =     10.3 : 1.0
              unbearable = True              neg : pos    =     10.3 : 1.0
             wonderfully = True              pos : neg    =      9.9 : 1.0
             outstanding = True              pos : neg    =      9.7 : 1.0
                  avoids = True              pos : neg    =      9.7 : 1.0
None


'\nOutput:\n \nMost Informative Features\n            breathtaking = True              pos : neg    =     20.3 : 1.0\n                dazzling = True              pos : neg    =     12.3 : 1.0\n               ludicrous = True              neg : pos    =     12.2 : 1.0\n             outstanding = True              pos : neg    =     10.6 : 1.0\n                 insipid = True              neg : pos    =     10.3 : 1.0\n               stretched = True              neg : pos    =     10.3 : 1.0\n               stupidity = True              neg : pos    =     10.2 : 1.0\n                  annual = True              pos : neg    =      9.7 : 1.0\n                headache = True              neg : pos    =      9.7 : 1.0\n                  avoids = True              pos : neg    =      9.7 : 1.0\n'

#Observations:
Let’s see the most informative features among the entire features in the feature set.

The result shows that the word outstanding is used in positive reviews 14.7 times more often than it is used in negative reviews the word poorly is used in negative reviews 7.7 times more often than it is used in positive reviews. Similarly, for other letters. These ratios are also called likelihood ratios.

Therefore, a review has a high chance to be classified as positive if it contains words like outstanding and wonderfully. Similarly, a review has a high chance of being classified as negative if it contains words like poorly, awful, waste, etc.


---



#Testing Classifier with Custom Review

We provide custom review text and check the classification output of the trained classifier. The classifier correctly predicts both negative and positive reviews provided.


---



In [0]:
from nltk.tokenize import word_tokenize
nltk.download('punkt') 
custom_review = "I hated the film. It was a disaster. Poor direction, bad acting."
custom_review_tokens = word_tokenize(custom_review)
custom_review_set = bag_of_words(custom_review_tokens)
print (classifier.classify(custom_review_set)) # Output: neg
# Negative review correctly classified as negative
 
# probability result
prob_result = classifier.prob_classify(custom_review_set)
print (prob_result) # Output: <ProbDist with 2 samples>
print (prob_result.max()) # Output: neg
print (prob_result.prob("neg")) # Output: 0.776128854994
print (prob_result.prob("pos")) # Output: 0.223871145006
 
custom_review = "It was a wonderful and amazing movie. I loved it. Best direction, good acting."
custom_review_tokens = word_tokenize(custom_review)
custom_review_set = bag_of_words(custom_review_tokens)
 
print (classifier.classify(custom_review_set)) # Output: pos
# Positive review correctly classified as positive
 
# probability result
prob_result = classifier.prob_classify(custom_review_set)
print (prob_result) # Output: <ProbDist with 2 samples>
print (prob_result.max()) # Output: pos
print (prob_result.prob("neg")) # Output: 0.0972171562901
print (prob_result.prob("pos")) # Output: 0.90278284371

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
neg
<ProbDist with 2 samples>
neg
0.8647638871672272
0.13523611283277223
pos
<ProbDist with 2 samples>
pos
0.09619195143993352
0.9038080485600672
