In this project I have used movie_review doc in nltk.corpus and used nlp for cleaning of it and then used Naive Bayes classifier 
for training and test purpose

In [1]:
import nltk
from nltk.corpus import movie_reviews
nltk.download('movie_reviews')

[nltk_data] Downloading package movie_reviews to /root/nltk_data...
[nltk_data]   Unzipping corpora/movie_reviews.zip.


True

In [2]:
len(movie_reviews.fileids())

2000

In [3]:
len(movie_reviews.fileids('pos'))

1000

 Dataset contains 2000 movie reviews out of which 1000 are positive and rest are negative.

In [4]:
movie_reviews.fileids()[0]

'neg/cv000_29416.txt'

In [5]:
movie_reviews.words(movie_reviews.fileids())

['plot', ':', 'two', 'teen', 'couples', 'go', 'to', ...]

clearly the words are tokenized so now let's add all the text to make a document which will contain the text and review(pos/neg)
along with it.

In [6]:
documents = []
for category in movie_reviews.categories():
    for fileid in movie_reviews.fileids(category):
        documents.append([movie_reviews.words(fileid), category])

In [7]:
documents

[[['plot', ':', 'two', 'teen', 'couples', 'go', 'to', ...], 'neg'],
 [['the', 'happy', 'bastard', "'", 's', 'quick', 'movie', ...], 'neg'],
 [['it', 'is', 'movies', 'like', 'these', 'that', 'make', ...], 'neg'],
 [['"', 'quest', 'for', 'camelot', '"', 'is', 'warner', ...], 'neg'],
 [['synopsis', ':', 'a', 'mentally', 'unstable', 'man', ...], 'neg'],
 [['capsule', ':', 'in', '2176', 'on', 'the', 'planet', ...], 'neg'],
 [['so', 'ask', 'yourself', 'what', '"', '8mm', '"', '(', ...], 'neg'],
 [['that', "'", 's', 'exactly', 'how', 'long', 'the', ...], 'neg'],
 [['call', 'it', 'a', 'road', 'trip', 'for', 'the', ...], 'neg'],
 [['plot', ':', 'a', 'young', 'french', 'boy', 'sees', ...], 'neg'],
 [['best', 'remembered', 'for', 'his', 'understated', ...], 'neg'],
 [['janeane', 'garofalo', 'in', 'a', 'romantic', ...], 'neg'],
 [['and', 'now', 'the', 'high', '-', 'flying', 'hong', ...], 'neg'],
 [['a', 'movie', 'like', 'mortal', 'kombat', ':', ...], 'neg'],
 [['she', 'was', 'the', 'femme', 'in', 

Now as we our document is created so we will remove stop words and lemmatize words in the document so that similar words are identified as same. As the 
lemmatizer require pos(parts of speech) tag as an argument so we will use pos_tag present in nltk library to find the pos tag
but the main problem in this is that the pos tag given by pos_tag in nltk library is not in simple form but our lemmatizer
require simple pos tag present in wordnet so first we will create a function which will convert complex pos tags into simple pos tag

As we can see in the below list adjective starts with j, verb starts with v, Noun starts with N and Adverb starts with R for rest of the pos tag we will return the pos tag of Noun.

POS tag list:

Abbreviation Meaning

CC coordinating conjunction

CD cardinal digit

DT determiner

EX existential there

FW foreign word

IN preposition/subordinating conjunction

JJ adjective (large)

JJR adjective, comparative (larger)

JJS adjective, superlative (largest)

LS list market

MD modal (could, will)

NN noun, singular (cat, tree)

NNS noun plural (desks)

NNP proper noun, singular (sarah)

NNPS proper noun, plural (indians or americans)

PDT predeterminer (all, both, half)

POS possessive ending (parent\ 's)

PRP personal pronoun (hers, herself, him,himself)

PRP$ possessive pronoun (her, his, mine, my, our )

RB adverb (occasionally, swiftly)

RBR adverb, comparative (greater)

RBS adverb, superlative (biggest)

RP particle (about)

TO infinite marker (to)

UH interjection (goodbye)

VB verb (ask)

VBG verb gerund (judging)

VBD verb past tense (pleaded)

VBN verb past participle (reunified)

VBP verb, present tense not 3rd person singular(wrap)

VBZ verb, present tense with 3rd person singular (bases)

WDT wh-determiner (that, what)

WP wh- pronoun (who)

WRB wh- adverb (how)

In [8]:
from nltk.corpus import wordnet
def get_simple_pos(tag): #creating simple tags to pass into the lemmatizer
    if tag.startswith('J'):
        return wordnet.ADJ
    elif tag.startswith('V'):
        return wordnet.VERB
    elif tag.startswith('N'):
        return wordnet.NOUN
    elif tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN

In [9]:
from nltk import pos_tag
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()


In [16]:
nltk.download('stopwords')
from nltk.corpus import stopwords
import string
stops = stopwords.words('english') + list(string.punctuation)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [17]:
def clean_lemmatize(words):
    output_words = []
    for w in words:
        if w.lower() not in stops:
            pos = pos_tag([w])                                 
            clean_word = lemmatizer.lemmatize(w, get_simple_pos(pos[0][1]))
            output_words.append(clean_word.lower())
    return output_words

In [21]:
nltk.download('omw-1.4')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
documents = [(clean_lemmatize(document), category) for document, category in documents]

[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


Let's shuffle our documents randomly so that we can split it into train and test data

In [22]:
import random
random.seed(2)   # Random.seed() will ensure that by applying random shuffle function same shuffled data should be obtained as in beginning.
random.shuffle(documents)  ## shuffling the training exapmles 

In [23]:
training_documents = documents[0:1600]
testing_documents = documents[1600:]

In [24]:
documents[0][0]

['cold',
 'molecule',
 'move',
 'everything',
 'clean',
 'essential',
 'word',
 'mikey',
 'carver',
 'elijah',
 'wood',
 'young',
 'teenage',
 'boy',
 'living',
 '1973',
 'new',
 'canaan',
 'connecticut',
 'ice',
 'storm',
 'mikey',
 'delivers',
 'word',
 'bore',
 'science',
 'class',
 'unlikely',
 'anyone',
 'realizes',
 'much',
 'parallel',
 'mikey',
 'life',
 'life',
 'surround',
 'father',
 'jim',
 'jamey',
 'sheridan',
 'rarely',
 'see',
 'mother',
 'janey',
 'sigourney',
 'weaver',
 'affair',
 'married',
 'neighbor',
 'ben',
 'hood',
 'kevin',
 'kline',
 'ben',
 'wife',
 'elena',
 'joan',
 'allen',
 'suspect',
 'affair',
 'say',
 'anything',
 'meanwhile',
 'ben',
 '14',
 'year',
 'old',
 'daughter',
 'wendy',
 'christina',
 'ricci',
 'continuously',
 'lure',
 'mikey',
 'young',
 'brother',
 'sandy',
 'adam',
 'hann',
 'byrd',
 'sexual',
 'exploration',
 'tobey',
 'maguire',
 'play',
 'paul',
 'hood',
 '16',
 'year',
 'old',
 'narrator',
 'story',
 'also',
 'happens',
 'least',
 '

In [25]:
all_words = []
for doc in training_documents:
    all_words += doc[0]

In [26]:
freq = nltk.FreqDist(all_words)                 #will retrurn a freq distribution object
len(freq)

28535

As there are total 28535 words and we cannot make all the words as features so let's take first 3000 words as features

In [27]:
common = freq.most_common(3000)


In [28]:
common

[('film', 8881),
 ('movie', 5546),
 ('one', 4714),
 ('make', 3378),
 ('like', 3168),
 ('character', 3083),
 ('get', 2920),
 ('see', 2503),
 ('go', 2409),
 ('time', 2349),
 ('well', 2237),
 ('scene', 2088),
 ('even', 2040),
 ('good', 1960),
 ('story', 1855),
 ('take', 1789),
 ('would', 1678),
 ('much', 1650),
 ('bad', 1558),
 ('look', 1548),
 ('also', 1545),
 ('give', 1542),
 ('come', 1537),
 ('life', 1512),
 ('two', 1508),
 ('way', 1500),
 ('end', 1469),
 ('know', 1459),
 ('first', 1445),
 ('seem', 1434),
 ('--', 1419),
 ('work', 1367),
 ('year', 1366),
 ('thing', 1349),
 ('plot', 1306),
 ('say', 1245),
 ('play', 1239),
 ('really', 1236),
 ('little', 1225),
 ('show', 1217),
 ('people', 1167),
 ('could', 1125),
 ('man', 1117),
 ('star', 1101),
 ('try', 1099),
 ('never', 1094),
 ('best', 1070),
 ('love', 1061),
 ('new', 1054),
 ('director', 1054),
 ('great', 1048),
 ('performance', 1035),
 ('big', 1023),
 ('many', 1003),
 ('want', 986),
 ('action', 984),
 ('actor', 981),
 ('u', 967),
 ('

In [29]:
features = [i[0] for i in common] #choosing the top 3000 frequency words

Now as we have our features so for any testing data we will check if the feature is present in the words or not and will save true and false corresponding to it in a list. A function for this work is defined below

In [30]:
def get_feature_dict(words): 
    current_features = {}
    words_set = set(words)
    for w in features:
        current_features[w] = w in words_set
    return current_features

In [31]:
training_data = [(get_feature_dict(doc), category) for doc, category in training_documents] 
testing_data = [(get_feature_dict(doc), category) for doc, category in testing_documents]

In [None]:
#Classification using NLTK Naive Bayes

In [32]:
from nltk import NaiveBayesClassifier 

In [34]:
classifier = NaiveBayesClassifier.train(training_data)

Now as we have trained our classifier we will check the score/accuracy of our clasifier in testing data 

In [35]:
nltk.classify.accuracy(classifier, testing_data)

0.7625

We can use show_most_informatice_features to get features which are present in majority in any one class due to which classification becomes easier

In [36]:
classifier.show_most_informative_features(15)

Most Informative Features
               ludicrous = True              neg : pos    =     21.6 : 1.0
             outstanding = True              pos : neg    =     11.4 : 1.0
                   damon = True              pos : neg    =      9.7 : 1.0
              schumacher = True              neg : pos    =      9.4 : 1.0
             wonderfully = True              pos : neg    =      9.0 : 1.0
                 idiotic = True              neg : pos    =      8.4 : 1.0
               stupidity = True              neg : pos    =      8.4 : 1.0
                   anger = True              pos : neg    =      8.4 : 1.0
                   mulan = True              pos : neg    =      7.9 : 1.0
                 balance = True              pos : neg    =      7.8 : 1.0
                ordinary = True              pos : neg    =      7.0 : 1.0
                  seagal = True              neg : pos    =      6.8 : 1.0
            breathtaking = True              pos : neg    =      6.5 : 1.0