# NLP Text Classification with Naive Bayes Model

Here, the main idea behind this project is to make an classification model with the help of naive bayes to classify the movie reviews in positives and negatives. So, first, all the required libraries are imported. 

In [47]:
import nltk                            #importing the nltk library
import random                          #importing random library
from nltk.corpus import movie_reviews  #importing movie_reviews library from nltk

Then, making a list of all the words in the movie reviews with their labels like positive and negative. And then shuffling all the reviews. After that, making another list of words in lower case and then making an frequency distribution of all the words. Then, making another list of first 3000 high frequency words.

In [48]:
documents = [(list(movie_reviews.words(fileid)), category)
            for category in movie_reviews.categories()
            for fileid in movie_reviews.fileids(category)] #making list of all the words in movie reviews with lables

In [49]:
random.shuffle(documents) #shuffling all the reviews

In [50]:
all_words = []
for w in movie_reviews.words():
    all_words.append(w.lower())  #making a list of all the words in the movie reviews in lower case

In [51]:
all_words = nltk.FreqDist(all_words) #making a frequency distribution of all the words

In [52]:
word_features = list(all_words.keys())[:3000] #making list of the first 3000 high frequency words 

Then, making a list based on whether they are present in the first 3000 high frequency list or not and then based on that again making set with categories to make a final training and testing set for the naive bayes classifier. 

In [53]:
def find_features(document):
    words = set(document)
    features = {}
    for w in word_features:
        features[w] = (w in words)
    return features                  #making a list of words with the tag whether they are present in the word_features list or not

In [55]:
featuresets = [(find_features(rev), category) for (rev, category) in documents] #making a set for training and testing with categories

'featuresets = []\nfor rev in document:\n    featuresets.append(find_features(rev))\nfor category in document:\n        featuresets.append(category)'

In [56]:
training_set = featuresets[:1900]  #training set
testing_set = featuresets[1900:]   #testing set

Finally, making a classifier object with naive bayes to train and test the dataset. Also printing the first 15 most informative features.

In [59]:
classifier = nltk.NaiveBayesClassifier.train(training_set)   #making classifier object with naive bayes
print("Naive Bayes Algo Accuray percent:", (nltk.classify.accuracy(classifier, testing_set))*100) #prining the accuracy of testing set
classifier.show_most_informative_features(15) #getting first 15 most informative features

Naive Bayes Algo Accuray percent: 84.0
Most Informative Features
                   sucks = True              neg : pos    =     10.1 : 1.0
                  justin = True              neg : pos    =      9.6 : 1.0
                 frances = True              pos : neg    =      9.1 : 1.0
                 idiotic = True              neg : pos    =      8.6 : 1.0
                  annual = True              pos : neg    =      8.4 : 1.0
           unimaginative = True              neg : pos    =      7.6 : 1.0
             silverstone = True              neg : pos    =      7.6 : 1.0
              schumacher = True              neg : pos    =      7.3 : 1.0
                  regard = True              pos : neg    =      7.1 : 1.0
                  shoddy = True              neg : pos    =      6.9 : 1.0
                  suvari = True              neg : pos    =      6.9 : 1.0
                    mena = True              neg : pos    =      6.9 : 1.0
               atrocious = True    