## Text classification with NLTK using Naive Bayes Classifier

NLTK has many built-in functions for text classification. First, we import them. Note that we also import the movie reviews that are included as a corpus.

In [None]:
import nltk
import pandas as pd
from nltk.corpus import movie_reviews
from sklearn.utils import shuffle

### Preparing the dataset

In [None]:
# A list of all the words in 'movie_reviews'
movie_reviews.words()

In [None]:
# number of words in 'movie_reviews'
len(movie_reviews.words())

In [None]:
print("First line of a review: "+str(movie_reviews.words(movie_reviews.fileids()[1])))
print("Labels: " + str(movie_reviews.categories()))

In [None]:
# Displays frequency of words in ‘movie_reviews’
nltk.FreqDist(movie_reviews.words())

In [None]:
# Print file ids of positive reviews
movie_reviews.fileids('pos')

In [None]:
# Print all words in movie_review with file id ‘neg/cv001_19502.txt’
movie_reviews.words('neg/cv001_19502.txt')

We store the reviews per category as a pair (review, catgeory), in this case the sentiment:

In [None]:
reviews = [ (list(movie_reviews.words(fileid)), category) 
            for category in movie_reviews.categories() 
            for fileid in movie_reviews.fileids(category)]

Check what they look like:

In [None]:
print("#reviews: ", len(reviews))
for (review, category), index in zip(reviews, range(0, len(reviews))):
    print('Review ', (index+1), ' first 5 words: ', review[0:5], ' \tcategory: ', category)
    
    if index > 9:
        break

Next, we shuffle them (to avoid having all the negative and positive ones sitting together):

In [None]:
reviews = shuffle(reviews)

### Creating features and classification with NLTK

This function converts a review into a line of features (i.e. whether a word is present in the review).

In [None]:
def getWordVector(document, features):
    # only keep a set, the count of the words does not matter
    document_words = set(document)
    
    # create a new list to store the features
    doc_features = {}
    for word in features:
        # if the word is present, True is stored, otherwise, False is stored
        doc_features["contains_"+word] = (word in document_words)
    return doc_features

Creating features out of text:

In [None]:
words = [w for w in movie_reviews.words()]
all_words = nltk.FreqDist(words)

# keep all words that appear more than 200 times (all_words is a dictionary of word - frequency)
features = {w for w in all_words.keys() if all_words[w]>200}

In [None]:
# check feature number
len(features)

In [None]:
# loop through the reviews and convert the review into a feature vector, store it together with the category
featureset = [(getWordVector(rev, features), cat) for (rev, cat) in reviews]

In [None]:
featureset

Create your training and test set, and train the model:

In [None]:
# rudimentary way to split training and test set
no_points = round(len(reviews)/2)
train, test = featureset[no_points:], featureset[:no_points]

# train NB model
classifier = nltk.NaiveBayesClassifier.train(train)

# print accuracy
print("Accuracy: "+str(nltk.classify.accuracy(classifier, test)))

You can check whether a function is an interesting feature (based on probability):

In [None]:
for feature in classifier.most_informative_features(n = 20):
    print('Interesting feature: ', feature)

### Creating features and classification with scikit-learn

We need some extra libraries:

In [None]:
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

Creating one-hot encoding of words (i.e. whether they are present 1 or not 0):

In [None]:
def getBinaryVector(review, features):
    # the output is a vector of 1's and 0's to indicate whether a word is present in the document
    x = []
    # count does not matter
    review = set(review)

    # go through the features (all words)
    for word in features:
       # check whether the word is present
        if word in review:
            x.append(1)
        else:
            x.append(0)   
    return x

Preparing the dataset:

In [None]:
X = []
y = []

for (rev, category) in reviews:
    # convert your review into a vector and save it into the feature matrix
    X.append(getBinaryVector(rev, features))
    
    # save the label for your dependent variable list
    y.append(category)

X_df = pd.DataFrame(data = X, columns = features)
    
# train and test split with sklearn
X_train, X_test, y_train, y_test = train_test_split(X_df, y, test_size=0.7)

Creating the classification with naive Bayes and random forests:

In [None]:
nb = GaussianNB()
rf = RandomForestClassifier()
nb_fit = nb.fit(X_train, y_train)
rf_fit = rf.fit(X_train, y_train)

# predict based on the test data points
y_pred = nb_fit.predict(X_test)
y_pred_rf = rf_fit.predict(X_test)

# Calculate the accuracy
accuracy = accuracy_score(y_test, y_pred)
accuracy_rf = accuracy_score(y_test, y_pred_rf)
print("Accuracy naive Bayes is :"+str(accuracy))
print("Accuracy random forest is :"+str(accuracy_rf))

In [None]:
for feature, importance in zip(X_df.columns, sorted(rf.feature_importances_)):
    if importance > 0.005:
        print(feature, ' score: \t', round(importance, 3))

**(Optional exercise) Instead of Naïve Bayes, try to use SVM.**

You may look at the following reference for hints:
https://medium.com/@bedigunjit/simple-guide-to-text-classification-nlp-using-svm-and-naive-bayes-with-python-421db3a72d34


Further reference on movie review analysis: 
https://medium.com/@joel_34096/sentiment-analysis-of-movie-reviews-in-nltk-python-4af4b76a6f3