### Custom Sentiment Analysis Module

Sentiment analysis is the utilization of natural language processing techniques to understand the "sentiment" or attutide/tone of a body of text. Sentiment analysis proves useful for tasks like analysis of product reviews, employee feedback, and trending topics and reactions.

This project creates a customized sentiment analysis module that combines together the classifications of many different classification algorithms. This ensemble method creates a classifier which votes on the sentiment of text pieces. 

The approach for designing this ensemble classifier will be as follows:

We'll prep the data for use in training our classifier, doing things like removing stop words, etc. Next we'll turn the words into features for our classifier, creating feature sets - along with their associated labels. Next, we'll train our chosen classifiers on the feature sets, and then "Pickle" the trained classifier so they don't need to be retrained.

After all the classifiers have been trained, we'll create a custom class that combines the decision of all classifiers and renders a judgement. 

After that, we'll load the pickled weights into a new file and then create a function to return the combined vote of the classifier.

First thing we should do is import all the libraries we will need to create the ensemble classifier.

In [1]:
import nltk
import random
from nltk.tokenize import word_tokenize
from nltk.classify.scikitlearn import SklearnClassifier
import pickle
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from nltk.corpus import stopwords
from sklearn.naive_bayes import MultinomialNB, BernoulliNB
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.svm import LinearSVC, NuSVC
from nltk.classify import ClassifierI
from statistics import mode

To begin with, we'll need to load in the data we wish to use for the training of our classifiers. The data is two text files, one containing examples of positive sentiment reviews while the other contains negative sentiment reviews. 

In [2]:
pos_text = open("positive.txt","r", encoding='utf-8', errors='replace').read()
neg_text = open("negative.txt","r", encoding='utf-8', errors='replace').read()

Next, we're going to need lists to store the data from our files in. We'll create a list for all the words and a list for the individual reviews in our documents. Each line in the document is a different review.

In [3]:
all_words = []
reviews = []

We'll be using the NLTK library to filter all the words in our two datasets. We're only interested in certain parts of speech, so we'll specify which parts of speech should be used to filter the words. We'll also need to select some stop words as we don't want extremely common words to be analyzed. Thankfully, NLTK has a set of stop words built in. The allowed word types refers to the way NLTK tags different parts of speech. We're interested mainly in adjectives (J), adverbs (R) and verbs (V). If we wanted, we could also throw in nouns (N), but it probably wouldn't be very useful.

In [4]:
allowed_word_types = ["J", "R", "V"]
stop_words = set(stopwords.words('english'))

Now we're going to need to use the parts of speech we've selected to make a list of all the words we want in our training corpus. We're going to append a label to every review/line in the training docs, so we'll split on lines and add "Pos" to the positives and "Neg" for negatives. Then, we need to represent our words as numbers our algorithms can interpret, or "Tokenize" them for our algorithms. We'll use the `word_tokenize` function in NLTK for this. Finally, for all the words that match our parts of speech, assuming the words aren't in our list of stop words, we'll add them to the list of all words we are interested in.

In [5]:
for p in pos_text.split('\n'):
    # Note that we take all the words in the doc and append a label to them
    reviews.append((p, "pos"))
    words = word_tokenize(p)
    pos = nltk.pos_tag(words)
    for w in pos:
        if w[1][0] in allowed_word_types:
            if w[0] not in stop_words:
                all_words.append(w[0].lower())

for p in neg_text.split('\n'):
    reviews.append((p, "neg"))
    words = word_tokenize(p)
    pos = nltk.pos_tag(words)
    for w in pos:
        if w[1][0] in allowed_word_types:
            if w[0] not in stop_words:
                all_words.append(w[0].lower())

We now need to transform the lists of words/tokens into a list of features we want to use for training. We probably don't want to use all the words for training, as this would take quite a long time. Instead, let's just select the top 5000 words. In order to get the top 5000 words, we want to get the word count of all the words and then grab the top 5000. It's important to be aware of how the frequency distribution is returned. It's basically returned as a dictionary with key-value pairs, and we'll be using the keys later on.

In [6]:
# frequency distribution gives words in order of most common to least common, essentially a key:val pair with
# a frequency val for every word(key)
word_dist = nltk.FreqDist(all_words)
print(word_dist.most_common(20))

word_features = list(word_dist.keys())[:5000]
print(len(word_features))


[("'s", 1709), ("n't", 940), ('much', 386), ('even', 382), ('good', 370), ('little', 302), ('make', 273), ('never', 262), ('enough', 260), ('funny', 255), ('makes', 252), ('bad', 234), ('best', 232), ('new', 206), ('really', 197), ('well', 196), ('made', 193), ('many', 183), ('still', 179), ('see', 177)]
5000


Now we need to pickle the words and the features, as these features as what our classifier will use to reason about future text examples.

In [7]:
save_review_files = open("pickled_docs.pickle", "wb")
pickle.dump(reviews, save_review_files)
save_review_files.close()

save_features = open("pickled_features5k.pickle","wb")
pickle.dump(word_features, save_features)
save_features.close()

We have the features we want to use to classify a document, but we now have to create a function to extract the features from a document we want to classify. Here's where it's important to remember that our word features exist as a key-value pair. We want to extract the keys from the document we're classifying. After we tokenize the document, we'll check the document to see if the keys are in it. After we get the keys, we put them in a list of features and return them.

In [8]:
def feature_extractor(document):
    # The words are the first part of the set, with the occurence rate being the second part
    words = word_tokenize(document)
    features = {}
    # for every word in the list of word features (the words we care about)
    # the key in the feature's dictionary must be equal to boolean value of w in words
    # if the word in the dictionary is in the set of document (is within the document at all)
    # a True value is returned
    for w in word_features:
        features[w] = (w in words)
    return features

So now we can get the desired features out of the document. However, we'll also need the labels for these features. We can use the function we just created to collect the features from the documents, and then get the label from our list of reviews. We also need to shuffle up the data, because as it exists now the training data would be all positive and then all negative.

In [9]:
features_sets = [(feature_extractor(review), category) for (review, category) in reviews]
random.shuffle(features_sets)

As you might expect, we should pickle the feature sets now that we have our feature/label pairs.

In [10]:
save_features_labels = open("pickled_features_labels_5k.pickle","wb")
pickle.dump(features_sets, save_features_labels)
save_features_labels.close()

Now that we have both our featues and labels contained in a variable list, we can split the list up into training and testing set.

In [11]:
training_data = features_sets[:10000]
testing_data = features_sets[10000:]

Now we can choose some classifiers to use. In this instance, we want an odd number of classifiers since there will be a vote and we want a tie-breaker. We'll be trying Naive Bayes, along with Multinomal Naive Bayes and Bernoulli Naive Bayes. We'll also use:
the Logistic Regression classifier, the Linear Support Vector Machine classifer, NuSVC, Stochastic Gradient Descent, K-Nearest Neighbors, and the Decision Tree Classifier. 

Let's see which ones perform best and then schoose some to drop.

In [12]:
NB_clf = nltk.NaiveBayesClassifier.train(testing_data)
MNB_clf = SklearnClassifier(MultinomialNB())
BNB_clf = SklearnClassifier(BernoulliNB())
LogReg_clf = SklearnClassifier(LogisticRegression())
SGDC_clf= SklearnClassifier(SGDClassifier())
LinSVC_clf = SklearnClassifier(LinearSVC())
NuSVC_clf = SklearnClassifier(NuSVC())
KNN_clf = SklearnClassifier(KNeighborsClassifier(n_neighbors=3))
DT_clf = SklearnClassifier(DecisionTreeClassifier())

Now we can train the classifiers. Note that this is done differently than you would normally do in Scikit-learn. Since we are using versions of the classifier from NLTK we call `train` on them, rather than fit and predicting. Please note that this can take a while.

In [13]:
clf_list = [NB_clf, MNB_clf, BNB_clf, LogReg_clf, SGDC_clf, LinSVC_clf, NuSVC_clf, KNN_clf, DT_clf]

for clf in clf_list:
    classifier = str(clf)
    print("Training: " + classifier)
    clf.train(training_data)

Training: <nltk.classify.naivebayes.NaiveBayesClassifier object at 0x000002987FCC0588>
Training: <SklearnClassifier(MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))>
Training: <SklearnClassifier(BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True))>
Training: <SklearnClassifier(LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False))>




Training: <SklearnClassifier(SGDClassifier(alpha=0.0001, average=False, class_weight=None,
              early_stopping=False, epsilon=0.1, eta0=0.0, fit_intercept=True,
              l1_ratio=0.15, learning_rate='optimal', loss='hinge',
              max_iter=1000, n_iter_no_change=5, n_jobs=None, penalty='l2',
              power_t=0.5, random_state=None, shuffle=True, tol=0.001,
              validation_fraction=0.1, verbose=0, warm_start=False))>
Training: <SklearnClassifier(LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
          intercept_scaling=1, loss='squared_hinge', max_iter=1000,
          multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
          verbose=0))>
Training: <SklearnClassifier(NuSVC(cache_size=200, class_weight=None, coef0=0.0,
      decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
      kernel='rbf', max_iter=-1, nu=0.5, probability=False, random_state=None,
      shrinking=True, tol=0.001, verbose=False))>




Training: <SklearnClassifier(KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=3, p=2,
                     weights='uniform'))>
Training: <SklearnClassifier(DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=None, splitter='best'))>


Now that they have trained we can test the classifiers on the validation set and see what their accuracy. We'll use a dictionary to contain the name and the classifier.

In [14]:
clf_names = {'Vanilla Naive Bayes': NB_clf,
            'Multinomial Naive Bayes': MNB_clf,
            'Bernoulli Naive Bayes': BNB_clf,
            'Logistic Regression': LogReg_clf,
            'SGDC': SGDC_clf,
            'Linear SVC': LinSVC_clf,
            'Nu SVC': NuSVC_clf,
            'Decision Tree': DT_clf,
            'K-Nearest Neighbors': KNN_clf}

for key, val in clf_names.items():
        print(key + " " + "accuracy is:")
        print(nltk.classify.accuracy(val, testing_data) * 100)

Vanilla Naive Bayes accuracy is:
93.22289156626506
Multinomial Naive Bayes accuracy is:
73.19277108433735
Bernoulli Naive Bayes accuracy is:
72.28915662650603
Logistic Regression accuracy is:
71.98795180722891
SGDC accuracy is:
68.22289156626506
Linear SVC accuracy is:
69.7289156626506
Nu SVC accuracy is:
69.87951807228916
Decision Tree accuracy is:
64.7590361445783
K-Nearest Neighbors accuracy is:
55.42168674698795


Here's the accuracy that I received for the classifiers (this tends to fluctuate a bit, so you might get something different):

Vanilla Naive Bayes accuracy is:
93.97590361445783

Multinomial Naive Bayes accuracy is:
72.43975903614458

Bernoulli Naive Bayes accuracy is:
71.3855421686747

Logistic Regression accuracy is:
71.83734939759037

SGDC accuracy is:
70.03012048192771

Linear SVC accuracy is:
70.63253012048193

Nu SVC accuracy is:
71.98795180722891

Decision Tree accuracy is:
65.06024096385542

K-Nearest Neighbors accuracy is:
55.12048192771084

Let's pick only the classifiers that performed the best. It looks like K-Nearest Neighbors performed barely better than chance, so let's drop it. Regular Naive Bayes performs extremely well, in fact it performed suspiciously well. It seems likely that the algorithm could be overfitting. For that reason, let us drop it as well.

After the classifiers have been trained and the classifiers we want to use chosen, we'll want to pickle them as well so that we don't have to retrain them again.

In [15]:
save_BNB_classifier = open("BNBclf5k.pickle","wb")
pickle.dump(BNB_clf, save_BNB_classifier)
save_BNB_classifier.close()

save_MNB_classifier = open("multinaivebayes5k.pickle","wb")
pickle.dump(MNB_clf, save_MNB_classifier)
save_MNB_classifier.close()

save_LogReg_classifier = open("LogReg5k.pickle","wb")
pickle.dump(LogReg_clf, save_LogReg_classifier)
save_LogReg_classifier.close()

save_SGDC_classifier = open("SGDC5k.pickle","wb")
pickle.dump(SGDC_clf, save_SGDC_classifier)
save_SGDC_classifier.close()

save_LinSVC_classifier = open("LinSVC5k.pickle","wb")
pickle.dump(LinSVC_clf, save_LinSVC_classifier)
save_LinSVC_classifier.close()

save_NuSVC_classifier = open("NuSVC5k.pickle","wb")
pickle.dump(NuSVC_clf, save_NuSVC_classifier)
save_NuSVC_classifier.close()

save_DT_classifier = open("DT5k.pickle","wb")
pickle.dump(DT_clf, save_DT_classifier)
save_DT_classifier.close()

All the features, labels, and classifier have been set up. This means we just need to create the voting classifier now.

In [16]:
class VotingClassifier(ClassifierI):

    def __init__(self, *classifiers):
        # The classifiers of this class are the classifiers we've specified above
        self.__classifiers = classifiers

    # This function overrides the default classifier in that originates from NLTK
    def classify(self, features):
        # Need a way to store the votes from the individual classifiers
        votes = []

        # For all the classifiers, classify the features
        # append the result of the classification to the votes list
        for i in self.__classifiers:
            v = i.classify(features)
            votes.append(v)

        # The classification will be the mode of all the votes
        return mode(votes)

    # We may also want to include a confidence statistic - 
    # which reflects how many classifiers voted in favor of the class
    
    def confidence(self, features):
        votes = []
        for c in self.__classifiers:
            v = c.classify(features)
            votes.append(v)

        # instead of the pure mode, we count how many classifiers voted for the mode
        # and then divide by the number of votes
        choice_votes = votes.count(mode(votes))
        conf = choice_votes / len(votes)
        return conf

Let's test out the accuracy of our voting classifier. This initial test will take a while, though when we use it to classfify individual pieces of text, it should be much faster.

In [17]:
print("Voting classifier test:")

vote_clf = VotingClassifier(MNB_clf, BNB_clf, LogReg_clf, LinSVC_clf, NuSVC_clf, SGDC_clf, DT_clf)
print("Voted Classifier accuracy:", (nltk.classify.accuracy(vote_clf, testing_data)) * 100)

Voting classifier test:
Voted Classifier accuracy: 70.93373493975903


Here's what I got when I ran this:

Voting classifier test:
Voted Classifier accuracy: 72.13855421686746

Currently we have about 72% accuracy, but this seems somewhat volatile as during testing it went as high as 74 or 75%. In general, over multiple tests, it seems to fluctuate between 69% to 74%. It does seem to be performing at least as well or better than most of our classifiers, and does much better than our weakest classifiers. A voting classifier/ensemble method should be more robust to overfitting as well.

Why don't we print a couple example sentences to see if it is working like we intend.

In [18]:
# Check performance on sample reviews
print("Classification:", vote_clf.classify(testing_data[0][0]),
      "Confidence: %:", vote_clf.confidence(testing_data[0][0]))

print("Classification:", vote_clf.classify(testing_data[1][0]),
      "Confidence: %:", vote_clf.confidence(testing_data[1][0]))

Classification: neg Confidence: %: 1.0
Classification: neg Confidence: %: 1.0


In order to use our Voting Classifier we now need to load the pickled data back in (if we're running this outside of a notebook).

In [19]:
documents_file = open("pickled_docs.pickle", "rb")
documents = pickle.load(documents_file)
documents_file.close()

word_features_file = open("pickled_features5k.pickle", "rb")
word_features = pickle.load(word_features_file)
word_features_file.close()

feature_sets_file = open("pickled_features_labels_5k.pickle", "rb")
feature_sets = pickle.load(feature_sets_file)
feature_sets_file.close()

Now all we have to do is create a function to classify inputs.

In [20]:
vote_clf = VotingClassifier(MNB_clf, BNB_clf, LogReg_clf, LinSVC_clf, NuSVC_clf, SGDC_clf, DT_clf)

# We can just import this function into another script to use our custom classifier

def sentiment(text):
    feats = feature_extractor(text)
    return vote_clf.classify(feats), vote_clf.confidence(feats)

Why don't we give it a shot on some custom data?

In [21]:
text = "This game is terrible. The controls are garbage and so are the character models, I hate everything about it."
text2 = "I love her so much. She makes me really happy."

analyzer = sentiment(text)
analyzer2 = sentiment(text2)

print(analyzer)
print(analyzer2)

('neg', 1.0)
('pos', 0.7142857142857143)


To get an idea of how our custom module is performing, let's compare it to some other sentiment analysis modules.

In [22]:
from textblob import TextBlob
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

sentences = [text, text2]

analyzer = SentimentIntensityAnalyzer()

for s in sentences:
    print("Sentence classification - " + "Test sentence is :  " + str(s))
    print("---")
    text_blob = TextBlob(s)
    # For Textblob, range runs from [-1.0, 1.0], Above 0 is positive
    print("TextBlob: " + str(text_blob.sentiment.polarity))
    v_sent = analyzer.polarity_scores(s)
    print("Vader: " + str(v_sent))
    custom = sentiment(s)
    print("Custom: " + str(custom))
    print()

Sentence classification - Test sentence is :  This game is terrible. The controls are garbage and so are the character models, I hate everything about it.
---
TextBlob: -0.7333333333333334
Vader: {'neg': 0.298, 'neu': 0.702, 'pos': 0.0, 'compound': -0.7783}
Custom: ('neg', 1.0)

Sentence classification - Test sentence is :  I love her so much. She makes me really happy.
---
TextBlob: 0.5
Vader: {'neg': 0.0, 'neu': 0.461, 'pos': 0.539, 'compound': 0.8479}
Custom: ('pos', 0.7142857142857143)



How did they compare? It looks like our classifier is more sure about the two examples than the other classifiers are.