# Naive Bayes Classifier in Natural Language Processing

### Layal Bata and Bryce Boyle

## Introduction

How would you characterize the following review of the Eagle Rock icon, Walt's Bar?
"Walt's bar is a hole in the wall that looks like something out of an 80s after school special
It's got a couple of pinball machine and serves bar food and drinks. Not a wide selection of beers but it's a cute little space." -Maria Bazan, Local Guide

Is it positive? Negative? What words make you think that? Now, how would a computer interpet this review?
This is a common question addressed in Natural Language Processing!

NLP is an area of computer science that focuses on gathering meaning from written or spoken language in a way a computer would understand it. If you've used Siri to call someone, or autocorrect to fix a typo, you've benefitted from NLP!

Here, we're going to use NLP techniques to figure out if a restaurant review is positive or negative based on the words used. But we'll get to that later! First, let's discuss sentiment analysis in text classification.

In NLP, text classification is when you categorize words based on characteristics of the words. For our purposes, we'll be looking at whether a word has a positive or negative connotation, analyzing the "sentiment" of the word. 

Sometimes this can be very difficult to do because computers don't have the context or general knowledge we apply automatically when we're looking at language. One attempted solution to this problem is the Naive Bayes' Classifier, the limitations of which we will definitely get into later. 

Now that we understand NLP a little better, let's look at how we can use a Naive Bayes' Classifier to help us understand our restaurant review data we mentioned earlier!

## Bayes' Theorem

We're gonna look at some math here, but we promise we'll break it down so stick with us!
 
Bayes' theorem is used in conditional probability. This is the probability that something will happen given something else is true. For example, the probability that a candidate will win an election given they're the incumbent, or the probability that your friend will like limeade given that they like lemonade. 

Bayes' Theorem is defined as: <br>$$P(A|B) = \displaystyle\frac {P(B|A)P(A)} {P(B)}$$

Let's say the probability your friend likes limeade is P(A), and the probability your friend likes lemonade is P(B). To calculate the probability that your friend likes limeade given they like lemonade P(B|A), we need 3 probabilities: the probability that your friend likes lemonade given they like limeade P(A|B), the probability your friend likes lemonade P(B), and the probability your friend likes limeade P(A).

So according to Bayes' Theorem:
probability(limeade|lemonade) = (probability(lemonade|limeade)*probability(lemonade)) / probability(limeade)

Let's calculate this together! Let's say that the probability of someone liking lemonade given they like limeade is pretty high, 80% (0.8). Think of one of your friends, and make up the next two probabilites based on your friend's taste!

In [None]:
friendName = input("What's your friend's name? ")
limeade = input("What's the probability that " +friendName + " likes limeade as a decimal (0.11-0.99)?")
lemonade = input("What's the probability that " +friendName+ " likes lemonade as a decimal (0.11-0.99)?")
BGivenA = float(0.8)
numerator = BGivenA*float(limeade)
AGivenB = round(100*(numerator/float(lemonade)),2)
print("The probability " +friendName+ " likes limeade given they like lemonade is: " +str(AGivenB)+ "%!")

Wasn't that fun! Now that we have a solid understanding of Bayes Theorem, let's find out how to use this for NLP!

## Naive Bayes Classifier

Let's go through this word by word!

We'll start with the end; the "Classifier" in "Naive Bayes Classifier" tells us how we're going to deal with our data. Classification is the process of predicting the "class" of a data point. In the example of our restaurant reviews, our classes are "positive" or "negative." We build up what we decide to be positive and negative by splitting our data into training and testing sets. The training set is where our classifier uses the training data to understand how our input data (our reviews) relate to the class (positive or negative). In our classifier, known positive and negative reviews are used as our training data. Once it's done with the training data, we then use that information to classify unknown reviews and decide if they're positive or negative.

The "Bayes" in "Naive Bayes Classifier" means that we're applying Bayes Theorem to each of our reviews. Given the probailities of each word in the review being in a positive review (gathered in our training set), we calculate how likely it is that the review is positive. We then use this probability to classify whether or not the review is positive.

The "Naive" in "Naive Bayes Classifier" means that each piece of data we put into our model is independent of the others. This means that if we're trying to decide if the review "Loved the atmosphere! Food was great!" is positive or negative, how we classify "Loved" has no effect on how we classify "atmosphere." In creating a Naive Bayes Classifier model, we are making the assumption that the classification of one word doesn't affect the classification of another.

Putting it all together, our Naive Bayes Classifier takes each of our restaurant reviews, applies Bayes' Theorem to find the probability of the review being positive or negative given each word in the review, and then classifies that review as positive or negative based on the results.

## Code time!

Now that we've gone over how the Naive Bayes' Classifier works, let's get into the implementation and have it sort some restaurant reviews! The dataset are using can be found [here](https://www.kaggle.com/vigneshwarsofficial/reviews), and consists of 1000 restaurant reviews that have been labelled as either positive or negative. The first few lines of code below are just for setup and aren't relevant to this walkthrough, but they still need to be run before the other kernels will work.

In [None]:
from nltk import*
import nltk
import random
from nltk.corpus import PlaintextCorpusReader
from nltk.corpus import stopwords
import pandas as pd

dataframe = pd.read_csv('restaurantrevievs.csv') #make sure the data file is in the same folder as the notebook! if not, copy file path and paste here
review_list = dataframe[' Review'].tolist() # grabbing reviews and putting them into a list (one review per element of list)
rating_list = dataframe['Liked'].tolist() # same as above but with the ratings

### Preprocessing

The first step in analyzing any kind of data is making sure it's in a form that is easy to work with. In the code above, we extracted the restaurant reviews and ratings from a large file where all of the information was separated by commas. But there is still more to do! In order to apply Bayes' Theorem correctly, we need to organize the data into two main collections: a list of every word in the dataset, and a list of all reviews labelled as positive or negative.

To create the list of all words, we have to sort through all of the reviews and add the words that we want. Right now, the reviews contain punctuation and many words without actual meaning, such as "and", "a", and "the" that don't contribute to the sentiment of the review. In NLP, these words are referred to as stop words, and are usually removed from text before we analyze it.

In [None]:
def remove_stopWords(words):
    stopword_list = stopwords.words('english')
    content = [w for w in words if w.lower() not in stopword_list]
    return content

def remove_punctuation(words):
    content = [w for w in words if w.isalpha()]
    return content

We have named this list of all words "all_words" (creative, we know!) and  removed punctuation and stop words from it. In NLP, all_words is called a "bag of words" because every word is extracted separately without any information about surrounding words. For example, if a review contains the phrase "not bad", then both "not" and "bad" are tossed into the list with no context (you can imagine why there'd be issues with this...). This list will be used later on for training and testing the classifier to help us find the most commonly used words.

Now that our bag of words has been created and sorted through, we make a list called "documents" that will hold all of the reiviews and information on whether each one is considered positive or negative. We do this by labelling each review with 'pos' or 'neg' depending on how it was originally labelled in the dataset. Then, we shuffle the order of the reviews in our list so that the classifier results are more consistent.

In [None]:
documents = []
all_words = []

for i in range(len(review_list)):
    if rating_list[i] == 1:           # if the rating is positive, add to documents and label as 'pos'
        documents.append([i, 'pos'])

    elif rating_list[i] == 0:         # if the rating is negative, add to documents and label as 'neg'
        documents.append([i, 'neg'])

    for word in nltk.word_tokenize(review_list[i]):     # add all individual words in each review to all_words (bag of words)
        all_words.append(word)

all_words = remove_punctuation(remove_stopWords(all_words)) # removing punctuation and stop words
random.shuffle(documents)

## Implementation

Now for the fun part! Our data has been organized into useful lists (all_words and documents), and now we have to go through a few additional steps to ensure that our classifier has all of the information it needs.

Below, we convert all_words into an object that contains information about how frequently each word appears in the list. The 2000 most common words found in the list are then added to a new list called "word_features". Finally, each review is analyzed for which of the most common words it contains (from word_features) and whether the review as a whole is positive or negative. This will help our classifier determine what words might be included in a negative review and which are more likely to be found in a positive review. This information is then added to another list called feature_sets that will be used for training and testing our classifier.

In [None]:
all_words = FreqDist(w.lower() for w in all_words)  # creates Frequency Distribution object with info from all_words

word_features = []
feature_sets = []

for item in all_words.most_common(2000):    # adds the most common 2000 words to word_features
    word = item[0]
    word_features.append(word)
    
def document_features(document, word_features):
    document_words = set(document)
    features = {}
    for word in word_features:
        features['contains({})'.format(word)] = (word in document_words)
    return features
    
for (review, pos_neg) in documents:     # for each review, what common words does it contain and is it pos or neg?
    features = document_features(nltk.word_tokenize(review_list[review]), word_features)
    feature_sets.append(tuple((features, pos_neg)))

We're finally at the point where we can train and test our classifier!

First, we need to split our feature set (explained above) and allocate some to our training set, and the rest to our testing set. For simplicity, we've decided to use half of our data for training, and half of our data for testing, but later  you'll get a chance to mess with these proportions yourself!

Now, let's train our classifier! With just that one line, we're calling on a built-in Naive Bayes' Classifier (thanks Python <3) to identify the proportion of positive to negative reviews for each of the 2000 most common words. 

Then, we're calling on a built in function (show most informative features) to give us the 25 words with the most contrasting ratios (the biggest and smallest), meaning that they're the 25 words that have the most consistent connotations of either positive or negative. 

Finally, we pass in our classifier and our testing set into a built in function that will output how accurate our classifier is as a percent. 

In [None]:
train_set, test_set = feature_sets[500:], feature_sets[:500]

classifier = NaiveBayesClassifier.train(train_set)
print('\n')
classifier.show_most_informative_features(25)

print('\nClassifier accuracy: ' + str(nltk.classify.accuracy(classifier, test_set)))

## Try it yourself!

Now that we have the whole classifier built, let's mess with some stuff ourselves!

Let's start by seeing how changing the ratio of training to testing data changes the accuracy of our model. We have it currently set to 50% (0.5) training, and 50% (0.5) testing, with the accuracy above.

Below, try raising the testing set size. What do you think will happen to the accuracy?

In [None]:
valid = False
while not valid:
    test_size = input('Enter a percent (as a decimal) for the testing set size: ')
    if float(test_size) > 0 and float(test_size) < 1:
        valid = True
    else:
        print('Please enter a valid input between 0 and 1 (non-inclusive)')
test_size = int(float(test_size) * len(documents))

test_set, train_set = feature_sets[:test_size], feature_sets[test_size:]

classifier = NaiveBayesClassifier.train(train_set)
print('\n')
classifier.show_most_informative_features(25)

print('\nClassifier accuracy: ' + str(nltk.classify.accuracy(classifier, test_set)))

When we increase the proportion of our testing set, we're leaving less data to train our classifier. This means it's likely the accuracy of our classifer will decrease because the more data we have to work with in training our model, the more kinds of cases our classifier will encounter, strengthening its accuracy. The most informative features section will also likely look different now, and may make less sense just by glance than when we had a larger training set. 

Now try to lower our testing set size to below our default 0.5. What do you think will happen?

In [None]:
valid = False
while not valid:
    test_size = input('Enter a percent (as a decimal) for the testing set size: ')
    if float(test_size) > 0 and float(test_size) < 1:
        valid = True
    else:
        print('Please enter a valid input between 0 and 1 (non-inclusive)')
test_size = int(float(test_size) * len(documents))

test_set, train_set = feature_sets[:test_size], feature_sets[test_size:]
classifier = NaiveBayesClassifier.train(train_set)
print('\n')
classifier.show_most_informative_features(25)

print('\nClassifier accuracy: ' + str(nltk.classify.accuracy(classifier, test_set)))

Lowering our testing set means we have more to work with for training, meaning our accuracy should increase! Just the decision of how much data we're allocating to testing vs training can have signicant effects on how useful our classifier is.


Now that we've put in all this work to create and understand this classifer, let's try entering a review ourselves! Go ahead and write up a review for your favorite or least favorite Mexican restaurant! Or whatever you want we can't stop you :~)

In [None]:
train_set, test_set = feature_sets[500:], feature_sets[:500]
text = input('Enter a review: ')
text = nltk.word_tokenize(text)
featset = document_features(text, word_features)
test_set = [tuple((featset, 'pos'))]
result = nltk.classify.accuracy(classifier, test_set)
if result == 0:
    print('\nClassified as negative')
elif result == 1:
    print('\nClassified as positive')

## Limitations and Pitfalls

Although our Naive Bayes' Classifier can be useful, there are also some limitations inherent in its design and assumptions. This means that we can mess with our classifier based on what's in the review we're giving it.

Let's see this in action! If you need a prompt to get those creative juices flowing, write a review of a restaurant in Eagle Rock!

In [None]:
train_set, test_set = feature_sets[500:], feature_sets[:500]
text = input('Enter a review with an idiom: ')
text = nltk.word_tokenize(text)
featset = document_features(text, word_features)
test_set = [tuple((featset, 'pos'))]
result = nltk.classify.accuracy(classifier, test_set)
if result == 0:
    print('\nClassified as negative')
elif result == 1:
    print('\nClassified as positive')

Did it classify your review correctly? If there's an idiom in a review, there's a good chance our classifier had trouble placing it in the right category. One potential problem is that our classifier looks at each word separately, meaning we can't look at the context around words to determine their connotations the way we do in the "real world." Things like the idiom you entered may lead to misclassifying a review!

Let's try another trick! If you need a prompt, write a review of the most expensive meal you've had!

In [None]:
train_set, test_set = feature_sets[500:], feature_sets[:500]
text = input('Enter a review with double negatives: ')
text = nltk.word_tokenize(text)
featset = document_features(text, word_features)
test_set = [tuple((featset, 'pos'))]
result = nltk.classify.accuracy(classifier, test_set)
if result == 0:
    print('\nClassified as negative')
elif result == 1:
    print('\nClassified as positive')

How did our classifier do with that review? There's a good chance that it was classified as negative, since "not" has a pretty strong relation to negative reviews. Even though we can read "not not" or "not bad" as cancelling out to make a positive, because our classifier looks at each word separately, it's going to just see two separate words with negative connotations. Independence is an assumption we make when we use a Naive Bayes' Classifer, so there's no "fix" for this within our classifier as it is.

Let's try one more trick! If you need a prompt, review a restaurant you associate with a very specific memory!

In [None]:
train_set, test_set = feature_sets[500:], feature_sets[:500]
text = input('Enter a review with rarer words: ')
text = nltk.word_tokenize(text)
featset = document_features(text, word_features)
test_set = [tuple((featset, 'pos'))]
result = nltk.classify.accuracy(classifier, test_set)
if result == 0:
    print('\nClassified as negative')
elif result == 1:
    print('\nClassified as positive')

How did our classifier do now? Because we have less data on rarer words since they show up less, our classification of their negative or positive connotation isn't as accurate. Remember, the more data, the better, but we're always going to run into words that are less and more rare within our set. Because of these rarer words, our classifier may have had a hard time classifying your review.

We've now used 3 tricks that mess with the performance of our classifier all because of the assumptions we made when making it. Now, lets look at some takeaways and what we've learned!

## Conclusion

What a journey! We started not even knowing what NLP means, and we ended understanding the intricacies of the limitations of a Naive Bayes' Classifier. Kudos to you!

This is just the tip of the iceberg when it comes to natural language processing. There are so many different classifiers we could apply to the same area of restaurant reviews with much different results. Further, there are so many more questions we could explore using NLP techniques. We hope you feel equipped enough to 