What is Natural Language Processing?

Plug NLTK

In [None]:
import nltk
nltk.download()

It'll bring up another window. From here, download: 
corpora -> movie_reviews

corpora -> stopwords

all packages -> punkt

corpora -> wordnet

Import everything we need, explain as we use it

In [None]:
import nltk.classify.util
from nltk.corpus import movie_reviews
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.corpus import wordnet

Break apart a sentence(tokenize)

In [None]:
sentence = "This is a test sentence. It will break everything apart!"


However, it misses punctuation. Luckily, NLTK has a solution!

In [None]:
sentence = "This is a test sentence. It will break everything apart!"

We can also have NLTK determine the part of speech (POS) for each token in our sentence

In [None]:
sentence = "This is a test sentence. It will tag everything!"

NN means singular noun, JJ means adjective etc.

You can use NTLK to find the definition of words, synonyms, antonyms, etc.

Some words have multiple definitions and multiple parts of speech depending on the usage. This can get very complicated very fast, so for this purpose, we're just going to assume that NLTK knows what its doing.

The next step we have are removing stopwords. Stop words are words that are used for grammaical purposes but carry little meaning (the, a, I, is, etc). 

So, let's remove them. Issue is, there are over 100 English words that are considered stopwords. So unless we want to create a list of stopwords and iterate every word over the list by hand everytime we write a program, we need a new solution.

So, lets just let NLTK remove them for us.

In [None]:
para = "Our symposium will also include two rounds of workshops with several choices in each round- so you can brush up on your Python, learn about data visualization, or deepen your knowledge of machine learning. "

In [None]:
words = word_tokenize(para)

Notice that "Our" was not removed even though it is a stopword. Thats because NLTK's list is only in lowercase. So let's move our paragraph to lowercase first so it doesn't miss any.

This new shorter list means that we don't need to process over as many words, saving us time and making things more efficient without getting rid of any meaning.

Now, lets start with creating our tool.

Machine learning works by learning from a set of data and then applying what its learned to a new set of data.

So, the first thing we need is some data. This data can be tweets, reviews, books, anything really. We're going to be using movie reviews today.

Lets explore the data a little bit:

In [None]:
from nltk.corpus import movie_reviews

In [None]:
movie_reviews.fileids()[:4]

Lets see what the most common words are in our data:

In [None]:
all_words = movie_reviews.words()

Notice how a lot of these are stopwords. Its a good thing we know how to remove those!

We now have most of the tools we need. Let's get started.




What is sentiment analysis? 


Sentiment analysis is the process of determining the mood of a certain piece of text. For this example, we're going to be looking at movie reviews to determine if the review is positive or negative.

There are many ways of going about doing this. We're going to be using a Naive Bayes algorithm, a fairly simple but effective machine learning algorithm.

Bayesian classifiers use whats called Bag-of-Words models. That means that the words are not looked at in context, only their frequency. It also uses Bayes' Theorum which states that "the probability of A given that B is true equals the probability of B given that A is true times the probability of A being true, divided by the probability of B being true".

Basically, it uses statistics to find the probablility that something is true given something else, a concept called conditional probability. With this, we can go from P(Evidence| Known Outcome) to P(New Outcome|Known Evidence), which is called Bayes Rule. 

Example: 

Probability of Disease D given Test-positive = 

               Prob(Test is positive|Disease) * P(Disease)
     _______________________________________________________________
     (scaled by) Prob(Testing Positive, with or without the disease)
     
     
Now, to Naive Bayes. What we've done until now assumes that we only have one piece of evidence for an outcome. In the real world, we have multiple pieces of evidence for an outcome. This leads to very complicated math. One way to get around this is to treat things independently, looking at data without knowing anything about the other pieces (ahhhh, now I get the name). 

                      P(Likelihood of Evidence) * Prior prob of outcome
P(outcome|evidence) = _________________________________________________
                                         P(Evidence)
                                         
So, an example. Lets say we have 1000 pieces of fruit. We know if its long or short, sweet or not, and yellow or not yellow. We also know what fruit it actually is (banana, orange or something else). 

Lets say we now have a new piece of fruit. How will we know if its a banana, an orange or something else?

Lets say our new fruit is long, sweet and yellow. What is it going to be?

P(Banana|Long, Sweet and Yellow) 
      P(Long|Banana) * P(Sweet|Banana) * P(Yellow|Banana) * P(banana)
    = _______________________________________________________________
                      P(Long) * P(Sweet) * P(Yellow)

    = 0.8 * 0.7 * 0.9 * 0.5 / P(evidence)

    = 0.252 / P(evidence)


P(Orange|Long, Sweet and Yellow) = 0


P(Other Fruit|Long, Sweet and Yellow)
      P(Long|Other fruit) * P(Sweet|Other fruit) * P(Yellow|Other fruit) * P(Other Fruit)
    = ____________________________________________________________________________________
                                          P(evidence)

    = (100/200 * 150/200 * 50/200 * 200/1000) / P(evidence)

    = 0.01875 / P(evidence)

We can assume that our new piece of fruit is an banana by a large margin, 0.252 >> 0.01875. Our new fruit is a banana. We can now do this with any other piece of fruit we come across. Since we can precompute most of these values once and use them over and over, this classifier is simple and effective. 

# Lets get started on our tool!

First, we need to start looking at some reviews.

We now have 1000 positive reviews and 1000 negative reviews in the format we want. We still need to break these into training and test sets:

Lets start looking at the reviews themselves.

Now, lets put this into our classifier through TextBlob's implimentation of Naive Bayes:

In [None]:
import textblob
from textblob.classifiers import NaiveBayesClassifier

This might take a second. But when it's done, we have our algorithm trained!

In the meantime, let's go over what training and testing sets are.