What is Natural Language Processing?

Plug NLTK

In [44]:
import nltk
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

It'll bring up another window. From here, download: 
corpora -> movie_reviews

corpora -> stopwords

all packages -> punkt

corpora -> wordnet

Import everything we need, explain as we use it

In [45]:
import nltk.classify.util
from nltk.corpus import movie_reviews
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.corpus import wordnet

Break apart a sentence(tokenize)

In [46]:
sentence = "This is a test sentence. It will break everything apart!"
sentence.split(" ")

['This',
 'is',
 'a',
 'test',
 'sentence.',
 'It',
 'will',
 'break',
 'everything',
 'apart!']

However, it misses punctuation. Luckily, NLTK has a solution!

In [47]:
sentence = "This is a test sentence. It will break everything apart!"
word_tokenize(sentence)

['This',
 'is',
 'a',
 'test',
 'sentence',
 '.',
 'It',
 'will',
 'break',
 'everything',
 'apart',
 '!']

We can also have NLTK determine the part of speech (POS) for each token in our sentence

In [48]:
sentence = "This is a test sentence. It will tag everything!"
w = word_tokenize(sentence)
nltk.pos_tag(w)

[('This', 'DT'),
 ('is', 'VBZ'),
 ('a', 'DT'),
 ('test', 'NN'),
 ('sentence', 'NN'),
 ('.', '.'),
 ('It', 'PRP'),
 ('will', 'MD'),
 ('tag', 'VB'),
 ('everything', 'NN'),
 ('!', '.')]

NN means singular noun, JJ means adjective etc.

You can use NTLK to find the definition of words, synonyms, antonyms, etc.

Some words have multiple definitions and multiple parts of speech depending on the usage. This can get very complicated very fast, so for this purpose, we're just going to assume that NLTK knows what its doing.

The next step we have are removing stopwords. Stop words are words that are used for grammaical purposes but carry little meaning (the, a, I, is, etc). 

So, let's remove them. Issue is, there are over 100 English words that are considered stopwords. So unless we want to create a list of stopwords and iterate every word over the list by hand everytime we write a program, we need a new solution.

So, lets just let NLTK remove them for us.

In [49]:
para = "Our symposium will also include two rounds of workshops with several choices in each round- so you can brush up on your Python, learn about data visualization, or deepen your knowledge of machine learning. "
words = word_tokenize(para)
print(words)

['Our', 'symposium', 'will', 'also', 'include', 'two', 'rounds', 'of', 'workshops', 'with', 'several', 'choices', 'in', 'each', 'round-', 'so', 'you', 'can', 'brush', 'up', 'on', 'your', 'Python', ',', 'learn', 'about', 'data', 'visualization', ',', 'or', 'deepen', 'your', 'knowledge', 'of', 'machine', 'learning', '.']


In [50]:
words = word_tokenize(para)
useful_words = [word for word in words if word not in stopwords.words('english')]
print(useful_words)

['Our', 'symposium', 'also', 'include', 'two', 'rounds', 'workshops', 'several', 'choices', 'round-', 'brush', 'Python', ',', 'learn', 'data', 'visualization', ',', 'deepen', 'knowledge', 'machine', 'learning', '.']


Notice that "The" was not removed even though it is a stopword. Thats because NLTK's list is only in lowercase. So let's move our paragraph to lowercase first so it doesn't miss any.

In [51]:
para = para.lower()
words = word_tokenize(para)
useful_words = [word for word in words if word not in stopwords.words('english')]
print(useful_words)

['symposium', 'also', 'include', 'two', 'rounds', 'workshops', 'several', 'choices', 'round-', 'brush', 'python', ',', 'learn', 'data', 'visualization', ',', 'deepen', 'knowledge', 'machine', 'learning', '.']


This new shorter list means that we don't need to process over as many words, saving us time and making things more efficient without getting rid of any meaning.

Now, lets start with creating our tool.

Machine learning works by learning from a set of data and then applying what its learned to a new set of data.

So, the first thing we need is some data. This data can be tweets, reviews, books, anything really. We're going to be using movie reviews today.

Lets explore the data a little bit:

In [52]:
from nltk.corpus import movie_reviews
movie_reviews.words()

['plot', ':', 'two', 'teen', 'couples', 'go', 'to', ...]

In [None]:
movie_reviews.fileids()[:4]

Lets see what the most common words are in our data:

In [54]:
all_words = movie_reviews.words()
freq_dist = nltk.FreqDist(all_words)
freq_dist.most_common(20)

[(',', 77717),
 ('the', 76529),
 ('.', 65876),
 ('a', 38106),
 ('and', 35576),
 ('of', 34123),
 ('to', 31937),
 ("'", 30585),
 ('is', 25195),
 ('in', 21822),
 ('s', 18513),
 ('"', 17612),
 ('it', 16107),
 ('that', 15924),
 ('-', 15595),
 (')', 11781),
 ('(', 11664),
 ('as', 11378),
 ('with', 10792),
 ('for', 9961)]

Notice how a lot of these are stopwords. Its a good thing we know how to remove those!

We now have most of the tools we need. Let's get started.

We're going to be using a Naive Bayes classifier but due to time constraints we're not going to get into how the classifier works and just work with it as a black box.


Lets start looking at the reviews themselves.

In [55]:
neg_reviews = []
for fileid in movie_reviews.fileids('neg'):
    words = movie_reviews.words(fileid)
    neg_reviews.append((words, "neg"))
print(neg_reviews[0])
print(len(neg_reviews))


pos_reviews = []
for fileid in movie_reviews.fileids('pos'):
    words = movie_reviews.words(fileid)
    pos_reviews.append((words, "pos"))
print(len(pos_reviews))
    
#print(len(pos_reviews))

(['plot', ':', 'two', 'teen', 'couples', 'go', 'to', ...], 'neg')
1000
1000


We now have 1000 positive reviews and 1000 negative reviews in the format we want. We still need to break these into training and test sets:

In [60]:
train_set = neg_reviews[:300] + pos_reviews[:300]
test_set =  neg_reviews[950:] + pos_reviews[950:]
print(len(train_set),  len(test_set))

600 200


Now, lets put this into our classifier through TextBlob's implimentation of Naive Bayes:

In [61]:
import textblob
from textblob.classifiers import NaiveBayesClassifier
cl = NaiveBayesClassifier(train_set)

This might take a second. But when it's done, we have our algorithm trained!

In the meantime, let's go over what training and testing sets are.

In [72]:
cl.accuracy(test_set)

0.81