# Natural Language Processing
So far, we have focused on data science involving quantitative and categorical variables. Today, we will learn how to analyze bodies of text with Natural Language Processing (NLP).

In [3]:
# Imports
import pandas as pd
import numpy as np
import nltk

# Download nltk libraries
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [1]:
# Read in the IMDB Dataset and look at the first few rows
url = 'https://raw.githubusercontent.com/ishaandey/node/master/week-9/workshop/IMDB.csv'
reviews = pd.read_csv(url)
reviews.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In its raw form, the data has one feature (review) and a label (sentiment). Each row is a movie review that is either positive or negative.

As a review (ba dum tsss), let's start by creating our own feature: the length of the review in characters. We're going to use an apply function on the 'review' column.

In [5]:
# Create a 'length' column
reviews['length'] = reviews['review'].apply(lambda x: len(x))
reviews.head()

Unnamed: 0,review,sentiment,length
0,One of the other reviewers has mentioned that ...,positive,1761
1,A wonderful little production. <br /><br />The...,positive,998
2,I thought this was a wonderful way to spend ti...,positive,926
3,Basically there's a family where a little boy ...,negative,748
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive,1317


Let's find the longest review (in terms of number of characters) and use that to learn some NLP skills.

In [6]:
# Use idxmax to find the index of the longest review
max_index = reviews['length'].idxmax()
max_index

31481

In [7]:
# Store the text of the longest review
txt = reviews['review'].iloc[max_index]
txt

'Match 1: Tag Team Table Match Bubba Ray and Spike Dudley vs Eddie Guerrero and Chris Benoit Bubba Ray and Spike Dudley started things off with a Tag Team Table Match against Eddie Guerrero and Chris Benoit. According to the rules of the match, both opponents have to go through tables in order to get the win. Benoit and Guerrero heated up early on by taking turns hammering first Spike and then Bubba Ray. A German suplex by Benoit to Bubba took the wind out of the Dudley brother. Spike tried to help his brother, but the referee restrained him while Benoit and Guerrero ganged up on him in the corner. With Benoit stomping away on Bubba, Guerrero set up a table outside. Spike dashed into the ring and somersaulted over the top rope onto Guerrero on the outside! After recovering and taking care of Spike, Guerrero slipped a table into the ring and helped the Wolverine set it up. The tandem then set up for a double superplex from the middle rope which would have put Bubba through the table, bu

Let's lowercase all the words so it's easier to identify words with different capitalizations as the same.

In [8]:
txt = txt.lower()

# Tokenize
"Tokenizing" is breaking a piece of text into smaller parts. The smaller parts of text are called "tokens."

In [9]:
# Import sent_tokenize and word_tokenize from nltk.tokenize
from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize

In [10]:
# Tokenize by sentence
sentence_tokens = sent_tokenize(txt)

# See how many sentences are in the review and preview the first 10
print(len(sentence_tokens))
sentence_tokens[:10]

147


['match 1: tag team table match bubba ray and spike dudley vs eddie guerrero and chris benoit bubba ray and spike dudley started things off with a tag team table match against eddie guerrero and chris benoit.',
 'according to the rules of the match, both opponents have to go through tables in order to get the win.',
 'benoit and guerrero heated up early on by taking turns hammering first spike and then bubba ray.',
 'a german suplex by benoit to bubba took the wind out of the dudley brother.',
 'spike tried to help his brother, but the referee restrained him while benoit and guerrero ganged up on him in the corner.',
 'with benoit stomping away on bubba, guerrero set up a table outside.',
 'spike dashed into the ring and somersaulted over the top rope onto guerrero on the outside!',
 'after recovering and taking care of spike, guerrero slipped a table into the ring and helped the wolverine set it up.',
 'the tandem then set up for a double superplex from the middle rope which would hav

In [11]:
# Tokenize by word
word_tokens = word_tokenize(txt)

# See how many words are in the review and preview the first 10
print(len(word_tokens))
word_tokens[:10]

2818


['match', '1', ':', 'tag', 'team', 'table', 'match', 'bubba', 'ray', 'and']

Tokenizing by word gives a list of all the words in the text. This allows us to get value counts of all of the words to get a sense of the main ideas.

In [12]:
# Covert to pd.Series and look at the value counts
pd.Series(word_tokens).value_counts()

the         198
.           117
and          98
to           85
,            78
           ... 
slipped       1
want          1
charisma      1
propped       1
grab          1
Length: 693, dtype: int64

Hmmmm. That's not very helpful. We don't really care about words like "the," "and," and "to." And, punctuation doesn't convey anything. Luckily, there's a really easy way to improve on this!

We'll start by removing the punctuation using *punctuation* from the "string" library.

In [13]:
# Remove punctuation
import string
no_punc = []
for word in word_tokens:
  if word not in string.punctuation:
    no_punc.append(word)
no_punc[:20]

['match',
 '1',
 'tag',
 'team',
 'table',
 'match',
 'bubba',
 'ray',
 'and',
 'spike',
 'dudley',
 'vs',
 'eddie',
 'guerrero',
 'and',
 'chris',
 'benoit',
 'bubba',
 'ray',
 'and']

# Stop Words
A stop word is a commonly used word that does not convey much meaning.

Let's take a look at the stop words that nltk provides for us.

In [14]:
# Take a look at the nltk stopwords
from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))
print(stop_words)

{'won', 'she', 'between', 'while', 'his', "isn't", 'yourself', 'ain', 'ma', 'i', "it's", 'very', 'were', 're', 'did', 'am', 'here', 'through', "couldn't", 'ours', 'aren', 't', "should've", 'your', 'of', 'to', 'so', 'before', 'is', 'at', 'shouldn', 'nor', 'didn', "hasn't", 'themselves', 'more', 'shan', 'the', 'theirs', 'during', 'does', 'both', 'are', 'had', 'this', 've', 'my', 'yourselves', 'with', 'only', 'same', "don't", 'y', 'you', 'that', 'an', 'do', 'don', 'again', 'be', 'her', 'most', 'can', "you're", 'me', 'few', 'doing', 'm', 'how', 'when', 'he', 'himself', 'into', 'hasn', 'haven', 'yours', 'we', "mustn't", 'herself', 'and', 'some', "you've", 'them', "that'll", 'these', "won't", "didn't", 'our', "you'll", 'was', 'against', 'than', 'where', "she's", 'ourselves', 'been', 'in', 'once', 'why', 'should', 'if', 'until', 'which', 'under', "aren't", 'after', 'has', "haven't", 'mustn', 'a', 'needn', "doesn't", 'hadn', 'doesn', 'mightn', "shouldn't", "weren't", 'too', 'further', 'about',

Let's remove these from our tokenized text.

In [15]:
# Remove stop words
filtered_txt = [w for w in no_punc if w not in stop_words]
filtered_txt[:20]

['match',
 '1',
 'tag',
 'team',
 'table',
 'match',
 'bubba',
 'ray',
 'spike',
 'dudley',
 'vs',
 'eddie',
 'guerrero',
 'chris',
 'benoit',
 'bubba',
 'ray',
 'spike',
 'dudley',
 'started']

We can see that 'and' was removed because it is a stop word.

In [16]:
# Get value counts
pd.Series(filtered_txt).value_counts()

ring          27
rock          26
angle         26
's            23
taker         22
              ..
nwo            1
repeatedly     1
pledging       1
superstar      1
'd             1
Length: 598, dtype: int64

That's more like it! We're getting a sense of the important words in the text. We can still do better though!

If "sneak" is in the text, we would want to group it with "sneaking" because they express the same concept. We can do this with stemming!

# Stemming
Stemming is reducing words to there stem. Words like "start," "started," "starting," and "starts" all have the same stem. After stemming them, they will all become "start." 

In [17]:
# Look at words 15-30 before stemming
filtered_txt[15:30]

['bubba',
 'ray',
 'spike',
 'dudley',
 'started',
 'things',
 'tag',
 'team',
 'table',
 'match',
 'eddie',
 'guerrero',
 'chris',
 'benoit',
 'according']

In [18]:
# Import the PorterStemmer
from nltk.stem import PorterStemmer
ps = PorterStemmer()

# Stem all words in filtered_txt
stemmed = [ps.stem(w) for w in filtered_txt]

# Compare the previous output with the same words after stemming
stemmed[15:30]

['bubba',
 'ray',
 'spike',
 'dudley',
 'start',
 'thing',
 'tag',
 'team',
 'tabl',
 'match',
 'eddi',
 'guerrero',
 'chri',
 'benoit',
 'accord']

We can see that "according" has been shortened to "accord." "Started" has been stemmed to "start."

In [19]:
pd.Series(stemmed).value_counts()[:20]

ring            27
rock            26
angl            26
's              23
taker           22
match           20
br              20
bubba           17
spike           15
tripl           15
tabl            14
back            13
benoit          13
brock           13
rvd             13
got             13
championship    12
nobl            12
kick            12
booker          12
dtype: int64

This gives us a much clearer picture of the important concepts in the text. However, we can see that it isn't perfect. Using computers to process text rarely is. The "'s" isn't helpful. And interestingly "booker" did not get stemmed to "book." The natural language libraries do the best they can, but the English language is complicated.

# Part of Speech Tagging
Another cool feature of nltk is POS tagging.

In [20]:
tagged = nltk.pos_tag(no_punc)
tagged[:10]

[('match', 'NN'),
 ('1', 'CD'),
 ('tag', 'NN'),
 ('team', 'NN'),
 ('table', 'JJ'),
 ('match', 'NN'),
 ('bubba', 'NN'),
 ('ray', 'NN'),
 ('and', 'CC'),
 ('spike', 'NN')]

NN means singluar nown. CD is a cardinal digit. CC is a coordinating conjuction.

Here you can find a description of what the acronyms mean: https://www.guru99.com/pos-tagging-chunking-nltk.html#:~:text=POS%20Tagging%20in%20NLTK%20is%20a%20process%20to,grammatical%20information%20of%20each%20word%20of%20the%20sentence.

POS tagging comes in handy for chunking. "Chunking" is grouping similar words or phrases together based on the nature of the word or phrase. You can search for sequence of words of different types. For example you could find verbs followed by nouns to get more information on the actions in the text. It involves regular expressions, which we have not covered yet. If NLP is something that interests you, "chunking" is definitely something to look into.

# Sentiment Analysis
Sentiment analysis is using computers to categorize opinions in text, especially to determine whether the attitude of the text is positive, negative, or neutral.

In [38]:
# Import TextBlob
from textblob import TextBlob

# Take a look at the review at index 1
positive = reviews['review'].iloc[1]

# Tokenize by sentence
positive_sent_tokens = sent_tokenize(positive)

In [22]:
# Get the sentiment of the first sentence
TextBlob(positive_sent_tokens[0]).sentiment

Sentiment(polarity=0.40625, subjectivity=0.75)

## Your turn
Before having TextBlob tell you the sentiment of the sentences, take a look at the sentences for yourself and decide what you think the sentiment should be.

1. Find the polarity and subjectivity of the last sentence in the review we just did. 
2. Find the polarity and subjectivity of the first sentence (index 0) of the review at index 49996.

In [23]:
# Take a look at the last sentence in positive_sent_tokens and judge the sentiment for yourself
# Remember that the index of the last element is "-1"
positive_sent_tokens[-1]

"It plays on our knowledge and our senses, particularly with the scenes concerning Orton and Halliwell and the sets (particularly of their flat with Halliwell's murals decorating every surface) are terribly well done."

In [24]:
# Get the sentiment of the last sentence in review 1
TextBlob(positive_sent_tokens[-1]).sentiment

Sentiment(polarity=-0.17291666666666666, subjectivity=0.44791666666666663)

In [25]:
# Get the review at index 49996
negative = reviews["review"].iloc[49996]

# Tokenize by sentence
negative_sent_tokens = sent_tokenize(negative)

In [26]:
# Take a look at the first sentence (index 0) and judge the sentiment for yourself
negative_sent_tokens[0]

'Bad plot, bad dialogue, bad acting, idiotic directing, the annoying porn groove soundtrack that ran continually over the overacted script, and a crappy copy of the VHS cannot be redeemed by consuming liquor.'

In [27]:
# Get the sentiment of the first sentence
TextBlob(negative_sent_tokens[0]).sentiment

Sentiment(polarity=-0.5944444444444444, subjectivity=0.6222222222222222)

As we can see, sentiment analysis works very well for text that is clearly positive or negative. However, in some situations, it can get confused.

# Bag of Words
A bag-of-words is a way of representing text as the frequency of certain words in a text. It involves a vocabulary of known words and a count for each of those words. The bag of words representation ignores the order of the words in the text. It is only concerned with the frequency of each word from the vocabulary.

In [28]:
# sklearn imports
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [29]:
# Covert the review column to a list of reviews
corpus = reviews.review.to_list()

In [30]:
# Use the CountVectorizer to convert to a BOW representation
vectorizer = CountVectorizer(analyzer='word', stop_words=stop_words, max_features=100)
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())

['10', 'acting', 'actors', 'actually', 'also', 'another', 'around', 'back', 'bad', 'best', 'better', 'big', 'br', 'cast', 'character', 'characters', 'comedy', 'could', 'director', 'end', 'enough', 'even', 'ever', 'every', 'fact', 'film', 'films', 'find', 'first', 'funny', 'get', 'give', 'go', 'going', 'good', 'got', 'great', 'horror', 'however', 'know', 'life', 'like', 'little', 'long', 'look', 'lot', 'love', 'made', 'make', 'makes', 'man', 'many', 'may', 'movie', 'movies', 'much', 'must', 'never', 'new', 'nothing', 'old', 'one', 'part', 'people', 'plot', 'pretty', 'quite', 'real', 'really', 'right', 'say', 'scene', 'scenes', 'see', 'seems', 'seen', 'series', 'show', 'something', 'still', 'story', 'take', 'thing', 'things', 'think', 'though', 'thought', 'time', 'two', 'us', 'want', 'watch', 'watching', 'way', 'well', 'work', 'world', 'would', 'years', 'young']


The features names above are the words in the vocabulary. They were selected because they are the most common words in the column.

In [31]:
X.toarray()

array([[0, 0, 0, ..., 1, 0, 0],
       [0, 0, 1, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 1, 1],
       ...,
       [0, 1, 0, ..., 1, 0, 0],
       [0, 0, 0, ..., 1, 0, 0],
       [0, 0, 0, ..., 1, 0, 0]])

In [32]:
# Convert to a DataFrame
pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())

Unnamed: 0,10,acting,actors,actually,also,another,around,back,bad,best,better,big,br,cast,character,characters,comedy,could,director,end,enough,even,ever,every,fact,film,films,find,first,funny,get,give,go,going,good,got,great,horror,however,know,...,old,one,part,people,plot,pretty,quite,real,really,right,say,scene,scenes,see,seems,seen,series,show,something,still,story,take,thing,things,think,though,thought,time,two,us,want,watch,watching,way,well,work,world,would,years,young
0,0,0,0,0,0,0,1,0,0,0,0,0,6,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,2,0,2,0,1,0,0,1,0,0,0,0,...,0,1,0,0,0,1,0,0,0,2,2,0,1,0,0,0,0,3,0,0,0,0,1,0,0,0,0,0,0,0,0,0,2,0,1,0,0,1,0,0
1,0,0,1,0,0,0,0,0,0,0,0,0,6,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,...,1,1,0,0,0,0,0,0,1,0,0,0,1,1,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,1,0,3,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,4,0,0,1,2,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,...,0,1,0,0,1,0,0,0,0,1,1,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,2,1,0,1,0,0,1,1,1,0,0,0,1,1
3,1,0,0,0,0,0,0,0,0,0,0,0,6,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,1,0,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0
4,0,1,0,0,0,1,0,0,0,1,0,1,8,1,0,2,0,0,1,0,0,0,0,0,0,2,0,1,0,0,0,0,0,0,2,0,0,0,0,1,...,0,6,0,3,0,0,0,0,0,0,0,0,0,1,2,0,0,0,0,0,0,0,1,0,0,0,0,2,0,2,0,1,0,1,0,1,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
49995,1,0,0,0,0,0,0,0,0,0,0,0,4,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,1,0,2,0,0,1,1,0,1,0,0,1,...,0,1,0,1,0,0,0,0,0,1,1,0,0,2,0,1,0,0,1,0,0,0,0,0,1,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0
49996,0,1,0,0,0,0,0,0,4,0,2,0,4,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,2,0,0,0,0,0,0,0,0
49997,0,1,0,0,0,0,0,1,1,0,0,0,4,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,2,0,0,0,0,0,...,0,0,0,2,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,1,0,0,1,0,2,0,0,0,0,0,1,0,0,1,0,0
49998,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,1,0,0,0,0,0,0,...,0,4,1,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0


Now we can see the BOW much more clearly. Each row is still a review. But, instead of containing text, there is a column for each word in the vocabulary. Each cell represents the count of that vocabulary word in the review.

In [33]:
# Map y so that 'positive' is 1 and 'negative' is 0
y = reviews['sentiment'].map({'positive': 1, 'negative': 0})

In [34]:
# Train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

While the cell below is running. It's a good time to talk about the advantages and disadvantages of different models.

In [35]:
# Fit a RandomForestClassifier (this takes a little while)
clf = RandomForestClassifier()
clf.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

In [36]:
# Predict on the testing data and compare to the actual
predicted = clf.predict(X_test)
actual = np.array(y_test)

print(predicted[:10])
print(actual[:10])

[1 0 0 1 0 1 1 0 0 1]
[1 1 0 1 0 1 1 1 0 0]


In [37]:
acc = accuracy_score(predicted, actual)
acc

0.7245

Not bad considering that the reviews are 50% positive and 50% negative!