# Perform Sentiment Analysis in Python 3 Using the Natural Language Toolkit (NLTK)

Adapted from online exercise found here: https://www.digitalocean.com/community/tutorials/how-to-perform-sentiment-analysis-in-python-3-using-the-natural-language-toolkit-nltk

## Step 1 — Installing NLTK and Downloading the Data

You will use the NLTK package in Python for all NLP tasks in this tutorial. In this step you will install NLTK and download the sample tweets that you will use to train and test your model.

**First, install the NLTK package with the pip package manager:**

```pip install nltk==3.3```
 
This tutorial will use sample tweets that are part of the NLTK package.

In [1]:
# import the nltk module in the python interpreter
import nltk

In [2]:
# Download the sample tweets from the NLTK package
nltk.download('twitter_samples')

[nltk_data] Downloading package twitter_samples to
[nltk_data]     C:\Users\clorh\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\twitter_samples.zip.


True

Running this command from the Python interpreter downloads and stores the tweets locally. Once the samples are downloaded, they are available for your use.

You will use the negative and positive tweets to train your model on sentiment analysis later in the tutorial. The tweets with no sentiments will be used to test your model.

## Step 2 — Tokenizing the Data

Language in its original form cannot be accurately processed by a machine, so you need to process the language to make it easier for the machine to understand. The first part of making sense of the data is through a process called tokenization, or splitting strings into smaller parts called tokens.

A token is a sequence of characters in text that serves as a unit. Based on how you create the tokens, they may consist of words, emoticons, hashtags, links, or even individual characters. A basic way of breaking language into tokens is by splitting the text based on whitespace and punctuation.

In [4]:
# import the twitter_samples
from nltk.corpus import twitter_samples

This will import three datasets from NLTK that contain various tweets to train and test the model:

- negative_tweets.json: 5000 tweets with negative sentiments
- positive_tweets.json: 5000 tweets with positive sentiments
- tweets.20150430-223406.json: 20000 tweets with no sentiments

In [5]:
# store tweets in variables
positive_tweets = twitter_samples.strings('positive_tweets.json')
negative_tweets = twitter_samples.strings('negative_tweets.json')
text = twitter_samples.strings('tweets.20150430-223406.json')

The `strings()` method of `twitter_samples` will print all of the tweets within a dataset as strings. Setting the different tweet collections as a variable will make processing and testing easier.

### Download additional resources

Before using a tokenizer in NLTK, you need to download an additional resource, punkt. The punkt module is a pre-trained model that helps you tokenize words and sentences. For instance, this model knows that a name may contain a period (like “S. Daityari”) and the presence of this period in a sentence does not necessarily end it.

NLTK provides a default tokenizer for tweets with the `.tokenized()` method.

In [6]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\clorh\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


True

In [7]:
# example of a token
tweet_tokens = twitter_samples.tokenized('positive_tweets.json')[0]
print(tweet_tokens[0])

#FollowFriday


In [8]:
# Tokenize all tweets
tweet_tokens = twitter_samples.tokenized('positive_tweets.json')

## Step 3 — Normalizing the Data

Words have different forms—for instance, “ran”, “runs”, and “running” are various forms of the same verb, “run”. Depending on the requirement of your analysis, all of these versions may need to be converted to the same form, “run”. **Normalization** in NLP is the process of converting a word to its canonical form.

Normalization helps group together words with the same meaning but different forms. Without normalization, “ran”, “runs”, and “running” would be treated as different words, even though you may want them to be treated as the same word. In this section, you explore stemming and lemmatization, which are two popular techniques of normalization.

**Stemming** is a process of removing affixes from a word. Stemming, working with only simple verb forms, is a heuristic process that removes the ends of words.

In this tutorial you will use the process of **lemmatization**, which normalizes a word with the context of vocabulary and morphological analysis of words in text. The lemmatization algorithm analyzes the structure of the word and its context to convert it to a normalized form. Therefore, it comes at a cost of speed. 

A comparison of stemming and lemmatization ultimately comes down to a trade off between speed and accuracy.

### Download resources required for lemmatization

`wordnet` is a lexical database for the English language that helps the script determine the base word. 

You need the `averaged_perceptron_tagger` resource to determine the context of a word in a sentence.

In [9]:
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\clorh\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\wordnet.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\clorh\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping taggers\averaged_perceptron_tagger.zip.


True

You are almost ready to use the lemmatizer. But before running a lemmatizer, you need to determine the context for each word in your text. This is achieved by a tagging algorithm, which assesses the relative position of a word in a sentence. This can be done using the `pos_tag` function.

In [12]:
from nltk.tag import pos_tag
from pprint import pprint #improves formatting of printed results

pprint(pos_tag(tweet_tokens[0]))

[('#FollowFriday', 'JJ'),
 ('@France_Inte', 'NNP'),
 ('@PKuchly57', 'NNP'),
 ('@Milipol_Paris', 'NNP'),
 ('for', 'IN'),
 ('being', 'VBG'),
 ('top', 'JJ'),
 ('engaged', 'VBN'),
 ('members', 'NNS'),
 ('in', 'IN'),
 ('my', 'PRP$'),
 ('community', 'NN'),
 ('this', 'DT'),
 ('week', 'NN'),
 (':)', 'NN')]


From the list of tags, here is the list of the most common items and their meaning:
- **NNP**: Noun, proper, singular  
- **NN**: Noun, common, singular or mass  
- **IN**: Preposition or conjunction, subordinating  
- **VBG**: Verb, gerund or present participle  
- **VBN**: Verb, past participle  

Here is a [full list of the dataset](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html).

In general, if a tag starts with NN, the word is a noun and if it stars with VB, the word is a verb. 

### Import and use lemmatizer

In [13]:
from nltk.stem.wordnet import WordNetLemmatizer

The function `lemmatize_sentence` first gets the position tag of each token of a tweet. Within the if statement, if the tag starts with `NN`, the token is assigned as a noun. Similarly, if the tag starts with `VB`, the token is assigned as a verb.

In [48]:
def lemmatize_sentence(tokens):
    lemmatizer = WordNetLemmatizer()
    lemmatized_sentence = []
    for word, tag in pos_tag(tokens):
        if tag.startswith('NN'):
            pos = 'n'
        elif tag.startswith('VB'):
            pos = 'v'
        else:
            pos = 'a'
        lemmatized_sentence.append(lemmatizer.lemmatize(word, pos))
    return lemmatized_sentence

# print a sample result
pprint(lemmatize_sentence(tweet_tokens[0]))

['#FollowFriday',
 '@France_Inte',
 '@PKuchly57',
 '@Milipol_Paris',
 'for',
 'be',
 'top',
 'engage',
 'member',
 'in',
 'my',
 'community',
 'this',
 'week',
 ':)']


## Step 4 — Removing Noise from the Data

In this step, you will remove noise from the dataset. Noise is any part of the text that does not add meaning or information to data.

Noise is specific to each project, so what constitutes noise in one project may not be in a different project. For instance, the most common words in a language are called stop words. Some examples of stop words are “is”, “the”, and “a”. They are generally irrelevant when processing language, unless a specific use case warrants their inclusion.

In this tutorial, you will use regular expressions in Python to search for and remove these items:

- **Hyperlinks** - All hyperlinks in Twitter are converted to the URL shortener t.co. Therefore, keeping them in the text processing would not add any value to the analysis.
- **Twitter handles in replies** - These Twitter usernames are preceded by a @ symbol, which does not convey any meaning.
- **Punctuation and special characters** - While these often provide context to textual data, this context is often difficult to process. For simplicity, you will remove all punctuation and special characters from tweets.

To remove hyperlinks, you need to first search for a substring that matches a URL starting with `http://` or `https://`, followed by letters, numbers, or special characters. Once a pattern is matched, the `.sub()` method replaces it with an empty string.

### Create function to remove noise

In [16]:
import re, string

def remove_noise(tweet_tokens, stop_words = ()):

    cleaned_tokens = []

    for token, tag in pos_tag(tweet_tokens):
        token = re.sub('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+#]|[!*\(\),]|'\
                       '(?:%[0-9a-fA-F][0-9a-fA-F]))+','', token)
        token = re.sub("(@[A-Za-z0-9_]+)","", token)

        if tag.startswith("NN"):
            pos = 'n'
        elif tag.startswith('VB'):
            pos = 'v'
        else:
            pos = 'a'

        lemmatizer = WordNetLemmatizer()
        token = lemmatizer.lemmatize(token, pos)

        if len(token) > 0 and token not in string.punctuation and token.lower() not in stop_words:
            cleaned_tokens.append(token.lower())
    return cleaned_tokens

This code creates a `remove_noise()` function that removes noise and incorporates the normalization and lemmatization mentioned in the previous section. The code takes two arguments: the tweet tokens and the tuple of stop words.

The code then uses a loop to remove the noise from the dataset. To remove hyperlinks, the code first searches for a substring that matches a URL starting with `http://` or `https://`, followed by letters, numbers, or special characters. Once a pattern is matched, the `.sub()` method replaces it with an empty string, or `''`.

Similarly, to remove `@` mentions, the code substitutes the relevant part of text using regular expressions. The code uses the `re` library to search `@` symbols, followed by numbers, letters, or `_`, and replaces them with an empty string.

Finally, you can remove punctuation using the library `string`.

Since the `remove_noise()` function also normalizes word forms, the `lemmatize_sentence()` function is no longer needed.

### Remove stop words

In addition to removing noise, you will also remove stop words. NLTK has a built-in set of stop words, which needs to be downloaded separately.

In [17]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\clorh\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


True

In [18]:
from nltk.corpus import stopwords
stop_words = stopwords.words('english')

pprint(remove_noise(tweet_tokens[0], stop_words))

['#followfriday', 'top', 'engage', 'member', 'community', 'week', ':)']


Notice that the function removes all `@` mentions, stop words, and converts the words to lowercase.

### Clean tweets

In [19]:
positive_tweet_tokens = twitter_samples.tokenized('positive_tweets.json')
negative_tweet_tokens = twitter_samples.tokenized('negative_tweets.json')

positive_cleaned_tokens_list = []
negative_cleaned_tokens_list = []

for tokens in positive_tweet_tokens:
    positive_cleaned_tokens_list.append(remove_noise(tokens, stop_words))

for tokens in negative_tweet_tokens:
    negative_cleaned_tokens_list.append(remove_noise(tokens, stop_words))
    
# Compare original tokens to the cleaned tokens for a sample tweet
print(positive_tweet_tokens[500])
print(positive_cleaned_tokens_list[500])

['Dang', 'that', 'is', 'some', 'rad', '@AbzuGame', '#fanart', '!', ':D', 'https://t.co/bI8k8tb9ht']
['dang', 'rad', '#fanart', ':d']


There are certain issues that might arise during the preprocessing of text. For instance, words without spaces (“iLoveYou”) will be treated as one and it can be difficult to separate such words. Furthermore, “Hi”, “Hii”, and “Hiiiii” will be treated differently by the script unless you write something specific to tackle the issue. It’s common to fine tune the noise removal process for your specific data.

## Step 5 — Determining Word Density

The most basic form of analysis on textual data is to check out the word frequency. A single tweet is too small of an entity to find out the distribution of words, hence, the analysis of the frequency of words would be done on all positive tweets.

The following snippet defines a **generator function**, named `get_all_words`, that takes a list of tweets as an argument to provide a list of words in all of the tweet tokens joined.

In [33]:
def get_all_words(cleaned_tokens_list):
    for tokens in cleaned_tokens_list:
        for token in tokens:
            yield token

all_pos_words = get_all_words(positive_cleaned_tokens_list)

Now that you have compiled all words in the sample of tweets, you can find out which are the most common words using the `FreqDist` class of `NLTK`. The `.most_common()` method lists the words which occur most frequently in the data. 

In [34]:
from nltk import FreqDist

freq_dist_pos = FreqDist(all_pos_words)

# example of first 10 most common words
pprint(freq_dist_pos.most_common(10))

# re-create the generator
all_pos_words = get_all_words(positive_cleaned_tokens_list)
freq_dist_pos = FreqDist(all_pos_words)

[(':)', 3691),
 (':-)', 701),
 (':d', 658),
 ('thanks', 388),
 ('follow', 357),
 ('love', 333),
 ('...', 290),
 ('good', 283),
 ('get', 263),
 ('thank', 253)]


From this data, you can see that emoticon entities form some of the most common parts of positive tweets. 

To summarize, you extracted the tweets from `nltk`, tokenized, normalized, and cleaned up the tweets for using in the model. Finally, you also looked at the frequencies of tokens in the data and checked the frequencies of the top ten tokens.

In the next step you will prepare data for sentiment analysis.

## Step 6 — Preparing Data for the Model

**Sentiment analysis** is a process of identifying an attitude of the author on a topic that is being written about. You will create a training data set to train a model. It is a supervised learning machine learning process, which requires you to associate each dataset with a “sentiment” for training. In this tutorial, your model will use the “positive” and “negative” sentiments.

Sentiment analysis can be used to categorize text into a variety of sentiments. For simplicity and availability of the training dataset, this tutorial helps you train your model in only two categories, positive and negative.

A **model** is a description of a system using rules and equations. It may be as simple as an equation which predicts the weight of a person, given their height. The sentiment analysis model that you will build would associate tweets with a positive or a negative sentiment. You will need to split your dataset into two parts. The purpose of the first part is to build the model, whereas the next part tests the performance of the model.

In the data preparation step, you will prepare the data for sentiment analysis by converting tokens to the dictionary form and then split the data for training and testing purposes.

### Converting Tokens to a Dictionary

First, you will prepare the data to be fed into the model. You will use the **Naive Bayes classifier** in `NLTK` to perform the modeling exercise. Notice that the model requires not just a list of words in a tweet, but a Python dictionary with words as keys and `True` as values. 

The following code makes a generator function to change the format of the cleaned data.

It converts the tweets from a list of cleaned tokens to dictionaries with keys as the tokens and `True` as values. The corresponding dictionaries are stored in `positive_tokens_for_model` and `negative_tokens_for_model`.

In [45]:
def get_tweets_for_model(cleaned_tokens_list):
    for tweet_tokens in cleaned_tokens_list:
        yield dict([token, True] for token in tweet_tokens)

positive_tokens_for_model = get_tweets_for_model(positive_cleaned_tokens_list)
negative_tokens_for_model = get_tweets_for_model(negative_cleaned_tokens_list)

### Splitting the Dataset for Training and Testing the Model

Prepare the data for training the NaiveBayesClassifier class.

In [46]:
import random

# assign label of 'positive' or 'negative'
positive_dataset = [(tweet_dict, "Positive")
                     for tweet_dict in positive_tokens_for_model]

negative_dataset = [(tweet_dict, "Negative")
                     for tweet_dict in negative_tokens_for_model]

# join both into a single dataset
dataset = positive_dataset + negative_dataset

# remove bias by shuffling the dataset records
random.seed(456) # set seed for reproducibility
random.shuffle(dataset)

# split into training and test data (70/30 split)
train_data = dataset[:7000]
test_data = dataset[7000:]

This code attaches a `Positive` or `Negative` label to each tweet. It then creates a dataset by joining the positive and negative tweets.

By default, the data contains all positive tweets followed by all negative tweets in sequence. When training the model, you should provide a sample of your data that does not contain any bias. To avoid bias, you’ve added code to randomly arrange the data using the `.shuffle()` method of `random`.

Finally, the code splits the shuffled data into a ratio of 70:30 for training and testing, respectively. Since the number of tweets is 10000, you can use the first 7000 tweets from the shuffled dataset for training the model and the final 3000 for testing the model.

## Step 7 — Building and Testing the Model

Finally, use the `NaiveBayesClassifier` class to build the model. Use the `.train()` method to train the model and the `.accuracy()` method to test the model on the testing data.

In [47]:
from nltk import classify
from nltk import NaiveBayesClassifier

classifier = NaiveBayesClassifier.train(train_data)

print("Accuracy is:", classify.accuracy(classifier, test_data))

print(classifier.show_most_informative_features(10))

Accuracy is: 0.9963333333333333
Most Informative Features
                      :( = True           Negati : Positi =   2067.5 : 1.0
                      :) = True           Positi : Negati =   1655.0 : 1.0
                     sad = True           Negati : Positi =     36.1 : 1.0
                     bam = True           Positi : Negati =     23.5 : 1.0
                 welcome = True           Positi : Negati =     22.0 : 1.0
                follower = True           Positi : Negati =     21.6 : 1.0
                   enjoy = True           Positi : Negati =     20.8 : 1.0
                     x15 = True           Negati : Positi =     20.5 : 1.0
                    glad = True           Positi : Negati =     13.3 : 1.0
                    blog = True           Positi : Negati =     12.9 : 1.0
None


**Accuracy** is defined as the percentage of tweets in the testing dataset for which the model was correctly able to predict the sentiment. A 99.5% accuracy on the test set is pretty good.

In the table that shows the most informative features, every row in the output shows the ratio of occurrence of a token in positive and negative tagged tweets in the training dataset. The first row in the data signifies that in all tweets containing the token `:(`, the ratio of negative to positives tweets was `2067.5` to `1`. Interestingly, it seems that there was one token with `:(` in the positive datasets. You can see that the top two discriminating items in the text are the emoticons. Further, words such as `sad` lead to negative sentiments, whereas `welcome` and `glad` are associated with positive sentiments.

Next, you can check how the model performs on random tweets from Twitter. 

This code will allow you to test custom tweets by updating the string associated with the `custom_tweet` variable.

#### Test Negative Tweet

In [39]:
from nltk.tokenize import word_tokenize

# update this variable to any string you want to analyze
custom_tweet = "I ordered just once from TerribleCo, they screwed up, never used the app again."

custom_tokens = remove_noise(word_tokenize(custom_tweet))

print(classifier.classify(dict([token, True] for token in custom_tokens)))

Negative


#### Test Positive Tweet

In [40]:
# update this variable to any string you want to analyze
custom_tweet = "Congrats #SportStar on your 7th best goal from last season winning goal of the year :) #Baller #Topbin #oneofmanyworldies"

custom_tokens = remove_noise(word_tokenize(custom_tweet))

print(classifier.classify(dict([token, True] for token in custom_tokens)))

Positive


#### Test Sarcastic Tweet

In [41]:
custom_tweet = 'Thank you for sending my baggage to CityX and flying me to CityY at the same time. Brilliant service. #thanksGenericAirline'

custom_tokens = remove_noise(word_tokenize(custom_tweet))

print(classifier.classify(dict([token, True] for token in custom_tokens)))

Positive


The model classified this example as positive. This is because the training data wasn’t comprehensive enough to classify sarcastic tweets as negative. 

In case you want your model to predict sarcasm, you would need to provide sufficient amount of training data to train it accordingly.

**Test Random Tweet**

In [103]:
custom_tweet = text[random.randint(0, len(text) - 1)]
custom_tokens = remove_noise(word_tokenize(custom_tweet))
print(custom_tweet, '\n', classifier.classify(dict([token, True] for token in custom_tokens)))

RT @Different_Name_: Ed Miliband is saying he would rather pass power to the Tories than accept Scottish democracy. Wow. Just wow.  #bbcqt 
 Negative


## Conclusion

This tutorial introduced you to a basic sentiment analysis model using the `nltk` library in Python 3. First, you performed pre-processing on tweets by tokenizing a tweet, normalizing the words, and removing noise. Next, you visualized frequently occurring items in the data. Finally, you built a model to associate tweets to a particular sentiment.

A supervised learning model is only as good as its training data. To further strengthen the model, you could considering adding more categories like excitement and anger. In this tutorial, you have only scratched the surface by building a rudimentary model. Here’s [a detailed guide on various considerations](https://monkeylearn.com/sentiment-analysis/) that one must take care of while performing sentiment analysis.