# Twitter Sentiment Analysis

This notebook shows how a classification model is built to perform sentiment analysis on tweets. The end result is to be able to determin the *Polarity*, *Positive* or *Negative*, of each tweet coming from a real-time Twitter API.

It is part of a larger project available on my GitHub: [twitter-sentiment-analysis](https://github.com/redouane-dev/twitter-sentiment-analysis).

This notebook is inspired from [this DigitialOcean tutorial](https://www.digitalocean.com/community/tutorials/how-to-perform-sentiment-analysis-in-python-3-using-the-natural-language-toolkit-nltk).

###### Steps:

1. Install the NLTK package (Natural Language Toolkit) and additional libraries to process tweets.
2. Load datasets of positive and negative tweets
3. Tokenize, normalize, and remove noise and stopwords from each tweet.
4. Assemble the cleaned data into a dataset and split it into a training and testing sets.
5. Train a Naive Bayes classification model and validate it.

# Install NLTK and Dependencies

To repeat the experiment, we will need to install Jupyter and the dependencies of this project.

Following steps are optional:

```bash
# Create a virtual environment to isolate this project from other Python projects and avoid dependency conflicts
virtualenv -p python3 venv

# Activate your virtual env. You will see a (venv) before your usual terminal prompt
source venv/bin/activate

# If you want to use Jupyter and have it installed in this virtual environment
pip install jupyter
```

Then comes the installation part:

```bash
# Install the single main dependency
pip install nltk==3.4.5
```

...and voila!

Or almost. We still need to install libraries that will help up process the tweets.

In [5]:
import nltk

nltk.download('punkt')        # Contains a pre-trained model to help tokenize sentences into single words
nltk.download('wordnet')      # Lexical database that will be used during normalization
nltk.download('averaged_perceptron_tagger')    # Tagger to find nature of words (verb, noun, ...)
nltk.download('stopwords')    # 

[nltk_data] Downloading package punkt to /home/peaceful/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /home/peaceful/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/peaceful/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/peaceful/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

# Load Datasets

In [6]:
# Download and store datasets locally
nltk.download('twitter_samples')

[nltk_data] Downloading package twitter_samples to
[nltk_data]     /home/peaceful/nltk_data...
[nltk_data]   Package twitter_samples is already up-to-date!


True

In [7]:
from nltk.corpus import twitter_samples

# To see what are the available files
twitter_samples.fileids()

['negative_tweets.json', 'positive_tweets.json', 'tweets.20150430-223406.json']

In [8]:
# Load the training set
positive_tweets = twitter_samples.strings('positive_tweets.json')
negative_tweets = twitter_samples.strings('negative_tweets.json')

# Load the test set
text = twitter_samples.strings('tweets.20150430-223406.json')

# Tokenize, Normalize, and Remove Noise and Stopwords

## Tokenization

This means splitting sentences into single words called *tokens*, including emojis :)

In [9]:
from nltk.tokenize import TweetTokenizer

# Instantiate a tweet tokenizer that will preserve each word (or token) as it is
tweet_tokenizer = TweetTokenizer(
    preserve_case = True,
    reduce_len    = False,
    strip_handles = False)

tokens_positive = [tweet_tokenizer.tokenize(p) for p in positive_tweets]
tokens_negative = [tweet_tokenizer.tokenize(n) for n in negative_tweets]

print("Example of a positive tweet:\n{}\n".format(positive_tweets[0]))
print("Tokens:\n{}".format(tokens_positive[0]))

Example of a positive tweet:
#FollowFriday @France_Inte @PKuchly57 @Milipol_Paris for being top engaged members in my community this week :)

Tokens:
['#FollowFriday', '@France_Inte', '@PKuchly57', '@Milipol_Paris', 'for', 'being', 'top', 'engaged', 'members', 'in', 'my', 'community', 'this', 'week', ':)']


## Normalization

Bringing words to their canonical form. We will use Lemmatization as a normalization process.

We will need to find the nature of each word by using a tagger:
- NNP: Noun, proper, singular
- NN: Noun, common, singular or mass
- IN: Preposition or conjunction, subordinating
- VBG: Verb, gerund or present participle
- VBN: Verb, past participle
- JJ: adjective ‘big’
- JJR: adjective, comparative ‘bigger’
- JJS: adjective, superlative ‘biggest’
- ...

After getting the types (Verb, noun, or others), we can extract the lemma of each word.

In [10]:
from nltk.tag import pos_tag    # Part-of-speech tagger

tags_positive = [pos_tag(p) for p in tokens_positive]
tags_negative = [pos_tag(n) for n in tokens_negative]

# print
tags_positive[0]

[('#FollowFriday', 'JJ'),
 ('@France_Inte', 'NNP'),
 ('@PKuchly57', 'NNP'),
 ('@Milipol_Paris', 'NNP'),
 ('for', 'IN'),
 ('being', 'VBG'),
 ('top', 'JJ'),
 ('engaged', 'VBN'),
 ('members', 'NNS'),
 ('in', 'IN'),
 ('my', 'PRP$'),
 ('community', 'NN'),
 ('this', 'DT'),
 ('week', 'NN'),
 (':)', 'NN')]

In [13]:
from nltk.stem.wordnet import WordNetLemmatizer

# All we need is to know the type (Noun, Verb, or others) of each word
def _tag2type(tag):
    '''
    Take a tag and return a type.
    return 'n' for noun, 'v' for verb, and 'a' for any
    '''
    if tag.startswith('NN'):
        return 'n'
    elif tag.startswith('VB'):
        return 'v'
    else:
        return 'a'

lemmatizer = WordNetLemmatizer()

lemma_positive = [[lemmatizer.lemmatize(word, _tag2type(tag)) for (word, tag) in tags] for tags in tags_positive]
lemma_negative = [[lemmatizer.lemmatize(word, _tag2type(tag)) for (word, tag) in tags] for tags in tags_negative]


print("Example of a positive tweet:\n{}\n".format(positive_tweets[0]))
print("Lemmatized:\n{}".format(lemma_positive[0]))

Example of a positive tweet:
#FollowFriday @France_Inte @PKuchly57 @Milipol_Paris for being top engaged members in my community this week :)

Lemmatized:
['#FollowFriday', '@France_Inte', '@PKuchly57', '@Milipol_Paris', 'for', 'be', 'top', 'engage', 'member', 'in', 'my', 'community', 'this', 'week', ':)']


We can notice that the verb *being* is converted to *be*, and the noun *members* to *member*.

## De-noising or Noise Reduction

We consider the following as noise:
1. Stopwords: Most common words in a language, such as "a", "the", and "it", generally don't convey a meaning, unless otherwise specified.
2. Hyperlinks: Twitter uses t.co to shorten hyperlinks, which doesn't leave any value in the information left as URLs.
3. Mentions: Usernames and pages that start with a @.
4. Punctuation: It adds context and meaning, but makes the text more complex to process. For simplicity, we'll remove all punctuation.

We will the dictionary *Stopwords* from NLTK, plus regular expressions to de-noise.

In [14]:
from nltk.corpus import stopwords
stopwords = stopwords.words('english')

# print
stopwords[:10]

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

In [20]:
import re
from string import punctuation

def _is_noise(word):
    pattern = 'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+#]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+|(@[A-Za-z0-9_]+)'
    return word in punctuation \
        or word.lower() in stopwords \
        or re.search(pattern, word, re.IGNORECASE) != None

denoised_positive = [[p.lower() for p in _list if not _is_noise(p)] for _list in lemma_positive]
denoised_negative = [[n.lower() for n in _list if not _is_noise(n)] for _list in lemma_negative]

print("Example of a positive tweet:\n{}\n".format(positive_tweets[0]))
print("Denoised:\n{}".format(denoised_positive[0]))

Example of a positive tweet:
#FollowFriday @France_Inte @PKuchly57 @Milipol_Paris for being top engaged members in my community this week :)

Denoised:
['#followfriday', 'top', 'engage', 'member', 'community', 'week', ':)']


In [20]:
from nltk import FreqDist

def get_all_words(cleaned_tokens_list):
    for tokens in cleaned_tokens_list:
        for token in tokens:
            yield token

all_pos_words = get_all_words(positive_cleaned_tokens_list)

freq_dist_pos = FreqDist(all_pos_words)
print(freq_dist_pos.most_common(10))


all_neg_words = get_all_words(negative_cleaned_tokens_list)

freq_dist_neg = FreqDist(all_neg_words)
print(freq_dist_neg.most_common(10))

[(':)', 3691), (':-)', 701), (':d', 658), ('thanks', 388), ('follow', 357), ('love', 333), ('...', 290), ('good', 283), ('get', 263), ('thank', 253)]
[(':(', 4585), (':-(', 501), ("i'm", 343), ('...', 332), ('get', 325), ('miss', 291), ('go', 275), ('please', 275), ('want', 246), ('like', 218)]


In [40]:
def get_tweets_for_model(cleaned_tokens_list):
    for tweet_tokens in cleaned_tokens_list:
        yield dict([token, True] for token in tweet_tokens)

positive_tokens_for_model = get_tweets_for_model(positive_cleaned_tokens_list)
negative_tokens_for_model = get_tweets_for_model(negative_cleaned_tokens_list)

In [42]:
import random

positive_dataset = [(tweet_dict, "Positive")
                     for tweet_dict in positive_tokens_for_model]

negative_dataset = [(tweet_dict, "Negative")
                     for tweet_dict in negative_tokens_for_model]

dataset = positive_dataset + negative_dataset

random.shuffle(dataset)

train_data = dataset[:7000]
test_data = dataset[7000:]

In [45]:
print(len(dataset))
dataset

10000


[({'sticker': True,
   'come': True,
   'sponsor': True,
   'prize': True,
   'entry': True,
   'tablet': True,
   ':)': True},
  'Positive'),
 ({'party': True, 'promotion': True, ':(': True}, 'Negative'),
 ({'fback': True, ':)': True}, 'Positive'),
 ({'look': True, 'different': True, ':)': True}, 'Positive'),
 ({'always': True, 'miss': True, 'something': True, 'wifi': True, ':(': True},
  'Negative'),
 ({'good': True, 'night': True, ':(': True}, 'Negative'),
 ({'thanks': True, ':)': True}, 'Positive'),
 ({'dont': True,
   'give': True,
   'know': True,
   'good': True,
   'day': True,
   'come': True,
   ':-)': True},
  'Positive'),
 ({'thankies': True, ':d': True, 'fuck': True, 'hot': True, 'xx': True},
  'Positive'),
 ({':(': True, "i'll": True, 'try': True, 'catch': True}, 'Negative'),
 ({'hi': True,
   'wayne': True,
   "we're": True,
   'sorry': True,
   'hear': True,
   'look': True,
   'leave': True,
   ':(': True,
   "what's": True,
   'happen': True,
   'make': True,
   'want

In [46]:
from nltk import classify
from nltk import NaiveBayesClassifier
classifier = NaiveBayesClassifier.train(train_data)

print("Accuracy is:", classify.accuracy(classifier, test_data))

print(classifier.show_most_informative_features(10))

Accuracy is: 0.995
Most Informative Features
                      :( = True           Negati : Positi =   2058.7 : 1.0
                      :) = True           Positi : Negati =   1651.1 : 1.0
                follower = True           Positi : Negati =     35.6 : 1.0
                followed = True           Negati : Positi =     24.4 : 1.0
                     sad = True           Negati : Positi =     23.0 : 1.0
                    glad = True           Positi : Negati =     20.3 : 1.0
                     x15 = True           Negati : Positi =     14.3 : 1.0
                     ugh = True           Negati : Positi =     14.3 : 1.0
                  arrive = True           Positi : Negati =     13.3 : 1.0
               goodnight = True           Positi : Negati =     12.3 : 1.0
None


In [59]:
from nltk.tokenize import word_tokenize

custom_tweet = "Thank you for sending my baggage to CityX and flying me to CityY at the same time. Brilliant service. #thanksGenericAirline"

custom_tokens = remove_noise(word_tokenize(custom_tweet))

print(classifier.classify(dict([token, True] for token in custom_tokens)))

Positive


In [58]:
import pickle

with open('./model.pickle', 'wb') as f:
    pickle.dump(classifier, f)