# Model for Analysing Twitter Sentiments

In this notebook we build a classification model to perform sentiment analysis on tweets. Sentiment Analysis means we know the *Polarity* of a tweet whether it is *Positive* or *Negative*.Once the model is built it will be able to do the analysis in real-time through use of a Twitter API.

###### Steps to be executed: 

1. Inorder to process tweets, install the NLTK package (Natural Language Toolkit) and all other neccessary libraries.
2. Load all datasets of tweets that express positive and negative sentiments
3. From each tweet, we have to perfom Tokenization, normalization, and removal of noise and stopwords.
4. In the particular dataset the Word Density will then be determined.
5. Assemble the cleaned data into a dataset and split it into a training and testing sets.
6. The training of the Naive Bayes classification model and its validation.
7. Save the model into binary format.



# Step1: Install NLTK and Dependencies

We will need to install Jupyter and the dependencies of this project.

Following steps are not mandatory:

```bash
# In-order to avoid dependency conflicts, create a virtual environment to seperate this project from other Python projects
virtualenv -p python3 venv

# Activate your virtual env, and you will see a (venv) before your usual terminal prompt
source venv/bin/activate

# To install jupyter in the environment
pip install jupyter
```

Installation:

```bash
# Install the single main dependency
pip install nltk==3.4.5
```
We still need to install libraries that will aid in the processing of tweets.

In [None]:
import nltk

nltk.download('punkt')        # It helps to tokenize sentences into single words, since it contains a pre-trained model
nltk.download('wordnet')      # Lexical database of use during normalization
nltk.download('averaged_perceptron_tagger')    # Tagger to find the nature of words i.e whether it is verb, noun, ...
nltk.download('stopwords')    

# Step 2: Load Datasets

In [None]:
# Make a local download and storage of datasets
nltk.download('twitter_samples')

In [1]:
from nltk.corpus import twitter_samples

# Check availabe fields in our dataset
twitter_samples.fileids()

['negative_tweets.json', 'positive_tweets.json', 'tweets.20150430-223406.json']

In [2]:
# Load the labelled data as our training set
positive_tweets = twitter_samples.strings('positive_tweets.json')
negative_tweets = twitter_samples.strings('negative_tweets.json')

# Load the unlabelled data as the test set
text = twitter_samples.strings('tweets.20150430-223406.json')

# Step 3: Tokenization, Normalization, and Removal of Noise and Stopwords

## Tokenization

In simple terms it is splitting sentences into single words known as *tokens*, including emojis 

In [3]:
from nltk.tokenize import TweetTokenizer

# Create an instance of a tweet tokenizer that will preserve each word (or token) as it is
tweet_tokenizer = TweetTokenizer(
    preserve_case = True,
    reduce_len    = False,
    strip_handles = False)

tokens_positive = [tweet_tokenizer.tokenize(p) for p in positive_tweets]
tokens_negative = [tweet_tokenizer.tokenize(n) for n in negative_tweets]

print("An Example of a positive tweet:\n{}\n".format(positive_tweets[0]))
print("Tokens:\n{}".format(tokens_positive[0]))

Example of a positive tweet:
#FollowFriday @France_Inte @PKuchly57 @Milipol_Paris for being top engaged members in my community this week :)

Tokens:
['#FollowFriday', '@France_Inte', '@PKuchly57', '@Milipol_Paris', 'for', 'being', 'top', 'engaged', 'members', 'in', 'my', 'community', 'this', 'week', ':)']


## Normalization

In this stage we bring words to their original form. We will impliment Lemmatization as a normalization process.

We will need to find the nature of each word by maaking use of a tagger:
- NNP: Noun, proper, singular
- NN: Noun, common, singular or mass
- IN: Preposition or conjunction, subordinating
- VBG: Verb, gerund or present participle
- VBN: Verb, past participle
- JJ: adjective ‘big’
- JJR: adjective, comparative ‘bigger’
- JJS: adjective, superlative ‘biggest’
- ...

After getting the types (Verb, noun, or others), we can extract the lemma of each word.

In [4]:
from nltk.tag import pos_tag    #This is part-of-speech tagger

tags_positive = [pos_tag(p) for p in tokens_positive]
tags_negative = [pos_tag(n) for n in tokens_negative]

# printing
tags_positive[0]

[('#FollowFriday', 'JJ'),
 ('@France_Inte', 'NNP'),
 ('@PKuchly57', 'NNP'),
 ('@Milipol_Paris', 'NNP'),
 ('for', 'IN'),
 ('being', 'VBG'),
 ('top', 'JJ'),
 ('engaged', 'VBN'),
 ('members', 'NNS'),
 ('in', 'IN'),
 ('my', 'PRP$'),
 ('community', 'NN'),
 ('this', 'DT'),
 ('week', 'NN'),
 (':)', 'NN')]

In [5]:
from nltk.stem.wordnet import WordNetLemmatizer

# We only need to know the type i.e whether it is Noun, Verb, or others for each word
def _tag2type(tag):
    '''
    Take a tag and return a type.
    return 'n' for noun, 'v' for verb, and 'a' for any
    '''
    if tag.startswith('NN'):
        return 'n'
    elif tag.startswith('VB'):
        return 'v'
    else:
        return 'a'

lemmatizer = WordNetLemmatizer()

lemma_positive = [[lemmatizer.lemmatize(word, _tag2type(tag)) for (word, tag) in tags] for tags in tags_positive]
lemma_negative = [[lemmatizer.lemmatize(word, _tag2type(tag)) for (word, tag) in tags] for tags in tags_negative]


print("Example of a positive tweet:\n{}\n".format(positive_tweets[0]))
print("Lemmatized format:\n{}".format(lemma_positive[0]))

Example of a positive tweet:
#FollowFriday @France_Inte @PKuchly57 @Milipol_Paris for being top engaged members in my community this week :)

Lemmatized:
['#FollowFriday', '@France_Inte', '@PKuchly57', '@Milipol_Paris', 'for', 'be', 'top', 'engage', 'member', 'in', 'my', 'community', 'this', 'week', ':)']


We can notice that the verb *being* is converted to *be*, and the noun *members* to *member*.

## De-noising or Noise Reduction on our data

At this stage we consider the following as noise:
1. Stopwords: Words such as "a", "the", and "it", are the most common words in a language and they generally don't convey a meaning, except in special cases.
2. Hyperlinks:Inorder to shorten hyperlinks Twitter uses t.co, this renders information carried by URLs to be of no value.
3. Mentions: Whenever usernames and pages start with a @ sympol we classify it as a mention.
4. Punctuation: To keep everything simple we remove all punctuation as it complicates text processing.

We are going to make use of the dictionary *Stopwords* from NLTK, together with regular expressions to de-noise.

In [6]:
from nltk.corpus import stopwords
stopwords = stopwords.words('english')

# print
stopwords[:10]

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

In [7]:
import re
from string import punctuation

def _is_noise(word):
    pattern = 'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+#]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+|(@[A-Za-z0-9_]+)'
    return word in punctuation \
        or word.lower() in stopwords \
        or re.search(pattern, word, re.IGNORECASE) != None

denoised_positive = [[p.lower() for p in _list if not _is_noise(p)] for _list in lemma_positive]
denoised_negative = [[n.lower() for n in _list if not _is_noise(n)] for _list in lemma_negative]

print("Example of a positive tweet:\n{}\n".format(positive_tweets[0]))
print("Denoised:\n{}".format(denoised_positive[0]))

Example of a positive tweet:
#FollowFriday @France_Inte @PKuchly57 @Milipol_Paris for being top engaged members in my community this week :)

Denoised:
['#followfriday', 'top', 'engage', 'member', 'community', 'week', ':)']


## Step 4: Determination of our Dataset Word Density

In [8]:
from nltk import FreqDist

def get_all_words(tokens_list):
    '''
    This generator function gets a flat mapping of all words in the dataset.
    
    @arg tokens_list: it is A 2-D list of (preferably cleaned) tokens
    @return A list of all words
    '''
    for tokens in tokens_list:
        for token in tokens:
            yield token

all_pos_words = get_all_words(denoised_positive)
all_neg_words = get_all_words(denoised_negative)

freq_dist_pos = FreqDist(all_pos_words)
freq_dist_neg = FreqDist(all_neg_words)

print("The 10 most common words in a set of positive tweets:\n{}\n".format(freq_dist_pos.most_common(10)))
print("The 10 most common words in a set of negative tweets:\n{}".format(freq_dist_neg.most_common(10)))

The 10 most common words in a set of positive tweets:
[(':)', 3691), (':-)', 701), (':d', 658), ('thanks', 388), ('follow', 357), ('love', 333), ('...', 290), ('good', 283), ('get', 263), ('thank', 253)]

The 10 most common words in a set of negative tweets:
[(':(', 4585), (':-(', 501), ("i'm", 343), ('...', 332), ('get', 325), ('miss', 291), ('go', 275), ('please', 275), ('want', 246), ('like', 218)]


## Step 5: Prepare Training and Testing Datasets

Our dataset is split into a training set for building the model, and a testing set for testing the performance of our model.

In [9]:
def get_tweets_for_model(tokens_list):
    '''
    Generator function that associates a boolean 'True' to each token in a list of tokens,
    which represents the label of each token.
        
    @arg tokens_list a 2-D list of (preferably cleaned) tokens
    @return A 2-D list of tuples (original_token, True) containing the unaltered token and a boolean label
    '''
    for tweet_tokens in tokens_list:
        yield dict([token, True] for token in tweet_tokens)

positive_tokens_for_model = get_tweets_for_model(denoised_positive)
negative_tokens_for_model = get_tweets_for_model(denoised_negative)

In [10]:
import random

TRAIN_SIZE_RATIO = 0.7    # We use 70% as a training set

positive_dataset = [(tweet_dict, "Positive") for tweet_dict in positive_tokens_for_model]
negative_dataset = [(tweet_dict, "Negative") for tweet_dict in negative_tokens_for_model]

# Merge the positive and negative sets, then shuffle to avoid any bias
# that could come from the arrangement of tweets.
dataset = positive_dataset + negative_dataset
random.shuffle(dataset)

train_data = dataset[: round(len(dataset) * TRAIN_SIZE_RATIO)]
test_data = dataset[round(len(dataset) * TRAIN_SIZE_RATIO) :]

## Step 6: Training our datasets

We will use a Naive Bayes classifier.

In [11]:
from nltk import classify
from nltk import NaiveBayesClassifier

classifier = NaiveBayesClassifier.train(train_data)

print("Training accuracy is:{}\n".format(classify.accuracy(classifier, train_data)))
print("Testing accuracy is:{}\n".format(classify.accuracy(classifier, test_data)))
print(classifier.show_most_informative_features(10))

Training accuracy is:0.9995714285714286

Testing accuracy is:0.9953333333333333

Most Informative Features
                      :) = True           Positi : Negati =    988.2 : 1.0
                     sad = True           Negati : Positi =     31.1 : 1.0
                follower = True           Positi : Negati =     23.6 : 1.0
                     bam = True           Positi : Negati =     21.9 : 1.0
                    glad = True           Positi : Negati =     19.8 : 1.0
                     x15 = True           Negati : Positi =     16.9 : 1.0
                 welcome = True           Positi : Negati =     15.1 : 1.0
               community = True           Positi : Negati =     14.5 : 1.0
                     ugh = True           Negati : Positi =     12.9 : 1.0
                    dont = True           Negati : Positi =     12.6 : 1.0
None


### Custom testing

For ease of use we wrap our classification algorithm into a function, then we perform tests on various emotions.

In [12]:
def classify(tweet):
    '''
    Wrapper function for the pre-processing and classification steps previously performed.
    
    @arg tweet: String representing a tweet
    @return String representing a polarity. (Positive or Negative)
    '''
    tokens = tweet_tokenizer.tokenize(tweet)
    tokens = [
        lemmatizer.lemmatize(word, _tag2type(tag)).lower()
        for word, tag in pos_tag(tokens)
        if not _is_noise(word)
    ]
    
    return tokens, classifier.classify(dict([token, True] for token in tokens))

In [13]:
positive_tweet = "@bakery_brothers Thanks for the Pie! Really appreciate it :) #yummy #pie_day"
tokens, polarity = classify(positive_tweet)

print("Denoised tokens: {}\nPolarity: {}\n".format(tokens, polarity))

Denoised tokens: ['thanks', 'pie', 'really', 'appreciate', ':)', '#yummy', '#pie_day']
Polarity: Positive



In [14]:
negative_tweet = "@raptors really sad that you lost the qualifications to the final. #no_luck"
tokens, polarity = classify(negative_tweet)

print("Denoised tokens: {}\nPolarity: {}\n".format(tokens, polarity))

Denoised tokens: ['really', 'sad', 'lose', 'qualification', 'final', '#no_luck']
Polarity: Negative



In [15]:
sarcasme_tweet = "@police thank you so much for closing half the roads to the city in the middle of the day! #traffic"
tokens, polarity = classify(sarcasme_tweet)

print("Denoised tokens: {}\nPolarity: {}\n".format(tokens, polarity))

Denoised tokens: ['thank', 'much', 'close', 'half', 'road', 'city', 'middle', 'day', '#traffic']
Polarity: Positive



### Conclusions

The model is not able to recognize sarcasme for lack of data in the training set.

Training a more complex model that would recognize more evolved emotions requires a training set that contains all of those emotions, and eventually a classification algorithm that can cope with this complexity.

## Step 7: Save the Model into Binary File

In [16]:
import pickle

with open('./model.pickle', 'wb') as f:
    pickle.dump(classifier, f)