# Twitter Sentiment Analysis

Create a Jupyter notebook with a simple analysis based on Twitter Sentiment Analysis Training Corpus:
http://thinknook.com/twitter-sentiment-analysis-training-corpus-dataset-2012-09-22/

First, let us load the relevant libraries.

In [4]:
import os
import re
import random
import zipfile

from six.moves import urllib

import numpy as np

import pandas
from pandas import Series, DataFrame

import matplotlib
import matplotlib.pyplot as plt

from sklearn import metrics
from sklearn.model_selection import train_test_split

import tensorflow as tf

from keras.models import Model
from keras.optimizers import Adam
from keras.layers import Bidirectional, Input, LSTM, Dense, Dropout, Activation, Conv1D, Flatten
from keras.layers.embeddings import Embedding
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.callbacks import EarlyStopping, ModelCheckpoint

import nltk
nltk.download('punkt')
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from nltk.tokenize import TweetTokenizer

from gensim.models import Word2Vec

seed = 1
tf.set_random_seed(seed)
random.seed(seed)

lemmatizer = WordNetLemmatizer()


[nltk_data] Downloading package punkt to /Users/cesar/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/cesar/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


If the [dataset](http://thinknook.com/wp-content/uploads/2012/09/Sentiment-Analysis-Dataset.zip) has not been downloaded, run the cell below to do so. Otherwise, copy it to the project folder.

In [5]:
download_url = "http://thinknook.com/wp-content/uploads/2012/09/Sentiment-Analysis-Dataset.zip"
download_path = os.path.join(".", "Sentiment-Analysis-Dataset.zip")

urllib.request.urlretrieve(download_url, download_path)
with zipfile.ZipFile(download_path, 'r') as zip_ref:
    zip_ref.extractall(".")

Now, we load the dataset. 

In [None]:
df = pandas.read_csv("Sentiment Analysis Dataset.csv", error_bad_lines=False)
print("Number of lines loaded:", len(df))
print("Positive tweets:", len(df[df.Sentiment == 1]))
print("Negative tweets:", len(df[df.Sentiment == 0]))

## Data Preprocessing
Before performing the tasks, we need to preprocess the data, removing undesired characters and words (such as links) and tokenizing the clean tweets into arrays.

In [None]:
# Sanitizing step. Remove handles, links, and non-alpha (except for '?', '!' and '#') characters
# Note: this could be done in one step with the regex option | but we leave it for readability.
clean_tweets = df.SentimentText.str.lower().str.replace("@[A-Za-z0-9_]+", "")
clean_tweets = clean_tweets.str.replace("http?://[^ ]+", "")
clean_tweets = clean_tweets.str.replace("https?://[^ ]+", "")
clean_tweets = clean_tweets.str.replace("www.[^ ]+", "")
clean_tweets = clean_tweets.str.replace("[^a-z ?!#]", "")

df['CleanTweet'] = clean_tweets

In [None]:
# We use the tweet tokenizer provided by NLTK, apart from normal tokenization,
# it helps us reduce the length of words like: wooooowwwwwww to wooowww 
# so there is less vocabulary without stripping much context.
tt = TweetTokenizer(preserve_case=False, reduce_len=True, strip_handles=True)
tokenized_tweets = df.CleanTweet.apply(tt.tokenize)

def lemmatize(sentence):
    return [lemmatizer.lemmatize(w) for w in sentence]

# And finally a lemmatizing step.
lemmatized_tweets = tokenized_tweets.apply(lemmatize)
df['TokenizedTweet'] = lemmatized_tweets

In [None]:
# Here we can take a look at a few sanitized examples
print(lemmatized_tweets.sample(5).values)

After lemmatizing and removing non-alpha digits, we are ready to analyze the dataset.

## Task 1
Show the top-10 most positive words, top-10 negative words (words more frequent with positive and negative labels respectively).

We present two approaches to find this: 
* The first is by calculating the difference between positive and negative tweets a word appears in (and vice-versa for negative sentiment) and presenting the highest scores. This calculation favors words with more frequency thus stop words like "i", "but" and "the" may appear. 
* The second approach is to calculate the ratio of positive to negative tweets a word appears in and present the top 10. This list may contain words that, even though are not very common, people used them more on a certain context than the other.

In [None]:
from collections import defaultdict

negative_words = defaultdict(int)
positive_words = defaultdict(int)

# let's count how many positive/negative appearances each word has
def count_negative(sentence):
    for w in sentence: negative_words[w]+=1

def count_positive(sentence):
    for w in sentence: positive_words[w]+=1

df[df.Sentiment == 1].TokenizedTweet.apply(count_positive)
df[df.Sentiment == 0].TokenizedTweet.apply(count_negative)

word_to_sentiment = DataFrame({
    'NegativeCount' : Series(negative_words),
    'PositiveCount' : Series(positive_words),
})

word_to_sentiment['TotalCount'] = word_to_sentiment.NegativeCount + word_to_sentiment.PositiveCount
print("Top 10 positive words by difference")
print((word_to_sentiment.PositiveCount - word_to_sentiment.NegativeCount).nlargest(10))
print()
print("Top 10 negative words by difference")
print((word_to_sentiment.NegativeCount - word_to_sentiment.PositiveCount).nlargest(10))
print()
print("Top 10 positive words by ratio")
print((word_to_sentiment.PositiveCount / word_to_sentiment.TotalCount).nlargest(10))
print()
print("Top 10 negative words by ratio")
print((word_to_sentiment.NegativeCount / word_to_sentiment.TotalCount).nlargest(10))
print()


## Task 2 
Check the Zipf law (https://en.wikipedia.org/wiki/Zipf%27s_law: the frequency of any word is inversely proportional to its rank in the frequency table), show it using a plot. 

In [None]:
sorted_words = word_to_sentiment.TotalCount.sort_values(ascending=False)

y = np.array(sorted_words, dtype=pandas.Series)
x = np.array(range(len(y)))

# a very rough estimate of Zipf's law, just to illustrate its trend
zipfs = 1./(x+0.000001)

fig, ax = plt.subplots()
ax.plot(x, y, label='dataset')
ax.plot(x, zipfs, label="Zipf's law")

ax.set_title("Rank vs. Frequency and Zipf's law")
ax.set_xlabel('rank')
ax.set_ylabel('frequency')
ax.set_xscale('log')
ax.set_yscale('log')
ax.legend()
ax.grid()

plt.show()

## Task 3
Write a simple classifier to predict sentiment, describe your approach and what can be done to improve the results.

A good choice for a sentiment classifier is a bidirectional RNN with word embeddings as inputs. There are several reasons to choose such a model:
* The overall sentiment is dependent on the whole sequence, and an RNN is capable of considering relationships across inputs, in this case, word embeddings.
* A bidirectional RNN allows us to look at the dependencies not only from beginning to end but also from end to beginning. A negative word at the end can change the whole context of the tweet. Example: I love rain. NOT!
* Word embeddings capture different features of a word, usually representing their occurrences with other words. Similar words have similar word embeddings

First, we have to obtain our embeddings matrix. We could download a set of pre-trained embeddings but using our own we make sure that the embeddings weights match our dataset domain. If we use pre-trained embeddings, the word embeddings will be with respect to the corpus it was trained on, which may or may not transfer well to our problem. 

In [None]:
embedding_len = 100
word2vec = Word2Vec(sentences=df.TokenizedTweet, size=embedding_len)

Let's see a few examples of our word vectors.

In [None]:
computer = [w for w, s in word2vec.wv.most_similar('computer')]
angry = [w for w, s in word2vec.wv.most_similar('angry')]
print("Words similar to computer:\n", computer)
print("Words similar to angry:\n", angry)

In [None]:
# Using Keras Tokenizer which helps us with the input format of the embedding layer
filters = '"$%&()*+,-./:;<=>[\]^_`{|}~'
vocabulary_len = 100000
tokenizer = Tokenizer(filters=filters, num_words=vocabulary_len)
tokenizer.fit_on_texts(df.TokenizedTweet)
max_words_in_tweet = max(df.TokenizedTweet.apply(len))
print("Max number of words in a tweet:" , max_words_in_tweet)

# building the embeddings for the vocabulary we use
embeddings = np.zeros((vocabulary_len, embedding_len))    
for word, index in tokenizer.word_index.items():
    if word in word2vec.wv:
        embeddings[index, :] = word2vec.wv[word]


Given our huge dataset of >1.5M samples, we can afford to have a bigger training set with smaller validation and test sets. We do a 98-1-1 split and prepare the sets in the correct input format.

In [None]:
train_df, rest_df = train_test_split(df, train_size=0.98)
val_df, test_df = train_test_split(rest_df, train_size=0.5) # ~40k each

# To use the Embedding layer we need to represent words in sentences by their index in the dict.
# The inputs need to be transformed to be vectors of indices in the dict for each word.
X_train = tokenizer.texts_to_sequences(train_df.TokenizedTweet)
# The inputs have to be of the same dimensions, regardless of the numbers of words in a sentence.
X_train = pad_sequences(X_train, maxlen = max_words_in_tweet, padding="post")
Y_train = train_df.Sentiment.to_numpy()

X_val = tokenizer.texts_to_sequences(val_df.TokenizedTweet)
X_val = pad_sequences(X_val, maxlen=max_words_in_tweet, padding="post")
Y_val = val_df.Sentiment.to_numpy()

X_test = tokenizer.texts_to_sequences(test_df.TokenizedTweet)
X_test = pad_sequences(X_test, maxlen=max_words_in_tweet, padding="post")
Y_test = test_df.Sentiment.to_numpy()

In [None]:
print("Training set size:", len(train_df))
print("Validation set size:", len(val_df))
print("Test set size:", len(test_df))

In [None]:
# Bidirectional LSTM-based RNN with a single sigmoid activation output
def TweetSentimentRNNModel(lstm_units, input_size, embeddings):
    inputs = Input(shape=(input_size,), dtype='int32')

    # The Embedding layer will be in charge of converting our sentences with indices to word embeddings.
    # Since we have to pre-trained word embeddings, we allow the optimizer to train its parameters.
    vocabulary_len, embedding_size = embeddings.shape
    embedding_layer = Embedding(vocabulary_len, embedding_size, weights=[embeddings], input_length=input_size, trainable=False)
    
    X = embedding_layer(inputs)
    # After converting to word embeddings, we pass them to the bidirectional LSTM.
    X = Bidirectional(LSTM(lstm_units))(X)
    # The encoding produced by the LSTM is passed to an output node using sigmoid activation.
    X = Dense(1, activation='sigmoid')(X)
    
    return Model(inputs=inputs, outputs=X)


After preparing the inputs and declaring our model, we proceed to train our model and find the best number of LSTM units to use. We could perform cross-validation and different iterations to reduce overfitting. In this case, only one loop over the range of LSTM units and we keep the validation set static. 

In [None]:
# Let's test different number of units for the LSTM to find the best number that can be used
models = {
    8: TweetSentimentRNNModel(8, max_words_in_tweet, embeddings), 
    16: TweetSentimentRNNModel(16, max_words_in_tweet, embeddings), 
    32: TweetSentimentRNNModel(32, max_words_in_tweet, embeddings), 
    64: TweetSentimentRNNModel(64, max_words_in_tweet, embeddings), 
    128: TweetSentimentRNNModel(128, max_words_in_tweet, embeddings), 
}


**Note:** The training time is very long, you can load the trained weights in the cell below this one.

In [None]:
for lstm_units, model in models.items():
    
    # Save the models best validation accuracy weights
    filepath = "models/LSTM_{}_pretrained_embeddings_".format(lstm_units) + "best_weights_{epoch:02d}_epochs.hdf5"
    checkpoint = ModelCheckpoint(filepath, monitor='val_acc', verbose=1, save_best_only=True, mode='max')

    # We use early stop of 1 because the model converges quickly given the huge dataset. 
    # The accuracy and loss do not improve further.
    early_stop = EarlyStopping(monitor='val_acc', patience=1, min_delta=0.001, mode='max') 
 
    # We choose adam optimizer which uses both momentum and RMSprop for faster learning.
    adam = Adam(lr=0.01)
    # Since we are dealing with a binary classification problem, we minimize the binary log loss or cross-entropy
    model.compile(loss='binary_crossentropy', optimizer=adam, metrics=['accuracy'])
    
    print("Training with lstm_units:", lstm_units)
    model.fit(X_train, Y_train, 
              validation_data=(X_val, Y_val), 
              callbacks=[early_stop, checkpoint], 
              batch_size=250,
              epochs=50, 
              shuffle=True)
        

In [None]:
models_filepath = {
    8: "models/LSTM_8_pretrained_embeddings_best_weights_01_epochs.hdf5", 
    16: "models/LSTM_16_pretrained_embeddings_best_weights_01_epochs.hdf5", 
    32: "models/LSTM_32_pretrained_embeddings_best_weights_03_epochs.hdf5", 
    64: "models/LSTM_64_pretrained_embeddings_best_weights_01_epochs.hdf5", 
    128: "models/LSTM_128_pretrained_embeddings_best_weights_01_epochs.hdf5", 
}
for lstm_units, filepath in models_filepath.items():
    model = models[lstm_units]
    model.load_weights(filepath, by_name=False)
    adam = Adam(lr=0.01)
    model.compile(loss='binary_crossentropy', optimizer=adam, metrics=['accuracy'])
    print("%s LSTM units" % lstm_units)
    loss, acc = model.evaluate(X_val, Y_val, verbose=0)
    print("Validation accuracy = %0.4f" % acc)
    print()

The best validation accuracy was `80.51%` with `32` LSTM units, nevertheless the rest of the models had similar scores. We could follow Occam's Razor and choose the simpler model with `8` units since it may be less prone to overfit and way faster to train and evaluate without sacrificing much accuracy but let's keep the `32` units.

In [None]:
best_model = models[32]

We want to use a completely separate dataset from training and validation sets to provide a final accuracy for our model since we trained by optimizing their results. We use our test set for this.

In [None]:
loss, acc = best_model.evaluate(X_test, Y_test, verbose=0)
print("Test accuracy = %0.4f" % acc)

Judging by the low training accuracies, the model was not able to fit the training data completely. This could be due to the embeddings we are using. We can verify this with a model with trainable embeddings. Note that having trainable embeddings increases the training time considerably due to the increase in the number of parameters. By sacrificing some training speed, we hope the model will learn which embedding weights are more useful to predict sentiment, so the training accuracy can be higher and hopefully, this translates into a higher validation/test accuracy. Because the training is very slow, we choose 8 LSTM units for this model but keep the embedding size of 100.

In [None]:
def TweetSentimentTrainableEmbeddingsModel(lstm_units, input_size, vocab_len, embedding_size):
    inputs = Input(shape=(input_size,), dtype='int32')
    
    X = Embedding(vocab_len, embedding_size, trainable=True)(inputs)
    X = Bidirectional(LSTM(lstm_units))(X)
    X = Dense(1, activation='sigmoid')(X)
    
    return Model(inputs=inputs, outputs=X)

In [None]:
lstm_units = 8
model = TweetSentimentTrainableEmbeddingsModel(lstm_units, max_words_in_tweet, vocabulary_len, embedding_size=100)
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()

Since the training of this model is much slower, we choose the number of LSTM units based on our previous results to get a quick result, although this does not mean it is the best number of LSTM units for this model as well.

**Note:** The training time is very long, you can load the trained weights in the cell below this one.

In [None]:
filepath="models/LSTM_{}_trainable_embeddings_".format(lstm_units) + "best_weights_{epoch:02d}_epochs.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='val_acc', verbose=1, save_best_only=True, mode='max')
early_stop = EarlyStopping(monitor='val_acc', patience=1, min_delta=0.001, mode='max') 

model.fit(X_train, Y_train, 
          validation_data=(X_val, Y_val), 
          callbacks=[early_stop, checkpoint], 
          epochs=50, 
          batch_size=250, 
          shuffle=True)

In [None]:
filepath = "models/LSTM_8_trainable_embeddings_best_weights_02_epochs.hdf5"
model.load_weights(filepath, by_name=False)
loss, acc = model.evaluate(X_val, Y_val, verbose=0)
print("Validation accuracy = %0.4f" % acc)

In [None]:
loss, acc = model.evaluate(X_test, Y_test, verbose=0)
print("Test accuracy = %0.4f" % acc)

Using 8 LSTM units and allowing the embedding layer to be trainable, we obtained a validation accuracy of `82.09%` and a similar test accuracy of `82.18%`, meaning the model generalized well. There is a small increase in accuracy of `~3%` compared to the pretrained embeddings with 8 LSTM units. This indicates that the pretrained embedding layer was holding the model back on fitting the data, thus it could be an area of improvement.

### Tentative Improvements
The model used for sentiment prediction was a simple LSTM-based RNN and it used word embeddings both pre-trained and trainable. We saw that allowing embeddings to be trainable increased the accuracy of the model, but it came with a high increase in training time due to the number of parameters. Considering the improvement was around 3%, it may be worth exploring more complex or deeper models as well as looking into the data preprocessing step for improvements. 

While the selection of a model is important, also the preprocessing of data can affect the performance. If our preprocessing step ends up removing too many pieces, the context required by the model won't be there. On the other hand, if we don't sanitize our data enough, the model might not be able to generalize from the training data to the testing data. This preprocessing step along with the vocabulary size can also be tweaked to explore other results.

Making use of publicly available word embeddings is also a possibility, although given that the dataset has a very particular use of words, it would be surprising if they produce improvements.

Adding more layers to our RNN may help us fit our dataset better. These layers can be other than LSTM, for example after the encoding step of the LSTM, we can have one or more fully-connected layers before reaching the output node. Another approach to consider is the attention model, which is widely used in machine translation due to its performance with longer sequences.

Other models that can be used are CNNs, which are usually used for image classification and analysis but have had attention on sequence analysis as well. Their convolution steps allow them to identify and group low-level patterns into higher-level patterns. In our case, it can go from character-level information to sentence-level patterns. Nogueira et al. obtained an accuracy of 86.4\% using such architecture (https://www.aclweb.org/anthology/C14-1008), so we can see that the dataset is hard to fit even by more intricate models.