# Building Deep Learning Applications with Keras: Advanced Practices for Recurrent Neural Networks

Now we are going to do sentiment analysis in a more real environment. We will be performming sentiment analysis on the tweets referencing the first debate of the republican party (GOP). Sentiment analysis for this type of political event can be important to understand who is reacting negatively to the debate. Maybe the people who are reacting with negative sentiment are people who support the democratic party or maybe these are people that are republican supporters. But in order to answer these questions, we must first understand the sentiment polarity of the tweets.

Our dataset consists of 13871 tweets and now we have three disctint labels for the tweets: **Positive**, **Negative** and **Neutral** tweets.

However, this dataset is highly imbalanced, as can be seen in the following image:

<img src="images/tweet_data.png">

We have the following number for each tweets in our dataset:
    
* **Negative**: 8493
* **Neutral**:  3142
* **Positive**: 2236

This is much more realistic setting than the imdb dataset we have used. Most of the time, when we collect a dataset, it will be imbalanced, and we will need to handle such an issue.

With that in mind, I have split this dataset into a training and a test dataset. Let's load the training dataset and take a look at it:

In [None]:
import pandas as pd

train_data_path = 'data/gop_tweets/train.csv'
train_data = pd.read_csv(train_data_path, index_col=0)

test_data_path = 'data/gop_tweets/test.csv'
test_data = pd.read_csv(test_data_path, index_col=0)

print(train_data.head())

We can turn any pandas columm into an array by using the `.values`. For example:

In [None]:
test_data["sentiment"].values

We can see that we have two variables, **text** that holds the tweet, and **sentiment**, that holds the polarity of the tweet

But before we use what we learned about Recurrent models, we need to transform the data into a format that the model can understand. We will need to do the following steps:

* **Pre-process the tweets**: In this step, we will clean the text data. Maybe we should turn all the words into lowercase or maybe there are some characters that will not help us much in this task, like punctuation.
* **Tokenization**: In this step we choose how many words we are going to allow in our vocabulary and create our valid_words list. After that, we must convert each of the words into the position they have in the valid_words list.
* **Padding**: We must make every tweet to have the same size by padding smaller tweets.
* **Labels to One-Hot**: We must turn each of our labels into an one-hot encoding representation.

We will apply these steps for both training and test set.

Let's create a function that will receive a pandas DataFrame and apply text pre-processing in it. Here is an example of a pre-prossing function. We are removing retweets tag "RT", lowering case every word and removing punctuation. Feel free to update this function as you want.

In [None]:
import re

def pre_process_text(data):
    data['text'] = data['text'].apply((lambda x: re.sub('RT','',x)))
    data['text'] = data['text'].apply(lambda x: x.lower())
    data['text'] = data['text'].apply((lambda x: re.sub('[^a-zA-z0-9\s]','',x)))
    return train_data

Now, let's apply tokenization. Here we will define how many words we want to use for this problem and turn each word into a position id.

Create a function that receives a pandas DataFrame and turn its text column into a list of positions ids.

**Hint**: Keras has a class called **Tokenizer** that will be really helpfull here.

In [None]:
from keras.preprocessing.text import Tokenizer

def tokenize_texts(data, num_words):
    ###YOUR CODE HERE

    ### END YOUR CODE
    return tokens

Now, let's apply padding to the sentece tokens we have created.

Create a function that will receive the list of reviews turned into tokens and pad them.

In [None]:
from keras.preprocessing import sequence

def pad_reviews(review_tokens, maxlen):
    ###YOUR CODE HERE

    ###END YOUR CODE
    
    return pad_tokens

Finally, let's turn the **sentiment** column into a list of one-hot encoding. For example, given that we have a **Neutral** sentiment for a review, its representation as an one-hot enconding vector will be:

[0, 1, 0]

If it was **Positive**, it will be:

[0, 0, 1]

This means that an one-hot encoding will create a vector with size equal to the number of labels for the problem and place a **1** only in the position associated with the given label. Here, we have set the position of the labels based on
the alphabetic order of the name of the labels.

Create a function that will receive a pandas DataFrame and return an array with the one-hot encoding of **sentiment** column.

**Hint**: We can use the **get_dummies** to turn the labels into one hot-encoding.

In [None]:
def label_to_one_hot(train_data):
    ###YOUR CODE HERE

    ###END YOUR CODE
    
    return one_hot

Now let's put it all together:

In [None]:
train_data = pre_process_text(train_data)
test_data = pre_process_text(test_data)

#HERE YOU WILL DEFINE THE VALUE OF THIS VARIABLE
num_words = 2000
x_train = tokenize_texts(train_data, num_words)
x_test = tokenize_texts(test_data, num_words)

#HERE YOU WILL DEFINE THE VALUE OF THIS VARIABLE
maxlen = 20
x_train = pad_reviews(x_train, maxlen)
x_test = pad_reviews(x_test, maxlen)

y_train = label_to_one_hot(train_data)
y_test = label_to_one_hot(test_data)

print(x_train[0])
print(y_train[0])

Now, let's create our model. Again we will use the same class for config and the Recurrent Model. However, feel free to add and change any option on the config class and Update the model as you want.

In [None]:
class RecurrentConfig:

    def __init__(self,
                 batch_size=32,
                 embedding_size=50,
                 num_words=2000,
                 lstm_units=64,
                 num_classes=3,
                 epochs=5):

        """
        Holds Recurrent Neural Network model hyperparams.

        :param batch_size: batch size for training
        :type batch_size: int
        
        :param embedding_size: The dimension of our embedding.
                               Recall that the embedding is a V X D matrix, where V is the number of
                               words in the vocabulary and D is the dimension of the embeddings.
                               This variable represents the D value in the embedding matrix.
        :type embedding_size:  int
        
        :param num_words: The size of our vocabulary or the number of words in the word_list variable.
        :type num_words:  int
        
        :param lstm_units: The number of units in the LSTM layer.
        :type num_words:  int
    
        :param num_classes: number of classes in the problem
        :type epochs: int
        
        :param epochs: number of epochs
        :type epochs: int
        """
        self.batch_size = batch_size
        self.embedding_size = embedding_size
        self.num_words = num_words
        self.lstm_units=lstm_units
        self.num_classes = num_classes
        self.epochs = epochs

    def __str__(self):
        status = ''
        status += 'batch size: {}\n'.format(self.batch_size)
        status += 'embedding size: {}\n'.format(self.embedding_size)
        status += 'num words: {}\n'.format(self.num_words)
        status += 'lstm units: {}\n'.format(self.lstm_units)
        status += 'num classes: {}\n'.format(self.num_classes)
        status += 'epochs: {}\n'.format(self.epochs)

        return status

In [None]:
from keras.models import Sequential
from keras.layers import LSTM, Embedding, Dense

from model.model import Model


class RecurrentModel(Model):
    
    def build_model(self):
        self.model = Sequential()
        
        ### YOUR CODE HERE

        ### END YOUR CODE
        
        self.model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
        
        print(self.model.summary())

In [None]:
tweet_config = RecurrentConfig()
tweet_model = RecurrentModel(tweet_config)

tweet_model.build_model()

train_data = (x_train, y_train)
model_history = tweet_model.fit(train_data)

Now, let's generate our test predictions:

In [None]:
predictions = tweet_model.predict(x_test)

Now let's take a look at the confusion matrix from our test data.

In [None]:
%matplotlib inline

import numpy as np

from utils.utils import plot_confusion_matrix

test_predictions = np.argmax(predictions, axis=1)
test_labels = np.argmax(y_test, axis=1)

plot_confusion_matrix(test_labels, test_predictions, classes=["negative", "neutral", "positive"])

We can see that our model Is very good at classifying negative tweets, but is not doing so great on classifying neutral and positive tweets. Now, in that circunstances, there is a lot of things we could try to see if we get a better model:

* Add regularization: We can add regularizers such as dropout or L2 to avoid model overffiting
* We can reduce the number of units in the Recurrent Neural Net layer
* We can balance the dataset before training the model
* We can better filter our text data

Try some of these approaches and see if you model has improvements. Also, remember to always plot the train vs validation graph to see how the model is behaving.