# Sentiment Classification

This notebook shows how to implement sentiment classification in tensorflow. It's based on tweets on various subjects. Each of these tweets have some embedded emotions. Be it positive or negative, sentiments behing such tweets may significantly impact a company, or a person's, brand.

As for the machine learning, the following concepts are covered:

- Data prerocessing
    - stopword removal
    - tokenization
    - padding
    - labels encoding
- Neural Networks
    - embedding layers
    - pooling layers
    - fully-connected layers
- Callbacks
- Transfer learning

## Data

The notebook is based on Emotions in Text dataset from Kaggle. It contains 31k tweets representing various sentiments. After building (and training) the neural network, you will be able to classify sentiment in any short text as belonging to one of those categories:
- positive
- negative
- neutral

To start, [download the data](https://www.kaggle.com/c/16295/download-all) and place it in your working directory. 


In [None]:
import random

import tensorflow as tf
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.preprocessing.text import Tokenizer

from utils import (AccuracyCallback, classify_sentence, create_embedding_layer,
                   load_glove_embeddings, load_text_data,
                   plot_training_progress, unpack_file)

In [None]:
# Configure GPU
gpu_devices = tf.config.experimental.list_physical_devices('GPU')
for device in gpu_devices: 
    tf.config.experimental.set_memory_growth(device, True)

In [None]:
DATA_FILE = "tweet-sentiment-extraction.zip"
DATA_DIR = "data/tweets"

unpack_file(DATA_FILE, DATA_DIR)

TRAIN_FILE = DATA_DIR + "/train.csv"
TEST_FILE = DATA_DIR + "/test.csv"

In [None]:
# Data preprocessing
SENTIMENT_TO_LABEL = {
        "positive": 2,
        "neutral": 1,
        "negative": 0,
    }
LABEL_TO_SENTIMENT = {label: sentiment for sentiment, label in SENTIMENT_TO_LABEL.items()}

train_texts, train_sentiments = load_text_data(TRAIN_FILE, 2, 3, remove_stopwords=True)
train_labels = [SENTIMENT_TO_LABEL[sentiment] for sentiment in train_sentiments]
train_labels = tf.keras.utils.to_categorical(train_labels, dtype=int)    

test_texts, test_sentiments = load_text_data(TEST_FILE, 1, 2)
test_labels = [SENTIMENT_TO_LABEL[sentiment] for sentiment in test_sentiments]
test_labels = tf.keras.utils.to_categorical(test_labels, dtype=int)    

In [None]:
# Let's have a look at sample tweets. 

tweets_to_show = 15

for i in range(tweets_to_show):
    idx = random.randrange(0,len(train_texts))
    print("Tweet: {}".format(" ".join(train_texts[idx])))
    print("Label: {}".format(train_sentiments[idx]))
    print("")

In [None]:
# Tokenization and padding
EMBEDDING_DIM = 16    
MAX_WORDS = 1e5        # max words in a dictionnary
MAX_SEQUENCE_LEN = 50  # max tweet length

tokenizer = Tokenizer(num_words=MAX_WORDS, oov_token="<oov>")
tokenizer.fit_on_texts(train_texts)

sequences_train = tokenizer.texts_to_sequences(train_texts)
padded_train = pad_sequences(sequences_train, padding = "post", maxlen = MAX_SEQUENCE_LEN)

sequences_test = tokenizer.texts_to_sequences(test_texts)
padded_test = pad_sequences(sequences_test, padding = "post", maxlen = MAX_SEQUENCE_LEN)

## Neural Network

Let's build a simple neural network for multiclass classification. We will include trainable embedding layer.
It will enable the NN to learn multidimensional relationships between different words in our text corpus.

In [None]:
# Model architecture

model = tf.keras.models.Sequential([
    tf.keras.layers.Embedding(len(tokenizer.word_index)+1, EMBEDDING_DIM, input_length = MAX_SEQUENCE_LEN),
    tf.keras.layers.GlobalAveragePooling1D(),
    tf.keras.layers.Dense(24, activation = "relu"),
    tf.keras.layers.Dense(3, activation = "softmax")
])

model.summary()

In [None]:
# Now it's time to train our model. 

tf.keras.backend.clear_session()

accuracy_callback = AccuracyCallback(0.9)

model.compile(optimizer=tf.keras.optimizers.RMSprop(lr=0.001), 
              loss="categorical_crossentropy", 
              metrics=["acc"])

history = model.fit(padded_train, 
                    train_labels, 
                    epochs=200, 
                    verbose=1, 
                    callbacks=[accuracy_callback], 
                    validation_data=(padded_test, test_labels))

In [None]:
# Let's evaluate its performance on test data
test_metrics = model.evaluate(padded_test, test_labels)

for i in range(len(test_metrics)):
    print("Test {}: {}".format(model.metrics_names[i], test_metrics[i]))

plot_training_progress(history, "acc")
plot_training_progress(history, "loss")

## Transfer Learning

Can we do any better than that? 

Instead of training our model for long hours, let us build upon well-trained models. More specifically, let's include pre-trained word embeddings.We will work with Global Vectors for Word Representation. These embeddings were trained by Stanford researchers on on massive amount of english texts. 

To start, [download the GloVe embeddings](http://nlp.stanford.edu/data/glove.6B.zip) and place them in your working directory. 

In [None]:
GLOVE_FILE = "glove.6B.zip"
word2vec = load_glove_embeddings(GLOVE_FILE)

In [None]:
# Let's design a model with pretrained Embeddings layer

# Embedding layer
sentence_indices = tf.keras.layers.Input(shape=(MAX_SEQUENCE_LEN,), dtype="int32")
embedding_layer = create_embedding_layer(tokenizer.word_index, word2vec, MAX_WORDS)
embeddings = embedding_layer(sentence_indices)

# Dense layers
x = tf.keras.layers.GlobalAveragePooling1D()(embeddings)
x = tf.keras.layers.Dense(24, activation='relu')(x)
x = tf.keras.layers.Dense(3, activation='softmax')(x)

model = tf.keras.models.Model(inputs=sentence_indices, outputs=x)

model.summary()

In [None]:
# Time to train model based on pretrained word embeddings. 
tf.keras.backend.clear_session()

model.compile(optimizer=tf.keras.optimizers.RMSprop(lr=0.001),
              loss="categorical_crossentropy", 
              metrics=["acc"])

history = model.fit(padded_train, 
                    train_labels,
                    validation_data=(padded_test, test_labels),
                    epochs=200, 
                    verbose=1, 
                    callbacks=[accuracy_callback])

In [None]:
# Let's evaluate its performance on test data. 
# Does it do any better than when we built the embeddings from scratch?

test_metrics = model.evaluate(padded_test, test_labels)

for i in range(len(test_metrics)):
    print("Test {}: {}".format(model.metrics_names[i], test_metrics[i]))

plot_training_progress(history, "acc")
plot_training_progress(history, "loss")

We have evaluated models by accuracy and log loss. Now, let's see how our model performs on random texts.

Can it correctly classify your tweet?

In [None]:
# Out of sample prediction

sentence = "I hate you!!!"
label = classify_sentence(model, tokenizer, sentence, MAX_SEQUENCE_LEN)

print("Sentiment: {}". format(LABEL_TO_SENTIMENT[label]))