<a href="https://colab.research.google.com/github/emery1189/nlp_emotion_classifier/blob/main/emotion_from_text.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Predicting Emotion from Text

using a [Kaggle dataset](https://www.kaggle.com/datasets/praveengovi/emotions-dataset-for-nlp), we will train a neural network to classify a piece of text into one of six emotions: anger, fear, joy, love, sadness, and surprise.

we will:
1. get the data into colab and import our tools
2. use SKLearn's `MultinomialNB` to generate a baseline
3. restructure the data to fit tensorflow's `text_dataset_from_directory()` utility.
4. vectorize and embed our text
5. build model(s) and fit
6. create a `dectect_emotion()` function to predict emotion contained in a user generated sentence

## getting data

In [1]:
!pip install -q kaggle
!mkdir ~/.kaggle
!cp kaggle.json ~/.kaggle/

In [2]:
!chmod 600 ~/.kaggle/kaggle.json

In [3]:
!kaggle datasets download praveengovi/emotions-dataset-for-nlp

Downloading emotions-dataset-for-nlp.zip to /content
  0% 0.00/721k [00:00<?, ?B/s]
100% 721k/721k [00:00<00:00, 132MB/s]


In [4]:
!unzip /content/emotions-dataset-for-nlp.zip

Archive:  /content/emotions-dataset-for-nlp.zip
  inflating: test.txt                
  inflating: train.txt               
  inflating: val.txt                 


In [5]:
import tensorflow as tf
import tensorflow_hub as hub
import numpy as np
import os
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline

getting sentences and labels

In [6]:
train_sentences = []
train_labels = []

with open("/content/train.txt") as train_file:
  for line in train_file:
    line = line.split(";")
    train_sentences.append(line[0])
    train_labels.append(line[1].rstrip())


len(train_sentences), len(train_labels), train_sentences[1834], train_labels[1834]

(16000, 16000, 'i instantly feel rejected', 'sadness')

In [7]:
valid_sentences = []
valid_labels = []

with open("/content/val.txt") as valid_file:
  for line in valid_file:
    line = line.split(";")
    valid_sentences.append(line[0])
    valid_labels.append(line[1].rstrip())


len(valid_sentences), len(valid_labels), valid_sentences[134], valid_labels[134]

(2000,
 2000,
 'i was feeling frightened to the core what if my friends laughed at me what if sir was too harsh what if',
 'fear')

In [8]:
test_sentences = []
test_labels = []

with open("/content/test.txt") as test_file:
  for line in test_file:
    line = line.split(";")
    test_sentences.append(line[0])
    test_labels.append(line[1].rstrip())


len(test_sentences), len(test_labels), test_sentences[184], test_labels[184]

(2000,
 2000,
 'i feel terrified because my landlord has not changed our locks yet',
 'fear')

## getting a baseline score with SKLearn's `MultinomialNB`

In [9]:
# create tokenization and modeling pipeline
baseline = Pipeline([
    ("tfidf", TfidfVectorizer()),  # convert words to numbers using tfidf
    ("clf", MultinomialNB())  # model the text
])

# fit the pipeline to the training data
baseline.fit(train_sentences, train_labels)

baseline_score = baseline.score(test_sentences, test_labels)
print(f"baseline model accuracy: {baseline_score*100:.2f}%")

baseline model accuracy: 64.85%


## restructuring data

per tensorflow's `text_dataset_from_directory()`, we want our data in the following structure:

main_directory/ <br>
...class_a/<br>
......a_text_1.txt<br>
......a_text_2.txt<br>
...class_b/<br>
......b_text_1.txt<br>
......b_text_2.txt<br>



first, we'll need a train, test, and valid folders.

In [10]:
for dataset in ["train", "test", "valid"]:
  os.mkdir(f"{dataset}/")

now each of the datasets will need a folder for each class (emotion).

In [11]:
for emotion in ["anger", "fear", "joy", "love", "sadness", "surprise"]:
  os.mkdir(f"/content/test/{emotion}/")
  os.mkdir(f"/content/train/{emotion}/")
  os.mkdir(f"/content/valid/{emotion}/")

now we need each example of an emotion to be its own .txt file in the associated folder.

In [12]:
for emotion in ["anger", "fear", "joy", "love", "sadness", "surprise"]:
  p = 1
  for i in range(len(train_sentences)):
    if train_labels[i] == emotion:
      with open(f"/content/train/{emotion}/{emotion}_text_{p}.txt", "w") as f:
        f.write(train_sentences[i])
        p += 1

In [13]:
for emotion in ["anger", "fear", "joy", "love", "sadness", "surprise"]:
  p = 1
  for i in range(len(test_sentences)):
    if test_labels[i] == emotion:
      with open(f"/content/test/{emotion}/{emotion}_text_{p}.txt", "w") as f:
        f.write(test_sentences[i])
        p += 1

In [14]:
for emotion in ["anger", "fear", "joy", "love", "sadness", "surprise"]:
  p = 1
  for i in range(len(valid_sentences)):
    if valid_labels[i] == emotion:
      with open(f"/content/valid/{emotion}/{emotion}_text_{p}.txt", "w") as f:
        f.write(valid_sentences[i])
        p += 1

let's see how many of each class we have:

In [15]:
train_labels_array = np.array(train_labels)

uniques, count = np.unique(train_labels_array, return_counts=True)

print(np.asarray((uniques, count)).T)

[['anger' '2159']
 ['fear' '1937']
 ['joy' '5362']
 ['love' '1304']
 ['sadness' '4666']
 ['surprise' '572']]


In [16]:
valid_labels_array = np.array(valid_labels)

uniques, count = np.unique(valid_labels_array, return_counts=True)

print(np.asarray((uniques, count)).T)

[['anger' '275']
 ['fear' '212']
 ['joy' '704']
 ['love' '178']
 ['sadness' '550']
 ['surprise' '81']]


In [17]:
test_labels_array = np.array(test_labels)

uniques, count = np.unique(test_labels_array, return_counts=True)

print(np.asarray((uniques, count)).T)

[['anger' '275']
 ['fear' '224']
 ['joy' '695']
 ['love' '159']
 ['sadness' '581']
 ['surprise' '66']]


perhaps not an ideal balance of classes, but we work with what we have.

## creating `tf.Datasets`

In [18]:
batch_size = 32
seed = 42
AUTOTUNE = tf.data.AUTOTUNE

raw_train_ds = tf.keras.utils.text_dataset_from_directory(
    '/content/train',
    batch_size=batch_size,
    validation_split=0,
    seed=seed)

class_names = raw_train_ds.class_names

raw_valid_ds = tf.keras.utils.text_dataset_from_directory(
    '/content/valid',
    batch_size=batch_size,
    validation_split=0,
    seed=seed)

raw_test_ds = tf.keras.utils.text_dataset_from_directory(
    '/content/test',
    batch_size=batch_size,
    validation_split=0,
    seed=seed)


train_ds = raw_train_ds.cache().prefetch(buffer_size=AUTOTUNE)
valid_ds = raw_valid_ds.cache().prefetch(buffer_size=AUTOTUNE)
test_ds = raw_test_ds.cache().prefetch(buffer_size=AUTOTUNE)

Found 16000 files belonging to 6 classes.
Found 2000 files belonging to 6 classes.
Found 2000 files belonging to 6 classes.


In [19]:
for i in range(len(class_names)):
  print(f"label {i} corresponds to", raw_train_ds.class_names[i])

label 0 corresponds to anger
label 1 corresponds to fear
label 2 corresponds to joy
label 3 corresponds to love
label 4 corresponds to sadness
label 5 corresponds to surprise


## vectorize and embed text

In [20]:
# find the average number of tokens (words) in the training sentences

round(sum([len(i.split()) for i in train_sentences])/len(train_sentences))

19

In [21]:
text_vectorizer = tf.keras.layers.TextVectorization(max_tokens=10000,
                                                    output_mode="int",
                                                    output_sequence_length=19)

In [22]:
# make a text-only dataset (without labels) and call adapt

train_text = raw_train_ds.map(lambda x, y: x)
text_vectorizer.adapt(train_text)

In [23]:
embedding_layer = tf.keras.layers.Embedding(input_dim=10000,
                                            output_dim=64,
                                            input_length=19)

In [24]:
train_sentences[777]

'i sometimes feel resentful that this has come into our lives at this time'

In [25]:
text_vectorizer(train_sentences[777])

<tf.Tensor: shape=(19,), dtype=int64, numpy=
array([  2, 186,   3, 523,   8,  23,  99, 182, 106, 133, 684,  33,  23,
        52,   0,   0,   0,   0,   0])>

In [26]:
embedding_layer(text_vectorizer(train_sentences[777]))

<tf.Tensor: shape=(19, 64), dtype=float32, numpy=
array([[ 0.04218544,  0.02298843,  0.01594229, ..., -0.00779108,
        -0.03963272, -0.01821632],
       [ 0.02916528,  0.04562428, -0.04782431, ..., -0.00677817,
        -0.02804784,  0.00100296],
       [-0.03272986, -0.01396491,  0.00617515, ..., -0.01699731,
         0.01951266,  0.02484052],
       ...,
       [-0.0076418 ,  0.01530791, -0.02656021, ...,  0.00044016,
         0.04240118, -0.01912725],
       [-0.0076418 ,  0.01530791, -0.02656021, ...,  0.00044016,
         0.04240118, -0.01912725],
       [-0.0076418 ,  0.01530791, -0.02656021, ...,  0.00044016,
         0.04240118, -0.01912725]], dtype=float32)>

## models

In [27]:
early_stopping = tf.keras.callbacks.EarlyStopping(monitor="val_accuracy",
                                                  patience=7,
                                                  mode="auto",
                                                  restore_best_weights=True)

reduce_LOR = tf.keras.callbacks.ReduceLROnPlateau(monitor="val_accuracy",
                                                  patience=5,
                                                  verbose=1)

### dense model with `Flatten()` layer

In [28]:
# build (Sequential)
model = tf.keras.Sequential([
    text_vectorizer,
    embedding_layer,
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(6, activation="softmax")
])

# compile
model.compile(loss=tf.keras.losses.SparseCategoricalCrossentropy(),
              optimizer=tf.keras.optimizers.Adam(),
              metrics=["accuracy"])

# fit
model.fit(train_ds,
          epochs=50,
          validation_data=valid_ds,
          callbacks=[early_stopping, reduce_LOR])

# evaluate
model.evaluate(test_ds)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 8: ReduceLROnPlateau reducing learning rate to 0.00010000000474974513.
Epoch 9/50
Epoch 10/50


[0.5691350102424622, 0.8195000290870667]

82% on our test data. not bad!

### dense model with `GlobalAveragePooling1D` layer

In [29]:
# build
model = tf.keras.Sequential([
    text_vectorizer,
    embedding_layer,
    tf.keras.layers.GlobalAveragePooling1D(),
    tf.keras.layers.Dropout(0.1),
    tf.keras.layers.Dense(6, activation="softmax")
])

# compile
model.compile(loss=tf.keras.losses.SparseCategoricalCrossentropy(),
              optimizer=tf.keras.optimizers.Adam(),
              metrics=["accuracy"])

# fit
model.fit(train_ds,
          epochs=50,
          validation_data=valid_ds,
          callbacks=[early_stopping, reduce_LOR])

# evaluate
model.evaluate(test_ds)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 10: ReduceLROnPlateau reducing learning rate to 0.00010000000474974513.
Epoch 11/50
Epoch 12/50


[0.5599210262298584, 0.8274999856948853]

83% on our test data. moving in the right direction!

### LSTM model

In [30]:
# build (Functional)

inputs = tf.keras.layers.Input(shape=(1,), dtype="string")
x = text_vectorizer(inputs)
x = embedding_layer(x)
x = tf.keras.layers.LSTM(units=64, return_sequences=True)(x)
x = tf.keras.layers.LSTM(64)(x)
x = tf.keras.layers.Dense(64, activation="relu")(x)
outputs = tf.keras.layers.Dense(6, activation="softmax")(x)

lstm_model = tf.keras.Model(inputs, outputs)

# compile
lstm_model.compile(loss=tf.keras.losses.SparseCategoricalCrossentropy(),
                  optimizer=tf.keras.optimizers.Adam(),
                  metrics=["accuracy"])

# fit
lstm_model.fit(train_ds,
               epochs=50,
               validation_data=valid_ds,
               callbacks=[early_stopping, reduce_LOR])

# evaluate
lstm_model.evaluate(test_ds)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 8: ReduceLROnPlateau reducing learning rate to 0.00010000000474974513.
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 14: ReduceLROnPlateau reducing learning rate to 1.0000000474974514e-05.
Epoch 15/50
Epoch 16/50


[1.1735920906066895, 0.796999990940094]

80%

### GRU model

In [31]:
# build
inputs = tf.keras.layers.Input(shape=(1,), dtype="string")
x = text_vectorizer(inputs)
x = embedding_layer(x)
x = tf.keras.layers.GRU(64, return_sequences=True)(x) # if stacking recurrent layers, use return_sequences=True
x = tf.keras.layers.GRU(64)(x)
x = tf.keras.layers.Dense(64, activation="relu")(x)
outputs = tf.keras.layers.Dense(6, activation="softmax")(x)
model_GRU = tf.keras.Model(inputs, outputs)


# compile
model_GRU.compile(loss=tf.keras.losses.SparseCategoricalCrossentropy(),
                  optimizer=tf.keras.optimizers.Adam(),
                  metrics=["accuracy"])

# fit
model_GRU.fit(train_ds,
              epochs=50,
              validation_data=valid_ds,
              callbacks=[early_stopping, reduce_LOR])

# evaluate
model_GRU.evaluate(test_ds)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 14: ReduceLROnPlateau reducing learning rate to 0.00010000000474974513.
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 22: ReduceLROnPlateau reducing learning rate to 1.0000000474974514e-05.
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 32: ReduceLROnPlateau reducing learning rate to 1.0000000656873453e-06.
Epoch 33/50
Epoch 34/50


[1.6014018058776855, 0.8169999718666077]

82%

### GRU with `Dropout()` layer

In [32]:
# build
inputs = tf.keras.layers.Input(shape=(1,), dtype="string")
x = text_vectorizer(inputs)
x = embedding_layer(x)
x = tf.keras.layers.GRU(64, return_sequences=True)(x)
x = tf.keras.layers.Dense(64, activation="relu")(x)
x = tf.keras.layers.GRU(64)(x)
x = tf.keras.layers.Dense(64)(x)
x = tf.keras.layers.Dropout(0.1)(x)
outputs = tf.keras.layers.Dense(6, activation="softmax")(x)

GRU_model_2 = tf.keras.Model(inputs, outputs)

# compile
GRU_model_2.compile(loss=tf.keras.losses.SparseCategoricalCrossentropy(),
                            optimizer=tf.keras.optimizers.Adam(),
                            metrics=["accuracy"])

# fit
GRU_model_2.fit(train_ds,
                epochs=50,
                validation_data=valid_ds,
                callbacks=[early_stopping, reduce_LOR])

# evaluate
GRU_model_2.evaluate(test_ds)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 11: ReduceLROnPlateau reducing learning rate to 0.00010000000474974513.
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 20: ReduceLROnPlateau reducing learning rate to 1.0000000474974514e-05.
Epoch 21/50
Epoch 22/50


[1.3717806339263916, 0.8109999895095825]

81%

### Conv1D model

In [34]:
# build
inputs = tf.keras.layers.Input(shape=(1,), dtype="string")
x = text_vectorizer(inputs)
x = embedding_layer(x)
x = tf.keras.layers.Conv1D(filters=64, kernel_size=5, activation="relu")(x)
x = tf.keras.layers.GlobalMaxPool1D()(x)
outputs = tf.keras.layers.Dense(6, activation="softmax")(x)
model_conv1d = tf.keras.Model(inputs, outputs)


# compile
model_conv1d.compile(loss=tf.keras.losses.SparseCategoricalCrossentropy(),
                  optimizer=tf.keras.optimizers.Adam(),
                  metrics=["accuracy"])

# fit
model_conv1d.fit(train_ds,
                 epochs=50,
                 validation_data=valid_ds,
                 callbacks=[early_stopping, reduce_LOR])

# evaluate
model_conv1d.evaluate(test_ds)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 7: ReduceLROnPlateau reducing learning rate to 0.00010000000474974513.
Epoch 8/50
Epoch 9/50


[0.6569477915763855, 0.8169999718666077]

82%

### model with Universal Sentence Encoder

In [35]:
# load in USE layer from TF Hub
sentence_encoder_layer = hub.KerasLayer("https://tfhub.dev/google/universal-sentence-encoder/4",
                                        input_shape=[], # the input to the USE is variable lenght, so we assign our input_shape as nothing
                                        dtype=tf.string,
                                        name="USE")

# build model
model_USE = tf.keras.Sequential([
    sentence_encoder_layer,
    tf.keras.layers.Dense(6, activation="softmax")
])

# compile
model_USE.compile(loss="sparse_categorical_crossentropy",
                  optimizer=tf.keras.optimizers.Adam(),
                  metrics=["accuracy"])

# fit
model_USE.fit(train_ds,
              epochs=50,
              validation_data=valid_ds,
              callbacks=[early_stopping, reduce_LOR])

# evaluate
model_USE.evaluate(test_ds)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 15: ReduceLROnPlateau reducing learning rate to 0.00010000000474974513.
Epoch 16/50
Epoch 17/50


[0.9944573044776917, 0.6324999928474426]

### recreating our best model

In [38]:
model = tf.keras.Sequential([
    text_vectorizer,
    embedding_layer,
    tf.keras.layers.GlobalAveragePooling1D(),
    tf.keras.layers.Dropout(0.1),
    tf.keras.layers.Dense(6, activation="softmax")
])

model.compile(loss=tf.keras.losses.SparseCategoricalCrossentropy(),
              optimizer=tf.keras.optimizers.Adam(),
              metrics=["accuracy"])

model.fit(train_ds,
          epochs=50,
          validation_data=valid_ds,
          callbacks=[early_stopping, reduce_LOR])


model.evaluate(test_ds)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 7: ReduceLROnPlateau reducing learning rate to 0.00010000000474974513.
Epoch 8/50
Epoch 9/50


[0.5902280807495117, 0.8264999985694885]

### examining model preds

In [39]:
model_preds = model.predict(test_ds)


# let's take a look at one of our preds

model_preds[13]



array([0.00814457, 0.00919383, 0.89692247, 0.01242341, 0.00296762,
       0.0703481 ], dtype=float32)

In [40]:
model_preds[13].argmax(), class_names[model_preds[13].argmax()], test_labels[13], test_sentences[13]

(2,
 'joy',
 'joy',
 'i just feel extremely comfortable with the group of people that i dont even need to hide myself')

In [41]:
my_sample_sentence = "I went to the store and got some really great stuff"

In [43]:
sample_pred = model.predict([my_sample_sentence])

print(f"predicted emotion: {class_names[sample_pred.argmax()]}")

predicted emotion: joy


## creating a `detect_emotion()` function

In [44]:
emotion_dict = {"anger" : "😡",
                "fear" : "😬",
                "joy" : "😁",
                "love" : "❤️",
                "sadness" : "😢",
                "surprise" : "🤯"}

In [45]:
from IPython.display import clear_output

def detect_emotion():
  user_text = input("please enter a sentence: ")
  prediction = model.predict([user_text])
  clear_output()
  emotion = class_names[prediction.argmax()]
  print(f"\n{user_text}\n\nemotion dectected in your sentence: {emotion}.\n          {emotion_dict[emotion]}")

In [46]:
detect_emotion()


the sun is shining bright in every direction

emotion dectected in your sentence: joy.
          😁


In [47]:
detect_emotion()


i find myself terrified of rainbows.

emotion dectected in your sentence: fear.
          😬


In [49]:
detect_emotion()


i am glad to be at the end of another project

emotion dectected in your sentence: joy.
          😁
