# NLP With TensorFlow/Keras

Reference: https://medium.com/geekculture/nlp-with-tensorflow-keras-explanation-and-tutorial-cae3554b1290

## Main Concepts

### Tokenization
* Splits sentence into tokens (often words)
* Remove unimportant chars like punctuation

### Stop Word Removal
* Remove irrelevant words: "and", "to", "the" --- may depend on the purpose of the model
* Increases model accuracy during training\

### Stemming
* "waiting" and "waited" become "wait"

### Lemmatization
* normalise to base form: "went" -> "go"
* "joyful" -> "good"

## Topic Modelling
* Unsupervised learning
* Groups texts under certain subjects

## Tutorial: Detect Text Emotion

Dataset: English Twitter messages https://huggingface.co/datasets/emotion
* `nlp` module can be used to import the data.

In [2]:
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
# import nlp  # https://pypi.org/project/nlp/
import datasets
import random
# from tensorflow.keras.preprocessing.text import Tokenizer
# from tensorflow.keras.preprocessing.sequence import pad_sequences

2025-02-22 02:28:32.262213: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


### Import and Prepare Data

In [3]:
# Original dataset loading using nlp failed because file train.txt had been deletd from dropbox.
# Belowis from https://huggingface.co/datasets/dair-ai/emotion?library=datasets
# split_ds = datasets.load_dataset("dair-ai/emotion", "split")
# unsplit_ds = datasets.load_dataset("dair-ai/emotion", "unsplit")

# Updated to use https://huggingface.co/datasets/dair-ai/emotion
DATASET = "dair-ai/emotion"
LABEL_MAP = {
    0: "sadness",
    1: "joy",
    2: "love",
    3: "anger",
    4: "fear",
    5: "surprise"
}

In [4]:
# train = split_ds["train"]
# val = split_ds["validation"]
# test = split_ds["test"]
# From https://huggingface.co/docs/hub/datasets-usage
train_dataset = datasets.load_dataset(DATASET, split="train")
valid_dataset = datasets.load_dataset(DATASET, split="validation")
test_dataset = datasets.load_dataset(DATASET, split="test")

In [5]:
def get_tweet(data: datasets.arrow_dataset.Dataset) -> tuple[list[str], list[str]]:
    """Splits a data split into its tweets and labels."""
    tweets = [x["text"] for x in data]
    labels = [LABEL_MAP[x["label"]] for x in data]
    return tweets, labels

In [6]:
tweets, labels = get_tweet(train_dataset)
print(tweets[0], labels[0])

i didnt feel humiliated sadness


### Tokenization

Assign each word a number by how commonly the appear in the dataset.

In [7]:
# Uses the deprecated https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer
tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=10000, oov_token="<UNK>")
tokenizer.fit_on_texts(tweets)  # Calibrate to the training data.

### Make all Sequences Same Shape

The ML model expects inputs to be a fixed shape and length.

Turn all tweets to the same length of `MAXLEN`, adding empty spaces and cutting off extra words.

In [8]:
VOCAB_SIZE = 10000
MAXLEN = 50
def get_sequences(tokenizer: tf.keras.preprocessing.text.Tokenizer, tweets: list[str]) -> list[str]:
    sequences = tokenizer.texts_to_sequences(tweets)
    padded = tf.keras.utils.pad_sequences(sequences, truncating = "post", padding="post", maxlen=MAXLEN)
    return padded

# Because the Tokenizer is now deprecated, prepare a vectorizer layer for the model.
vec_layer = tf.keras.layers.TextVectorization(
    max_tokens=VOCAB_SIZE,
    output_mode="int",
    output_sequence_length=MAXLEN,
)
vec_layer.adapt(tweets)

In [9]:
padded_train_seq = get_sequences(tokenizer, tweets)

In [10]:
print(tweets[0], len(padded_train_seq[0]))
padded_train_seq[0]

i didnt feel humiliated 50


array([  2, 139,   3, 679,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0], dtype=int32)

### Preparing Data for Model

Create a set for all the labels

In [11]:
index_to_class = LABEL_MAP.copy()
classes = set(index_to_class.values())
class_to_index = {
    c: i
    for i, c in index_to_class.items()
}
def names_to_ids(labels: list[str]) -> np.ndarray:
    return np.array([class_to_index[x] for x in labels])
train_labels = names_to_ids(labels)

In [12]:
class_to_index

{'sadness': 0, 'joy': 1, 'love': 2, 'anger': 3, 'fear': 4, 'surprise': 5}

### Model Definition

* 1 embedding layer.  https://machinelearningmastery.com/use-word-embedding-layers-deep-learning-keras/
* 2 bidirectional LSTM layers --- allow 2-way communication.  https://colah.github.io/posts/2015-08-Understanding-LSTMs/
* 1 dense layer for output.

In [16]:
common_layers = [
    tf.keras.layers.Embedding(VOCAB_SIZE, 16, input_length=MAXLEN),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(20, return_sequences=True)),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(20)),
    tf.keras.layers.Dense(6, activation="softmax")
]

old_model = tf.keras.models.Sequential(common_layers)

vec_model = tf.keras.models.Sequential([vec_layer, *common_layers])
    
model = old_model

#### Model Compilation:
* use the Adam optimiser https://machinelearningmastery.com/adam-optimization-algorithm-for-deep-learning/
* loss function sparse categorical cross-entropy https://datascience.stackexchange.com/questions/41921/sparse-categorical-crossentropy-vs-categorical-crossentropy-keras-accuracy

In [17]:
model.compile(
     loss="sparse_categorical_crossentropy",
     optimizer="adam",
     metrics=["accuracy"]
)

### Training
* use callbacks to halt the training when validation accuracy does not increase for more than 2 epochs

In [18]:
val_tweets, val_labels = get_tweet(valid_dataset)
val_seq = get_sequences(tokenizer, val_tweets)
val_labels= names_to_ids(val_labels)
h = model.fit(
     padded_train_seq, train_labels,
     validation_data=(val_seq, val_labels),
     epochs=20,
     callbacks=[tf.keras.callbacks.EarlyStopping(monitor="val_accuracy", patience=2)]
)

Epoch 1/20
[1m500/500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m30s[0m 45ms/step - accuracy: 0.3816 - loss: 1.5293 - val_accuracy: 0.6605 - val_loss: 1.0141
Epoch 2/20
[1m500/500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m22s[0m 43ms/step - accuracy: 0.7384 - loss: 0.6994 - val_accuracy: 0.8045 - val_loss: 0.5934
Epoch 3/20
[1m500/500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m21s[0m 42ms/step - accuracy: 0.8707 - loss: 0.3726 - val_accuracy: 0.8400 - val_loss: 0.4751
Epoch 4/20
[1m500/500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m22s[0m 44ms/step - accuracy: 0.9160 - loss: 0.2554 - val_accuracy: 0.8670 - val_loss: 0.4214
Epoch 5/20
[1m500/500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m22s[0m 43ms/step - accuracy: 0.9427 - loss: 0.1756 - val_accuracy: 0.8670 - val_loss: 0.4131
Epoch 6/20
[1m500/500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m21s[0m 42ms/step - accuracy: 0.9584 - loss: 0.1429 - val_accuracy: 0.8855 - val_loss: 0.3949
Epoch 7/20
[1m5

### Model Evaluation

In [19]:
test_tweets, test_labels=get_tweet(test_dataset)
test_seq = get_sequences(tokenizer, test_tweets)
test_labels=names_to_ids(test_labels)

Evaluate model accuracy against test data.
* `metrics_value` will correspond to the `metrics="accuracy"` given during model compilation.

In [20]:
loss_value, metrics_value = model.evaluate(test_seq, test_labels)

[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 15ms/step - accuracy: 0.8839 - loss: 0.3977


In [21]:
loss_value, metrics_value

(0.4076148271560669, 0.8794999718666077)

#### Random model sampling
Generate a random tweet, and predicdt its class.

In [22]:
i = random.randint(0,len(test_labels)-1)
print('Sentence:', test_tweets[i])
print('Emotion:', index_to_class[test_labels[i]])
p = model.predict(np.expand_dims(test_seq[i], axis=0))[0]
print(test_seq[i])
pred_class=index_to_class[np.argmax(p).astype('uint8')]
print('Predicted Emotion: ', pred_class)

Sentence: i found myself feeling inhibited and shushing her quite a lot
Emotion: sadness
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 815ms/step
[   2  323   51    8 1067    4    1   68  157    7  159    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0]
Predicted Emotion:  sadness


#### Classifying an Input Sentence

In [23]:
sentence = 'goodbye frustration'
sequence = tokenizer.texts_to_sequences([sentence])
paddedSequence = tf.keras.utils.pad_sequences(sequence, truncating = 'post', padding='post', maxlen=MAXLEN)
p = model.predict(np.expand_dims(paddedSequence[0], axis=0))[0]
pred_class=index_to_class[np.argmax(p).astype('uint8')]
print('Sentence:', sentence)
print('Predicted Emotion: ', pred_class)

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 76ms/step
Sentence: goodbye frustration
Predicted Emotion:  anger


### Saving the Model
Save in Hierarchical Data Format 5 (`h5`) format. 
See https://medium.com/@mysterious_obscure/a-deep-dive-into-model-files-pkl-pt-h5-and-the-magic-of-machine-learning-740768317e76

Formats:
* `.pkl` = Pickled Python Objects - used by `scikit-learn`
* `.p5` = Pytorch Tensors - stores architecture and learned params as tensors
* `.h5` = Hierarchical Data Format 5 stores architecture, learned params and training data.

In [24]:
# Originally for Google Collab and Drive.
# from google.colab import drive
# drive.mount("/content/drive")
# path = "/content/drive/My Drive/TweetEmotionRecognition/h5/tweet_model.h5"
# path = "/tmp/tweet_model.h5"
# model.save(path)
# Save in more modern keras format.
path = "/tmp/tweet_model.keras"
model.save(path)

### Load Model

In [25]:
load_model = tf.keras.models.load_model(path)
print(load_model.summary())

None
