# NLP With TensorFlow/Keras

Reference: https://medium.com/geekculture/nlp-with-tensorflow-keras-explanation-and-tutorial-cae3554b1290

## Main Concepts

### Tokenization
* Splits sentence into tokens (often words)
* Remove unimportant chars like punctuation

### Stop Word Removal
* Remove irrelevant words: "and", "to", "the" --- may depend on the purpose of the model
* Increases model accuracy during training\

### Stemming
* "waiting" and "waited" become "wait"

### Lemmatization
* normalise to base form: "went" -> "go"
* "joyful" -> "good"

## Topic Modelling
* Unsupervised learning
* Groups texts under certain subjects

## Tutorial: Detect Text Emotion

Dataset: English Twitter messages https://huggingface.co/datasets/dair-ai/emotion

In [None]:
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
import datasets
import random

### Import and Prepare Data

In [None]:
DATASET = "dair-ai/emotion"
LABEL_MAP = {
    0: "sadness",
    1: "joy",
    2: "love",
    3: "anger",
    4: "fear",
    5: "surprise"
}

In [None]:
# From https://huggingface.co/docs/hub/datasets-usage
train_dataset = datasets.load_dataset(DATASET, split="train")
valid_dataset = datasets.load_dataset(DATASET, split="validation")
test_dataset = datasets.load_dataset(DATASET, split="test")

In [None]:
def get_tweet(data: datasets.arrow_dataset.Dataset) -> tuple[list[str], list[str]]:
    """Splits a data split into its tweets and labels."""
    tweets = [x["text"] for x in data]
    labels = [LABEL_MAP[x["label"]] for x in data]
    return tweets, labels

def inputs_for_vector(strings: list[str]) -> list[list[str]]:
    """Adapt a list of strings to become the input to the TextVectorization layer.
    Example at https://keras.io/api/layers/preprocessing_layers/text/text_vectorization/
    """
    return [[s] for s in strings]
    

In [None]:
tweets, labels = get_tweet(train_dataset)
print(tweets[0], labels[0])

### Tokenization

Assign each word a number by how commonly the appear in the dataset.

In [None]:
# Uses the deprecated https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer
tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=10000, oov_token="<UNK>")
tokenizer.fit_on_texts(tweets)  # Calibrate to the training data.

### Make all Sequences Same Shape

The ML model expects inputs to be a fixed shape and length.

Turn all tweets to the same length of `MAXLEN`, adding empty spaces and cutting off extra words.

In [None]:
VOCAB_SIZE = 10000
MAXLEN = 50
def get_sequences(tokenizer: tf.keras.preprocessing.text.Tokenizer, tweets: list[str]) -> list[list[int]]:
    """Uses the tokenizer and padding to turn a list of strings to a list of list of ints of equal length."""
    sequences = tokenizer.texts_to_sequences(tweets)
    padded = tf.keras.utils.pad_sequences(sequences, truncating = "post", padding="post", maxlen=MAXLEN)
    return padded

# Because the Tokenizer is now deprecated, prepare a vectorizer layer for the model.
vec_layer = tf.keras.layers.TextVectorization(
    max_tokens=VOCAB_SIZE,
    output_mode="int",
    output_sequence_length=MAXLEN,
)
vec_layer.adapt(tweets)

In [None]:
padded_train_seq = get_sequences(tokenizer, tweets)

In [None]:
print(tweets[0], len(padded_train_seq[0]))
print(padded_train_seq[0])
vec_layer(inputs_for_vector(tweets))

### Preparing Data for Model

Create a set for all the labels

In [None]:
index_to_class = LABEL_MAP.copy()
classes = set(index_to_class.values())
class_to_index = {
    c: i
    for i, c in index_to_class.items()
}

def names_to_ids(labels: list[str]) -> np.ndarray:
    return np.array([class_to_index[x] for x in labels])
    
train_labels = names_to_ids(labels)

In [None]:
class_to_index

### Model Definition

* 1 embedding layer.  https://machinelearningmastery.com/use-word-embedding-layers-deep-learning-keras/
* 2 bidirectional LSTM layers --- allow 2-way communication.  https://colah.github.io/posts/2015-08-Understanding-LSTMs/
* 1 dense layer for output.

In [None]:
common_layers = [
    tf.keras.layers.Embedding(VOCAB_SIZE, 16),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(20, return_sequences=True)),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(20)),
    tf.keras.layers.Dense(6, activation="softmax")
]

old_model = tf.keras.models.Sequential(common_layers, name="tokeniser")

# Experiment in using the vectorisation layer.
vec_model = tf.keras.models.Sequential([
    tf.keras.Input(shape=(None,), dtype=tf.string),
    vec_layer,
    *common_layers
], name="vectorisation")
    
model = old_model
print(model.summary())

#### Model Compilation:
* use the Adam optimiser https://machinelearningmastery.com/adam-optimization-algorithm-for-deep-learning/
* loss function sparse categorical cross-entropy https://datascience.stackexchange.com/questions/41921/sparse-categorical-crossentropy-vs-categorical-crossentropy-keras-accuracy

In [None]:
model.compile(
     loss="sparse_categorical_crossentropy",
     optimizer="adam",
     metrics=["accuracy"]
)

### Training
* use callbacks to halt the training when validation accuracy does not increase for more than 2 epochs

In [None]:
val_tweets, val_labels = get_tweet(valid_dataset)
val_seq = get_sequences(tokenizer, val_tweets)
val_labels = names_to_ids(val_labels)
h = model.fit(
     inputs_for_vector(tweets), train_labels,
     validation_data=(val_seq, val_labels),
     epochs=20,
     callbacks=[tf.keras.callbacks.EarlyStopping(monitor="val_accuracy", patience=2)]
)

### Model Evaluation

In [None]:
test_tweets, test_labels = get_tweet(test_dataset)
test_seq = get_sequences(tokenizer, test_tweets)
test_labels = names_to_ids(test_labels)

Evaluate model accuracy against test data.
* `metrics_value` will correspond to the `metrics="accuracy"` given during model compilation.

In [None]:
loss_value, metrics_value = model.evaluate(test_seq, test_labels)

In [None]:
loss_value, metrics_value

#### Random model sampling
Generate a random tweet, and predicdt its class.

In [None]:
i = random.randint(0,len(test_labels)-1)
print('Sentence:', test_tweets[i])
print('Emotion:', index_to_class[test_labels[i]])
p = model.predict(np.expand_dims(test_seq[i], axis=0))[0]
print(test_seq[i])
pred_class=index_to_class[np.argmax(p).astype('uint8')]
print('Predicted Emotion: ', pred_class)

#### Classifying an Input Sentence

In [None]:
sentence = "an ethereal performance by helene grimaud"
sequence = tokenizer.texts_to_sequences([sentence])
paddedSequence = tf.keras.utils.pad_sequences(sequence, truncating = "post", padding="post", maxlen=MAXLEN)
p = model.predict(np.expand_dims(paddedSequence[0], axis=0))[0]
pred_class = index_to_class[np.argmax(p).astype("uint8")]
print("Sentence:", sentence)
print("Predicted Emotion: ", pred_class)

### Saving the Model
Save in Hierarchical Data Format 5 (`h5`) format. 
See https://medium.com/@mysterious_obscure/a-deep-dive-into-model-files-pkl-pt-h5-and-the-magic-of-machine-learning-740768317e76

Formats:
* `.pkl` = Pickled Python Objects - used by `scikit-learn`
* `.p5` = Pytorch Tensors - stores architecture and learned params as tensors
* `.h5` = Hierarchical Data Format 5 stores architecture, learned params and training data.

In [None]:
# Originally for Google Collab and Drive.
# from google.colab import drive
# drive.mount("/content/drive")
# path = "/content/drive/My Drive/TweetEmotionRecognition/h5/tweet_model.h5"
# path = "/tmp/tweet_model.h5"
# model.save(path)
# Save in more modern keras format.
path = "/tmp/tweet_model.keras"
model.save(path)

### Load Model

In [None]:
load_model = tf.keras.models.load_model(path)
print(load_model.summary())