# Two approaches for representing groups of words: Sets and sequences
We’ll demonstrate each approach on a well-known text classification benchmark: the IMDB movie review sentiment-classification dataset.

In the previous notebooks (ch. 4 and 5) we worked with a prevectorized version of the IMDB dataset; now, let’s process the raw IMDB text data, just like you would do when approaching a new text-classification problem in the real world.

## Preparing the IMDB movie reviews data
Let’s start by downloading the dataset from the Stanford page of Andrew Maas and uncompressing it:

In [None]:
!curl -O https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
!tar -xf aclImdb_v1.tar.gz

In total, there are $25,000$ text files for training and another $25,000$ for testing.

There’s also a train/unsup subdirectory in there, which we don’t need. Let’s delete it:

In [None]:
!rm -r aclImdb/train/unsup

Whether you’re working with text data or image data, remember to always inspect what your data looks like before you dive into modeling it. It will ground your intuition about what your model is actually doing:

In [None]:
!cat aclImdb/train/pos/4077_10.txt

Next, let’s prepare a validation set by setting apart $20%$ of the training text files in a new directory, ```aclImdb/val```:

In [None]:
import os, pathlib, shutil, random

base_dir = pathlib.Path("aclImdb")
val_dir = base_dir / "val"
train_dir = base_dir / "train"
for category in ("neg", "pos"):
    os.makedirs(val_dir / category)
    files = os.listdir(train_dir / category)
    # Shuffle the list of training files using a seed, to ensure we get the same validation set every time
    random.Random(1337).shuffle(files)
    # Take 20% of the training files to use for validation
    num_val_samples = int(0.2 * len(files))
    val_files = files[-num_val_samples:]
    for fname in val_files:
        # Move the files to 'aclImdb/val/neg' and 'aclImdb/val/pos'
        shutil.move(train_dir / category / fname,
                    val_dir / category / fname)

In the same way we used ```image_dataset_from_directory``` to create a batched _Dataset_ in ch. 8, we can use ```text_dataset_from_directory``` for text files.

Let’s create three _Dataset_ objects for training, validation, and testing:

In [None]:
from tensorflow import keras

batch_size = 32

train_ds = keras.utils.text_dataset_from_directory(
    "aclImdb/train", batch_size=batch_size
)
val_ds = keras.utils.text_dataset_from_directory(
    "aclImdb/val", batch_size=batch_size
)
test_ds = keras.utils.text_dataset_from_directory(
    "aclImdb/test", batch_size=batch_size
)

### Displaying the shapes and dtypes of the first batch
These datasets yield inputs that are TensorFlow _tf.string_ tensors and targets that are
_int32_ tensors encoding the value “0” or “1.”

In [None]:
for inputs, targets in train_ds:
    print("inputs.shape:", inputs.shape)
    print("inputs.dtype:", inputs.dtype)
    print("targets.shape:", targets.shape)
    print("targets.dtype:", targets.dtype)
    print("inputs[0]:", inputs[0])
    print("targets[0]:", targets[0])
    break

## Processing words as a set: The bag-of-words approach
The simplest way to encode a piece of text for processing by a machine learning model is to discard order and treat it as a set (a “bag”) of tokens.

You could either look at individual words (unigrams), or try to recover some local order information by looking at groups of consecutive token (N-grams).

### SINGLE WORDS (UNIGRAMS) WITH BINARY ENCODING
If you use a bag of single words, the sentence _“the cat sat on the mat”_ becomes:

{"cat", "mat", "on", "sat", "the"}

The main advantage of this encoding is that you can represent an entire text as a single vector, where each entry is a presence indicator for a given word. For instance, using binary encoding (multi-hot), you’d encode a text as a vector with as many dimensions as there are words in your vocabulary—with 0s almost everywhere and some 1s for dimensions that encode words present in the text.

### Preprocessing our datasets with a _TextVectorization_ layer
First, let’s process our raw text datasets with a _TextVectorization_ layer so that they yield multi-hot encoded binary word vectors. Our layer will only look at single words (that is to say, _unigrams_).

In [None]:
from tensorflow.keras.layers import TextVectorization

text_vectorization = TextVectorization(
    # Limit the vocabulary to the 20,000 most frequent words
    max_tokens=20000,
    # Encode the output tokens as multi-hot binary vectors
    output_mode="multi_hot",
)

# Prepare a dataset that only yields raw text inputs (no labels)
text_only_train_ds = train_ds.map(lambda x, y: x)
# Use that dataset to index the dataset vocabulary via the adapt() method
text_vectorization.adapt(text_only_train_ds)

# Prepare processed versions of our training, validation, and test dataset
binary_1gram_train_ds = train_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4) # Make sure to specify num_parallel_calls to leverage multiple CPU cores
binary_1gram_val_ds = val_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)
binary_1gram_test_ds = test_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)

### Inspecting the output of our binary unigram dataset
You can try to inspect the output of one of these datasets.

In [None]:
for inputs, targets in binary_1gram_train_ds:
    print("inputs.shape:", inputs.shape)
    print("inputs.dtype:", inputs.dtype)
    print("targets.shape:", targets.shape)
    print("targets.dtype:", targets.dtype)
    print("inputs[0]:", inputs[0])
    print("targets[0]:", targets[0])
    break

### Our model-building utility
Next, let’s write a reusable model-building function that we’ll use in all of our experiments in this section.

In [None]:
from tensorflow.keras import layers

def get_model(max_tokens=20000, hidden_dim=16):
    inputs = keras.Input(shape=(max_tokens,))
    x = layers.Dense(hidden_dim, activation="relu")(inputs)
    x = layers.Dropout(0.5)(x)
    outputs = layers.Dense(1, activation="sigmoid")(x)
    model = keras.Model(inputs, outputs)
    model.compile(optimizer="rmsprop",
                  loss="binary_crossentropy",
                  metrics=["accuracy"])
    return model

### Training and testing the binary unigram model
Finally, let’s train and test our model.

We call ```cache()``` on the datasets to cache them in memory: this way, we will only do the preprocessing
once, during the first epoch, and we’ll reuse the preprocessed texts for the following epochs. This can
only be done if the data is small enough to fit in memory.

Note that in this case, since the dataset is a balanced two-class classification dataset (there are as many positive samples as negative samples), the “naive baseline” we could reach without training an actual model would only be $50%$.

In [None]:
model = get_model()
model.summary()
callbacks = [
    keras.callbacks.ModelCheckpoint("binary_1gram.keras",
                                    save_best_only=True)
]
model.fit(binary_1gram_train_ds.cache(),
          validation_data=binary_1gram_val_ds.cache(),
          epochs=10,
          callbacks=callbacks)
model = keras.models.load_model("binary_1gram.keras")
print(f"Test acc: {model.evaluate(binary_1gram_test_ds)[1]:.3f}")

### BIGRAMS WITH BINARY ENCODING
Of course, discarding word order is very reductive, because even atomic concepts can be expressed via multiple words: the term “United States” conveys a concept that is quite distinct from the meaning of the words “states” and “united” taken separately.

For this reason, you will usually end up re-injecting local order information into your bag-of-words representation by looking at N-grams rather than single words (most commonly, bigrams).

With bigrams, our sentence becomes:

{"the", "the cat", "cat", "cat sat", "sat",
"sat on", "on", "on the", "the mat", "mat"}

### Configuring the _TextVectorization_ layer to return bigrams
The _TextVectorization_ layer can be configured to return arbitrary N-grams: bigrams, trigrams, etc. Just pass an ```ngrams=N``` argument as in the following listing:

In [None]:
text_vectorization = TextVectorization(
    ngrams=2,
    max_tokens=20000,
    output_mode="multi_hot",
)

### Training and testing the binary bigram model
Let’s test how our model performs when trained on such binary-encoded bags of bigrams.

In [None]:
text_vectorization.adapt(text_only_train_ds)
binary_2gram_train_ds = train_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)
binary_2gram_val_ds = val_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)
binary_2gram_test_ds = test_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)

model = get_model()
model.summary()
callbacks = [
    keras.callbacks.ModelCheckpoint("binary_2gram.keras",
                                    save_best_only=True)
]
model.fit(binary_2gram_train_ds.cache(),
          validation_data=binary_2gram_val_ds.cache(),
          epochs=10,
          callbacks=callbacks)
model = keras.models.load_model("binary_2gram.keras")
print(f"Test acc: {model.evaluate(binary_2gram_test_ds)[1]:.3f}")

### BIGRAMS WITH TF-IDF ENCODING
You can also add a bit more information to this representation by counting how many times each word or N-gram occurs, that is to say, by taking the histogram of the words over the text:

{"the": 2, "the cat": 1, "cat": 1, "cat sat": 1, "sat": 1,
"sat on": 1, "on": 1, "on the": 1, "the mat: 1", "mat": 1}

If you’re doing text classification, knowing how many times a word occurs in a sample is critical: any sufficiently long movie review may contain the word “terrible” regardless of sentiment, but a review that contains many instances of the word “terrible” is likely a negative one.

### Configuring the _TextVectorization_ layer to return token counts

In [None]:
text_vectorization = TextVectorization(
    ngrams=2,
    max_tokens=20000,
    output_mode="count"
)

### Configuring _TextVectorization_ to return TF-IDF-weighted outputs
Now, of course, some words are bound to occur more often than others no matter what the text is about.

The words _“the”_, _“a”_, _“is”_, and _“are”_ will always dominate your word count histograms, drowning out other words $—$ despite being pretty much useless features in a classification context.

We can address this via **normalization**. The best practice is to go with something called **TF-IDF normalization**.

TF-IDF stands for _“term frequency, inverse document frequency”_. It weights a given term by taking “term frequency,” how many times the term appears in the current document, and dividing it by a measure of “document frequency,” which estimates how often the term comes up across the dataset. In other words, if a given term appears frequently across all documents in a dataset, then it's not informative about a specific document.

TF-IDF is so common that it’s built into the _TextVectorization_ layer. All you need to do to start using it is to switch the output_mode argument to _"tf_idf"_.

In [None]:
text_vectorization = TextVectorization(
    ngrams=2,
    max_tokens=20000,
    output_mode="tf_idf",
)

### Training and testing the TF-IDF bigram model
Let’s train a new model with this scheme.

In [None]:
text_vectorization.adapt(text_only_train_ds)

tfidf_2gram_train_ds = train_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)
tfidf_2gram_val_ds = val_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)
tfidf_2gram_test_ds = test_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)

model = get_model()
model.summary()
callbacks = [
    keras.callbacks.ModelCheckpoint("tfidf_2gram.keras",
                                    save_best_only=True)
]
model.fit(tfidf_2gram_train_ds.cache(),
          validation_data=tfidf_2gram_val_ds.cache(),
          epochs=10,
          callbacks=callbacks)
model = keras.models.load_model("tfidf_2gram.keras")
print(f"Test acc: {model.evaluate(tfidf_2gram_test_ds)[1]:.3f}")

These past few examples clearly show that word order matters: manual engineering of order-based features, such as bigrams, yields a nice accuracy boost. Now remember: the history of deep learning is that of a move away from manual feature engineering, toward letting models learn their own features from exposure to data alone.

## Processing words as a sequence: The sequence model approach
To implement a sequence model, you’d start by representing your input samples as sequences of integer indices (one integer standing for one word).

Then, you’d map each integer to a vector to obtain vector sequences.

Finally, you’d feed these sequences of vectors into a stack of layers that could cross-correlate features from adjacent vectors, such as a 1D convnet, a RNN, or a Transformer.

### Preparing integer sequence datasets
First, let’s prepare datasets that return integer sequences.

In [None]:
max_length = 600
max_tokens = 20000
text_vectorization = layers.TextVectorization(
    max_tokens=max_tokens,
    output_mode="int",
    output_sequence_length=max_length,  # to keep a manageable input size, we’ll truncate the inputs after the first 600 words (only 5% of reviews are longer)
)
text_vectorization.adapt(text_only_train_ds)

int_train_ds = train_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)
int_val_ds = val_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)
int_test_ds = test_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)

### A sequence model built on one-hot encoded vector sequences
Next, let’s make a model. The simplest way to convert our integer sequences to vector sequences is to one-hot encode the integers (each dimension would represent one possible term in the vocabulary). On top of these one-hot vectors, we’ll add a simple bidirectional LSTM.

In [None]:
import tensorflow as tf

inputs = keras.Input(shape=(None,), dtype="int64")
embedded = tf.one_hot(inputs, depth=max_tokens)  # Encode the integers into binary 20'000 dimensional vectors
x = layers.Bidirectional(layers.LSTM(32))(embedded)  # Add a bidirectional LSTM
x = layers.Dropout(0.5)(x)
outputs = layers.Dense(1, activation="sigmoid")(x)  # Finally, add a classification layer
model = keras.Model(inputs, outputs)
model.compile(optimizer="rmsprop",
              loss="binary_crossentropy",
              metrics=["accuracy"])
model.summary()

### Training a first basic sequence model
Now, let’s train our model.

In [None]:
callbacks = [
    keras.callbacks.ModelCheckpoint("one_hot_bidir_lstm.keras",
                                    save_best_only=True)
]
model.fit(int_train_ds, validation_data=int_val_ds, epochs=10, callbacks=callbacks)
model = keras.models.load_model("one_hot_bidir_lstm.keras")
print(f"Test acc: {model.evaluate(int_test_ds)[1]:.3f}")

This model trains very slowly, especially compared to the lightweight model of the previous section. This is because our inputs are quite large: each input sample is encoded as a matrix of size $(600, 20000)$ (600 words per sample, 20,000 possible words). That’s $12,000,000$ floats for a single movie review. Our bidirectional LSTM has a lot of work to do.

Second, the model doesn’t perform nearly as well as our (very fast) binary unigram model. Clearly, using one-hot encoding to turn words into vectors, which was the simplest thing we could do, wasn’t a great idea. There’s a better way: **_word embeddings_**.