<a href="https://colab.research.google.com/github/UPstartDeveloper/DS-2.4-Advanced-Topics/blob/main/Notebooks/NLP/Efficient_IMDb_Classifier.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exploring the Data API

In this exercise we'll revist the IMDb dataset, but this time we'll use the features of the Tensorflow Data API, `tf.data`, to implement highly performant input pipelines.

We'll also take another look at making language models for binary classification, and use an `Embedding` layer to see if we can get a computer to learn the implicit relationships between words.

## Setup

In [40]:
# Copied from Aurélien Géron's Ch. 13 notebook, 
# for "Hands-on Machine Learning with Scikit-Learn, Keras and Tensorflow": 
# https://colab.research.google.com/github/ageron/handson-ml2/blob/master/13_loading_and_preprocessing_data.ipynb


# Python ≥3.5 is required
import sys
assert sys.version_info >= (3, 5)

# Scikit-Learn ≥0.20 is required
import sklearn
assert sklearn.__version__ >= "0.20"

# TensorFlow ≥2.0 is required
import tensorflow as tf
from tensorflow import keras
assert tf.__version__ >= "2.0"

# Common imports
import numpy as np
import os
# to make this notebook's output stable across runs
np.random.seed(42)

# To plot pretty figures
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

# To Download the Dataset (see Part 1)
from pathlib import Path

# for easily counting the frequency of items in a collection
from collections import Counter

# Great Source of Datasets
import tensorflow_datasets as tfds

## Part 1: Get the Data

**About the IMDb Dataset**:
1. 50,000 movies reviews from the Internet Movie Database (IMDb). 
2. Training and testing data are in `train/` and `test/`
3. Both of these directories has their own subdirectories for samples of `pos/` and `neg/` reviews.
4. Dataset is *balanced* (
  - 12,500 samples per class, in both the training and test data
5. The samples themselves are *text files.*

In [4]:
# locating the dataset TAR file
DOWNLOAD_ROOT = "http://ai.stanford.edu/~amaas/data/sentiment/"
FILENAME = "aclImdb_v1.tar.gz"
# downloading it onto the client machine
filepath = keras.utils.get_file(FILENAME, DOWNLOAD_ROOT + FILENAME, 
                                extract=True)
# finding a place for it on our machine
path = Path(filepath).parent / "aclImdb"
# here it is!
print(path)

/root/.keras/datasets/aclImdb


## Part 2: Splitting the Data

In [7]:
def review_paths(dirpath):
    """Given a directory path, returns a list of all the text files present.

    Args:
      dirpath: str. The path to a folder on the filesystem.

    Returns: List[str]

    Example Usage:
    review_paths("/root/.foo") ==> ["bar.txt", "foobar.txt"]
    """
    return [str(path) for path in dirpath.glob("*.txt")]


# collect samples for each of the training data, divided by class
train_pos = review_paths(path / "train" / "pos")
train_neg = review_paths(path / "train" / "neg")
# do the same for test data (includes data we'll use for validation as well)
test_valid_pos = review_paths(path / "test" / "pos")
test_valid_neg = review_paths(path / "test" / "neg")

In [9]:
# verify we collected all the samples for each section
len(train_pos), len(train_neg), len(test_valid_pos), len(test_valid_neg)

(12500, 12500, 12500, 12500)

In [10]:
# shuffle test data to make sure the validation data is identically distributed
np.random.shuffle(test_valid_neg)
np.random.shuffle(test_valid_pos)
# shuffle training data so the model doesn't learn based on it is ordered
np.random.shuffle(train_pos)
np.random.shuffle(train_neg)

To aid our training process, we'll create a separate validation set from 15,000 of the samples in the testing data 

The remaining 10,000 samples of the test data will be kept separate, and not seen by the model until after training is completed of course.

In [11]:
# keep just 5,000 samples of pos and neg test data 
test_pos = test_valid_pos[:5000]
test_neg = test_valid_neg[:5000]
# the rest of the data is for validation
valid_pos = test_valid_pos[5000:]
valid_neg = test_valid_neg[5000:]

## Part 3: Using the Data API

Say hello to `tf.data`!

Pretending as though this dataset were humongous, we could read it efficiently using `TextLineDataset`. This works for 2 reasons:

1. Each review is 1 just line of text long
2. If this dataset was in fact humongous, then this technique allows us to avoid loading it in all at once into RAM. 

In [14]:
def convert_to_dataset(filepaths_positive, filepaths_negative, n_read_threads=5):
    """Create a Tensorflow-based dataset object from the dataset on disk.

    This is intended for use on binary classification problems only.
    
    Args:
      filepaths_positive: str. The path of the positive (1) samples.
      filepaths_negative: str. The path to the negative (0) samples.
      n_read_threads: int. Specifies the number of threads to use,
        so we can read in multiple records at once. 

    Returns: tf.data.Dataset: contain labeled data for both classes.
    """
    # read in the negative reviews
    dataset_neg = tf.data.TextLineDataset(filepaths_negative,
                                          num_parallel_reads=n_read_threads)
    # read in the positive reviews
    dataset_pos = tf.data.TextLineDataset(filepaths_positive,
                                          num_parallel_reads=n_read_threads)
    # label the positive and negative reviews numerically 
    dataset_neg = dataset_neg.map(lambda review: (review, 0))
    dataset_pos = dataset_pos.map(lambda review: (review, 1))
    return tf.data.Dataset.concatenate(dataset_pos, dataset_neg)

Speed test!

In [21]:
%timeit -r1 convert_to_dataset(train_pos, train_neg).repeat(10)

10 loops, best of 1: 40.9 ms per loop


Oh, not fast enough? We can make it more efficient by using `.cache()`!

Again, if the dataset is small enough we could also just make the tensor of the dataset (aka `tf.data.Dataset`) by just loading it into memory, and it would be just as efficient as the following:

In [22]:
%timeit -r1 convert_to_dataset(train_pos, train_neg).cache().repeat(10)

10 loops, best of 1: 40.7 ms per loop


Let's do this for all of our data splits. 

We'll add some extra efficiency by using `prefetch()`. This tells our hardware to simultaneously grab batches for training, as the previous batch is currently being processed.

In [23]:
batch_size = 32
# Optionally, we have shuffled the training data once more
train_set = convert_to_dataset(train_pos, train_neg).shuffle(25000).batch(batch_size).prefetch(1)
valid_set = convert_to_dataset(valid_pos, valid_neg).batch(batch_size).prefetch(1)
test_set = convert_to_dataset(test_pos, test_neg).batch(batch_size).prefetch(1)

## Part 4: Define the Model

### Text Preprocessing (Layers + Operations)

This comes first, as neural networks cannot understand raw text!

Since we will frequently need to refer to the maximum number of tokens we believe are in the dataset, we'll start this section by saving a constant for this value (I arbitrarily chose 5000):

In [28]:
MAX_VOCAB_SIZE = 5000  

Next we'll need a helper function for cleaning the text:

In [24]:
def preprocess(X_batch, n_words=50):
    """Preprocess the input text for a single batch of data.

    Args:
      X_batch: tf.data.Dataset. A tensor of several text samples
      n_words: int. The desired number of tokens for each review

    Returns: tf.Tensor. Cleaned text version of the batch.
    """
    # store the shape of the batch as a tensor
    shape = tf.shape(X_batch) * tf.constant([1, 0]) + tf.constant([0, n_words])
    # crop the review lengths to 300 characters
    Z = tf.strings.substr(X_batch, 0, 300)
    # lower-case the reviews
    Z = tf.strings.lower(Z)
    # text-cleaning: replace all <br /> and all non-letter characters w/ spaces
    Z = tf.strings.regex_replace(Z, b"<br\\s*/?>", b" ")
    Z = tf.strings.regex_replace(Z, b"[^a-z]", b" ")
    # tokenize the review
    Z = tf.strings.split(Z)
    # ensure each review has n_words token (padding or cropping)
    return Z.to_tensor(shape=shape, default_value=b"<pad>")

This utility will ensure the tokens are encoded with the **most frequent tokens getting
 the lowest indices**:

In [29]:
def get_vocabulary(data_sample, max_size=MAX_VOCAB_SIZE):
    """List unique tokens in a corpus, sorted by most frequently occurring.

    The tokens will be listed as byte strings, as that works best with TF

    Args:
      data_sample: tf.Tensor. Contains the input strings.
      max_size: int. The maximum number of tokens we believe are in the dataset.

    Returns: List[str].
    """
    preprocessed_reviews = preprocess(data_sample).numpy()
    # counter the frequencies of different tokens
    counter = Counter()
    for words in preprocessed_reviews:
        for word in words:
            if word != b"<pad>":
                counter[word] += 1
    # sort the tokens by occurence, make sure the padding token appears first
    return [b"<pad>"] + [word for word, count in counter.most_common(max_size)]

Now we will define our `TextVectorization` layer (Keras also has its own version of this, however it is currently experimental):

In [31]:
class TextVectorization(keras.layers.Layer):
    def __init__(self, max_vocabulary_size=MAX_VOCAB_SIZE, n_oov_buckets=100, 
                dtype=tf.string, **kwargs):
        '''define the hyperparams for our text corpus'''
        super().__init__(dtype=dtype, **kwargs)
        self.max_vocabulary_size = max_vocabulary_size
        self.n_oov_buckets = n_oov_buckets

    def adapt(self, data_sample):
        '''create words IDs for the vocabulary of our text corpus'''
        self.vocab = get_vocabulary(data_sample, self.max_vocabulary_size)
        words = tf.constant(self.vocab)
        word_ids = tf.range(len(self.vocab), dtype=tf.int64)
        vocab_init = tf.lookup.KeyValueTensorInitializer(words, word_ids)
        self.table = tf.lookup.StaticVocabularyTable(vocab_init, self.n_oov_buckets)
        
    def call(self, inputs):
        '''convert input text to the corresponding integers in the vocabulary'''
        preprocessed_inputs = preprocess(inputs)
        return self.table.lookup(preprocessed_inputs)

Now we can apply this layer to our dataset (if it were humongous, we could also just use a representative sample of the dataset's vocabulary):

In [32]:
# A: define the max number of unique tokens that we expect are in the corpus
max_vocabulary_size = MAX_VOCAB_SIZE
n_oov_buckets = 100

# B: store a sample of JUST the input text as an iterable
sample_review_batches = train_set.map(lambda review, label: review)
sample_reviews = np.concatenate(list(sample_review_batches.as_numpy_iterator()),
                                axis=0)
# C: now just apply the vectorizer, to encode the word ids 
text_vectorization = TextVectorization(max_vocabulary_size, n_oov_buckets,
                                       input_shape=[])
text_vectorization.adapt(sample_reviews)

As the very last step in preprocessing, we need to encode the text samples themselves into a numerical format - for example, we can use a custom layer for the bag-of-words approach.

In [33]:
class BagOfWords(keras.layers.Layer):
    def __init__(self, n_tokens, dtype=tf.int32, **kwargs):
        '''n_tokens tells us the length that the bag of word vectors must be'''
        super().__init__(dtype=tf.int32, **kwargs)
        self.n_tokens = n_tokens
    def call(self, inputs):
        '''creating the bag of word vectors on the input'''
        one_hot = tf.one_hot(inputs, self.n_tokens)
        return tf.reduce_sum(one_hot, axis=1)[:, 1:]

The next step is to adapt the bag of words layer to our dataset of course:

In [34]:
 # store size of the vocabulary (including an extra for the <pad> token)
n_tokens = max_vocabulary_size + n_oov_buckets + 1 
# use the BoW layer
bag_of_words = BagOfWords(n_tokens)

### Build + Train the Model

Here we will use a fully-connected network, that has:

- a text vectorization and BoW layer to do the preprocessing,
- 1 hidden layer with 100 neurons, and ReLU activation,
- 5 epochs of training,
- 1 neuron and sigmoid activation in the output layer (since it is binary classification)
- Nadam optimization

And we will track accuracy as the metric.

In [35]:
model = keras.models.Sequential([
    text_vectorization,
    bag_of_words,
    keras.layers.Dense(100, activation="relu"),
    keras.layers.Dense(1, activation="sigmoid"),
])
model.compile(loss="binary_crossentropy", optimizer="nadam",
              metrics=["accuracy"])
model.fit(train_set, epochs=5, validation_data=valid_set)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x7f2a34e5e0f0>

## Part 5: Use Word Embeddings

To make sure we don't throw off this new model, we'll make a function to compute the mean embedding for our input data:

In [36]:
def compute_mean_embedding(inputs):
    """Compute the normalized embedding vectors from input data.

    Args:
      inputs: tf.Tensor. Contains vectorized text samples.
    
    Returns: tf.Tensor. Contains the embedding vectors for each sample.

    Example usage:
    >>> example = tf.constant([[[1., 2., 3.], [4., 5., 0.], [0., 0., 0.]],
                           [[6., 0., 0.], [0., 0., 0.], [0., 0., 0.]]])
    >>> compute_mean_embedding(example)
    >>> <tf.Tensor: shape=(2, 3), dtype=float32, numpy=
        array([[2.3570225, 3.2998314, 1.4142135],
                [2.       , 0.       , 0.       ]], dtype=float32)>
    """
    # count the number of non-padding tokens (remember it is encoded as 0)
    not_pad = tf.math.count_nonzero(inputs, axis=-1)
    # count the number of words
    n_words = tf.math.count_nonzero(not_pad, axis=-1, keepdims=True)  
    # multiply by the square root of the number of words   
    sqrt_n_words = tf.math.sqrt(tf.cast(n_words, tf.float32))
    return tf.reduce_mean(inputs, axis=1) * sqrt_n_words


### Building a New Model

It is same as before, except now the `BagOfWords` layer has been replaced with Keras' `Embedding` layer, as well as a layer to apply the scaling function above:

In [37]:
embedding_size = 20 # this says the embedding space has 20 dimensions

model = keras.models.Sequential([
    text_vectorization,
    keras.layers.Embedding(input_dim=n_tokens,
                           output_dim=embedding_size,
                           mask_zero=True), # <pad> tokens => zero vectors
    keras.layers.Lambda(compute_mean_embedding),
    keras.layers.Dense(100, activation="relu"),
    keras.layers.Dense(1, activation="sigmoid"),
])

How does it compare to the previous model?

In [38]:
model.compile(loss="binary_crossentropy", optimizer="nadam", metrics=["accuracy"])
model.fit(train_set, epochs=5, validation_data=valid_set)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x7f2a34e5e588>

Actually it appears as though the model performed about the same, having 75% accuracy on the validation data.

## Part 6: A Look at Tensorflow Datasets

In [Part 1](https://colab.research.google.com/drive/1WD0OcJT14Ms07hJDQ3A70a9pyYEKetiR#scrollTo=_myKm9V-b9Ol) of this notebook, we used `keras.utils.get_file` to download the IMDb dataset into our project.

Alternatively we could also used Tensorflow Datasets (TFDS) to get this datasets. This library is especially designed to provide data that is ready to be used by Tensorflow models (e.g. in this scenario, it gives us text that has already been cleaned, shuffled, and stored as byte strings). 

A full list of the available datasets on TFDS can be found [here](https://www.tensorflow.org/datasets/catalog/overview).

In [39]:
datasets = tfds.load(name="imdb_reviews")
train_set, test_set = datasets["train"], datasets["test"]

[1mDownloading and preparing dataset imdb_reviews/plain_text/1.0.0 (download: 80.23 MiB, generated: Unknown size, total: 80.23 MiB) to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0...[0m


HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Dl Completed...', max=1.0, style=Progre…

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Dl Size...', max=1.0, style=ProgressSty…







HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Shuffling and writing examples to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incomplete2ONUZD/imdb_reviews-train.tfrecord


HBox(children=(FloatProgress(value=0.0, max=25000.0), HTML(value='')))



HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Shuffling and writing examples to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incomplete2ONUZD/imdb_reviews-test.tfrecord


HBox(children=(FloatProgress(value=0.0, max=25000.0), HTML(value='')))



HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Shuffling and writing examples to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incomplete2ONUZD/imdb_reviews-unsupervised.tfrecord


HBox(children=(FloatProgress(value=0.0, max=50000.0), HTML(value='')))



[1mDataset imdb_reviews downloaded and prepared to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0. Subsequent calls will reuse this data.[0m


In [41]:
for example in train_set.take(1):
    print(example["text"])
    print(example["label"])

tf.Tensor(b"This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting could not redeem this movie's ridiculous storyline. This movie is an early nineties US propaganda piece. The most pathetic scenes were those when the Columbian rebels were making their cases for revolutions. Maria Conchita Alonso appeared phony, and her pseudo-love affair with Walken was nothing but a pathetic emotional plug in a movie that was devoid of any real meaning. I am disappointed that there are movies like this, ruining actor's like Christopher Walken's good name. I could barely sit through it.", shape=(), dtype=string)
tf.Tensor(0, shape=(), dtype=int64)
