# Chp 11: Part 1

Modern NLP is about using machine learning and large datasets to give computers the ability — not to understand language, which is a more lofty goal — but to ingest a piece of language as input and return something useful, like predicting:

    "What’s the topic of this text?" (text classification)
    "Does this text contain abuse?" (content filtering)
    "Does this text sound positive or negative?" (sentiment analysis)
    "What should be the next word in this incomplete sentence?" (language modeling) "How would you say this in German?" (translation)
    "How would you summarize this article in one paragraph?" (summarization)

### Using the `TextVectorization` layer

In [1]:
def HR():
    # print char * numeric
    print('-' * 80)

In [2]:
def set_mixed_precision():
    import tensorflow as tf
    from tensorflow import keras

    physical_devices = tf.config.list_physical_devices('GPU')
    if len(physical_devices) > 0:
        print("GPU mode - switch to mixed precision.")
        print("Every layer will use a 16-bit compute dtype and float32 variable dtype by default.")
        keras.mixed_precision.set_global_policy("mixed_float16")

    HR()
    print("global policy:", tf.keras.mixed_precision.global_policy())

set_mixed_precision()   

--------------------------------------------------------------------------------
global policy: <Policy "float32">


In [3]:
# listing11_2_4, p.359
# This is just for demonstration purposes, as it is not performant.
# In actuality, it is better to use the Keras TextVectorization layer,
# which is fast and efficient, and can be dropped directly into a 
# tf.data pipeline or a Keras model.

def listing11_2_4():
        
    import string

    class Vectorizer:
        def standardize(self, text):
            text = text.lower()
            return "".join(char for char in text if char not in string.punctuation)

        def tokenize(self, text):
            text = self.standardize(text)
            return text.split()

        def make_vocabulary(self, dataset):
            self.vocabulary = {"": 0, "[UNK]": 1}
            for text in dataset:
                text = self.standardize(text)
                tokens = self.tokenize(text)
                for token in tokens:
                    if token not in self.vocabulary:
                        self.vocabulary[token] = len(self.vocabulary)
            self.inverse_vocabulary = dict(
                (v, k) for k, v in self.vocabulary.items())

        def encode(self, text):
            text = self.standardize(text)
            tokens = self.tokenize(text)
            return [self.vocabulary.get(token, 1) for token in tokens]

        def decode(self, int_sequence):
            return " ".join(
                self.inverse_vocabulary.get(i, "[UNK]") for i in int_sequence)

    #####################
        
    vectorizer = Vectorizer()
    dataset = [
        "I write, erase, rewrite",
        "Erase again, and then",
        "A poppy blooms.",
    ]
    vectorizer.make_vocabulary(dataset)


    # test Haiku-like sentence
    test_sentence = "I write, rewrite, and still rewrite again"
    encoded_sentence = vectorizer.encode(test_sentence)
    print(encoded_sentence)
    print()
    
    decoded_sentence = vectorizer.decode(encoded_sentence)
    print(decoded_sentence)

listing11_2_4()

[2, 3, 5, 7, 1, 5, 6]

i write rewrite and [UNK] rewrite again


In [4]:
# Using the the Keras TextVectorization layer, p.360
# This can be dropped directly into a tf.data pipeline or a Keras model.
# We can provide custom functions for standardization and tokenization, 
# which means the layer is flexible enough to handle any use case.
# Such custom functions should operate on tf.string tensors, 
# not regular Python strings.

def listing11_2_5():
    import re
    import string
    import tensorflow as tf
    from tensorflow.keras.layers.experimental.preprocessing import TextVectorization

    def custom_standardization_fn(string_tensor):
        # Convert strings to lowercase
        lowercase_string = tf.strings.lower(string_tensor)
        # Replace punctuation characters with the empty string
        return tf.strings.regex_replace(
            lowercase_string, f"[{re.escape(string.punctuation)}]", "")

    def custom_split_fn(string_tensor):
        # Split strings on whitespace
        return tf.strings.split(string_tensor)

    # Configures the layer to return sequences of words encoded as integer indices.
    text_vectorization = TextVectorization(
        output_mode="int",
        standardize=custom_standardization_fn,
        split=custom_split_fn,
    )


    # To index the vocabulary of a text corpus, call the adapt() method 
    # of the layer with a Dataset object that yields strings, or just with 
    # on list of Python strings:
    dataset = [
        "I write, erase, rewrite",
        "Erase again, and then",
        "A poppy blooms.",
    ]
    text_vectorization.adapt(dataset)

    
    # Displaying the vocabulary.
    # We can retrieve the computed vocabulary via get_vocabulary().
    # This is useful if you need to convert text encoded as integer sequences back into words.
    # The first two entries in the vocabulary are the mask token (index 0) and and the OOV token (index 1).
    # Entries in the vocabulary list are sorted by frequency.
    print(text_vectorization.get_vocabulary())
    print()

    
    # Encode and then decode an example sentence.
    vocabulary = text_vectorization.get_vocabulary()
    test_sentence = "I write, rewrite, and still rewrite again"
    encoded_sentence = text_vectorization(test_sentence)
    print(encoded_sentence)
    print()

    inverse_vocab = dict(enumerate(vocabulary))
    decoded_sentence = " ".join(inverse_vocab[int(i)] for i in encoded_sentence)
    print(decoded_sentence)
    print()

listing11_2_5()

['', '[UNK]', 'erase', 'write', 'then', 'rewrite', 'poppy', 'i', 'blooms', 'and', 'again', 'a']

tf.Tensor([ 7  3  5  9  1  5 10], shape=(7,), dtype=int64)

i write rewrite and [UNK] rewrite again



---
## 11.3 Two approaches for representing groups of words: sets and sequences, P.362

How to represent word order is the pivotal question from which different kinds of NLP architectures spring. The simplest thing you could do is just discard order and treat text as an unordered set of words—this gives you bag-of-words models. You could also decide that words should be processed strictly in the order in which they appear, one at a time, like steps in a timeseries—you could then leverage the recurrent models from last chapter. Finally, a hybrid approach is also possible: the Transformer architecture is technically order-agnostic, yet it injects word-position information into the representations it processes, which enables it to simultaneously look at different parts of a sentence (unlike RNNs) while still being order-aware. Because they take into account word order, both RNNs and Transformers are called sequence models.

### 11.3.1 Preparing the IMDB movie reviews data

In [5]:
def listing11_3_1():
    import os

    dirpath = 'aclImdb'
    if not os.path.isdir(dirpath):
        print(f'{dirpath} not found, creating directory')
        HR()
        try:
            !curl -O https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
            !tar -xf aclImdb_v1.tar.gz
            !rm -r aclImdb/train/unsup
            
        except Exception as ex:
            print(f"Not able to create directory due to error {ex}")

    print()
    !cat aclImdb/train/pos/4077_10.txt

listing11_3_1()

# This creates this folder structure, where pos is positive, neg is negative
# aclImdb/
# ...train/
# ......pos/
# ......neg/
# ...test/
# ......pos/
# ......neg/

aclImdb not found, creating directory
--------------------------------------------------------------------------------
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 80.2M  100 80.2M    0     0  37.5M      0  0:00:02  0:00:02 --:--:-- 37.5M

I first saw this back in the early 90s on UK TV, i did like it then but i missed the chance to tape it, many years passed but the film always stuck with me and i lost hope of seeing it TV again, the main thing that stuck with me was the end, the hole castle part really touched me, its easy to watch, has a great story, great music, the list goes on and on, its OK me saying how good it is but everyone will take there own best bits away with them once they have seen it, yes the animation is top notch and beautiful to watch, it does show its age in a very few parts but that has now become part of it beauty, i am so glad it has came out on 

In [6]:
def listing11_3_2():
    import os, pathlib, shutil, random
    from tensorflow import keras

    dirpath = 'aclImdb/val'
    if os.path.isdir(dirpath):
        print(f"{dirpath} already exists")
    else:
        print(f"Prepare a validation set by setting apart 20% of the training text files in a new directory, {dirpath}")
        base_dir = pathlib.Path("aclImdb")
        val_dir = base_dir / "val"
        train_dir = base_dir / "train"
        
        for category in ("neg", "pos"):
            os.makedirs(val_dir / category, exist_ok=True)
            files = os.listdir(train_dir / category)

            # Shuffle the list of training files using a seed, to ensure
            # we get the same validation set every time we run the code
            random.Random(1337).shuffle(files)

            # Take 20% of the training files to use for validation
            num_val_samples = int(0.2 * len(files))
            val_files = files[-num_val_samples:]

            # Move the files to aclImdb/val/neg and aclImdb/val/pos
            for fname in val_files:
                shutil.move(train_dir / category / fname,
                            val_dir / category / fname)
    
listing11_3_2()

Prepare a validation set by setting apart 20% of the training text files in a new directory, aclImdb/val


In [7]:
# https://realpython.com/python-namedtuple/
# development tool to measure, monitor and analyze the memory behavior of Python objects.
!pip install pympler

Collecting pympler
  Downloading Pympler-0.9.tar.gz (178 kB)
[K     |████████████████████████████████| 178 kB 5.2 MB/s 
[?25hBuilding wheels for collected packages: pympler
  Building wheel for pympler (setup.py) ... [?25l[?25hdone
  Created wheel for pympler: filename=Pympler-0.9-py3-none-any.whl size=164823 sha256=70639f90d7ba70e1f55a8fd7e9cf44922187b1674d1ce0aab248ac4d3983be0d
  Stored in directory: /root/.cache/pip/wheels/1a/f3/d8/35d5614ea4ddd295ffb9372a5f2f9570d9593d1ea4be33ec6d
Successfully built pympler
Installing collected packages: pympler
Successfully installed pympler-0.9


In [36]:
from pympler import asizeof
from collections import namedtuple

# One possibility, use namedtuple for data object.

DATA = namedtuple("DATA", [
    'train_ds'
    'val_ds',
    'test_ds',
    'binary_1gram_train_ds',
    'binary_1gram_val_ds',
    'binary_1gram_test_ds',
    'text_only_train_ds'
])
print(asizeof.asizeof(DATA))

#data_c = 

0


In [104]:
import tensorflow
from dataclasses import dataclass
from dataclasses import astuple

# Create dataclass
# https://realpython.com/python-data-classes/

# create immutable dataclass
@dataclass(frozen=True)
class DATACLASS_C:
    #train_ds: tensorflow.data
    train_ds: object
    val_ds: object
    test_ds: object
    binary_1gram_train_ds: object
    binary_1gram_val_ds: object
    binary_1gram_test_ds: object
    text_only_train_ds: object

    # How to create iterable dataclass?
    # def __iter__(self):
    #     return iter(astuple(self))

print(DATACLASS_C)
print(asizeof.asizeof(DATACLASS_C))

<class '__main__.DATACLASS_C'>
0


In [105]:
# In chapter 8, we used the utility image_dataset_from_directory to create a 
# batched Dataset of images and their labels for a directory structure.  
# We can do the same thing for text files using the utility text_dataset_from_directory.
# We create three Dataset objects, for training, validation, and testing.
# p.364

# Organize this project into 2 main aspects:
# 1. Data engineering
# 2. Model building

# Create and process data ("data-engineering")
def listing11_3_2b():
    import tensorflow as tf
    from tensorflow import keras
    from tensorflow.keras.layers.experimental.preprocessing import TextVectorization

    batch_size = 32

    train_ds = keras.preprocessing.text_dataset_from_directory(
        "aclImdb/train", batch_size=batch_size
    )
    val_ds = keras.preprocessing.text_dataset_from_directory(
        "aclImdb/val", batch_size=batch_size
    )
    test_ds = keras.preprocessing.text_dataset_from_directory(
        "aclImdb/test", batch_size=batch_size
    )

    print()

    # Displaying the shapes and dtypes of the first batch
    for inputs, targets in train_ds:
        print("inputs.shape:", inputs.shape)
        print("inputs.dtype:", inputs.dtype)
        print("targets.shape:", targets.shape)
        print("targets.dtype:", targets.dtype)
        print("inputs[0]:", inputs[0])
        print("targets[0]:", targets[0])
        break


    # Processing words as a set: the bag-of-words approach
    # Single words (unigrams) with binary encoding
    # Preprocessing our datasets with a TextVectorization layer
    text_vectorization = TextVectorization(
        max_tokens=20000,
        output_mode="binary",
    )
    text_only_train_ds = train_ds.map(lambda x, y: x)
    text_vectorization.adapt(text_only_train_ds)

    binary_1gram_train_ds = train_ds.map(lambda x, y: (text_vectorization(x), y))
    binary_1gram_val_ds = val_ds.map(lambda x, y: (text_vectorization(x), y))
    binary_1gram_test_ds = test_ds.map(lambda x, y: (text_vectorization(x), y))


    # Inspecting the output of our binary unigram dataset
    # These datasets yield inputs that are TensorFlow tf.string tensors, 
    # and targets that are int32 tensors encoding the value "0" or "1"
    for inputs, targets in binary_1gram_train_ds:
        print("inputs.shape:", inputs.shape)
        print("inputs.dtype:", inputs.dtype)
        print("targets.shape:", targets.shape)
        print("targets.dtype:", targets.dtype)
        print("inputs[0]:", inputs[0])
        print("targets[0]:", targets[0])
        print('--------')
        break
    

    HR()
    print(type(train_ds))

    HR()

    data = {
        'train_ds': train_ds,
        'val_ds': val_ds,
        'test_ds': test_ds,
        'binary_1gram_train_ds': binary_1gram_train_ds,
        'binary_1gram_val_ds': binary_1gram_val_ds,
        'binary_1gram_test_ds': binary_1gram_test_ds,
        'text_only_train_ds': text_only_train_ds,
    }

    data_t = DATA(
        train_ds = train_ds,
        val_ds = val_ds,
        test_ds = test_ds,
        binary_1gram_train_ds = binary_1gram_train_ds,
        binary_1gram_val_ds = binary_1gram_val_ds,
        binary_1gram_test_ds = binary_1gram_test_ds,
        text_only_train_ds = text_only_train_ds
    )

    data_c = DATACLASS_C(
        train_ds,
        val_ds = val_ds,
        test_ds = test_ds,
        binary_1gram_train_ds = binary_1gram_train_ds,
        binary_1gram_val_ds = binary_1gram_val_ds,
        binary_1gram_test_ds = binary_1gram_test_ds,
        text_only_train_ds = text_only_train_ds
    )


    HR()
    print(f"Size of data:\t{asizeof.asizeof(data)}")
    print(f"Size of data_t:\t{asizeof.asizeof(data_t)}")
    print(f"Size of data_c:\t{asizeof.asizeof(data_c)}")
    HR()


    print(f"Size of data_c: {asizeof.asizeof(data_c)}")
    print("Testing data_c")
    print(data_c)
    print(data_c.train_ds)
    print(data_c.test_ds)
    print(data_c.val_ds)
    print()
    HR()



    return data, data_c
    


data, data_c = listing11_3_2b()

print(f"Size of data 2: {asizeof.asizeof(data)}")

print(type(data))

for x, y in data.items():
    print(x, y)

print(data['train_ds'])

# print(type(binary_1gram_train_ds))
# print(binary_1gram_train_ds)

Found 20000 files belonging to 2 classes.
Found 5000 files belonging to 2 classes.
Found 25000 files belonging to 2 classes.

inputs.shape: (32,)
inputs.dtype: <dtype: 'string'>
targets.shape: (32,)
targets.dtype: <dtype: 'int32'>
inputs[0]: tf.Tensor(b"if you didn't live in the 90's or didn't listen to rapper EVER!! this movie might be OK for you, but any for any fan or any single person who ever listened to rap this movie was boring and there was no point in the movie where i said thats interesting or i didn't know that. another thing that bugged me was it made it look like anything in his life he did was very easy there was no struggle he made jail look easy, selling drugs, and even rapping it wasn't realistic. i think if the movie where released in about 15 years from now it might have more of an impact maybe!!! good rap movies hustle and flow, get rich or die trying not notorious", shape=(), dtype=string)
targets[0]: tf.Tensor(0, shape=(), dtype=int32)
inputs.shape: (32, 20000)
in

In [107]:
print(f"Size of data_t: {asizeof.asizeof(data_t)}")
print(data_t)
print(data_t.train_ds)
print(data_t.test_ds)
print(data_t.val_ds)
print()

for x in data_t:
    print(x)
HR()

print(f"Size of data_c: {asizeof.asizeof(data_c)}")
print(data_c)
print(data_c.train_ds)
print(data_c.test_ds)
print(data_c.val_ds)
print()

# for x in data_c:
#     print(x)
# HR()


Size of data_t: 663248
DATA(train_ds=<BatchDataset shapes: ((None,), (None,)), types: (tf.string, tf.int32)>, val_ds=<BatchDataset shapes: ((None,), (None,)), types: (tf.string, tf.int32)>, test_ds=<BatchDataset shapes: ((None,), (None,)), types: (tf.string, tf.int32)>, binary_1gram_train_ds=<MapDataset shapes: ((None, 20000), (None,)), types: (tf.float32, tf.int32)>, binary_1gram_val_ds=<MapDataset shapes: ((None, 20000), (None,)), types: (tf.float32, tf.int32)>, binary_1gram_test_ds=<MapDataset shapes: ((None, 20000), (None,)), types: (tf.float32, tf.int32)>, text_only_train_ds=<MapDataset shapes: (None,), types: tf.string>)
<BatchDataset shapes: ((None,), (None,)), types: (tf.string, tf.int32)>
<BatchDataset shapes: ((None,), (None,)), types: (tf.string, tf.int32)>
<BatchDataset shapes: ((None,), (None,)), types: (tf.string, tf.int32)>

<BatchDataset shapes: ((None,), (None,)), types: (tf.string, tf.int32)>
<BatchDataset shapes: ((None,), (None,)), types: (tf.string, tf.int32)>
<Bat

In [None]:
# Processing words as a set: the bag-of-words approach
# The simplest way to encode a piece of text for processing by a machine learning 
# model is to discard order and treat it as a set (a "bag") of tokens. You could 
# either look at individual words (unigrams), or try to recover some local order 
# information by looking at groups of consecutive token (N-grams).
 

def get_model(max_tokens=20000, hidden_dim=16):
    from tensorflow import keras
    from tensorflow.keras import layers

    inputs = keras.Input(shape=(max_tokens,))
    x = layers.Dense(hidden_dim, activation="relu")(inputs)
    x = layers.Dropout(0.5)(x)
    outputs = layers.Dense(1, activation="sigmoid")(x)
    model = keras.Model(inputs, outputs)
    model.compile(
        optimizer="rmsprop",
        loss="binary_crossentropy",
        metrics=["accuracy"]
    )
    return model


def listing11_7():
    from tensorflow import keras
    from tensorflow.keras import layers

    binary_1gram_train_ds = data['binary_1gram_train_ds']
    binary_1gram_val_ds = data['binary_1gram_val_ds']
    binary_1gram_test_ds = data['binary_1gram_test_ds']

   
    #############################
    
    # Listing 11.8 Training and testing the binary unigram model
    model = get_model()
    model.summary()

    callbacks = [
        keras.callbacks.ModelCheckpoint(
            "binary_1gram.keras",
            save_best_only=True
        )
    ]

    model.fit(
        # Call .cache() on the datasets to cache them in memory: this way, we will only 
        # do the preprocessing once during the first epoch, and we’ll reuse the 
        # preprocessed texts for the following epochs. This can only be done if the 
        # data is small enough to fit in memory.
        binary_1gram_train_ds.cache(),
        validation_data=binary_1gram_val_ds.cache(),
        epochs=10,
        callbacks=callbacks
    )
    
    model = keras.models.load_model("binary_1gram.keras")

    HR()

    print(f"Test acc: {model.evaluate(binary_1gram_test_ds)[1]:.3f}")

listing11_7()

# Test acc: 0.879

Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         [(None, 20000)]           0         
_________________________________________________________________
dense (Dense)                (None, 16)                320016    
_________________________________________________________________
dropout (Dropout)            (None, 16)                0         
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 17        
Total params: 320,033
Trainable params: 320,033
Non-trainable params: 0
_________________________________________________________________
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
--------------------------------------------------------------------------------
Test acc: 0.880


In [None]:
def listing11_10():
    from tensorflow import keras
    from tensorflow.keras import layers
    from tensorflow.keras.layers.experimental.preprocessing import TextVectorization

    # Bigrams with binary encoding
    
    # Of course, discarding word order is very reductive, because even atomic concepts 
    # can be expressed via multiple words: the term "United States" conveys a concept 
    # that is quite distinct from the meaning of the words "states" and "united" taken 
    # separately. For this reason, you will usually end up re-injecting local order 
    # information into your bag-of-words representation by looking at N-grams rather 
    # than single words (most commonly, bigrams).
    
    binary_1gram_train_ds = data['binary_1gram_train_ds']
    binary_1gram_val_ds = data['binary_1gram_val_ds']
    binary_1gram_test_ds = data['binary_1gram_test_ds']
    text_only_train_ds = data['text_only_train_ds']
    train_ds = data['train_ds']
    val_ds = data['val_ds']
    test_ds = data['test_ds']

    # Configuring the TextVectorization layer to return bigrams
    text_vectorization = TextVectorization(
        ngrams=2,
        max_tokens=20000,
        output_mode="binary",
    )

    # Test how model would perform when trained on such binary-encoded bags of bigrams.
    # Listing 11.10 Training and testing the binary bigram model
    text_vectorization.adapt(text_only_train_ds)
    binary_2gram_train_ds = train_ds.map(lambda x, y: (text_vectorization(x), y))
    binary_2gram_val_ds = val_ds.map(lambda x, y: (text_vectorization(x), y))
    binary_2gram_test_ds = test_ds.map(lambda x, y: (text_vectorization(x), y))

    # Listing 11.8 Training and testing the binary unigram model
    model = get_model()

    model.summary()
    callbacks = [
        keras.callbacks.ModelCheckpoint(
            "binary_2gram.keras",
            save_best_only=True
        )
    ]
    model.fit(
        binary_2gram_train_ds.cache(),
        validation_data=binary_2gram_val_ds.cache(),
        epochs=10,
        callbacks=callbacks
    )
    model = keras.models.load_model("binary_2gram.keras")

    HR()

    print(f"Test acc: {model.evaluate(binary_2gram_test_ds)[1]:.3f}")

listing11_10()

# Test acc: 0.887

Model: "model_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_2 (InputLayer)         [(None, 20000)]           0         
_________________________________________________________________
dense_2 (Dense)              (None, 16)                320016    
_________________________________________________________________
dropout_1 (Dropout)          (None, 16)                0         
_________________________________________________________________
dense_3 (Dense)              (None, 1)                 17        
Total params: 320,033
Trainable params: 320,033
Non-trainable params: 0
_________________________________________________________________
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
--------------------------------------------------------------------------------
Test acc: 0.896


In [None]:
def listing11_13():
    from tensorflow import keras
    from tensorflow.keras import layers
    from tensorflow.keras.layers.experimental.preprocessing import TextVectorization

    # Bigrams with TF-IDF encoding
    
    # Add a bit more information to this representation by counting how many times 
    # each word or N-gram occurs, that is to say, by taking the histogram of the 
    # words over the text.
    # {"the": 2, "the cat": 1, "cat": 1, "cat sat": 1, "sat": 1,
    # "sat on": 1, "on": 1, "on the": 1, "the mat: 1", "mat": 1}
    
    # If you’re doing text classification, knowing how many times a word occurs in a 
    # sample is critical: any sufficiently long movie review may contain the word 
    # "terrible" regardless of sentiment, but a review that contains many instances 
    # of the word "terrible" is likely a negative one.
    

    binary_1gram_train_ds = data['binary_1gram_train_ds']
    binary_1gram_val_ds = data['binary_1gram_val_ds']
    binary_1gram_test_ds = data['binary_1gram_test_ds']
    text_only_train_ds = data['text_only_train_ds']
    train_ds = data['train_ds']
    val_ds = data['val_ds']
    test_ds = data['test_ds']


    # Configuring the TextVectorization layer to return token counts
    text_vectorization = TextVectorization(
        ngrams=2,
        max_tokens=20000,
        output_mode="count"
    )

    # Count bigram occurrences with the TextVectorization layer:
    # Configuring the TextVectorization layer to return TF-IDF-weighted outputs
    text_vectorization = TextVectorization(
        ngrams=2,
        max_tokens=20000,
        output_mode="tf-idf", # TF-IDF normalization
    )

    # Training and testing the TF-IDF bigram model
    text_vectorization.adapt(text_only_train_ds)

    tfidf_2gram_train_ds = train_ds.map(lambda x, y: (text_vectorization(x), y))
    tfidf_2gram_val_ds = val_ds.map(lambda x, y: (text_vectorization(x), y))
    tfidf_2gram_test_ds = test_ds.map(lambda x, y: (text_vectorization(x), y))

    model = get_model()
    model.summary()
    callbacks = [
        keras.callbacks.ModelCheckpoint("tfidf_2gram.keras",
                                        save_best_only=True)
    ]
    model.fit(
        tfidf_2gram_train_ds.cache(),
        validation_data=tfidf_2gram_val_ds.cache(),
        epochs=10,
        callbacks=callbacks
    )
    model = keras.models.load_model("tfidf_2gram.keras")
    
    HR()

    print(f"Test acc: {model.evaluate(tfidf_2gram_test_ds)[1]:.3f}")

listing11_13()

# Test acc: 0.888

Model: "model_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_3 (InputLayer)         [(None, 20000)]           0         
_________________________________________________________________
dense_4 (Dense)              (None, 16)                320016    
_________________________________________________________________
dropout_2 (Dropout)          (None, 16)                0         
_________________________________________________________________
dense_5 (Dense)              (None, 1)                 17        
Total params: 320,033
Trainable params: 320,033
Non-trainable params: 0
_________________________________________________________________
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
--------------------------------------------------------------------------------
Test acc: 0.895


In [None]:
def listing_sidebar():
    import tensorflow as tf
    from tensorflow import keras
    from tensorflow.keras import layers
    from tensorflow.keras.layers.experimental.preprocessing import TextVectorization

    # Bigrams with TF-IDF encoding
    
    # Add a bit more information to this representation by counting how many times 
    # each word or N-gram occurs, that is to say, by taking the histogram of the 
    # words over the text.
    # {"the": 2, "the cat": 1, "cat": 1, "cat sat": 1, "sat": 1,
    # "sat on": 1, "on": 1, "on the": 1, "the mat: 1", "mat": 1}
    
    # If you’re doing text classification, knowing how many times a word occurs in a 
    # sample is critical: any sufficiently long movie review may contain the word 
    # "terrible" regardless of sentiment, but a review that contains many instances 
    # of the word "terrible" is likely a negative one.
    

    binary_1gram_train_ds = data['binary_1gram_train_ds']
    binary_1gram_val_ds = data['binary_1gram_val_ds']
    binary_1gram_test_ds = data['binary_1gram_test_ds']
    text_only_train_ds = data['text_only_train_ds']
    train_ds = data['train_ds']
    val_ds = data['val_ds']
    test_ds = data['test_ds']


    # Configuring the TextVectorization layer to return token counts
    text_vectorization = TextVectorization(
        ngrams=2,
        max_tokens=20000,
        output_mode="count"
    )

    # Count bigram occurrences with the TextVectorization layer:
    # Configuring the TextVectorization layer to return TF-IDF-weighted outputs
    text_vectorization = TextVectorization(
        ngrams=2,
        max_tokens=20000,
        output_mode="tf-idf", # TF-IDF normalization
    )

    # Training and testing the TF-IDF bigram model
    text_vectorization.adapt(text_only_train_ds)

    tfidf_2gram_train_ds = train_ds.map(lambda x, y: (text_vectorization(x), y))
    tfidf_2gram_val_ds = val_ds.map(lambda x, y: (text_vectorization(x), y))
    tfidf_2gram_test_ds = test_ds.map(lambda x, y: (text_vectorization(x), y))

    model = get_model()

    model.summary()
    callbacks = [
        keras.callbacks.ModelCheckpoint("tfidf_2gram.keras",
                                        save_best_only=True)
    ]
    model.fit(
        tfidf_2gram_train_ds.cache(),
        validation_data=tfidf_2gram_val_ds.cache(),
        epochs=10,
        callbacks=callbacks
    )
    model = keras.models.load_model("tfidf_2gram.keras")
    
    HR()

    print(f"Test acc: {model.evaluate(tfidf_2gram_test_ds)[1]:.3f}")

    print()

    ###########################

    print("Exporting a model that processes raw strings.")

    # Exporting a model that processes raw strings
    inputs = keras.Input(shape=(1,), dtype="string")
    processed_inputs = text_vectorization(inputs)
    outputs = model(processed_inputs)
    inference_model = keras.Model(inputs, outputs)

    raw_text_data = tf.convert_to_tensor([
        ["That was an excellent movie, I loved it."],
    ])
    print("raw_text_data", raw_text_data)
    print()

    predictions = inference_model(raw_text_data)
    print(f"{float(predictions[0] * 100):.2f} percent positive")
    print()

    inference_model.summary()

listing_sidebar()

# Test acc: 0.888

# ORIGINAL MODEL that processes data in separate pipeline
# Model: "model_10"
# _________________________________________________________________
# Layer (type)                 Output Shape              Param #   
# =================================================================
# input_11 (InputLayer)        [(None, 20000)]           0         
# _________________________________________________________________
# dense_14 (Dense)             (None, 16)                320016    
# _________________________________________________________________
# dropout_7 (Dropout)          (None, 16)                0         
# _________________________________________________________________
# dense_15 (Dense)             (None, 1)                 17        
# =================================================================
# Total params: 320,033
# Trainable params: 320,033
# Non-trainable params: 0


# NEW MODEL that processes input as part of model
# Model: "model_11"
# _________________________________________________________________
# Layer (type)                 Output Shape              Param #   
# =================================================================
# input_12 (InputLayer)        [(None, 1)]               0         
# _________________________________________________________________
# text_vectorization_20 (TextV (None, 20000)             0         
# _________________________________________________________________
# model_10 (Functional)        (None, 1)                 320033    
# =================================================================
# Total params: 320,033
# Trainable params: 320,033
# Non-trainable params: 0
# _________________________________________________________________

Model: "model_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_4 (InputLayer)         [(None, 20000)]           0         
_________________________________________________________________
dense_6 (Dense)              (None, 16)                320016    
_________________________________________________________________
dropout_3 (Dropout)          (None, 16)                0         
_________________________________________________________________
dense_7 (Dense)              (None, 1)                 17        
Total params: 320,033
Trainable params: 320,033
Non-trainable params: 0
_________________________________________________________________
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
--------------------------------------------------------------------------------
Test acc: 0.882

Exporting a model that processes raw strin

### Next is 11.3.3 Processing words as a sequence: the Sequence Model approach