### Pulled from https://www.tensorflow.org/tutorials/keras/text_classification_with_hub

# Text classification with TensorFlow Hub: Movie reviews

This notebook classifies movie reviews as positive or negative using the text of the review. This is an example of binary—or two-class—classification, an important and widely applicable kind of machine learning problem.

The tutorial demonstrates the basic application of transfer learning with <a href="https://tfhub.dev/">TensorFlow Hub</a> and Keras.

It uses the <a href="https://www.tensorflow.org/api_docs/python/tf/keras/datasets/imdb">IMDB dataset</a> that contains the text of 50,000 movie reviews from the <a href="https://www.imdb.com/">Internet Movie Database.</a> These are split into 25,000 reviews for training and 25,000 reviews for testing. The training and testing sets are balanced, meaning they contain an equal number of positive and negative reviews.

This notebook uses <a href="https://www.tensorflow.org/guide/keras"><code>tf.keras</code></a>, a high-level API to build and train models in TensorFlow, and <a href="https://www.tensorflow.org/hub"><code>tensorflow_hub</code></a>, a library for loading trained models from <a href="https://tfhub.dev/">TFHub</a> in a single line of code. For a more advanced text classification tutorial using <a href="https://www.tensorflow.org/api_docs/python/tf/keras"><code>tf.keras</code></a>, see the <a href="https://developers.google.com/machine-learning/guides/text-classification/">MLCC Text Classification Guide.</a>

In [1]:
!pip install tensorflow-hub
!pip install tensorflow-datasets

Collecting tensorflow-hub
  Downloading tensorflow_hub-0.16.1-py2.py3-none-any.whl.metadata (1.3 kB)
Collecting tf-keras>=2.14.1 (from tensorflow-hub)
  Downloading tf_keras-2.19.0-py3-none-any.whl.metadata (1.8 kB)
Collecting tensorflow<2.20,>=2.19 (from tf-keras>=2.14.1->tensorflow-hub)
  Downloading tensorflow-2.19.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.1 kB)
Collecting tensorboard~=2.19.0 (from tensorflow<2.20,>=2.19->tf-keras>=2.14.1->tensorflow-hub)
  Downloading tensorboard-2.19.0-py3-none-any.whl.metadata (1.8 kB)
Collecting ml-dtypes<1.0.0,>=0.5.1 (from tensorflow<2.20,>=2.19->tf-keras>=2.14.1->tensorflow-hub)
  Downloading ml_dtypes-0.5.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (21 kB)
Downloading tensorflow_hub-0.16.1-py2.py3-none-any.whl (30 kB)
Downloading tf_keras-2.19.0-py3-none-any.whl (1.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m21.1 MB/s[0m eta [36m0:00:00[0m

In [2]:
import os
import numpy as np

import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_datasets as tfds

print("Version: ", tf.__version__)
print("Eager mode: ", tf.executing_eagerly())
print("Hub version: ", hub.__version__)
print("GPU is", "available" if tf.config.list_physical_devices("GPU") else "NOT AVAILABLE")

2025-03-19 02:48:00.931137: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1742352480.948563  207866 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1742352480.954232  207866 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1742352480.966626  207866 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1742352480.966644  207866 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1742352480.966646  207866 computation_placer.cc:177] computation placer alr

Version:  2.19.0
Eager mode:  True
Hub version:  0.16.1
GPU is NOT AVAILABLE


2025-03-19 02:48:03.523759: E external/local_xla/xla/stream_executor/cuda/cuda_platform.cc:51] failed call to cuInit: INTERNAL: CUDA error: Failed call to cuInit: UNKNOWN ERROR (303)


## Download the IMDB dataset

The IMDB dataset is available on <a href="https://www.tensorflow.org/datasets/catalog/imdb_reviews">imdb reviews</a> or on <a href="https://www.tensorflow.org/datasets">TensorFlow datasets</a>. The following code downloads the IMDB dataset to your machine (or the colab runtime):

In [3]:
# Split the training set into 60% and 40% to end up with 15,000 examples
# for training, 10,000 examples for validation and 25,000 examples for testing.
train_data, validation_data, test_data = tfds.load(
    name="imdb_reviews", 
    split=('train[:60%]', 'train[60%:]', 'test'),
    as_supervised=True)



[1mDownloading and preparing dataset Unknown size (download: Unknown size, generated: Unknown size, total: Unknown size) to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0...[0m


Dl Completed...: 0 url [00:00, ? url/s]

Dl Size...: 0 MiB [00:00, ? MiB/s]

Generating splits...:   0%|          | 0/3 [00:00<?, ? splits/s]

Generating train examples...: 0 examples [00:00, ? examples/s]

Shuffling /root/tensorflow_datasets/imdb_reviews/plain_text/incomplete.QSW0UN_1.0.0/imdb_reviews-train.tfrecor…

Generating test examples...: 0 examples [00:00, ? examples/s]

Shuffling /root/tensorflow_datasets/imdb_reviews/plain_text/incomplete.QSW0UN_1.0.0/imdb_reviews-test.tfrecord…

Generating unsupervised examples...: 0 examples [00:00, ? examples/s]

Shuffling /root/tensorflow_datasets/imdb_reviews/plain_text/incomplete.QSW0UN_1.0.0/imdb_reviews-unsupervised.…

[1mDataset imdb_reviews downloaded and prepared to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0. Subsequent calls will reuse this data.[0m


## Explore the data

Let's take a moment to understand the format of the data. Each example is a sentence representing the movie review and a corresponding label. The sentence is not preprocessed in any way. The label is an integer value of either 0 or 1, where 0 is a negative review, and 1 is a positive review.

Let's print first 10 examples.

In [4]:
train_examples_batch, train_labels_batch = next(iter(train_data.batch(10)))
train_examples_batch

2025-03-19 02:51:13.611259: I tensorflow/core/kernels/data/tf_record_dataset_op.cc:387] The default buffer size is 262144, which is overridden by the user specified `buffer_size` of 8388608
2025-03-19 02:51:13.615997: W tensorflow/core/kernels/data/cache_dataset_ops.cc:916] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset  will be discarded. This can happen if you have an input pipeline similar to `dataset.cache().take(k).repeat()`. You should use `dataset.take(k).cache().repeat()` instead.


<tf.Tensor: shape=(10,), dtype=string, numpy=
array([b"This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting could not redeem this movie's ridiculous storyline. This movie is an early nineties US propaganda piece. The most pathetic scenes were those when the Columbian rebels were making their cases for revolutions. Maria Conchita Alonso appeared phony, and her pseudo-love affair with Walken was nothing but a pathetic emotional plug in a movie that was devoid of any real meaning. I am disappointed that there are movies like this, ruining actor's like Christopher Walken's good name. I could barely sit through it.",
       b'I have been known to fall asleep during films, but this is usually due to a combination of things including, really tired, being warm and comfortable on the sette and having just eaten a lot. However on this occasion I fell 

In [5]:
train_labels_batch

<tf.Tensor: shape=(10,), dtype=int64, numpy=array([0, 0, 0, 1, 1, 1, 0, 0, 0, 0])>

## Build the model

The neural network is created by stacking layers—this requires three main architectural decisions:

<ul>
    <li>How to represent the text?</li>
    <li>How many layers to use in the model?</li>
    <li>How many hidden units to use for each layer?</li>
</ul>

In this example, the input data consists of sentences. The labels to predict are either 0 or 1.

One way to represent the text is to convert sentences into embeddings vectors. Use a pre-trained text embedding as the first layer, which will have three advantages:

<ul>
    <li>You don't have to worry about text preprocessing,</li>
    <li>Benefit from transfer learning,</li>
    <li>the embedding has a fixed size, so it's simpler to process.</li>
</ul>

For this example you use a <b>pre-trained text embedding model</b> from <a href="https://tfhub.dev/">TensorFlow Hub</a> called <a href="https://tfhub.dev/google/nnlm-en-dim50/2">google/nnlm-en-dim50/2</a>.

There are many other pre-trained text embeddings from TFHub that can be used in this tutorial:

<ul>
    <li><a href="https://tfhub.dev/google/nnlm-en-dim128/2">google/nnlm-en-dim128/2</a> - trained with the same NNLM architecture on the same data as <a href="https://tfhub.dev/google/nnlm-en-dim50/2">google/nnlm-en-dim50/2</a>, but with a larger embedding dimension. Larger dimensional embeddings can improve on your task but it may take longer to train your model.</li>
    <li><a href="https://tfhub.dev/google/nnlm-en-dim128-with-normalization/2">google/nnlm-en-dim128-with-normalization/2</a> - the same as <a href="https://tfhub.dev/google/nnlm-en-dim128/2">google/nnlm-en-dim128/2</a>, but with additional text normalization such as removing punctuation. This can help if the text in your task contains additional characters or punctuation.</li>
    <li><a href="https://tfhub.dev/google/universal-sentence-encoder/4">google/universal-sentence-encoder/4</a> - a much larger model yielding 512 dimensional embeddings trained with a deep averaging network (DAN) encoder.</li>
</ul>

And many more! Find more <a href="https://tfhub.dev/s?module-type=text-embedding">text embedding models</a> on TFHub.

Let's first create a Keras layer that uses a TensorFlow Hub model to embed the sentences, and try it out on a couple of input examples. Note that no matter the length of the input text, the output shape of the embeddings is: ```(num_examples, embedding_dimension)```.

In [6]:
embedding = "https://tfhub.dev/google/nnlm-en-dim50/2"
hub_layer = hub.KerasLayer(embedding, input_shape=[], 
                           dtype=tf.string, trainable=True)
hub_layer(train_examples_batch[:3])

2025-03-19 03:08:51.047685: W external/local_xla/xla/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 192762400 exceeds 10% of free system memory.


<tf.Tensor: shape=(3, 50), dtype=float32, numpy=
array([[ 0.5423195 , -0.0119017 ,  0.06337538,  0.06862972, -0.16776837,
        -0.10581174,  0.16865303, -0.04998824, -0.31148055,  0.07910346,
         0.15442263,  0.01488662,  0.03930153,  0.19772711, -0.12215476,
        -0.04120981, -0.2704109 , -0.21922152,  0.26517662, -0.80739075,
         0.25833532, -0.3100421 ,  0.28683215,  0.1943387 , -0.29036492,
         0.03862849, -0.7844411 , -0.0479324 ,  0.4110299 , -0.36388892,
        -0.58034706,  0.30269456,  0.3630897 , -0.15227164, -0.44391504,
         0.19462997,  0.19528408,  0.05666234,  0.2890704 , -0.28468323,
        -0.00531206,  0.0571938 , -0.3201318 , -0.04418665, -0.08550783,
        -0.55847436, -0.23336391, -0.20782952, -0.03543064, -0.17533456],
       [ 0.56338924, -0.12339553, -0.10862679,  0.7753425 , -0.07667089,
        -0.15752277,  0.01872335, -0.08169781, -0.3521876 ,  0.4637341 ,
        -0.08492756,  0.07166859, -0.00670817,  0.12686075, -0.19326553,
 

Let's now build the full model:

In [7]:
model = tf.keras.Sequential()
model.add(hub_layer)
model.add(tf.keras.layers.Dense(16, activation='relu'))
model.add(tf.keras.layers.Dense(1))

model.summary()

ValueError: Only instances of `keras.Layer` can be added to a Sequential model. Received: <tensorflow_hub.keras_layer.KerasLayer object at 0x7fbe26cab410> (of type <class 'tensorflow_hub.keras_layer.KerasLayer'>)