# Basic text classification

URL: https://github.com/tensorflow/docs/blob/master/site/en/tutorials/keras/text_classification.ipynb

This tutorial demonstrates text classification starting from plain text files stored on disk. You'll train a binary classifier to perform sentiment analysis on an IMDB dataset

In [2]:
import os
import re
import shutil
import string
import tensorflow as tf

from tensorflow.keras import layers
from tensorflow.keras import losses

In [3]:
print(tf.__version__)

2.6.2


## Sentiment analysis

This notebook trains a sentiment analysis model to classify movie reviews as positive or negative, based on the text of the review. This is an example of binary—or two-class—classification, an important and widely applicable kind of machine learning problem.

You'll use the Large Movie Review Dataset that contains the text of 50,000 movie reviews from the Internet Movie Database. These are split into 25,000 reviews for training and 25,000 reviews for testing. The training and testing sets are balanced, meaning they contain an equal number of positive and negative reviews.

### Download and explore the IMDB dataset
Let's download and extract the dataset, then explore the directory structure.

In [4]:
def download_datafiles(path, url):
    #path = "aclImdb_v1"
    #url = "https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"
    # Load the files
    dataset = tf.keras.utils.get_file(path, url,
                                    untar=True, cache_dir='.',
                                    cache_subdir='')
    # Define the dataset directory
    dataset_dir = os.path.join(os.path.dirname(dataset), 'aclImdb')
    # Files in the directories
    print(os.listdir(dataset_dir))
    # Set the train dir
    train_dir = os.path.join(dataset_dir, 'train')
    test_dir = os.path.join(dataset_dir, 'test')
    #REmove the unecessary directories
    remove_dir = os.path.join(train_dir, 'unsup')
    shutil.rmtree(remove_dir)
    
    return train_dir, test_dir
    

In [12]:

#remove_dir = os.path.join(train_dir, 'unsup')
#shutil.rmtree(remove_dir)

### Load the dataset

Next, you will load the data off disk and prepare it into a format suitable for training. To do so, you will use the helpful text_dataset_from_directory utility, which expects a directory structure as follows.

'main_directory/
...class_a/
......a_text_1.txt
......a_text_2.txt
...class_b/
......b_text_1.txt
......b_text_2.txt'

To prepare a dataset for binary classification, you will need two folders on disk, corresponding to class_a and class_b. These will be the positive and negative movie reviews, which can be found in aclImdb/train/pos and aclImdb/train/neg. As the IMDB dataset contains additional folders, you will remove them before using this utility.

Next, you will use the text_dataset_from_directory utility to create a labeled tf.data.Dataset. tf.data is a powerful collection of tools for working with data.

When running a machine learning experiment, it is a best practice to divide your dataset into three splits: train, validation, and test.

The IMDB dataset has already been divided into train and test, but it lacks a validation set. Let's create a validation set using an 80:20 split of the training data by using the validation_split argument below.

In [5]:

def create_datasets(train_dir, test_dir):
    # train 
    batch_size = 32
    seed = 42

    raw_train_ds = tf.keras.utils.text_dataset_from_directory(
        train_dir, 
        batch_size=batch_size, 
        validation_split=0.2, 
        subset='training', 
        seed=seed)
    
    raw_val_ds = tf.keras.utils.text_dataset_from_directory(
        train_dir, 
        batch_size=batch_size, 
        validation_split=0.2, 
        subset='validation', 
        seed=seed)
    
    raw_test_ds = tf.keras.utils.text_dataset_from_directory(
        test_dir, 
        batch_size=batch_size)
    
    return raw_train_ds, raw_val_ds, raw_test_ds


As you can see above, there are 25,000 examples in the training folder, of which you will use 80% (or 20,000) for training. As you will see in a moment, you can train a model by passing a dataset directly to model.fit. If you're new to tf.data, you can also iterate over the dataset and print out a few examples as follows.

In [None]:
# RUN THIS CELL ONLY FOR THE VERY FIRST TIME YOU NEED TO DOWNLOAD THE DATAFILES
url = "https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"

train_dir, test_dir=download_datafiles("aclImdb_v1", url)

print(train_dir)
print(test_dir)

raw_train_ds, raw_val_ds, raw_test_ds= create_datasets(train_dir, test_dir)


Downloading data from https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
['test', 'README', 'train', 'imdbEr.txt', 'imdb.vocab']


In [6]:
train_dir = "aclImdb/train"
test_dir = "aclImdb/test"

In [7]:
raw_train_ds, raw_val_ds, raw_test_ds= create_datasets(train_dir, test_dir)

Found 25000 files belonging to 2 classes.
Using 20000 files for training.
Found 25000 files belonging to 2 classes.
Using 5000 files for validation.
Found 25000 files belonging to 2 classes.


### Analyze datasets

In [8]:
print(raw_train_ds)

<BatchDataset shapes: ((None,), (None,)), types: (tf.string, tf.int32)>


In [9]:
for text_batch, label_batch in raw_train_ds.take(1):
  for i in range(3):
    print("Review", text_batch.numpy()[i])
    print("Label", label_batch.numpy()[i])

Review b'"Pandemonium" is a horror movie spoof that comes off more stupid than funny. Believe me when I tell you, I love comedies. Especially comedy spoofs. "Airplane", "The Naked Gun" trilogy, "Blazing Saddles", "High Anxiety", and "Spaceballs" are some of my favorite comedies that spoof a particular genre. "Pandemonium" is not up there with those films. Most of the scenes in this movie had me sitting there in stunned silence because the movie wasn\'t all that funny. There are a few laughs in the film, but when you watch a comedy, you expect to laugh a lot more than a few times and that\'s all this film has going for it. Geez, "Scream" had more laughs than this film and that was more of a horror film. How bizarre is that?<br /><br />*1/2 (out of four)'
Label 0
Review b"David Mamet is a very interesting and a very un-equal director. His first movie 'House of Games' was the one I liked best, and it set a series of films with characters whose perspective of life changes as they get into 

In [9]:
print("Label 0 corresponds to", raw_train_ds.class_names[0])
print("Label 1 corresponds to", raw_train_ds.class_names[1])

Label 0 corresponds to neg
Label 1 corresponds to pos


## Save the dataset to disk

In [10]:
train_path='data/train'
val_path='data/validation'
test_path='data/test'

tf.data.experimental.save(raw_train_ds, train_path)
tf.data.experimental.save(raw_val_ds, val_path)
tf.data.experimental.save(raw_test_ds, test_path)

## Prepare the dataset for training

Next, you will standardize, tokenize, and vectorize the data using the helpful tf.keras.layers.TextVectorization layer.

Standardization refers to preprocessing the text, typically to remove punctuation or HTML elements to simplify the dataset. Tokenization refers to splitting strings into tokens (for example, splitting a sentence into individual words, by splitting on whitespace). Vectorization refers to converting tokens into numbers so they can be fed into a neural network. All of these tasks can be accomplished with this layer.

As you saw above, the reviews contain various HTML tags like <br />. These tags will not be removed by the default standardizer in the TextVectorization layer (which converts text to lowercase and strips punctuation by default, but doesn't strip HTML). You will write a custom standardization function to remove the HTML.

Note: To prevent training-testing skew (also known as training-serving skew), it is important to preprocess the data identically at train and test time. To facilitate this, the TextVectorization layer can be included directly inside your model, as shown later in this tutorial.

In [8]:
train_path='data/train'
val_path='data/validation'
test_path='data/test'

raw_train_ds = tf.data.experimental.load(train_path)
raw_val_ds = tf.data.experimental.load(val_path)
raw_test_ds = tf.data.experimental.load(test_path)

In [9]:
@tf.keras.utils.register_keras_serializable()
def custom_standardization(input_data):
    lowercase = tf.strings.lower(input_data)
    stripped_html = tf.strings.regex_replace(lowercase, '<br />', ' ')
    
    return tf.strings.regex_replace(stripped_html,
                                  '[%s]' % re.escape(string.punctuation),
                                  '')

Next, you will create a TextVectorization layer. You will use this layer to standardize, tokenize, and vectorize our data. You set the output_mode to int to create unique integer indices for each token.

Note that you're using the default split function, and the custom standardization function you defined above. You'll also define some constants for the model, like an explicit maximum sequence_length, which will cause the layer to pad or truncate sequences to exactly sequence_length values.

Next, you will call adapt to fit the state of the preprocessing layer to the dataset. This will cause the model to build an index of strings to integers.

Note: It's important to only use your training data when calling adapt (using the test set would leak information).

Let's create a function to see the result of using this layer to preprocess some data.

In [10]:
def vectorize_text(text, label):
  text = tf.expand_dims(text, -1)
  return vectorize_layer(text), label

In [11]:
def create_vectorization_layer(raw_train_ds, raw_val_ds, raw_test_ds):
    # Make a text-only dataset (without labels), then call adapt
    train_text = raw_train_ds.map(lambda x, y: x)
    vectorize_layer.adapt(train_text)
    
    train_ds = raw_train_ds.map(vectorize_text)
    val_ds = raw_val_ds.map(vectorize_text)
    test_ds = raw_test_ds.map(vectorize_text)
    
    AUTOTUNE = tf.data.AUTOTUNE

    train_ds = train_ds.cache().prefetch(buffer_size=AUTOTUNE)
    val_ds = val_ds.cache().prefetch(buffer_size=AUTOTUNE)
    test_ds = test_ds.cache().prefetch(buffer_size=AUTOTUNE)    
    
    return train_ds, val_ds, test_ds

**Configure the dataset for performance**

These are two important methods you should use when loading data to make sure that I/O does not become blocking.

.cache() keeps data in memory after it's loaded off disk. This will ensure the dataset does not become a bottleneck while training your model. If your dataset is too large to fit into memory, you can also use this method to create a performant on-disk cache, which is more efficient to read than many small files.

.prefetch() overlaps data preprocessing and model execution while training.

You can learn more about both methods, as well as how to cache data to disk in the data performance guide.

In [12]:
max_features = 10000
sequence_length = 250
# Create a TextVectorization layer
vectorize_layer = layers.TextVectorization(
    standardize=custom_standardization,
    max_tokens=max_features,
    output_mode='int',
    output_sequence_length=sequence_length)

train_ds, val_ds, test_ds= create_vectorization_layer(raw_train_ds, raw_val_ds, raw_test_ds)


## Create the model

It's time to create your neural network:

In [13]:
def create_model(max_features, embedding_dim, drop_out):
    
    model = tf.keras.Sequential([
        layers.Embedding(max_features + 1, embedding_dim),
        layers.Dropout(drop_out), # 0.2
        layers.GlobalAveragePooling1D(),
        layers.Dropout(drop_out),
        layers.Dense(1)])
    
    return model
    

In [14]:
embedding_dim=16
drop_out=0.2

model = create_model(max_features, embedding_dim, drop_out)

### Loss function and optimizer
A model needs a loss function and an optimizer for training. Since this is a binary classification problem and the model outputs a probability (a single-unit layer with a sigmoid activation), you'll use losses.BinaryCrossentropy loss function.

Now, configure the model to use an optimizer and a loss function:

In [15]:
model.compile(loss=losses.BinaryCrossentropy(from_logits=True),
              optimizer='adam',
              metrics=tf.metrics.BinaryAccuracy(threshold=0.0))

## Train the model
You will train the model by passing the dataset object to the fit method.

In [16]:
epochs = 3
history = model.fit(
    train_ds,
    validation_data=val_ds,
    epochs=epochs)

Epoch 1/3
Extension horovod.torch has not been built: /usr/local/lib/python3.8/site-packages/horovod/torch/mpi_lib/_mpi_lib.cpython-38-x86_64-linux-gnu.so not found
If this is not expected, reinstall Horovod with HOROVOD_WITH_PYTORCH=1 to debug the build error.
[2022-10-16 16:58:43.625 tensorflow-2-6-cpu-py-ml-t3-medium-9169b2e75617c45c79c40579f6a8:21 INFO utils.py:27] RULE_JOB_STOP_SIGNAL_FILENAME: None
[2022-10-16 16:58:43.833 tensorflow-2-6-cpu-py-ml-t3-medium-9169b2e75617c45c79c40579f6a8:21 INFO profiler_config_parser.py:111] Unable to find config at /opt/ml/input/config/profilerconfig.json. Profiler is disabled.
Epoch 2/3
Epoch 3/3


## Evaluate the model

Let's see how the model performs. Two values will be returned. Loss (a number which represents our error, lower values are better), and accuracy.

In [17]:
loss, accuracy = model.evaluate(test_ds)

print("Loss: ", loss)
print("Accuracy: ", accuracy)

Loss:  0.42934006452560425
Accuracy:  0.8402799963951111


In [20]:
### Save the model to disk

In [18]:
#Save the model
model.save('model')

INFO:tensorflow:Assets written to: model/assets


INFO:tensorflow:Assets written to: model/assets


### Save the TextVectorization Layer

In [19]:
# Create a temporal model
vectorize_layer_model = tf.keras.Sequential([
        tf.keras.Input(shape=(1,), dtype=tf.string),
        vectorize_layer
])
    

#vectorize_layer_model = tf.keras.models.Sequential()
#vectorize_layer_model.add(tf.keras.Input(shape=(1,), dtype=tf.string))
#vectorize_layer_model.add(vectorize_layer)
# Show the model
vectorize_layer_model.summary()

filepath = "vectorize_layer_model"
vectorize_layer_model.save(filepath)

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
text_vectorization (TextVect (None, 250)               0         
Total params: 0
Trainable params: 0
Non-trainable params: 0
_________________________________________________________________




INFO:tensorflow:Assets written to: vectorize_layer_model/assets


INFO:tensorflow:Assets written to: vectorize_layer_model/assets


## Create an inference model

In the code above, you applied the TextVectorization layer to the dataset before feeding text to the model. If you want to make your model capable of processing raw strings (for example, to simplify deploying it), you can include the TextVectorization layer inside your model. To do so, you can create a new model using the weights you just trained.

In [20]:
new_model= tf.keras.models.load_model('model')

Load the vectorizer layer from the "dummy" model we saved previously

In [21]:
filepath = "vectorize_layer_model"
loaded_vectorize_layer_model = tf.keras.models.load_model(filepath)
loaded_vectorize_layer = loaded_vectorize_layer_model.layers[0]





In [22]:
inference_model = tf.keras.Sequential([
  loaded_vectorize_layer,
  new_model,
  layers.Activation('sigmoid')
])

inference_model.compile(
    loss=losses.BinaryCrossentropy(from_logits=False), optimizer="adam", metrics=['accuracy']
)


In [22]:
inference_model = tf.keras.Sequential([
  vectorize_layer,
  model,
  layers.Activation('sigmoid')
])

inference_model.compile(
    loss=losses.BinaryCrossentropy(from_logits=False), optimizer="adam", metrics=['accuracy']
)


In [24]:
inference_model.save('inference_model')

INFO:tensorflow:Assets written to: inference_model/assets


INFO:tensorflow:Assets written to: inference_model/assets


### Load the saved model and test it

In [23]:
# Check its architecture
inference_model.summary()

# Test it with `raw_test_ds`, which yields raw strings
loss, accuracy = inference_model.evaluate(raw_test_ds)
print(accuracy)

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
text_vectorization (TextVect (None, None)              0         
_________________________________________________________________
sequential (Sequential)      (None, 1)                 160033    
_________________________________________________________________
activation (Activation)      (None, 1)                 0         
Total params: 160,033
Trainable params: 160,033
Non-trainable params: 0
_________________________________________________________________
0.8529599905014038


In [28]:
inference_model= tf.keras.models.load_model('inference_model')
# Check its architecture
inference_model.summary()

# Test it with `raw_test_ds`, which yields raw strings
loss, accuracy = inference_model.evaluate(raw_test_ds)
print(accuracy)

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
text_vectorization (TextVect (None, None)              0         
_________________________________________________________________
sequential (Sequential)      (None, 1)                 160033    
_________________________________________________________________
activation (Activation)      (None, 1)                 0         
Total params: 160,033
Trainable params: 160,033
Non-trainable params: 0
_________________________________________________________________
0.8730800151824951
