<a href="https://colab.research.google.com/github/anuragdhere30/Text-Classification-from-Scratch/blob/main/examples/nlp/ipynb/text_classification_from_scratch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text classification from scratch

**Authors:** Mark Omernick, Francois Chollet<br>
**Date created:** 2019/11/06<br>
**Last modified:** 2020/05/17<br>
**Description:** Text sentiment classification starting from raw text files.

## Introduction

This example shows how to do text classification starting from raw text (as
a set of text files on disk). We demonstrate the workflow on the IMDB sentiment
classification dataset (unprocessed version). We use the `TextVectorization` layer for
 word splitting & indexing.

## Setup

In [1]:
import os

os.environ["KERAS_BACKEND"] = "tensorflow"

import keras
import tensorflow as tf
import numpy as np
from keras import layers

## Load the data: IMDB movie review sentiment classification

Let's download the data and inspect its structure.

In [2]:
!kaggle datasets download -d anuragdhere/text-classification-from-scratch
!unzip -qq text-classification-from-scratch.zip

Dataset URL: https://www.kaggle.com/datasets/anuragdhere/text-classification-from-scratch
License(s): unknown
Downloading text-classification-from-scratch.zip to /content
 99% 296M/299M [00:17<00:00, 22.4MB/s]
100% 299M/299M [00:17<00:00, 17.6MB/s]


The `aclImdb` folder contains a `train` and `test` subfolder:

In [6]:
!ls G27_Text_Classification

imdbEr.txt  imdb.vocab	README	test  train


In [7]:
!ls G27_Text_Classification/test

labeledBow.feat  neg  pos  urls_neg.txt  urls_pos.txt


In [8]:
!ls G27_Text_Classification/train

labeledBow.feat  neg  pos  unsup  unsupBow.feat  urls_neg.txt  urls_pos.txt  urls_unsup.txt


The `aclImdb/train/pos` and `aclImdb/train/neg` folders contain text files, each of
 which represents one review (either positive or negative):

In [9]:
!cat G27_Text_Classification/train/pos/Ses02F_impro02_F002.txt

Frame# Time CH1 CH2 CH3 FH1 FH2 FH3 LC1 LC2 LC3 LC4 LC5 LC6 LC7 LC8 RC1 RC2 RC3 RC4 RC5 RC6 RC7 RC8 LLID RLID MH MNOSE LNSTRL TNOSE RNSTRL LBM0 LBM1 LBM2 LBM3 RBM0 RBM1 RBM2 RBM3 LBRO1 LBRO2 LBRO3 LBRO4 RBRO1 RBRO2 RBRO3 RBRO4 Mou1 Mou2 Mou3 Mou4 Mou5 Mou6 Mou7 Mou8 LHD RHD
X01 Y01 Z01 X02 Y02 Z02 X03 Y03 Z03 X04 Y04 Z04 X05 Y05 Z05 X06 Y06 Z06 X07 Y07 Z07 X08 Y08 Z08 X09 Y09 Z09 X10 Y10 Z10 X11 Y11 Z11 X12 Y12 Z12 X13 Y13 Z13 X14 Y14 Z14 X15 Y15 Z15 X16 Y16 Z16 X17 Y17 Z17 X18 Y18 Z18 X19 Y19 Z19 X20 Y20 Z20 X21 Y21 Z21 X22 Y22 Z22 X23 Y23 Z23 X24 Y24 Z24 X25 Y25 Z25 X26 Y26 Z26 X27 Y27 Z27 X28 Y28 Z28 X29 Y29 Z29 X30 Y30 Z30 X31 Y31 Z31 X32 Y32 Z32 X33 Y33 Z33 X34 Y34 Z34 X35 Y35 Z35 X36 Y36 Z36 X37 Y37 Z37 X38 Y38 Z38 X39 Y39 Z39 X40 Y40 Z40 X41 Y41 Z41 X42 Y42 Z42 X43 Y43 Z43 X44 Y44 Z44 X45 Y45 Z45 X46 Y46 Z46 X47 Y47 Z47 X48 Y48 Z48 X49 Y49 Z49 X50 Y50 Z50 X51 Y51 Z51 X52 Y52 Z52 X53 Y53 Z53 X60 Y60 Z60 X61 Y61 Z61
3676 0.00453 -26.70529 32.76058 -51.07345 -2.40233 19.62558 -54

We are only interested in the `pos` and `neg` subfolders, so let's delete the other subfolder that has text files in it:

In [10]:
!rm -r G27_Text_Classification/train/unsup

You can use the utility `keras.utils.text_dataset_from_directory` to
generate a labeled `tf.data.Dataset` object from a set of text files on disk filed
 into class-specific folders.

Let's use it to generate the training, validation, and test datasets. The validation
and training datasets are generated from two subsets of the `train` directory, with 20%
of samples going to the validation dataset and 80% going to the training dataset.

Having a validation dataset in addition to the test dataset is useful for tuning
hyperparameters, such as the model architecture, for which the test dataset should not
be used.

Before putting the model out into the real world however, it should be retrained using all
available training data (without creating a validation dataset), so its performance is maximized.

When using the `validation_split` & `subset` arguments, make sure to either specify a
random seed, or to pass `shuffle=False`, so that the validation & training splits you
get have no overlap.

In [13]:
batch_size = 32
raw_train_ds = keras.utils.text_dataset_from_directory(
    "G27_Text_Classification/train",
    batch_size=batch_size,
    validation_split=0.2,
    subset="training",
    seed=1337,
)
raw_val_ds = keras.utils.text_dataset_from_directory(
    "G27_Text_Classification/train",
    batch_size=batch_size,
    validation_split=0.2,
    subset="validation",
    seed=1337,
)
raw_test_ds = keras.utils.text_dataset_from_directory(
    "G27_Text_Classification/test", batch_size=batch_size
)

print(f"Number of batches in raw_train_ds: {raw_train_ds.cardinality()}")
print(f"Number of batches in raw_val_ds: {raw_val_ds.cardinality()}")
print(f"Number of batches in raw_test_ds: {raw_test_ds.cardinality()}")

Found 339 files belonging to 2 classes.
Using 272 files for training.
Found 339 files belonging to 2 classes.
Using 67 files for validation.
Found 339 files belonging to 2 classes.
Number of batches in raw_train_ds: 9
Number of batches in raw_val_ds: 3
Number of batches in raw_test_ds: 11


Let's preview a few samples:

In [14]:
# It's important to take a look at your raw data to ensure your normalization
# and tokenization will work as expected. We can do that by taking a few
# examples from the training set and looking at them.
# This is one of the places where eager execution shines:
# we can just evaluate these tensors using .numpy()
# instead of needing to evaluate them in a Session/Graph context.
for text_batch, label_batch in raw_train_ds.take(1):
    for i in range(5):
        print(text_batch.numpy()[i])
        print(label_batch.numpy()[i])

b'Frame# Time CH1 CH2 CH3 FH1 FH2 FH3 LC1 LC2 LC3 LC4 LC5 LC6 LC7 LC8 RC1 RC2 RC3 RC4 RC5 RC6 RC7 RC8 LLID RLID MH MNOSE LNSTRL TNOSE RNSTRL LBM0 LBM1 LBM2 LBM3 RBM0 RBM1 RBM2 RBM3 LBRO1 LBRO2 LBRO3 LBRO4 RBRO1 RBRO2 RBRO3 RBRO4 Mou1 Mou2 Mou3 Mou4 Mou5 Mou6 Mou7 Mou8 LHD RHD\r\nX01 Y01 Z01 X02 Y02 Z02 X03 Y03 Z03 X04 Y04 Z04 X05 Y05 Z05 X06 Y06 Z06 X07 Y07 Z07 X08 Y08 Z08 X09 Y09 Z09 X10 Y10 Z10 X11 Y11 Z11 X12 Y12 Z12 X13 Y13 Z13 X14 Y14 Z14 X15 Y15 Z15 X16 Y16 Z16 X17 Y17 Z17 X18 Y18 Z18 X19 Y19 Z19 X20 Y20 Z20 X21 Y21 Z21 X22 Y22 Z22 X23 Y23 Z23 X24 Y24 Z24 X25 Y25 Z25 X26 Y26 Z26 X27 Y27 Z27 X28 Y28 Z28 X29 Y29 Z29 X30 Y30 Z30 X31 Y31 Z31 X32 Y32 Z32 X33 Y33 Z33 X34 Y34 Z34 X35 Y35 Z35 X36 Y36 Z36 X37 Y37 Z37 X38 Y38 Z38 X39 Y39 Z39 X40 Y40 Z40 X41 Y41 Z41 X42 Y42 Z42 X43 Y43 Z43 X44 Y44 Z44 X45 Y45 Z45 X46 Y46 Z46 X47 Y47 Z47 X48 Y48 Z48 X49 Y49 Z49 X50 Y50 Z50 X51 Y51 Z51 X52 Y52 Z52 X53 Y53 Z53 X60 Y60 Z60 X61 Y61 Z61\r\n7399 0.00370 -26.71876 33.74245 -65.91756 -5.11977 24.112

## Prepare the data

In particular, we remove `<br />` tags.

In [15]:
import string
import re


# Having looked at our data above, we see that the raw text contains HTML break
# tags of the form '<br />'. These tags will not be removed by the default
# standardizer (which doesn't strip HTML). Because of this, we will need to
# create a custom standardization function.
def custom_standardization(input_data):
    lowercase = tf.strings.lower(input_data)
    stripped_html = tf.strings.regex_replace(lowercase, "<br />", " ")
    return tf.strings.regex_replace(
        stripped_html, f"[{re.escape(string.punctuation)}]", ""
    )


# Model constants.
max_features = 20000
embedding_dim = 128
sequence_length = 500

# Now that we have our custom standardization, we can instantiate our text
# vectorization layer. We are using this layer to normalize, split, and map
# strings to integers, so we set our 'output_mode' to 'int'.
# Note that we're using the default split function,
# and the custom standardization defined above.
# We also set an explicit maximum sequence length, since the CNNs later in our
# model won't support ragged sequences.
vectorize_layer = keras.layers.TextVectorization(
    standardize=custom_standardization,
    max_tokens=max_features,
    output_mode="int",
    output_sequence_length=sequence_length,
)

# Now that the vectorize_layer has been created, call `adapt` on a text-only
# dataset to create the vocabulary. You don't have to batch, but for very large
# datasets this means you're not keeping spare copies of the dataset in memory.

# Let's make a text-only dataset (no labels):
text_ds = raw_train_ds.map(lambda x, y: x)
# Let's call `adapt`:
vectorize_layer.adapt(text_ds)

## Two options to vectorize the data

There are 2 ways we can use our text vectorization layer:

**Option 1: Make it part of the model**, so as to obtain a model that processes raw
 strings, like this:

```python
text_input = keras.Input(shape=(1,), dtype=tf.string, name='text')
x = vectorize_layer(text_input)
x = layers.Embedding(max_features + 1, embedding_dim)(x)
...
```

**Option 2: Apply it to the text dataset** to obtain a dataset of word indices, then
 feed it into a model that expects integer sequences as inputs.

An important difference between the two is that option 2 enables you to do
**asynchronous CPU processing and buffering** of your data when training on GPU.
So if you're training the model on GPU, you probably want to go with this option to get
 the best performance. This is what we will do below.

If we were to export our model to production, we'd ship a model that accepts raw
strings as input, like in the code snippet for option 1 above. This can be done after
 training. We do this in the last section.


In [16]:

def vectorize_text(text, label):
    text = tf.expand_dims(text, -1)
    return vectorize_layer(text), label


# Vectorize the data.
train_ds = raw_train_ds.map(vectorize_text)
val_ds = raw_val_ds.map(vectorize_text)
test_ds = raw_test_ds.map(vectorize_text)

# Do async prefetching / buffering of the data for best performance on GPU.
train_ds = train_ds.cache().prefetch(buffer_size=10)
val_ds = val_ds.cache().prefetch(buffer_size=10)
test_ds = test_ds.cache().prefetch(buffer_size=10)

## Build a model

We choose a simple 1D convnet starting with an `Embedding` layer.

In [17]:
# A integer input for vocab indices.
inputs = keras.Input(shape=(None,), dtype="int64")

# Next, we add a layer to map those vocab indices into a space of dimensionality
# 'embedding_dim'.
x = layers.Embedding(max_features, embedding_dim)(inputs)
x = layers.Dropout(0.5)(x)

# Conv1D + global max pooling
x = layers.Conv1D(128, 7, padding="valid", activation="relu", strides=3)(x)
x = layers.Conv1D(128, 7, padding="valid", activation="relu", strides=3)(x)
x = layers.GlobalMaxPooling1D()(x)

# We add a vanilla hidden layer:
x = layers.Dense(128, activation="relu")(x)
x = layers.Dropout(0.5)(x)

# We project onto a single unit output layer, and squash it with a sigmoid:
predictions = layers.Dense(1, activation="sigmoid", name="predictions")(x)

model = keras.Model(inputs, predictions)

# Compile the model with binary crossentropy loss and an adam optimizer.
model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])

## Train the model

In [18]:
epochs = 3

# Fit the model using the train and test datasets.
model.fit(train_ds, validation_data=val_ds, epochs=epochs)

Epoch 1/3
[1m9/9[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m24s[0m 1s/step - accuracy: 0.5102 - loss: 0.6934 - val_accuracy: 0.5075 - val_loss: 0.6964
Epoch 2/3
[1m9/9[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m18s[0m 9ms/step - accuracy: 0.5257 - loss: 0.6988 - val_accuracy: 0.5075 - val_loss: 0.6971
Epoch 3/3
[1m9/9[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 8ms/step - accuracy: 0.5230 - loss: 0.6919 - val_accuracy: 0.5075 - val_loss: 0.6947


<keras.src.callbacks.history.History at 0x795d737e2da0>

## Evaluate the model on the test set

In [19]:
model.evaluate(test_ds)

[1m11/11[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m13s[0m 1s/step - accuracy: 0.5492 - loss: 0.6886


[0.6908358335494995, 0.533923327922821]

## Make an end-to-end model

If you want to obtain a model capable of processing raw strings, you can simply
create a new model (using the weights we just trained):

In [20]:
# A string input
inputs = keras.Input(shape=(1,), dtype="string")
# Turn strings into vocab indices
indices = vectorize_layer(inputs)
# Turn vocab indices into predictions
outputs = model(indices)

# Our end to end model
end_to_end_model = keras.Model(inputs, outputs)
end_to_end_model.compile(
    loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"]
)

# Test it with `raw_test_ds`, which yields raw strings
end_to_end_model.evaluate(raw_test_ds)

[1m11/11[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m13s[0m 1s/step - accuracy: 0.5297 - loss: 0.0000e+00


[0.0, 0.0, 0.533923327922821, 0.533923327922821]