<a href="https://colab.research.google.com/github/abdalrahmenyousifMohamed/Master-TensorFlow/blob/main/13_1_Spam_Classification_with_BERT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Spam Classification with BERT

Deep learning has been revolutionized by transformer models. Transformer based models like BERT are heavily used in NLP to solve tasks due to the rich numerical representations of text they provide. Here we will be discussing how to download a transformer model and then adapt it to solve a spam classification problem.


<table align="left">
    <td>
        <a target="_blank" href="https://colab.research.google.com/github/thushv89/manning_tf2_in_action/blob/master/Ch13-Transormers-with-TF2-and-Huggingface/13.1_Spam_Classification_with_BERT.ipynb"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab</a>
    </td>
</table>


## Import libraries

In [3]:
import random
import os
import pandas as pd
import tensorflow as tf
import numpy as np
import tensorflow_models as tfm

import time

print("TensorFlow: {} installed".format(tf.__version__))

gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
    try:
        # Currently, memory growth needs to be the same across GPUs
        for gpu in gpus:
            tf.config.experimental.set_memory_growth(gpu, True)
    except:
        print("Couldn't set memory_growth")
        pass


def fix_random_seed(seed):
    """ Setting the random seed of various libraries """
    try:
        np.random.seed(seed)
    except NameError:
        print("Warning: Numpy is not imported. Setting the seed for Numpy failed.")
    try:
        tf.random.set_seed(seed)
    except NameError:
        print("Warning: TensorFlow is not imported. Setting the seed for TensorFlow failed.")
    try:
        random.seed(seed)
    except NameError:
        print("Warning: random module is not imported. Setting the seed for random failed.")

# Fixing the random seed
random_seed=4321
fix_random_seed(random_seed)

TensorFlow: 2.15.0 installed


## Download and read the data

For this, we will be using the spam classification dataset available [here](https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip). It is a tab separated file with two columns. First column is a single word (ham/spam), where the second column contains the SMS message.

In [4]:
# Section 13.2
# Downloading the data

import os
import requests
import zipfile

import shutil

if not os.path.exists('data'):
    os.mkdir('data')

# Retrieve the data
if not os.path.exists(os.path.join('data', 'smsspamcollection.zip')):
    url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip"
    # Get the file from web
    r = requests.get(url)

    # Write to a file
    with open(os.path.join('data', 'smsspamcollection.zip'), 'wb') as f:
        f.write(r.content)

else:
    print("The zip file already exists.")

if not os.path.exists(os.path.join('data', 'SMSSpamCollection')):
    with zipfile.ZipFile(os.path.join('data', 'smsspamcollection.zip'), 'r') as zip_ref:
        zip_ref.extractall('data')
else:
    print("The extracted data already exists")




In [5]:
# Section 13.2

# Code listing 13.1
import numpy as np

# Inputs and labels will be stored in this
inputs = []
labels = []
# Total number of instances for spam and ham
n_ham, n_spam = 0,0
with open(os.path.join('data', 'SMSSpamCollection'), 'r') as f:
    for r in f:
        # Ham input
        if r.startswith('ham'):
            label = 0
            txt = r[4:]
            n_ham += 1
        # Spam input
        elif r.startswith('spam'):
            label = 1
            txt = r[5:]
            n_spam += 1
        inputs.append(txt)
        labels.append(label)

print("Found {} ham and {} spam".format(n_ham, n_spam))
print(inputs[:5])

# Convert them to arrays
inputs = np.array(inputs).reshape(-1,1)
labels = np.array(labels)

Found 4827 ham and 747 spam
['Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...\n', 'Ok lar... Joking wif u oni...\n', "Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's\n", 'U dun say so early hor... U c already then say...\n', "Nah I don't think he goes to usf, he lives around here though\n"]


## Splitting train/valid/test

Here we will split the data to three sets using `imbalanced-learn` library. Specifically we,

* Create a balanced test set with 100 spam and 100 ham (Random)
* Create a balanced validation set with 100 spam and 100 ham (Random)
* Create a balanced train set from the left over instances (undersampled using Near miss algorithm)

In [6]:
# Section 13.2

from imblearn.under_sampling import  NearMiss, RandomUnderSampler


n=100 # Number of instances for each class for train/validation sets
rus = RandomUnderSampler(sampling_strategy={0:n, 1:n}, random_state=random_seed)
rus.fit_resample(inputs, labels)

# Get test indices
test_inds = rus.sample_indices_
test_x, test_y = inputs[test_inds], np.array(labels)[test_inds]
print("Test statistics")
print(pd.Series(test_y).value_counts())

# Get rest (train + valid)
rest_inds = [i for i in range(inputs.shape[0]) if i not in test_inds]
rest_x, rest_y = inputs[rest_inds], labels[rest_inds]

# Get valid indices
rus.fit_resample(rest_x, rest_y)
valid_inds = rus.sample_indices_
valid_x, valid_y = rest_x[valid_inds], rest_y[valid_inds]
print("Valid statistics")
print(pd.Series(valid_y).value_counts())

# Rest goes in training
train_inds = [i for i in range(rest_x.shape[0]) if i not in valid_inds]
train_x, train_y = rest_x[train_inds], rest_y[train_inds]
print("Train statistics")
print(pd.Series(train_y).value_counts())

Test statistics
0    100
1    100
dtype: int64
Valid statistics
0    100
1    100
dtype: int64
Train statistics
0    4627
1     547
dtype: int64


In [7]:
from sklearn.feature_extraction.text import CountVectorizer

# To use near miss algorithm, we need a numerical representation of the messages
# We will use the bag of words representation
countvec = CountVectorizer()
train_bow = countvec.fit_transform(train_x.reshape(-1).tolist())

# NearMiss is a common undersampling technique
oss = NearMiss()
X_res, y_res = oss.fit_resample(train_bow, train_y)
train_inds = oss.sample_indices_

train_x, train_y = train_x[train_inds], train_y[train_inds]

print(pd.Series(train_y).value_counts())

0    547
1    547
dtype: int64


## Analysing the vocabulary of BERT

In [9]:
# This file is obrained from the articates found in https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/4

# Get the vocab file path from the BERT layer

# You can automatically obtain these via the following code. But for ease of understanding I have fixed them to constants
# bert_layer = hub.KerasLayer(hub_bert_url, trainable=True)
# bert_layer.resolved_object.vocab_file.asset_path.numpy()
# do_lower_case = bert_layer.resolved_object.do_lower_case.numpy()

vocab_file = os.path.join("data", "vocab.txt")
do_lower_case = True

# Define a tokenizer
tokenizer = tfm.nlp.layers.FastWordpieceBertTokenizer(vocab_file=vocab_file, lower_case=do_lower_case)

## Understanding tokenization in BERT

In [10]:
text = ["She sells seashells by the seashore"]
print("Original text: {}".format(text))
#tokens = tokenizer.tokenize(text)
tokens = tf.reshape(tokenizer(text), [-1])
print("Tokens IDs generated by BERT: {}".format(tokens))
ids = [tokenizer._vocab[tid] for tid in tokens]
#tokenizer.convert_tokens_to_ids(tokens)
print("Tokens generated by BERT: {}".format(ids))

Original text: ['She sells seashells by the seashore']
Tokens IDs generated by BERT: [ 2016 15187 11915 18223  2015  2011  1996 11915 16892]
Tokens generated by BERT: ['she', 'sells', 'seas', '##hell', '##s', 'by', 'the', 'seas', '##hore']


## Special tokens used by BERT

In [12]:
special_tokens = ['[CLS]', '[SEP]', '[MASK]', '[PAD]']
ids = [tokenizer._vocab.index(tok) for tok in special_tokens]
for t, i in zip(special_tokens, ids):
    print("Token: {} has ID: {}".format(t, i))

Token: [CLS] has ID: 101
Token: [SEP] has ID: 102
Token: [MASK] has ID: 103
Token: [PAD] has ID: 0


## Analysing sequence length

In [13]:
# Section 13.2
# Code listing 13.2
def encode_sentence(s):
    """ Encode a given sentence by tokenizing it and adding special tokens """

    tokens = list(tf.reshape(tokenizer(["CLS" + s + "[SEP]"]), [-1]))
    return tokens

seq_lengths = pd.Series([len(encode_sentence(str(s))) for s in train_x])
seq_lengths.describe(percentiles=[0.25, 0.5, 0.75, 0.9])

count    1094.000000
mean       35.684644
std        19.318968
min        12.000000
25%        18.000000
50%        25.000000
75%        55.000000
90%        62.000000
max        75.000000
dtype: float64

## Generating the correct input format for BERT

BERT model needs three inputs. These are formed into a dictionary having the following keys.

* `input_word_ids` - These are the input tokens generated from text and padded to `max_seq_length` with zeros
* `input_mask` - A matrix of the shape of `input_word_ids` that represents whether an element is a token of a padded value (0s and 1s)
* `input_type_ids` - A matrix of the shape of `input_word_ids` that represents which sentence/sequence each token belongs to (0s and 1s)

`tf-models` library provides a convenient class to pack a given sentence (or a list of sentences) to this format called `BertPackInputs`. We'll use this to create our input.

In [15]:
max_seq_length=60
packer = tfm.nlp.layers.BertPackInputs(
    seq_length=max_seq_length,
    special_tokens_dict = tokenizer.get_special_tokens_dict()
)

text = ["She sells seashells by the seashore"]

tok1 = tokenizer(text)

packed = packer(tok1)

print(packed)

{'input_word_ids': <tf.Tensor: shape=(1, 60), dtype=int32, numpy=
array([[  101,  2016, 15187, 11915, 18223,  2015,  2011,  1996, 11915,
        16892,   102,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0]], dtype=int32)>, 'input_mask': <tf.Tensor: shape=(1, 60), dtype=int32, numpy=
array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], dtype=int32)>, 'input_type_ids': <tf.Tensor: shape=(1, 60), dtype=int32, numpy=
array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 

In [16]:
# Section 13.2

# Code listing 13.3
def get_bert_inputs(tokenizer, docs,max_seq_len=None):
    """ Generate inputs for BERT using a set of documents """

    packer = tfm.nlp.layers.BertPackInputs(
        seq_length=max_seq_length,
        special_tokens_dict = tokenizer.get_special_tokens_dict()
    )

    packed = packer(tokenizer(docs))

    packed_numpy = dict([(k, v.numpy()) for k,v in packed.items()])
    # Final output
    return packed_numpy

# Creating train/validation/test data
train_inputs = get_bert_inputs(tokenizer, train_x, max_seq_len=60)
print(f"Added {len(train_inputs['input_word_ids'])} training samples")

valid_inputs = get_bert_inputs(tokenizer, valid_x, max_seq_len=60)
print(f"Added {len(valid_inputs['input_word_ids'])} validation samples")

test_inputs = get_bert_inputs(tokenizer, test_x, max_seq_len=60)
print(f"Added {len(test_inputs['input_word_ids'])} test samples")

Added 1094 training samples
Added 200 validation samples
Added 200 test samples


In [17]:
# Shuffling training data as a precaution
train_inds = np.random.permutation(len(train_inputs["input_word_ids"]))
train_inputs = dict([(k, v[train_inds]) for k, v in train_inputs.items()])
print(train_inputs)
train_y = train_y[train_inds]

print(f"Shuffled train_y labels: {train_y[:20]}")

{'input_word_ids': array([[  101,  3984,  2054, ...,     0,     0,     0],
       [  101,  2489,  2005, ...,     0,     0,     0],
       [  101, 13661,   999, ...,     0,     0,     0],
       ...,
       [  101, 28194,  1057, ...,     0,     0,     0],
       [  101, 15659,  4895, ...,     0,     0,     0],
       [  101, 13661,   999, ...,     0,     0,     0]], dtype=int32), 'input_mask': array([[1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0],
       ...,
       [1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0]], dtype=int32), 'input_type_ids': array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int32)}
Shuffled train_y labels: [1 1 1 0 1 0 1 1 1 0 1 0 1 0 1 0 1 0 1 0]


## Downloading the BERT model

Here we download the BERT model from the TensorFlow hub. and create a Keras layer from that.

## Defining the BERT encoder and inputs

BERT model needs three inputs,

* Input word IDs - These are the input tokens generated from text and padded to `max_seq_length` with zeros
* Input mask - A matrix of the shape of `input_word_ids` that represents whether an element is a token of a padded value (0s and 1s)
* Segment IDs - A matrix of the shape of `input_word_ids` that represents which sentence/sequence each token belongs to (0s and 1s)

In [19]:
hub_bert_url = "https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/4"
max_seq_length = 60  # Your choice here.
import tensorflow_hub as hub
# Input layers

# Contains input token ids
input_word_ids = tf.keras.layers.Input(shape=(max_seq_length,), dtype=tf.int32,
                                       name="input_word_ids")
# Contains input mask values
input_mask = tf.keras.layers.Input(shape=(max_seq_length,), dtype=tf.int32,
                                   name="input_mask")
# Contains input type (whether token belongs to sequence A or B) values
input_type_ids = tf.keras.layers.Input(shape=(max_seq_length,), dtype=tf.int32,
                                    name="input_type_ids")

# BERT encoder downloaded from TF hub
bert_layer = hub.KerasLayer(hub_bert_url, trainable=True)

# get the output of the encoder
output = bert_layer({"input_word_ids":input_word_ids, "input_mask": input_mask,
                                             "input_type_ids": input_type_ids})

# Define the final encoder as with the Functional API
hub_encoder = tf.keras.models.Model(
    inputs={"input_word_ids": input_word_ids, "input_mask": input_mask, "input_type_ids": input_type_ids},
    outputs={"sequence_output": output["sequence_output"], "pooled_output": output["pooled_output"]}
)

# Check the outputs of the Bert layer
print(output["pooled_output"].shape)
print(output["sequence_output"].shape)

(None, 768)
(None, 60, 768)


## Creating a downstream classifier from BERT

In [20]:
# Section 13.2

# Generating a classifier and the encoder
bert_classifier = tfm.nlp.models.BertClassifier(network=hub_encoder, num_classes=2)

In [22]:
# """
# Another alternative way to get the BERT encoder (not pretrained)

import yaml

# https://github.com/tensorflow/models/blob/master/official/nlp/configs/models/bert_en_uncased_base.yaml
with open(os.path.join("data", "bert_en_uncased_base.yaml"), 'r') as stream:
    config_dict = yaml.safe_load(stream)['task']['model']['encoder']['bert']


# Print BERT config
print("BERT Config")
print(config_dict)

encoder_config = tfm.nlp.encoders.EncoderConfig({
    'type':'bert',
    'bert': config_dict
})

bert_encoder = tfm.nlp.encoders.build_encoder(encoder_config)

# Generating a classifier and the encoder
bert_classifier = tfm.nlp.models.BertClassifier(
    network=bert_encoder, num_classes=2
)
# """

BERT Config
{'attention_dropout_rate': 0.1, 'dropout_rate': 0.1, 'hidden_activation': 'gelu', 'hidden_size': 768, 'initializer_range': 0.02, 'intermediate_size': 3072, 'max_position_embeddings': 512, 'num_attention_heads': 12, 'num_layers': 12, 'type_vocab_size': 2, 'vocab_size': 30522}


## Defining the optimizer

In [23]:
# Code listing 13.4
# Set up epochs and steps
epochs = 3
batch_size = 56
eval_batch_size = 56

train_data_size = train_x.shape[0]
steps_per_epoch = int(train_data_size / batch_size)
num_train_steps = steps_per_epoch * epochs
warmup_steps = int(num_train_steps * 0.1)

init_lr = 3e-6
end_lr = 0.0

# Define the decay
linear_decay = tf.keras.optimizers.schedules.PolynomialDecay(
    initial_learning_rate=init_lr,
    end_learning_rate=end_lr,
    decay_steps=num_train_steps
)

# Define learning rate schedule
warmup_schedule = tfm.optimization.lr_schedule.LinearWarmup(
    warmup_learning_rate = 1e-10,
    after_warmup_lr_sched = linear_decay,
    warmup_steps = warmup_steps
)

# creates an optimizer with learning rate schedule
optimizer = tf.keras.optimizers.experimental.Adam(
    learning_rate = warmup_schedule
)


## Finetuning BERT and the classifier

In [None]:
import time

metrics = [tf.keras.metrics.SparseCategoricalAccuracy('accuracy', dtype=tf.float32)]
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)

# Compile the model
bert_classifier.compile(
    optimizer=optimizer,
    loss=loss,
    metrics=metrics)

t1 = time.time()

# Train the model
bert_classifier.fit(
      x=train_inputs,
      y=train_y,
      validation_data=(valid_inputs, valid_y),
      validation_batch_size=eval_batch_size,
      batch_size=batch_size,
      epochs=epochs
)

t2 = time.time()

print("It took {} seconds to complete the training".format(t2-t1))

Epoch 1/3
Epoch 2/3


## Save the model

In [None]:
os.makedirs('models', exist_ok=True)
tf.keras.models.save_model(bert_classifier, os.path.join('models', 'bert_spam.h5'))

## Testing the data

In [None]:
# Section 13.2
bert_classifier.evaluate(test_inputs, test_y)