# Text Classification with BERT

BERT and other Transformer encoder architectures have been wildly successful on a variety of tasks in NLP (natural language processing). They compute vector-space representations of natural language that are suitable for use in deep learning models. The BERT family of models uses the Transformer encoder architecture to process each token of input text in the full context of all tokens before and after, hence the name: Bidirectional Encoder Representations from Transformers.

BERT models are usually pre-trained on a large corpus of text, then fine-tuned for specific tasks.

In [1]:
## Loading required packages
import os
import shutil
import pandas as pd
import numpy as np

import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_text as text
from official.nlp import optimization  # to create AdamW optimizer
import keras
from tensorflow.keras import layers

import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import classification_report

tf.get_logger().setLevel('ERROR')

In [2]:
# check keras and TF version used
print('TF Version:', tf.__version__)
print('Keras Version:', keras.__version__)
print('Number of available GPUs:', len(tf.config.list_physical_devices('GPU')))

TF Version: 2.7.0
Keras Version: 2.7.0
Number of available GPUs: 1


## Reading the data

In [3]:
# Read the data
df = pd.read_csv('../data/interim/covid_articles_preprocessed.csv')

## Merge Tags

tag_map = {'consumer':'general',
           'healthcare':'science',
           'automotive':'business',
           'environment':'science',
           'construction':'business',
           'ai':'tech'}

df['tags'] = [(lambda tags: tag_map[tags] if tags in tag_map.keys() else tags)(tags)
                          for tags in df['topic_area']]
df.tags.value_counts()

business    245652
general      86372
finance      22386
tech          8915
science       5595
Name: tags, dtype: int64

In [4]:
X = df.content.values
y = df.tags.values

enc = LabelEncoder()
y = enc.fit_transform(y)
enc_tags_mapping = dict(zip(enc.transform(enc.classes_), enc.classes_))

## Split the data in train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=21)
X_train, X_validation, y_train, y_validation = train_test_split(X_train, y_train, test_size=0.2, random_state=21)

## Encoding the raw text

## Build the Model

In [5]:
def build_classifier_model(tfhub_handle_preprocess, tfhub_handle_encoder):
    text_input = tf.keras.layers.Input(shape=(), dtype=tf.string, name='text')
    preprocessing_layer = hub.KerasLayer(tfhub_handle_preprocess, name='preprocessing')
    encoder_inputs = preprocessing_layer(text_input)
    encoder = hub.KerasLayer(tfhub_handle_encoder, trainable=True, name='BERT_encoder')
    outputs = encoder(encoder_inputs)
    net = outputs['pooled_output']
    net = tf.keras.layers.Dropout(0.1)(net)
    net = tf.keras.layers.Dense(5, activation='softmax', name='classifier')(net)
    return tf.keras.Model(text_input, net)

In this notebook I will use a version of small BERT. Small BERTs have the same general architecture but fewer and/or smaller Transformer blocks, which lets you explore tradeoffs between speed, size and quality.

Text inputs need to be transformed to numeric token ids and arranged in several Tensors before being input to BERT. TensorFlow Hub provides a matching preprocessing model for each of the BERT models, which implements this transformation using TF ops from the `TF.text` library. It is not necessary to run pure Python code outside the TensorFlow model to preprocess text.

In [6]:
tfhub_handle_preprocess = 'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3'
tfhub_handle_encoder = 'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-4_H-512_A-8/1'
classifier_model = build_classifier_model(tfhub_handle_preprocess, tfhub_handle_encoder)

## Train the model
I just assembelled all pieces required in my BERT model including the preprocessing module, BERT encoder, data, and classifier. The next step is to train the model using the news dataset.

I will use the `tf.keras.losses.SparseCategoricalCrossentropy` for multi-class classification. For For fine-tuning I will use Adam, the same optimizer that BERT was originally trained with.

In [7]:
loss = tf.keras.losses.SparseCategoricalCrossentropy(name='sparse_categorical_crossentropy')
metrics = tf.metrics.SparseCategoricalAccuracy('accuracy')

For the learning rate (init_lr), I will use the same schedule as BERT pre-training: linear decay of a notional initial learning rate, prefixed with a linear warm-up phase over the first 10% of training steps (num_warmup_steps). In line with the BERT paper, the initial learning rate is smaller for fine-tuning (best of 5e-5, 3e-5, 2e-5).

In [8]:
epochs = 10
steps_per_epoch = tf.data.experimental.cardinality(tf.data.Dataset.from_tensor_slices(X_train)).numpy()
num_train_steps = steps_per_epoch * epochs
num_warmup_steps = int(0.1*num_train_steps)

init_lr = 3e-5
optimizer = optimization.create_optimizer(init_lr=init_lr,
                                          num_train_steps=num_train_steps,
                                          num_warmup_steps=num_warmup_steps,
                                          optimizer_type='adamw')

In [9]:
classifier_model.compile(optimizer=optimizer,
                         loss=loss,
                         metrics=metrics)

In [10]:
classifier_model.summary()

Model: "model"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 text (InputLayer)              [(None,)]            0           []                               
                                                                                                  
 preprocessing (KerasLayer)     {'input_word_ids':   0           ['text[0][0]']                   
                                (None, 128),                                                      
                                 'input_type_ids':                                                
                                (None, 128),                                                      
                                 'input_mask': (Non                                               
                                e, 128)}                                                      

## Run the model

In [11]:
print(f'Training model with {tfhub_handle_encoder}')

checkpoint_path = "../models/DL/bert-train/cp.ckpt"
checkpoint_dir = os.path.dirname(checkpoint_path)

# Create a callback that saves the model's weights
my_callbacks = [
                keras.callbacks.ModelCheckpoint(filepath=checkpoint_path,
                                                 save_weights_only=True,
                                                 verbose=False),
                keras.callbacks.EarlyStopping(monitor='val_loss',
                                              patience=1,
                                              restore_best_weights=False),
                ]

history = classifier_model.fit(X_train, y_train,
                               epochs=epochs,
                               verbose=True,
                               validation_data=(X_validation, y_validation),
                               callbacks=[my_callbacks])

Training model with https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-4_H-512_A-8/1
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10


In [12]:
y_pred_prob = classifier_model.predict(X_test)
y_pred = np.argmax(y_pred_prob, axis=1)
print(classification_report(y_test, y_pred, target_names=list(enc_tags_mapping.values())))

              precision    recall  f1-score   support

    business       0.95      0.92      0.93     73935
     finance       0.71      0.81      0.76      6788
     general       0.83      0.89      0.86     25646
     science       0.84      0.60      0.70      1690
        tech       0.68      0.67      0.68      2617

    accuracy                           0.89    110676
   macro avg       0.80      0.78      0.79    110676
weighted avg       0.90      0.89      0.89    110676



The results indicates that Small BERT provide a lower accuracy compared to the custom CNN model I created for this problem. I will save model for future use.

In [13]:
classifier_model.save( "../models/DL/bert-model")

