<a href="https://colab.research.google.com/github/dbhadore/BERT-Related/blob/main/bert_tune_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## BERT Fine Tune for Text Classification - Small Dataset (Data in memory)

In [1]:
import tensorflow as tf
tf.__version__

'2.5.0'

In [2]:
!git clone --depth 1 -b v2.3.0 https://github.com/tensorflow/models.git

Cloning into 'models'...
remote: Enumerating objects: 2650, done.[K
remote: Counting objects: 100% (2650/2650), done.[K
remote: Compressing objects: 100% (2311/2311), done.[K
remote: Total 2650 (delta 506), reused 1386 (delta 306), pack-reused 0[K
Receiving objects: 100% (2650/2650), 34.02 MiB | 31.24 MiB/s, done.
Resolving deltas: 100% (506/506), done.
Note: checking out '400d68abbccda2f0f6609e3a924467718b144233'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by performing another checkout.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -b with the checkout command again. Example:

  git checkout -b <new-branch-name>



In [3]:
!pip install -Uqr models/official/requirements.txt

[K     |████████████████████████████████| 7.2 MB 8.0 MB/s 
[K     |████████████████████████████████| 196 kB 60.7 MB/s 
[K     |████████████████████████████████| 15.7 MB 74 kB/s 
[K     |████████████████████████████████| 11.5 MB 63.7 MB/s 
[K     |████████████████████████████████| 296 kB 58.0 MB/s 
[K     |████████████████████████████████| 99 kB 11.5 MB/s 
[K     |████████████████████████████████| 28.5 MB 26 kB/s 
[K     |████████████████████████████████| 211 kB 64.2 MB/s 
[K     |████████████████████████████████| 4.0 MB 44.7 MB/s 
[K     |████████████████████████████████| 679 kB 55.7 MB/s 
[K     |████████████████████████████████| 352 kB 66.1 MB/s 
[K     |████████████████████████████████| 1.2 MB 58.2 MB/s 
[K     |████████████████████████████████| 2.0 MB 50.4 MB/s 
[K     |████████████████████████████████| 10.3 MB 29.8 MB/s 
[K     |████████████████████████████████| 37.1 MB 48 kB/s 
[K     |████████████████████████████████| 636 kB 57.6 MB/s 
[K     |██████████████████

In [4]:
import sys
sys.path.append("models")

In [5]:
import numpy as np
import pandas as pd
import tensorflow_hub as hub
from official.nlp.bert import tokenization
import tensorflow_datasets as tfds
from official.nlp import optimization

#### Explore Data used to fine tune

A corpus of sentence pairs automatically extracted from online news sources, with human annotations for whether the sentences in the pair are semantically equivalent.

In [6]:
glue, info = tfds.load('glue/mrpc', with_info=True, batch_size=-1)
print(list(glue.keys()))
print(info.features)

[1mDownloading and preparing dataset 1.43 MiB (download: 1.43 MiB, generated: 1.74 MiB, total: 3.17 MiB) to /root/tensorflow_datasets/glue/mrpc/2.0.0...[0m


HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Dl Completed...', max=1.0, style=Progre…

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Dl Size...', max=1.0, style=ProgressSty…





HBox(children=(FloatProgress(value=0.0, description='Generating splits...', max=3.0, style=ProgressStyle(descr…

HBox(children=(FloatProgress(value=0.0, description='Generating train examples...', max=3668.0, style=Progress…

HBox(children=(FloatProgress(value=0.0, description='Shuffling glue-train.tfrecord...', max=3668.0, style=Prog…

HBox(children=(FloatProgress(value=0.0, description='Generating validation examples...', max=408.0, style=Prog…

HBox(children=(FloatProgress(value=0.0, description='Shuffling glue-validation.tfrecord...', max=408.0, style=…

HBox(children=(FloatProgress(value=0.0, description='Generating test examples...', max=1725.0, style=ProgressS…

HBox(children=(FloatProgress(value=0.0, description='Shuffling glue-test.tfrecord...', max=1725.0, style=Progr…

[1mDataset glue downloaded and prepared to /root/tensorflow_datasets/glue/mrpc/2.0.0. Subsequent calls will reuse this data.[0m
[Split('train'), Split('validation'), Split('test')]
FeaturesDict({
    'idx': tf.int32,
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=2),
    'sentence1': Text(shape=(), dtype=tf.string),
    'sentence2': Text(shape=(), dtype=tf.string),
})


In [7]:
print(info.features['label'].names)
print('Train, Val, Test - No of records:', 
      len(glue['train']['label']), len(glue['validation']['label']), len(glue['test']['label']))

['not_equivalent', 'equivalent']
Train, Val, Test - No of records: 3668 408 1725


In [8]:
max(tf.strings.length(glue['train']['sentence1']).numpy())

226

BERT Pre-trained Model & BERT Tokenizer

In [9]:
bert_layer = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/2", trainable=True)
vocab_file = bert_layer.resolved_object.vocab_file.asset_path.numpy()
do_lower_case = bert_layer.resolved_object.do_lower_case.numpy()
tokenizer = tokenization.FullTokenizer(vocab_file, do_lower_case)

In [10]:
print("Vocab size:", len(tokenizer.vocab))

Vocab size: 30522


In [11]:
tokens = tokenizer.tokenize("This is a tuning problem. Dhiman is doing this")
print(tokens)
ids = tokenizer.convert_tokens_to_ids(tokens)
print(ids)

['this', 'is', 'a', 'tuning', 'problem', '.', 'dh', '##iman', 'is', 'doing', 'this']
[2023, 2003, 1037, 17372, 3291, 1012, 28144, 18505, 2003, 2725, 2023]


The model expects its two inputs sentences to be concatenated together. This input is expected to start with a [CLS] (This is a classification problem) token, and each sentence should end with a [SEP] "Separator" token:"

In [12]:
def encode_sentence(s, tokenizer):
    tokens = list(tokenizer.tokenize(s))
    tokens.append('[SEP]')
    return tokenizer.convert_tokens_to_ids(tokens)

The model expects 3 inputs:

* The input word ids
* The input mask
* The input type

The mask allows the model to cleanly differentiate between the content and the padding. The mask has the same shape as the input_word_ids, and contains a 1 anywhere the input_word_ids is not padding.

The "input type" also has the same shape, but inside the non-padded region, contains a 0 or a 1 indicating which sentence the token is a part of.

In [13]:
def bert_encode(glue_dict, tokenizer):
    # start by encoding all the sentences and packing them into ragged-tensors
    sentence1 = tf.ragged.constant([encode_sentence(s, tokenizer) for s in np.array(glue_dict["sentence1"])])
    sentence2 = tf.ragged.constant([encode_sentence(s, tokenizer) for s in np.array(glue_dict["sentence2"])])
    cls = [tokenizer.convert_tokens_to_ids(['[CLS]'])] * sentence1.shape[0]
    
    #Input1
    # now prepend a [CLS] token
    input_word_ids = tf.concat([cls, sentence1, sentence2], axis=-1)
    
    #Input2
    input_mask = tf.ones_like(input_word_ids).to_tensor(shape=(None,128))
    
    #Input3
    type_cls = tf.zeros_like(cls)
    type_s1 = tf.zeros_like(sentence1)
    type_s2 = tf.ones_like(sentence2)
    input_type_ids = tf.concat([type_cls, type_s1, type_s2], axis=-1).to_tensor(shape=(None,128))
    
    inputs = {
        'input_word_ids': input_word_ids.to_tensor(shape=(None,128)),
        'input_mask': input_mask,
        'input_type_ids': input_type_ids}
    return inputs

Prepare training data

In [14]:
glue_train = bert_encode(glue['train'], tokenizer)
glue_train_labels = glue['train']['label']
glue_validation = bert_encode(glue['validation'], tokenizer)
glue_validation_labels = glue['validation']['label']

Define Model

In [15]:
max_seq_length = glue_train['input_word_ids'].shape[1]
input_word_ids = tf.keras.layers.Input(shape=(max_seq_length,), dtype=tf.int32, name="input_word_ids")
input_mask = tf.keras.layers.Input(shape=(max_seq_length,), dtype=tf.int32, name="input_mask")
input_type_ids = tf.keras.layers.Input(shape=(max_seq_length,), dtype=tf.int32, name='input_type_ids')
bert_inputs = {'input_word_ids': input_word_ids, 'input_mask': input_mask, 'input_type_ids': input_type_ids}
pooled_output, _ = bert_layer([input_word_ids, input_mask, input_type_ids])
output = tf.keras.layers.Dropout(rate=0.2)(pooled_output)
initializer = tf.keras.initializers.TruncatedNormal(stddev=0.02)
bert_output = tf.keras.layers.Dense(2, kernel_initializer=initializer, name='output')(output)
model = tf.keras.models.Model(inputs=bert_inputs, outputs=bert_output)
model.summary()

Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_word_ids (InputLayer)     [(None, 128)]        0                                            
__________________________________________________________________________________________________
input_mask (InputLayer)         [(None, 128)]        0                                            
__________________________________________________________________________________________________
input_type_ids (InputLayer)     [(None, 128)]        0                                            
__________________________________________________________________________________________________
keras_layer (KerasLayer)        [(None, 768), (None, 109482241   input_word_ids[0][0]             
                                                                 input_mask[0][0]             

Train Model

In [16]:
epochs = 10
batch_size = 32
eval_batch_size = 32
train_data_size = len(glue_train_labels)
steps_per_epoch = int(train_data_size / batch_size)
num_train_steps = steps_per_epoch * epochs
warmup_steps = int(epochs * train_data_size * 0.1 / batch_size)
optimizer = optimization.create_optimizer(2e-5, num_train_steps=num_train_steps, num_warmup_steps=warmup_steps)
metrics = [tf.keras.metrics.SparseCategoricalAccuracy('accuracy', dtype=tf.float32)]
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
model.compile(optimizer=optimizer, loss=loss, metrics=metrics)
model.fit(
    glue_train, glue_train_labels,
    validation_data=(glue_validation, glue_validation_labels),
    batch_size=batch_size,
    validation_batch_size=eval_batch_size,
    epochs=epochs)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x7fbcc3d4e890>

Test Model

In [23]:
test_examples = bert_encode(
    glue_dict={
        'sentence1': [
            'The people in home isolation should be effectively monitored.',
            'Olympic is going on'],
        'sentence2': [
            'The people who isolated theselves at home should be monitored.',
            'Pandemic is not over']
    }, tokenizer=tokenizer)
result = model.predict(test_examples)
print(result)
result = tf.argmax(result).numpy()
print(result)
print(np.array(info.features['label'].names)[result])

[[-4.0140576  3.494368 ]
 [ 3.2827957 -2.404686 ]]
[1 0]
['equivalent' 'not_equivalent']


Save Model

In [24]:
export_dir='./saved_model'
tf.saved_model.save(model, export_dir=export_dir)




FOR DEVS: If you are overwriting _tracking_metadata in your class, this property has been used to save metadata in the SavedModel. The metadta field will be deprecated soon, so please move the metadata to a different file.



FOR DEVS: If you are overwriting _tracking_metadata in your class, this property has been used to save metadata in the SavedModel. The metadta field will be deprecated soon, so please move the metadata to a different file.


INFO:tensorflow:Assets written to: ./saved_model/assets


INFO:tensorflow:Assets written to: ./saved_model/assets
