# **IDL Project Assignment Task 6: BERT fine-tuning for classification**

## **Assigning Tensorflow version and importing the libraries required for the tasks**


In [1]:
!pip install tf-models-official


Collecting tf-models-official
  Using cached https://files.pythonhosted.org/packages/5b/33/91e5e90e3e96292717245d3fe87eb3b35b07c8a2113f2da7f482040facdb/tf_models_official-2.3.0-py2.py3-none-any.whl
Installing collected packages: tf-models-official
Successfully installed tf-models-official-2.3.0


In [2]:
%tensorflow_version 2.x

import os

import numpy as np
import matplotlib.pyplot as plt

import tensorflow as tf

import tensorflow_hub as hub
import tensorflow_datasets as tfds
tfds.disable_progress_bar()

from official.modeling import tf_utils
from official import nlp
from official.nlp import bert

# Load the required submodules
import official.nlp.optimization
import official.nlp.bert.bert_models
import official.nlp.bert.configs
import official.nlp.bert.run_classifier
import official.nlp.bert.tokenization
import official.nlp.data.classifier_data_lib
import official.nlp.modeling.losses
import official.nlp.modeling.models
import official.nlp.modeling.networks

tf.__version__

'2.3.0'

In [3]:
os.getcwd()
os.chdir("/content/drive/My Drive/Colab Notebooks/IDL /IDL Project") 


## **Importing dataset**

In [4]:
glue, info = tfds.load('glue/cola', with_info=True,
                       # It's small, load the whole dataset
                       batch_size=-1)

[1mDownloading and preparing dataset glue/cola/1.0.0 (download: 368.14 KiB, generated: Unknown size, total: 368.14 KiB) to /root/tensorflow_datasets/glue/cola/1.0.0...[0m




Shuffling and writing examples to /root/tensorflow_datasets/glue/cola/1.0.0.incomplete532WPK/glue-train.tfrecord
Shuffling and writing examples to /root/tensorflow_datasets/glue/cola/1.0.0.incomplete532WPK/glue-validation.tfrecord
Shuffling and writing examples to /root/tensorflow_datasets/glue/cola/1.0.0.incomplete532WPK/glue-test.tfrecord
[1mDataset glue downloaded and prepared to /root/tensorflow_datasets/glue/cola/1.0.0. Subsequent calls will reuse this data.[0m


In [5]:
list(glue.keys())


['test', 'train', 'validation']

In [6]:
info.features


FeaturesDict({
    'idx': tf.int32,
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=2),
    'sentence': Text(shape=(), dtype=tf.string),
})

In [7]:
info.features['label'].names


['unacceptable', 'acceptable']

In [8]:
glue_train = glue['train']

for key, value in glue_train.items():
  print(f"{key:9s}: {value[0].numpy()}")

idx      : 1680
label    : 1
sentence : b'It is this hat that it is certain that he was wearing.'


## **Preprocessing the text data**

In [9]:
gs_folder_bert = "gs://cloud-tpu-checkpoints/bert/keras_bert/uncased_L-12_H-768_A-12"
tf.io.gfile.listdir(gs_folder_bert)

['bert_config.json',
 'bert_model.ckpt.data-00000-of-00001',
 'bert_model.ckpt.index',
 'vocab.txt']

In [10]:
tokenizer = bert.tokenization.FullTokenizer(
    vocab_file=os.path.join(gs_folder_bert, "vocab.txt"),
     do_lower_case=True)

print("Vocab size:", len(tokenizer.vocab))

Vocab size: 30522


In [11]:
tokenizer.convert_tokens_to_ids(['[CLS]'])

[101]

In [12]:
def encode_sentence(s, tokenizer):
   tokens = list(tokenizer.tokenize(s))
   tokens.append('[SEP]')
   return tokenizer.convert_tokens_to_ids(tokens)

def bert_encode(glue_dict, tokenizer):
  num_examples = len(glue_dict["sentence"])
  
  sentence = tf.ragged.constant([encode_sentence(s, tokenizer) for s in np.array(glue_dict["sentence"])])
  

  cls = [tokenizer.convert_tokens_to_ids(['[CLS]'])]*sentence.shape[0]
  input_word_ids = tf.concat([cls, sentence], axis=-1)

  input_mask = tf.ones_like(input_word_ids).to_tensor()

  type_cls = tf.zeros_like(cls)
  type_s = tf.ones_like(sentence)
  input_type_ids = tf.concat([type_cls, type_s], axis=-1).to_tensor()

  inputs = {
      'input_word_ids': input_word_ids.to_tensor(),
      'input_mask': input_mask,
      'input_type_ids': input_type_ids}

  return inputs

In [13]:
glue_train = bert_encode(glue['train'], tokenizer)
glue_train_labels = glue['train']['label']

glue_validation = bert_encode(glue['validation'], tokenizer)
glue_validation_labels = glue['validation']['label']

glue_test = bert_encode(glue['test'], tokenizer)
glue_test_labels  = glue['test']['label']

In [14]:
for key, value in glue_train.items():
  print(f'{key:15s} shape: {value.shape}')

print(f'glue_train_labels shape: {glue_train_labels.shape}')

input_word_ids  shape: (8551, 47)
input_mask      shape: (8551, 47)
input_type_ids  shape: (8551, 47)
glue_train_labels shape: (8551,)


## **Model Building**

In [15]:
import json

bert_config_file = os.path.join(gs_folder_bert, "bert_config.json")
config_dict = json.loads(tf.io.gfile.GFile(bert_config_file).read())

bert_config = bert.configs.BertConfig.from_dict(config_dict)

config_dict

{'attention_probs_dropout_prob': 0.1,
 'hidden_act': 'gelu',
 'hidden_dropout_prob': 0.1,
 'hidden_size': 768,
 'initializer_range': 0.02,
 'intermediate_size': 3072,
 'max_position_embeddings': 512,
 'num_attention_heads': 12,
 'num_hidden_layers': 12,
 'type_vocab_size': 2,
 'vocab_size': 30522}

In [16]:
_, bert_encoder = bert.bert_models.classifier_model(bert_config, num_labels=2)

In [17]:
checkpoint = tf.train.Checkpoint(model=bert_encoder)
checkpoint.restore(os.path.join(gs_folder_bert, 'bert_model.ckpt')).assert_consumed()

<tensorflow.python.training.tracking.util.CheckpointLoadStatus at 0x7feeaf624e80>

In [18]:
# Set up epochs and steps
epochs = 3
batch_size = 32
eval_batch_size = 32

train_data_size = len(glue_train_labels)
steps_per_epoch = int(train_data_size / batch_size)
num_train_steps = steps_per_epoch * epochs
warmup_steps = int(epochs * train_data_size * 0.1 / batch_size)

# creates an optimizer with learning rate schedule
optimizer = nlp.optimization.create_optimizer(
    2e-5, num_train_steps=num_train_steps, num_warmup_steps=warmup_steps)

In [19]:
transformer_config = config_dict.copy()

# You need to rename a few fields to make this work:
transformer_config['attention_dropout_rate'] = transformer_config.pop('attention_probs_dropout_prob')
transformer_config['activation'] = tf_utils.get_activation(transformer_config.pop('hidden_act'))
transformer_config['dropout_rate'] = transformer_config.pop('hidden_dropout_prob')
transformer_config['initializer'] = tf.keras.initializers.TruncatedNormal(
          stddev=transformer_config.pop('initializer_range'))
transformer_config['max_sequence_length'] = transformer_config.pop('max_position_embeddings')
transformer_config['num_layers'] = transformer_config.pop('num_hidden_layers')

transformer_config

{'activation': <function official.modeling.activations.gelu.gelu>,
 'attention_dropout_rate': 0.1,
 'dropout_rate': 0.1,
 'hidden_size': 768,
 'initializer': <tensorflow.python.keras.initializers.initializers_v2.TruncatedNormal at 0x7feeaf601b00>,
 'intermediate_size': 3072,
 'max_sequence_length': 512,
 'num_attention_heads': 12,
 'num_layers': 12,
 'type_vocab_size': 2,
 'vocab_size': 30522}

In [20]:
manual_classifier = nlp.modeling.models.BertClassifier(
        bert_encoder,
        num_classes=2,
        dropout_rate=transformer_config['dropout_rate'],
        initializer=tf.keras.initializers.TruncatedNormal(
          stddev=bert_config.initializer_range))

In [21]:
metrics = [tf.keras.metrics.SparseCategoricalAccuracy('accuracy', dtype=tf.float32)]
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)

manual_classifier.compile(
    optimizer=optimizer,
    loss=loss,
    metrics=metrics)

manual_classifier.fit(
      glue_train, glue_train_labels,
      validation_data=(glue_validation, glue_validation_labels),
      batch_size=32,
      epochs=epochs)

Epoch 1/3
Epoch 2/3
Epoch 3/3


<tensorflow.python.keras.callbacks.History at 0x7feeae1e44e0>

In [26]:
my_examples = bert_encode(
    glue_dict = {
        'sentence':[
            'The rain in Spain falls mainly on the plain.',
            'Look I fine tuned BERT.',
            'It rains regularly in Germany',
            'Mumbai has easy raining in every year.',
            'I had an intusion that I am done the task right but I was not wrong.']
    },
    tokenizer=tokenizer)

In [27]:
result = manual_classifier(my_examples, training=False)

result = tf.argmax(result,axis=1).numpy()
result

array([1, 0, 1, 0, 0])

In [28]:
np.array(info.features['label'].names)[result]

array(['acceptable', 'unacceptable', 'acceptable', 'unacceptable',
       'unacceptable'], dtype='<U12')

In [25]:
export_dir='./saved_model'
tf.saved_model.save(manual_classifier, export_dir=export_dir)

Instructions for updating:
This property should not be used in TensorFlow 2.0, as updates are applied automatically.


Instructions for updating:
This property should not be used in TensorFlow 2.0, as updates are applied automatically.


Instructions for updating:
This property should not be used in TensorFlow 2.0, as updates are applied automatically.


Instructions for updating:
This property should not be used in TensorFlow 2.0, as updates are applied automatically.


INFO:tensorflow:Assets written to: ./saved_model/assets


INFO:tensorflow:Assets written to: ./saved_model/assets


## Questions



*   **What is the tutorial classifying when using the GLUE MRPC data set?**
The tutorial which uses the GLUE MRPC data set points out that whether the sentences in pairs are semantically equivalent or not. 

*   **In addition to the input itself, the tutorial feeds two binary tensors for input mask and input type to the model. Is this necessary for the data set single sentence classification?**
Technically when doing a single sentence classification input type is not required. But the API call requires all 3 data. So all 3 are fed into it. 


*   **How does the tokenization in BERT differ from the one in the previous Task 5?**
In the previous task we encode and decode ourselves. Here we used an already existing tokenized file.

*   **What is a [CLS] token and what is it used for?**
The [CLS] token is to point out that its a classification problem. So each sentence is padded with CLS in the starting. 


*   **Which part of the BERT encoding is used for the classification?Does your answer match the output shape of the encoder?**
The tokenised encoding is used for the classification problem. Yes it matches the output shape. 
The output of BERT encoding shape -> n,768 is used for classification. 


*   **Are the BERT encoder weights also fine-tuned to the task?**
The encoder weights are finetuned. Thats why we restore the checkpoint and do the task. 


