<h2 align=center> Fine-Tune BERT for Text Classification</h2>

<div align="center">
    <img width="512px" src='https://drive.google.com/uc?id=1fnJTeJs5HUpz7nix-F9E6EZdgUflqyEu' />
    <p style="text-align: center;color:gray">Figure 1: BERT Classification Model</p>
</div>

In this notebook, we will fine tune a BERT base model with Quora data

**BERT Model:** https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/4 <br>
**BERT Data Preprocessor:** https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3 <br>
**Dataset:** https://archive.org/download/fine-tune-bert-tensorflow-train.csv/train.csv.zip

Here is flow on high level:

- #1 - Install and load necessary packages
- #2 - Read the data into a pandas dataframe
- #3 - Split dataset into train, validation and test data sets
- #4 - Create input pipeline using tf.data.dataset
- #5 - Load the encoder model and preprocessor model. BERT encoder require input data in fixed format, i.e., input_word_ids, input_mask input_type_ids. We can preprocess using tensorflow/keras tokenizer or use preprocessor model. In this notebook, I have used preprocess model. 
- #6 - Create the model with:
    - Input tensor
    - Preprocessor model
    - encoder
    - dropout
    - Final dense with sigmoid activation
- #7 - Compile the model, using BinaryCrossEntropy loss and AdamW optimizer. 
- #8 - Train the model
- #9 - Evaluate model against the test dataset



<h3>Install packages</h3>

Some packages are required to run this notebook. 

In [None]:
# https://pypi.python.org/pypi/pydot
!apt-get -qq install -y graphviz && pip install pydot
!pip install tensorflow-text
!pip install tensorflow
!pip install tensorflow-hub


If you are running this notebook on cloud like colab, you will need to restart the runtime. If you are running on local machine then just proceed. 
Set logger to show only Fatal error, otherwise you will get lots of debug messages. 

In [None]:
import logging
logging.getLogger('tensorflow').setLevel(logging.FATAL)
import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd
import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_text as text
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt



<h3>Load Dataset</h3>

I am using quora dataset. This is a labeled dataset on whether the question asked is sincere or unsincere. There are 1.3M rows in the dataset. The data has 3 columns:
- GUID: ID for reach row
- Question Text: Question asked by user on Quora
- Target: Whether question is insincere or not ( 1 = unsincere, 0 = sincere)

In [None]:
df = pd.read_csv('https://archive.org/download/fine-tune-bert-tensorflow-train.csv/train.csv.zip', compression='zip', low_memory=False)

print(df.shape)

df.tail(10)



<h3>Split Dataset</h3>

Working on this dataset require some good compute power. To expedite on my macbook and free colab, I have trained only on .5% of dataset. This may not be best choice as it can lead to overfitting. 

In [None]:
train_df, test_df = train_test_split(df, random_state=42, train_size=0.005, stratify=df.target.values)
train_df, valid_df = train_test_split(train_df, random_state=42, train_size=0.9, stratify=train_df.target.values)
print(f'train: {train_df.shape}; valid: {valid_df.shape}; test: {test_df.shape}')


<h3>Load Pretrained Models</h3>

In [None]:
batch_size = 32
shuffle_buffer_size = 1000

bert_layer = hub.KerasLayer('https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/4', trainable=True, name = 'bert_layer')
bert_preprocess_layer = hub.KerasLayer('https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3', trainable=False, name = 'bert_preprocess_layer')



<h3>Input Pipeline</h3>

Create a input pipeline with tf.data.dataset

In [None]:

with tf.device('/cpu:0'):
  train_data = (tf.data.Dataset.from_tensor_slices((train_df.question_text.values, train_df.target.values))
  .shuffle(shuffle_buffer_size)
  .batch(batch_size, drop_remainder=True)
  .prefetch(tf.data.experimental.AUTOTUNE))

  valid_data = (tf.data.Dataset.from_tensor_slices((valid_df.question_text.values, valid_df.target.values))
  .batch(batch_size, drop_remainder=True)
  .prefetch(tf.data.experimental.AUTOTUNE))


print(train_data.element_spec)
print(valid_data.element_spec)

<h3>Preprocessing Sample</h3>

<div align="center">
    <img width="512px" src='https://drive.google.com/uc?id=1-SpKFELnEvBMBqO7h3iypo8q9uUUo96P' />
    <p style="text-align: center;color:gray">Figure 2: BERT Tokenizer</p>
</div>

Review segments created by proprocessor model

In [None]:
test_text = []
for x in train_data.take(1):
    for i in range(1):
        test_text.append(x[0][i].numpy())

print(test_text)
text_processed = bert_preprocess_layer(test_text)
print(text_processed.keys())
print(f"input_word_id:: shape:{text_processed['input_word_ids'].shape}; values:{text_processed['input_word_ids']}")
print(f"input_mask:: shape:{text_processed['input_mask'].shape}; values:{text_processed['input_mask']}")
print(f"input_type_ids:: shape:{text_processed['input_type_ids'].shape}; values:{text_processed['input_type_ids']}")



<h3>Create Model</h3>

Create classification model with following layers
- Text Input
- Preprocessor layer
- Encoder
- Dropout
- Dense with sigmoid activation

In [None]:
# Building the model
def create_model():

  text_input = tf.keras.layers.Input(shape=(), dtype=tf.string, name='Input')
  encoder_inputs = bert_preprocess_layer(text_input)
  pooled_output = bert_layer(encoder_inputs)['pooled_output']
  drop = tf.keras.layers.Dropout(0.1)(pooled_output)
  output = tf.keras.layers.Dense(1, activation='sigmoid', name='classifier')(drop)
  return tf.keras.Model(text_input, output)


<h3>Compile Model</h3>

Compile model with BinaryCrossEntropy loss and AdamW optimizer. Notice we are using AdamW optimizer, not Adam. BERT was pretrained using AdamW. 

In [None]:

model = create_model()
model.compile(optimizer=tf.keras.optimizers.AdamW(learning_rate=2e-5),
              loss=tf.keras.losses.BinaryFocalCrossentropy(),
              metrics=[tf.keras.metrics.BinaryAccuracy()])

model.summary()


Visualize the model

In [None]:
tf.keras.utils.plot_model(model=model, show_shapes=True, dpi=76, )

<h3>Train Model</h3>

Train the model. I ran it for 4 epochs. 

In [None]:
epochs = 4
history = model.fit(train_data,
                    validation_data=valid_data,
                    epochs=epochs,
                    verbose=1)

Plot loss and accuracy of both training and validation set. 


In [None]:
history_dict = history.history
print(history_dict.keys())

acc = history_dict['binary_accuracy']
val_acc = history_dict['val_binary_accuracy']
loss = history_dict['loss']
val_loss = history_dict['val_loss']

epochs = range(1, len(acc) + 1)
fig = plt.figure(figsize=(10, 6))
fig.tight_layout()

plt.subplot(2, 1, 1)
# r is for "solid red line"
plt.plot(epochs, loss, 'r', label='Training loss')
# b is for "solid blue line"
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
# plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()

plt.subplot(2, 1, 2)
plt.plot(epochs, acc, 'r', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend(loc='lower right')

<h3>Evaluate</h3>

Evaluate the model with test dataset. 

In [None]:
with tf.device('/cpu:0'):
  test_data = (tf.data.Dataset.from_tensor_slices((test_df.question_text.values, test_df.target.values))
  .batch(batch_size, drop_remainder=True)
  .prefetch(tf.data.experimental.AUTOTUNE))


loss, accuracy = model.evaluate(test_data)

print(f'Loss: {loss}')
print(f'Accuracy: {accuracy}')

<H3>Acnowledgement</h3>

This notebook was created based on guided project on Coursera at https://www.coursera.org/learn/fine-tune-bert-tensorflow/ungradedLti/ack5t/fine-tune-bert-for-text-classification-with-tensorflow. I modified it to take advantage of preprocessor model and latest version of tensorflow.

<H3>Next Steps</h3>

This is very basic setup. You might want to try with different values for data split, dropout, learning rate, etc. As they say, you can only train with experimentation, there is no one value for any of these. 