<a href="https://colab.research.google.com/github/eluyutao/MMAI-Deep-Learning-Projects/blob/main/Transfer%20learning%20with%20DistilBert.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# MMAI 894 - Exercise 3
## Transfer learning with DistilBert
The goal of this excercise is to build a text classifier using the pretrained DistilBert published by HuggingFace. You will be doing this using the Glue/CoLA dataset (https://nyu-mll.github.io/CoLA/).

Submission instructions:

- You cannot edit this notebook directly. Save a copy to your drive, and make sure to identify yourself in the title using name and student number
- Do not insert new cells before the final one (titled "Further exploration") 
- Verify that your notebook can _restart and run all_. 
- Unlike previous assignments, please **submit all three formats: .py, .ipynb, and html** (see https://torbjornzetterlund.com/how-to-save-a-google-colab-notebook-as-html/)
 - The notebook and html submissions should show the completion of your best performing run
 - Submission files should be named: `studentID_lastname_firstname_ex3.py (or .html, .ipynb)`
- The mark will be assessed on the implementation of the functions with #TODO
- **Do not change anything outside the functions**  unless in the further exploration section
- - As you are encouraged to explore the network configuration, 20% of the mark is based on final accuracy. 
- Note: You do not have to answer the questions in thie notebook as part of your submission. They are meant to guide you.

- You should not need to use any additional libraries other than the ones listed below. You may want to import additional modules from those libraries, however.

In [13]:
# This cell installs and sets up DistilBert import, as well as the dataset, which we will 
# use tf.datasets to load (https://www.tensorflow.org/datasets/catalog/overview)

!pip install -q transformers tfds-nightly

import matplotlib.pyplot as plt
import tensorflow.keras as keras
import pandas as pd

try: # this is only working on the 2nd try in colab :)
  from transformers import DistilBertTokenizer, TFDistilBertModel, DistilBertConfig
except Exception as err: # so we catch the error and import it again
  from transformers import DistilBertTokenizer, TFDistilBertModel, DistilBertConfig

import numpy as np
import tensorflow.keras as keras
from tensorflow.keras.layers import Dense, Input, Dropout, Flatten, LSTM

import tensorflow_datasets as tfds
import tensorflow as tf

dbert_tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')


# Data Preparation

In [2]:
def load_data(save_dir="./"):
  dataset = tfds.load('glue/cola', shuffle_files=True)
  train = tfds.as_dataframe(dataset["train"])
  val = tfds.as_dataframe(dataset["validation"])
  test = tfds.as_dataframe(dataset["test"])
  return train, val, test

def prepare_raw_data(df):
  raw_data = df.loc[:, ["idx", "sentence", "label"]]
  raw_data["label"] = raw_data["label"].astype('category')
  return raw_data

train, val, test = load_data()
train = prepare_raw_data(train)
val = prepare_raw_data(val)
test = prepare_raw_data(test)

[1mDownloading and preparing dataset 368.14 KiB (download: 368.14 KiB, generated: 965.49 KiB, total: 1.30 MiB) to ~/tensorflow_datasets/glue/cola/2.0.0...[0m


Dl Completed...: 0 url [00:00, ? url/s]

Dl Size...: 0 MiB [00:00, ? MiB/s]

Extraction completed...: 0 file [00:00, ? file/s]

Generating splits...:   0%|          | 0/3 [00:00<?, ? splits/s]

Generating train examples...:   0%|          | 0/8551 [00:00<?, ? examples/s]

Shuffling ~/tensorflow_datasets/glue/cola/2.0.0.incompleteTUNE9R/glue-train.tfrecord*...:   0%|          | 0/8…

Generating validation examples...:   0%|          | 0/1043 [00:00<?, ? examples/s]

Shuffling ~/tensorflow_datasets/glue/cola/2.0.0.incompleteTUNE9R/glue-validation.tfrecord*...:   0%|          …

Generating test examples...:   0%|          | 0/1063 [00:00<?, ? examples/s]

Shuffling ~/tensorflow_datasets/glue/cola/2.0.0.incompleteTUNE9R/glue-test.tfrecord*...:   0%|          | 0/10…

[1mDataset glue downloaded and prepared to ~/tensorflow_datasets/glue/cola/2.0.0. Subsequent calls will reuse this data.[0m


Before using this data, we need to clean and QA it. Unlike MNIST, this is a text dataset, and we should be more caerful. For example:
- Are there any duplicate entries? 
- What is the range of lengths for the sentences? Should we impose a minimum sentence length?
- Are there "non-sentence" entries? For example, hashtags or other features we should remove? (luckily, this dataset is quite clean, but that might not always be the case!)

NOTE! The sentences are encoded as binary strings. To do text manipulations, you might need to decode them using `s.decode("utf-8")`

You may notice that that test set has no labels. This is because Glue is a benchmark dataset, and only gets scored on submissions.

In [3]:
def clean_data(df):
#   # TODO: What data cleaning/filtering should you consider?
#   # DO NOT CHANGE THE INPUTS OR OUTPUTS TO THIS FUNCTION
  cleaned_data = df.drop_duplicates(subset=['sentence'])
  return cleaned_data

train = clean_data(train)
val = clean_data(val)
test = clean_data(test)

print(train.head())
print(test.head())

    idx                                           sentence label
0  1680  b'It is this hat that it is certain that he wa...     1
1  1456  b'Her efficient looking up of the answer pleas...     1
2  4223          b'Both the workers will wear carnations.'     1
3  4093  b'John enjoyed drawing trees for his syntax ho...     1
4  7111  b'We consider Leslie rather foolish, and Lou a...     1
    idx                                         sentence label
0   163            b'Brian was wiping behind the stove.'    -1
1   131       b'You could give a headache to a Tylenol.'    -1
2  1021                          b'I want to meet at 6.'    -1
3   166                        b'Packages carry easily.'    -1
4  1039  b"Many people said they were sick who weren't."    -1


Next, we need to prepare the text for DistilBert. Instead of ingesting raw text, the model uses token IDs to map to internal embedding. Additionally, since the input is fixed size (due to our use of batches), we need to let the model know which tokens to use (i.e. are part of the sentence).

Luckily, `dbert_tokenizer` takes care of all that for us - 
- Preprocessing: https://huggingface.co/transformers/preprocessing.html
- Summary of tokenizers (DistilBert uses WordPiece): https://huggingface.co/transformers/tokenizer_summary.html#wordpiece

In [4]:
def extract_text_and_y(df):
  text = [x.decode('utf-8') for x in  df.sentence.values]
  # for multiclass problems, you can use sklearn.preprocessing.OneHotEncoder, but we only have two classes, so we'll use a single sigmoid output
  y = np.array([x for x in df.label.values])
  return text, y

def encode_text(text):
    # TODO: encode text using dbert_tokenizer
    # DO NOT CHANGE THE INPUTS OR OUTPUTS TO THIS FUNCTION
    tmp = dbert_tokenizer(text, padding='max_length', truncation=True, 
                          return_tensors="tf", max_length = 64)
    input_ids, attention_mask = tmp['input_ids'], tmp['attention_mask']

    return input_ids, attention_mask

# the following prepares the input for running in DistilBert
train_text, train_y = extract_text_and_y(clean_data(train))
val_text, val_y = extract_text_and_y(clean_data(val))
test_text, test_y = extract_text_and_y(clean_data(test))

train_input, train_mask = encode_text(train_text)
val_input, val_mask = encode_text(val_text)
test_input, test_mask = encode_text(test_text)

train_model_inputs_and_masks = {
    'inputs' : train_input,
    'masks' : train_mask
}

val_model_inputs_and_masks = {
    'inputs' : val_input,
    'masks' : val_mask
}

test_model_inputs_and_masks = {
    'inputs' : test_input,
    'masks' : test_mask
}

# Modelling

## Build and Train Model

Resources:
- BERT paper https://arxiv.org/pdf/1810.04805.pdf
- DistilBert paper: https://arxiv.org/abs/1910.01108
- DistilBert Tensorflow Documentation: https://huggingface.co/transformers/model_doc/distilbert.html#tfdistilbertmodel

In [24]:
def build_model(base_model, trainable=False, params={}):
    # TODO: build the model, with the option to freeze the parameters in distilBERT
    # DO NOT CHANGE THE INPUTS OR OUTPUTS TO THIS FUNCTION
    # Hint 1: the cls token (token for classification in bert / distilBert) corresponds to the first element in the 
    # sequence in DistilBert. Take a look at Figure 2 in BERT paper.
    # Hint 2: this guide may be helpful for parameter freezing: https://keras.io/guides/transfer_learning/
    # Hint 3: double check that your number of parameters make sense
    # Hint 4: carefully consider your final layer activation and loss function

    # Refer to https://keras.io/api/layers/core_layers/input/
    max_seq_len = 64
    inputs = Input(shape=(max_seq_len,), name='inputs', dtype='int32')
    masks  = Input(shape=(max_seq_len,), name='masks', dtype='int32')

    base_model.trainable = trainable

    dbert_output = base_model(inputs, attention_mask=masks)
    # dbert_last_hidden_state gets you the output encoding for each of your tokens.
    # Each such encoding is a vector with 768 values. The first token fed into the model is [cls]
    # which can be used to build a sentence classification network
    dbert_last_hidden_state = dbert_output.last_hidden_state

    # Any additional layers should go here
    # use the 'params' as a dictionary for hyper parameter to facilitate experimentation
    
    my_output = Dense(128, activation=params['actv'])(dbert_last_hidden_state)
    my_output = Dropout(params['dpot'])(my_output)

    my_output = Dense(128, activation=params['actv'])(my_output)
    my_output = Dropout(params['dpot'])(my_output)

    my_output = Dense(64, activation=params['actv'])(my_output)
    my_output = Dropout(params['dpot'])(my_output)

    my_output = Flatten()(my_output)
    probs = Dense(1, activation='sigmoid')(my_output)

    model = keras.Model(inputs=[inputs, masks], outputs=probs)
    model.summary()
    return model

dbert_model = TFDistilBertModel.from_pretrained('distilbert-base-uncased')
params={'actv': 'elu',
        'dpot': 0.3
        }

model = build_model(dbert_model, params=params)

Some layers from the model checkpoint at distilbert-base-uncased were not used when initializing TFDistilBertModel: ['vocab_transform', 'vocab_layer_norm', 'vocab_projector', 'activation_13']
- This IS expected if you are initializing TFDistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFDistilBertModel were initialized from the model checkpoint at distilbert-base-uncased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertModel for predictions without further training.


Model: "model_5"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 inputs (InputLayer)            [(None, 64)]         0           []                               
                                                                                                  
 masks (InputLayer)             [(None, 64)]         0           []                               
                                                                                                  
 tf_distil_bert_model_8 (TFDist  TFBaseModelOutput(l  66362880   ['inputs[0][0]',                 
 ilBertModel)                   ast_hidden_state=(N               'masks[0][0]']                  
                                one, 64, 768),                                                    
                                 hidden_states=None                                         

In [25]:
def compile_model(model):
    # TODO: compile the model, include relevant auc metrics when training
    # DO NOT CHANGE THE INPUTS OR OUTPUTS TO THIS FUNCTION
    loss = tf.keras.losses.BinaryCrossentropy()
    metrics = [tf.metrics.BinaryAccuracy(name='bin_accuracy'),
               tf.keras.metrics.AUC(name='my_auc')
    ]
    model.compile(optimizer='adam',
                         loss=loss,
                         metrics=metrics)
    return model

model = compile_model(model)

In [26]:
def train_model(model, model_inputs_and_masks_train, model_inputs_and_masks_val,
    y_train, y_val, batch_size, num_epochs):
    # TODO: train the model
    # DO NOT CHANGE THE INPUTS OR OUTPUTS TO THIS FUNCTION

    # y_train = np.asarray(y_train).astype('int32').reshape((-1,128))
    # y_val = np.asarray(y_val).astype('int32').reshape((-1,128))

    history = model.fit(
        model_inputs_and_masks_train,
        y_train,
        batch_size=batch_size,
        epochs=num_epochs,
        verbose=2,
        validation_data=(model_inputs_and_masks_val, y_val))

    
    return model, history

model, history = train_model(model, train_model_inputs_and_masks, val_model_inputs_and_masks, train_y, val_y, batch_size=128, num_epochs=5)

Epoch 1/5
67/67 - 27s - loss: 0.6086 - bin_accuracy: 0.6893 - my_auc: 0.6161 - val_loss: 0.5694 - val_bin_accuracy: 0.7218 - val_my_auc: 0.7165 - 27s/epoch - 402ms/step
Epoch 2/5
67/67 - 20s - loss: 0.5561 - bin_accuracy: 0.7247 - my_auc: 0.7042 - val_loss: 0.5712 - val_bin_accuracy: 0.7190 - val_my_auc: 0.7342 - 20s/epoch - 297ms/step
Epoch 3/5
67/67 - 20s - loss: 0.5316 - bin_accuracy: 0.7369 - my_auc: 0.7402 - val_loss: 0.5424 - val_bin_accuracy: 0.7507 - val_my_auc: 0.7465 - 20s/epoch - 305ms/step
Epoch 4/5
67/67 - 21s - loss: 0.5073 - bin_accuracy: 0.7566 - my_auc: 0.7691 - val_loss: 0.5421 - val_bin_accuracy: 0.7334 - val_my_auc: 0.7500 - 21s/epoch - 310ms/step
Epoch 5/5
67/67 - 21s - loss: 0.4944 - bin_accuracy: 0.7587 - my_auc: 0.7845 - val_loss: 0.5262 - val_bin_accuracy: 0.7565 - val_my_auc: 0.7522 - 21s/epoch - 317ms/step


# Further exploration (REMOVE ALL CODE AFTER THIS CELL BEFORE SUBMISSION)
Any code after this is not evaluated, and must be removed before submission.
Leaving code below will result in losing marks.