<h2><center>Notebook Walk-through: </center>
    
<center>Fine-Tuning BERT - Optimizer Considerations and Layer Freezing
</center></h2>

In this notebook we discuss some aspects of BERT Fine-tuning for a specific task. We choose a text classification as an example. We will highlight various aspects you may encounter.

Specifically, we will:

* play with BERT (Hugging Face implementation): Tokenization, Layers and Output Dimensions  
* build a sentiment classifier with BERT from scratch and discuss a couple of options you may have
* train the network with various configurations and make observations that will hopefully be helpful

Note that a lot of the content will be delivered through live experimentation in the walkthrough session, and it will not be recorded in the notebook. Please watch the recording. 

Also, note that we are not attempting to reach state of the art by any means. The purpose of the notebook is to highlight some of the issues you may want to consider when fine-tuning BERT.

We start with a few common imports.


In [None]:
from google.colab import (drive, files)
import pandas as pd
import numpy as np
import sklearn
import os

import tensorflow as tf
import tensorflow_datasets as tfds

!pip install -q transformers

import transformers

from transformers import BertTokenizer, TFBertModel
from tensorflow.keras import backend as K


import logging
tf.get_logger().setLevel(logging.ERROR)

[K     |████████████████████████████████| 4.0 MB 6.8 MB/s 
[K     |████████████████████████████████| 895 kB 34.7 MB/s 
[K     |████████████████████████████████| 77 kB 5.1 MB/s 
[K     |████████████████████████████████| 596 kB 10.8 MB/s 
[K     |████████████████████████████████| 6.5 MB 18.5 MB/s 
[?25h

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
import sklearn.model_selection
import sklearn.preprocessing as preproc
from sklearn.feature_extraction import text

import sklearn.metrics as metrics


Let's check for presence of a GPU. We'll need that (or better) if we use transformer models like BERT. 

In [None]:
tf.config.list_physical_devices('GPU')

[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

Next, let's specify the versions that we are using:

In [None]:
tf.__version__

'2.8.0'

In [None]:
transformers.__version__

'4.18.0'

#### TODO
1) Import our data


In [None]:
drive.mount('/content/gdrive', force_remount=True)
path = "/content/gdrive"
os.chdir(path)

Mounted at /content/gdrive


In [None]:
## Get our cleaned data (from the DataCreation2.ipynb), which is stored in good_lyrics_data.csv
df = pd.read_csv('MyDrive/W266_Final_Project/good_lyrics_data.csv')
df

Unnamed: 0,Year,Yearly Rank,Title,Artist(s),Lyrics,Num Chars,Num Words,Decade
0,1960,2,"""Cathy's Clown""",The Everly Brothers,Cathy’s Clown Lyrics[Chorus] Don't want your l...,827,156,1960s
1,1960,8,"""Stuck on You""",Elvis Presley,Stuck on You Lyrics[Verse 1] You can shake an ...,1242,242,1960s
2,1960,9,"""The Twist""",Chubby Checker,The Twist Lyrics[Chorus:] Come on baby let's d...,754,147,1960s
3,1960,14,"""El Paso""",Marty Robbins,El Paso Lyrics[Verse 1] Out in the West Texas ...,2465,496,1960s
4,1960,15,"""Alley Oop""",The Hollywood Argyles,"Alley-Oop Lyrics[Intro] (Oop-oop, oop, oop-oop...",1859,299,1960s
...,...,...,...,...,...,...,...,...
3542,2021,94,"""Single Saturday Night""",Cole Swindell,Single Saturday Night Lyrics[Verse 1] I was ou...,2038,390,2020s
3543,2021,95,"""Things a Man Oughta Know""",Lainey Wilson,Things a Man Oughta Know Lyrics[Verse 1] I can...,1341,298,2020s
3544,2021,96,"""Throat Baby (Go Baby)""",BRS Kash,Throat Baby (Go Baby) Lyrics[Intro] (What's ha...,3042,615,2020s
3545,2021,97,"""Tombstone""",Rod Wave,"Tombstone Lyrics[Intro] Damn, this motherfucke...",2086,393,2020s


In [None]:
bert_df = df[["Lyrics", "Decade"]]
bert_df

Unnamed: 0,Lyrics,Decade
0,Cathy’s Clown Lyrics[Chorus] Don't want your l...,1960s
1,Stuck on You Lyrics[Verse 1] You can shake an ...,1960s
2,The Twist Lyrics[Chorus:] Come on baby let's d...,1960s
3,El Paso Lyrics[Verse 1] Out in the West Texas ...,1960s
4,"Alley-Oop Lyrics[Intro] (Oop-oop, oop, oop-oop...",1960s
...,...,...
3542,Single Saturday Night Lyrics[Verse 1] I was ou...,2020s
3543,Things a Man Oughta Know Lyrics[Verse 1] I can...,2020s
3544,Throat Baby (Go Baby) Lyrics[Intro] (What's ha...,2020s
3545,"Tombstone Lyrics[Intro] Damn, this motherfucke...",2020s


In [None]:
## Create Train/Val/Test Split (in 2 steps)
train, rem = sklearn.model_selection.train_test_split(bert_df, train_size = 0.7, random_state=42)
val, test = sklearn.model_selection.train_test_split(rem, train_size = 0.5, random_state = 43)

print("Train Shape: ", train.shape)
print("Val Shape:   ", val.shape)
print("Test Shape:  ", test.shape)
train.head(3)

Train Shape:  (2482, 2)
Val Shape:    (532, 2)
Test Shape:   (533, 2)


Unnamed: 0,Lyrics,Decade
944,Off the Wall Lyrics[Verse 1] When the world is...,1980s
199,"Bus Stop LyricsBus stop, wet day She's there, ...",1960s
3351,Lucid Dreams Lyrics[Intro] Enviyon on the mix ...,2010s


### 2. Preparing the model input with the BERT Tokenizer

We use the 'bert-base-cased' from Huggingface as the underlying BERT model.

In [None]:
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
bert_model = TFBertModel.from_pretrained('bert-base-cased')

Downloading:   0%|          | 0.00/208k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/502M [00:00<?, ?B/s]

Some layers from the model checkpoint at bert-base-cased were not used when initializing TFBertModel: ['nsp___cls', 'mlm___cls']
- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertModel were initialized from the model checkpoint at bert-base-cased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.


In [None]:
train["Lyrics"]

944     Off the Wall Lyrics[Verse 1] When the world is...
199     Bus Stop LyricsBus stop, wet day She's there, ...
3351    Lucid Dreams Lyrics[Intro] Enviyon on the mix ...
2276    He Can’t Love U Lyrics[Intro: Brandon] I ain't...
801     With a Little Luck Lyrics[Verse 1] With a litt...
                              ...                        
1130    You Got Lucky Lyrics[Intro] One, two  [Verse 1...
1294    Kiss Lyrics[Verse 1] You don't have to be beau...
860     What a Fool Believes Lyrics[Verse 1] He came f...
3507    Heat Waves Lyrics[Intro] (Last night, all I th...
3174    Partition Lyrics[Part 1: "Yoncé"]  [Intro] Let...
Name: Lyrics, Length: 2482, dtype: object

In [None]:
## check: do we get the ouput we want? YES!
# tokenizer([x for x in train["Lyrics"]], 
#               max_length=max_length,
#               truncation=True,
#               padding='max_length', 
#               return_tensors='tf')

In [None]:
pd.get_dummies(train["Decade"]) # to one-hot

Unnamed: 0,1960s,1970s,1980s,1990s,2000s,2010s,2020s
944,0,0,1,0,0,0,0
199,1,0,0,0,0,0,0
3351,0,0,0,0,0,1,0
2276,0,0,0,0,1,0,0
801,0,1,0,0,0,0,0
...,...,...,...,...,...,...,...
1130,0,0,1,0,0,0,0
1294,0,0,1,0,0,0,0
860,0,1,0,0,0,0,0
3507,0,0,0,0,0,0,1


In [None]:
max_length = 512

x_train = tokenizer([x for x in train["Lyrics"]], 
              max_length=max_length,
              truncation=True,
              padding='max_length', 
              return_tensors='tf')
y_train = pd.get_dummies(train["Decade"])

x_val = tokenizer([x for x in val["Lyrics"]], 
              max_length=max_length,
              truncation=True,
              padding='max_length', 
              return_tensors='tf')
y_val = pd.get_dummies(val["Decade"])


x_test = tokenizer([x for x in test["Lyrics"]], 
              max_length=max_length,
              truncation=True,
              padding='max_length', 
              return_tensors='tf')
y_test = pd.get_dummies(test["Decade"])




In [None]:

y_train.shape

(2482, 7)

In [None]:
x_train

{'input_ids': <tf.Tensor: shape=(2482, 512), dtype=int32, numpy=
array([[  101,  8060,  1103, ...,  2551,  1106,   102],
       [  101,  8947,  6682, ...,     0,     0,     0],
       [  101, 13174,  2386, ...,  1267,  1240,   102],
       ...,
       [  101,  1327,   170, ...,     0,     0,     0],
       [  101,  9653, 13531, ...,  8552,  1179,   102],
       [  101,  4539,  8934, ...,  9562,   112,   102]], dtype=int32)>, 'token_type_ids': <tf.Tensor: shape=(2482, 512), dtype=int32, numpy=
array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int32)>, 'attention_mask': <tf.Tensor: shape=(2482, 512), dtype=int32, numpy=
array([[1, 1, 1, ..., 1, 1, 1],
       [1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 1, 1, 1],
       ...,
       [1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 1, 1, 1],
       [1, 1, 1, ..., 1, 1, 1]], dtype=in

In [None]:
x_train.keys()

dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])

input_ids are the ones we really care about

### 3. BERT



**Questions:**
* What are the interpretations of the 3 outputs?
* Are the respective dimensions as expected?

### 4. Building our Classification Model (from scratch)

Let's build our classification model from scratch and run a few configurations.

In particular, we will consider:

* Optimizer choices
* number of bert layers to be re-trained
* effects of freezing and unfreezing


### 5. Build Classification Model (for real)

In [None]:
def create_classification_model(hidden_size = 200, 
                                train_layers = -1, 
                                optimizer=tf.keras.optimizers.Adam()):
    """
    Build a simple classification model with BERT. Let's keep it simple and don't add dropout, layer norms, etc.
    """

    input_ids = tf.keras.layers.Input(shape=(max_length,), dtype=tf.int32, name='input_ids_layer')
    token_type_ids = tf.keras.layers.Input(shape=(max_length,), dtype=tf.int32, name='token_type_ids_layer')
    attention_mask = tf.keras.layers.Input(shape=(max_length,), dtype=tf.int32, name='attention_mask_layer')

    bert_inputs = {'input_ids': input_ids,
                  'token_type_ids': token_type_ids,
                  'attention_mask': attention_mask}


    #restrict training to the train_layers outer transformer layers
    if not train_layers == -1:

            retrain_layers = []

            for retrain_layer_number in range(train_layers):

                layer_code = '_' + str(11 - retrain_layer_number)
                retrain_layers.append(layer_code)

            for w in bert_model.weights:
                if not any([x in w.name for x in retrain_layers]):
                    w._trainable = False


    bert_out = bert_model(bert_inputs)


    classification_token = tf.keras.layers.Lambda(lambda x: x[:,0,:], name='get_first_vector')(bert_out[0])


    hidden = tf.keras.layers.Dense(hidden_size, name='hidden_layer')(classification_token)

    classification = tf.keras.layers.Dense(7, activation='softmax',name='classification_layer')(hidden)

    classification_model = tf.keras.Model(inputs=[input_ids, token_type_ids, attention_mask], 
                                          outputs=[classification])
    
    classification_model.compile(optimizer=optimizer,
                            loss=tf.keras.losses.CategoricalCrossentropy(from_logits=False),
                            metrics='accuracy')


    return classification_model

### 5. Experimentation

Let us compare a few configurations:

* 'default': Adam Optimizer with default parameters (lr=0.001), all BERT layers fine-tuned 
* 'smaller learning rate': Adam Optimizer with lr=0.00005 parameters, all BERT layers fine-tuned 
* 'frozen': Adam Optimizer with default parameters, all BERT layers frozen

#### 5.1 Default -- doesn't learn


In [None]:
#classification_model = create_classification_model()     

In [None]:
# classification_model.fit([x_train.input_ids, x_train.token_type_ids, x_train.attention_mask],
#                          y_train,
#                          validation_data=([x_val.input_ids, x_val.token_type_ids, x_val.attention_mask],
#                          y_val),
#                         epochs=3,
#                         batch_size=8)

#classification_model([x.input_ids, x.token_type_ids, x.attention_mask])

In [None]:
# classification_model.predict([x_train.input_ids, x_train.token_type_ids, x_train.attention_mask], 
#                              batch_size=8, 
#                              steps=2)

What is this? All essentially the same prediction? And basically not better than always predicting the majority class for each example? It may seem like "BERT is no good for this task"?!

Careful, not so! There are a number of changes one can consider:

* Change the optimizer configuration
* Freeze some BERT layers - maybe for the entire training cycle or for thye first few epochs. 
* Add more data


#### 5.2 Lower Learning Rate


In [None]:
try:
    del classification_model
except:
    pass

try:
    del bert_model
except:
    pass

tf.keras.backend.clear_session()
bert_model = TFBertModel.from_pretrained('bert-base-cased')

classification_model = create_classification_model(optimizer=tf.keras.optimizers.Adam(0.00005))

classification_model.fit([x_train.input_ids, x_train.token_type_ids, x_train.attention_mask],
                         y_train,
                         validation_data=([x_val.input_ids, x_val.token_type_ids, x_val.attention_mask],
                         y_val),
                        epochs=4,
                        batch_size=16)

## Commented out to run below after i interupt this training
# classification_model.predict([x_train.input_ids, x_train.token_type_ids, x_train.attention_mask], 
#                              batch_size=8, 
#                              steps=2)

Some layers from the model checkpoint at bert-base-cased were not used when initializing TFBertModel: ['nsp___cls', 'mlm___cls']
- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertModel were initialized from the model checkpoint at bert-base-cased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.


Epoch 1/4


### Results

[see this google sheet](https://docs.google.com/spreadsheets/d/1DFTXUfE2SE4XCt-m4__YyI2BEb5miCl5WsLaXUOVExQ/edit?usp=sharing)

In [None]:
classification_model.predict([x_train.input_ids, x_train.token_type_ids, x_train.attention_mask], 
                             batch_size=8, 
                             steps=2)

That seemed to work! Looks like the learning rate really mattered! (Of course, we have not focused here on finding the model for the test accuracy. We simply wanted to 'get it to work').

#### 5.3 Layer Freezing

In [None]:
try:
    del classification_model
except:
    pass

try:
    del bert_model
except:
    pass

tf.keras.backend.clear_session()
bert_model = TFBertModel.from_pretrained('bert-base-cased')

classification_model = create_classification_model(train_layers=0, optimizer=tf.keras.optimizers.Adam(0.00005))

classification_model.fit([x_train.input_ids, x_train.token_type_ids, x_train.attention_mask],
                         y_train,
                         validation_data=([x_val.input_ids, x_val.token_type_ids, x_val.attention_mask],
                         y_val),
                        epochs=5,
                        batch_size=8)

## Commented out to run below after i interupt this training
# classification_model.predict([x_train.input_ids, x_train.token_type_ids, x_train.attention_mask], 
#                              batch_size=8, 
#                              steps=2)

In [None]:
classification_model.predict([x_train.input_ids, x_train.token_type_ids, x_train.attention_mask], 
                             batch_size=8, 
                             steps=2)

That 'worked' too! As expected, the final validation loss is larger and the validation accuracy is smaller though.

**Questions:**
* is that expected? 
* What else is different?

But either way, all of these parameters seem to be interrelated. Experiment!

---

#### Idea for Viz of output

Similarity matrix for 10 "well-known" songs

Y = True Decade,
X = Pred Decade

further apart = redder. exactly on = green

---

### 6. Conclusions 

While one has to be careful to generalize from one (truncated) dataset, the pattern is pretty clear: it is not enough to simply define the model and see what you get. Some investigation needs to be devoted to making sure that the combination of model details, optimizer configurations, and data work.

One big tell is if a BERT model is not better than ~'pick the majority class' or close to it, while other models perform better. 

One should also say that there are other things to try in the learning phase, but the point of this notebook was to point out a few obvious issues. Previous students ran into precisely these issues!