<a id = 'top'></a>

#  A quick-start guide to fine-tune a BERT model with Keras
  * A. [What is BERT?](#introBERT)
  * B. [What is fine-tuning?](#fineTuned) 
  * C. [Datasets](#datasetClass) 
      * 1. [IMDB Description](#IMDBdesc)
      * 2. [Exploratory Data Analysis](#EDA)
  * D. [Model Preparation](#modelPrep)
      * 1. [Model Selection](#modelSelection)
      * 2. [Tokenizer Selection](#tokenizerSelect)
      * 3. [Auto Model](#autoModel)
      * 4. [Encode Data](#encodeData)
  * E. [Fine-Tuning The Model](#fineTuning)
     

Hugging Face is a company that offers a library of "transformers" as well as a collection of pre-trained language models.  These represent one source of code and abstract classes as well as a variety of documentation and examples. We are going to explore one way of working with these models at a very high level.  In later classes, when we have covered how a transformer works, we'll come back and look at them at a deeper level.  This tutorial is designed to look at the Huggingface library at the same level of abstraction as the Keras Sequential API rather at the lower level of abstraction of TensorFlow and the Keras Functional API.

---

Larger models, with millions or billions of paramters can only be trained on a machine with a GPU.  Do not run this notebook on your GCP instance as the training epochs will take entirely too long.  If you run this notebook in a Colab, it is automatically configured to use a GPU.


[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/datasci-w266/2022-fall-main/blob/master/materials/walkthrough_notebooks/bert_as_black_box/Keras_HuggingFace_Transformers_BERT_notebook.ipynb)

[Return to Top](#top)
 <a id = 'introBERT'></a>
# What is BERT?
This notebook leverages one of a variety of BERT models.  BERT models can be classified in terms of three parts.  The first part is a component named a transformer.  These can grow to be quite large.  BERT consists of either 12 or 24 layers of transformers. The second part is the training (called pre-training) the model already has on language.  The pre-training is characterized by one or more tasks.  The third part consists of the very specific tasks it is geared toward performing.  Different models use different sizes and layers of transformers and may be optimized for different languages and different tasks.  For example, CamemBERT is trained in French and SciBERT is trained on scientific journal articles.  You'll want to make sure you use a model appropriate to your language and task.

---

The [HuggingFace web site](https://huggingface.co/transformers) offers an interesting set of resources.  Their [model documentation](https://huggingface.co/transformers/model_summary.html) provides an excellent explanation of transformers as well as the growing variety of models they offer (see the right hand navigation column).  In addition, their collection of [notebooks](https://huggingface.co/transformers/notebooks.html) is a valuable set of examples.  

---

One word of caution:  this is a rapidly evolving resource and as a result you can often run in to bugs.  They will get fixed, eventually, but may be buggy for a while.  

[Return to Top](#top)
 <a id = 'fineTuned'></a>
 # Fine-tuning a Model
 
We'll use abstract classes that simplify the process of training by consolidating a number of pieces under one class. It's a good way to begin working with these models.  PyTorch is the native computational graph language used in Hugging Face. However, they make a point of porting models to TensorFlow, Google's computational graph language. Many models first get put on HuggingFace in PyTorch. Eventually they get ported over to TensorFlow. Depending on what model you want to use, you may have to run the PyTorch version. It's important to always be aware of which dialect you're using. The good news is that HuggingFace has built these models so that the underlying weight parameters can be used across PyTorch and TensorFlow implementations. It is simply the commands you use to construct, run, and manipulate the model that are in PyTorch or TensorFlow.  This notebook will demonstrate fine-tuning a TensorFlow HuggingFace model using Keras.  To do this, we'll need to select a data set, a model, and be sure to invoke its tokenizer.
 
 Borrowing liberally from the fine-tuning description in https://huggingface.co/transformers/training.html

In [1]:
!pip install -q transformers

[K     |████████████████████████████████| 4.2 MB 4.4 MB/s 
[K     |████████████████████████████████| 84 kB 3.1 MB/s 
[K     |████████████████████████████████| 596 kB 26.6 MB/s 
[K     |████████████████████████████████| 6.6 MB 37.6 MB/s 
[?25h

[Return to Top](#top)
 <a id = 'datasetClass'></a>
# Datasets



HuggingFace provides [a class for the managing datasets](https://huggingface.co/docs/datasets/index). They also provide a library of actual data that is accessible via this datasets class. We'll take advantage of the datasets object in Huggingface to access some well known corpora, specifically IMDB. 

In [2]:
!pip install -q datasets

[K     |████████████████████████████████| 342 kB 4.4 MB/s 
[K     |████████████████████████████████| 136 kB 37.9 MB/s 
[K     |████████████████████████████████| 1.1 MB 8.9 MB/s 
[K     |████████████████████████████████| 212 kB 38.8 MB/s 
[K     |████████████████████████████████| 127 kB 37.1 MB/s 
[K     |████████████████████████████████| 144 kB 31.7 MB/s 
[K     |████████████████████████████████| 94 kB 2.7 MB/s 
[K     |████████████████████████████████| 271 kB 33.3 MB/s 
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
datascience 0.10.6 requires folium==0.2.1, but you have folium 0.8.3 which is incompatible.[0m
[?25h

[Return to Top](#top)
 <a id = 'IMDBdesc'></a>
### IMDB Description

IMDB is a set of movie reviews. It is set up for a binary sentiment classification task.  It is good for learning how to work with HuggingFace Transformers library and also good for baselines. 

In [3]:
from datasets import load_dataset

raw_datasets = load_dataset("imdb")

Downloading builder script:   0%|          | 0.00/1.79k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/1.05k [00:00<?, ?B/s]

Downloading and preparing dataset imdb/plain_text (download: 80.23 MiB, generated: 127.02 MiB, post-processed: Unknown size, total: 207.25 MiB) to /root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1...


Downloading data:   0%|          | 0.00/84.1M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

Dataset imdb downloaded and prepared to /root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

[Return to Top](#top)
 <a id = 'EDA'></a>
### Exploratory Data Analysis

Let's look inside the IMDB dataset and see what it contains.  We see it is already split into train, test, and unsupervised records.  

In [4]:
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

Here is one sample record from the test set.  Each record contains a label and some text.  Different data sets will have different parts in their records.

In [5]:
raw_datasets['test'][0]

{'label': 0,
 'text': 'I love sci-fi and am willing to put up with a lot. Sci-fi movies/TV are usually underfunded, under-appreciated and misunderstood. I tried to like this, I really did, but it is to good TV sci-fi as Babylon 5 is to Star Trek (the original). Silly prosthetics, cheap cardboard sets, stilted dialogues, CG that doesn\'t match the background, and painfully one-dimensional characters cannot be overcome with a \'sci-fi\' setting. (I\'m sure there are those of you out there who think Babylon 5 is good sci-fi TV. It\'s not. It\'s clichéd and uninspiring.) While US viewers might like emotion and character development, sci-fi is a genre that does not take itself seriously (cf. Star Trek). It may treat important issues, yet not as a serious philosophy. It\'s really difficult to care about the characters here as they are not simply foolish, just missing a spark of life. Their actions and reactions are wooden and predictable, often painful to watch. The makers of Earth KNOW it\'

Here is a utility function that leverages the dataset structure to display 10 random records from the dataset and loads them in a data frame.

In [6]:
#from https://github.com/huggingface/notebooks/blob/master/examples/text_classification.ipynb
import datasets
import random
import pandas as pd
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=10):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
    
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, datasets.ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))

In [7]:
show_random_elements(raw_datasets['test'])

Unnamed: 0,text,label
0,"You want to know what the writers of this movie consider funny? A robot child sees his robot parents killed (beheaded, as I recall), and then moves between their bodies calling their names. Yeah--what a comic moment. This is the worst movie I ever paid to see.",neg
1,"i think the title of the movie describes it well. if you are looking for a documentary on louis kahn and his work, you'll have to look somewhere else. although some of that is covered in this film.<br /><br />of course, i eat up pretty much anything i'm fed, and i don't know much of the family history revolving around this case. so i believed what i was told about nathaniel and his father, etc.<br /><br />for what this movie was, i thought it was pretty good. a little slow and grabbing for attention at time, i wish that nathaniel would have focused a little more on his father's work than his family drama (although much of the history was interesting, louis was a bit of a player).<br /><br />this really is a journey through someone's life, and i was happy to tag along for the experience. a learning experience for me, and so it seems, for the filmmaker as well. <br /><br />oh, and the footage of some of kahn's work is *stunning*",pos
2,"Even not being a fan of the ""Star Trek"" movies or universe of shows and books and such, I still find some enjoyment in some of the movies featuring the old cast and in the case of ""First Contact"" even the new cast a bit. This one though was kind of sad to watch...it seemed to want to be so much, but it failed on so many levels to be one of the worst Star Trek movies. The plot is very far fetched seeming to want to combine three or four stories into one ultimate Trek adventure, but it ends up an unfunny when it tries to be, not tense when it wants to be and not action packed like it tries to be mess of inconsistencies. The whole movie to take a phrase from Spock is illogical. The effects are nothing special as I have seen episodes of Next Generation that are just as good, which is to say it is fine for a television show, but not a major motion picture. The plot is laughable as the gang at first tries to stop Spock's brother then joins him on his quest to find God, yes you read that correctly. The Klingons make a tacked on appearance, which actually will set up the much better Undiscovered Country movie. All in all you know it is bad when the best part of the film is Kirk, Bones and Spock singing row your boat, well Spock was not really singing, but rather questioning the lyrics.",neg
3,"Outstanding performance by Tantoo Cardinal. She carries this movie alone. Rip Torn is great but just a shadow to Tantoo. A bitter sweet story of a woman who loves a very stubborn man. Beautiful, funny, sad, touching, a must see film.",pos
4,"Technically I'am a Van Damme Fan, or I was. this movie is so bad that I hated myself for wasting those 90 minutes. Do not let the name Isaac Florentine (Undisputed II) fool you, I had big hopes for this one, depending on what I saw in (Undisputed II), man.. was I wrong ??! all action fans wanted a big comeback for the classic action hero, but i guess we wont be able to see that soon, as our hero keep coming with those (going -to-a-border - far-away-town-and -kill -the-bad-guys- than-comeback- home) movies I mean for God's sake, we are in 2008, and they insist on doing those disappointing movies on every level. Why ??!!! Do your self a favor, skip it.. seriously.",neg
5,"The mystery and its solution was a great noir conceit. I do have some questions though, maybe I wasn't paying enough attention.<br /><br />Who killed the neighbor and why? Who killed the replacement girl and why?<br /><br />And some minor quibbles, they should have shown the stopoff at the hotel for the switch. Not that they should have shown the switch, but they should have shown Jim and the girl going in the hotel, Jim going to the bathroom, coming out and being told by the bartender that his girlfriend went to the car without him.<br /><br />then, Jim getting back in the car and seeing the sleeping woman, and little girl in his back seat.<br /><br />This would have given the viewer a sporting chance at figuring out the solution.<br /><br />I wish I taped it though, I'd like to see it again.",pos
6,"This movie brought together some of the old Spinal crew for another mockumentary film, this time revolving around the world of the Dog Show, how their owners prepare and train for the show before moving on to the show itself.<br /><br />We meet several teams as they hope to win the top prize- The Fleck's, Cookie who seems to have slept with every man ever, and Gerry who tries to cope with his wife's old escapades and the fact that he literally has two left feet. Harlan, whose dog talks to him, and enjoys ventriloquism. The Swan's who have taken far too much coffee and scream at each other. Donalan and Vanderhoof the gay couple, and Cabot and Cummings who have won the last two years. Fred Willard commentates on the show, and is very funny as always. Funny scenes include the 'Look at me!' scene, and any with Levy. Unfortunately some of the best scenes were deleted or filmed later- Willard interviewing Leslie Cabot, and the alternative epilogue with Gerry is one of the funniest things i have ever seen. If these had been included, i would give the film an extra mark. But...<br /><br />7 out of 10",pos
7,"Throughout the world the unmistakable imprint of the American C.I.A. can be found in many a muddled mess they have left behind. In the beginning, their objectives were simple: spy, remove enemy agents, steal classified information and destabilize unfavorable governments. Years have elapse and although their mission remains similar, their clandestine black operations now include domestic spying, discrediting U.S. citizens and infiltrating American organizations who criticize the U.S. government. This movie however, centers on the C.I.A.'s world manhunt for the infamous 'Carlos, the Jackel.' The film is called "" The Assignment "" and tells the story Lt. Cmdr. Annibal Ramirez, (Aidan Quinn) a U.S. naval officer who bears a striking resemblance to the mastermind of so many terrorist bombings. Recruited by Jack Shaw (Donald Sutherland) of the C.I.A. and Amos (Ben Kingsley), a special agent from the Israeli Mosad, Ramirez is secretly trained to look, pose, infiltrate the elusive organization and to thereafter discredit the real Jackel working for the Russians. This film is Explosively exciting, and packed with wild chases, killings and inter-country mayhem. Quinn is wonderful and surprisingly artistic playing both sides of the war. Easily one of his best efforts. ****",pos
8,"Bad script? Check. Awful effects? Check. Horrible actors? Check. Lame direction? Check.<br /><br />After seeing the DVD box at blockbuster video and being a fan of the horror genre, I placed my $4.28 on the line and rented this ""film."" My girlfriend was out of town and I was bored so on a late Tuesday night I decided this would be a perfect time for me to watch, what appeared to be (based on the box cover art) a horror movie. What I got instead was the worst film ever made. Up until that point I had always declared ""Slumber Party Massacre 3"" the worst film ever made.<br /><br />If you are the type that wants to see a movie because you heard how bad it is, this is for you. If you don't want to lose $4.00 and 80 irreplaceable minutes of your life, steer clear of this garbage.<br /><br />An added note: I noticed a few of the ""actors"" come on here and post comments on the bulletin board. How can you brag about being in this film? You were all horrible. I mean really bad. If there was an American Idol for actors, you all would be laughed at in the first few episodes.<br /><br />Peace.<br /><br />Sutter Cain",neg
9,"83 minutes? Nope, this thing is 72 minutes, tops.<br /><br />If you cannot guess the killer in this movie, you had better throw your TV out the window, because you ain't learned nothing in 20+ years of cinematic slasher history.<br /><br />And how come the plain star who never gets naked is always the one you want to get naked?",neg


[Return to Top](#top)
 <a id = 'modelPrep'></a>
## Model Preparation

[Return to Top](#top)
 <a id = 'modelSelection'></a>
### Model Selection

We need to pick the model we are going to train to classify IMDB.  We'll do that in several stages. First we define some variables to hold information about the model that we'll re-use.

In [8]:
model_checkpoint = "bert-base-cased"
batch_size = 8

[Return to Top](#top)
 <a id = 'tokenizerSelect'></a>
### Tokenizer Selection 

We'll use the [AutoTokenizer](https://huggingface.co/docs/transformers/main/en/autoclass_tutorial#autotokenizer) object to avoid simple configuration mistakes because it insures that we get the correct tokenizer given our pre-trained model.  This time the model we're using is BERT and we're selecting the cased version (meaning case is preserved) and the base version (rather than the large version).

In [9]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/208k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/426k [00:00<?, ?B/s]

What's the tokenizer doing?  It's taking care of breaking down a sentence into the parts the model can understand and was trained on, as well as a bunch of housekeeping that's needed by the model in order to work properly.  Once we've covered how a transformer works in live session, we'll come back (in week 4) and discuss its various components.  For now, you don't need to understand it in order to make use of it.

Here's one example of what the tokenizer outputs.

In [10]:
tokenizer("Hello, we only need one sentence for our task but these reviews often have more.")

{'input_ids': [101, 8667, 117, 1195, 1178, 1444, 1141, 5650, 1111, 1412, 4579, 1133, 1292, 3761, 1510, 1138, 1167, 119, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

The tokenizer converts the incoming words to integer ids that are used to retrieve the model's input word embeddings.  All tokenizers convert words to input ids.  The wrong tokenizer will produce the wrong set of token ids and result in very poor predictions.  The AutoTokenizer insures the correct ids are assigned.

In [11]:
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)

  0%|          | 0/25 [00:00<?, ?ba/s]

  0%|          | 0/25 [00:00<?, ?ba/s]

  0%|          | 0/50 [00:00<?, ?ba/s]

In [12]:
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))
full_train_dataset = tokenized_datasets["train"]
full_eval_dataset = tokenized_datasets["test"]

[Return to Top](#top)
 <a id = 'autoModel'></a>
### Automatic Model Configuration

We'll use the [AutoModel abstraction](https://huggingface.co/docs/transformers/main/en/autoclass_tutorial#automodel) and invoke the TensorFlow port since we want to use Keras to train and run the model.  As a result we'll instantiate a copy of TFAutoModelForSequenceClassification.  Note the 'TF' at the begining of the class name to designate it as a TensorFlow port.  The model for "sequence classification" is specifically structured to perform classification based on sequences of words like a sentence.  HuggingFace provides a set of models specifically configured for [particular NLP tasks](https://huggingface.co/docs/transformers/main/en/model_doc/auto) as shown by all of the AutoModelFor *FillInTheTask*.

In [13]:
import tensorflow as tf
from transformers import TFAutoModelForSequenceClassification

model = TFAutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=2)

Downloading:   0%|          | 0.00/502M [00:00<?, ?B/s]

All model checkpoint layers were used when initializing TFBertForSequenceClassification.

Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


[Return to Top](#top)
 <a id = 'encodeData'></a>
### Encode Data


Let's create the encoded data for training.  Because we're using the TensorFlow port, we'll need to convert our PyTorch dataset object contents to the TensorFlow version.  HuggingFace provides some nice conversion functions to assist in the process.

In [14]:
tf_train_dataset = small_train_dataset.remove_columns(["text"]).with_format("tensorflow")
tf_eval_dataset = small_eval_dataset.remove_columns(["text"]).with_format("tensorflow")

In [15]:
train_features = {x: tf_train_dataset[x] for x in tokenizer.model_input_names}
train_tf_dataset = tf.data.Dataset.from_tensor_slices((train_features, tf_train_dataset["label"]))
train_tf_dataset = train_tf_dataset.shuffle(len(tf_train_dataset)).batch(batch_size)

eval_features = {x: tf_eval_dataset[x] for x in tokenizer.model_input_names}
eval_tf_dataset = tf.data.Dataset.from_tensor_slices((eval_features, tf_eval_dataset["label"]))
eval_tf_dataset = eval_tf_dataset.batch(batch_size)

[Return to Top](#top)
 <a id = 'fineTuning'></a>
# Fine Tuning

In keeping with the Keras process we call model.compile first to make sure that all the pieces are in place.  We can follow that up with a call to model.summary() to make sure we've put the correct players together in the correct manner.  We can also see how much is trainable, which gives a sense of training time and resource requirements.

In [16]:
model.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate=5e-5),
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=tf.metrics.SparseCategoricalAccuracy(),
)

In [17]:
model.summary()

Model: "tf_bert_for_sequence_classification"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 bert (TFBertMainLayer)      multiple                  108310272 
                                                                 
 dropout_37 (Dropout)        multiple                  0         
                                                                 
 classifier (Dense)          multiple                  1538      
                                                                 
Total params: 108,311,810
Trainable params: 108,311,810
Non-trainable params: 0
_________________________________________________________________


Then we call model.fit to perform the actual training.  Note that many times, because of what the model has learned about language in it pre-training phase we can limit our training to a small number of epochs.

In [18]:
model.fit(train_tf_dataset, validation_data=eval_tf_dataset, epochs=2)

Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x7f713390dd90>