<a id = 'top'></a>

#  A quick-start guide to fine-tune BERT with Keras
  * A. [Fine-tuned Keras Example](#fineTuned) 
  * B. [Datasets](#datasetClass) 
      * 1. [Tokenizer Selection](#tokenizerSelect)
      * 2. [Model Selection](#modelSelection)
      * 3. [Encode Data](#encodeData)
  * C. [Fine-Tuning](#fineTuning)
     

Hugging Face is a company that offers a library of "transformers" as well as pre-trained language models.  We are going to explore several ways of working with these models at a very high level.  In later classes, when we have covered how a transformer works, we'll come back and look at them at a deeper level.  This tutorial is designed to look at the Huggingface library at the same level of abstraction as Keras rather at the lower level of abstraction of TensorFlow.

You should take a look at the "A quick-start guide to fine-tune BERT with PyTorch and the Trainer Class" before you look at this notebook.  They're meant to be looked at together.

---

This directory includes three different uses of the HuggingFace Library.  These uses are incompatible with each other so you should only run one at a time and then stop and restart your notebook.  This link allows you to open the notebook in the Google Colab.    

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/datasci-w266/2021-summer-main/tree/master/materials/walkthrough_notebooks/bert_as_black_box/HuggingFaceThreewWays_3_Keras.ipynb)

[Return to Top](#top)
 <a id = 'fineTuned'></a>
 # Fine-tuned Keras Example
 
This notebook will demonstrate fine-tuning a TensorFlow HuggingFace model using Keras.  To do this, we'll need to select a data set, a model, and be sure to invoke its tokenizer.
 
 Borrowing liberally from the fine-tuning description in https://huggingface.co/transformers/training.html

In [1]:
!pip install -q transformers
#!pip install transformers

[K     |████████████████████████████████| 2.3MB 28.1MB/s 
[K     |████████████████████████████████| 3.3MB 42.4MB/s 
[K     |████████████████████████████████| 901kB 36.7MB/s 
[?25h

In [2]:
!pip install -q datasets

[K     |████████████████████████████████| 225kB 32.4MB/s 
[K     |████████████████████████████████| 112kB 52.1MB/s 
[K     |████████████████████████████████| 245kB 23.5MB/s 
[?25h

[Return to Top](#top)
 <a id = 'datasetClass'></a>
# Datasets



We'll take advantage of the datasets object in Huggingface to access some well known corpora, specifically IMDB. This contains a sentiment classification task.  It is good for learning how to work with HuggingFace Transformers library and also good for baselines.

In [3]:
from datasets import load_dataset

raw_datasets = load_dataset("imdb")

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1867.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1004.0, style=ProgressStyle(description…


Downloading and preparing dataset imdb/plain_text (download: 80.23 MiB, generated: 127.06 MiB, post-processed: Unknown size, total: 207.28 MiB) to /root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/4ea52f2e58a08dbc12c2bd52d0d92b30b88c00230b4522801b3636782f625c5b...


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=84125825.0, style=ProgressStyle(descrip…




HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))



HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))



HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Dataset imdb downloaded and prepared to /root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/4ea52f2e58a08dbc12c2bd52d0d92b30b88c00230b4522801b3636782f625c5b. Subsequent calls will reuse this data.


[Return to Top](#top)
 <a id = 'tokenizerSelect'></a>
### Tokenizer Selection 

Once again we'll use the AutoTokenizer object to avoid simple mistakes because it insures that we get the correct tokenizer given our pre-trained model.  This time the model we're using is BERT and we're selecting the cased version (meaning case is preserved) and the base version (rather than the large version).

In [4]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=570.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=213450.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=435797.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=29.0, style=ProgressStyle(description_w…




In [5]:
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)

HBox(children=(FloatProgress(value=0.0, max=25.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=25.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=50.0), HTML(value='')))




In [6]:
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))
full_train_dataset = tokenized_datasets["train"]
full_eval_dataset = tokenized_datasets["test"]

[Return to Top](#top)
 <a id = 'modelSelection'></a>
### Model Selection

We'll use the AutoModel abstraction again but this time we'll invoke the TensorFLow port since we want to use Keras to train and run the model.  As a result we'll instantiate a copy of TFAutoModelForSequenceClassification.  Note the 'TF' at the begining of the class name to designate it as a TensorFlow port.

In [7]:
import tensorflow as tf
from transformers import TFAutoModelForSequenceClassification

model = TFAutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=2)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=526681800.0, style=ProgressStyle(descri…




All model checkpoint layers were used when initializing TFBertForSequenceClassification.

Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


[Return to Top](#top)
 <a id = 'encodeData'></a>
### Encode Data


Let's create the encoded data for training.  Because we're using the TensorFlow port, we'll need to convert our PyTorch dataset object to the TensorFlow version.  HuggingFace provides some nice conversion functions to assist in the process.

In [8]:
tf_train_dataset = small_train_dataset.remove_columns(["text"]).with_format("tensorflow")
tf_eval_dataset = small_eval_dataset.remove_columns(["text"]).with_format("tensorflow")

In [9]:
train_features = {x: tf_train_dataset[x].to_tensor() for x in tokenizer.model_input_names}
train_tf_dataset = tf.data.Dataset.from_tensor_slices((train_features, tf_train_dataset["label"]))
train_tf_dataset = train_tf_dataset.shuffle(len(tf_train_dataset)).batch(8)

eval_features = {x: tf_eval_dataset[x].to_tensor() for x in tokenizer.model_input_names}
eval_tf_dataset = tf.data.Dataset.from_tensor_slices((eval_features, tf_eval_dataset["label"]))
eval_tf_dataset = eval_tf_dataset.batch(8)

[Return to Top](#top)
 <a id = 'fineTuning'></a>
# Fine Tuning

In keeping with the Keras process we call model.compile and model.fit.

In [10]:
model.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate=5e-5),
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=tf.metrics.SparseCategoricalAccuracy(),
)

In [11]:
model.summary()

Model: "tf_bert_for_sequence_classification"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
bert (TFBertMainLayer)       multiple                  108310272 
_________________________________________________________________
dropout_37 (Dropout)         multiple                  0         
_________________________________________________________________
classifier (Dense)           multiple                  1538      
Total params: 108,311,810
Trainable params: 108,311,810
Non-trainable params: 0
_________________________________________________________________


In [12]:
model.fit(train_tf_dataset, validation_data=eval_tf_dataset, epochs=3)

Epoch 1/3
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module, class, method, function, traceback, frame, or code object was expected, got cython_function_or_method
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module, class, method, function, traceback, frame, or code object was expected, got cython_function_or_method
Cause: while/else statement not yet supported
Cause: while/else statement not yet supported
Instructions for updating:
The `validate_indices` argument has no effect. Indices are always validated on CPU and never validated on GPU.
Epoch 2/3
Epoch 3/3


<tensorflow.python.keras.callbacks.History at 0x7f9163fa93d0>

### Future Discussion
Trained models can be saved and re-used.  This provides an opportunity to train the model using a variety of tasks that may be related to your ultimate predictive task, save the results (basically the parameter values), and then re-use that saved model.  This is a process we'll discuss later in live session called transfer learning. 

In [None]:
trainer.save_model("path/to/awesome-name-you-picked")
tokenizer.save_pretrained("path/to/repo/clone/your-model-name")

In [None]:
tokenizer = AutoTokenizer.from_pretrained("namespace/awesome-name-you-picked")
model = TFAutoModel.from_pretrained("namespace/awesome-name-you-picked")