<a id = 'top'></a>

#  A quick-start guide to fine-tune a BERT model with Keras
  * A. [What is BERT?](#introBERT)
  * B. [What is fine-tuning?](#fineTuned) 
  * C. [Datasets](#datasetClass) 
      * 1. [IMDB Description](#IMDBdesc)
      * 2. [Exploratory Data Analysis](#EDA)
  * D. [Model Preparation](#modelPrep)
      * 1. [Model Selection](#modelSelection)
      * 2. [Tokenizer Selection](#tokenizerSelect)
      * 3. [Auto Model](#autoModel)
      * 4. [Encode Data](#encodeData)
  * E. [Fine-Tuning The Model](#fineTuning)
     

Hugging Face is a company that offers a library of "transformers" as well as a collection of pre-trained language models.  These represent one source of code and abstract classes as well as a variety of documentation and examples. We are going to explore one way of working with these models at a very high level.  In later classes, when we have covered how a transformer works, we'll come back and look at them at a deeper level.  This tutorial is designed to look at the Huggingface library at the same level of abstraction as the Keras Sequential API rather at the lower level of abstraction of TensorFlow and the Keras Functional API.

---

Larger models, with millions or billions of paramters can only be trained on a machine with a GPU.  Do not run this notebook on your GCP instance as the training epochs will take entirely too long.  If you run this notebook in a Colab, it is automatically configured to use a GPU.


[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/datasci-w266/2023-summer-main/blob/master/materials/walkthrough_notebooks/bert_as_black_box/Keras_HuggingFace_Transformers_BERT_notebook.ipynb)

[Return to Top](#top)
 <a id = 'introBERT'></a>
# What is BERT?
This notebook leverages one of a variety of BERT models.  BERT models can be classified in terms of three parts.  The first part is a component named a transformer.  These can grow to be quite large.  BERT consists of either 12 or 24 layers of transformers. The second part is the training (called pre-training) the model already has on language.  The pre-training is characterized by one or more tasks.  The third part consists of the very specific tasks it is geared toward performing.  Different models use different sizes and layers of transformers and may be optimized for different languages and different tasks.  For example, CamemBERT is trained in French and SciBERT is trained on scientific journal articles.  You'll want to make sure you use a model appropriate to your language and task.

---

The [HuggingFace web site](https://huggingface.co/transformers) offers an interesting set of resources.  Their [model documentation](https://huggingface.co/transformers/model_summary.html) provides an excellent explanation of transformers as well as the growing variety of models they offer (see the right hand navigation column).  In addition, their collection of [notebooks](https://huggingface.co/transformers/notebooks.html) is a valuable set of examples.  

---

One word of caution:  this is a rapidly evolving resource and as a result you can often run in to bugs.  They will get fixed, eventually, but may be buggy for a while.  

[Return to Top](#top)
 <a id = 'fineTuned'></a>
 # Fine-tuning a Model
 
We'll use abstract classes that simplify the process of training by consolidating a number of pieces under one class. It's a good way to begin working with these models.  PyTorch is the native computational graph language used in Hugging Face. However, they make a point of porting models to TensorFlow, Google's computational graph language. Many models first get put on HuggingFace in PyTorch. Eventually they get ported over to TensorFlow. Depending on what model you want to use, you may have to run the PyTorch version. It's important to always be aware of which dialect you're using. The good news is that HuggingFace has built these models so that the underlying weight parameters can be used across PyTorch and TensorFlow implementations. It is simply the commands you use to construct, run, and manipulate the model that are in PyTorch or TensorFlow.  This notebook will demonstrate fine-tuning a TensorFlow HuggingFace model using Keras.  To do this, we'll need to select a data set, a model, and be sure to invoke its tokenizer.
 
 Borrowing liberally from the fine-tuning description in https://huggingface.co/transformers/training.html

In [1]:
!pip install -q transformers

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.3/6.3 MB[0m [31m94.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m120.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m190.3/190.3 KB[0m [31m24.1 MB/s[0m eta [36m0:00:00[0m
[?25h

[Return to Top](#top)
 <a id = 'datasetClass'></a>
# Datasets



HuggingFace provides [a class for the managing datasets](https://huggingface.co/docs/datasets/index). They also provide a library of actual data that is accessible via this datasets class. We'll take advantage of the datasets object in Huggingface to access some well known corpora, specifically IMDB. 

In [2]:
!pip install -q datasets

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m462.8/462.8 KB[0m [31m20.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m213.0/213.0 KB[0m [31m17.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m132.0/132.0 KB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m140.6/140.6 KB[0m [31m15.7 MB/s[0m eta [36m0:00:00[0m
[?25h

[Return to Top](#top)
 <a id = 'IMDBdesc'></a>
### IMDB Description

IMDB is a set of movie reviews. It is set up for a binary sentiment classification task.  It is good for learning how to work with HuggingFace Transformers library and also good for baselines. 

In [3]:
from datasets import load_dataset

raw_datasets = load_dataset("imdb")

Downloading builder script:   0%|          | 0.00/4.31k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/2.17k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/7.59k [00:00<?, ?B/s]

Downloading and preparing dataset imdb/plain_text to /root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1...


Downloading data:   0%|          | 0.00/84.1M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

Dataset imdb downloaded and prepared to /root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

[Return to Top](#top)
 <a id = 'EDA'></a>
### Exploratory Data Analysis

Let's look inside the IMDB dataset and see what it contains.  We see it is already split into train, test, and unsupervised records.  

In [4]:
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

Here is one sample record from the test set.  Each record contains a label and some text.  Different data sets will have different parts in their records.

In [5]:
raw_datasets['test'][0]

{'text': 'I love sci-fi and am willing to put up with a lot. Sci-fi movies/TV are usually underfunded, under-appreciated and misunderstood. I tried to like this, I really did, but it is to good TV sci-fi as Babylon 5 is to Star Trek (the original). Silly prosthetics, cheap cardboard sets, stilted dialogues, CG that doesn\'t match the background, and painfully one-dimensional characters cannot be overcome with a \'sci-fi\' setting. (I\'m sure there are those of you out there who think Babylon 5 is good sci-fi TV. It\'s not. It\'s clichéd and uninspiring.) While US viewers might like emotion and character development, sci-fi is a genre that does not take itself seriously (cf. Star Trek). It may treat important issues, yet not as a serious philosophy. It\'s really difficult to care about the characters here as they are not simply foolish, just missing a spark of life. Their actions and reactions are wooden and predictable, often painful to watch. The makers of Earth KNOW it\'s rubbish as 

Here is a utility function that leverages the dataset structure to display 10 random records from the dataset and loads them in a data frame.

In [6]:
#from https://github.com/huggingface/notebooks/blob/master/examples/text_classification.ipynb
import datasets
import random
import pandas as pd
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=10):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
    
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, datasets.ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))

In [7]:
show_random_elements(raw_datasets['test'])

Unnamed: 0,text,label
0,"Yet another ""gay"" film ruined by asinine politics. Luigi's final speech just about sent me running out of the theatre with its bumper-sticker epigrams. Read the comic book it was based on for a much more entertaining experience.",neg
1,"If you thought ""ROSEMARY'S BABY"" was bad, this one isn't much better. Easily one of the worst movies ever made, like it's lame predecessor, it goes nowhere fast. <br /><br />Rating: 1/2* out of *****",neg
2,"As has already been noted, the short film ""Every Sunday"" (1936) could be considered the first music video. This was a happy accident resulting from MGM's need to crank out a variety of short films for exhibit with its feature length material. They had a couple fresh young singing talents (Judy Garland and Deanna Durbin) available and essentially slapped together a blend of music styles in a kind of Norman Rockwell concert in the park setting. <br /><br />Who would have dreamed at the time that they would capture the best collection of images since Eisenstein's ""Odessa Steps"" sequence. <br /><br />It's Sunday with some inattentive folks sitting around a small wooden band shell in the park while a tired looking ensemble play Strauss. Events unfold and the next Sunday Judy and Deanna save the day. The operatic Deanna sings ""Il Bacio"" (The Kiss) and Garland follows with the contrasting ""Waltz with a Swing"". The climax nicely blends the two styles into a duet of ""Americana"". <br /><br />A must see.<br /><br />Then again, what do I know? I'm only a child.",pos
3,"We start all of our reviews with the following information. My wife and I have seen nearly 100 movies per year for the past 15 years. Recently, we were honored by receiving lifetime movie passes to any movie any time at no cost! So we can see whatever we want whenever we want. The point of this is that CRITICS count for ZERO. Your local critics or the national critics like Ebert are really no different than you or me. The only difference is that they get to write about the movie and are forced to see hundreds of movies whether they want to or not.Therefore, it is our belief that if you get your monies worth for two hours of enjoyment that is good enough for us! We NEVER EVER listen or read the critics. We only care about our friends and those who we know like the same things as us. Well enough about that. <br /><br />When Meryl Streep the head of the NSC in the movie says ""The United States does not torture"" it got a big laugh at this movie. It is of course a lie that the Bush Administration has denied time and again. It is a very good movie and it is scary in what they can do to us as we lose all of our civil rights. They can simply ""snatch"" you anywhere and tell know one that they have done it. In this case, they snatch a man who has a name similar to those who killed thousands on 9-11. He is of course just like you or I. And so they take him to a secret location outside of the US to torture and waterboard him. <br /><br />Very frightening. Well acted by Jake and Reese and the entire cast.",pos
4,"Let me say this new He-Man cartoon is not destroying childhood memories, as I didn't like the old He-Man cartoon either. I loved the action figures, but I found the cartoon to be corny and I hated the storyline (the He-Man I liked was the one from the very early, pre-cartoon mini-comics included in the figure boxes, where He-Man was a Barbarian, the Sword of Power was split in two pieces, and there was no Prince Adam, no Shazam-ripoff premise, and no Orko). Anyway, let's leave the old stuff alone.<br /><br />The new He-Man cartoon (or at least this pilot) is a disgrace on its own, s it represents both the worst cheesiness of the old show, and the worst tendences of nowadays. I watched it because I had heard the in-your-face morals of the old show were (thankfully) gone, and this one had more swordplay and character development. But I encountered an awful mishmash of the worst clichés of the genre, characters I couldn't help but hate, and the sadly inevitable Matrix-esque visual style.<br /><br />I think it was a good idea to give a bit of a background to the characters, as it was showing a pre-face-peeling Skeletor, but that's how far the character development goes, aside from the usual coming-to-age rubbish I see coming in the subsequent episodes, where this teenybopper Adam will be learning the responsibility of his new-acquired powers, blah blah. At least in the old show Adam was not a spoiled brat! I found myself hating his guts. That's what we get when they put out an adventure show aimed at pre-teens: pretty faces and wanna-be-cool-and-look-juvenile clothes. I should check new episodes to see if N'Sync make a special appearance. Man, does this show remind me to the 1996 Flash Gordon stinkbomb cartoon!<br /><br />Dialogue? Ha! It follows absolutely every cliché in the book, from the goody-lil-two-shoes Randor in the opening scene to Skeletor's immortal ""Oh, and He-Man... I lied!"" in the ending. And Skeletor's voice is still the same high-pitch kind than in the old series. 20 years, and nothing we have learnt.<br /><br />And sure, nowadays there can't exist something remotely action-related that's not Matrix-style. Leave Anime to the Japanese, folks, think fresh ideas. And seeing the characters' poses while fighting didn't help either.<br /><br />Of course, we have our usual dose of PCness: Evil-Lyn (now I think about it, who the hell comes up with these names?) has no yellow skin now, but grey-ish. Oh, so no Asian people will be offended. I bet Jitsu won't appear in the show either. Shades of the 1996 FG again, where Ming the Merciless was a green, toad-like alien!<br /><br />People are complaining about Cringer's lack of speech. I don't think I would have liked this more or less if Cringer spoke, he's corny enough this way. And you have your extra ration of cheese with Orko! The shocking thing is, probably many of the people who (rightfully) hated Jar Jar Binks, might be huge Orko fans...<br /><br />I watched the feature-lenght pilot, and I've had enough. Leave the series alone. 2 out of 10.",neg
5,"Yes, I spelled that right. This movie is so predictable, the actual word needs additional letters to exemplify the predictability. From the moment the principal characters and situation are introduced, it is paint-by-numbers as to where this plot will take us. The foreshadowing was as subtle as a two ton sledgehammer. You could take numerous pieces of dialogue and anticipate the role it would play in the ending.<br /><br />Catherine Zeta-Jones and Aaron Eckhardt did decent jobs in undemanding roles and Abigail Breslin played the cute role admirably. It's just that the movie brought absolutely nothing new to the romantic comedy genre. The romance was tepid and the laughs were weak and few. Sure, it's an OK movie if you have nothing to watch, but you won't miss anything by missing this one.",neg
6,"Gake no Ue no Ponyo is a beautifully animated film and a relief from the many heartless soulless CGI movies being made. The pastel and color pencil backgrounds were a surprise after Sen to Chihiro no Kamikakushi(Spirited Away) and Hauru no Ugoku Shiro(Howl's Moving Castle) being so similar stylistically. The style worked well for the film and was done exceptionally well. The time and effort put in to animate this is greatly appreciated as it gives the characters so much more life, the detail and care it takes makes the movie turn out so much better. There are several great scenes throughout the movie that have lots of movement and action. The greatest to me being a scene were Ponyo's sisters transform into massive wave like fish while she runs on top of them. The story is simple but fairly well written and played out. The plot stayed focused around character relationships and while it wasn't played out as well as in Tonari no Totoro(My Neighbor Totoro) it is still great. I felt that each character had the an appropriate amount of screen time unlike Spirited Away which was so jammed with distinctive characters that it could have been stretched out into an entire series.(The Radish Spirit. There was a another whole movie right there!) My only real problem with the movie was the end. The way its worded in the English version at least makes it seem like there's going to be some great test given to Sōsuke which turns out to be just him promising Ponyos mother that he will love Ponyo. Though putting more thought into this leads me to think that translation may not be that accurate to the actual meaning in that the test is the promise and that deep down he really means it. The movie did seem to end abruptly though. Other than that the movie was great and I highly recommend it.",pos
7,"We sought out this hard-to-find VHS after watching two excellent Merchant-Ivory pictures back to back. Knowing it was an instant box office failure, a failure as a rental, I thought it might be worth seeing anyway based on M-I's reputation. Too bad! Nine years ago, it was very much a Liberal Agenda objective to trash the Founding Fathers and indeed they had some success in eradicating the Founding Fathers from many American classrooms including, for example, New Jersey; whose eradication of our great founders quickly ended when the Washington Times shone the spotlight of truth into the NJ School Board and their subversive deed. A small part of this was headlining the alleged Sally Hemmings-Thomas Jefferson connection, disregarding the inconvenient DNA findings which failed to support the wacky left's agenda. Never mind! They got James Ellis, an author of dubious reputation, to put it in a book, and Columbia University sealed the deal by giving Ellis a Pulitzer.<br /><br />As to Jefferson in Paris, the Liberal Agenda spin begins in the opening scene wherein James Earl Jones is claiming to be the son of Jefferson. The spin simply continues in flashback mode to Paris. The unmistakable truth is that even if a person assumes the lie is true the Hemmings allegation would be an insignificant detail into the larger matter of Jefferson's prolonged and vital diplomatic mission to Paris (as well as to the Netherlands where he secured crucial financial backing for America when our infant nation was without funds).<br /><br />Besides the Liberal Spin Job, there is nothing else of interest in this drab and tortuously dull movie. Some of the other history is indeed accurate --- adding credence to frame the lie --- but this movie takes one of the most interesting moments in American history and reduces it to a remedy for insomnia.<br /><br />Please do not ask me why Liberals set out to trash the Founding Fathers, because I don't waste time explaining the acts of such people. Don't ask them either; they usually respond to such questions with the same answer: ""SHUT UP!""",neg
8,"""THE KING OF QUEENS,"" in my opinion, is a pure CBS hit! Despite the fact that I've never seen every episode, I still enjoy it very much. For that reason, it's hard for me to say which episode is my favorite. Even so, I must say that CBS really knows how to make a good sitcom. Before I wrap this up, I'd like to say that everyone always gives a good performance, the production design is spectacular, the costumes are well-designed, and the writing is always very strong. In conclusion, if this show lives on in syndication after it goes off CBS, I strongly recommend you catch it just in case it goes off the air for good.",pos
9,It is one of the best of Stephen Chow. I give it a nine out of ten.<br /><br />I was surprised to see that Shaolin Soccer was rated on top of all singsing's movies. Unbelievable.,pos


[Return to Top](#top)
 <a id = 'modelPrep'></a>
## Model Preparation

[Return to Top](#top)
 <a id = 'modelSelection'></a>
### Model Selection

We need to pick the model we are going to train to classify IMDB.  We'll do that in several stages. First we define some variables to hold information about the model that we'll re-use.

In [8]:
model_checkpoint = "bert-base-cased"
batch_size = 8

[Return to Top](#top)
 <a id = 'tokenizerSelect'></a>
### Tokenizer Selection 

We'll use the [AutoTokenizer](https://huggingface.co/docs/transformers/main/en/autoclass_tutorial#autotokenizer) object to avoid simple configuration mistakes because it insures that we get the correct tokenizer given our pre-trained model.  This time the model we're using is BERT and we're selecting the cased version (meaning case is preserved) and the base version (rather than the large version).

In [9]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

Downloading (…)okenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

What's the tokenizer doing?  It's taking care of breaking down a sentence into the parts the model can understand and was trained on, as well as a bunch of housekeeping that's needed by the model in order to work properly.  Once we've covered how a transformer works in live session, we'll come back (in week 4) and discuss its various components.  For now, you don't need to understand it in order to make use of it.

Here's one example of what the tokenizer outputs.

In [10]:
tokenizer("Hello, we only need one sentence for our task but these reviews often have more.")

{'input_ids': [101, 8667, 117, 1195, 1178, 1444, 1141, 5650, 1111, 1412, 4579, 1133, 1292, 3761, 1510, 1138, 1167, 119, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

The tokenizer converts the incoming words to integer ids that are used to retrieve the model's input word embeddings.  All tokenizers convert words to input ids.  The wrong tokenizer will produce the wrong set of token ids and result in very poor predictions.  The AutoTokenizer insures the correct ids are assigned.

In [11]:
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)

  0%|          | 0/25 [00:00<?, ?ba/s]

  0%|          | 0/25 [00:00<?, ?ba/s]

  0%|          | 0/50 [00:00<?, ?ba/s]

In [12]:
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))
full_train_dataset = tokenized_datasets["train"]
full_eval_dataset = tokenized_datasets["test"]

[Return to Top](#top)
 <a id = 'autoModel'></a>
### Automatic Model Configuration

We'll use the [AutoModel abstraction](https://huggingface.co/docs/transformers/main/en/autoclass_tutorial#automodel) and invoke the TensorFlow port since we want to use Keras to train and run the model.  As a result we'll instantiate a copy of TFAutoModelForSequenceClassification.  Note the 'TF' at the begining of the class name to designate it as a TensorFlow port.  The model for "sequence classification" is specifically structured to perform classification based on sequences of words like a sentence.  HuggingFace provides a set of models specifically configured for [particular NLP tasks](https://huggingface.co/docs/transformers/main/en/model_doc/auto) as shown by all of the AutoModelFor *FillInTheTask*.

In [13]:
import tensorflow as tf
from transformers import TFAutoModelForSequenceClassification

model = TFAutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=2)

Downloading (…)"tf_model.h5";:   0%|          | 0.00/527M [00:00<?, ?B/s]

All model checkpoint layers were used when initializing TFBertForSequenceClassification.

Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


[Return to Top](#top)
 <a id = 'encodeData'></a>
### Encode Data


Let's create the encoded data for training.  Because we're using the TensorFlow port, we'll need to convert our PyTorch dataset object contents to the TensorFlow version.  HuggingFace provides some nice conversion functions to assist in the process.

In [14]:
tf_train_dataset = small_train_dataset.remove_columns(["text"]).with_format("tensorflow")
tf_eval_dataset = small_eval_dataset.remove_columns(["text"]).with_format("tensorflow")

In [15]:
train_features = {x: tf_train_dataset[x] for x in tokenizer.model_input_names}
train_tf_dataset = tf.data.Dataset.from_tensor_slices((train_features, tf_train_dataset["label"]))
train_tf_dataset = train_tf_dataset.shuffle(len(tf_train_dataset)).batch(batch_size)

eval_features = {x: tf_eval_dataset[x] for x in tokenizer.model_input_names}
eval_tf_dataset = tf.data.Dataset.from_tensor_slices((eval_features, tf_eval_dataset["label"]))
eval_tf_dataset = eval_tf_dataset.batch(batch_size)

[Return to Top](#top)
 <a id = 'fineTuning'></a>
# Fine Tuning

In keeping with the Keras process we call model.compile first to make sure that all the pieces are in place.  We can follow that up with a call to model.summary() to make sure we've put the correct players together in the correct manner.  We can also see how much is trainable, which gives a sense of training time and resource requirements.

In [16]:
model.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate=5e-5),
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=tf.metrics.SparseCategoricalAccuracy(),
)

In [17]:
model.summary()

Model: "tf_bert_for_sequence_classification"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 bert (TFBertMainLayer)      multiple                  108310272 
                                                                 
 dropout_37 (Dropout)        multiple                  0         
                                                                 
 classifier (Dense)          multiple                  1538      
                                                                 
Total params: 108,311,810
Trainable params: 108,311,810
Non-trainable params: 0
_________________________________________________________________


Then we call model.fit to perform the actual training.  Note that many times, because of what the model has learned about language in it pre-training phase we can limit our training to a small number of epochs.

In [18]:
model.fit(train_tf_dataset, validation_data=eval_tf_dataset, epochs=2)

Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x7fef18f95a30>