# The Textbook Fine-Tuning Tutorial #

The original tutorial is from pages 574-586 of the textbook--see also ch16-part3-bert.ipynb on the [author's github](https://github.com/rasbt/machine-learning-book)--but I have made some significant deviations from this tutorial.


## The BERT model ##

The BERT model is discussed on pages 569-572 of the textbook.  It is an interesting alternative to GPT--which is pre-trained solely on 'next token' prediction--in that it is trained primarily on (1) next sentence prediction and (2) masked language modeling.  The idea behind masked language modeling is that a token (essentially, 'word') in a sentence is replaced randomly by a '<mask>' token and the model attempts to predict what token has been masked.

The idea is that this form of training makes it better able to understand 'contexts' that the next-word-predicting GPT models. Hence it is good for sentence classification (among other things).

*In fact*, we will use a simplified version of the BERT model called DistilBERT. (More later.)

## Loading Datasets ##

In [99]:
import datasets

There is a Huggingface page [completely devoted to this topic](https://huggingface.co/docs/datasets/en/index).  

The primary class is [Dataset](https://huggingface.co/docs/datasets/package_reference/main_classes#datasets.Dataset).

Loading and listing methods are [available directly through the datasets package](https://huggingface.co/docs/datasets/en/package_reference/loading_methods)

### Example: IMDB ##

In [144]:
datasets.get_dataset_config_names('imdb')

['plain_text']

In [145]:
from datasets import load_dataset

In [146]:
dataset1 = load_dataset('imdb')

In [147]:
dataset1

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

In [148]:
dataset1.column_names

{'train': ['text', 'label'],
 'test': ['text', 'label'],
 'unsupervised': ['text', 'label']}

In [149]:
dataset1['train']['label'][:5]

[0, 0, 0, 0, 0]

In [150]:
dataset1['train']['text'][:1]

['I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far between, e

### Example: tweet_eval

In [30]:
dataset = load_dataset('tweet_eval')

ValueError: Config name is missing.
Please pick one among the available configs: ['emoji', 'emotion', 'hate', 'irony', 'offensive', 'sentiment', 'stance_abortion', 'stance_atheism', 'stance_climate', 'stance_feminist', 'stance_hillary']
Example of usage:
	`load_dataset('tweet_eval', 'emoji')`

In [31]:
datasets.get_dataset_config_names('tweet_eval')

['emoji',
 'emotion',
 'hate',
 'irony',
 'offensive',
 'sentiment',
 'stance_abortion',
 'stance_atheism',
 'stance_climate',
 'stance_feminist',
 'stance_hillary']

In [32]:
dataset2 = load_dataset('tweet_eval', 'stance_climate')

Downloading data: 100%|██████████| 28.1k/28.1k [00:00<00:00, 54.1kB/s]
Downloading data: 100%|██████████| 14.9k/14.9k [00:00<00:00, 72.9kB/s]
Downloading data: 100%|██████████| 5.47k/5.47k [00:00<00:00, 23.0kB/s]
Generating train split: 100%|██████████| 355/355 [00:00<00:00, 62365.57 examples/s]
Generating test split: 100%|██████████| 169/169 [00:00<00:00, 48855.01 examples/s]
Generating validation split: 100%|██████████| 40/40 [00:00<00:00, 11608.12 examples/s]


In [33]:
dataset2.column_names

{'train': ['text', 'label'],
 'test': ['text', 'label'],
 'validation': ['text', 'label']}

In [37]:
dataset2['train'][2]

{'text': "It's nights like this when I'm not so fond of my long hair. I just wanna chop it all off! #heatwave #pnwgirl #SemST",
 'label': 0}

In [39]:
dataset2['train']['text'][:3]

['Why Is The Pope Upset?  via @user #UnzippedTruth #PopeFrancis #SemST',
 "We support Australia's Climate Roundtable which is providing a framework for sensible debate ahead of Paris @user #SemST",
 "It's nights like this when I'm not so fond of my long hair. I just wanna chop it all off! #heatwave #pnwgirl #SemST"]

In [40]:
dataset2['train']['label'][:3]

[0, 2, 0]

**Note:**  You can find all this information through the API, but it's probably easiest to go to the datacard through the Huggingface [Datasets link](https://huggingface.co/datasets/tweet_eval).

### Train/Test Split for IMDB ###

Recall that the dataset is already split into test and train.  The next step is to split the train data into train and validation datasets (about 10% for our example).  Huggingface has a [train_test_split](https://huggingface.co/docs/datasets/v2.18.0/en/package_reference/main_classes#datasets.Dataset.train_test_split) function.

In [105]:
dataset1

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

**Idea:** We will split the *train* dataset to include a validation dataset.

In [106]:
train_valid = dataset1['train'].train_test_split(test_size=0.1) 

In [107]:
train_valid

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 22500
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 2500
    })
})

Note this still has the same structure.

In [108]:
train_valid['train']['text'][:2]

["With these people faking so many shots, using old footage, and gassing animals to get them out, not to mention that some of the scenes were filmed on a created set with actors, what's to believe? Old film of countries is nice, but the animal abuse and degradation of natives is painful to watch in these films. I know, racism is OK in these old films, but there is more to that to make this couple lose credibility. Portrayed as fliers, they never flew their planes, Martin Johnson was an ex-vaudevillian, used friends like Jack London for financial gain while stiffing them of royalties, denying his wife's apparent depression, using her as a cute prop, all this makes these films unbearable. They were by no means the first to travel to these lands, or the first to write about them. He was OK as a filmmaker and photographer, but that's about it.",
 "I don't know the stars, or modern Chinese teenage music - but I do know a thoroughly entertaining movie when I see one.<br /><br />Kung Fu Dunk 

In [109]:
mytrain = train_valid['train']

In [110]:
mytrain[0]

{'text': "With these people faking so many shots, using old footage, and gassing animals to get them out, not to mention that some of the scenes were filmed on a created set with actors, what's to believe? Old film of countries is nice, but the animal abuse and degradation of natives is painful to watch in these films. I know, racism is OK in these old films, but there is more to that to make this couple lose credibility. Portrayed as fliers, they never flew their planes, Martin Johnson was an ex-vaudevillian, used friends like Jack London for financial gain while stiffing them of royalties, denying his wife's apparent depression, using her as a cute prop, all this makes these films unbearable. They were by no means the first to travel to these lands, or the first to write about them. He was OK as a filmmaker and photographer, but that's about it.",
 'label': 0}

In [111]:
myvalid = train_valid['test']

In [112]:
myvalid[0]

{'text': 'There is no relation at all between Fortier and Profiler but the fact that both are police series about violent crimes. Profiler looks crispy, Fortier looks classic. Profiler plots are quite simple. Fortier\'s plot are far more complicated... Fortier looks more like Prime Suspect, if we have to spot similarities... The main character is weak and weirdo, but have "clairvoyance". People like to compare, to judge, to evaluate. How about just enjoying? Funny thing too, people writing Fortier looks American but, on the other hand, arguing they prefer American series (!!!). Maybe it\'s the language, or the spirit, but I think this series is more English than American. By the way, the actors are really good and funny. The acting is not superficial at all...',
 'label': 1}

We have to transform our data into a format that can be used by the model.   This is accomplished by a [Huggingface Tokenizer](https://huggingface.co/docs/transformers/main_classes/tokenizer).  Huggingface models have tokenizers that produce the appropriate tokens.  The Distilbert tokenizer is described on the [Distilbert page](https://huggingface.co/docs/transformers/model_doc/distilbert).

### Tokenizing the Data ###

In [123]:
import transformers
from transformers import DistilBertTokenizerFast

In [114]:
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

**Note:** This seems rather awkward, but most HuggingFace fine-tuning tutorials I've seen make use of this.

In [118]:
def tokenize_function(examples):
    return tokenizer(examples["text"], truncation=True, padding='max_length')

In [119]:
train_tokened = mytrain.map(tokenize_function, batched=True)

Map: 100%|██████████| 22500/22500 [00:04<00:00, 5083.76 examples/s]


In [None]:
train_tokened[0]

In [120]:
valid_tokened = myvalid.map(tokenize_function, batched=True)

Map: 100%|██████████| 2500/2500 [00:00<00:00, 5300.50 examples/s]


In [None]:
valid_tokened[0]

## Fine-tuning with the Trainer API ##

We will use the 'distilbert-base-uncased' model.  This is available from the [Huggingface models page](https://huggingface.co/models). See the model card page at that URL.  See also the [DistilBert page](https://huggingface.co/docs/transformers/model_doc/distilbert), which discusses the [sequence classification head](https://huggingface.co/docs/transformers/v4.38.2/en/model_doc/distilbert#transformers.DistilBertForSequenceClassification).

In [124]:
from transformers import DistilBertForSequenceClassification

In [51]:
import torch
DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
NUM_EPOCHS = 3

**Note:**  This last step is unnecessary.  If we are using  the Trainer class with Pytorch, the GPU is used automatically by the software. 

In [52]:
model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased')
model.to(DEVICE)
model.train();

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### The Huggingface Trainer Class ###

The Huggingface [Trainer class](https://huggingface.co/docs/transformers/en/main_classes/trainer) specifies a [Trainer](https://huggingface.co/docs/transformers/v4.38.2/en/main_classes/trainer#transformers.Trainer).  The basic inputs for this are the model and the dataset, but the most complicated part is specifying the [TrainingArguments](https://huggingface.co/docs/transformers/v4.38.2/en/main_classes/trainer#transformers.TrainingArguments).

In [152]:
from transformers import Trainer, TrainingArguments

optim = torch.optim.Adam(model.parameters(), lr=5e-5)
training_args = TrainingArguments(
    output_dir='./results', 
    num_train_epochs=3,     
    per_device_train_batch_size=16, 
    per_device_eval_batch_size=16,   
    logging_dir='./logs',
    logging_steps=10,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_tokened,
)

In [None]:
trainer.train()

In [154]:
from transformers import Trainer, TrainingArguments

optim = torch.optim.Adam(model.parameters(), lr=5e-5)
training_args = TrainingArguments(
    output_dir='./results', 
    num_train_epochs=3,     
    per_device_train_batch_size=16, 
    per_device_eval_batch_size=16,   
    logging_dir='./logs',
    logging_steps=16,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_tokened,
    eval_dataset=valid_tokened
)

In [None]:
trainer.train()

**Note:** The trainer API only shows *training loss* and not *model evaluation* (e.g., accuracy).  However, you can *define* a separate model evaluation function.  We can add a 'compute_metrics' function to change that.  This function uses the model predictions "as logits" and compares them to test labels. 

## Better Model Evaluation ##

See [this page](https://huggingface.co/docs/datasets/metrics) on the Huggingface site.  
Each 'Metric' objection  has a [compute method](https://huggingface.co/docs/datasets/v2.18.0/en/package_reference/main_classes#datasets.Metric.compute).

**Note:** The first paragraphs warns that the 'metrics' class is deprecated, and they are transitioning the the [evaluation class](https://huggingface.co/docs/evaluate/index).  Tutorials will probably have a mixture of both, so **be careful**.


In [127]:
from datasets import list_metrics
metrics_list = list_metrics()
len(metrics_list)
print(metrics_list)

  metrics_list = list_metrics()


['accuracy', 'bertscore', 'bleu', 'bleurt', 'brier_score', 'cer', 'character', 'charcut_mt', 'chrf', 'code_eval', 'comet', 'competition_math', 'confusion_matrix', 'coval', 'cuad', 'exact_match', 'f1', 'frugalscore', 'glue', 'google_bleu', 'indic_glue', 'mae', 'mahalanobis', 'mape', 'mase', 'matthews_correlation', 'mauve', 'mean_iou', 'meteor', 'mse', 'nist_mt', 'pearsonr', 'perplexity', 'poseval', 'precision', 'r_squared', 'recall', 'rl_reliability', 'roc_auc', 'rouge', 'sacrebleu', 'sari', 'seqeval', 'smape', 'spearmanr', 'squad', 'squad_v2', 'super_glue', 'ter', 'trec_eval', 'wer', 'wiki_split', 'xnli', 'xtreme_s', 'Aledade/extraction_evaluation', 'AlhitawiMohammed22/CER_Hu-Evaluation-Metrics', 'Bekhouche/NED', 'BucketHeadP65/confusion_matrix', 'BucketHeadP65/roc_curve', 'CZLC/rouge_raw', 'DaliaCaRo/accents_unplugged_eval', 'DarrenChensformer/eval_keyphrase', 'DarrenChensformer/relation_extraction', 'DoctorSlimm/bangalore_score', 'DoctorSlimm/kaushiks_criteria', 'Drunper/metrica_tesi

**Note:** More details on individual metrics can be found [here](https://huggingface.co/metrics).   We will use the accuracy metric, which is described in the [Huggingface documentation](https://huggingface.co/spaces/evaluate-metric/accuracy).

In [133]:
from datasets import load_metric
import numpy as np


metric = load_metric("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred # logits are a numpy array, not pytorch tensor
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(
               predictions=predictions, references=labels)

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.


In [156]:
optim = torch.optim.Adam(model.parameters(), lr=5e-5)


training_args = TrainingArguments(
    output_dir='./results', 
    num_train_epochs=3,     
    per_device_train_batch_size=16, 
    per_device_eval_batch_size=16,   
    logging_dir='./logs',
    logging_steps=10
)

trainer = Trainer(
    model=model,
    compute_metrics=compute_metrics,
    args=training_args,
    train_dataset=train_tokened,
    eval_dataset=valid_tokened,
    optimizers=(optim, None) # optimizer and learning rate scheduler
)


**Note:**  The example below gives a way to *time* your training run.

In [157]:
import time

In [None]:
start_time = time.time()
trainer.train()
print(f'Total Training Time: {(time.time() - start_time)/60:.2f} min')

In [140]:
trainer.evaluate()

{'eval_loss': 0.7392910718917847, 'eval_accuracy': 0.902}

In [142]:
optim = torch.optim.Adam(model.parameters(), lr=5e-5)

training_args = TrainingArguments(
    output_dir='./results', 
    num_train_epochs=3,     
    per_device_train_batch_size=16, 
    per_device_eval_batch_size=16,   
    logging_dir='./logs',
    logging_steps=10,
    evaluation_strategy="epoch"
)

trainer = Trainer(
    model=model,
    compute_metrics=compute_metrics,
    args=training_args,
    train_dataset=train_tokened,
    eval_dataset=valid_tokened,
    optimizers=(optim, None) # optimizer and learning rate scheduler
)


In [143]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy
1,0.0975,0.45267,0.9036


Checkpoint destination directory ./results/checkpoint-500 already exists and is non-empty. Saving will proceed but saved results may be invalid.
Checkpoint destination directory ./results/checkpoint-1000 already exists and is non-empty. Saving will proceed but saved results may be invalid.


KeyboardInterrupt: 