**Why should we care about transformers?**

Transformers are proving to be the most influential and powerful models. They dominate every domain now, natural language processing, computer vision, time series and even some aspects of tabular data now. 

As a data scientist understanding transformers going forward is going to be an extremely powerful skill. In this notebook specifically we will look at the natural language processing side of transformers, but don't take for granted that they can be extremely powerful on many problems. 

We can look on https://paperswithcode.com/sota/ and see all of the various domains and their benchmark tasks and see that for almost all of them transformers are surpassing CNNs, RNNs etc.


Language modeling benchmarks
<img src ="https://i.imgur.com/Ox0QHPO.png">

Imagenet classification benchmarks
<img src ="https://i.imgur.com/3IePBOn.png">


---



---



**Crash course on transformers - So what are they?**

Transformers are a model architecture built with attention mechanisms. Here is the original paper that proposed this architecture: https://arxiv.org/abs/1706.03762

<img src = "https://i.imgur.com/aXCvmzp.png">

Looking at this diagram, if you are familiar with deep learning, most of these layers are not new. Feedforward layers, adding and normalizing, embedding, linear layers, softmax. The important innovation is the attention layer. We may cover this in more detail at a later date but for now we will set it aside so we can use it in practice first and then look at the nuts and bolts later. If you want to really dig into this now you can look at the annotated transformer here http://nlp.seas.harvard.edu/annotated-transformer/

There are many flavors of transformers, but for now we will just look at BERT because that is a very common one that many other models have borrowed the architecture from. https://huggingface.co/docs/transformers/index 

On the left bar look under Models -> Text models

These are a bunch of models that have been directly integrated into the huggingface transformers package. Transformers have become so ubiquituous that this package was created to try to gather and standardized implementations of them. The officially integrated models are under that tab on the sidebar, but they also allow users to upload their own trained models to the model hub https://huggingface.co/models

There are thousands of models covering many use cases, at this point many are transformers, but there are lots of other architectures available as well. Model hub is a more general hub for all kinds of models. If you are ever looking for a specialized model this is a great place to start your search. 



---







On the documentation page for the BERT model we can find a link to the original BERT paper if you would like to dive in in more detail.

https://huggingface.co/docs/transformers/model_doc/bert 

https://arxiv.org/abs/1810.04805

For now I just want us to look at a few specific parameters and discuss them. 
https://huggingface.co/docs/transformers/model_doc/bert#transformers.BertConfig

Transformers have become so standard that there is a standard convention for sizing them and just tweaking these can yield us the majority of transformer architectures out there. We don't need to custom write our own implementations from scratch

Looking at a few of the important ones we can see:


*   Vocab size - the number of tokens that the model actually understands
*   Hidden size - this is the dimensionality that we use to represent each token, we can kind of view this as the width of the network
*   Number of hidden layers - The number of transformer blocks that are stacked together. We can view this as the depth of the network

There are many other implementational details but those broad ones will be very important and common across pretty much all architectures.  

BERT came originally in two sizes, base and large. In the original paper base BERT had a vocab size of 30522, hidden size of 768, and hidden layer count of 12. This yields a model with 100+ million parameters. The large variant is over 300+ million parameters. These were once considered ludicrously large but now we have scaled even beyond a trillion parameters.  This base implementation has become extremely common. Lots of derivative works have borrowed this same size of model and then tried to tweak various things around it. What flavor of attention, the data that is used for training, the method of pretraining, etc. 





---
<img src = "https://i.imgur.com/5c0T3hK.png">

One of the most important things to understand as a data scientist/machine learning practitioner are the inputs and outputs so we will walk through what actually goes into the model and what comes out of the model. 

On the input side, one of our problems is that when operating on text we cant directly do any math on strings so we need to find some numerical representation for them. In the past people have done work to try to find meaninful vectors that represent each word individually and they build up a massive file represents the whole vocabulary. This is ok but it misses big when a word occurs that has no vector available. Language has a very long tail and it is difficult to cover all possible words and mispelling and random combination of characters and it also doesnt take advantage of the common elements in language like words ending in "logy" or words that begin with "chem". Many roots have similar meanings and if you operate at the word level it is hard to represent these commonalities. 

Instead of doing this the language model transformers will operate on tokens instead of words. They will break the words up into chunks. There are a few common options here, byte pair encoding, wordpiece, etc. They all functionally do the same thing though of splitting a word into its constituent pieces. 

So when we looked at the BERT vocab size of 30522, that doesn't mean it only knows 30522 words, that means it knows that many tokens and that many tokens can recreate basically any combination of characters possible and it will have some understanding. 

The tokenizer will split the word into pieces and then gives it an ID. This ID then can be mapped to the correct embedding/vector that represents that token. At that point we have succesfully gone from strings to numbers that mean something for the model. Lets try that out


In [None]:
import numpy as np
import torch
# Transformers installation
! pip install transformers datasets
# To install from source instead of the last release, comment the command above and uncomment the following one.
# ! pip install git+https://github.com/huggingface/transformers.git

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
from datasets import load_dataset

dataset = load_dataset("yelp_review_full")
dataset["train"][100]



  0%|          | 0/2 [00:00<?, ?it/s]

{'label': 0,
 'text': 'My expectations for McDonalds are t rarely high. But for one to still fail so spectacularly...that takes something special!\\nThe cashier took my friends\'s order, then promptly ignored me. I had to force myself in front of a cashier who opened his register to wait on the person BEHIND me. I waited over five minutes for a gigantic order that included precisely one kid\'s meal. After watching two people who ordered after me be handed their food, I asked where mine was. The manager started yelling at the cashiers for \\"serving off their orders\\" when they didn\'t have their food. But neither cashier was anywhere near those controls, and the manager was the one serving food to customers and clearing the boards.\\nThe manager was rude when giving me my order. She didn\'t make sure that I had everything ON MY RECEIPT, and never even had the decency to apologize that I felt I was getting poor service.\\nI\'ve eaten at various McDonalds restaurants for over 30 years. 

The transformers package has functions for all of the major models tokenization so we can just borrow these existing implementations. In this one we can load in the tokenizer for bert base. There are a few important parameters here. The model is only capable of handling up to 512 tokens at a time so we might need to cut some text off or break it into multiple parts if the text is longer than that. Additionally, since we typically want to train models with batches of data at a time we want them all to be sized the same so we will pad the text to the maximum length of the model (512 tokens) so that it is one big batched tensor we can pass to the model instead of individual vectors or a ragged tensor. 

In [None]:
dataset["train"] = dataset["train"].shuffle(seed=42).select(range(1000))
dataset["test"] = dataset["test"].shuffle(seed=42).select(range(1000))



In [None]:
dataset["train"][0]

{'label': 4,
 'text': "I stalk this truck.  I've been to industrial parks where I pretend to be a tech worker standing in line, strip mall parking lots, and of course the farmer's market.  The bowls are so so absolutely divine.  The owner is super friendly and he makes each bowl by hand with an incredible amount of pride.  You gotta eat here guys!!!"}

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")


def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True, max_length = 128)


tokenized_datasets = dataset.map(tokenize_function, batched=True)

loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--bert-base-cased/snapshots/5532cc56f74641d4bb33641f5c76a55d11f846e0/config.json
Model config BertConfig {
  "_name_or_path": "bert-base-cased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.25.1",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 28996
}

loading file vocab.txt from cache at /root/.cache/huggingface/hub/models--bert-base-cased/snapshots/5532cc56f74641d4bb33641f5c76a55d11f846e0/vocab.txt
loading file tokenize

  0%|          | 0/1 [00:00<?, ?ba/s]

If you like, you can create a smaller subset of the full dataset to fine-tune on to reduce the time it takes:

In [None]:
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))



In [None]:
small_train_dataset[0]["text"]

'This is a medium sized all you can eat sushi and Asian food buffet.  We went during \\"Happy Hour\\" where the price is only $11.95 per adult.  Possibly there are regular menu items that we missed by coming in an off peak time, but I doubt we will go back to find out.\\n\\nAlthough we like these kinds of restaurants, especially O Nami in San Diego, this is a poor representation.  The sushi is somewhat fresh, but the fish pieces are too large to eat in one bite and the rice quite small, it was an awkward eating experience.  I particularly dislike combo sushis without any labeling and there was none here.  There are also some hot dishes, such as soups, stir fries, fried noodles, and tempura shrimp and vegetables.  These were not bad, but not worth recommending.  The one thing that probably keeps people coming back is the chocolate fountain, which we are not particularly fond of.  The rest of the desserts are forgettable.\\n\\nBut the one thing that will definitely keep me away was the M

In [None]:
small_train_dataset[0]["label"]

1

In [None]:
torch.tensor(small_train_dataset[0]["input_ids"])

tensor([  101,  1188,  1110,   170,  5143,  6956,  1155,  1128,  1169,  3940,
        28117,  5933,  1105,  3141,  2094,   171,  9435,  2105,   119,  1284,
         1355,  1219,   165,   107,  8325, 12197,   165,   107,  1187,  1103,
         3945,  1110,  1178,   109,  1429,   119,  4573,  1679,  4457,   119,
        18959, 19828,  4999,  1175,  1132,  2366, 13171,  4454,  1115,  1195,
         4007,  1118,  1909,  1107,  1126,  1228,  4709,  1159,   117,  1133,
          146,  4095,  1195,  1209,  1301,  1171,  1106,  1525,  1149,   119,
          165,   183,   165,   183,  1592,  6066, 14640,  5084,  1195,  1176,
         1292,  7553,  1104,  7724,   117,  2108,   152, 19346,  1182,  1107,
         1727,  4494,   117,  1142,  1110,   170,  2869,  6368,   119,  1109,
        28117,  5933,  1110,  4742,  4489,   117,  1133,  1103,  3489,  3423,
         1132,  1315,  1415,  1106,  3940,  1107,  1141,  6513,  1105,  1103,
         7738,  2385,  1353,   117,  1122,  1108,  1126,   102])

<a id='trainer'></a>

In [None]:
np.array(tokenizer.convert_ids_to_tokens(small_train_dataset[0]["input_ids"]))

array(['[CLS]', 'This', 'is', 'a', 'medium', 'sized', 'all', 'you', 'can',
       'eat', 'su', '##shi', 'and', 'Asian', 'food', 'b', '##uff', '##et',
       '.', 'We', 'went', 'during', '\\', '"', 'Happy', 'Hour', '\\', '"',
       'where', 'the', 'price', 'is', 'only', '$', '11', '.', '95', 'per',
       'adult', '.', 'Po', '##ssi', '##bly', 'there', 'are', 'regular',
       'menu', 'items', 'that', 'we', 'missed', 'by', 'coming', 'in',
       'an', 'off', 'peak', 'time', ',', 'but', 'I', 'doubt', 'we',
       'will', 'go', 'back', 'to', 'find', 'out', '.', '\\', 'n', '\\',
       'n', '##A', '##lt', '##hou', '##gh', 'we', 'like', 'these',
       'kinds', 'of', 'restaurants', ',', 'especially', 'O', 'Nam', '##i',
       'in', 'San', 'Diego', ',', 'this', 'is', 'a', 'poor',
       'representation', '.', 'The', 'su', '##shi', 'is', 'somewhat',
       'fresh', ',', 'but', 'the', 'fish', 'pieces', 'are', 'too',
       'large', 'to', 'eat', 'in', 'one', 'bite', 'and', 'the', 'rice',
      

## Train with PyTorch Trainer

🤗 Transformers provides a [Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer) class optimized for training 🤗 Transformers models, making it easier to start training without manually writing your own training loop. The [Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer) API supports a wide range of training options and features such as logging, gradient accumulation, and mixed precision.

Start by loading your model and specify the number of expected labels. From the Yelp Review [dataset card](https://huggingface.co/datasets/yelp_review_full#data-fields), you know there are five labels:

In [None]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=5)

loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--bert-base-cased/snapshots/5532cc56f74641d4bb33641f5c76a55d11f846e0/config.json
Model config BertConfig {
  "_name_or_path": "bert-base-cased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2",
    "3": "LABEL_3",
    "4": "LABEL_4"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_2": 2,
    "LABEL_3": 3,
    "LABEL_4": 4
  },
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.25.1",
  "type_voca

<Tip>

You will see a warning about some of the pretrained weights not being used and some weights being randomly
initialized. Don't worry, this is completely normal! The pretrained head of the BERT model is discarded, and replaced with a randomly initialized classification head. You will fine-tune this new model head on your sequence classification task, transferring the knowledge of the pretrained model to it.

</Tip>

### Training hyperparameters

Next, create a [TrainingArguments](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.TrainingArguments) class which contains all the hyperparameters you can tune as well as flags for activating different training options. For this tutorial you can start with the default training [hyperparameters](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments), but feel free to experiment with these to find your optimal settings.

Specify where to save the checkpoints from your training:

In [None]:
from transformers import TrainingArguments

training_args = TrainingArguments(output_dir="test_trainer")

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


### Evaluate

In [None]:
!pip install evaluate

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


[Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer) does not automatically evaluate model performance during training. You'll need to pass [Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer) a function to compute and report metrics. The [🤗 Evaluate](https://huggingface.co/docs/evaluate/index) library provides a simple [`accuracy`](https://huggingface.co/spaces/evaluate-metric/accuracy) function you can load with the [evaluate.load](https://huggingface.co/docs/evaluate/main/en/package_reference/loading_methods#evaluate.load) (see this [quicktour](https://huggingface.co/docs/evaluate/a_quick_tour) for more information) function:

In [None]:
import numpy as np
import evaluate

metric = evaluate.load("accuracy")

Call `compute` on `metric` to calculate the accuracy of your predictions. Before passing your predictions to `compute`, you need to convert the predictions to logits (remember all 🤗 Transformers models return logits):

In [None]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

If you'd like to monitor your evaluation metrics during fine-tuning, specify the `evaluation_strategy` parameter in your training arguments to report the evaluation metric at the end of each epoch:

In [None]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(output_dir="test_trainer", evaluation_strategy="epoch")

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


### Trainer

Create a [Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer) object with your model, training arguments, training and test datasets, and evaluation function:

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_train_dataset,
    eval_dataset=small_eval_dataset,
    compute_metrics=compute_metrics,
)

Then fine-tune your model by calling [train()](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer.train):

In [None]:
trainer.train()

The following columns in the training set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 1000
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 375
  Number of trainable parameters = 108314117


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,1.167053,0.453
2,No log,1.187468,0.489
3,No log,1.166395,0.529


The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1000
  Batch size = 8
The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1000
  Batch size = 8
The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1000
  Batch size = 8


Training compl

TrainOutput(global_step=375, training_loss=1.0172345377604166, metrics={'train_runtime': 103.4123, 'train_samples_per_second': 29.01, 'train_steps_per_second': 3.626, 'total_flos': 197338606848000.0, 'train_loss': 1.0172345377604166, 'epoch': 3.0})

<a id='pytorch_native'></a>

## Train in native PyTorch

In [None]:
del model
del trainer
torch.cuda.empty_cache()

In [None]:
tokenized_datasets = tokenized_datasets.remove_columns(["text"])
tokenized_datasets = tokenized_datasets.rename_column("label", "labels")
tokenized_datasets.set_format("torch")

In [None]:
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))

### DataLoader

Create a `DataLoader` for your training and test datasets so you can iterate over batches of data:

In [None]:
from torch.utils.data import DataLoader

train_dataloader = DataLoader(small_train_dataset, shuffle=True, batch_size=32)
eval_dataloader = DataLoader(small_eval_dataset, batch_size=32)

Load your model with the number of expected labels:

In [None]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=5)

loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--bert-base-cased/snapshots/5532cc56f74641d4bb33641f5c76a55d11f846e0/config.json
Model config BertConfig {
  "_name_or_path": "bert-base-cased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2",
    "3": "LABEL_3",
    "4": "LABEL_4"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_2": 2,
    "LABEL_3": 3,
    "LABEL_4": 4
  },
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.25.1",
  "type_voca

### Optimizer and learning rate scheduler

Create an optimizer and learning rate scheduler to fine-tune the model. Let's use the [`AdamW`](https://pytorch.org/docs/stable/generated/torch.optim.AdamW.html) optimizer from PyTorch:

In [None]:
from torch.optim import AdamW

optimizer = AdamW(model.parameters(), lr=5e-5)

Create the default learning rate scheduler from [Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer):

In [None]:
from transformers import get_scheduler

num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
    name="linear", optimizer=optimizer, num_warmup_steps=0, num_training_steps=num_training_steps
)

Lastly, specify `device` to use a GPU if you have access to one. Otherwise, training on a CPU may take several hours instead of a couple of minutes.

In [None]:
import torch

device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model.to(device)

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(28996, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, element

<Tip>

Get free access to a cloud GPU if you don't have one with a hosted notebook like [Colaboratory](https://colab.research.google.com/) or [SageMaker StudioLab](https://studiolab.sagemaker.aws/).

</Tip>

Great, now you are ready to train! 🥳

### Training loop

To keep track of your training progress, use the [tqdm](https://tqdm.github.io/) library to add a progress bar over the number of training steps:

In [None]:
from tqdm.auto import tqdm

progress_bar = tqdm(range(num_training_steps))

model.train()
for epoch in range(num_epochs):
    for batch in train_dataloader:
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

  0%|          | 0/96 [00:00<?, ?it/s]

### Evaluate

Just like how you added an evaluation function to [Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer), you need to do the same when you write your own training loop. But instead of calculating and reporting the metric at the end of each epoch, this time you'll accumulate all the batches with `add_batch` and calculate the metric at the very end.

In [None]:
import evaluate

metric = evaluate.load("accuracy")
model.eval()
for batch in eval_dataloader:
    batch = {k: v.to(device) for k, v in batch.items()}
    with torch.no_grad():
        outputs = model(**batch)

    logits = outputs.logits
    predictions = torch.argmax(logits, dim=-1)
    metric.add_batch(predictions=predictions, references=batch["labels"])

metric.compute()

{'accuracy': 0.5}

<a id='additional-resources'></a>

## Additional resources

For more fine-tuning examples, refer to:

- [🤗 Transformers Examples](https://github.com/huggingface/transformers/tree/main/examples) includes scripts
  to train common NLP tasks in PyTorch and TensorFlow.

- [🤗 Transformers Notebooks](https://huggingface.co/docs/transformers/main/en/notebooks) contains various notebooks on how to fine-tune a model for specific tasks in PyTorch and TensorFlow.