If you're opening this Notebook on colab, you will probably need to install 🤗 Transformers and 🤗 Datasets. Uncomment the following cell and run it.

In [1]:
! pip install datasets transformers transformers[torch] accelerate

Collecting datasets
  Downloading datasets-2.16.1-py3-none-any.whl (507 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m507.1/507.1 kB[0m [31m7.7 MB/s[0m eta [36m0:00:00[0m
Collecting accelerate
  Downloading accelerate-0.26.1-py3-none-any.whl (270 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m270.9/270.9 kB[0m [31m24.2 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.8,>=0.3.0 (from datasets)
  Downloading dill-0.3.7-py3-none-any.whl (115 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m14.5 MB/s[0m eta [36m0:00:00[0m
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m15.9 MB/s[0m eta [36m0:00:00[0m
INFO: pip is looking at multiple versions of multiprocess to determine which version is compatible with other requirements. This could take a while.
  Do

If you're opening this notebook locally, make sure your environment has an install from the last version of those libraries.

To be able to share your model with the community and generate results like the one shown in the picture below via the inference API, there are a few more steps to follow.

First you have to store your authentication token from the Hugging Face website (sign up [here](https://huggingface.co/join) if you haven't already!) then execute the following cell and input your username and password:

In [2]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

Then you need to install Git-LFS. Uncomment the following instructions:

In [3]:
!apt install git-lfs

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
git-lfs is already the newest version (3.0.2-1ubuntu0.2).
0 upgraded, 0 newly installed, 0 to remove and 32 not upgraded.


Make sure your version of Transformers is at least 4.11.0 since the functionality was introduced in that version:

In [4]:
import csv
import requests
import pandas as pd

import transformers

from datasets import Dataset

print(transformers.__version__)

4.35.2


You can find a script version of this notebook to fine-tune your model in a distributed fashion using multiple GPUs or TPUs [here](https://github.com/huggingface/transformers/tree/master/examples/text-classification).

# Fine-tuning a model on a text classification task

Some initial parameters for starting our model, even if it's not ideal for our
task of PNEUMONIA classification

In [5]:
model_checkpoint = "distilbert-base-uncased"
batch_size = 16

## Loading the dataset

In [6]:
# put a dl=1 here
#CSV_URL = 'https://www.dropbox.com/scl/fi/4x8aj95l7e9x96f4qzch1/mimic2_pneumonia_corpus.csv?rlkey=9rgtu2cp7wfv4rbpx3kw36a7g&dl=0'
CSV_URL = 'https://www.dropbox.com/scl/fi/4x8aj95l7e9x96f4qzch1/mimic2_pneumonia_corpus.csv?rlkey=9rgtu2cp7wfv4rbpx3kw36a7g&dl=1'

df = pd.read_csv(CSV_URL)

print(df.head())

   Unnamed: 0  subject_id  hadm_id             admit_dt  Pneumonia  \
0           5          37    18052  3264-08-14 00:00:00          1   
1          14          94     8743  2656-08-18 00:00:00          1   
2          10         117    14296  3131-11-27 00:00:00          1   
3          19         184      203  3251-04-30 00:00:00          1   
4          18         184    17249  3251-03-19 00:00:00          1   

                                                text  
0  \n\n\n     DATE: [**3264-8-14**] 10:57 AM\n   ...  
1  \n\n\n     DATE: [**2656-8-19**] 4:17 PM\n    ...  
2  \n\n\n     DATE: [**3131-11-28**] 1:30 PM\n   ...  
3  \n\n\n     DATE: [**3251-5-1**] 3:18 PM\n     ...  
4  \n\n\n     DATE: [**3251-3-19**] 3:18 PM\n    ...  


In [7]:
# now that we have a dataframe, here's a way to iterate through the rows

all_dataset_dicts = []

for index, row in df.iterrows():
  text = row['text']
  label = row['Pneumonia']

  # key values of text and label
  row_dict = {'text': text, 'label': label}
  all_dataset_dicts.append(row_dict)

print(f'len(all_dataset_dicts): {len(all_dataset_dicts)}')

len(all_dataset_dicts): 200


In [13]:
# now that we have all of the data, let's turn this into a type (Dataset) which HuggingFace recognizes

dataset_before_split = Dataset.from_list(all_dataset_dicts)

In [14]:
# now let's split this up into train and test:

dataset = dataset_before_split.train_test_split(test_size=0.3)

print(type(dataset))

<class 'datasets.dataset_dict.DatasetDict'>


The `dataset` object itself is [`DatasetDict`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasetdict), which contains one key for the training, validation and test set (with more keys for the mismatched validation and test set in the special case of `mnli`).

In [15]:
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 140
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 60
    })
})

To access an actual element, you need to select a split first, then give an index:

In [16]:
dataset["train"][0]

{'text': '\n\n\n     DATE: [**2879-6-10**] 12:52 PM\n     CHEST (PORTABLE AP)                                             Clip # [**Clip Number (Radiology) 10241**]\n     Reason: assess for infiltrate                                       \n     ______________________________________________________________________________\n     UNDERLYING MEDICAL CONDITION:\n      47 year old man with fever                                                      \n     REASON FOR THIS EXAMINATION:\n      assess for infiltrate                                                           \n     ______________________________________________________________________________\n     WET READ: KKXa SAT [**2879-6-10**] 2:20 PM\n      Evidence of fluid overload.\n      Left basilar consolidation.\n     WET READ AUDIT #1 KKXa SAT [**2879-6-10**] 1:37 PM\n      Evidence of fluid overload.\n      Possible consolidation at the left base, but the opacity may be due to\n      technique or overlying soft tissue and PA and l

To get a sense of what the data looks like, the following function will show some examples picked randomly in the dataset.

In [17]:
import datasets
import random
import pandas as pd
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=10):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)

    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, datasets.ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))

In [18]:
show_random_elements(dataset["train"])

Unnamed: 0,text,label
0,"\n\n\n DATE: [**2506-4-16**] 7:46 PM\n CHEST (PORTABLE AP) Clip # [**Clip Number (Radiology) 11115**]\n Reason: assess degree of CHF\n ______________________________________________________________________________\n UNDERLYING MEDICAL CONDITION:\n 75 year old man with S/P r NEPHRECTOURETERECTOMY, CAD, HTN, DM.\n REASON FOR THIS EXAMINATION:\n assess degree of CHF\n ______________________________________________________________________________\n FINAL REPORT\n PORTABLE CHEST-20:09:\n \n INDICATION: S/P nephrectomy and ureterectomy; fluid changes suspected.\n \n The catheter is seen extending from below presumably in the right main\n pulmonary artery; presumably this is a Swan-Ganz catheter.\n \n Diffuse opacification of the right hemithorax is noted. The appearance\n suggests an asymmetric congestive heart failure pattern. Was this patient on\n his right side for a prolonged period of time? There is no evidence for\n pneumothorax. The left lung is clear. Pulmonary vascular markings are also\n more prominent on the right.\n \n Follow up is recommended to evaluate for progression of air space disease.\n \n IMPRESSION: Asymmetric opacification of the right hemithorax. Pattern suggests\n congestive features although superimposed pneumonia cannot be excluded. Follow\n up is recommended along with clinical correlation. See above.\n \n\n",1
1,"\n\n\n DATE: [**3264-8-14**] 10:57 AM\n CHEST (PORTABLE AP) Clip # [**Clip Number (Radiology) 10698**]\n Reason: 68 yo M with CHF and possible pna, now with acute worsening \n ______________________________________________________________________________\n UNDERLYING MEDICAL CONDITION:\n 68 year old man with sob, hx of chf, cad s/p CABG\n REASON FOR THIS EXAMINATION:\n 68 yo M with CHF and possible pna, now with acute worsening of SOB and\n increasing O2 requirement, eval for worsening failure, new infiltrates or pulm\n process\n ______________________________________________________________________________\n FINAL REPORT\n PORTABLE CHEST:\n \n CLINICAL INDICATION: Worsening shortness of breath and oxygen requirement.\n \n Comparison is made to prior chest radiographs of [**3264-1-29**] and [**8-14**], 2003.\n \n The heart is enlarged but stable in size. There is persistent upper zone\n vascular redistribution and perihilar haziness. Additionally, there is an\n area of increased opacity in the lingula which obscures the left heart border.\n Note is made of disruption and malalignment of the sternal wires which is\n unchanged dating back to [**3264-1-26**].\n \n IMPRESSION:\n 1) Confluent lingular opacity concerning for pneumonia. When the patient's\n condition permits, more complete evaluation with standard PA and lateral chest\n radiographs is recommended.\n 2) Mild congestive heart failure pattern.\n 3) Malalignment and disruption of multiple sternal wires which appears to be\n chronic. This may be due to a chronic sternal dehiscence.\n\n",1
2,"\n\n\n DATE: [**2900-2-18**] 3:17 PM\n CHEST (PORTABLE AP) Clip # [**Clip Number (Radiology) 2600**]\n Reason: iabp placement s/p cabg\n ______________________________________________________________________________\n UNDERLYING MEDICAL CONDITION:\n 76 year old man with \n REASON FOR THIS EXAMINATION:\n iabp placement s/p cabg\n ______________________________________________________________________________\n FINAL REPORT\n INDICATION: IABP placement, status post CABG.\n \n AP SUPINE CHEST: The patient is status post median sternotomy. ET tube\n projects 4 cm above the carina. IABP tip projects 4.7 cm in the aortic arch.\n NG tube is in good position. Right IJ approach Swan-Ganz tip projects over\n the main pulmonary artery, perhaps within the proximal left main pulmonary\n artery. There is expected post surgical linear atelectasis bilaterally.\n There is no pneumothorax or significant effusion.\n \n IMPRESSION: Tubes and lines as described. IABPD projects 4.7 cm from the\n aortic arch.\n \n\n",1
3,"\n\n\n DATE: [**2799-4-10**] 12:30 PM\n CHEST (PA & LAT) Clip # [**Clip Number (Radiology) 15050**]\n Reason: assess for infitrate, tumor, failure \n ______________________________________________________________________________\n UNDERLYING MEDICAL CONDITION:\n 79 year old man with cp \n REASON FOR THIS EXAMINATION:\n assess for infitrate, tumor, failure \n ______________________________________________________________________________\n FINAL REPORT (REVISED)\n INDICATION: History of chest pain, evaluate for infiltrate, tumor, or CHF.\n \n COMPARISON: [**2798-11-3**].\n \n PA AND LATERAL CHEST RADIOGRAPHS: Again noted is cardiomegaly. The\n mediastinal contours are stable in appearance. There are bilateral pleural\n effusions. The right effusion was present previously, however, the left\n effusion is new. There are multiple calcified pleural plaques consistent with\n asbestos exposure. No pneumothorax is seen. Degenerative changes within the\n mid thoracic spine are again noted.\n \n IMPRESSION:\n 1. Multiple calcified pleural plaques, consistent with asbestos exposure.\n 2. Bilateral pleural effusions. The right effusion is slightly increased.\n The left effusion is new. In the setting of a new pleural effusion and prior\n asbestos exposure, a CT scan to evaluate for an underlying mass such as a\n mesothelioma is recommended. \n\n",0
4,"\n\n\n DATE: [**3371-11-10**] 4:27 PM\n CHEST (PORTABLE AP) Clip # [**Clip Number (Radiology) 11136**]\n Reason: desaturations and increasing SOB\n ______________________________________________________________________________\n UNDERLYING MEDICAL CONDITION:\n 76 year old woman with COPD and requiring intubation. Extubated 3 days ago.\n Was doing well, but now w/ increasing SOB and o2 requirement. Recent +\n cardiac enzymes and echo showing ef30%; also being treated for PNA.\n REASON FOR THIS EXAMINATION:\n desaturations and increasing SOB\n ______________________________________________________________________________\n FINAL REPORT\n INDICATION: 76 y/o woman with COPD and increasing shortness of breath and\n hypoxia.\n \n COMPARISON: AP portable radiograph dated [**3371-11-5**].\n \n This study is limited due to extensive motion artifact. The heart size is\n normal. Mediastinal and hilar contours are normal. The pulmonary vascularity\n is normal. A linear opacity is noted in the retrocardiac space, likely\n representing atelectasis. The lungs are hyperinflated. Calcifications are\n noted along the left apical pleural surface.\n \n The soft tissue and osseous structures are unremarkable. There are no pleural\n effusions.\n \n IMPRESSION: 1) Limited study due to extensive motion artifact. No pleural\n effusions or infiltrates identified. Recommend repeat PA & lateral with\n improved breath hold. 2) Left apical pleural calcifications. 3) Hyperinflation\n of the lungs, consistent with emphysema.\n\n",1
5,"\n\n\n DATE: [**2621-6-11**] 7:45 PM\n CHEST (PORTABLE AP) Clip # [**Clip Number (Radiology) 6386**]\n Reason: evalaute for infiltrate, line placement \n Admitting Diagnosis: FAILURE TO WEAN\n ______________________________________________________________________________\n UNDERLYING MEDICAL CONDITION:\n 67 year old woman with PNA, transferred from OSH for failure to wean. \n REASON FOR THIS EXAMINATION:\n evalaute for infiltrate, line placement \n ______________________________________________________________________________\n FINAL REPORT\n INDICATIONS: Line placement. \n \n PORTABLE AP CHEST: No prior studies are available for comparison.\n \n There is placement of ET tube, with the distal tip about 2.5 cm above the\n carina. Also noted is a right subclavian central line, with the tip in the\n distal SVC. There is no evidence of pneumothorax. There is opacity in the\n right upper lobe with right minor fissure slightly elevated, consistent with a\n slight volume loss of the right upper lobe. Also noted is mild prominence of\n pulmonary vasculature along with some upper zone redistribution. There are\n bilateral pleural effusions noted. The right and the left hemidiaphragms are\n not well visualized which most likely represents atelectasis at the bases\n bilaterally. The heart is upper limits of normal in size. Visualized osseous\n structures appear unremarkable.\n \n IMPRESSION:\n \n 1. Findings consistent with right upper lobe pneumonia with mild component\n of volume loss. Follow up study is recommended to ensure resolution and to\n exclude an obstructing lesion.\n \n 2. Mild CHF with bilateral small pleural effusions noted.\n \n 3. Placement of ET tube and right subclavian central line, both of which are\n in good position. No evidence of pneumothorax.\n \n \n \n\n",1
6,"\n\n\n DATE: [**2596-5-10**] 5:45 AM\n CHEST (PORTABLE AP) Clip # [**Clip Number (Radiology) 8395**]\n Reason: please eval lung fields; thanks \n Admitting Diagnosis: PANCREATIC ABSCESS\n ______________________________________________________________________________\n UNDERLYING MEDICAL CONDITION:\n 51 year old man with pancr cysts, recent SOB starting 1 day ago, also s/p \n right thoracentesis. \n REASON FOR THIS EXAMINATION:\n please eval lung fields; thanks \n ______________________________________________________________________________\n FINAL REPORT\n HISTORY: 51-year-old male with pancreatic cysts, now status post right\n thoracentesis with dyspnea.\n \n COMPARISON: Chest radiographs, [**5-8**] through [**2596-5-9**].\n \n SUPINE PORTABLE CHEST: Endotracheal tube tip is 5 cm above the carina. Left\n subclavian central line tip is in the proximal SVC. Compared to the earliest\n chest film on [**5-8**], there has been progressive mediastinal widening and\n increase in size of the cardiac silhouette concerning for elevated central\n venous pressure and pericardial effusion. Given differences in patient\n positioning compared to [**5-9**], there is probable decrease in size of right\n pleural effusion which is now moderate to large. There remains a persistent\n small left pleural effusion. Prominent interstitial markings are compatible\n with mild interstitial edema.\n \n The results of this study were discussed with Dr. [**First Name4 (NamePattern1) 792**] [**Last Name (NamePattern1) 8396**] on [**2596-5-10**].\n \n\n",1
7,"\n\n\n DATE: [**2603-7-21**] 11:28 AM\n CHEST (PORTABLE AP) Clip # [**Clip Number (Radiology) 2628**]\n Reason: Sob \n Admitting Diagnosis: MI,CHF\n ______________________________________________________________________________\n UNDERLYING MEDICAL CONDITION:\n [**Age over 90 **] year old woman with SOB, CHF, NSTEMI , acute SOB \n \n REASON FOR THIS EXAMINATION:\n Sob \n ______________________________________________________________________________\n FINAL REPORT\n INDICATION: Short of breath and congestive heart failure and non ST elevation\n MI.\n \n COMPARISON: [**2603-7-21**].\n \n TECHNIQUE: Single AP portable upright chest.\n \n COMMENT: The heart size and mediastinal contours are unchanged. There is\n marked interval increase in congestive heart failure, with patchy bilateral\n and perihilar opacities. Increased size in bilateral pleural effusions, left\n greater than right. No pneumothorax. The surrounding osseous structures are\n unchanged.\n \n IMPRESSION: Interval increase in congestive heart failure and bilateral\n pleural effusions. Stable cardiomegaly.\n \n\n",1
8,"\n\n\n DATE: [**3128-6-21**] 11:59\n CHEST (PORTABLE AP); -77 BY DIFFERENT PHYSICIAN [**Name Initial (PRE) 29**] # [**Clip Number (Radiology) 3200**]\n Reason: ET placement \n Admitting Diagnosis: SEPSIS\n ______________________________________________________________________________\n UNDERLYING MEDICAL CONDITION:\n 80 year old man with found with sepsis \n \n REASON FOR THIS EXAMINATION:\n ET placement \n ______________________________________________________________________________\n FINAL REPORT\n AP CHEST 11:08 [**Initials (NamePattern4) 1442**] [**6-21**]\n \n HISTORY: Sepsis. Check ET tube placement.\n \n IMPRESSION: AP chest compared to [**6-21**]:\n \n Bibasilar consolidation has increased substantially, particularly on the\n right, now accompanied by a small-to-moderate pleural effusion. There is some\n pulmonary edema, heart size is normal, moderate azygous distention is stable,\n and the lung findings are concerning for progressive pneumonia. ET tube and\n right internal jugular line are in standard placements. Nasogastric tube ends\n in moderately distended stomach. Note is made of severe pagetoid changes in\n the right shoulder.\n \n\n",0
9,"\n\n\n DATE: [**3406-5-30**] 9:43 PM\n CHEST (PORTABLE AP) Clip # [**Clip Number (Radiology) 13301**]\n Reason: fx \n ______________________________________________________________________________\n UNDERLYING MEDICAL CONDITION:\n 81 year old woman with fz \n REASON FOR THIS EXAMINATION:\n fx \n ______________________________________________________________________________\n FINAL REPORT\n INDICATION: Patient with seizure.\n \n Low lung volumes. Bilateral basilar opacities, considerably larger at the\n left base than at the right. There is widening of the superior mediastinum\n and slight displacement of the trachea, which may represent an enlarged\n thyroid.\n \n IMPRESSION: Lung volumes with bilateral basilar opacities.\n \n Question substernal thyroid enlargement.\n \n Recommend followup chest x-ray.\n \n\n",1


## Preprocessing the data

Before we can feed those texts to our model, we need to preprocess them. This is done by a 🤗 Transformers `Tokenizer` which will (as the name indicates) tokenize the inputs (including converting the tokens to their corresponding IDs in the pretrained vocabulary) and put it in a format the model expects, as well as generate the other inputs that model requires.

To do all of this, we instantiate our tokenizer with the `AutoTokenizer.from_pretrained` method, which will ensure:

- we get a tokenizer that corresponds to the model architecture we want to use,
- we download the vocabulary used when pretraining this specific checkpoint.

That vocabulary will be cached, so it's not downloaded again the next time we run the cell.

In [20]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

We pass along `use_fast=True` to the call above to use one of the fast tokenizers (backed by Rust) from the 🤗 Tokenizers library. Those fast tokenizers are available for almost all models, but if you got an error with the previous call, remove that argument.

You can directly call this tokenizer on one sentence or a pair of sentences:

In [21]:
tokenizer("Hello, this one sentence!", "And this sentence goes with it.")

{'input_ids': [101, 7592, 1010, 2023, 2028, 6251, 999, 102, 1998, 2023, 6251, 3632, 2007, 2009, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

## Fine-tuning the model

Now that our data is ready, we can download a pretrained base model and fine-tune it. Since all our task is document classification, we use the `AutoModelForSequenceClassification` class. Like with the tokenizer, the `from_pretrained` method will download and cache the model for us. The only thing we have to specify is the number of labels for our problem (which is always 2, except for STS-B which is a regression problem and MNLI where we have 3 labels):

In [22]:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

num_labels = 2
model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=num_labels)

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'pre_classifier.bias', 'classifier.weight', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


The warning is telling us we are throwing away some weights (the `vocab_transform` and `vocab_layer_norm` layers) and randomly initializing some other (the `pre_classifier` and `classifier` layers). This is absolutely normal in this case, because we are removing the head used to pretrain the model on a masked language modeling objective and replacing it with a new head for which we don't have pretrained weights, so the library warns us we should fine-tune this model before using it for inference, which is exactly what we are going to do.

To instantiate a `Trainer`, we will need to define two more things. The most important is the [`TrainingArguments`](https://huggingface.co/transformers/main_classes/trainer.html#transformers.TrainingArguments), which is a class that contains all the attributes to customize the training. It requires one folder name, which will be used to save the checkpoints of the model, and all other arguments are optional:

In [24]:
metric_name = "accuracy"
model_name = model_checkpoint.split("/")[-1]

args = TrainingArguments(
    f"{model_name}-finetuned-pneumonia",
    evaluation_strategy = "epoch",
    save_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=5,
    weight_decay=0.01,
    load_best_model_at_end=True,
    metric_for_best_model=metric_name
)

Here we set the evaluation to be done at the end of each epoch, tweak the learning rate, use the `batch_size` defined at the top of the notebook and customize the number of epochs for training, as well as the weight decay. Since the best model might not be the one at the end of training, we ask the `Trainer` to load the best model it saved (according to `metric_name`) at the end of training.

The last argument to setup everything so we can push the model to the [Hub](https://huggingface.co/models) regularly during training. Remove it if you didn't follow the installation steps at the top of the notebook. If you want to save your model locally in a name that is different than the name of the repository it will be pushed, or if you want to push your model under an organization and not your name space, use the `hub_model_id` argument to set the repo name (it needs to be the full name, including your namespace: for instance `"sgugger/bert-finetuned-mrpc"` or `"huggingface/bert-finetuned-mrpc"`).

The last thing to define for our `Trainer` is how to compute the metrics from the predictions. We need to define a function for this, which will just use the `metric` we loaded earlier, the only preprocessing we have to do is to take the argmax of our predicted logits (our just squeeze the last axis in the case of STS-B):

In [25]:
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    if task != "stsb":
        predictions = np.argmax(predictions, axis=1)
    else:
        predictions = predictions[:, 0]
    return metric.compute(predictions=predictions, references=labels)

Then we just need to pass all of this along with our datasets to the `Trainer`:

In [27]:
trainer = Trainer(
    model,
    args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

You might wonder why we pass along the `tokenizer` when we already preprocessed our data. This is because we will use it once last time to make all the samples we gather the same length by applying padding, which requires knowing the model's preferences regarding padding (to the left or right? with which token?). The `tokenizer` has a pad method that will do all of this right for us, and the `Trainer` will use it. You can customize this part by defining and passing your own `data_collator` which will receive the samples like the dictionaries seen above and will need to return a dictionary of tensors.

We can now finetune our model by just calling the `train` method:

In [28]:
trainer.train()

You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


ValueError: You should supply an encoding or a list of encodings to this method that includes input_ids, but you provided ['label']

We can check with the `evaluate` method that our `Trainer` did reload the best model properly (if it was not the last one):

In [None]:
trainer.evaluate()

To see how your model fared you can compare it to the [GLUE Benchmark leaderboard](https://gluebenchmark.com/leaderboard).

You can now upload the result of the training to the Hub, just execute this instruction:

In [None]:
trainer.push_to_hub()

You can now share this model with all your friends, family, favorite pets: they can all load it with the identifier `"your-username/the-name-you-picked"` so for instance:

```python
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("sgugger/my-awesome-model")
```

## Hyperparameter search

The `Trainer` supports hyperparameter search using [optuna](https://optuna.org/) or [Ray Tune](https://docs.ray.io/en/latest/tune/). For this last section you will need either of those libraries installed, just uncomment the line you want on the next cell and run it.

In [None]:
# ! pip install optuna
# ! pip install ray[tune]

During hyperparameter search, the `Trainer` will run several trainings, so it needs to have the model defined via a function (so it can be reinitialized at each new run) instead of just having it passed. We jsut use the same function as before:

In [None]:
def model_init():
    return AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=num_labels)

And we can instantiate our `Trainer` like before:

In [None]:
trainer = Trainer(
    model_init=model_init,
    args=args,
    train_dataset=encoded_dataset["train"],
    eval_dataset=encoded_dataset[validation_key],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

The method we call this time is `hyperparameter_search`. Note that it can take a long time to run on the full dataset for some of the tasks. You can try to find some good hyperparameter on a portion of the training dataset by replacing the `train_dataset` line above by:
```python
train_dataset = encoded_dataset["train"].shard(index=1, num_shards=10)
```
for 1/10th of the dataset. Then you can run a full training on the best hyperparameters picked by the search.

In [None]:
best_run = trainer.hyperparameter_search(n_trials=10, direction="maximize")

The `hyperparameter_search` method returns a `BestRun` objects, which contains the value of the objective maximized (by default the sum of all metrics) and the hyperparameters it used for that run.

In [None]:
best_run

You can customize the objective to maximize by passing along a `compute_objective` function to the `hyperparameter_search` method, and you can customize the search space by passing a `hp_space` argument to `hyperparameter_search`. See this [forum post](https://discuss.huggingface.co/t/using-hyperparameter-search-in-trainer/785/10) for some examples.

To reproduce the best training, just set the hyperparameters in your `TrainingArgument` before creating a `Trainer`:

In [None]:
for n, v in best_run.hyperparameters.items():
    setattr(trainer.args, n, v)

trainer.train()