If you're opening this Notebook on colab, you will probably need to install 🤗 Transformers and 🤗 Datasets. Uncomment the following cell and run it.

In [None]:
! pip install datasets transformers transformers[torch] accelerate evaluate



If you're opening this notebook locally, make sure your environment has an install from the last version of those libraries.

To be able to share your model with the community and generate results like the one shown in the picture below via the inference API, there are a few more steps to follow.

First you have to store your authentication token from the Hugging Face website (sign up [here](https://huggingface.co/join) if you haven't already!) then execute the following cell and input your username and password:

In [None]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

Then you need to install Git-LFS. Uncomment the following instructions:

In [None]:
!apt install git-lfs

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
git-lfs is already the newest version (3.0.2-1ubuntu0.2).
0 upgraded, 0 newly installed, 0 to remove and 32 not upgraded.


Make sure your version of Transformers is at least 4.11.0 since the functionality was introduced in that version:

In [None]:
import csv
import requests
import pandas as pd

import sklearn
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

import evaluate

import transformers

from datasets import Dataset

print(transformers.__version__)
print(sklearn.__version__)

4.35.2
1.2.2


You can find a script version of this notebook to fine-tune your model in a distributed fashion using multiple GPUs or TPUs [here](https://github.com/huggingface/transformers/tree/master/examples/text-classification).

# Fine-tuning a model on a text classification task

Some initial parameters for starting our model, even if it's not ideal for our
task of PNEUMONIA classification

In [None]:
model_checkpoint = "distilbert-base-uncased"
batch_size = 16

## Loading the dataset

In [None]:
# put a dl=1 here
#CSV_URL = 'https://www.dropbox.com/scl/fi/4x8aj95l7e9x96f4qzch1/mimic2_pneumonia_corpus.csv?rlkey=9rgtu2cp7wfv4rbpx3kw36a7g&dl=0'
CSV_URL = 'https://www.dropbox.com/scl/fi/4x8aj95l7e9x96f4qzch1/mimic2_pneumonia_corpus.csv?rlkey=9rgtu2cp7wfv4rbpx3kw36a7g&dl=1'

df = pd.read_csv(CSV_URL)

print(df.head())

   Unnamed: 0  subject_id  hadm_id             admit_dt  Pneumonia  \
0           5          37    18052  3264-08-14 00:00:00          1   
1          14          94     8743  2656-08-18 00:00:00          1   
2          10         117    14296  3131-11-27 00:00:00          1   
3          19         184      203  3251-04-30 00:00:00          1   
4          18         184    17249  3251-03-19 00:00:00          1   

                                                text  
0  \n\n\n     DATE: [**3264-8-14**] 10:57 AM\n   ...  
1  \n\n\n     DATE: [**2656-8-19**] 4:17 PM\n    ...  
2  \n\n\n     DATE: [**3131-11-28**] 1:30 PM\n   ...  
3  \n\n\n     DATE: [**3251-5-1**] 3:18 PM\n     ...  
4  \n\n\n     DATE: [**3251-3-19**] 3:18 PM\n    ...  


In [None]:
# now that we have a dataframe, here's a way to iterate through the rows

all_dataset_dicts = []

for index, row in df.iterrows():
  text = row['text']
  label = row['Pneumonia']

  # key values of text and label
  row_dict = {'text': text, 'label': label}
  all_dataset_dicts.append(row_dict)

print(f'len(all_dataset_dicts): {len(all_dataset_dicts)}')

len(all_dataset_dicts): 200


In [None]:
# now that we have all of the data, let's turn this into a type (Dataset) which HuggingFace recognizes

dataset_before_split = Dataset.from_list(all_dataset_dicts)

In [None]:
# now let's split this up into train and test:

dataset = dataset_before_split.train_test_split(test_size=0.3)

print(type(dataset))

<class 'datasets.dataset_dict.DatasetDict'>


The `dataset` object itself is [`DatasetDict`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasetdict), which contains one key for the training, validation and test set (with more keys for the mismatched validation and test set in the special case of `mnli`).

In [None]:
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 140
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 60
    })
})

To access an actual element, you need to select a split first, then give an index:

In [None]:
dataset["train"][0]

{'text': '\n\n\n     DATE: [**3134-10-10**] 1:50 PM\n     CHEST (PORTABLE AP)                                             Clip # [**Clip Number (Radiology) 9288**]\n     Reason: eval for placement of right IJ                              \n     Admitting Diagnosis: SEIZURE\n     ______________________________________________________________________________\n     UNDERLYING MEDICAL CONDITION:\n       [**Age over 90 **] year old man with seizure/altered mental status, s/p right IJ placement.    \n                                                           \n     REASON FOR THIS EXAMINATION:\n      eval for placement of right IJ                                                  \n     ______________________________________________________________________________\n                                     FINAL REPORT\n     CHEST SINGLE AP FILM.\n     \n     History of seizures with right jugular CV line placement.\n     \n     Endotracheal tube is 2 cm above carina.  Tip of right jugular CV line

To get a sense of what the data looks like, the following function will show some examples picked randomly in the dataset.

In [None]:
import datasets
import random
import pandas as pd
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=10):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)

    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, datasets.ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))

In [None]:
show_random_elements(dataset["train"])

Unnamed: 0,text,label
0,"\n\n\n DATE: [**2944-4-10**] 8:28 AM\n CHEST (PORTABLE AP) Clip # [**Clip Number (Radiology) 16298**]\n Reason: PLEASE COMPARE WITH [**2944-4-9**] CXR TO ASSESS PNEUMOTHORAX \n ______________________________________________________________________________\n UNDERLYING MEDICAL CONDITION:\n 77 year old man with dyspnea, hx/o lung ca verify tube placement \n \n REASON FOR THIS EXAMINATION:\n PLEASE COMPARE WITH [**2944-4-9**] CXR TO ASSESS PNEUMOTHORAX \n ______________________________________________________________________________\n FINAL REPORT\n INDICATION: Lung CA, dyspnea, tube placement.\n \n AP PORTABLE SEMI-UPRIGHT CHEST: Left-sided chest tube projects over the left\n lung apex. There is extensive subcutaneous emphysema. Right subclavian line\n projects its tip over the distal SVC. There is a persistent minute left\n apical pneumothorax. The heart is probably enlarged. The aorta is unfolded\n and ectatic. Compared to the exam one day previously, the retrocardiac patchy\n atelectasis appears slightly improved. Extensive left upper quadrant\n surgical clips are noted.\n \n IMPRESSION:\n \n 1. No significant change in minute left apical pneumothorax.\n 2. Perhaps slight interval improvement in retrocardiac lower lobe\n atelectasis. \n\n",1
1,"\n\n\n DATE: [**2721-6-30**] 9:45 PM\n CHEST (PA & LAT) Clip # [**Clip Number (Radiology) 13923**]\n Reason: Pleural effusions? Nodules? \n Admitting Diagnosis: LYMPHOMA\n ______________________________________________________________________________\n UNDERLYING MEDICAL CONDITION:\n 37 year old woman with DVT, ? PE on VQ, with workup for probable malignancy. \n REASON FOR THIS EXAMINATION:\n Pleural effusions? Nodules? \n ______________________________________________________________________________\n FINAL REPORT\n HISTORY: 37 year old woman with DVT and ?PE. Being worked up for a probable\n malignancy.\n \n CHEST PA AND LATERAL: The heart size is normal. There is an area of\n increased opacity lateral to the right paratracheal stripe. In the right\n upper lobe, there is a small focal opacity. The lungs are otherwise clear.\n There are no pleural effusions. Osseous and soft-tissue structures are\n unremarkable.\n \n IMPRESSION: Small focal opacity in right upper lobe and right paratracheal\n opacity. In the setting of possibly malignancy, CT scan of the chest is\n recommended for further evaluation.\n\n",0
2,"\n\n\n DATE: [**3060-1-8**] 3:40 PM\n CHEST (PORTABLE AP); -59 DISTINCT PROCEDURAL SERVICE Clip # [**Clip Number (Radiology) 8574**]\n Reason: please eval for CHF/PNA \n ______________________________________________________________________________\n UNDERLYING MEDICAL CONDITION:\n 75 year old man with hypotension and recent V tach. \n REASON FOR THIS EXAMINATION:\n please eval for CHF/PNA \n ______________________________________________________________________________\n FINAL REPORT\n INDICATION: 75-year-old male with hypertension and recent V-TACH. Evaluate\n for CHF or pneumonia.\n \n COMPARISONS: Comparison is made to [**3059-12-21**].\n \n TECHNIQUE: AP semi-upright single view of the chest.\n \n FINDINGS: There is a left chest wall AICD biventricular pacer with leads in\n unchanged position. There is again noted cardiomegaly. The heart size is\n stable when compared to prior study. There is interval increase in opacity\n involving the entire right lung. There are also more patchy opacities\n involving the left mid and lower lung zones. There appears to be a left\n pleural effusion. These findings could represent asymmetric pulmonary edema\n Vs. an infectious process involving the right lung.\n \n IMPRESSION:\n 1. Findings are most consistent with asymmetric pulmonary edema. However,\n superimposed infection cannot be excluded.\n 2. Left pleural effusion.\n 3. Stable cardiomegaly.\n\n",1
3,"\n\n\n DATE: [**2709-6-16**] 10:28 PM\n CHEST (PORTABLE AP) Clip # [**Clip Number (Radiology) 13424**]\n Reason: r/out pna, edema \n Admitting Diagnosis: PULMONARY EDEMA\n ______________________________________________________________________________\n UNDERLYING MEDICAL CONDITION:\n 84 year old man with AS, here with sob \n REASON FOR THIS EXAMINATION:\n r/out pna, edema \n ______________________________________________________________________________\n FINAL REPORT\n REASON FOR EXAMINATION: Shortness of breath in a patient with aortic\n stenosis.\n \n Portable AP chest radiograph compared to [**2705-2-8**].\n \n The heart size is mildly enlarged but grossly unchanged. The aorta is\n tortuous and calcified. The lungs are hyperinflated. This most likely\n represent unlike emphysema. Perihilar opacities involving the lower lobes are\n demonstrated, right slightly worse than left and might represent pulmonary\n edema with asymmetric appearance due to underlying emphysema. Small right\n pleural effusion cannot be excluded. The slight asymmetry between the lungs\n might represent underlying right lower lobe infectious process which can be\n better characterized after resolving of pulmonary edema.\n \n\n",1
4,"\n\n\n DATE: [**2950-2-16**] 7:24 PM\n CHEST (PORTABLE AP) Clip # [**Clip Number (Radiology) 8312**]\n Reason: ? ett placement, ? infiltrates \n Admitting Diagnosis: HEMOPTYSIS,LUNG CA\n ______________________________________________________________________________\n UNDERLYING MEDICAL CONDITION:\n 64 year old man with NSCL CA presenting intubated with hemoptysis \n REASON FOR THIS EXAMINATION:\n ? ett placement, ? infiltrates \n ______________________________________________________________________________\n FINAL REPORT\n CHEST PORTABLE\n \n INDICATION: 64-year-old man with non-small cell lung cancer with hemoptysis,\n evaluate for ET tube placement.\n \n CHEST PORTABLE: No prior studies are available for comparison. The heart\n size is difficult to evaluate. The left lung is normally aerated. There is\n pleural thickening surrounding the right lung and right mediastinal\n lymphadenopathy. Patchy opacities are seen throughout the right lung which\n could be due to atelectasis from compression of the right lung or represent\n air space consolidation. An ET tube is identified with tip 6.3 cm from the\n carina. An NG tube is seen, the tip of which is not depicted on this film but\n the tube reaches the stomach.\n \n IMPRESSION:\n \n 1. ET tube with its tip 6.3 cm from the carina.\n \n 2. Right mediastinal lymphadenopathy and pleural thickening/pleural effusion\n surrounding the right lung.\n \n\n",1
5,"\n\n\n DATE: [**3495-10-11**] 11:38 AM\n CHEST (PORTABLE AP) Clip # [**Clip Number (Radiology) 10446**]\n Reason: r/o pneumonia \n ______________________________________________________________________________\n UNDERLYING MEDICAL CONDITION:\n [**Age over 90 **] year old woman with worsening sob x 2 days. now with hypoxia and hemoptysis \n REASON FOR THIS EXAMINATION:\n r/o pneumonia \n ______________________________________________________________________________\n FINAL REPORT\n This is a portable upright chest dated [**3495-10-11**], with clinical\n indication of worsening shortness of breath for 2 days. Hemoptysis and\n hypoxia.\n \n The cardiac silhouette is enlarged but is difficult to fully assess due to\n obscuration of the right heart border. Pulmonary vascularity is slightly\n engorged and there is perihilar haziness. There are multifocal areas of\n consolidation present in the lower lobes, right greater than left, and there\n is a more confluent opacity in the right perihilar region.\n Collapse/consolidation of right middle lobe is also noted. There are bilateral\n pleural effusions, moderate on the right and small on the left.\n \n IMPRESSION:\n \n 1. Multilobar consolidation, which could reflect asymmetrical edema and/or\n multilobar pneumonia. A postobstructive process in the right middle lobe\n cannot be excluded. By report, the patient is scheduled to undergo CTA, which\n will be helpful for more complete characterization of these findings.\n 2. Bilateral pleural effusions, right greater than left.\n \n\n",1
6,"\n\n\n DATE: [**2776-2-17**] 9:49 AM\n CHEST (PA & LAT) Clip # [**Clip Number (Radiology) 10141**]\n Reason: Please evaluate for chf vs pneumonia. \n Admitting Diagnosis: CORONARY ARTERY DISEASE\n ______________________________________________________________________________\n UNDERLYING MEDICAL CONDITION:\n 83 year old woman with sob, cough, recent MI. please evaluate for chf and \n pneumonia.\n REASON FOR THIS EXAMINATION:\n Please evaluate for chf vs pneumonia. \n ______________________________________________________________________________\n FINAL REPORT\n CHEST PA AND LATERAL FROM [**2-17**]\n \n HISTORY: Shortness of breath and cough. Recent MI. Please evaluate for CHF\n versus pneumonia.\n \n AP upright and left lateral views of the chest show interval clearing of\n pleural effusion seen on the patient's prior portable study from [**2776-1-14**]. No focal consolidation is seen to suggest pneumonia and the pulmonary\n vasculature is not congested. Moderate cardiomegaly appears stable and soft\n tissue mass at the right upper mediastinum has been shown to be vascular on\n previous cross sectional imaging studies. Tunneled dialysis tubing is seen\n with distal tip at the level of the right atrium and the proximal tip at the\n level of the SVC/right atrial junction. Calcified atherosclerotic plaque is\n seen in the arch of the aorta.\n \n CONCLUSION: No CHF or pneumonia.\n \n\n",1
7,"\n\n\n DATE: [**3287-4-16**] 7:33 PM\n CHEST (PORTABLE AP) Clip # [**Clip Number (Radiology) 14455**]\n Reason: r/o pneu \n ______________________________________________________________________________\n UNDERLYING MEDICAL CONDITION:\n 82 year old woman with hypotension \n REASON FOR THIS EXAMINATION:\n r/o pneu \n ______________________________________________________________________________\n FINAL REPORT\n INDICATION: Hypotension.\n \n CHEST X-RAY, PORTABLE AP: An endotracheal tube is present which lies\n immediately below the thoracic inlet, 3 cm from the carina. There is a right\n internal jugular central venous line with tip in the mid superior vena cava. A\n nasogastric tube is present which passes below the level of the diaphragm.\n Linear opacities are present at the right lung base, adjacent to the right\n cardiac border. This is likely consistent with atelectasis. No infiltrates\n or consolidations are present. There is no pneumothorax. Slight blunting is\n present at the right lung base, likely consistent with a small amount of\n pleural fluid. The osseous structures are unremarkable.\n \n IMPRESSION:\n \n 1) Tubes and lines as described above.\n \n 2) No acute infiltrate or consolidation.\n\n",1
8,"\n\n\n DATE: [**2948-10-10**] 7:32 PM\n CHEST (PORTABLE AP) Clip # [**Clip Number (Radiology) 12144**]\n Reason: Bilateral PNA \n ______________________________________________________________________________\n UNDERLYING MEDICAL CONDITION:\n 74s/ year old male with bilat PNA. \n \n REASON FOR THIS EXAMINATION:\n Bilateral PNA \n ______________________________________________________________________________\n FINAL REPORT\n INDICATION: 74-year-old man with bilateral pneumonia.\n \n Portable AP view of the chest performed [**2948-10-10**] at 19:27 hours: Comparison is\n made with the prior study performed the same day at 00:06 hours. An ET tube,\n NG tube and left central venous catheter are again noted essentially unchanged\n in position. The heart size is within normal limits. The aorta is calcified.\n There has been no significant change in the patchy right-sided parenchymal\n abnormalities, including air space opacities in the right mid and lower lung\n zones. Allowing for differences in patient rotation there has been no\n significant change in the left perihilar infiltrate. The extreme right\n costophrenic angle is excluded and cannot be assessed.\n \n IMPRESSION: Allowing for differences in patient rotation there has been no\n significant change compared to prior study.\n\n",0
9,"\n\n\n DATE: [**3353-1-10**] 5:35 PM\n CHEST (PA & LAT) Clip # [**Clip Number (Radiology) 1659**]\n Reason: Patient with Hx NHL now febrile and neutropenic. Question p\n ICD9 code from order: 202.8\n ______________________________________________________________________________\n UNDERLYING MEDICAL CONDITION:\n 55 year old man with T cell lymphoma s/p auto BMT currently with fevers \n REASON FOR THIS EXAMINATION:\n Patient with Hx NHL now febrile and neutropenic. Question pneumonia\n ______________________________________________________________________________\n FINAL REPORT\n CHEST 2 VIEWS PA & LATERAL:\n \n HISTORY: T cell lymphoma with bone marrow transplant and fever.\n \n Heart size is normal. The lungs are clear. No pleural effusions. Left sided\n PICC line is in region of cavoatrial junction.\n \n IMPRESSION: No evidence for pneumonia. No change since prior study of [**12-13**], 05.\n\n",0


## Preprocessing the data

Before we can feed those texts to our model, we need to preprocess them. This is done by a 🤗 Transformers `Tokenizer` which will (as the name indicates) tokenize the inputs (including converting the tokens to their corresponding IDs in the pretrained vocabulary) and put it in a format the model expects, as well as generate the other inputs that model requires.

To do all of this, we instantiate our tokenizer with the `AutoTokenizer.from_pretrained` method, which will ensure:

- we get a tokenizer that corresponds to the model architecture we want to use,
- we download the vocabulary used when pretraining this specific checkpoint.

That vocabulary will be cached, so it's not downloaded again the next time we run the cell.

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


We pass along `use_fast=True` to the call above to use one of the fast tokenizers (backed by Rust) from the 🤗 Tokenizers library. Those fast tokenizers are available for almost all models, but if you got an error with the previous call, remove that argument.

You can directly call this tokenizer on one sentence or a pair of sentences:

In [None]:
tokenizer("Hello, this one sentence!", "And this sentence goes with it.")

{'input_ids': [101, 7592, 1010, 2023, 2028, 6251, 999, 102, 1998, 2023, 6251, 3632, 2007, 2009, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

## Fine-tuning the model

Now that our data is ready, we can download a pretrained base model and fine-tune it. Since all our task is document classification, we use the `AutoModelForSequenceClassification` class. Like with the tokenizer, the `from_pretrained` method will download and cache the model for us. The only thing we have to specify is the number of labels for our problem (which is always 2, except for STS-B which is a regression problem and MNLI where we have 3 labels):

In [None]:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

num_labels = 2
model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=num_labels)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.weight', 'pre_classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


The warning is telling us we are throwing away some weights (the `vocab_transform` and `vocab_layer_norm` layers) and randomly initializing some other (the `pre_classifier` and `classifier` layers). This is absolutely normal in this case, because we are removing the head used to pretrain the model on a masked language modeling objective and replacing it with a new head for which we don't have pretrained weights, so the library warns us we should fine-tune this model before using it for inference, which is exactly what we are going to do.

To instantiate a `Trainer`, we will need to define two more things. The most important is the [`TrainingArguments`](https://huggingface.co/transformers/main_classes/trainer.html#transformers.TrainingArguments), which is a class that contains all the attributes to customize the training. It requires one folder name, which will be used to save the checkpoints of the model, and all other arguments are optional:

In [None]:
metric = evaluate.load("accuracy")

In [None]:
metric_name = "accuracy"
model_name = model_checkpoint.split("/")[-1]

args = TrainingArguments(
    f"{model_name}-finetuned-pneumonia",
    evaluation_strategy = "epoch",
    save_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=5,
    weight_decay=0.01,
    load_best_model_at_end=True,
    metric_for_best_model=metric_name
)

Here we set the evaluation to be done at the end of each epoch, tweak the learning rate, use the `batch_size` defined at the top of the notebook and customize the number of epochs for training, as well as the weight decay. Since the best model might not be the one at the end of training, we ask the `Trainer` to load the best model it saved (according to `metric_name`) at the end of training.

Before we continue, we need to tokenize (translate into input_ids for training a model)

In [None]:
max_tokens = 512

In [None]:

def tokenize_function(examples):
    return tokenizer(examples["text"], padding=True, truncation=True, max_length=max_tokens, add_special_tokens = True)


In [None]:
tokenized_dataset = dataset.map(tokenize_function, batched=True, num_proc=4, remove_columns=["text"])

Map (num_proc=4):   0%|          | 0/140 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/60 [00:00<?, ? examples/s]

In [None]:
def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)

    # Calculate accuracy
    accuracy = accuracy_score(labels, preds)

   # Calculate precision, recall, and F1-score
    precision = precision_score(labels, preds, average='weighted')
    recall = recall_score(labels, preds, average='weighted')
    f1 = f1_score(labels, preds, average='weighted')

    return {
        'accuracy': accuracy,
        'precision': precision,
        'recall': recall,
        'f1': f1
    }

Then we just need to pass all of this along with our datasets to the `Trainer`:

In [None]:
trainer = Trainer(
    model,
    args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    tokenizer=tokenizer,
    compute_metrics = compute_metrics
)

You might wonder why we pass along the `tokenizer` when we already preprocessed our data. This is because we will use it once last time to make all the samples we gather the same length by applying padding, which requires knowing the model's preferences regarding padding (to the left or right? with which token?). The `tokenizer` has a pad method that will do all of this right for us, and the `Trainer` will use it. You can customize this part by defining and passing your own `data_collator` which will receive the samples like the dictionaries seen above and will need to return a dictionary of tensors.

We can now finetune our model by just calling the `train` method:

In [None]:
trainer.train()

You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,No log,0.600309,0.716667,0.513611,0.716667,0.598382
2,No log,0.596914,0.716667,0.513611,0.716667,0.598382
3,No log,0.586105,0.716667,0.513611,0.716667,0.598382
4,No log,0.585361,0.716667,0.513611,0.716667,0.598382
5,No log,0.610571,0.716667,0.513611,0.716667,0.598382
6,No log,0.606131,0.716667,0.513611,0.716667,0.598382


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,No log,0.600309,0.716667,0.513611,0.716667,0.598382
2,No log,0.596914,0.716667,0.513611,0.716667,0.598382
3,No log,0.586105,0.716667,0.513611,0.716667,0.598382
4,No log,0.585361,0.716667,0.513611,0.716667,0.598382
5,No log,0.610571,0.716667,0.513611,0.716667,0.598382
6,No log,0.606131,0.716667,0.513611,0.716667,0.598382
7,No log,0.653789,0.716667,0.513611,0.716667,0.598382
8,No log,0.677907,0.733333,0.80565,0.733333,0.63573
9,No log,0.690694,0.733333,0.704242,0.733333,0.676933
10,No log,0.74152,0.716667,0.672531,0.716667,0.664978


  _warn_prf(average, modifier, msg_start, len(result))


TrainOutput(global_step=225, training_loss=0.2414657253689236, metrics={'train_runtime': 291.6232, 'train_samples_per_second': 12.002, 'train_steps_per_second': 0.772, 'total_flos': 463635895296000.0, 'train_loss': 0.2414657253689236, 'epoch': 25.0})

We can check with the `evaluate` method that our `Trainer` did reload the best model properly (if it was not the last one):

In [None]:
trainer.evaluate()

{'eval_loss': 0.6779069900512695,
 'eval_accuracy': 0.7333333333333333,
 'eval_precision': 0.8056497175141243,
 'eval_recall': 0.7333333333333333,
 'eval_f1': 0.6357298474945534,
 'eval_runtime': 1.0786,
 'eval_samples_per_second': 55.628,
 'eval_steps_per_second': 3.709,
 'epoch': 25.0}

# OK, let's consider that to be a baseline model, in this next section, we'll work in small groups or individuals.  The goal is for each person to explore various hyperparameters that you might want to change.  

## We'll do this manually today (which takes time) but it will be a good way to learn and think about these parameters.  In practice, you will likely use a package or strategy for automated hyperparameter search.  For example, these hyperparameters could be:
1. Different base model (not the one used above)
2. Number of epochs (iterations through the dataset)
3. Learning rate
4. Weight decay
5. Plus many more...

In [None]:
# create your own new TrainingArguments here

learning_rate_part_2 = None
#learning_rate_part_2 = 2e-5
#learning_rate_part_2 = 2e-4
#learning_rate_part_2 = 2e-3
#learning_rate_part_2 = 2e-2

batch_size_part_2 = None
#batch_size_part_2 = 2
#batch_size_part_2 = 4
#batch_size_part_2 = 8
#batch_size_part_2 = 16
#batch_size_part_2 = 32

num_train_epochs_part_2 = None
#num_train_epochs_part_2 = 1
#num_train_epochs_part_2 = 2
#num_train_epochs_part_2 = 3
#num_train_epochs_part_2 = 5
#num_train_epochs_part_2 = 10
#num_train_epochs_part_2 = 15
#num_train_epochs_part_2 = 20
#num_train_epochs_part_2 = 25

weight_decay_part_2 = None
#weight_decay_part_2 = 0.5
#weight_decay_part_2 = 0.1
#weight_decay_part_2 = 0.01
#weight_decay_part_2 = 0.001
#weight_decay_part_2 = 0.0001

args_part_2 = TrainingArguments(
    f"{model_name}-finetuned-pneumonia-part-2",
    evaluation_strategy = "epoch",
    save_strategy = "epoch",
    learning_rate=learning_rate_part_2,
    per_device_train_batch_size=batch_size_part_2,
    per_device_eval_batch_size=batch_size_part_2,
    num_train_epochs=num_train_epochs_part_2,
    weight_decay_part_2=weight_decay_part_2,
    load_best_model_at_end=True,
    metric_for_best_model=metric_name
)

## Some of you might want to experiment with different base models (maybe something trained on clinical documnets, or medical research).  If you do, make sure that you use a tokenizer which is compatible with your model.  Then if you change this, make sure to retokenize again below

In [None]:
def tokenize_function_part_2(examples):
    return tokenizer_part_2(examples["text"], padding=True, truncation=True, max_length=max_tokens_part_2, add_special_tokens = True)

In [None]:
# Here you select what you want for your base model.  There are many more
# models for you to evaluate here
#
model_name_part_2 = None
# You could re-use the model we used above...
#model_name_part_2 = "distilbert-base-uncased"

model_part_2 = AutoModelForSequenceClassification.from_pretrained(model_name_part_2, num_labels=num_labels)

max_tokens_part_2 = None
#max_tokens_part_2 = 512

tokenizer_part_2 = None

# if you use a tokenizer here which is different from above, note that you must tokenize again
# Each model must align with its tokenizer.  Otherwise, the matrices are all off
# (e.g., different widths for token vectors, final vectors, etc...)

# make sure to tokenize using the tokenizer you've selected for part 2 here
tokenized_dataset_part_2 = dataset.map(tokenize_function_part_2, batched=True, num_proc=4, remove_columns=["text"])

trainer_part_2 = Trainer(
    model_part_2,
    args_part_2,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    tokenizer=tokenizer_part_2,
    compute_metrics = compute_metrics
)