If you're opening this Notebook on colab, you will probably need to install 🤗 Transformers and 🤗 Datasets. Uncomment the following cell and run it.

In [1]:
! pip install datasets transformers transformers[torch] accelerate evaluate

Collecting datasets
  Downloading datasets-2.16.1-py3-none-any.whl (507 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/507.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━[0m [32m286.7/507.1 kB[0m [31m8.4 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m507.1/507.1 kB[0m [31m9.2 MB/s[0m eta [36m0:00:00[0m
Collecting accelerate
  Downloading accelerate-0.26.1-py3-none-any.whl (270 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m270.9/270.9 kB[0m [31m13.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting evaluate
  Downloading evaluate-0.4.1-py3-none-any.whl (84 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m10.1 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.8,>=0.3.0 (from datasets)
  Downloading dill-0.3.7-py3-none-any.whl (115 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

If you're opening this notebook locally, make sure your environment has an install from the last version of those libraries.

To be able to share your model with the community and generate results like the one shown in the picture below via the inference API, there are a few more steps to follow.

First you have to store your authentication token from the Hugging Face website (sign up [here](https://huggingface.co/join) if you haven't already!) then execute the following cell and input your username and password:

In [2]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

Then you need to install Git-LFS. Uncomment the following instructions:

In [3]:
!apt install git-lfs

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
git-lfs is already the newest version (3.0.2-1ubuntu0.2).
0 upgraded, 0 newly installed, 0 to remove and 32 not upgraded.


Make sure your version of Transformers is at least 4.11.0 since the functionality was introduced in that version:

In [32]:
import csv
import requests
import pandas as pd

import sklearn
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

import evaluate

import transformers
from transformers import AutoConfig

from datasets import Dataset

print(transformers.__version__)
print(sklearn.__version__)

4.35.2
1.2.2


You can find a script version of this notebook to fine-tune your model in a distributed fashion using multiple GPUs or TPUs [here](https://github.com/huggingface/transformers/tree/master/examples/text-classification).

# Fine-tuning a model on a text classification task

Some initial parameters for starting our model, even if it's not ideal for our
task of PNEUMONIA classification

In [5]:
model_checkpoint = "distilbert-base-uncased"
batch_size = 16

## Loading the dataset

In [6]:
# put a dl=1 here
#CSV_URL = 'https://www.dropbox.com/scl/fi/4x8aj95l7e9x96f4qzch1/mimic2_pneumonia_corpus.csv?rlkey=9rgtu2cp7wfv4rbpx3kw36a7g&dl=0'
CSV_URL = 'https://www.dropbox.com/scl/fi/4x8aj95l7e9x96f4qzch1/mimic2_pneumonia_corpus.csv?rlkey=9rgtu2cp7wfv4rbpx3kw36a7g&dl=1'

df = pd.read_csv(CSV_URL)

print(df.head())

   Unnamed: 0  subject_id  hadm_id             admit_dt  Pneumonia  \
0           5          37    18052  3264-08-14 00:00:00          1   
1          14          94     8743  2656-08-18 00:00:00          1   
2          10         117    14296  3131-11-27 00:00:00          1   
3          19         184      203  3251-04-30 00:00:00          1   
4          18         184    17249  3251-03-19 00:00:00          1   

                                                text  
0  \n\n\n     DATE: [**3264-8-14**] 10:57 AM\n   ...  
1  \n\n\n     DATE: [**2656-8-19**] 4:17 PM\n    ...  
2  \n\n\n     DATE: [**3131-11-28**] 1:30 PM\n   ...  
3  \n\n\n     DATE: [**3251-5-1**] 3:18 PM\n     ...  
4  \n\n\n     DATE: [**3251-3-19**] 3:18 PM\n    ...  


In [7]:
# now that we have a dataframe, here's a way to iterate through the rows

all_dataset_dicts = []

for index, row in df.iterrows():
  text = row['text']
  label = row['Pneumonia']

  # key values of text and label
  row_dict = {'text': text, 'label': label}
  all_dataset_dicts.append(row_dict)

print(f'len(all_dataset_dicts): {len(all_dataset_dicts)}')

len(all_dataset_dicts): 200


In [8]:
# now that we have all of the data, let's turn this into a type (Dataset) which HuggingFace recognizes

dataset_before_split = Dataset.from_list(all_dataset_dicts)

In [11]:
# now let's split this up into train and test:

dataset = dataset_before_split.train_test_split(test_size=0.3, seed = 77)

print(type(dataset))

<class 'datasets.dataset_dict.DatasetDict'>


The `dataset` object itself is [`DatasetDict`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasetdict), which contains one key for the training, validation and test set (with more keys for the mismatched validation and test set in the special case of `mnli`).

In [12]:
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 140
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 60
    })
})

To access an actual element, you need to select a split first, then give an index:

In [13]:
dataset["train"][0]

{'text': '\n\n\n     DATE: [**3334-10-2**] 8:02 AM\n     CHEST (PORTABLE AP)                                             Clip # [**Clip Number (Radiology) 3594**]\n     Reason: Pleural or pericardial effusion?                            \n     ______________________________________________________________________________\n     UNDERLYING MEDICAL CONDITION:\n      62 year old woman with non-small cell lung cancer here s/p thoracentesis and \n      pericardiocentesis. \n     REASON FOR THIS EXAMINATION:\n      Pleural or pericardial effusion?                                                \n     ______________________________________________________________________________\n                                     FINAL REPORT\n     PORTABLE CHEST [**3334-10-2**] AT 8:27\n     \n     INDICATION:  Status post thoracentesis, paracardiocentesis. Evaluate for re-\n     accumulation of fluid.\n     \n     COMPARISONS: [**3334-9-30**] at 17:17\n     \n     FINDINGS: Progressive accumulation of rig

To get a sense of what the data looks like, the following function will show some examples picked randomly in the dataset.

In [14]:
import datasets
import random
import pandas as pd
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=10):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)

    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, datasets.ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))

In [15]:
show_random_elements(dataset["train"])

Unnamed: 0,text,label
0,"\n\n\n DATE: [**2842-9-12**] 6:51 PM\n CHEST (PORTABLE AP) Clip # [**Clip Number (Radiology) 5348**]\n Reason: Any infiltrates? \n Admitting Diagnosis: PANCREATITIS\n ______________________________________________________________________________\n UNDERLYING MEDICAL CONDITION:\n 41 year old woman with metastatic melanoma, now with marked leukocytosis, \n fever. Known lung mets. Please evaluate for PNA. \n \n REASON FOR THIS EXAMINATION:\n Any infiltrates? \n ______________________________________________________________________________\n FINAL REPORT\n INDICATION: 41-year-old female with metastatic melanoma, now with marked\n leukocytosis and fever. Evaluate.\n \n COMPARISON: AP supine portable chest x-ray dated [**2842-9-2**].\n \n AP SUPINE PORTABLE CHEST X-RAY: When compared with prior exam, there is\n opacification of the left retrocardiac region, concerning for left lower lobe\n collapse or consolidation. assymmetric appacification fo the left hemithorax\n is consistent iwth layering pleural effusion on this supine film. The right\n costophrenic angle is not seen. The imaged right lung is grossly clear. No\n pleural effusions are seen bilaterally. The cardiac, mediastinal, and hilar\n contours are normal and stable. There is no pneumothorax. A nasogastric tube\n is coiled within the stomach. The surrounding soft tissue and osseous\n structures again demonstrate orthopedic hardware within the lower lumbar\n spine. Contrast is seen within the colon.\n \n IMPRESSION:\n 1. Left lower lobe collapse/consolidation, concerning for pneumonia.\n 2. Moderate, layering, left pleural effusion.\n\n",1
1,"\n\n\n DATE: [**2948-11-14**] 8:26 PM\n CHEST (PORTABLE AP) Clip # [**Clip Number (Radiology) 12161**]\n Reason: r/o out pneumonia \n ______________________________________________________________________________\n UNDERLYING MEDICAL CONDITION:\n 74s/ year old male with bilat PNA. s/p replacement of OGT tube. \n \n REASON FOR THIS EXAMINATION:\n r/o out pneumonia \n ______________________________________________________________________________\n FINAL REPORT\n HISTORY: Orogastric tube replacement.\n \n COMPARISONS: [**2948-10-24**]\n \n PORTABLE AP CHEST: The examination is somewhat limited by patient motion.\n Allowing for this the tracheostomy tube is unchanged. An orogastric tube is\n not visualized. If one has been placed, it may be coiled within the\n oropharynx. The cardiomediastinal silhouette is stable. Interval decrease in\n vascular congestion.\n\n",0
2,"\n\n\n DATE: [**3131-12-30**] 6:56 PM\n CHEST (PORTABLE AP) Clip # [**Clip Number (Radiology) 13634**]\n Reason: sob,chf.\n ______________________________________________________________________________\n UNDERLYING MEDICAL CONDITION:\n 75 year old woman with h/o chf, cad, and new desaturation \n REASON FOR THIS EXAMINATION:\n sob,chf.\n ______________________________________________________________________________\n FINAL REPORT\n INDICATION: Shortness of breath and desaturation.\n \n COMPARISON: [**3131-12-27**]\n \n FINDINGS: Sternal wires and CABG clips are noted. The cardiac and\n mediastinal contours are within normal limits. There is no pulmonary vascular\n congestion. There is continued but slightly improved left lower lobe\n consolidation and collapse with associated left pleural effusion. Right\n basilar atelectasis and small right pleural effusion have increased in the\n interval. These changes may represent atelectasis adjacent to pleural\n effusions, but differential would include infectious consolidation. No\n pneumothorax. The osseous structures are unremarkable.\n \n IMPRESSION: Slight improvement of left lower lobe consolidation and collapse\n with associated pleural effusion. Slight worsening of right basilar\n atelectasis/consolidation and effusion.\n\n",1
3,"\n\n\n DATE: [**2854-3-31**] 3:35 PM\n CHEST (PA & LAT) Clip # [**Clip Number (Radiology) 10388**]\n Reason: eval for infiltrate \n ______________________________________________________________________________\n UNDERLYING MEDICAL CONDITION:\n 86 year old man with cough, ams \n REASON FOR THIS EXAMINATION:\n eval for infiltrate \n ______________________________________________________________________________\n FINAL REPORT\n INDICATION: 86-year-old male with cough and AMS.\n \n TECHNIQUE: PA and lateral chest radiograph.\n \n COMPARISON: Chest radiograph dated [**2853-10-28**].\n \n FINDINGS: Again, note is made of cardiomegaly with tortuous aorta, overall\n unchanged compared to the prior study. Calcified pleural plaques are seen.\n Note is made of somewhat increased opacity in the left lower lobe, which may\n be technical, however, can represent new consolidation in the appropriate\n clinical setting. No definite effusion is seen.\n \n IMPRESSION: Unchanged appearance of cardiomegaly and tortuous aorta.\n Calcified pleural plaques suggestive of prior asbestos exposure. Somewhat\n increased opacity in left lower lobe, which may represent consolidation, in\n the appropriate clinical setting. Please correlate clinically, and please\n closely follow by repeated chest x-rays.\n \n The referring physician, [**Last Name (NamePattern4) 527**]. [**Last Name (STitle) 10389**], has been paged.\n \n \n\n",1
4,"\n\n\n DATE: [**2850-10-10**] 1:03 PM\n CHEST (PORTABLE AP) Clip # [**Clip Number (Radiology) 14012**]\n Reason: Please assess ETT placement\n ______________________________________________________________________________\n UNDERLYING MEDICAL CONDITION:\n 79 year old woman with new RLL consolidation, L base atelectasis, now s/p\n intubation.\n REASON FOR THIS EXAMINATION:\n Please assess ETT placement\n ______________________________________________________________________________\n FINAL REPORT\n HISTORY: Intubation. For endotracheal tube placement.\n \n Endotracheal tube is at the carina and too low for optimal location. Status\n post AVR. No pneumothorax. There is marked cardiomegaly with bilateral\n pleural effusions and probable upper zone redistribution. Cannot rule out\n atelectasis at the right lung base.\n \n IMPRESSION: Status post AVR. Endotracheal tube at carina too low for optimal\n position. Cardiomegaly with possible CHF, bilateral pleural effusions and\n probable basilar atelectases.\n\n",1
5,"\n\n\n DATE: [**3036-7-14**] 10:56 AM\n CHEST (PORTABLE AP) Clip # [**Clip Number (Radiology) 7286**]\n Reason: please eval for interval change, especially in RLL but also \n Admitting Diagnosis: ALTERED MENTAL STATUS\n ______________________________________________________________________________\n UNDERLYING MEDICAL CONDITION:\n 71 year old man with cirrhosis, vol overload, Afib with RVR, recent RLL \n collapse on abx, who diuresed overnight but has worsening O2 requirement\n REASON FOR THIS EXAMINATION:\n please eval for interval change, especially in RLL but also fluid status \n ______________________________________________________________________________\n FINAL REPORT\n INDICATION: Cirrhosis, volume overload, worsening hypoxia.\n \n CHEST, ONE VIEW: Comparison with [**3036-7-9**]. Bilateral pleural effusions,\n right greater than left, probably unchanged, given technique. Large cardiac\n shadow also unchanged. Pulmonary hila are full, but there is no definite\n evidence of pulmonary edema. No pneumothorax. Osseous structures are\n unchanged on this poorly penetrated film.\n \n IMPRESSION: No significant interval change in appearance of bilateral pleural\n effusions, right larger than left, or cardiomegaly.\n \n\n",1
6,"\n\n\n DATE: [**3139-6-12**] 8:10 AM\n CHEST (PORTABLE AP) Clip # [**Clip Number (Radiology) 8561**]\n Reason: pls eval \n Admitting Diagnosis: GASTROINTESTINAL BLEED\n ______________________________________________________________________________\n UNDERLYING MEDICAL CONDITION:\n 73 year old woman with exsanguinating upper GI bleed \n REASON FOR THIS EXAMINATION:\n pls eval \n ______________________________________________________________________________\n FINAL REPORT\n PORTABLE CHEST AP:\n \n HISTORY: 73 y/o woman with upper GI bleed. Evaluate chest.\n \n FINDINGS: Single frontal radiograph, without comparison, demonstrates\n confluent opacities with air bronchograms in the lower lungs, possibly\n secondary to aspiration. The ETT is 2.7 cm above the carina. The gastric tube\n terminates below field of imaging within the mid abdomen. The pulmonary\n vasculature is prominent even when accounting for low lung volumes, and is\n accompanied by vascular indistinctness and subtle interstitial opacities,\n likely due to fluid overload.\n\n",1
7,"\n\n\n DATE: [**2900-2-18**] 3:17 PM\n CHEST (PORTABLE AP) Clip # [**Clip Number (Radiology) 2600**]\n Reason: iabp placement s/p cabg\n ______________________________________________________________________________\n UNDERLYING MEDICAL CONDITION:\n 76 year old man with \n REASON FOR THIS EXAMINATION:\n iabp placement s/p cabg\n ______________________________________________________________________________\n FINAL REPORT\n INDICATION: IABP placement, status post CABG.\n \n AP SUPINE CHEST: The patient is status post median sternotomy. ET tube\n projects 4 cm above the carina. IABP tip projects 4.7 cm in the aortic arch.\n NG tube is in good position. Right IJ approach Swan-Ganz tip projects over\n the main pulmonary artery, perhaps within the proximal left main pulmonary\n artery. There is expected post surgical linear atelectasis bilaterally.\n There is no pneumothorax or significant effusion.\n \n IMPRESSION: Tubes and lines as described. IABPD projects 4.7 cm from the\n aortic arch.\n \n\n",1
8,"\n\n\n DATE: [**2699-1-5**] 12:25 AM\n CHEST (PORTABLE AP) Clip # [**Clip Number (Radiology) 12728**]\n Reason: + sob. + rales. rr 35. and hypertensive\n ______________________________________________________________________________\n UNDERLYING MEDICAL CONDITION:\n 70 year old woman with htn and recent sob\n REASON FOR THIS EXAMINATION:\n + sob. + rales. rr 35. and hypertensive\n ______________________________________________________________________________\n FINAL REPORT\n PORTABLE CHEST X-RAY\n \n CLINICAL INDICATION: 70 year old woman with shortness of breath and rales.\n \n A single frontal 70 degree upright radiograph of the chest was obtained and\n compared to the next prior study from [**2699-1-4**].\n \n The lung volumes are low. Prominence of the pulmonary vascularity remains,\n consistent with left heart failure. There is obscuration of the\n hemidiaphragms bilaterally consistent with bilateral pleural effusions. The\n overall heart size is difficult to assess. There is dense retrocardiac\n opacity, possibly secondary to collapse and/or consolidation in the left lower\n lobe. There is also a right lower lobe and middle lobe opacity consistent\n with collapse and/or consolidation.\n \n IMPRESSION: Persistent left heart failure with bilateral pleural effusions.\n Collapse and/or consolidation at the bases bilaterally.\n\n",1
9,\n\n\n DATE: [**3124-1-22**] 11:35 AM\n CHEST (PORTABLE AP) Clip # [**Clip Number (Radiology) 12605**]\n Reason: eval for chf and tube placement\n ______________________________________________________________________________\n UNDERLYING MEDICAL CONDITION:\n 80 year old man with vomiting presented to [**Location (un) 12606**]. CT there showed left\n cerebellar infarct. s/p adjustment of ET tube \n REASON FOR THIS EXAMINATION:\n eval for chf and tube placement\n ______________________________________________________________________________\n FINAL REPORT\n HISTORY: 80-year-old with vomiting.\n \n PORTABLE CHEST: The cardiac silhouette is at the upper limits of normal for\n size. The hilar and mediastinal silhouettes are within normal limits for\n size. There is bilateral mild peribronchial thickening with increased\n interstitial markings and perihilar haziness. The costophrenic angles are\n sharp. There is an endotracheal tube with its tip approximately 3-4 cm above\n the carina. There is a nasogastric tube present with its tip in the body of\n the stomach and its side hole in the esophagus. There is bilateral\n degenerative changes in the AC joint.\n \n IMPRESSION: Increased interstitial markings with vascular haziness findings\n consistent with congestive heart failure.\n\n,1


## Preprocessing the data

Before we can feed those texts to our model, we need to preprocess them. This is done by a 🤗 Transformers `Tokenizer` which will (as the name indicates) tokenize the inputs (including converting the tokens to their corresponding IDs in the pretrained vocabulary) and put it in a format the model expects, as well as generate the other inputs that model requires.

To do all of this, we instantiate our tokenizer with the `AutoTokenizer.from_pretrained` method, which will ensure:

- we get a tokenizer that corresponds to the model architecture we want to use,
- we download the vocabulary used when pretraining this specific checkpoint.

That vocabulary will be cached, so it's not downloaded again the next time we run the cell.

In [16]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

We pass along `use_fast=True` to the call above to use one of the fast tokenizers (backed by Rust) from the 🤗 Tokenizers library. Those fast tokenizers are available for almost all models, but if you got an error with the previous call, remove that argument.

You can directly call this tokenizer on one sentence or a pair of sentences:

In [17]:
tokenizer("Hello, this one sentence!", "And this sentence goes with it.")

{'input_ids': [101, 7592, 1010, 2023, 2028, 6251, 999, 102, 1998, 2023, 6251, 3632, 2007, 2009, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

## Fine-tuning the model

Now that our data is ready, we can download a pretrained base model and fine-tune it. Since all our task is document classification, we use the `AutoModelForSequenceClassification` class. Like with the tokenizer, the `from_pretrained` method will download and cache the model for us. The only thing we have to specify is the number of labels for our problem (which is always 2, except for STS-B which is a regression problem and MNLI where we have 3 labels):

In [18]:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

num_labels = 2
model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=num_labels)

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias', 'pre_classifier.weight', 'pre_classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


The warning is telling us we are throwing away some weights (the `vocab_transform` and `vocab_layer_norm` layers) and randomly initializing some other (the `pre_classifier` and `classifier` layers). This is absolutely normal in this case, because we are removing the head used to pretrain the model on a masked language modeling objective and replacing it with a new head for which we don't have pretrained weights, so the library warns us we should fine-tune this model before using it for inference, which is exactly what we are going to do.

To instantiate a `Trainer`, we will need to define two more things. The most important is the [`TrainingArguments`](https://huggingface.co/transformers/main_classes/trainer.html#transformers.TrainingArguments), which is a class that contains all the attributes to customize the training. It requires one folder name, which will be used to save the checkpoints of the model, and all other arguments are optional:

In [19]:
metric = evaluate.load("accuracy")

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

In [20]:
metric_name = "accuracy"
model_name = model_checkpoint.split("/")[-1]

args = TrainingArguments(
    f"{model_name}-finetuned-pneumonia",
    evaluation_strategy = "epoch",
    save_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=5,
    weight_decay=0.01,
    load_best_model_at_end=True,
    metric_for_best_model=metric_name
)

Here we set the evaluation to be done at the end of each epoch, tweak the learning rate, use the `batch_size` defined at the top of the notebook and customize the number of epochs for training, as well as the weight decay. Since the best model might not be the one at the end of training, we ask the `Trainer` to load the best model it saved (according to `metric_name`) at the end of training.

Before we continue, we need to tokenize (translate into input_ids for training a model)

In [21]:
max_tokens = 512

In [22]:

def tokenize_function(examples):
    return tokenizer(examples["text"], padding=True, truncation=True, max_length=max_tokens, add_special_tokens = True)


In [23]:
tokenized_dataset = dataset.map(tokenize_function, batched=True, num_proc=4, remove_columns=["text"])

Map (num_proc=4):   0%|          | 0/140 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/60 [00:00<?, ? examples/s]

In [24]:
def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)

    # Calculate accuracy
    accuracy = accuracy_score(labels, preds)

   # Calculate precision, recall, and F1-score
    precision = precision_score(labels, preds, average='weighted')
    recall = recall_score(labels, preds, average='weighted')
    f1 = f1_score(labels, preds, average='weighted')

    return {
        'accuracy': accuracy,
        'precision': precision,
        'recall': recall,
        'f1': f1
    }

Then we just need to pass all of this along with our datasets to the `Trainer`:

In [25]:
trainer = Trainer(
    model,
    args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    tokenizer=tokenizer,
    compute_metrics = compute_metrics
)

You might wonder why we pass along the `tokenizer` when we already preprocessed our data. This is because we will use it once last time to make all the samples we gather the same length by applying padding, which requires knowing the model's preferences regarding padding (to the left or right? with which token?). The `tokenizer` has a pad method that will do all of this right for us, and the `Trainer` will use it. You can customize this part by defining and passing your own `data_collator` which will receive the samples like the dictionaries seen above and will need to return a dictionary of tensors.

We can now finetune our model by just calling the `train` method:

In [26]:
trainer.train()

You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,No log,0.742235,0.616667,0.380278,0.616667,0.470447
2,No log,0.735926,0.616667,0.380278,0.616667,0.470447
3,No log,0.695656,0.616667,0.380278,0.616667,0.470447
4,No log,0.689013,0.616667,0.380278,0.616667,0.470447
5,No log,0.699436,0.616667,0.380278,0.616667,0.470447


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


TrainOutput(global_step=45, training_loss=0.5681185828314887, metrics={'train_runtime': 38.3781, 'train_samples_per_second': 18.24, 'train_steps_per_second': 1.173, 'total_flos': 92727179059200.0, 'train_loss': 0.5681185828314887, 'epoch': 5.0})

We can check with the `evaluate` method that our `Trainer` did reload the best model properly (if it was not the last one):

In [27]:
trainer.evaluate()

  _warn_prf(average, modifier, msg_start, len(result))


{'eval_loss': 0.7422347664833069,
 'eval_accuracy': 0.6166666666666667,
 'eval_precision': 0.38027777777777777,
 'eval_recall': 0.6166666666666667,
 'eval_f1': 0.47044673539518905,
 'eval_runtime': 0.3444,
 'eval_samples_per_second': 174.2,
 'eval_steps_per_second': 11.613,
 'epoch': 5.0}

# OK, let's consider that to be a baseline model, in this next section, we'll work in small groups or individuals.  The goal is for each person to explore various hyperparameters that you might want to change.  

## We'll do this manually today (which takes time) but it will be a good way to learn and think about these parameters.  In practice, you will likely use a package or strategy for automated hyperparameter search.  For example, these hyperparameters could be:
1. Different base model (not the one used above)
2. Number of epochs (iterations through the dataset)
3. Learning rate
4. Weight decay
5. Plus many more...

In [58]:
import torch

# We already trained a model up above which will use a lot of RAM (on GPU or CPU)
# So let's tell CUDA (the GPU library) to clear its cache...
torch.cuda.empty_cache()

In [59]:
# create your own new TrainingArguments here

#learning_rate_part_2 = None
learning_rate_part_2 = 2e-5
#learning_rate_part_2 = 5e-5
#learning_rate_part_2 = 1e-5
#learning_rate_part_2 = 2e-4
#learning_rate_part_2 = 2e-3
#learning_rate_part_2 = 2e-2

#batch_size_part_2 = None
#batch_size_part_2 = 2
#batch_size_part_2 = 4
#batch_size_part_2 = 8
batch_size_part_2 = 16
#batch_size_part_2 = 24
#batch_size_part_2 = 32

#num_train_epochs_part_2 = None
#num_train_epochs_part_2 = 1
#num_train_epochs_part_2 = 2
#num_train_epochs_part_2 = 3
#num_train_epochs_part_2 = 5
num_train_epochs_part_2 = 9
#num_train_epochs_part_2 = 10
#num_train_epochs_part_2 = 15
#num_train_epochs_part_2 = 20
#num_train_epochs_part_2 = 25

#weight_decay_part_2 = None
#weight_decay_part_2 = 0.5
#weight_decay_part_2 = 0.1
weight_decay_part_2 = 0.01
#weight_decay_part_2 = 0.001
#weight_decay_part_2 = 0.0001

args_part_2 = TrainingArguments(
    f"{model_name}-finetuned-pneumonia-part-2",
    evaluation_strategy = "epoch",
    save_strategy = "epoch",
    learning_rate=learning_rate_part_2,
    per_device_train_batch_size=batch_size_part_2,
    per_device_eval_batch_size=batch_size_part_2,
    num_train_epochs=num_train_epochs_part_2,
    weight_decay=weight_decay_part_2,
    load_best_model_at_end=True,
    metric_for_best_model=metric_name
)

#print(args_part_2)

## Some of you might want to experiment with different base models (maybe something trained on clinical documnets, or medical research).  If you do, make sure that you use a tokenizer which is compatible with your model.  Then if you change this, make sure to retokenize again below

In [60]:
def tokenize_function_part_2(examples):
    return tokenizer_part_2(examples["text"], padding=True, truncation=True, max_length=max_tokens_part_2, add_special_tokens = True)

In [61]:
# Here you select what you want for your base model.  There are many more
# models for you to evaluate here
#
#model_name_part_2 = None

# You could re-use the model we used above...
# https://huggingface.co/distilbert/distilbert-base-uncased
#model_name_part_2 = "distilbert-base-uncased"

# https://huggingface.co/bert-base-uncased
#model_name_part_2 = "bert-base-uncased"

# This one was trained on biomedical literature and clinical data
# https://huggingface.co/emilyalsentzer/Bio_ClinicalBERT
model_name_part_2 = "emilyalsentzer/Bio_ClinicalBERT"

# Here's a model trained by Microsoft on biomedical abstracts:
# https://huggingface.co/microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext
#model_name_part_2 = "microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext"

# let's add some dropout to this model
configuration = AutoConfig.from_pretrained(model_name_part_2)
#configuration.hidden_dropout_prob = 0.25
#configuration.attention_probs_dropout_prob = 0.25
#configuration.attention_dropout = 0.1
print(configuration)

model_part_2 = AutoModelForSequenceClassification.from_pretrained(model_name_part_2,
                                                                  config = configuration)

tokenizer_part_2 = AutoTokenizer.from_pretrained(model_name_part_2, use_fast=True)

# if you use a tokenizer here which is different from above, note that you must tokenize again
# Each model must align with its tokenizer.  Otherwise, the matrices are all off
# (e.g., different widths for token vectors, final vectors, etc...)

# make sure you know what the maximum number of tokens can be for your model/tokenizer:
max_tokens_part_2 = 512
#max_tokens_part_2 = 512

# make sure to tokenize using the tokenizer you've selected for part 2 here
tokenized_dataset_part_2 = dataset.map(tokenize_function_part_2, batched=True, num_proc=4, remove_columns=["text"])

trainer_part_2 = Trainer(
    model_part_2,
    args_part_2,
    train_dataset=tokenized_dataset_part_2["train"],
    eval_dataset=tokenized_dataset_part_2["test"],
    tokenizer=tokenizer_part_2,
    compute_metrics = compute_metrics
)

BertConfig {
  "_name_or_path": "emilyalsentzer/Bio_ClinicalBERT",
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.35.2",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 28996
}



Some weights of BertForSequenceClassification were not initialized from the model checkpoint at emilyalsentzer/Bio_ClinicalBERT and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Map (num_proc=4):   0%|          | 0/140 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/60 [00:00<?, ? examples/s]

In [62]:
# now train it!  good luck!!
trainer_part_2.train()

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,No log,0.706124,0.616667,0.380278,0.616667,0.470447
2,No log,0.710011,0.616667,0.380278,0.616667,0.470447
3,No log,0.682977,0.616667,0.380278,0.616667,0.470447
4,No log,0.651131,0.616667,0.380278,0.616667,0.470447
5,No log,0.638977,0.616667,0.380278,0.616667,0.470447
6,No log,0.661022,0.616667,0.574425,0.616667,0.498035
7,No log,0.649543,0.616667,0.586333,0.616667,0.569801
8,No log,0.676438,0.6,0.557298,0.6,0.544318
9,No log,0.636988,0.633333,0.615556,0.633333,0.612623


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


TrainOutput(global_step=81, training_loss=0.5055567423502604, metrics={'train_runtime': 102.2713, 'train_samples_per_second': 12.32, 'train_steps_per_second': 0.792, 'total_flos': 331519929753600.0, 'train_loss': 0.5055567423502604, 'epoch': 9.0})

In [63]:
trainer_part_2.evaluate()

{'eval_loss': 0.6369878649711609,
 'eval_accuracy': 0.6333333333333333,
 'eval_precision': 0.6155555555555555,
 'eval_recall': 0.6333333333333333,
 'eval_f1': 0.6126230209670518,
 'eval_runtime': 0.632,
 'eval_samples_per_second': 94.937,
 'eval_steps_per_second': 6.329,
 'epoch': 9.0}