If you're opening this Notebook on colab, you will probably need to install 🤗 Transformers and 🤗 Datasets. Uncomment the following cell and run it.

In [1]:
! pip install datasets transformers transformers[torch] accelerate evaluate

Collecting datasets
  Downloading datasets-2.17.1-py3-none-any.whl (536 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/536.7 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━[0m [32m245.8/536.7 kB[0m [31m7.9 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m536.7/536.7 kB[0m [31m8.9 MB/s[0m eta [36m0:00:00[0m
Collecting accelerate
  Downloading accelerate-0.27.2-py3-none-any.whl (279 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m280.0/280.0 kB[0m [31m13.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting evaluate
  Downloading evaluate-0.4.1-py3-none-any.whl (84 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m11.7 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

If you're opening this notebook locally, make sure your environment has an install from the last version of those libraries.

To be able to share your model with the community and generate results like the one shown in the picture below via the inference API, there are a few more steps to follow.

First you have to store your authentication token from the Hugging Face website (sign up [here](https://huggingface.co/join) if you haven't already!) then execute the following cell and input your username and password:

Then you need to install Git-LFS. Uncomment the following instructions:

In [2]:
!apt install git-lfs

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
git-lfs is already the newest version (3.0.2-1ubuntu0.2).
0 upgraded, 0 newly installed, 0 to remove and 35 not upgraded.


In [3]:
# let's also get our dataset we used for the rules-based part of the class...
!pip install https://github.com/abchapman93/DELPHI_Intro_to_NLP_Spring_2024/releases/download/v0.1/delphi_nlp_2024-0.1.tar.gz

Collecting https://github.com/abchapman93/DELPHI_Intro_to_NLP_Spring_2024/releases/download/v0.1/delphi_nlp_2024-0.1.tar.gz
  Downloading https://github.com/abchapman93/DELPHI_Intro_to_NLP_Spring_2024/releases/download/v0.1/delphi_nlp_2024-0.1.tar.gz (14 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting jupyter (from delphi-nlp-2024==0.1)
  Downloading jupyter-1.0.0-py2.py3-none-any.whl (2.7 kB)
Collecting qtconsole (from jupyter->delphi-nlp-2024==0.1)
  Downloading qtconsole-5.5.1-py3-none-any.whl (123 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m123.4/123.4 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
Collecting qtpy>=2.4.0 (from qtconsole->jupyter->delphi-nlp-2024==0.1)
  Downloading QtPy-2.4.1-py3-none-any.whl (93 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m93.5/93.5 kB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
Collecting jedi>=0.16 (from ipython>=5.0.0->ipykernel->jupyter->delphi-nlp-2024==0.1)
  Downloading jedi-0

Make sure your version of Transformers is at least 4.11.0 since the functionality was introduced in that version:

In [4]:
import csv
import requests
import pandas as pd

import sklearn
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

import evaluate

import transformers

from datasets import Dataset

import torch

print(transformers.__version__)
print(sklearn.__version__)

4.38.1
1.2.2


In [5]:
from delphi_nlp_2024 import *
from delphi_nlp_2024.quizzes.quizzes import *
from delphi_nlp_2024.helpers import *

You can find a script version of this notebook to fine-tune your model in a distributed fashion using multiple GPUs or TPUs [here](https://github.com/huggingface/transformers/tree/master/examples/text-classification).

# Fine-tuning a model on a text classification task

Some initial parameters for starting our model, even if it's not ideal for our
task of PNEUMONIA classification

In [6]:
model_checkpoint = "distilbert-base-uncased"
batch_size = 16

## Loading the dataset

In [7]:
train = load_pneumonia_data()

test = load_pneumonia_data("test",)

print(f'len(train): {len(train)}')
print(f'len(test): {len(test)}')

len(train): 70
len(test): 30


In [8]:
print(train.head())

                       record_id  \
0   subject_id_157_hadm_id_26180   
3  subject_id_7272_hadm_id_19098   
5  subject_id_8156_hadm_id_23798   
7  subject_id_4726_hadm_id_27535   
8    subject_id_26_hadm_id_15067   

                                                text  document_classification  \
0  \n\n\n     DATE: [**3128-5-28**] 10:42 AM\n   ...                        1   
3  \n\n\n     DATE: [**2699-1-5**] 12:25 AM\n    ...                        1   
5  \n\n\n     DATE: [**2533-6-14**] 9:28 PM\n    ...                        1   
7  \n\n\n     DATE: [**2904-8-20**] 4:47 PM\n    ...                        0   
8  \n\n\n     DATE: [**3079-3-6**] 8:03 AM\n     ...                        0   

   split  baseline_document_classification  \
0  train                                 0   
3  train                                 1   
5  train                                 1   
7  train                                 0   
8  train                                 0   

                   

We are now getting our dataset ready for the HuggingFace libraries.  We start with a dataframe, but we need to convert into dictionaries and then the Dataset type

In [9]:
# now that we have a dataframe, here's a way to iterate through the rows

train_dataset_dicts = []
test_dataset_dicts = []

for index, row in train.iterrows():
  text = row['text']
  label = row['document_classification']

  # key values of text and label
  row_dict = {'text': text, 'label': label}
  train_dataset_dicts.append(row_dict)

for index, row in test.iterrows():
  text = row['text']
  label = row['document_classification']

  # key values of text and label
  row_dict = {'text': text, 'label': label}
  test_dataset_dicts.append(row_dict)

print(f'len(train_dataset_dicts): {len(train_dataset_dicts)}')
print(f'len(test_dataset_dicts): {len(train_dataset_dicts)}')

len(train_dataset_dicts): 70
len(test_dataset_dicts): 70


In [10]:
# now that we have all of the data, let's turn this into a type (Dataset) which HuggingFace recognizes

train_dataset = Dataset.from_list(train_dataset_dicts, split="train")
test_dataset = Dataset.from_list(test_dataset_dicts, split="test")

To get a sense of what the data looks like, the following function will show some examples picked randomly in the dataset.

In [11]:
import datasets
import random
import pandas as pd
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=5):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)

    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, datasets.ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))

In [12]:
show_random_elements(train_dataset)

Unnamed: 0,text,label
0,"\n\n\n DATE: [**3388-5-16**] 8:20 PM\n CHEST (PORTABLE AP) Clip # [**Clip Number (Radiology) 773**]\n Reason: Pt s/p L subclavian. Assess for line placement and lack of \n ______________________________________________________________________________\n UNDERLYING MEDICAL CONDITION:\n 73 year old man with above.\n REASON FOR THIS EXAMINATION:\n Pt s/p L subclavian. Assess for line placement and lack of pneumothorax.\n ______________________________________________________________________________\n FINAL REPORT\n INDICATION: Status post left subclavian catheter placement, check position.\n \n Single frontal chest radiograph dated [**3388-5-16**] is compared with prior chest\n radiograph dated [**3388-5-10**].\n \n The left subclavian catheter is terminating in the left IJ. There is no\n evidence of pneumothorax. The patient is status post mediansternotomy and\n [**Location (un) 766**] rod placement. Multiple surgical clips are placed in the right\n upper quadrant, indicative of previous surgery, probably cholecystectomy.\n \n The heart is of normal size. The mediastinal and hilar contours are stable\n since the prior study. There is persistent opacification in the right lower\n lung zone. The pulmonary vascularity is unremarkable. The extreme left\n costophrenic angle is cut off. There is probable small right pleural effusion.\n \n IMPRESSION: The left subclavian catheter is terminating in the left IJ. No\n pneumothorax. Persistent right lower lung zone opacification and probable\n small effusion.\n\n",1
1,"\n\n\n DATE: [**3326-9-10**] 9:11 AM\n CHEST (PA & LAT) Clip # [**Clip Number (Radiology) 3443**]\n Reason: ? pneumonia]\n ______________________________________________________________________________\n UNDERLYING MEDICAL CONDITION:\n 41 year old woman with advanced HIV/AIDS, h/o recurrent bacterial pneumonias,\n now with high fever, cough with sputum.\n REASON FOR THIS EXAMINATION:\n ? pneumonia]\n ______________________________________________________________________________\n FINAL REPORT\n INDICATION: HIV. Fever and cough.\n \n FINDINGS: PA and lateral chest dated [**3326-9-10**] is compared with prior\n examination dated [**3326-5-30**]. The cardiac and mediastinal contours are\n unchanged. There has been interval development of bilateral small pleural\n effusions, right greater than left. There is diffuse interstitial opacity\n with areas at the bilateral bases of greater confluence suggestive of air\n space disease. Pulmonary vascularity appears unremarkable. The osseous\n structures appear unremarkable. Prominent nipple shadows are again noted.\n \n IMPRESSION: Diffuse interstitial pneumonitis with areas of air space opacity\n at bilateral bases and small bilateral pleural effusions. In the appropriate\n clinical setting this is compatible with appearance of PCP [**Name Initial (PRE) 3444**]. Other\n atypical pathogens cannot be excluded.\n\n",1
2,"\n\n\n DATE: [**2603-7-21**] 11:28 AM\n CHEST (PORTABLE AP) Clip # [**Clip Number (Radiology) 2628**]\n Reason: Sob \n Admitting Diagnosis: MI,CHF\n ______________________________________________________________________________\n UNDERLYING MEDICAL CONDITION:\n [**Age over 90 **] year old woman with SOB, CHF, NSTEMI , acute SOB \n \n REASON FOR THIS EXAMINATION:\n Sob \n ______________________________________________________________________________\n FINAL REPORT\n INDICATION: Short of breath and congestive heart failure and non ST elevation\n MI.\n \n COMPARISON: [**2603-7-21**].\n \n TECHNIQUE: Single AP portable upright chest.\n \n COMMENT: The heart size and mediastinal contours are unchanged. There is\n marked interval increase in congestive heart failure, with patchy bilateral\n and perihilar opacities. Increased size in bilateral pleural effusions, left\n greater than right. No pneumothorax. The surrounding osseous structures are\n unchanged.\n \n IMPRESSION: Interval increase in congestive heart failure and bilateral\n pleural effusions. Stable cardiomegaly.\n \n\n",0
3,"\n\n\n DATE: [**2558-7-16**] 10:56 PM\n CHEST (PORTABLE AP); -77 BY DIFFERENT PHYSICIAN [**Name Initial (PRE) 58**] # [**Clip Number (Radiology) 15282**]\n Reason: eval for ETT placement following transfer \n Admitting Diagnosis: STRIDOR;NEUTROPENIA;FEVER\n ______________________________________________________________________________\n UNDERLYING MEDICAL CONDITION:\n 66 year old woman with mestastatic cancer now with fevers s/p intubation - \n would like to confirm ett placement following transfer \n REASON FOR THIS EXAMINATION:\n eval for ETT placement following transfer \n ______________________________________________________________________________\n FINAL REPORT\n REASON FOR EXAMINATION: Intubation in a patient with fever and known\n metastatic cancer.\n \n Portable AP chest radiograph compared to previous one made the same day\n earlier at 19:59 p.m.\n \n The ET tube tip in good position 3.5 cm above the carina. The left subclavian\n line tip is in distal superior vena cava, unchanged. There is marked\n improvement of the bilateral consolidations, especially on the right. The NG\n tube tip is coiled in the stomach.\n \n IMPRESSION:\n 1. Standard position of ET tube, NG tube and left subclavian line.\n 2. Resolution of the pulmonary edema.\n\n",1
4,"\n\n\n DATE: [**2900-2-18**] 3:17 PM\n CHEST (PORTABLE AP) Clip # [**Clip Number (Radiology) 2600**]\n Reason: iabp placement s/p cabg\n ______________________________________________________________________________\n UNDERLYING MEDICAL CONDITION:\n 76 year old man with \n REASON FOR THIS EXAMINATION:\n iabp placement s/p cabg\n ______________________________________________________________________________\n FINAL REPORT\n INDICATION: IABP placement, status post CABG.\n \n AP SUPINE CHEST: The patient is status post median sternotomy. ET tube\n projects 4 cm above the carina. IABP tip projects 4.7 cm in the aortic arch.\n NG tube is in good position. Right IJ approach Swan-Ganz tip projects over\n the main pulmonary artery, perhaps within the proximal left main pulmonary\n artery. There is expected post surgical linear atelectasis bilaterally.\n There is no pneumothorax or significant effusion.\n \n IMPRESSION: Tubes and lines as described. IABPD projects 4.7 cm from the\n aortic arch.\n \n\n",0


## Preprocessing the data

Before we can feed those texts to our model, we need to preprocess them. This is done by a 🤗 Transformers `Tokenizer` which will (as the name indicates) tokenize the inputs (including converting the tokens to their corresponding IDs in the pretrained vocabulary) and put it in a format the model expects, as well as generate the other inputs that model requires.

To do all of this, we instantiate our tokenizer with the `AutoTokenizer.from_pretrained` method, which will ensure:

- we get a tokenizer that corresponds to the model architecture we want to use,
- we download the vocabulary used when pretraining this specific checkpoint.

That vocabulary will be cached, so it's not downloaded again the next time we run the cell.

In [13]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

We pass along `use_fast=True` to the call above to use one of the fast tokenizers (backed by Rust) from the 🤗 Tokenizers library. Those fast tokenizers are available for almost all models, but if you got an error with the previous call, remove that argument.

You can directly call this tokenizer on one sentence or a pair of sentences:

In [14]:
tokenizer("Hello, this one sentence!", "And this sentence goes with it.")

{'input_ids': [101, 7592, 1010, 2023, 2028, 6251, 999, 102, 1998, 2023, 6251, 3632, 2007, 2009, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

## Fine-tuning the model

Now that our data is ready, we can download a pretrained base model and fine-tune it. Since all our task is document classification, we use the `AutoModelForSequenceClassification` class. Like with the tokenizer, the `from_pretrained` method will download and cache the model for us. The only thing we have to specify is the number of labels for our problem (which is always 2, except for STS-B which is a regression problem and MNLI where we have 3 labels):

In [15]:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

num_labels = 2
model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=num_labels)

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


The warning is telling us we are throwing away some weights (the `vocab_transform` and `vocab_layer_norm` layers) and randomly initializing some other (the `pre_classifier` and `classifier` layers). This is absolutely normal in this case, because we are removing the head used to pretrain the model on a masked language modeling objective and replacing it with a new head for which we don't have pretrained weights, so the library warns us we should fine-tune this model before using it for inference, which is exactly what we are going to do.

To instantiate a `Trainer`, we will need to define two more things. The most important is the [`TrainingArguments`](https://huggingface.co/transformers/main_classes/trainer.html#transformers.TrainingArguments), which is a class that contains all the attributes to customize the training. It requires one folder name, which will be used to save the checkpoints of the model, and all other arguments are optional:

In [16]:
metric = evaluate.load("accuracy")

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

In [17]:
metric_name = "accuracy"
model_name = model_checkpoint.split("/")[-1]

args = TrainingArguments(
    f"{model_name}-finetuned-pneumonia",
    evaluation_strategy = "epoch",
    save_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=5,
    weight_decay=0.01,
    load_best_model_at_end=True,
    metric_for_best_model=metric_name
)

Here we set the evaluation to be done at the end of each epoch, tweak the learning rate, use the `batch_size` defined at the top of the notebook and customize the number of epochs for training, as well as the weight decay. Since the best model might not be the one at the end of training, we ask the `Trainer` to load the best model it saved (according to `metric_name`) at the end of training.

Before we continue, we need to tokenize (translate into input_ids for training a model)

In [18]:
max_tokens = 512

In [19]:

def tokenize_function(examples):
    return tokenizer(examples["text"], padding=True, truncation=True, max_length=max_tokens, add_special_tokens = True)


In [20]:
train_tokenized_dataset = train_dataset.map(tokenize_function, batched=True, num_proc=4, remove_columns=["text"])
test_tokenized_dataset = test_dataset.map(tokenize_function, batched=True, num_proc=4, remove_columns=["text"])

Map (num_proc=4):   0%|          | 0/70 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/30 [00:00<?, ? examples/s]

In [21]:
def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)

    # Calculate accuracy
    accuracy = accuracy_score(labels, preds)

   # Calculate precision, recall, and F1-score
    precision = precision_score(labels, preds, average='weighted')
    recall = recall_score(labels, preds, average='weighted')
    f1 = f1_score(labels, preds, average='weighted')

    return {
        'accuracy': accuracy,
        'precision': precision,
        'recall': recall,
        'f1': f1
    }

Then we just need to pass all of this along with our datasets to the `Trainer`:

In [22]:
trainer = Trainer(
    model,
    args,
    train_dataset=train_tokenized_dataset,
    eval_dataset=test_tokenized_dataset,
    tokenizer=tokenizer,
    compute_metrics = compute_metrics
)

You might wonder why we pass along the `tokenizer` when we already preprocessed our data. This is because we will use it once last time to make all the samples we gather the same length by applying padding, which requires knowing the model's preferences regarding padding (to the left or right? with which token?). The `tokenizer` has a pad method that will do all of this right for us, and the `Trainer` will use it. You can customize this part by defining and passing your own `data_collator` which will receive the samples like the dictionaries seen above and will need to return a dictionary of tensors.

We can now finetune our model by just calling the `train` method:

In [23]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,No log,0.688758,0.533333,0.284444,0.533333,0.371014
2,No log,0.687366,0.533333,0.284444,0.533333,0.371014
3,No log,0.685387,0.533333,0.284444,0.533333,0.371014
4,No log,0.683898,0.566667,0.76092,0.566667,0.441481
5,No log,0.682972,0.566667,0.76092,0.566667,0.441481


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


TrainOutput(global_step=25, training_loss=0.6874748992919922, metrics={'train_runtime': 49.7691, 'train_samples_per_second': 7.032, 'train_steps_per_second': 0.502, 'total_flos': 46363589529600.0, 'train_loss': 0.6874748992919922, 'epoch': 5.0})

We can check with the `evaluate` method that our `Trainer` did reload the best model properly (if it was not the last one):

In [24]:
trainer.evaluate()

{'eval_loss': 0.6838977932929993,
 'eval_accuracy': 0.5666666666666667,
 'eval_precision': 0.7609195402298851,
 'eval_recall': 0.5666666666666667,
 'eval_f1': 0.44148148148148153,
 'eval_runtime': 0.5775,
 'eval_samples_per_second': 51.949,
 'eval_steps_per_second': 3.463,
 'epoch': 5.0}

# OK, let's consider that to be a baseline model, in this next section, we'll work in small groups or individuals.  The goal is for each person to explore various hyperparameters that you might want to change.  

## We'll do this manually today (which takes time) but it will be a good way to learn and think about these parameters.  In practice, you will likely use a package or strategy for automated hyperparameter search.  For example, these hyperparameters could be:
1. Different base model (not the one used above)
2. Number of epochs (iterations through the dataset)
3. Learning rate
4. Weight decay
5. Plus many more...

In [25]:
# We already trained a model up above which will use a lot of RAM (on GPU or CPU)
# So let's tell CUDA (the GPU library) to clear its cache...
torch.cuda.empty_cache()

In [26]:
# create your own new TrainingArguments here

# You can learn more about these parameters, and you can add others which are all documented here:
# https://huggingface.co/docs/transformers/v4.37.2/en/main_classes/trainer#transformers.TrainingArguments

#learning_rate_part_2 = None
learning_rate_part_2 = 2e-5
#learning_rate_part_2 = 2e-4
#learning_rate_part_2 = 2e-3
#learning_rate_part_2 = 2e-2

#batch_size_part_2 = None
#batch_size_part_2 = 2
#batch_size_part_2 = 4
#batch_size_part_2 = 8
batch_size_part_2 = 16
#batch_size_part_2 = 32

#num_train_epochs_part_2 = None
#num_train_epochs_part_2 = 1
#num_train_epochs_part_2 = 2
#num_train_epochs_part_2 = 3
#num_train_epochs_part_2 = 5
#num_train_epochs_part_2 = 10
#num_train_epochs_part_2 = 15
num_train_epochs_part_2 = 20
#num_train_epochs_part_2 = 25

#weight_decay_part_2 = None
#weight_decay_part_2 = 0.5
#weight_decay_part_2 = 0.1
weight_decay_part_2 = 0.01
#weight_decay_part_2 = 0.001
#weight_decay_part_2 = 0.0001

args_part_2 = TrainingArguments(
    f"{model_name}-finetuned-pneumonia-part-2",
    evaluation_strategy = "epoch",
    save_strategy = "epoch",
    learning_rate=learning_rate_part_2,
    per_device_train_batch_size=batch_size_part_2,
    per_device_eval_batch_size=batch_size_part_2,
    num_train_epochs=num_train_epochs_part_2,
    weight_decay=weight_decay_part_2,
    load_best_model_at_end=True,
    metric_for_best_model=metric_name
)

# you might want to print out what we chose, for keeping track
#print(args_part_2)

## Some of you might want to experiment with different base models (maybe something trained on clinical documents, or medical research).  If you do, make sure that you use a tokenizer which is compatible with your model.  Then if you change this, make sure to retokenize again below

In [27]:
def tokenize_function_part_2(examples):
    return tokenizer_part_2(examples["text"], padding=True, truncation=True, max_length=max_tokens_part_2, add_special_tokens = True)

In [28]:
# Here you select what you want for your base model.  There are many more
# models for you to evaluate here
# https://huggingface.co/models?pipeline_tag=fill-mask&sort=created
# Note in the link above to select the filter under "Natural Language Processing"
# for "Fill Mask" so these are base models whose only task
# is filling in the blanks
#model_name_part_2 = None

# You could re-use the model we used above...
# https://huggingface.co/distilbert/distilbert-base-uncased
model_name_part_2 = "distilbert-base-uncased"

# https://huggingface.co/bert-base-uncased
#model_name_part_2 = "bert-base-uncased"

# This one was trained on biomedical literature and clinical data
# https://huggingface.co/emilyalsentzer/Bio_ClinicalBERT
#model_name_part_2 = "emilyalsentzer/Bio_ClinicalBERT"

# Here's a model trained by Microsoft on biomedical abstracts:
# https://huggingface.co/microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext
#model_name_part_2 = "microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext"

model_part_2 = AutoModelForSequenceClassification.from_pretrained(model_name_part_2, num_labels=num_labels)

tokenizer_part_2 = AutoTokenizer.from_pretrained(model_name_part_2, use_fast=True)

# if you use a tokenizer here which is different from above, note that you must tokenize again
# Each model must align with its tokenizer.  Otherwise, the matrices are all off
# (e.g., different widths for token vectors, final vectors, etc...)

# make sure you know what the maximum number of tokens can be for your model/tokenizer:
#max_tokens_part_2 = None
max_tokens_part_2 = 512

# make sure to tokenize using the tokenizer you've selected for part 2 here
train_tokenized_dataset_part_2 = train_dataset.map(tokenize_function_part_2, batched=True, num_proc=4, remove_columns=["text"])
test_tokenized_dataset_part_2 = test_dataset.map(tokenize_function_part_2, batched=True, num_proc=4, remove_columns=["text"])

# More documentation on the Trainer can be found here:
# https://huggingface.co/docs/transformers/main_classes/trainer
trainer_part_2 = Trainer(
    model_part_2,
    args_part_2,
    train_dataset=train_tokenized_dataset_part_2,
    eval_dataset=test_tokenized_dataset_part_2,
    tokenizer=tokenizer_part_2,
    compute_metrics = compute_metrics
)

# you might want to print out what we chose for logging our experiments
#print(trainer_part_2)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Map (num_proc=4):   0%|          | 0/70 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/30 [00:00<?, ? examples/s]

In [29]:
# now train it!  good luck!!
trainer_part_2.train()

Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,No log,0.686405,0.533333,0.284444,0.533333,0.371014
2,No log,0.68775,0.533333,0.284444,0.533333,0.371014
3,No log,0.684563,0.533333,0.284444,0.533333,0.371014
4,No log,0.685157,0.633333,0.794667,0.633333,0.589011
5,No log,0.678173,0.733333,0.77197,0.733333,0.718022
6,No log,0.668355,0.7,0.747826,0.7,0.676923
7,No log,0.653903,0.566667,0.76092,0.566667,0.441481
8,No log,0.631942,0.8,0.803704,0.8,0.79819
9,No log,0.596754,0.9,0.902222,0.9,0.900111
10,No log,0.543771,0.833333,0.834087,0.833333,0.832772


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


TrainOutput(global_step=100, training_loss=0.45628013610839846, metrics={'train_runtime': 167.1403, 'train_samples_per_second': 8.376, 'train_steps_per_second': 0.598, 'total_flos': 185454358118400.0, 'train_loss': 0.45628013610839846, 'epoch': 20.0})

In [30]:
trainer_part_2.evaluate()

{'eval_loss': 0.5967535376548767,
 'eval_accuracy': 0.9,
 'eval_precision': 0.9022222222222223,
 'eval_recall': 0.9,
 'eval_f1': 0.900111234705228,
 'eval_runtime': 0.5506,
 'eval_samples_per_second': 54.484,
 'eval_steps_per_second': 3.632,
 'epoch': 20.0}