## Introduction and loading libraries

Here, we are going to try and train a basic classifier that can predict whether a publication is about CDK4/6 inhibitors (a cancer protein inhibitor) or FtsZ (bacterial cell division protein) based on PubMed title. The original code is adapted from the HuggingFace tutorial code here: https://huggingface.co/docs/transformers/training 

In [None]:
# Install required libraries
! pip install transformers
! pip install evaluate
! pip install --upgrade accelerate

## Upload data

Data were downloaded directly from PubMed as a csv file using two different search terms: "FtsZ" and "CDK4/6 inhibitors"

In [None]:
import pandas as pd
import datasets
from datasets import Dataset, DatasetDict
from sklearn.model_selection import train_test_split
from transformers import AutoTokenizer

# Read in data and select only relevant columns
df = pd.read_csv('data/Abstracts-pubmed-classification.csv')
data = df[['Title', 'label']]

# We need to encode labels to numbers rather than text for everything to work
# Note: You can also use 'LabelEncoder()' if you have loads of classes!
data['label'] = data['label'].replace({"CDK":0, "FtsZ":1})

# Rename columns to the standard labels that HF expects (i.e. text and labels)
data.rename(columns={"Title": "text", "label": "labels"}, inplace=True)

# Check everything is ok
display(data)

## Create train and test datasets

In [3]:
# Split data into 80% train and 20% test
train, test = train_test_split(data, test_size=0.2, shuffle=True)

tds = Dataset.from_pandas(train)
test_ds = Dataset.from_pandas(test)

ds = DatasetDict()

ds['train'] = tds
ds['test'] = test_ds

print(ds)

DatasetDict({
    train: Dataset({
        features: ['text', 'labels', '__index_level_0__'],
        num_rows: 864
    })
    test: Dataset({
        features: ['text', 'labels', '__index_level_0__'],
        num_rows: 216
    })
})


## Tokenize text

In [4]:
# Enter model name that will be used for the fine-tuning
model_name = "dmis-lab/biobert-base-cased-v1.1"
# Enter the number of different labels in the dataset (CDK4 vs bacterial cell division)
num_labels = 2

tokenizer = AutoTokenizer.from_pretrained(model_name)

# Notice here we are tokenizing the 'text' column
def tokenize_function(examples):
    return tokenizer(examples["text"], max_length=512, padding="max_length", truncation=True)


tokenized_datasets = ds.map(tokenize_function, batched=True)

Downloading (…)lve/main/config.json:   0%|          | 0.00/313 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Map:   0%|          | 0/864 [00:00<?, ? examples/s]

Map:   0%|          | 0/216 [00:00<?, ? examples/s]

## Train with PyTorch

In [5]:
tokenized_datasets = tokenized_datasets.remove_columns(["text", "__index_level_0__"])
#tokenized_datasets = tokenized_datasets.rename_column("label", "labels")  # If your label containing column is called something else, please change to 'label
tokenized_datasets.set_format("torch")

small_train_dataset = tokenized_datasets["train"]
small_eval_dataset = tokenized_datasets["test"]

### DataLoader

Create a `DataLoader` for your training and test datasets so you can iterate over batches of data:

In [7]:
from torch.utils.data import DataLoader

train_dataloader = DataLoader(small_train_dataset, shuffle=True, batch_size=8)
eval_dataloader = DataLoader(small_eval_dataset, batch_size=8)

Load your model with the number of expected labels:

In [8]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=num_labels)

Downloading pytorch_model.bin:   0%|          | 0.00/436M [00:00<?, ?B/s]

Some weights of the model checkpoint at dmis-lab/biobert-base-cased-v1.1 were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.predictions.decoder.bias', 'cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification we

### Optimizer and learning rate scheduler

Create an optimizer and learning rate scheduler to fine-tune the model

In [9]:
from torch.optim import AdamW

optimizer = AdamW(model.parameters(), lr=5e-5)

In [10]:
from transformers import get_scheduler

num_epochs = 5
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
    name="linear", optimizer=optimizer, num_warmup_steps=0, num_training_steps=num_training_steps
)

Lastly, specify `device` to use a GPU if you have access to one. Otherwise, training on a CPU may take several hours instead of a couple of minutes.

In [11]:
import torch

device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model.to(device)

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(28996, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12,

## Train model via training loop

In [12]:
from tqdm.auto import tqdm

progress_bar = tqdm(range(num_training_steps))

model.train()
for epoch in range(num_epochs):
    for batch in train_dataloader:
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()  # Compute the gradients (partial derivitives wrt to model weights)
        optimizer.step()  # Update the weights using the gradients calculated from the previous line of code
        lr_scheduler.step()
        optimizer.zero_grad()  #Resetting the gradients from the previous iteration (we don't want to accumulate): Always recommended to do
        progress_bar.update(1)

        # ### LOGGING
        # if not batch_idx % 20: # log every 20th batch
        #     print(f'Epoch: {epoch+1:03d}/{num_epochs:03d}'
        #            f' | Batch {batch_idx:03d}/{len(train_loader):03d}'
        #            f' | Loss: {loss:.2f}')

  0%|          | 0/540 [00:00<?, ?it/s]

### Evaluate

Evaluate accuracy, f1, precision and recall

In [13]:
import evaluate

metric = evaluate.combine(["accuracy", "f1", "precision", "recall"])
model.eval()
for batch in eval_dataloader:
    batch = {k: v.to(device) for k, v in batch.items()}
    with torch.no_grad():
        outputs = model(**batch)

    logits = outputs.logits
    predictions = torch.argmax(logits, dim=-1)
    metric.add_batch(predictions=predictions, references=batch["labels"])

print(metric.compute())

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/6.77k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/7.55k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/7.36k [00:00<?, ?B/s]

{'accuracy': 0.9861111111111112, 'f1': 0.9881422924901185, 'precision': 0.984251968503937, 'recall': 0.9920634920634921}


## Visualize loss over epoch times (in progress!!!)

In [None]:

# Perhaps this is a way to visualize the loss and ....
# This code was taken from here: https://www.kaggle.com/code/mikeaalv/bert-huggingface-pytorch

loss_arr = []

for i in range(0, epochs):
    
    # ========= Training ==========
    
    print('====== Epoch {:} of {:}'.format(i+1, epochs))
    print('Training...')
    
    t0 = time.time()
    
    total_loss = 0
    # initialize training mode
    model.train()
    
    for step, batch in enumerate(train_dataloader):
        if step % 30 == 0 and not step == 0:
            elapsed = format_time(time.time() - t0)
            print('Batch {:>5,} of {:>5,}. Elapsed: {:}.'.format(step, len(train_dataloader), elapsed))
            
            # Unpacking the training batch from dataloader and copying each tensor to the GPU
            
        b_input_ids = batch[0].to(device)
        b_input_mask = batch[1].to(device)
        b_labels = batch[2].to(device)
        
        # pytorch doesn't clear previously calculated gradients
        # before performing backward pass, so clearing here:
        model.zero_grad()
        
        outputs = model(b_input_ids,
                       token_type_ids = None, 
                       attention_mask = b_input_mask,
                       labels = b_labels)
        loss = outputs[0]
        
        total_loss += loss.item()
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        
        optimizer.step()
        
        #update the learning rate
        scheduler.step()
    
    avg_train_loss = total_loss / len(train_dataloader)
    
    loss_arr.append(avg_train_loss)
    print("  Average training loss: {0:.2f}".format(avg_train_loss))
    print("  Training epoch took: {:}".format(format_time(time.time() - t0)))
    
    # ========= Validation ==========
    
    print("")
    print("Running Validation...")
    t0 = time.time()
    # evaluation mode
    model.eval()
    
    eval_loss, eval_accuracy = 0, 0
    nb_eval_steps, nb_eval_examples = 0, 0
    
    for batch in val_dataloader:
        b_input_ids = batch[0].to(device)
        b_input_mask = batch[1].to(device)
        b_labels = batch[2].to(device)
        
        with torch.no_grad():
            
            outputs = model(b_input_ids, 
                           token_type_ids = None, 
                           attention_mask = b_input_mask)
            
        logits = outputs[0]
        # move logits to cpu
        logits = logits.detach().cpu().numpy()
        label_ids = b_labels.to('cpu').numpy()
        # get accuracy
        tmp_eval_accuracy = flat_accuracy(logits, label_ids)
        
        eval_accuracy += tmp_eval_accuracy
        
        nb_eval_steps += 1
    
    print("  Accuracy: {0:.2f}".format(eval_accuracy/nb_eval_steps))
    print("  Validation took: {:}".format(format_time(time.time() - t0)))
    
print("")
print("Training complete!")

In [None]:
import matplotlib.pyplot as plt


import seaborn as sns

# Use plot styling from seaborn.
sns.set(style='darkgrid')

# Increase the plot size and font size.
sns.set(font_scale=1.5)
plt.rcParams["figure.figsize"] = (12,6)

# Plot the learning curve.
plt.plot(loss_arr, 'b-o')

# Label the plot.
plt.title("Training loss")
plt.xlabel("Epoch")
plt.ylabel("Loss")

plt.show()

<a id='additional-resources'></a>

## Additional resources

For more fine-tuning examples, refer to:

- [🤗 Transformers Examples](https://github.com/huggingface/transformers/tree/main/examples) includes scripts
  to train common NLP tasks in PyTorch and TensorFlow.

- [🤗 Transformers Notebooks](https://huggingface.co/docs/transformers/main/en/notebooks) contains various notebooks on how to fine-tune a model for specific tasks in PyTorch and TensorFlow.