<a href="https://colab.research.google.com/github/claudelepere/ML_GitHub/blob/main/BERT_for_multi_label_text_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## IMPORTANT

Open/modify any ipynb file only with the editor in which it was created, don't open a JupyterLab ipynb with Colab and vice versa, don't open any ipynb with an editor that can render notebook display, like VS Code or VSCodium.

---

Do not use conda in Colab notebook, do not try to install with conda, use pip.



In [5]:
!pip install -q accelerate
!pip install -q huggingface_hub
!pip install -q scikit-learn
!pip install -q transformers datasets
!pip install -q wandb

import json
import numpy as np
import os
import sys
import time
import torch
import wandb

from datasets              import DatasetDict
from google.colab          import auth, drive, files, userdata
from huggingface_hub       import create_repo, login, upload_file
from huggingface_hub.utils import RepositoryNotFoundError
from sklearn.metrics       import accuracy_score, average_precision_score, classification_report, f1_score, precision_score, recall_score, roc_auc_score
from torch.utils.data      import DataLoader
from tqdm.auto             import tqdm
from transformers          import AutoModelForSequenceClassification, AutoTokenizer, EvalPrediction, Trainer, TrainingArguments


In [6]:
"""
# Check the Python version
print(sys.version)
print()

# Get the installed packages (you can see that conda is not installed (do not install it))
!pip list
print()

# Check system information
!cat /etc/os-release
!uname -m
print()

# Check the GPU details (only if the runtime type is T4 GPU)
#!nvidia-smi
#print()

# Check RAM
!free -h
print()

# Check disk space
!df -h
print()

# Get environment variables
for key, value in os.environ.items():
    print(f"{key}: {value}")
"""
!python -V

print(f"currentdir: {os.getcwd()}")

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"device: {device}")

datasetDict_zip_file_name = "dataset_11_10000.zip"
datasetDict_dir_name      = os.path.splitext(datasetDict_zip_file_name)[0]
print(f"datasetDict_zip_file_name: {datasetDict_zip_file_name}")
print(f"datasetDict_dir_name     : {datasetDict_dir_name}")
print()

batch_size    = 8
epochs        = 5
learning_rate = 2e-5
threshold     = 0.2

run_name = f"BERT-multilabel-{datasetDict_dir_name}-batch{batch_size}-epochs{epochs}-lr{learning_rate}-threshold{threshold}"
print(f"run_name                 : {run_name}")

Python 3.10.12
currentdir: /content
device: cuda
datasetDict_zip_file_name: dataset_11_10000.zip
datasetDict_dir_name     : dataset_11_10000

run_name                 : BERT-multilabel-dataset_11_10000-batch8-epochs5-lr2e-05-threshold0.2


## Load dataset

Next, let's download a multi-label text classification dataset from the [hub](https://huggingface.co/).

At the time of writing, I picked a random one as follows:   

* first, go to the "datasets" tab on huggingface.co
* next, select the "multi-label-classification" tag on the left as well as the the "1k<10k" tag (fo find a relatively small dataset).

Note that you can also easily load your local data (i.e. csv files, txt files, Parquet files, JSON, ...) as explained [here](https://huggingface.co/docs/datasets/loading.html#local-and-remote-files).



In [7]:
def upload_unzip_dataset(file_name=datasetDict_zip_file_name):
  # Check if the file exists
  if not os.path.exists(file_name):
    print(f"'{file_name}' not found in /content. Uploading...")
    uploaded_files = files.upload()                              # Prompt file upload dialog
    if file_name not in uploaded_files:
      raise FileNotFoundError(f"'{file_name}' was not uploaded. Please try again.")
    print(f"'{file_name}' successfully uploaded to /content")
    uploaded_file_name = list(uploaded_files.keys())[0]          # Get the name of the uploaded file

    !unzip {uploaded_file_name}

    unzipped_dir_name = os.path.splitext(uploaded_file_name)[0]
    assert unzipped_dir_name==datasetDict_dir_name, "unzipped_dir_name != datasetDict_dir_name"
  else:
    print(f"'{datasetDict_dir_name}' already exists in /content.")


In [8]:
upload_unzip_dataset(datasetDict_zip_file_name)

'dataset_11_10000.zip' not found in /content. Uploading...


Saving dataset_11_10000.zip to dataset_11_10000.zip
'dataset_11_10000.zip' successfully uploaded to /content
Archive:  dataset_11_10000.zip
  inflating: dataset_11_10000/dataset_dict.json  
  inflating: dataset_11_10000/test/data-00000-of-00001.arrow  
  inflating: dataset_11_10000/test/dataset_info.json  
  inflating: dataset_11_10000/test/state.json  
  inflating: dataset_11_10000/train/data-00000-of-00001.arrow  
  inflating: dataset_11_10000/train/dataset_info.json  
  inflating: dataset_11_10000/train/state.json  
  inflating: dataset_11_10000/validation/data-00000-of-00001.arrow  
  inflating: dataset_11_10000/validation/dataset_info.json  
  inflating: dataset_11_10000/validation/state.json  


In [9]:
# Hugging Face Authenticate

os.environ["HF_TOKEN"] = userdata.get("HF_TOKEN")    # Store the key in os.environ
hf_token               = os.environ.get('HF_TOKEN')
login(token=hf_token)

# Verify
!huggingface-cli whoami


Note: Environment variable`HF_TOKEN` is set and is the current active token independently from the token you've just configured.


claudelepere


In [10]:
# Create the skill_classification repo on the Hugging Face Hub

HF_name         = "claudelepere/skill_classification"
repo_id_model   = HF_name
repo_id_dataset = HF_name

repo_model_url = create_repo(
    repo_id   = repo_id_model,
    repo_type = "model",
    private   = True,
    exist_ok  = True
    )
print(f"Repo model url: {repo_model_url} created successfully as a private repo.")

repo_dataset_url = create_repo(
    repo_id   = repo_id_dataset,
    repo_type = "dataset",
    private   = True,
    exist_ok  = True
    )
print(f"Repo datasets url: {repo_dataset_url} created successfully as a private repo.")

repo_id_dataset = f"datasets/{HF_name}"

print(f"repo_id_model: {repo_id_model}")
print(f"repo_id_dataset: {repo_id_dataset}")


Repo model url: https://huggingface.co/claudelepere/skill_classification created successfully as a private repo.
Repo datasets url: https://huggingface.co/datasets/claudelepere/skill_classification created successfully as a private repo.
repo_id_model: claudelepere/skill_classification
repo_id_dataset: datasets/claudelepere/skill_classification


In [11]:
# W&B initialization

os.environ["WANDB_API_KEY"] = userdata.get("WANDB_API_KEY")        # Store the key in os.environ
wandb_api_key               = os.environ.get('WANDB_API_KEY')
wandb.login(key=wandb_api_key)

try:
  wandb.init(
      project = "skill_classification",
      name    = run_name,
      entity  = "claudelepere-c-cile-cy",
      config  = {
          "learning_rate": 2e-5,
          "epochs"       : 5,
          "batch_size"   : 8
          }
      )
except wandb.errors.CommError as err:
  print(f"CommError: {err}")
except Exception as exc:
  print(f"Exception: {exc}")


[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.
[34m[1mwandb[0m: Currently logged in as: [33mclaudelepere[0m ([33mclaudelepere-c-cile-cy[0m). Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


# My fine-tuning BERT (and friends) for multi-label text classification

In this notebook, we are going to fine-tune BERT to predict one or more labels for a given piece of text. Note that this notebook illustrates how to fine-tune a bert-base-uncased model, but you can also fine-tune a RoBERTa, DeBERTa, DistilBERT, CANINE, ... checkpoint in the same way.

All of those work in the same way: they add a **linear layer on top of the base model, which is used to produce a tensor of shape (batch_size, num_labels)**, indicating the unnormalized scores for a number of labels for every example in the batch.


In [12]:
# Create the dataset: 3 Hugging Face Dataset in a Hugging Face DatasetDict

datasetDict = DatasetDict.load_from_disk(datasetDict_dir_name)

print(f"datasetDict: {type(datasetDict)} {datasetDict.shape}\n{datasetDict}")


datasetDict: <class 'datasets.dataset_dict.DatasetDict'> {'train': (7000, 8), 'validation': (1500, 8), 'test': (1500, 8)}
DatasetDict({
    train: Dataset({
        features: ['id', 'text', '390', '135', '136', '137', '138', '139'],
        num_rows: 7000
    })
    validation: Dataset({
        features: ['id', 'text', '390', '135', '136', '137', '138', '139'],
        num_rows: 1500
    })
    test: Dataset({
        features: ['id', 'text', '390', '135', '136', '137', '138', '139'],
        num_rows: 1500
    })
})


As we can see, the dataset contains 3 splits: one for training, one for validation and one for testing.

Let's test the first example of the training split:

In [13]:
example = datasetDict['train'][0]
print(f"example: {type(example)} {example.keys()}\n{example}")

example: <class 'dict'> dict_keys(['id', 'text', '390', '135', '136', '137', '138', '139'])
{'id': 155200, 'text': 'Vivid Resourcing - Service Engineer Service, Engineer, Mechanical Vivid Resourcing Main Tasks Provide high level support to a varied range of external customers & internal stakeholders to ensure a high performance of machinery and a strong level of satisfaction from end users and contributors Take a thorough approach in seeking, and diagnosing, bugs or problems relating to either the electronic Hardware or software related problems Document results of analysis, pass to colleagues in R&D to help form the basis of modifications and amendments to product ranges Skills Degree in a related field i.e. mechanical engineering, electronics, automation etc. Strong analytical mindset and excellent problem-solving abilities Interest in mechnical engineering and effective automation Passion for learning new technologies and skills, especially relating to software & programming Fluent 

In [14]:
# Create the label list and the id2label and label2id mappings.

"""
dataset 7_1000_125_125  ,  48 labels
dataset 7_128_18_54     ,  42 labels
dataset 8910_1087_68_204, 206 labels
dataset 11_1000         ,   6 labels
"""

labels = [label for label in datasetDict['train'].features.keys() if label not in ['id', 'text']]
labels.sort()
print(f"labels: {type(labels)} {len(labels)}\n{labels}")

id2label = {idx:label for idx, label in enumerate(labels)}
print(f"id2label: {type(id2label)} {len(id2label)}\n{id2label}")

label2id = {label:idx for idx, label in enumerate(labels)}
print(f"label2id: {type(label2id)} {len(label2id)}\n{label2id}")

labels: <class 'list'> 6
['135', '136', '137', '138', '139', '390']
id2label: <class 'dict'> 6
{0: '135', 1: '136', 2: '137', 3: '138', 4: '139', 5: '390'}
label2id: <class 'dict'> 6
{'135': 0, '136': 1, '137': 2, '138': 3, '139': 4, '390': 5}


The dataset consists of texts, labeled with one or more skills.

Let's create a list that contains the labels, as well as 2 dictionaries that map labels to integers and back.

In [15]:
# Upload to the HF repo_id_dataset the label list as a JSON file

labels_path = "labels.json"
with open(labels_path, 'w') as f:
    json.dump(labels, f)
print(f"labels saved to {labels_path}")

upload_file(
    path_or_fileobj = labels_path,
    path_in_repo    = labels_path,
    repo_id         = HF_name,
    repo_type       = "dataset"
    )
print(f"labels uploaded to https://huggingface.co/datasets/{HF_name}/tree/main/{labels_path}")

No files have been modified since last commit. Skipping to prevent empty commit.


labels saved to labels.json
labels uploaded to https://huggingface.co/datasets/claudelepere/skill_classification/tree/main/labels.json


## Preprocess data

As models like BERT don't expect text as direct input, but rather **`input_ids`**, etc., we tokenize the text using the tokenizer. Here I'm using the `AutoTokenizer` API, which will automatically load the appropriate tokenizer based on the checkpoint on the hub.

What's a bit tricky is that we also need to provide labels to the model. For multi-label text classification, this is a **matrix of shape (batch_size, num_labels)**. Also important: this should be a tensor of floats rather than integers, otherwise PyTorch' **BCEWithLogitsLoss** (which the model will use) will complain, as explained [here](https://discuss.pytorch.org/t/multi-label-binary-classification-result-type-float-cant-be-cast-to-the-desired-output-type-long/117915/3).

### Preprocess
note: examples, not example, because batched=True => examples is a batch

In [16]:
# Tokenize 'text' in the 3 datasets, train, validation and test

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [17]:
def preprocess_data(examples, indices):
  text = examples['text']    # Batch of texts

  encoding = tokenizer(
      text,                           # Tokenize text
      truncation     = True,
      padding        = 'max_length',
      max_length     = 512,
      return_tensors = 'pt'           # Return PyTorch tensors
      )

  # Create an empty label matrix
  labels_matrix = torch.zeros((len(text), len(labels)), dtype=torch.float32)
  #print(f"labels_matrix: {type(labels_matrix)} {labels_matrix.shape}")

  # Populate label matrix
  for idx, label in enumerate(labels):
    #print(f"idx:{idx} label:{label}")
    if label in examples:
      labels_matrix[:, idx] = torch.tensor(
          [1.0 if val else 0.0 for val in examples[label]],
          dtype=torch.float32
          )
  #print(f"labels_matrix: {type(labels_matrix)} {labels_matrix.shape}")

  # Add labels to the encoding
  encoding['labels'] = labels_matrix
  #print(f"encoding['labels']: {encoding['labels']}")

  return encoding

### Tokens are not words, so the text is truncated

Tokens are the units the tokenizer produces after processing the text. In BERT (and most Transformer models), tokenization uses **subword units** (from methods like WordPiece or Byte-Pair Encoding). A single word can be split into multiple tokens if it's not in the model's vocabulary.

Example: 'unbelievable' => 3 tokens, 'un', '##believe', '##able'


In [18]:
def get_truncated_part(text):
  tokens = tokenizer(
      text,
      truncation                = True,
      padding                   = 'max_length',
      max_length                = 512,
      return_overflowing_tokens = True,
      return_tensors            = None
      )
  print(f"tokens.keys(): {tokens.keys()}")

  # Get the truncated tokens
  truncated_ids = tokens["input_ids"][0]
  print(f"truncated_ids: {type(truncated_ids)} {truncated_ids}")
  #overflow_ids  = tokens["overflow_to_sample_mapping"][0]
  #print(f"overflow_ids: {type(overflow_ids)} {overflow_ids}")

  # Decode the tokens back to text
  truncated_text = tokenizer.decode(truncated_ids, skip_special_tokens=True)
  #overflow_text  = tokenizer.decode(overflow_ids, skip_special_tokens=True)

  print(f"original_text :\n{text}")
  print(f"truncated_text:\n{truncated_text}")
  #print(f"overflow_text:\n{overflow_text}")

  original_tokens  = tokenizer.tokenize(text)
  truncated_tokens = tokenizer.tokenize(truncated_text)
  #overflow_tokens  = tokenizer.tokenize(overflow_text)

  print(f"original_tokens count : {len(original_tokens)}")
  print(f"truncated_tokens count: {len(truncated_tokens)}")
  #print(f"overflow_tokens count: {len(overflow_tokens)}")


In [19]:
example_text = datasetDict['train'][0]['text']
get_truncated_part(example_text)


tokens.keys(): dict_keys(['input_ids', 'token_type_ids', 'attention_mask', 'overflow_to_sample_mapping'])
truncated_ids: <class 'list'> [101, 14954, 24501, 8162, 6129, 1011, 2326, 3992, 2326, 1010, 3992, 1010, 6228, 14954, 24501, 8162, 6129, 2364, 8518, 3073, 2152, 2504, 2490, 2000, 1037, 9426, 2846, 1997, 6327, 6304, 1004, 4722, 22859, 2000, 5676, 1037, 2152, 2836, 1997, 10394, 1998, 1037, 2844, 2504, 1997, 9967, 2013, 2203, 5198, 1998, 16884, 2202, 1037, 16030, 3921, 1999, 6224, 1010, 1998, 22939, 26745, 7741, 1010, 12883, 2030, 3471, 8800, 2000, 2593, 1996, 4816, 8051, 2030, 4007, 3141, 3471, 6254, 3463, 1997, 4106, 1010, 3413, 2000, 8628, 1999, 1054, 1004, 1040, 2000, 2393, 2433, 1996, 3978, 1997, 12719, 1998, 16051, 2000, 4031, 8483, 4813, 3014, 1999, 1037, 3141, 2492, 1045, 1012, 1041, 1012, 6228, 3330, 1010, 8139, 1010, 19309, 4385, 1012, 2844, 17826, 9273, 3388, 1998, 6581, 3291, 1011, 13729, 7590, 3037, 1999, 2033, 2818, 20913, 3330, 1998, 4621, 19309, 6896, 2005, 4083, 2047, 

In [20]:
# Create the 3 encoded datasets, train, validation and test

encoded_dataset = datasetDict.map(
    preprocess_data,
    batched        = True,
    remove_columns = datasetDict['train'].column_names,
    with_indices   = True
    )
train_dataset      = encoded_dataset['train']
validation_dataset = encoded_dataset['validation']
test_dataset       = encoded_dataset['test']
print(f"encoded_dataset: {type(encoded_dataset)} {encoded_dataset.shape}\n{encoded_dataset}")
print(f"train_dataset: {type(train_dataset)} {train_dataset.shape}")
print(f"validation_dataset: {type(validation_dataset)} {validation_dataset.shape}")
print(f"test_dataset: {type(test_dataset)} {test_dataset.shape}")

Map:   0%|          | 0/7000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1500 [00:00<?, ? examples/s]

Map:   0%|          | 0/1500 [00:00<?, ? examples/s]

encoded_dataset: <class 'datasets.dataset_dict.DatasetDict'> {'train': (7000, 4), 'validation': (1500, 4), 'test': (1500, 4)}
DatasetDict({
    train: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 7000
    })
    validation: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 1500
    })
    test: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 1500
    })
})
train_dataset: <class 'datasets.arrow_dataset.Dataset'> (7000, 4)
validation_dataset: <class 'datasets.arrow_dataset.Dataset'> (1500, 4)
test_dataset: <class 'datasets.arrow_dataset.Dataset'> (1500, 4)


In [21]:
example = encoded_dataset['train'][0]

print(f"example: {type(example)} {example.keys()}\n{example}")
print()
print(f"example['input_ids']: {type(example['input_ids'])} {len(example['input_ids'])}\n{example['input_ids']}")
print(f"example['token_type_ids']: {type(example['token_type_ids'])} {len(example['token_type_ids'])}\n{example['token_type_ids']}")
print(f"example['attention_mask']: {type(example['attention_mask'])} {len(example['attention_mask'])}\n{example['attention_mask']}")
print(f"example['labels']:  {type(example['labels'])} {len(example['labels'])}\n{example['labels']}")

example: <class 'dict'> dict_keys(['input_ids', 'token_type_ids', 'attention_mask', 'labels'])
{'input_ids': [101, 14954, 24501, 8162, 6129, 1011, 2326, 3992, 2326, 1010, 3992, 1010, 6228, 14954, 24501, 8162, 6129, 2364, 8518, 3073, 2152, 2504, 2490, 2000, 1037, 9426, 2846, 1997, 6327, 6304, 1004, 4722, 22859, 2000, 5676, 1037, 2152, 2836, 1997, 10394, 1998, 1037, 2844, 2504, 1997, 9967, 2013, 2203, 5198, 1998, 16884, 2202, 1037, 16030, 3921, 1999, 6224, 1010, 1998, 22939, 26745, 7741, 1010, 12883, 2030, 3471, 8800, 2000, 2593, 1996, 4816, 8051, 2030, 4007, 3141, 3471, 6254, 3463, 1997, 4106, 1010, 3413, 2000, 8628, 1999, 1054, 1004, 1040, 2000, 2393, 2433, 1996, 3978, 1997, 12719, 1998, 16051, 2000, 4031, 8483, 4813, 3014, 1999, 1037, 3141, 2492, 1045, 1012, 1041, 1012, 6228, 3330, 1010, 8139, 1010, 19309, 4385, 1012, 2844, 17826, 9273, 3388, 1998, 6581, 3291, 1011, 13729, 7590, 3037, 1999, 2033, 2818, 20913, 3330, 1998, 4621, 19309, 6896, 2005, 4083, 2047, 6786, 1998, 4813, 1010, 292

In [22]:
tokenizer.decode(example['input_ids'])

'[CLS] vivid resourcing - service engineer service, engineer, mechanical vivid resourcing main tasks provide high level support to a varied range of external customers & internal stakeholders to ensure a high performance of machinery and a strong level of satisfaction from end users and contributors take a thorough approach in seeking, and diagnosing, bugs or problems relating to either the electronic hardware or software related problems document results of analysis, pass to colleagues in r & d to help form the basis of modifications and amendments to product ranges skills degree in a related field i. e. mechanical engineering, electronics, automation etc. strong analytical mindset and excellent problem - solving abilities interest in mechnical engineering and effective automation passion for learning new technologies and skills, especially relating to software & programming fluent english speaker the offer strong package including a yearly salary paid in line with your experience & s

In [23]:
[id2label[idx] for idx, label in enumerate(example['labels']) if label == 1.0]

['137', '138', '139']

Finally, we set the format of our data to PyTorch tensors. This will turn the training, validation and test sets into standard PyTorch [datasets](https://pytorch.org/docs/stable/data.html).

In [24]:
# Set PyTorch format to ensures correctness and compatibility with PyTorch pipelines

# The 3 Hugging Face Dataset are formatted as PyTorch Dataset
encoded_dataset.set_format('torch')

## Define the model

Here we define a **model that includes a pre-trained base (i.e. the weights from bert-base-uncased) are loaded, with a random initialized classification head (linear layer) on top**. One should fine-tune this head, together with the pre-trained base on a labeled dataset.

This is also printed by the warning.

We set the `problem_type` to be "multi_label_classification", as this will make sure the appropriate loss function is used (namely [**BCEWithLogitsLoss**](https://pytorch.org/docs/stable/generated/torch.nn.BCEWithLogitsLoss.html)). We also make sure the output layer has `len(labels)` output neurons, and we set the id2label and label2id mappings.

In [25]:
# Define the model

model = AutoModelForSequenceClassification.from_pretrained(
    "bert-base-uncased",
    problem_type="multi_label_classification",
    num_labels=len(labels),
    id2label=id2label,
    label2id=label2id
    )

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## Train the model

We are going to train the model using HuggingFace's Trainer API. This requires us to define 2 things:

* `TrainingArguments`, which specify training hyperparameters. All options can be found in the [docs](https://huggingface.co/transformers/main_classes/trainer.html#trainingarguments). Below, we for example specify that we want to evaluate after every epoch of training, we would like to save the model every epoch, we set the learning rate, the batch size to use for training/evaluation, how many epochs to train for, and so on.
* a `Trainer` object (docs can be found [here](https://huggingface.co/transformers/main_classes/trainer.html#id1)).

In [26]:
batch_size  = batch_size
metric_name = "f1"

### TrainingArguments

In [27]:
output_dir = "training_results"  # where model predictions and checkpoints will be written during training
args = TrainingArguments(
    output_dir                  = output_dir,
    overwrite_output_dir        = True,
    logging_dir                 = "logs",
    logging_steps               = 50,
    save_steps                  = 500,
    save_total_limit            = 2,
    eval_strategy               = "epoch",
    save_strategy               = "epoch",
    learning_rate               = learning_rate,
    per_device_train_batch_size = batch_size,
    per_device_eval_batch_size  = batch_size,
    num_train_epochs            = epochs,
    weight_decay                = 0.01,
    load_best_model_at_end      = True,
    metric_for_best_model       = metric_name,
    run_name                   = run_name,
    report_to                  = "wandb"
    )

We are also going to compute metrics while training. For this, we need to define a `compute_metrics` function, that returns a dictionary with the desired metric values.

In [28]:
# Metrics
#   source: https://jesusleal.io/2021/04/21/Longformer-multilabel-classification/

def multi_label_metrics(predictions, labels):
    average = 'micro'    # 'micro' or 'weighted'

    # first, apply sigmoid on predictions whose shape is (batch_size, num_labels)
    sigmoid = torch.nn.Sigmoid()
    probs   = sigmoid(torch.Tensor(predictions))

    # next, use threshold to turn them into integer predictions
    y_pred = np.zeros(probs.shape)
    y_pred[np.where(probs >= threshold)] = 1

    # finally, compute metrics
    y_true               = labels
    f1                   = f1_score               (y_true=y_true, y_pred=y_pred, average=average)    #, zero_division=1)
    precision            = precision_score        (y_true=y_true, y_pred=y_pred, average=average)    #, zero_division=1)
    recall               = recall_score           (y_true=y_true, y_pred=y_pred, average=average)    #, zero_division=1)
    roc_auc              = roc_auc_score          (y_true=y_true, y_score=probs, average=average)
    precision_recall_auc = average_precision_score(y_true=y_true, y_score=probs, average=average)
    accuracy             = accuracy_score         (y_true=y_true, y_pred=y_pred)

    # return as dictionary
    metrics = {
        'f1'                  : f1,
        'precision'           : precision,
        'recall'              : recall,
        'roc_auc'             : roc_auc,
        'precision_recall_auc': precision_recall_auc,
        'accuracy'            : accuracy
        }

    return metrics

In [29]:
def compute_metrics(p: EvalPrediction):
    preds = p.predictions[0] if isinstance(p.predictions, tuple) else p.predictions
    result = multi_label_metrics(
        predictions = preds,
        labels      = p.label_ids
        )
    return result

Let's verify a batch as well as a forward pass:

In [30]:
print(f"inputids:        {type(encoded_dataset['train']['input_ids'][0])}\t{encoded_dataset['train']['input_ids'][0].shape}")
print(f"token_type_ids': {type(encoded_dataset['train']['token_type_ids'][0])}\t{encoded_dataset['train']['token_type_ids'][0].shape}")
print(f"attention_mask:  {type(encoded_dataset['train']['attention_mask'][0])}\t{encoded_dataset['train']['attention_mask'][0].shape}")
print(f"labels:          {type(encoded_dataset['train'][0]['labels'])}\t{encoded_dataset['train'][0]['labels'].shape}")

inputids:        <class 'torch.Tensor'>	torch.Size([512])
token_type_ids': <class 'torch.Tensor'>	torch.Size([512])
attention_mask:  <class 'torch.Tensor'>	torch.Size([512])
labels:          <class 'torch.Tensor'>	torch.Size([6])


In [31]:
# Execute a forward pass for debugging or verificatin purposes (cf. BERT_3_1 in Notion BERT database)

outputs = model(
    input_ids      = encoded_dataset['train']['input_ids'][0].unsqueeze(0),
    attention_mask = encoded_dataset['train']['attention_mask'][0].unsqueeze(0),
    labels         = encoded_dataset['train'][0]['labels'].unsqueeze(0)
    )

print(f"outputs: {type(outputs)} {outputs.keys()}\n{outputs}")

outputs: <class 'transformers.modeling_outputs.SequenceClassifierOutput'> odict_keys(['loss', 'logits'])
SequenceClassifierOutput(loss=tensor(0.7612, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>), logits=tensor([[ 0.0565,  0.2041, -0.0686, -0.1558, -0.3548, -0.0740]],
       grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)


Let's start training!

In [32]:
# Create the trainer

trainer = Trainer(
    model,
    args,
    train_dataset = encoded_dataset["train"],
    eval_dataset  = encoded_dataset["validation"],
    tokenizer     = tokenizer,
    compute_metrics=compute_metrics
    )


  trainer = Trainer(


In [33]:
# Train, save the results as a JSON file

train_output  = trainer.train()

train_results = {
    'global_step':   train_output.global_step,    # total steps completed during training
    'training_loss': train_output.training_loss,  # average loss during training
    'metrics':       train_output.metrics         # dictionary of metrics
}

# Save train results
with open("train_results.json", "w") as f:
  json.dump(train_results, f, indent=4)

Epoch,Training Loss,Validation Loss,F1,Precision,Recall,Roc Auc,Precision Recall Auc,Accuracy
1,0.3445,0.33495,0.830724,0.740494,0.945995,0.930031,0.895993,0.323333
2,0.2851,0.323742,0.835856,0.760593,0.927649,0.938047,0.906183,0.330667
3,0.2399,0.294704,0.857211,0.794442,0.930749,0.948285,0.925616,0.435333
4,0.2373,0.287823,0.859213,0.79525,0.934367,0.951606,0.929337,0.435333
5,0.1883,0.291572,0.863746,0.808472,0.927132,0.951596,0.929675,0.452667


In [34]:
print("Training successfully completed.")

Training successfully completed.


## Evaluate

After training, we evaluate our model on the validation set.

In [35]:
def get_results(model, dataset, batch_size, threshold):

  # Set the model to evaluation mode to disable dropout and other training-specific behaviors
  model.eval()

  device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
  model.to(device)

  test_loader = DataLoader(dataset, batch_size=batch_size, shuffle=False)

  all_preds       = []
  all_probs       = []
  all_true_labels = []

  for batch in tqdm(test_loader):
    batch = {k: v.to(device) for k, v in batch.items()}
    with torch.no_grad():
      outputs = model(**batch)
    logits = outputs.logits

    # Convert logits to probabilities and probabilities to predictions
    sigmoid = torch.nn.Sigmoid()
    probs   = sigmoid(logits).cpu().numpy()    # Convert to Numpy
    preds   = (probs > threshold).astype(int)  # Convert to binary Numpy array

    # Accumulate probabilities, predictions and labels
    all_probs.append(probs)
    all_preds.append(preds)
    all_true_labels.append(batch['labels'].cpu().numpy())

  # Concatenate results from all batches
  all_probs       = np.concatenate(all_probs, axis=0)        # shape: [num_samples, num_labels]
  all_preds       = np.concatenate(all_preds, axis=0)        # shape: [num_samples, num_labels]
  all_true_labels = np.concatenate(all_true_labels, axis=0)  # shape: [num_samples, num_labels]

  print(f"all_probs:       {type(all_probs)} {all_probs.shape}")
  print(f"all_preds:       {type(all_preds)} {all_preds.shape}")
  print(f"all_true_labels: {type(all_true_labels)} {all_true_labels.shape}")

  # Classification report for precision, recall, F1 score
  print(classification_report(
      y_true        = all_true_labels,
      y_pred        = all_preds,
      target_names  = labels,
      zero_division = 0
      ))

  # ROC AUC for multi-label classification
  roc_auc = roc_auc_score(
      y_true  = all_true_labels,
      y_score = all_probs,
      average = 'micro'
      )
  print(f"ROC AUC: {roc_auc}")

In [36]:
# First evaluate results NO SAVE

get_results(model=model, dataset=validation_dataset, batch_size=batch_size, threshold=threshold)

  0%|          | 0/188 [00:00<?, ?it/s]

all_probs:       <class 'numpy.ndarray'> (1500, 6)
all_preds:       <class 'numpy.ndarray'> (1500, 6)
all_true_labels: <class 'numpy.ndarray'> (1500, 6)
              precision    recall  f1-score   support

         135       0.74      0.62      0.68       108
         136       0.64      0.69      0.66       294
         137       0.81      0.94      0.87      1033
         138       0.93      1.00      0.96      1345
         139       0.73      0.97      0.83       961
         390       0.64      0.59      0.62       129

   micro avg       0.81      0.93      0.86      3870
   macro avg       0.75      0.80      0.77      3870
weighted avg       0.81      0.93      0.86      3870
 samples avg       0.82      0.94      0.86      3870

ROC AUC: 0.9515964005621289


In [37]:
print("First evaluation successfully completed.")

First evaluation successfully completed.


In [38]:
# Second evaluate results; save to /content

eval_output = trainer.evaluate()

# Save evaluate results
with open("eval_results.json", "w") as f:
  json.dump(eval_output, f, indent=4)

In [39]:
print("Second evaluation successfully completed.")

Second evaluation successfully completed.


## Upload model, tokenizer, train results, evaluate results

In [40]:
# Save model to /content

model_path = "model"
trainer.save_model(model_path)

In [41]:
# Upload model and tokenizer to the HF repo_id_model

model     = AutoModelForSequenceClassification.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)

model.push_to_hub(repo_id_model)
tokenizer.push_to_hub(repo_id_model)

README.md:   0%|          | 0.00/5.17k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

No files have been modified since last commit. Skipping to prevent empty commit.


CommitInfo(commit_url='https://huggingface.co/claudelepere/skill_classification/commit/a1a907125a22288fa2ad27b9928f5f9daba75aff', commit_message='Upload tokenizer', commit_description='', oid='a1a907125a22288fa2ad27b9928f5f9daba75aff', pr_url=None, repo_url=RepoUrl('https://huggingface.co/claudelepere/skill_classification', endpoint='https://huggingface.co', repo_type='model', repo_id='claudelepere/skill_classification'), pr_revision=None, pr_num=None)

In [42]:
# Upload train_results.json and eval_results.json to the HF repo_id_dataset BETTER to upload to wanddb repo?

upload_file(
    path_or_fileobj = "train_results.json",
    path_in_repo    = "train_results.json",
    repo_id         = HF_name,
    repo_type       = "dataset"
    )

upload_file(
    path_or_fileobj = "eval_results.json",
    path_in_repo    = "eval_results.json",
    repo_id         = HF_name,
    repo_type       = "dataset"
    )

CommitInfo(commit_url='https://huggingface.co/datasets/claudelepere/skill_classification/commit/1a083122f0c52b6467f7b50dd618088f29798365', commit_message='Upload eval_results.json with huggingface_hub', commit_description='', oid='1a083122f0c52b6467f7b50dd618088f29798365', pr_url=None, repo_url=RepoUrl('https://huggingface.co/datasets/claudelepere/skill_classification', endpoint='https://huggingface.co', repo_type='dataset', repo_id='claudelepere/skill_classification'), pr_revision=None, pr_num=None)

## Test

In [43]:
# Test: first results NO SAVE

get_results(model=model, dataset=test_dataset, batch_size=batch_size, threshold=threshold)

  0%|          | 0/188 [00:00<?, ?it/s]

all_probs:       <class 'numpy.ndarray'> (1500, 6)
all_preds:       <class 'numpy.ndarray'> (1500, 6)
all_true_labels: <class 'numpy.ndarray'> (1500, 6)
              precision    recall  f1-score   support

         135       0.59      0.63      0.61        83
         136       0.62      0.68      0.65       291
         137       0.80      0.94      0.86      1009
         138       0.93      1.00      0.96      1359
         139       0.72      0.95      0.82       970
         390       0.56      0.55      0.56       110

   micro avg       0.80      0.92      0.86      3822
   macro avg       0.70      0.79      0.74      3822
weighted avg       0.80      0.92      0.86      3822
 samples avg       0.81      0.94      0.86      3822

ROC AUC: 0.9477469182402141


In [44]:
print("First test successfully completed.")

First test successfully completed.


In [46]:
# Test: second results NO SAVE

predictions = trainer.predict(test_dataset)

#print(f"predictions.predictions: {type(predictions.predictions)} {predictions.predictions.shape}\n{predictions.predictions}")  # Model logits
#print(f"predictions.label_ids: {type(predictions.label_ids)} {predictions.label_ids.shape}\n{predictions.label_ids}")          # Ground truth labels
print(f"predictions.metrics: {type(predictions.metrics)} {len(predictions.metrics)}\n{predictions.metrics}")                  # Metrics


predictions.metrics: <class 'dict'> 10
{'test_loss': 0.30430394411087036, 'test_f1': 0.8559660811629316, 'test_precision': 0.7969772163320551, 'test_recall': 0.9243851386708529, 'test_roc_auc': 0.9477469182402141, 'test_precision_recall_auc': 0.9237939898478946, 'test_accuracy': 0.43, 'test_runtime': 44.5435, 'test_samples_per_second': 33.675, 'test_steps_per_second': 4.221}


In [None]:
print("Second test successfully completed.")

### Or otherwise

In [47]:
# Test: third results NO SAVE

predictions = trainer.predict(test_dataset)
print(predictions.predictions)  # Model logits
print(predictions.label_ids)    # Ground truth labels
print(predictions.metrics)      # Metrics

[[-4.8155503  -2.9387457   3.8723664   3.792648   -1.0429465  -4.491061  ]
 [-4.1023374   0.20203793  4.53945     3.1439354  -1.4629261  -3.1444325 ]
 [-4.4058104  -4.3287797  -2.8539727   2.2606132   3.675032   -4.654865  ]
 ...
 [-5.134101   -2.9658496   3.9250417   4.115949   -0.46127692 -4.750445  ]
 [-5.5903544  -3.2374227   1.104647    4.3683014   3.177631   -5.1692953 ]
 [-5.207977   -4.7572794  -2.0086288   3.0196774   3.1847699  -5.452267  ]]
[[0. 0. 1. 1. 0. 0.]
 [0. 1. 1. 1. 1. 1.]
 [0. 0. 0. 1. 1. 0.]
 ...
 [0. 0. 1. 1. 1. 0.]
 [0. 1. 1. 1. 1. 0.]
 [0. 0. 0. 1. 1. 0.]]
{'test_loss': 0.30430394411087036, 'test_f1': 0.8559660811629316, 'test_precision': 0.7969772163320551, 'test_recall': 0.9243851386708529, 'test_roc_auc': 0.9477469182402141, 'test_precision_recall_auc': 0.9237939898478946, 'test_accuracy': 0.43, 'test_runtime': 45.6674, 'test_samples_per_second': 32.846, 'test_steps_per_second': 4.117}


In [None]:
print("Third test successfully completed.")