### **TFM - Label reduction to 95% of Dataset. Reduction from 7201 to 453 labels. (13/01/2025)**

* See
https://discuss.huggingface.co/t/most-efficient-multi-label-classifier/9296/2

* The Artificial Guy - MULTI-LABEL TEXT CLASSIFICATION USING BERT AND PYTORCH
https://www.youtube.com/watch?v=f-86-HcYYi8

* Saurabh Anand - BERT for Multi-Label Classification
https://www.youtube.com/watch?v=JjcxZPNZbUY

* KGP Talkie - 5 - Multi-Label Text Classification Model with DistilBERT and Hugging Face Transformers in PyTorch
https://www.youtube.com/watch?v=ZYc9za75Chk

* Fine Tuning BERT for a Multi-Label Classification Problem on Colab - https://medium.com/@abdurhmanfayad_73788/fine-tuning-bert-for-a-multi-label-classification-problem-on-colab-5ca5b8759f3f

* BERT and DistilBERT Models for NLP - https://medium.com/@kumari01priyanka/bert-and-distilbert-model-for-nlp-7352eb16915e

* Choosing the Right Colab Runtime: A Guide for Data Scientists and Analysts - https://drlee.io/choosing-the-right-colab-runtime-a-guide-for-data-scientists-and-analysts-57ee7b7c9638

* distilbert / distilbert-base-uncased - https://huggingface.co/distilbert/distilbert-base-uncased

* "DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter" - https://arxiv.org/abs/1910.01108

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
!cp /content/drive/MyDrive/TFM-MUECIM/*.py /content
!cp /content/drive/MyDrive/TFM-MUECIM/*.txt /content
!cp /content/drive/MyDrive/TFM-MUECIM/*.dat /content
!cp /content/drive/MyDrive/TFM-MUECIM/*.pt /content
!cp /content/drive/MyDrive/TFM-MUECIM/*.csv /content
!cp /content/drive/MyDrive/TFM-MUECIM/*.tar /content
!cd /content; tar xf data.tar data
!cd /content/drive/MyDrive/TFM-MUECIM

In [None]:
!pip install transformers



In [None]:
import sys
baseDir = '/content' #/drive/My Drive/TFM-MUECIM'
sys.path.append(baseDir)

Segons el notebook:
Fine-tuning BERT (and friends) for multi-label text classification.ipynb

https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/BERT/Fine_tuning_BERT_(and_friends)_for_multi_label_text_classification.ipynb#scrollTo=HgpKXDfvKBxn  

In [None]:
import torch

device = 'cuda' if torch.cuda.is_available() else 'cpu'
# Move the model to the correct device before training.

# ensures reproducibility
torch.manual_seed(0)

<torch._C.Generator at 0x7a6afcb72a70>

In [None]:
from tfm_EURLEX57KDataset import EURLEX57KDataset
from torch.utils.data import random_split

ds = EURLEX57KDataset(baseDir,'ReducedEURLEX57KDataFrame.csv')
fullSetSize = ds.__len__()
trainSetSize = int(fullSetSize * 0.8)
valSetSize = int(fullSetSize * 0.1)
testSetSize = fullSetSize - trainSetSize - valSetSize
print(f'Full set size: {fullSetSize}')
print(f'Train set size: {trainSetSize}')
print(f'Validation set size: {valSetSize}')
print(f'Test set size: {testSetSize}')
trainData, valData, testData = random_split(
    ds,
    [trainSetSize, valSetSize, testSetSize]
)

Full set size: 54382
Train set size: 43505
Validation set size: 5438
Test set size: 5439


In [None]:
from torch.utils.data import DataLoader

# set batch size
batchSize = 10

# create dataloaders. In case we'll use a more classical pipeline approach
trainDataLoader = DataLoader(trainData, batch_size=batchSize, shuffle=True)
valDataLoader = DataLoader(valData, batch_size=batchSize, shuffle=True)
testDataLoader = DataLoader(testData, batch_size=batchSize, shuffle=True)

In [None]:
def countNonZeroItems(items):
    nonZero = torch.nonzero(items, as_tuple= True)
    return len(nonZero[0])

In [None]:
# iterate through val batches
for i, batch in enumerate(valDataLoader):
  print(f'Batch {i}: ')
  batchFileNames = batch.get('fileName')
  batchData = batch.get('input_ids')
  batchAttentionMasks = batch.get('attention_mask')
  batchLabels = batch.get('labels')

  for elem in zip(batchFileNames, batchData, batchAttentionMasks, batchLabels):
    print(f'fileName: {elem[0]}')
    print(f'input_ids (5 first elements):\n{elem[1][0:5]}')
    print(f'attention_masks (5 first elements):\n{elem[2][0:5]}')
    print(f'Nonzero labels:{countNonZeroItems(elem[3])}\n')

  break

print('Done!')

Batch 0: 
fileName: data/datasets/EURLEX57K/train/32010R0503.json
input_ids (5 first elements):
tensor([3222, 7816, 1006, 7327, 1007])
attention_masks (5 first elements):
tensor([1, 1, 1, 1, 1])
Nonzero labels:6

fileName: data/datasets/EURLEX57K/train/32005R1314.json
input_ids (5 first elements):
tensor([ 3222,  7816,  1006, 14925,  1007])
attention_masks (5 first elements):
tensor([1, 1, 1, 1, 1])
Nonzero labels:3

fileName: data/datasets/EURLEX57K/train/32013R0735.json
input_ids (5 first elements):
tensor([ 2473, 14972,  7816,  1006,  7327])
attention_masks (5 first elements):
tensor([1, 1, 1, 1, 1])
Nonzero labels:1

fileName: data/datasets/EURLEX57K/train/32001R1485.json
input_ids (5 first elements):
tensor([ 7816,  1006, 14925,  1007,  2053])
attention_masks (5 first elements):
tensor([1, 1, 1, 1, 1])
Nonzero labels:2

fileName: data/datasets/EURLEX57K/train/32007L0052.json
input_ids (5 first elements):
tensor([ 3222, 16449,  2289,  1013,  4720])
attention_masks (5 first elements

In [None]:
# bert huggingface pretrained model
import os
from transformers import AutoConfig
from transformers import DistilBertForSequenceClassification
from tfm_ReducedLabelIndex import LabelIndex

labelIndex = LabelIndex(baseDir)

# Gemini. Define a cache directory for Hugging Face models and ensure it exists.
cache_dir = os.path.join(baseDir, 'tfm_cache')
os.makedirs(cache_dir, exist_ok=True)

# Load the configuration with the cache directory.
config = AutoConfig.from_pretrained(
    'distilbert-base-uncased',
    force_download=True,
    cache_dir=cache_dir,
    num_labels=labelIndex.numLabels,
    problem_type='multi_label_classification',
    id2label=labelIndex.id2label,
    label2id=labelIndex.label2id
)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

In [None]:
labelIndex.numLabels

453

In [None]:
# 0 means pretrained fresh model
# 1, 2, 3 ... n. trained epochs from file
last_epoch_trained = 18

if last_epoch_trained == 1:
  modelFile = '20250122_tfm_model.pt'
if last_epoch_trained == 2:
  modelFile = '20250124_tfm_model_1.pt'
if last_epoch_trained == 3:
  modelFile = '20250124_tfm_model_2.pt'
if last_epoch_trained == 6:
  modelFile = '20250124_tfm_model_3.pt'
if last_epoch_trained == 9:
  modelFile = '20250126_tfm_model_1.pt'
if last_epoch_trained == 12:
  modelFile = '20250126_tfm_model_2.pt'
if last_epoch_trained == 15:
  modelFile = '20250126_tfm_model_3.pt'
if last_epoch_trained == 18:
  modelFile = '20250126_tfm_model_4.pt'

In [None]:
# Load the model with the configuration and cache directory.
if last_epoch_trained == 0:
  model = DistilBertForSequenceClassification.from_pretrained(
    'distilbert-base-uncased', # Changed to the correct model identifier
    config=config,  # Pass the configuration to the model.
    cache_dir=cache_dir  # Specify the cache directory again.
  )
else:
  model = torch.load(os.path.join(baseDir,modelFile), map_location=torch.device(device))

  model = torch.load(os.path.join(baseDir,modelFile), map_location=torch.device(device))


In [None]:
model

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): DistilBertSdpaAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)


In [None]:

# forward pass. no training. test case
# item = trainData.__getitem__(0)

# outputs = model(
#    input_ids=item['input_ids'][0:512].unsqueeze(0),
#    attention_mask=item['attention_mask'][0:512].unsqueeze(0),
#    labels=item['labels'].unsqueeze(0))

In [None]:
# outputs.logits[0]


In [None]:
# source: https://jesusleal.io/2021/04/21/Longformer-multilabel-classification/
# calculate metrics
# import numpy as np
# from sklearn.metrics import f1_score, roc_auc_score, accuracy_score

# sigmoid = torch.nn.Sigmoid()
# probs = sigmoid(outputs.logits[0])
# threshold = 0.5
# y_pred = np.zeros(probs.shape)
# y_pred[np.where(probs >= threshold)] = 1
# y_true = item['labels'].cpu().numpy() # Convert y_true to a NumPy array
# f1_micro_average = f1_score(y_true=y_true, y_pred=y_pred, average='micro')
# roc_auc = roc_auc_score(y_true, y_pred, average = 'micro')
# accuracy = accuracy_score(y_true, y_pred)

# metrics = {'f1': f1_micro_average,
#           'roc_auc': roc_auc,
#           'accuracy': accuracy}


In [None]:
# metrics  # code test

In [None]:
# https://github.com/NielsRogge/Transformers-Tutorials/blob/master/BERT/Fine_tuning_BERT_(and_friends)_for_multi_label_text_classification.ipynb
# https://towardsdatascience.com/evaluating-multi-label-classifiers-a31be83da6ea

import numpy as np
from sklearn.metrics import f1_score, roc_auc_score, accuracy_score
import torch
from transformers import TrainingArguments
from transformers import Trainer
from transformers import EvalPrediction


def multi_label_metrics(predictions, labels, ):
    sigmoid = torch.nn.Sigmoid()
    probs = sigmoid(torch.Tensor(predictions))
    y_pred = np.zeros(probs.shape)
    y_true = labels
    y_pred[np.where(probs >= 0.5)] = 1
    f1_micro_average = f1_score(y_true=y_true, y_pred=y_pred, average='micro')
    roc_auc = roc_auc_score(y_true, y_pred, average = 'micro')
    accuracy = accuracy_score(y_true, y_pred)
    # define dictionary of metrics to return
    metrics = {'f1': f1_micro_average,
               'roc_auc': roc_auc,
               'accuracy': accuracy}
    return metrics

def compute_metrics(p: EvalPrediction):
    preds = p.predictions[0] if isinstance(p.predictions,
            tuple) else p.predictions
    result = multi_label_metrics(
        predictions=preds,
        labels=p.label_ids)
    return result

def metricsForTestSet():
    predictions = trainer.predict(testData)
    preds = predictions.predictions[0] if isinstance(predictions.predictions, tuple) else predictions.predictions
    labels = predictions.label_ids
    testMetrics = multi_label_metrics(predictions=preds, labels=labels)
    print(testMetrics)

# metric
metricName = 'f1'

# training arguments
trainArgs = TrainingArguments(
    'tfm_oputput',
    report_to='none',  # deactivate wandb  reports. Alternative -> TensorBoard
    evaluation_strategy = 'epoch',
    save_strategy = 'epoch',
    learning_rate=2e-5,
    per_device_train_batch_size=batchSize,
    per_device_eval_batch_size=batchSize,
    num_train_epochs=3, # 3 epochs
    weight_decay=0.01,
    load_best_model_at_end=True,
    metric_for_best_model=metricName)

trainer = Trainer(
    model=model,
    args=trainArgs,
    train_dataset=trainData,
    eval_dataset=valData,
    compute_metrics = compute_metrics,
    #data_collator = Data_Processing(),
)

# API-KEY-WAND-LIBRARY: 1bb618394e7e44feab7f79534fa2be428243d1bb

device = 'cuda' if torch.cuda.is_available() else 'cpu'
# Move the model to the correct device before training.
model.to(device)

# epoch 0 - baseline
print('Begin epoch train session - trainer evaluate')
trainer.evaluate()

print('Metrics for test set (before train)')
metricsForTestSet()

# training
print('Epoch train.')
trainer.train()
print('Epoch train done.')

print('End epoch train session - trainer evaluate')
trainer.evaluate()

print('Metrics for test set (after train)')
metricsForTestSet()

print('End epoch train session')



Begin epoch train session - trainer evaluate


Metrics for test set (before train)
{'f1': 0.7846710290179187, 'roc_auc': 0.8739954214164557, 'accuracy': 0.36992094134951276}
Epoch train.


Epoch,Training Loss,Validation Loss


Epoch,Training Loss,Validation Loss,Model Preparation Time,F1,Roc Auc,Accuracy
1,0.0027,0.014422,0.0015,0.773243,0.862745,0.333946
2,0.0027,0.014209,0.0015,0.778159,0.869101,0.348106
3,0.0055,0.013577,0.0015,0.783571,0.875154,0.35822


Epoch train done.
End epoch train session - trainer evaluate


Metrics for test set (after train)
{'f1': 0.7842010182295943, 'roc_auc': 0.876986260975051, 'accuracy': 0.3669792241220813}
End epoch train session


In [None]:
# https://stackoverflow.com/questions/42703500/how-do-i-save-a-trained-model-in-pytorch
import shutil
from datetime import datetime

prefixDate = datetime.today().strftime('%Y%m%d')
fileName = f'{prefixDate}_tfm_model.pt'
modelFullPath = os.path.join(baseDir,fileName)
drivePath = '/content/drive/MyDrive/TFM-MUECIM'
destFullPath = os.path.join(drivePath,fileName)
print('Save model after epoch train session')
torch.save(model, modelFullPath)
shutil.copyfile(modelFullPath, destFullPath)



Save model after epoch train session


'/content/drive/MyDrive/TFM-MUECIM/20250126_tfm_model.pt'