<a href="https://colab.research.google.com/github/claudelepere/ML_GitHub/blob/main/Copy_of_Copy_of_Fine_tuning_BERT_(and_friends)_for_multi_label_text_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import os

current_dir = os.getcwd()

print(f"Current directory: {current_dir}")


# My fine-tuning BERT (and friends) for multi-label text classification

In this notebook, we are going to fine-tune BERT to predict one or more labels for a given piece of text. Note that this notebook illustrates how to fine-tune a bert-base-uncased model, but you can also fine-tune a RoBERTa, DeBERTa, DistilBERT, CANINE, ... checkpoint in the same way.

All of those work in the same way: they add a **linear layer on top of the base model, which is used to produce a tensor of shape (batch_size, num_labels)**, indicating the unnormalized scores for a number of labels for every example in the batch.



## Set-up environment

First, we install the libraries which we'll use: HuggingFace Transformers and Datasets.

In [5]:
!pip install -q transformers datasets


## Load dataset

Next, let's download a multi-label text classification dataset from the [hub](https://huggingface.co/).

At the time of writing, I picked a random one as follows:   

* first, go to the "datasets" tab on huggingface.co
* next, select the "multi-label-classification" tag on the left as well as the the "1k<10k" tag (fo find a relatively small dataset).

Note that you can also easily load your local data (i.e. csv files, txt files, Parquet files, JSON, ...) as explained [here](https://huggingface.co/docs/datasets/loading.html#local-and-remote-files).



In [13]:
dir_1000_125_125 = False
dir_128_18_54    = False

from google.colab import files
uploaded = files.upload()         # upload datasetHF_128_18_54.zip or datasetHF_1000_125_125.zip, and skills.csv

!ls -la


Current directory: /content


total 240
drwxr-xr-x 1 root root   4096 Nov 15 16:31 .
drwxr-xr-x 1 root root   4096 Nov 15 16:00 ..
drwxr-xr-x 4 root root   4096 Nov 12 14:24 .config
drwxr-xr-x 5 root root   4096 Nov 15 16:31 datasetHF_128_18_54
-rw-r--r-- 1 root root 213003 Nov 15 16:31 datasetHF_128_18_54.zip
drwxr-xr-x 1 root root   4096 Nov 12 14:25 sample_data
-rw-r--r-- 1 root root   7805 Nov 15 16:31 skills.csv


In [7]:
### Unzip the datasetHF zip file

if os.path.isfile("datasetHF_1000_125_125.zip"):
    print("datasetHF_1000_125_125.zip exists")
    !unzip datasetHF_1000_125_125.zip -d datasetHF_1000_125_125
    dir_1000_125_125 = True
elif os.path.isfile("datasetHF_128_18_54.zip"):
    print("datasetHF_128_18_54.zip exists")
    !unzip datasetHF_128_18_54.zip -d datasetHF_128_18_54
    dir_128_18_54 = True
else:
  print("Neither datasetHF_1000_125_125.zip nor datasetHF_128_18_54.zip exists")


datasetHF_128_18_54.zip exists
Archive:  datasetHF_128_18_54.zip
   creating: datasetHF_128_18_54/test/
   creating: datasetHF_128_18_54/train/
   creating: datasetHF_128_18_54/validation/
  inflating: datasetHF_128_18_54/dataset_dict.json  
  inflating: datasetHF_128_18_54/test/data-00000-of-00001.arrow  
  inflating: datasetHF_128_18_54/test/dataset_info.json  
  inflating: datasetHF_128_18_54/test/state.json  
  inflating: datasetHF_128_18_54/train/data-00000-of-00001.arrow  
  inflating: datasetHF_128_18_54/train/dataset_info.json  
  inflating: datasetHF_128_18_54/train/state.json  
  inflating: datasetHF_128_18_54/validation/data-00000-of-00001.arrow  
  inflating: datasetHF_128_18_54/validation/dataset_info.json  
  inflating: datasetHF_128_18_54/validation/state.json  


In [8]:
### dataset

from datasets import DatasetDict

if dir_1000_125_125:
    dataset = DatasetDict.load_from_disk('datasetHF_1000_125_125')
elif dir_128_18_54:
    dataset = DatasetDict.load_from_disk('datasetHF_128_18_54')
else:
    print("Neither dir datasetHF_1000_125_125 nor datasetHF_128_18_54 exists")


As we can see, the dataset contains 3 splits: one for training, one for validation and one for testing.

In [9]:
print(f"dataset: {type(dataset)} {dataset.shape}\n{dataset}")


dataset: <class 'datasets.dataset_dict.DatasetDict'> {'train': (128, 44), 'validation': (18, 44), 'test': (54, 44)}
DatasetDict({
    train: Dataset({
        features: ['id', 'text', '394', '142', '146', '147', '148', '149', '150', '151', '408', '409', '153', '154', '155', '156', '157', '158', '160', '152', '162', '165', '167', '168', '169', '170', '171', '685', '174', '686', '176', '689', '173', '356', '360', '361', '362', '364', '760', '756', '758', '375', '376', '761'],
        num_rows: 128
    })
    validation: Dataset({
        features: ['id', 'text', '394', '142', '146', '147', '148', '149', '150', '151', '408', '409', '153', '154', '155', '156', '157', '158', '160', '152', '162', '165', '167', '168', '169', '170', '171', '685', '174', '686', '176', '689', '173', '356', '360', '361', '362', '364', '760', '756', '758', '375', '376', '761'],
        num_rows: 18
    })
    test: Dataset({
        features: ['id', 'text', '394', '142', '146', '147', '148', '149', '150', '151', '

Let's test the first example of the training split:

In [10]:
example = dataset['train'][0]

print(f"example: {type(example)} {example.keys()}\n{example}")


example: <class 'dict'> dict_keys(['id', 'text', '394', '142', '146', '147', '148', '149', '150', '151', '408', '409', '153', '154', '155', '156', '157', '158', '160', '152', '162', '165', '167', '168', '169', '170', '171', '685', '174', '686', '176', '689', '173', '356', '360', '361', '362', '364', '760', '756', '758', '375', '376', '761'])
{'id': 140409, 'text': "Inetum-Realdolmen - Azure Cloud Engineer Azure Cloud Inetum-Realdolmen YOUR FUNCTION To support the exponential growth of our Azure practice, we are looking for several Azure Cloud Engineers. Here's how you'll make impact: For larger projects you work in team with other Azure Cloud Engineers, Cloud Solution Architects and Project Managers to write a new success story. We can count on you for the professional implementation of the tasks entrusted to you. For smaller cloud projects you are in charge for the full engagement: you advise your client from the design until his solution is fully operational. In addition to consultan

The dataset consists of texts, labeled with one or more skills.

Let's create a list that contains the labels, as well as 2 dictionaries that map labels to integers and back.

In [11]:
### if dataset 1000_125_125, 48 labels
### if dataset 128_18_54   , 42 labels

labels = [label for label in dataset['train'].features.keys() if label not in ['id', 'text']]
labels.sort()
print(f"labels: {type(labels)} {len(labels)}\n{labels}")

id2label = {idx:label for idx, label in enumerate(labels)}
print(f"id2label: {type(id2label)} {len(id2label)}\n{id2label}")

label2id = {label:idx for idx, label in enumerate(labels)}
print(f"label2id: {type(label2id)} {len(label2id)}\n{label2id}")

labels: <class 'list'> 42
['142', '146', '147', '148', '149', '150', '151', '152', '153', '154', '155', '156', '157', '158', '160', '162', '165', '167', '168', '169', '170', '171', '173', '174', '176', '356', '360', '361', '362', '364', '375', '376', '394', '408', '409', '685', '686', '689', '756', '758', '760', '761']
id2label: <class 'dict'> 42
{0: '142', 1: '146', 2: '147', 3: '148', 4: '149', 5: '150', 6: '151', 7: '152', 8: '153', 9: '154', 10: '155', 11: '156', 12: '157', 13: '158', 14: '160', 15: '162', 16: '165', 17: '167', 18: '168', 19: '169', 20: '170', 21: '171', 22: '173', 23: '174', 24: '176', 25: '356', 26: '360', 27: '361', 28: '362', 29: '364', 30: '375', 31: '376', 32: '394', 33: '408', 34: '409', 35: '685', 36: '686', 37: '689', 38: '756', 39: '758', 40: '760', 41: '761'}
label2id: <class 'dict'> 42
{'142': 0, '146': 1, '147': 2, '148': 3, '149': 4, '150': 5, '151': 6, '152': 7, '153': 8, '154': 9, '155': 10, '156': 11, '157': 12, '158': 13, '160': 14, '162': 15, '16

In [12]:
### dataset of filtered skills, only those in labels

import pandas as pd

skill_df          = pd.read_csv("skills.csv")
skill_df['Id']    = skill_df['Id'].astype(str)
skill_df['Value'] = skill_df['Value'].astype(str)
filtered_skill_df = skill_df[skill_df['Id'].isin(labels)]

print(f"filtered_skill_df: {type(filtered_skill_df)} {filtered_skill_df.shape}\n{filtered_skill_df}")


filtered_skill_df: <class 'pandas.core.frame.DataFrame'> (42, 3)
     Id  SkillTypeId                             Value
0   142            7    Developer / Analyst Programmer
2   146            7  Application / Solution Architect
3   147            7          Infrastructure Architect
4   148            7                 Technical Analyst
5   149            7                Functional Analyst
6   150            7        Test / Validation Engineer
7   151            7         Test / Validation Manager
8   152            7                  Technical Writer
9   153            7                Database Developer
10  154            7            Database Administrator
11  155            7                Database Architect
12  156            7                Helpdesk / Support
13  157            7                          Operator
14  158            7      Field / Maintenance Engineer
15  160            7   System Engineer / Administrator
16  162            7                 Security Engineer


## Preprocess data

As models like BERT don't expect text as direct input, but rather **`input_ids`**, etc., we tokenize the text using the tokenizer. Here I'm using the `AutoTokenizer` API, which will automatically load the appropriate tokenizer based on the checkpoint on the hub.

What's a bit tricky is that we also need to provide labels to the model. For multi-label text classification, this is a **matrix of shape (batch_size, num_labels)**. Also important: this should be a tensor of floats rather than integers, otherwise PyTorch' **BCEWithLogitsLoss** (which the model will use) will complain, as explained [here](https://discuss.pytorch.org/t/multi-label-binary-classification-result-type-float-cant-be-cast-to-the-desired-output-type-long/117915/3).

In [None]:
from transformers import AutoTokenizer
import numpy as np

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

### preprocess function: examples, not example, because batched=True => examples is a batch
def preprocess_data(examples, indices):
  text = examples['text']    # Batch of texts

  encoding = tokenizer(        # Tokenize text
      text,
      truncation=True,
      padding='max_length',
      max_length=512,
      return_tensors='pt'      # Return PyTorch tensors
  )

  # Create an empty label matrix
  labels_matrix = torch.zeros((len(text), len(labels)), dtype=torch.float32)

  #print(f"labels_matrix: {type(labels_matrix)} {labels_matrix.shape}")

  # Populate label matrix
  for idx, label in enumerate(labels):

    #print(f"idx:{idx} label:{label}")

    if label in examples:
      labels_matrix[:, idx] = torch.tensor(
          [1.0 if val else 0.0 for val in examples[label]],
          dtype=torch.float32
      )

  #print(f"labels_matrix: {type(labels_matrix)} {labels_matrix.shape}")

  # Add labels to the encoding
  encoding['labels'] = labels_matrix

  #print(f"encoding['labels']: {encoding['labels']}")

  return encoding


In [None]:
encoded_dataset = dataset.map(
    preprocess_data,
    batched=True,
    remove_columns=dataset['train'].column_names,
    with_indices=True
)

In [None]:
example = encoded_dataset['validation'][0]

print(f"example['labels']:  {type(example['labels'])} {example['labels'].shape}\n{example['labels']}")


In [None]:
example = encoded_dataset['validation'][0]

print(f"example.keys(): {example.keys()}")
print(f"example['input_ids']: {example['input_ids']}")
print(f"example['token_type_ids']: {example['token_type_ids']}")
print(f"example['attention_mask']: {example['attention_mask']}")
print(f"example['labels']: {example['labels']}")


In [None]:
tokenizer.decode(example['input_ids'])


In [None]:
example['labels']


In [None]:
[id2label[idx] for idx, label in enumerate(example['labels']) if label == 1.0]


Finally, we set the format of our data to PyTorch tensors. This will turn the training, validation and test sets into standard PyTorch [datasets](https://pytorch.org/docs/stable/data.html).

In [None]:
encoded_dataset.set_format("torch")    # Ensures correctness and compatibility with PyTorch pipelines


## Define model

Here we define a **model that includes a pre-trained base (i.e. the weights from bert-base-uncased) are loaded, with a random initialized classification head (linear layer) on top**. One should fine-tune this head, together with the pre-trained base on a labeled dataset.

This is also printed by the warning.

We set the `problem_type` to be "multi_label_classification", as this will make sure the appropriate loss function is used (namely [**BCEWithLogitsLoss**](https://pytorch.org/docs/stable/generated/torch.nn.BCEWithLogitsLoss.html)). We also make sure the output layer has `len(labels)` output neurons, and we set the id2label and label2id mappings.

In [None]:
### device

import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

print(f"device: {device}")


In [None]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased",
                                                           problem_type="multi_label_classification",
                                                           num_labels=len(labels),
                                                           id2label=id2label,
                                                           label2id=label2id)

## Train the model!

We are going to train the model using HuggingFace's Trainer API. This requires us to define 2 things:

* `TrainingArguments`, which specify training hyperparameters. All options can be found in the [docs](https://huggingface.co/transformers/main_classes/trainer.html#trainingarguments). Below, we for example specify that we want to evaluate after every epoch of training, we would like to save the model every epoch, we set the learning rate, the batch size to use for training/evaluation, how many epochs to train for, and so on.
* a `Trainer` object (docs can be found [here](https://huggingface.co/transformers/main_classes/trainer.html#id1)).

In [None]:
batch_size  = 8
metric_name = "f1"

In [None]:
from transformers import TrainingArguments, Trainer

args = TrainingArguments(
    output_dir                  = r'C:\tmp\BERT_results\output',
    overwrite_output_dir        = True,
    logging_dir                 = r'C:\tmp\BERT_results\logs',
    logging_steps               = 50,
    save_steps                  = 100,
    save_total_limit            = 2,
    eval_strategy               = "epoch",
    save_strategy               = "epoch",
    learning_rate               = 2e-5,
    per_device_train_batch_size = batch_size,
    per_device_eval_batch_size  = batch_size,
    num_train_epochs            = 5,
    weight_decay                = 0.01,
    load_best_model_at_end      = True,
    metric_for_best_model       = metric_name,
    #push_to_hub                 = True,
    run_name                   = "BERT-multilabel-lr2e5-epochs5-datasetHF_128_18_54"
)

We are also going to compute metrics while training. For this, we need to define a `compute_metrics` function, that returns a dictionary with the desired metric values.

In [None]:
from sklearn.metrics import f1_score, precision_score, recall_score, roc_auc_score, average_precision_score, accuracy_score
from transformers import EvalPrediction

# source: https://jesusleal.io/2021/04/21/Longformer-multilabel-classification/
def multi_label_metrics(predictions, labels, threshold=0.2):
    _average = 'micro'    # 'micro' or 'weighted'

    # first, apply sigmoid on predictions which are of shape (batch_size, num_labels)
    sigmoid = torch.nn.Sigmoid()
    probs   = sigmoid(torch.Tensor(predictions))

    # next, use threshold to turn them into integer predictions
    y_pred = np.zeros(probs.shape)
    y_pred[np.where(probs >= threshold)] = 1

    # finally, compute metrics
    y_true               = labels
    f1                   = f1_score               (y_true=y_true, y_pred=y_pred, average=_average)    #, zero_division=1)
    precision            = precision_score        (y_true=y_true, y_pred=y_pred, average=_average)    #, zero_division=1)
    recall               = recall_score           (y_true=y_true, y_pred=y_pred, average=_average)    #, zero_division=1)
    roc_auc              = roc_auc_score          (y_true=y_true, y_score=probs, average=_average)
    precision_recall_auc = average_precision_score(y_true=y_true, y_score=probs, average=_average)
    accuracy             = accuracy_score         (y_true=y_true, y_pred=y_pred)

    # return as dictionary
    metrics = {'f1'                  : f1,
               'precision'           : precision,
               'recall'              : recall,
               'roc_auc'             : roc_auc,
               'precision_recall_auc': precision_recall_auc,
               'accuracy'            : accuracy}
    return metrics

def compute_metrics(p: EvalPrediction):
    preds = p.predictions[0] if isinstance(p.predictions, tuple) else p.predictions
    result = multi_label_metrics(
        predictions=preds,
        labels=p.label_ids)
    return result

Let's verify a batch as well as a forward pass:

In [None]:
encoded_dataset['train'][0]['labels'].type()

In [None]:
encoded_dataset['train']['input_ids'][0]

In [None]:
#forward pass
print(f"inputids:       {type(encoded_dataset['train']['input_ids'][0])}      {encoded_dataset['train']['input_ids'][0].shape}")
print(f"attention_mask: {type(encoded_dataset['train']['attention_mask'][0])} {encoded_dataset['train']['attention_mask'][0].shape}")
print(f"labels:         {type(encoded_dataset['train'][0]['labels'])}         {encoded_dataset['train'][0]['labels'].shape}")

outputs = model(input_ids      = encoded_dataset['train']['input_ids'][0].unsqueeze(0),
                attention_mask = encoded_dataset['train']['attention_mask'][0].unsqueeze(0),
                labels         = encoded_dataset['train'][0]['labels'].unsqueeze(0)
          )
outputs

Let's start training!

In [None]:
trainer = Trainer(
    model,
    args,
    train_dataset=encoded_dataset["train"],
    eval_dataset=encoded_dataset["validation"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

In [None]:
train_ouput = trainer.train()

print(f"train_ouput.global_step: {type(train_ouput.global_step)} {train_ouput.global_step}")          # Total training steps
print(f"train_ouput.training_loss: {type(train_ouput.training_loss)} {train_ouput.training_loss}")    # Final training loss
print(f"train_ouput.metrics: {type(train_ouput.metrics)} {train_ouput.metrics}")                      # Training metrics
print(f"train_ouput.state_dict: {type(train_ouput.state_dict)} {train_ouput.state_dict}")             # Model state dictionary
print(f"train_ouput.log_history: {type(train_ouput.log_history)} {train_ouput.log_history}")         # Log history
print(f"train_ouput.prediction_step: {type(train_ouput.prediction_step)} {train_ouput.prediction_step}") # Prediction step
print(f"train_ouput.optimizer: {type(train_ouput.optimizer)} {train_ouput.optimizer}")                 # Optimizer state
print(f"train_ouput.lr_scheduler: {type(train_ouput.lr_scheduler)} {train_ouput.lr_scheduler}")         # Learning rate scheduler state
print(f"train_ouput.epoch: {type(train_ouput.epoch)} {train_ouput.epoch}")                             # Current epoch
print(f"train_ouput.state: {type(train_ouput.state)} {train_ouput.state}")                             # Trainer state
print(f"train_ouput.world_size: {type(train_ouput.world_size)} {train_ouput.world_size}")               # World size
print(f"train_ouput.name: {type(train_ouput.name)} {train_ouput.name}")                                 # Name of the trainer
print(f"train_ouput.args: {type(train_ouput.args)} {train_ouput.args}")                                 # Training arguments
print(f"train_ouput.train_dataset: {type(train_ouput.train_dataset)} {train_ouput.train_dataset}")     # Training dataset
print(f"train_ouput.eval_dataset: {type(train_ouput.eval_dataset)} {train_ouput.eval_dataset}")         # Evaluation dataset
print(f"train_ouput.data_collator: {type(train_ouput.data_collator)} {train_ouput.data_collator}")     # Data collator
print(f"train_ouput.compute_metrics: {type(train_ouput.compute_metrics)} {train_ouput.compute_metrics}") # Compute metrics function
print(f"train_ouput.callbacks: {type(train_ouput.callbacks)} {train_ouput.callbacks}")                   # Callbacks
print(f"train_ouput.optimizers: {type(train_ouput.optimizers)} {train_ouput.optimizers}")                 # Optimizers
print(f"train_ouput.lr_schedulers: {type(train_ouput.lr_schedulers)} {train_ouput.lr_schedulers}")       # Learning rate schedulers
print(f"train_ouput.label_names: {type(train_ouput.label_names)} {train_ouput.label_names}")             # Label names
print(f"train_ouput.model_class: {type(train_ouput.model_class)} {train_ouput.model_class}")             # Model class
print(f"train_ouput.model_init_args: {type(train_ouput.model_init_args)} {train_ouput.model_init_args}") # Model initialization arguments
print(f"train_ouput.model_name_or_path: {type(train_ouput.model_name_or_path)} {train_ouput.model_name_or_path}") # Model name or path
print(f"train_ouput.tokenizer: {type(train_ouput.tokenizer)} {train_ouput.tokenizer}")                     # Tokenizer
print(f"train_ouput.train_dataloader: {type(train_ouput.train_dataloader)} {train_ouput.train_dataloader}") # Training dataloader
print(f"train_ouput.eval_dataloader: {type(train_ouput.eval_dataloader)} {train_ouput.eval_dataloader}")     # Evaluation dataloader
print(f"train_ouput.train_dataloader_batch_size: {type(train_ouput.train_dataloader_batch_size)} {train_ouput.train_dataloader_batch_size}") # Training dataloader batch size
print(f"train_ouput.eval_dataloader_batch_size: {type(train_ouput.eval_dataloader_batch_size)} {train_ouput.eval_dataloader_batch_size}")     # Evaluation dataloader batch size
print(f"train_ouput.train_dataloader_drop_last: {type(train_ouput.train_dataloader_drop_last)} {train_ouput.train_dataloader_drop_last}") # Training dataloader drop last
print(f"train_ouput.eval_dataloader_drop_last: {type(train_ouput.eval_dataloader_drop_last)} {train_ouput.eval_dataloader_drop_last}")     # Evaluation dataloader drop last
print(f"train_ouput.train_dataloader_num_workers: {type(train_ouput.train_dataloader_num_workers)} {train_ouput.train_dataloader_num_workers}") # Training dataloader num workers
print(f"train_ouput.eval_dataloader_num_workers: {type(train_ouput.eval_dataloader_num_workers)} {train_ouput.eval_dataloader_num_workers}")     # Evaluation dataloader num workers
print(f"train_ouput.train_dataloader_pin_memory: {type(train_ouput.train_dataloader_pin_memory)} {train_ouput.train_dataloader_pin_memory}") # Training dataloader pin memory
print(f"train_ouput.eval_dataloader_pin_memory: {type(train_ouput.eval_dataloader_pin_memory)} {train_ouput.eval_dataloader_pin_memory}")     # Evaluation dataloader pin memory
print(f"train_ouput.train_dataloader_persistent_workers: {type(train_ouput.train_dataloader_persistent_workers)} {train_ouput.train_dataloader_persistent_workers}") # Training dataloader persistent workers
print(f"train_ouput.eval_dataloader_persistent_workers: {type(train_ouput.eval_dataloader_persistent_workers)} {train_ouput.eval_dataloader_persistent_workers}")     # Evaluation dataloader persistent workers
print(f"train_ouput.train_dataloader_prefetch_factor: {type(train_ouput.train_dataloader_prefetch_factor)} {train_ouput.train_dataloader_prefetch_factor}") # Training dataloader prefetch factor
print(f"train_ouput.eval_dataloader_prefetch_factor: {type(train_ouput.eval_dataloader_prefetch_factor)} {train_ouput.eval_dataloader_prefetch_factor}")     # Evaluation dataloader prefetch factor


## Evaluate

After training, we evaluate our model on the validation set.

In [None]:
eval_results = trainer.evaluate()

In [None]:
raise Exception("STOP")

## Inference

Let's test the model on a new sentence:

id: 323697
"Voor een klant van Talencia ben ik opzoek naar een Senior Full Stack Developer (Java & Angular) Job beschrijving Als Developer zal je een bestaand team toevoegen en meewerken aan de buitbouw van webapplicaties op Azure. Dit is om bestaande applicaties te vervangen die end-of-live zijn. Het project is al in volle realisatie. Profiel Zeer goede kennis van Java en Angular Goede kennis van Azure DevOps, AKS,.. is een grote pluspunt Kennis van Docker/ SQL/ OAuth/PWA/ RESTful API is vereist Taal: Nederlands met kennis van Engels Extra informatie Teamspeler met ervaring in Agile methodiek is vereist. Als je meer informatie wilt en dit klinkt interessant voor u, aarzel dan niet om uw meest recente CV door te sturen. Het kan zijn dat ik niet beschik over uw meest recente CV en dat ik daarom u deze opportuniteit doorstuur dat niet geschikt is voor u. Als u iemand kent dat deze missie interessant zou vinden mag u deze vacature doorsturen. Met vriendelijke groeten,"

['142', '147', '149', '154', '156', '157', '173', '409', '685', '689']

---

id: 323611,"Atcon Global - Project Management Officer / PMO team management Atcon Global For one of our clients, we are looking for an experienced Project Management Officer (PMO) / Project Manager (PM) for permanent employment in the Flanders region. Your role? As a PMO, you will play a crucial role in setting up and improving our project management processes. You will not only be responsible for developing PM standards, but also for carrying out projects independently as a Project Manager. Your duties and responsibilities will include: Developing PMO and project management standards Executing and managing complex digital projects Oversee project progress and report to senior management Follow-up of project budgets, project selection, capacity planning and resource management Coaching and training project managers Identifying and managing project risks Promote continuous improvement in the project management domain Collaborate with stakeholders and external partners Who are we looking for? Bachelor's or master's degree 5+ years in a similar role in a dynamic organization Expertise in project management methods (Agile, Scrum, Lean, Kanban) Strong analytical and problem-solving skills Excellent communication and stakeholder management Experience in team management with clear objectives Proactive, Hands-on mentality and result-oriented Fluent in Dutch and English; French is a plus What's on offer? A dynamic and varied role in a growing, ambitious and innovative company Numerous opportunities for personal growth and career development A competitive salary with customizable benefits A friendly, collegial working atmosphere Flexible working hours, possibility to work from home","171,170,794,800,798,797,138,139,352"
---


In [None]:
#text = "Voor een klant van Talencia ben ik opzoek naar een Senior Full Stack Developer (Java & Angular) Job beschrijving Als Developer zal je een bestaand team toevoegen en meewerken aan de buitbouw van webapplicaties op Azure. Dit is om bestaande applicaties te vervangen die end-of-live zijn. Het project is al in volle realisatie. Profiel Zeer goede kennis van Java en Angular Goede kennis van Azure DevOps, AKS,.. is een grote pluspunt Kennis van Docker/ SQL/ OAuth/PWA/ RESTful API is vereist Taal: Nederlands met kennis van Engels Extra informatie Teamspeler met ervaring in Agile methodiek is vereist. Als je meer informatie wilt en dit klinkt interessant voor u, aarzel dan niet om uw meest recente CV door te sturen. Het kan zijn dat ik niet beschik over uw meest recente CV en dat ik daarom u deze opportuniteit doorstuur dat niet geschikt is voor u. Als u iemand kent dat deze missie interessant zou vinden mag u deze vacature doorsturen. Met vriendelijke groeten"
#text = "Atcon Global - Project Management Officer / PMO team management Atcon Global For one of our clients, we are looking for an experienced Project Management Officer (PMO) / Project Manager (PM) for permanent employment in the Flanders region. Your role? As a PMO, you will play a crucial role in setting up and improving our project management processes. You will not only be responsible for developing PM standards, but also for carrying out projects independently as a Project Manager. Your duties and responsibilities will include: Developing PMO and project management standards Executing and managing complex digital projects Oversee project progress and report to senior management Follow-up of project budgets, project selection, capacity planning and resource management Coaching and training project managers Identifying and managing project risks Promote continuous improvement in the project management domain Collaborate with stakeholders and external partners Who are we looking for? Bachelor's or master's degree 5+ years in a similar role in a dynamic organization Expertise in project management methods (Agile, Scrum, Lean, Kanban) Strong analytical and problem-solving skills Excellent communication and stakeholder management Experience in team management with clear objectives Proactive, Hands-on mentality and result-oriented Fluent in Dutch and English; French is a plus What's on offer? A dynamic and varied role in a growing, ambitious and innovative company Numerous opportunities for personal growth and career development A competitive salary with customizable benefits A friendly, collegial working atmosphere Flexible working hours, possibility to work from home"
#encoding = tokenizer(text, return_tensors="pt")
#encoding = {k: v.to(trainer.model.device) for k,v in encoding.items()}

#outputs = trainer.model(**encoding)

The logits that come out of the model are of shape (batch_size, num_labels). As we are only forwarding a single sentence through the model, the `batch_size` equals 1. The logits is a tensor that contains the (unnormalized) scores for every individual label.

In [None]:
#logits = outputs.logits
#logits.shape

To turn them into actual predicted labels, we first apply a sigmoid function independently to every score, such that every score is turned into a number between 0 and 1, that can be interpreted as a "probability" for how certain the model is that a given class belongs to the input text.

Next, we use a threshold (typically, 0.5) to turn every probability into either a 1 (which means, we predict the label for the given example) or a 0 (which means, we don't predict the label for the given example).

In [None]:
# apply sigmoid + threshold
#import torch

#sigmoid = torch.nn.Sigmoid()
#probs = sigmoid(logits.squeeze().cpu())
#predictions = np.zeros(probs.shape)
#predictions[np.where(probs >= 0.2)] = 1
# turn predicted id's into actual label names
#predicted_labels = [id2label[idx] for idx, label in enumerate(predictions) if label == 1.0]
#print(predicted_labels)

#raise Exception("STOP")

**id**: 323697

**MySQL**: "142,189,190,754,208,794,676,811,812,139,138" (only 142="Developer / Analyst Programmer") is a 7-skill)

**predicted_labels**: ['148', '152', '154', '409'] : all are 7-skills: 148="Technical Analyst", 152="Technical Writer", 154="Database Admininistrator"


---


**id**: 323611

**MySQL**: "171,170,794,800,798,797,138,139,352"            
           171: Project Mgmt Officer (PMO)  
           170: Project Manager / Coordinator

**predicted labels**: 409:  
                      409: "SOA Specialist" (SOA: Service Oriented Architecture)

In [None]:
trainer.save_model("skills_model")    # Save locally the trained model and tokenizer: saves the model weights, the tokenizer, the model configuration file ("config.json")

import json

with open("training_metrics.json", 'w') as f:
    json.dump(trainer.state.log_history,f)

with open("eval_metrics.json", 'w') as f:
    json.dump(eval_results, f)


In [None]:
!pip install transformers huggingface_hub

from huggingface_hub import notebook_login

notebook_login()

In [None]:
from huggingface_hub import create_repo, HfApi
from huggingface_hub.utils import RepositoryNotFoundError

repo_id = 'claudelepere/skills_model'
api     = HfApi()
try:
    api.repo_info(repo_id)
    print(f"repo_id: {repo_id}")
except RepositoryNotFoundError:
    create_repo(repo_id, private=True)
    print(f"Repo {repo_id} created succesfully as a private repo.")

In [None]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer

model = AutoModelForSequenceClassification.from_pretrained("skills_model")
tokenizer = AutoTokenizer.from_pretrained("skills_model")

model.push_to_hub(repo_id)
tokenizer.push_to_hub(repo_id)

In [None]:
raise Exception("STOP")


In [None]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch

# Load the trained model and tokenizer
model     = AutoModelForSequenceClassification.from_pretrained(repo_id)
tokenizer = AutoTokenizer.from_pretrained(repo_id)

In [None]:
#text      = "Sample text for prediction"
#text = "Voor een klant van Talencia ben ik opzoek naar een Senior Full Stack Developer (Java & Angular) Job beschrijving Als Developer zal je een bestaand team toevoegen en meewerken aan de buitbouw van webapplicaties op Azure. Dit is om bestaande applicaties te vervangen die end-of-live zijn. Het project is al in volle realisatie. Profiel Zeer goede kennis van Java en Angular Goede kennis van Azure DevOps, AKS,.. is een grote pluspunt Kennis van Docker/ SQL/ OAuth/PWA/ RESTful API is vereist Taal: Nederlands met kennis van Engels Extra informatie Teamspeler met ervaring in Agile methodiek is vereist. Als je meer informatie wilt en dit klinkt interessant voor u, aarzel dan niet om uw meest recente CV door te sturen. Het kan zijn dat ik niet beschik over uw meest recente CV en dat ik daarom u deze opportuniteit doorstuur dat niet geschikt is voor u. Als u iemand kent dat deze missie interessant zou vinden mag u deze vacature doorsturen. Met vriendelijke groeten"
text = "Atcon Global - Project Management Officer / PMO team management Atcon Global For one of our clients, we are looking for an experienced Project Management Officer (PMO) / Project Manager (PM) for permanent employment in the Flanders region. Your role? As a PMO, you will play a crucial role in setting up and improving our project management processes. You will not only be responsible for developing PM standards, but also for carrying out projects independently as a Project Manager. Your duties and responsibilities will include: Developing PMO and project management standards Executing and managing complex digital projects Oversee project progress and report to senior management Follow-up of project budgets, project selection, capacity planning and resource management Coaching and training project managers Identifying and managing project risks Promote continuous improvement in the project management domain Collaborate with stakeholders and external partners Who are we looking for? Bachelor's or master's degree 5+ years in a similar role in a dynamic organization Expertise in project management methods (Agile, Scrum, Lean, Kanban) Strong analytical and problem-solving skills Excellent communication and stakeholder management Experience in team management with clear objectives Proactive, Hands-on mentality and result-oriented Fluent in Dutch and English; French is a plus What's on offer? A dynamic and varied role in a growing, ambitious and innovative company Numerous opportunities for personal growth and career development A competitive salary with customizable benefits A friendly, collegial working atmosphere Flexible working hours, possibility to work from home"
#text = "Vivid Resourcing - Chief Technology Officer CTO, reliability, business goals Vivid Resourcing We're partnered with a leading sustainability-oriented company near Brussels, aiming to combat high pollution rates worldwide. They are currently working on a unique application that rewards workers for reducing their carbon footprint, whilst also maintaining and even improving profits. Together we are seeking a visionary Chief Technology Officer (CTO) who aligns with their mission and ambitions. The ideal candidate will possess a strong hands-on technical background, proven management experience, and strong business acumen. This role requires a strategic thinker who can drive technological direction and support the company's growth objectives of transitioning from a scale-up to an established business entity, so any past experience leading teams in this manner would go a long way. Key responsibilities Develop and execute the company's technological vision and strategy Lead and mentor a team of engineers and technologists Oversee all technical aspects of the company, ensuring alignment with business goals Drive innovation in regenerative sustainable technologies and carbon measurement systems Collaborate with cross-functional teams to integrate technology solutions Ensure the reliability, security, and scalability of technological infrastructures Foster a culture of continuous improvement and technical excellence Qualifications Experience leading a team within a small to medium sized company Strong technical background in software development/data analytics/system architecture Bachelor's or Master's degree in either an IT or Business related field Experience in the agriculture or environmental sectors is a plus Proven management skills with the ability to lead, communicate and inspire a diverse team Excellent business acumen and strategic thinking Strong problem-solving skills and the ability to make informed decisions in a fast-paced environment Offer Taking charge of a genuinely impactful product, using your direction for the good of the environment Complete responsibility over a technical team, with management responsibilities Up to 110,000 EUR gross for experienced applicants, which can then be increased further Full benefits package including mobility costs Flexible hybrid work Inclusive work environment If this role interests you, attach a CV and apply today!"
inputs    = tokenizer(text, return_tensors="pt", truncation=True, padding="max_length", max_length=512)
threshold = 0.5

with torch.no_grad():
    outputs = model(**inputs)
    logits = outputs.logits
    probs = torch.sigmoid(logits)
    #predictions = torch.where(probs >= threshold, torch.ones_like(probs), torch.zeros_like(probs))
    #predictions = torch.argmax(probs, dim=-1) if model.config.num_labels > 1 else torch.where(probs >= threshold, torch.ones_like(probs), torch.zeros_like(probs))
    preds = (probs > threshold).int()
    print(f"probs: {probs} preds: {preds}")
    print()
    for label, Value, prob, pred in zip(filtered_skill_df['Id'], filtered_skill_df['Value'], probs.squeeze(), preds.squeeze()):
      #if (pred == 1):
      print(f"label: {label} logits: {logits} prob: {prob.item():.4f} pred: {int(pred.item())} {Value}")


from google.colab import drive
drive.mount('/content/drive')

# Copy files to your Google Drive
!cp -r my_model /content/drive/MyDrive/
!cp training_metrics.json /content/drive/MyDrive/
!cp eval_metrics.json /content/drive/MyDrive/

In [None]:
from google.colab import files

#!zip -r my_model.zip my_model
#!split -b 100M my_model.zip my_model_part_
#!unzip - my_model.zip

# for part in ['my_model_part_aa', 'my_model_part_ab', 'my_model_part_ac']:  # Adjust based on number of parts
#    files.download(part)
#files.download("my_model_part_aa")
#files.download("my_model_part_ab")
#files.download("my_model_part_ac")
#files.download("my_model_part_ad")
#!md5sum my_model.zip
#files.download("training_metrics.json")
#files.download("eval_metrics.json")

#uploaded = files.upload()
#!md5sum my_model.zip
#!md5sum training_metrics.json
#!md5sum eval_metrics.json

#!unzip my_model.zip -d my_model_unzip

#raise Exception("STOP")

In [None]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch

# Load the trained model and tokenizer
model = AutoModelForSequenceClassification.from_pretrained("/content/my_model_unzip/my_model")
tokenizer = AutoTokenizer.from_pretrained("/content/my_model_unzip/my_model")

import json
with open("/content/training_metrics.json", 'r') as f:
    training_metrics = json.load(f)
with open("/content/eval_metrics.json", 'r') as f:
    eval_metrics = json.load(f)



In [None]:
from transformers import TrainingArguments, Trainer

#text = "Voor een klant van Talencia ben ik opzoek naar een Senior Full Stack Developer (Java & Angular) Job beschrijving Als Developer zal je een bestaand team toevoegen en meewerken aan de buitbouw van webapplicaties op Azure. Dit is om bestaande applicaties te vervangen die end-of-live zijn. Het project is al in volle realisatie. Profiel Zeer goede kennis van Java en Angular Goede kennis van Azure DevOps, AKS,.. is een grote pluspunt Kennis van Docker/ SQL/ OAuth/PWA/ RESTful API is vereist Taal: Nederlands met kennis van Engels Extra informatie Teamspeler met ervaring in Agile methodiek is vereist. Als je meer informatie wilt en dit klinkt interessant voor u, aarzel dan niet om uw meest recente CV door te sturen. Het kan zijn dat ik niet beschik over uw meest recente CV en dat ik daarom u deze opportuniteit doorstuur dat niet geschikt is voor u. Als u iemand kent dat deze missie interessant zou vinden mag u deze vacature doorsturen. Met vriendelijke groeten"
text = "Atcon Global - Project Management Officer / PMO team management Atcon Global For one of our clients, we are looking for an experienced Project Management Officer (PMO) / Project Manager (PM) for permanent employment in the Flanders region. Your role? As a PMO, you will play a crucial role in setting up and improving our project management processes. You will not only be responsible for developing PM standards, but also for carrying out projects independently as a Project Manager. Your duties and responsibilities will include: Developing PMO and project management standards Executing and managing complex digital projects Oversee project progress and report to senior management Follow-up of project budgets, project selection, capacity planning and resource management Coaching and training project managers Identifying and managing project risks Promote continuous improvement in the project management domain Collaborate with stakeholders and external partners Who are we looking for? Bachelor's or master's degree 5+ years in a similar role in a dynamic organization Expertise in project management methods (Agile, Scrum, Lean, Kanban) Strong analytical and problem-solving skills Excellent communication and stakeholder management Experience in team management with clear objectives Proactive, Hands-on mentality and result-oriented Fluent in Dutch and English; French is a plus What's on offer? A dynamic and varied role in a growing, ambitious and innovative company Numerous opportunities for personal growth and career development A competitive salary with customizable benefits A friendly, collegial working atmosphere Flexible working hours, possibility to work from home"
encoding = tokenizer(text, return_tensors="pt")

# Define the device based on availability
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# Move the model to the device
model.to(device)
# Move encoding to the device of the model
encoding = {k: v.to(device) for k,v in encoding.items()}

# Perform inference
with torch.no_grad():    # no gradients needed for inference. Forward pass
    outputs = model(**encoding)

# Get logits from the model's output
logits = outputs.logits

# Apply softmax/sigmoid based on the type of classification
if model.config.num_labels == 1:
    probs = torch.sigmoid(logits.squeeze())
else:
    #probs = torch.softmax(logits, dim=1).squeeze()
    probs = torch.sigmoid(logits)



# To get predictions
threshold = 0.5
#predictions = torch.where(probs >= threshold, torch.ones_like(probs), torch.zeros_like(probs))
#predictions = torch.argmax(probs, dim=-1) if model.config.num_labels > 1 else torch.where(probs >= threshold, torch.ones_like(probs), torch.zeros_like(probs))
predictions = (probs > threshold).float()
print("Predictions:", predictions)
print()

# Turn predicted id's into actual label names
print("Probabilites:", probs)

#[id2label[idx] for idx, label in enumerate(predictions['labels']) if label == 1.0]

#predicted_labels = [id2label[idx.item()] for idx in predictions]
#print(predicted_labels)

for label, prob, pred in zip(labels, probs.squeeze(), predictions.squeeze()):
  print(f"Label: {label}: Probability: {prob.item():.4f} {int(pred.item())}")

