<a href="https://colab.research.google.com/github/claudelepere/ML_GitHub/blob/main/My_Fine_tuning_BERT_(and_friends)_for_multi_label_text_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
#### Current directory

import os

current_dir = os.getcwd()
print(f"current_dir: {current_dir}")


current_dir: /content


# My fine-tuning BERT (and friends) for multi-label text classification

In this notebook, we are going to fine-tune BERT to predict one or more labels for a given piece of text. Note that this notebook illustrates how to fine-tune a bert-base-uncased model, but you can also fine-tune a RoBERTa, DeBERTa, DistilBERT, CANINE, ... checkpoint in the same way.

All of those work in the same way: they add a **linear layer on top of the base model, which is used to produce a tensor of shape (batch_size, num_labels)**, indicating the unnormalized scores for a number of labels for every example in the batch.



## Set-up environment

First, we install the libraries which we'll use: HuggingFace Transformers and Datasets.

In [2]:
!pip install -q transformers datasets


[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/480.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━[0m [32m460.8/480.6 kB[0m [31m13.7 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m [32m471.0/480.6 kB[0m [31m11.6 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m [32m471.0/480.6 kB[0m [31m11.6 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m179.3/179.3 kB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
[2K   

## Load dataset

Next, let's download a multi-label text classification dataset from the [hub](https://huggingface.co/).

At the time of writing, I picked a random one as follows:   

* first, go to the "datasets" tab on huggingface.co
* next, select the "multi-label-classification" tag on the left as well as the the "1k<10k" tag (fo find a relatively small dataset).

Note that you can also easily load your local data (i.e. csv files, txt files, Parquet files, JSON, ...) as explained [here](https://huggingface.co/docs/datasets/loading.html#local-and-remote-files).



In [6]:
#### Upload datasetHF_128_18_54.zip or datasetHF_1000_125_125.zip

dir_1000_125_125 = False
dir_128_18_54    = False

from google.colab import files
uploaded = files.upload()
!ls -la


Saving datasetHF_7_1000_125_125.zip to datasetHF_7_1000_125_125.zip
total 1220
drwxr-xr-x 1 root root    4096 Nov 20 10:21  .
drwxr-xr-x 1 root root    4096 Nov 20 10:15  ..
drwxr-xr-x 4 root root    4096 Nov 18 14:23  .config
-rw-r--r-- 1 root root 1222831 Nov 20 10:21  datasetHF_7_1000_125_125.zip
drwxr-xr-x 1 root root    4096 Nov 18 14:23  sample_data
-rw-r--r-- 1 root root    7805 Nov 20 10:20 'skills - Copy.csv'


In [8]:
#### Unzip

if os.path.isfile("datasetHF_1000_125_125.zip"):
    print("datasetHF_1000_125_125.zip exists")
    !unzip datasetHF_1000_125_125.zip -d datasetHF_1000_125_125
    dir_1000_125_125 = True
elif os.path.isfile("datasetHF_128_18_54.zip"):
    print("datasetHF_128_18_54.zip exists")
    !unzip datasetHF_128_18_54.zip -d datasetHF_128_18_54
    dir_128_18_54 = True
else:
  print("Neither datasetHF_1000_125_125.zip nor datasetHF_128_18_54.zip exists")


datasetHF_1000_125_125.zip exists
Archive:  datasetHF_1000_125_125.zip
   creating: datasetHF_1000_125_125/test/
   creating: datasetHF_1000_125_125/train/
   creating: datasetHF_1000_125_125/validation/
  inflating: datasetHF_1000_125_125/dataset_dict.json  
  inflating: datasetHF_1000_125_125/test/data-00000-of-00001.arrow  
  inflating: datasetHF_1000_125_125/test/dataset_info.json  
  inflating: datasetHF_1000_125_125/test/state.json  
  inflating: datasetHF_1000_125_125/train/data-00000-of-00001.arrow  
  inflating: datasetHF_1000_125_125/train/dataset_info.json  
  inflating: datasetHF_1000_125_125/train/state.json  
  inflating: datasetHF_1000_125_125/validation/data-00000-of-00001.arrow  
  inflating: datasetHF_1000_125_125/validation/dataset_info.json  
  inflating: datasetHF_1000_125_125/validation/state.json  


In [9]:
### dataset

from datasets import DatasetDict

if dir_1000_125_125:
    dataset = DatasetDict.load_from_disk('datasetHF_1000_125_125')
elif dir_128_18_54:
    dataset = DatasetDict.load_from_disk('datasetHF_128_18_54')
else:
    print("Neither datasetHF_1000_125_125 dir nor datasetHF_128_18_54 dir exists")


As we can see, the dataset contains 3 splits: one for training, one for validation and one for testing.

In [None]:
print(f"dataset: {type(dataset)} {dataset.shape}\n{dataset}")


dataset: <class 'datasets.dataset_dict.DatasetDict'> {'train': (1000, 50), 'validation': (125, 50), 'test': (125, 50)}
DatasetDict({
    train: Dataset({
        features: ['id', 'text', '394', '142', '143', '146', '147', '148', '149', '150', '151', '408', '409', '153', '154', '155', '156', '157', '158', '160', '152', '162', '667', '165', '167', '168', '169', '170', '171', '685', '174', '686', '176', '689', '173', '175', '356', '360', '361', '362', '364', '760', '371', '756', '373', '758', '375', '376', '761', '757'],
        num_rows: 1000
    })
    validation: Dataset({
        features: ['id', 'text', '394', '142', '143', '146', '147', '148', '149', '150', '151', '408', '409', '153', '154', '155', '156', '157', '158', '160', '152', '162', '667', '165', '167', '168', '169', '170', '171', '685', '174', '686', '176', '689', '173', '175', '356', '360', '361', '362', '364', '760', '371', '756', '373', '758', '375', '376', '761', '757'],
        num_rows: 125
    })
    test: Dataset({
 

Let's test the first example of the training split:

In [10]:
example = dataset['train'][0]
print(f"example: {type(example)} {example.keys()}\n{example}")


example: <class 'dict'> dict_keys(['id', 'text', '394', '142', '143', '146', '147', '148', '149', '150', '151', '408', '409', '153', '154', '155', '156', '157', '158', '160', '152', '162', '667', '165', '167', '168', '169', '170', '171', '685', '174', '686', '176', '689', '173', '175', '356', '360', '361', '362', '364', '760', '371', '756', '373', '758', '375', '376', '761', '757'])
{'id': 161181, 'text': "Argenta - Product Owner Netwerk en Security Cloud, Werkplek, Netwerk, Security Argenta Je gaat aan de slag op het hoofdkantoor van Argenta binnen de afdeling Infrastructure & Operations Services (iOS). Deze directie staat in voor de centrale en overkoepelende aansturing van de technische implementatie, de opvolging en het beheer van de infrastructuur (private en public Cloud), werkplek- en netwerk & security componenten. Hiermee ondersteunen we onze applicatie voor de kantoren, HQ medewerkers en onze klanten via digitale kanalen. Als Product Owner Netwerk & Security ben je verantwoor

In [11]:
####
# labels
#     if dataset 1000_125_125, 48 labels
#     if dataset 128_18_54   , 42 labels
# id2label
# label2id

labels = [label for label in dataset['train'].features.keys() if label not in ['id', 'text']]
labels.sort()
print(f"labels: {type(labels)} {len(labels)}\n{labels}")

id2label = {idx:label for idx, label in enumerate(labels)}
print(f"id2label: {type(id2label)} {len(id2label)}\n{id2label}")

label2id = {label:idx for idx, label in enumerate(labels)}
print(f"label2id: {type(label2id)} {len(label2id)}\n{label2id}")

labels: <class 'list'> 48
['142', '143', '146', '147', '148', '149', '150', '151', '152', '153', '154', '155', '156', '157', '158', '160', '162', '165', '167', '168', '169', '170', '171', '173', '174', '175', '176', '356', '360', '361', '362', '364', '371', '373', '375', '376', '394', '408', '409', '667', '685', '686', '689', '756', '757', '758', '760', '761']
id2label: <class 'dict'> 48
{0: '142', 1: '143', 2: '146', 3: '147', 4: '148', 5: '149', 6: '150', 7: '151', 8: '152', 9: '153', 10: '154', 11: '155', 12: '156', 13: '157', 14: '158', 15: '160', 16: '162', 17: '165', 18: '167', 19: '168', 20: '169', 21: '170', 22: '171', 23: '173', 24: '174', 25: '175', 26: '176', 27: '356', 28: '360', 29: '361', 30: '362', 31: '364', 32: '371', 33: '373', 34: '375', 35: '376', 36: '394', 37: '408', 38: '409', 39: '667', 40: '685', 41: '686', 42: '689', 43: '756', 44: '757', 45: '758', 46: '760', 47: '761'}
label2id: <class 'dict'> 48
{'142': 0, '143': 1, '146': 2, '147': 3, '148': 4, '149': 5, '

In [13]:
!pip install huggingface_hub

from google.colab import userdata

HF_TOKEN = userdata.get('HF_TOKEN')

from huggingface_hub import login

login(token=HF_TOKEN, add_to_git_credential=True)



In [14]:
#### Create the labels repo

from huggingface_hub import create_repo, HfApi
from huggingface_hub.utils import RepositoryNotFoundError

repo_id_labels = "claudelepere/skills_labels"
api            = HfApi()
try:
    api.repo_info(repo_id_labels, repo_type="dataset")
    print(f"repo_id_labels: {repo_id_labels}")
except RepositoryNotFoundError:
    create_repo(repo_id   = repo_id_labels,
                repo_type = "dataset",
                private   = True
               )
    print(f"Repo {repo_id_labels} created succesfully as a private repo.")


repo_id_labels: claudelepere/skills_labels


The dataset consists of texts, labeled with one or more skills.

Let's create a list that contains the labels, as well as 2 dictionaries that map labels to integers and back.

In [15]:
#### Upload labels to repo

import json

labels_path = "/content/labels.json"
with open(labels_path, 'w') as f:
    json.dump(labels, f)
print(f"labels saved to {labels_path}")

from huggingface_hub import upload_file
repo_labels_path  = "labels.json"
upload_file(path_or_fileobj = labels_path,
            path_in_repo    = repo_labels_path,
            repo_id         = repo_id_labels,
            repo_type       = "dataset"
           )
print(f"labels uploaded to https://huggingface.co/datasets/{repo_id_labels}/tree/main/{repo_labels_path}")


No files have been modified since last commit. Skipping to prevent empty commit.


labels saved to /content/labels.json
labels uploaded to https://huggingface.co/datasets/claudelepere/skills_labels/tree/main/labels.json


## Preprocess data

As models like BERT don't expect text as direct input, but rather **`input_ids`**, etc., we tokenize the text using the tokenizer. Here I'm using the `AutoTokenizer` API, which will automatically load the appropriate tokenizer based on the checkpoint on the hub.

What's a bit tricky is that we also need to provide labels to the model. For multi-label text classification, this is a **matrix of shape (batch_size, num_labels)**. Also important: this should be a tensor of floats rather than integers, otherwise PyTorch' **BCEWithLogitsLoss** (which the model will use) will complain, as explained [here](https://discuss.pytorch.org/t/multi-label-binary-classification-result-type-float-cant-be-cast-to-the-desired-output-type-long/117915/3).

In [16]:
#### device

import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"device: {device}")


device: cpu


In [17]:
#### Preprocess (examples, not example, because batched=True => examples is a batch)

from transformers import AutoTokenizer
import numpy as np

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

def preprocess_data(examples, indices):
  text = examples['text']    # Batch of texts

  encoding = tokenizer(text,                             # Tokenize text
                       truncation     = True,
                       padding        = 'max_length',
                       max_length     = 512,
                       return_tensors = 'pt'             # Return PyTorch tensors
                      )

  # Create an empty label matrix
  labels_matrix = torch.zeros((len(text), len(labels)), dtype=torch.float32)
  #print(f"labels_matrix: {type(labels_matrix)} {labels_matrix.shape}")

  # Populate label matrix
  for idx, label in enumerate(labels):
    #print(f"idx:{idx} label:{label}")
    if label in examples:
      labels_matrix[:, idx] = torch.tensor([1.0 if val else 0.0 for val in examples[label]],
                                           dtype=torch.float32
                                          )
  #print(f"labels_matrix: {type(labels_matrix)} {labels_matrix.shape}")

  # Add labels to the encoding
  encoding['labels'] = labels_matrix
  #print(f"encoding['labels']: {encoding['labels']}")

  return encoding


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [18]:
#### encoded_dataset

encoded_dataset = dataset.map(preprocess_data,
                              batched      = True,
                              remove_columns=dataset['train'].column_names,
                              with_indices = True
                              )

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/125 [00:00<?, ? examples/s]

Map:   0%|          | 0/125 [00:00<?, ? examples/s]

In [19]:
example = encoded_dataset['validation'][0]
print(f"example['labels']:  {type(example['labels'])} {len(example['labels'])}\n{example['labels']}")
print(f"example.keys(): {example.keys()}")
print(f"example['input_ids']: {example['input_ids']}")
print(f"example['token_type_ids']: {example['token_type_ids']}")
print(f"example['attention_mask']: {example['attention_mask']}")
print(f"example['labels']: {example['labels']}")


example['labels']:  <class 'list'> 48
[0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
example.keys(): dict_keys(['input_ids', 'token_type_ids', 'attention_mask', 'labels'])
example['input_ids']: [101, 16941, 18098, 8528, 2483, 1011, 4944, 2063, 3026, 4372, 16941, 3366, 10841, 17625, 4944, 2063, 1010, 16941, 3366, 10841, 17625, 1010, 9152, 2015, 1016, 16941, 10975, 8528, 2483, 6412, 4241, 2695, 2063, 2053, 2271, 28667, 5886, 24561, 2015, 4895, 4944, 2063, 3026, 4372, 16941, 3366, 10841, 17625, 10364, 25261, 16200, 10289, 1041, 15549, 5051, 1040, 18279, 4328, 4226, 3802, 19817, 12462, 10484, 2099, 7505, 4078, 4013, 15759, 2015, 2139, 10819, 9496, 2618, 12367, 10450, 4226, 2139, 9026, 4372, 6299, 27390, 2063, 1012, 4372, 9092, 2102, 24209, 1005, 4944, 2063, 3026, 4372, 16941, 3366, 10841, 17625, 101

In [20]:
tokenizer.decode(example['input_ids'])


"[CLS] cyberpraxis - architecte senior en cybersecurite architecte, cybersecurite, nis 2 cyber praxis description du poste nous recherchons un architecte senior en cybersecurite pour rejoindre notre equipe dynamique et travailler sur des projets de securite informatique de grande envergure. en tant qu ' architecte senior en cybersecurite, vous serez responsable de concevoir, developper et mettre en œuvre des solutions de securite robustes pour proteger nos systemes et donnees sensibles contre les cybermenaces. responsabilites concevoir et developper des architectures de securite adaptees a nos besoins specifiques, en tenant compte des meilleures pratiques et des dernieres tendances en matiere de cybersecurite. evaluer les risques de securite et developper des strategies d ' attenuation efficaces pour proteger nos infrastructures et donnees sensibles. collaborer avec les equipes internes pour integrer la securite des la conception des systemes et des applications. mener des evaluations 

In [21]:
example['labels']


[0.0,
 0.0,
 0.0,
 1.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 1.0,
 1.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0]

In [22]:
[id2label[idx] for idx, label in enumerate(example['labels']) if label == 1.0]


['147', '160', '162']

Finally, we set the format of our data to PyTorch tensors. This will turn the training, validation and test sets into standard PyTorch [datasets](https://pytorch.org/docs/stable/data.html).

In [24]:
encoded_dataset.set_format("torch")    # Ensures correctness and compatibility with PyTorch pipelines


## Define model

Here we define a **model that includes a pre-trained base (i.e. the weights from bert-base-uncased) are loaded, with a random initialized classification head (linear layer) on top**. One should fine-tune this head, together with the pre-trained base on a labeled dataset.

This is also printed by the warning.

We set the `problem_type` to be "multi_label_classification", as this will make sure the appropriate loss function is used (namely [**BCEWithLogitsLoss**](https://pytorch.org/docs/stable/generated/torch.nn.BCEWithLogitsLoss.html)). We also make sure the output layer has `len(labels)` output neurons, and we set the id2label and label2id mappings.

In [25]:
#### model

from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased",
                                                           problem_type="multi_label_classification",
                                                           num_labels=len(labels),
                                                           id2label=id2label,
                                                           label2id=label2id
                                                          )


model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## Train the model!

We are going to train the model using HuggingFace's Trainer API. This requires us to define 2 things:

* `TrainingArguments`, which specify training hyperparameters. All options can be found in the [docs](https://huggingface.co/transformers/main_classes/trainer.html#trainingarguments). Below, we for example specify that we want to evaluate after every epoch of training, we would like to save the model every epoch, we set the learning rate, the batch size to use for training/evaluation, how many epochs to train for, and so on.
* a `Trainer` object (docs can be found [here](https://huggingface.co/transformers/main_classes/trainer.html#id1)).

In [26]:
batch_size  = 8
metric_name = "f1"

In [27]:
#### TrainingArguments

from transformers import TrainingArguments, Trainer

args = TrainingArguments(output_dir                  = "/content/results/output",
                         overwrite_output_dir        = True,
                         logging_dir                 = "/content/results/logs",
                         logging_steps               = 50,
                         save_steps                  = 100,
                         save_total_limit            = 2,
                         eval_strategy               = "epoch",
                         save_strategy               = "epoch",
                         learning_rate               = 2e-5,
                         per_device_train_batch_size = batch_size,
                         per_device_eval_batch_size  = batch_size,
                         num_train_epochs            = 5,
                         weight_decay                = 0.01,
                         load_best_model_at_end      = True,
                         metric_for_best_model       = metric_name,
                         #push_to_hub                 = True,
                         run_name                   = "BERT-multilabel-lr2e5-epochs5-datasetHF_128_18_54"
                        )


We are also going to compute metrics while training. For this, we need to define a `compute_metrics` function, that returns a dictionary with the desired metric values.

In [28]:
#### Metrics

from sklearn.metrics import f1_score, precision_score, recall_score, roc_auc_score, average_precision_score, accuracy_score
from transformers import EvalPrediction

# source: https://jesusleal.io/2021/04/21/Longformer-multilabel-classification/
def multi_label_metrics(predictions, labels, threshold=0.2):
    _average = 'micro'    # 'micro' or 'weighted'

    # first, apply sigmoid on predictions which are of shape (batch_size, num_labels)
    sigmoid = torch.nn.Sigmoid()
    probs   = sigmoid(torch.Tensor(predictions))

    # next, use threshold to turn them into integer predictions
    y_pred = np.zeros(probs.shape)
    y_pred[np.where(probs >= threshold)] = 1

    # finally, compute metrics
    y_true               = labels
    f1                   = f1_score               (y_true=y_true, y_pred=y_pred, average=_average)    #, zero_division=1)
    precision            = precision_score        (y_true=y_true, y_pred=y_pred, average=_average)    #, zero_division=1)
    recall               = recall_score           (y_true=y_true, y_pred=y_pred, average=_average)    #, zero_division=1)
    roc_auc              = roc_auc_score          (y_true=y_true, y_score=probs, average=_average)
    precision_recall_auc = average_precision_score(y_true=y_true, y_score=probs, average=_average)
    accuracy             = accuracy_score         (y_true=y_true, y_pred=y_pred)

    # return as dictionary
    metrics = {'f1'                  : f1,
               'precision'           : precision,
               'recall'              : recall,
               'roc_auc'             : roc_auc,
               'precision_recall_auc': precision_recall_auc,
               'accuracy'            : accuracy
              }
    return metrics

def compute_metrics(p: EvalPrediction):
    preds = p.predictions[0] if isinstance(p.predictions, tuple) else p.predictions
    result = multi_label_metrics(predictions = preds,
                                 labels      = p.label_ids
                                )
    return result

Let's verify a batch as well as a forward pass:

In [29]:
encoded_dataset['train'][0]['labels'].type()

'torch.FloatTensor'

In [30]:
encoded_dataset['train']['input_ids'][0]

tensor([  101, 23157,  2050,  1011,  4031,  3954,  5658, 29548,  4372,  3036,
         6112,  1010,  2057,  8024, 10814,  2243,  1010,  5658, 29548,  1010,
         3036, 23157,  2050, 15333, 19930,  2102,  9779,  2078,  2139, 22889,
         8490,  6728, 21770,  7570, 11253,  2094,  9126,  3406,  2953,  3158,
        23157,  2050,  8026, 10224,  2139, 21358,  9247,  2075,  6502,  1004,
         3136,  2578,  1006, 16380,  1007,  1012,  2139,  4371,  3622,  2666,
         2358, 11057,  2102,  1999, 29536,  2953,  2139,  2430,  2063,  4372,
         2058,  3683, 13699, 12260, 13629,  9779, 23808, 12228,  3158,  2139,
         6627,  8977,  5403, 10408, 10450,  2063,  1010,  2139,  6728,  6767,
         2140,  4726,  4372, 21770,  2022, 21030,  2099,  3158,  2139,  1999,
        27843,  3367,  6820,  6593,  2226,  3126,  1006,  2797,  4372,  2270,
         6112,  1007,  1010,  2057,  8024, 10814,  2243,  1011,  4372,  5658,
        29548,  1004,  3036,  6922,  2368,  1012,  7632,  2121, 

In [31]:
#### forward pass

print(f"inputids:       {type(encoded_dataset['train']['input_ids'][0])}      {encoded_dataset['train']['input_ids'][0].shape}")
print(f"attention_mask: {type(encoded_dataset['train']['attention_mask'][0])} {encoded_dataset['train']['attention_mask'][0].shape}")
print(f"labels:         {type(encoded_dataset['train'][0]['labels'])}         {encoded_dataset['train'][0]['labels'].shape}")

outputs = model(input_ids      = encoded_dataset['train']['input_ids'][0].unsqueeze(0),
                attention_mask = encoded_dataset['train']['attention_mask'][0].unsqueeze(0),
                labels         = encoded_dataset['train'][0]['labels'].unsqueeze(0)
               )
outputs


inputids:       <class 'torch.Tensor'>      torch.Size([512])
attention_mask: <class 'torch.Tensor'> torch.Size([512])
labels:         <class 'torch.Tensor'>         torch.Size([48])


SequenceClassifierOutput(loss=tensor(0.6787, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>), logits=tensor([[-0.4893, -0.5414,  0.3820, -0.3347,  0.2474,  0.1031, -0.2008, -0.0866,
         -0.0815, -0.6332,  0.1877,  0.3550, -0.3671, -0.1385, -0.0150, -0.6524,
         -0.2085,  0.7171, -0.3596,  0.0633,  0.2915, -0.1733,  1.0543, -0.5208,
         -0.1564, -0.3752, -0.2511,  0.3730, -0.4661, -0.2688,  0.4399, -0.2745,
          0.3553, -0.3878,  0.0256,  0.4179,  0.7461, -0.2472,  0.1168, -0.7905,
         -0.2479,  0.6173,  0.1469, -0.3306,  0.2063,  0.2004, -0.1865, -0.5036]],
       grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

Let's start training!

In [None]:
#### trainer

trainer = Trainer(model,
                  args,
                  train_dataset=encoded_dataset["train"],
                  eval_dataset=encoded_dataset["validation"],
                  tokenizer=tokenizer,
                  compute_metrics=compute_metrics
                 )

  trainer = Trainer(model,


In [None]:
#### Train

train_ouput = trainer.train()
print(f"train_ouput.global_step: {type(train_ouput.global_step)} {train_ouput.global_step}")                         # Total training steps
print(f"train_ouput.training_loss: {type(train_ouput.training_loss)} {train_ouput.training_loss}")                   # Final training loss
print(f"train_ouput.metrics: {type(train_ouput.metrics)} {train_ouput.metrics}")                                     # Training metrics


[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


Epoch,Training Loss,Validation Loss,F1,Precision,Recall,Roc Auc,Precision Recall Auc,Accuracy
1,0.2928,0.210169,0.144044,0.208,0.110169,0.704318,0.130727,0.112
2,0.1695,0.163716,0.144444,0.209677,0.110169,0.759906,0.154515,0.112
3,0.1597,0.154452,0.144044,0.208,0.110169,0.776311,0.159539,0.112
4,0.151,0.151524,0.144044,0.208,0.110169,0.781533,0.16245,0.112
5,0.1477,0.150669,0.144044,0.208,0.110169,0.782938,0.170439,0.112


train_ouput.global_step: <class 'int'> 625
train_ouput.training_loss: <class 'float'> 0.19866193466186524
train_ouput.metrics: <class 'dict'> {'train_runtime': 580.7211, 'train_samples_per_second': 8.61, 'train_steps_per_second': 1.076, 'total_flos': 1316098621440000.0, 'train_loss': 0.19866193466186524, 'epoch': 5.0}


AttributeError: 'TrainOutput' object has no attribute 'log_history'

## Evaluate

After training, we evaluate our model on the validation set.

In [None]:
#### Evaluate

eval_results = trainer.evaluate()

In [None]:
####
# In content, save:
#   - the trained model and the tokenizer (saves the model weights, the tokenizer, the model configuration file ("config.json"))
#   - train and evaluation metrics

import json

model_path            = "/content/skills_model"
training_metrics_path = "/content/training_metrics.json"
eval_metrics_path     = "/content/eval_metrics.json"

trainer.save_model(model_path)

with open(training_metrics_path, 'w') as f:
    json.dump(trainer.state.log_history,f)

with open(eval_metrics_path, 'w') as f:
    json.dump(eval_results, f)


In [None]:
!pip install huggingface_hub

from google.colab import userdata

HF_TOKEN = userdata.get('HF_TOKEN')

from huggingface_hub import login

login(token=HF_TOKEN, add_to_git_credential=True)

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
#### Create the model repo

from huggingface_hub import create_repo, HfApi
from huggingface_hub.utils import RepositoryNotFoundError

repo_id_model = 'claudelepere/skills_model'
api           = HfApi()
try:
    api.repo_info(repo_id_model)
    print(f"repo_id_model: {repo_id_model}")
except RepositoryNotFoundError:
    create_repo(repo_id   = repo_id_model,
                repo_type = "model",
                private   = True
               )
    print(f"Repo {repo_id_model} created succesfully as a private repo")

In [None]:
#### Upload the model and the tokenizer

from transformers import AutoModelForSequenceClassification, AutoTokenizer

model     = AutoModelForSequenceClassification.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)

model.push_to_hub(repo_id_model)
tokenizer.push_to_hub(repo_id_model)

README.md:   0%|          | 0.00/5.17k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

No files have been modified since last commit. Skipping to prevent empty commit.


CommitInfo(commit_url='https://huggingface.co/claudelepere/skills_model/commit/3893e74829b919cd331c0963b7c29b88234c8c76', commit_message='Upload tokenizer', commit_description='', oid='3893e74829b919cd331c0963b7c29b88234c8c76', pr_url=None, repo_url=RepoUrl('https://huggingface.co/claudelepere/skills_model', endpoint='https://huggingface.co', repo_type='model', repo_id='claudelepere/skills_model'), pr_revision=None, pr_num=None)

In [None]:
raise Exception("STOP")

Exception: STOP

## Inference

Let's test the model on a new sentence:

The logits that come out of the model are of shape (batch_size, num_labels). As we are only forwarding a single sentence through the model, the `batch_size` equals 1. The logits is a tensor that contains the (unnormalized) scores for every individual label.

To turn them into actual predicted labels, we first apply a sigmoid function independently to every score, such that every score is turned into a number between 0 and 1, that can be interpreted as a "probability" for how certain the model is that a given class belongs to the input text.

Next, we use a threshold (typically, 0.5) to turn every probability into either a 1 (which means, we predict the label for the given example) or a 0 (which means, we don't predict the label for the given example).

In [None]:
### device

import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"device: {device}")


device: cpu


In [None]:
# Download labels.json

import json
from huggingface_hub import hf_hub_download

repo_id_labels = "claudelepere/skills_labels"
repo_filename  = "labels.json"

file_path = hf_hub_download(repo_id                = repo_id_labels,
                            filename               = repo_filename,
                            repo_type              = "dataset",
                            local_dir              = "/content",
                            local_dir_use_symlinks = False
                           )
with open(file_path, 'r') as f:
    labels = json.load(f)
print(f"labels: {type(labels)} {len(labels)}\n{labels}")


For more details, check out https://huggingface.co/docs/huggingface_hub/main/en/guides/download#download-files-to-local-folder.


labels.json:   0%|          | 0.00/336 [00:00<?, ?B/s]

labels: <class 'list'> 48
['142', '143', '146', '147', '148', '149', '150', '151', '152', '153', '154', '155', '156', '157', '158', '160', '162', '165', '167', '168', '169', '170', '171', '173', '174', '175', '176', '356', '360', '361', '362', '364', '371', '373', '375', '376', '394', '408', '409', '667', '685', '686', '689', '756', '757', '758', '760', '761']


In [None]:
# Upload skills.csv

from google.colab import files
uploaded = files.upload()
!ls -la


Saving skills.csv to skills.csv
total 32
drwxr-xr-x 1 root root 4096 Nov 18 08:52 .
drwxr-xr-x 1 root root 4096 Nov 18 08:52 ..
drwxr-xr-x 3 root root 4096 Nov 18 08:52 .cache
drwxr-xr-x 4 root root 4096 Nov 14 14:25 .config
-rw-r--r-- 1 root root  336 Nov 18 08:52 labels.json
drwxr-xr-x 1 root root 4096 Nov 14 14:25 sample_data
-rw-r--r-- 1 root root 7805 Nov 18 08:52 skills.csv


In [None]:
### Filtered skills (those in labels)

import pandas as pd

skill_df          = pd.read_csv("skills.csv")
skill_df['Id']    = skill_df['Id'].astype(str)
skill_df['Value'] = skill_df['Value'].astype(str)
filtered_skill_df = skill_df[skill_df['Id'].isin(labels)]
print(f"filtered_skill_df: {type(filtered_skill_df)} {filtered_skill_df.shape}\n{filtered_skill_df}")


filtered_skill_df: <class 'pandas.core.frame.DataFrame'> (48, 3)
     Id  SkillTypeId                             Value
0   142            7    Developer / Analyst Programmer
1   143            7           Graphics / Web Designer
2   146            7  Application / Solution Architect
3   147            7          Infrastructure Architect
4   148            7                 Technical Analyst
5   149            7                Functional Analyst
6   150            7        Test / Validation Engineer
7   151            7         Test / Validation Manager
8   152            7                  Technical Writer
9   153            7                Database Developer
10  154            7            Database Administrator
11  155            7                Database Architect
12  156            7                Helpdesk / Support
13  157            7                          Operator
14  158            7      Field / Maintenance Engineer
15  160            7   System Engineer / Administrator


In [None]:
#### Model and Tokenizer from the model repo

from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch

repo_id_model = "claudelepere/skills_model"

# Load the trained model and tokenizer
model     = AutoModelForSequenceClassification.from_pretrained(repo_id_model)
tokenizer = AutoTokenizer.from_pretrained(repo_id_model)


config.json:   0%|          | 0.00/2.29k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.38k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/712k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/695 [00:00<?, ?B/s]

In [None]:
#### New texts to classify

#text = "I'm happy, I can finally train a model for multi-label classification."
# preds: none

#### in train or validation or test
# id = 23553
text = "Cream Consulting - Business Analyst Insurance, Property&Casualty (P&C) Cream Consulting Today we are looking for Business Analyst to extend our team specialized in Business Process Improvement. You will join a multi-lingual team focusing on P&C business. In that position, you will define and clarify the role and responsibilities. You will be responsible for leading business process reviews where you will identify and implement innovative solutions. You will define and document the new processes and systems to meet our client's business objectives. What the job is all about? Implement new solutions for P&C insurance companies Write and validate the “as is” Plan the analysis of organization's strategic business needs Understand End Users needs Develop process modeling and design Follow up projects and monitor development Organize and conduct testing What Cream is looking for? At least 5 years of experience in Business Analysis in insurance business Analysis know-how Fluent editorial Good knowledge of Property & Casualty insurance business to interface with all stakeholders Functional knowledge of insurance products Good communication skills Fluent English, good knowledge in French and/or Dutch is a plus"
# "356,168,149,795,802,137,139,138,353"; SkillTypeId 7: true: 149, 168, 356 preds: 142, 170

#### NL
# id = 323697
#text = "Talencia Consulting - Senior Full Stack Developer (Java & Angular) Full Stack, DevOps, AKS, OAuth, API Talencia Consulting Voor een klant van Talencia ben ik opzoek naar een Senior Full Stack Developer (Java & Angular) Job beschrijving Als Developer zal je een bestaand team toevoegen en meewerken aan de buitbouw van webapplicaties op Azure. Dit is om bestaande applicaties te vervangen die end-of-live zijn. Het project is al in volle realisatie. Profiel Zeer goede kennis van Java en Angular Goede kennis van Azure DevOps, AKS,.. is een grote pluspunt Kennis van Docker/ SQL/ OAuth/PWA/ RESTful API is vereist Taal: Nederlands met kennis van Engels Extra informatie Teamspeler met ervaring in Agile methodiek is vereist. Als je meer informatie wilt en dit klinkt interessant voor u, aarzel dan niet om uw meest recente CV door te sturen. Het kan zijn dat ik niet beschik over uw meest recente CV en dat ik daarom u deze opportuniteit doorstuur dat niet geschikt is voor u. Als u iemand kent dat deze missie interessant zou vinden mag u deze vacature doorsturen. Met vriendelijke groeten,"
# "142,189,190,754,208,794,676,811,812,139,138"; SkillTypeId: true: 142; preds: 142, 170

#### EN
# id = 323611
text = "Atcon Global - Project Management Officer / PMO team management Atcon Global For one of our clients, we are looking for an experienced Project Management Officer (PMO) / Project Manager (PM) for permanent employment in the Flanders region. Your role? As a PMO, you will play a crucial role in setting up and improving our project management processes. You will not only be responsible for developing PM standards, but also for carrying out projects independently as a Project Manager. Your duties and responsibilities will include: Developing PMO and project management standards Executing and managing complex digital projects Oversee project progress and report to senior management Follow-up of project budgets, project selection, capacity planning and resource management Coaching and training project managers Identifying and managing project risks Promote continuous improvement in the project management domain Collaborate with stakeholders and external partners Who are we looking for? Bachelor's or master's degree 5+ years in a similar role in a dynamic organization Expertise in project management methods (Agile, Scrum, Lean, Kanban) Strong analytical and problem-solving skills Excellent communication and stakeholder management Experience in team management with clear objectives Proactive, Hands-on mentality and result-oriented Fluent in Dutch and English; French is a plus What's on offer? A dynamic and varied role in a growing, ambitious and innovative company Numerous opportunities for personal growth and career development A competitive salary with customizable benefits A friendly, collegial working atmosphere Flexible working hours, possibility to work from home"
# "171,170,794,800,798,797,138,139,352"; SkillTypeId 7: true: 170, 171; preds: 142, 170

# id = 323526
#text = "Vivid Resourcing - Chief Technology Officer CTO, reliability, business goals Vivid Resourcing We're partnered with a leading sustainability-oriented company near Brussels, aiming to combat high pollution rates worldwide. They are currently working on a unique application that rewards workers for reducing their carbon footprint, whilst also maintaining and even improving profits. Together we are seeking a visionary Chief Technology Officer (CTO) who aligns with their mission and ambitions. The ideal candidate will possess a strong hands-on technical background, proven management experience, and strong business acumen. This role requires a strategic thinker who can drive technological direction and support the company's growth objectives of transitioning from a scale-up to an established business entity, so any past experience leading teams in this manner would go a long way. Key responsibilities Develop and execute the company's technological vision and strategy Lead and mentor a team of engineers and technologists Oversee all technical aspects of the company, ensuring alignment with business goals Drive innovation in regenerative sustainable technologies and carbon measurement systems Collaborate with cross-functional teams to integrate technology solutions Ensure the reliability, security, and scalability of technological infrastructures Foster a culture of continuous improvement and technical excellence Qualifications Experience leading a team within a small to medium sized company Strong technical background in software development/data analytics/system architecture Bachelor's or Master's degree in either an IT or Business related field Experience in the agriculture or environmental sectors is a plus Proven management skills with the ability to lead, communicate and inspire a diverse team Excellent business acumen and strategic thinking Strong problem-solving skills and the ability to make informed decisions in a fast-paced environment Offer Taking charge of a genuinely impactful product, using your direction for the good of the environment Complete responsibility over a technical team, with management responsibilities Up to 110,000 EUR gross for experienced applicants, which can then be increased further Full benefits package including mobility costs Flexible hybrid work Inclusive work environment If this role interests you, attach a CV and apply today!"
# "175,169,139,352"; SkillTypeId 7: true: 169, 175 preds: 142, 170

#### FR
# id = 323517
#text = "AG Insurance - Technical Architect REST API, DevOps AG Insurance Un lieu de travail où vous pouvez prendre des initiatives. Où tout le monde vous encourage à montrer ce dont vous êtes capable. Où on vous encourage à vous développer. Et où vous construisez l'avenir avec vos collègues. Vous avez de l'audace et le client est au centre de vos préoccupations ? Alors, à très bientôt chez AG ! Notre compagnie compte environ 700 spécialistes IT. Autant de collègues géniaux qui contribuent chaque jour à la (r)évolution technologique chez AG. Le département Information System (IS) est responsable du développement des applications supportant le business assurance, ainsi que des applications transversales dans les domaines suivants : In & Outbounds, Referentials & Financials, sans oublier des centres d'expertise technologique pour le développement des Front-Ends, pour la mise en place d'APIs et s... Et si c'était votre prochain job ? Nous recherchons un(e) .NET Solution Architect pour renforcer ce département. Vous proposez des solutions techniques adaptées aux besoins et aux contraintes des projets et conformes à la vision IT d'AG. En innovant et en améliorant ce qui existe déjà, vous créez également une valeur ajoutée pour AG. Pendant la phase architecturale, vous choisissez les solutions en concertation avec les autres architectes techniques et fonctionnels, et avec toutes les autres parties concernées (autres équipes, partenaires externes, collègues d'Infrastructure & Operations, etc.). Vous concevez l'architecture technique des applications sur la base des exigences fonctionnelles et non fonctionnelles. Nous comptons également sur vous pour le support technique, du développement à la production de l'application. Vous assurez le suivi de l'implémentation de vos solutions, en coopération avec les équipes de développement. Vous vous assurez ensuite que la solution mise en œuvre est conforme à l'architecture proposée. Si nécessaire, vous apportez des corrections. Vous accompagnez et coachez vos collègues pour les aider à monter en compétence dans le domaine. Ce poste vous intéresse ? Qu'attendez-vous pour postuler ? Nos recruteurs examineront avec vous quelle division de notre département IT et quelle équipe vous conviendront le mieux. Vous vous reconnaissez dans ce profil ? Les nouvelles technologies ? Vous en suivez de près l'évolution et les tendances en architecture IT. Vous avez une expérience solide dans les domaines suivants : Technical Architecture & Design Patterns REST API, .NET Framework & .NET 6 and up Source control tooling like Azure DevOps & GitHub Microsoft Azure Cloud SQL Server Vous êtes autonome et prenez les choses en main ? Bien sûr ! Mais vous savez remonter les informations ou demander du support si nécessaire. Team-player dans l'âme, vous partagez naturellement votre savoir-faire avec vos collègues. La qualité du travail, dans ses moindres détails, c'est votre cheval de bataille. Vous le documentez avec précision, tant pour le support technique de votre équipe que pour les utilisateurs d'autres teams. Communiquer en français et/ou en néerlandais, à l'oral et à l'écrit ? Un jeu d'enfant pour vous ! Votre anglais est à la hauteur ? Encore mieux ! Un diplôme supérieur en informatique et/ou plus de 2 ans d'expérience dans une fonction similaire ? Foncez ! Vous n'allez pas rater ça ? Un super job chez le leader sur le marché de l'assurance. Et comme nous cherchons à encore mieux servir nos clients, nous comptons sur vous pour nous y aider. Un environnement de travail moderne dans tous les sens du terme : physique, digital et organisationnel. En d'autres termes, des heures de travail flexibles, la possibilité de télétravailler jusqu'à 3 jours par semaine et le matériel IT adéquat pour travailler de la maison. Une équipe enthousiaste et dynamique, notée 10/10 pour l'ambiance et la convivialité. La possibilité de vous perfectionner en continu, grâce à un large éventail de formations. Idéal pour apprendre toutes les compétences qui vous aideront à faire évoluer votre carrière. Une véritable carrière. En effet, travailler chez AG, c'est bien plus qu'un job. Envie d'explorer de nouveaux horizons après quelque temps ? Nous vous guidons et vous encourageons à exploiter tous vos talents à fond. Des bonnes vibrations, grâce à un vaste programme de bien-être bourré d'activités sportives et d'ateliers inspirants. De quoi vous sentir (encore) plus épanoui(e) dans votre travail et votre vie. Un lieu où tout le monde se sent bienvenu et a des chances égales de s'épanouir. Où votre avenir n'est pas déterminé par votre origine, votre âge, votre genre, votre orientation sexuelle ou votre handicap, mais par votre talent et vos compétences. Et comme l'argent, ça compte aussi : un package salarial attrayant. Vous pouvez même composer vous-même une partie de votre package, car personne ne sait mieux que vous ce dont vous avez besoin. Lancez-vous ! Emballé(e) ? > Postulez aujourd'hui encore ! Nous nous ferons un plaisir de vous aider si vous avez besoin de soutien avant ou pendant le processus de sélection."
# "146,636,676,668,670,300,812,138,137"; SkillTypeId 7: true: 146; preds: 142, 170, 689


In [None]:
#### Classify

inputs    = tokenizer(text, return_tensors="pt", truncation=True, padding="max_length", max_length=512)
threshold = 0.19

# Perform the forward pass for inference
with torch.no_grad():
    outputs = model(**inputs)
    logits = outputs.logits

# Convert logits to probabilities
probs = torch.sigmoid(logits)

# Convert probabilities to predictions
preds = (probs > threshold).int()

for label, Value, prob, pred in zip(filtered_skill_df['Id'], filtered_skill_df['Value'], probs.squeeze(), preds.squeeze()):
    print(f"label: {label} prob: {prob.item():.4f} pred: {int(pred.item())} {Value}")
print()
for label, Value, prob, pred in zip(filtered_skill_df['Id'], filtered_skill_df['Value'], probs.squeeze(), preds.squeeze()):
    if (pred == 1):
        print(f"label: {label} prob: {prob.item():.4f} pred: {int(pred.item())} {Value}")


label: 142 prob: 0.2035 pred: 1 Developer / Analyst Programmer
label: 143 prob: 0.0469 pred: 0 Graphics / Web Designer
label: 146 prob: 0.0840 pred: 0 Application / Solution Architect
label: 147 prob: 0.0644 pred: 0 Infrastructure Architect
label: 148 prob: 0.0954 pred: 0 Technical Analyst
label: 149 prob: 0.1358 pred: 0 Functional Analyst
label: 150 prob: 0.0467 pred: 0 Test / Validation Engineer
label: 151 prob: 0.0468 pred: 0 Test / Validation Manager
label: 152 prob: 0.0470 pred: 0 Technical Writer
label: 153 prob: 0.0463 pred: 0 Database Developer
label: 154 prob: 0.0568 pred: 0 Database Administrator
label: 155 prob: 0.0775 pred: 0 Database Architect
label: 156 prob: 0.1142 pred: 0 Helpdesk / Support
label: 157 prob: 0.0573 pred: 0 Operator
label: 158 prob: 0.0830 pred: 0 Field / Maintenance Engineer
label: 160 prob: 0.1318 pred: 0 System Engineer / Administrator
label: 162 prob: 0.0630 pred: 0 Security Engineer
label: 165 prob: 0.0503 pred: 0 Network / Telecom Engineer
label: 16

In [None]:
#### END #####################################################################

In [None]:
from transformers import TrainingArguments, Trainer

#text = "Voor een klant van Talencia ben ik opzoek naar een Senior Full Stack Developer (Java & Angular) Job beschrijving Als Developer zal je een bestaand team toevoegen en meewerken aan de buitbouw van webapplicaties op Azure. Dit is om bestaande applicaties te vervangen die end-of-live zijn. Het project is al in volle realisatie. Profiel Zeer goede kennis van Java en Angular Goede kennis van Azure DevOps, AKS,.. is een grote pluspunt Kennis van Docker/ SQL/ OAuth/PWA/ RESTful API is vereist Taal: Nederlands met kennis van Engels Extra informatie Teamspeler met ervaring in Agile methodiek is vereist. Als je meer informatie wilt en dit klinkt interessant voor u, aarzel dan niet om uw meest recente CV door te sturen. Het kan zijn dat ik niet beschik over uw meest recente CV en dat ik daarom u deze opportuniteit doorstuur dat niet geschikt is voor u. Als u iemand kent dat deze missie interessant zou vinden mag u deze vacature doorsturen. Met vriendelijke groeten"
text = "Atcon Global - Project Management Officer / PMO team management Atcon Global For one of our clients, we are looking for an experienced Project Management Officer (PMO) / Project Manager (PM) for permanent employment in the Flanders region. Your role? As a PMO, you will play a crucial role in setting up and improving our project management processes. You will not only be responsible for developing PM standards, but also for carrying out projects independently as a Project Manager. Your duties and responsibilities will include: Developing PMO and project management standards Executing and managing complex digital projects Oversee project progress and report to senior management Follow-up of project budgets, project selection, capacity planning and resource management Coaching and training project managers Identifying and managing project risks Promote continuous improvement in the project management domain Collaborate with stakeholders and external partners Who are we looking for? Bachelor's or master's degree 5+ years in a similar role in a dynamic organization Expertise in project management methods (Agile, Scrum, Lean, Kanban) Strong analytical and problem-solving skills Excellent communication and stakeholder management Experience in team management with clear objectives Proactive, Hands-on mentality and result-oriented Fluent in Dutch and English; French is a plus What's on offer? A dynamic and varied role in a growing, ambitious and innovative company Numerous opportunities for personal growth and career development A competitive salary with customizable benefits A friendly, collegial working atmosphere Flexible working hours, possibility to work from home"
encoding = tokenizer(text, return_tensors="pt")

# Define the device based on availability
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# Move the model to the device
model.to(device)
# Move encoding to the device of the model
encoding = {k: v.to(device) for k,v in encoding.items()}

# Perform inference
with torch.no_grad():    # no gradients needed for inference. Forward pass
    outputs = model(**encoding)

# Get logits from the model's output
logits = outputs.logits

# Apply softmax/sigmoid based on the type of classification
if model.config.num_labels == 1:
    probs = torch.sigmoid(logits.squeeze())
else:
    #probs = torch.softmax(logits, dim=1).squeeze()
    probs = torch.sigmoid(logits)



# To get predictions
threshold = 0.5
#predictions = torch.where(probs >= threshold, torch.ones_like(probs), torch.zeros_like(probs))
#predictions = torch.argmax(probs, dim=-1) if model.config.num_labels > 1 else torch.where(probs >= threshold, torch.ones_like(probs), torch.zeros_like(probs))
predictions = (probs > threshold).float()
print("Predictions:", predictions)
print()

# Turn predicted id's into actual label names
print("Probabilites:", probs)

#[id2label[idx] for idx, label in enumerate(predictions['labels']) if label == 1.0]

#predicted_labels = [id2label[idx.item()] for idx in predictions]
#print(predicted_labels)

for label, prob, pred in zip(labels, probs.squeeze(), predictions.squeeze()):
  print(f"Label: {label}: Probability: {prob.item():.4f} {int(pred.item())}")
