Your task is to create a bert-base-classifier of vacancy areas based on their titles.

Each vacancy can have more than one area so it's **Multi-label classification** not Multiclass classification




In [1]:
import pandas as pd
import numpy as np
import os
from sklearn.metrics import classification_report
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score, accuracy_score
from nltk.tokenize import word_tokenize
from string import punctuation
from tqdm import tqdm

In [2]:
import nltk
nltk.download('punkt')
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [3]:
import torch
from torch import nn
import torch.nn.functional as F
from torch.utils.data import DataLoader, RandomSampler, Dataset, SequentialSampler
import random
import transformers
import torch.optim as optim

# Try two or more different bert-like models(different berts, robertas etc. or any other transformer based model) (**2 points max**)
 your notebook should contain the training process of all your models!

In [None]:
#для начала возьмем базовую модель google-bert/bert-base-uncased (альтернативная модель будет обучена ниже)
MODEL_NAME =  "google-bert/bert-base-uncased"
MAX_SEQ_LENGTH = 64
RESULT_MODEL_PATH = './model.pt'

In [10]:
def seed_everything(seed_value):
    random.seed(seed_value)
    np.random.seed(seed_value)
    torch.manual_seed(seed_value)
    os.environ['PYTHONHASHSEED'] = str(seed_value)

    if torch.cuda.is_available():
        torch.cuda.manual_seed(seed_value)
        torch.cuda.manual_seed_all(seed_value)
        torch.backends.cudnn.deterministic = True
        torch.backends.cudnn.benchmark = False

seed = 12
seed_everything(seed)

In [11]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

In [12]:
device

device(type='cuda')

In [13]:
punctuation = set('!"$%&\'()*,-/:;<=>?@[\\]^_`{|}~')

In [14]:
def clean(text):
    return ' '.join([token.lower() for token in word_tokenize(text) if token not in punctuation])

In [4]:
#df = pd.read_csv('./dataset_2020.csv')
df = pd.read_csv('https://raw.githubusercontent.com/zelcookie/DL_NLP_HW_3/refs/heads/main/dataset_2020.csv')
df.shape

(78909, 2)

In [5]:
df.head()

Unnamed: 0,title,area
0,Expert Java Developer (Technical Leader),programmer
1,Software Engineer (JVM Runtime),programmer
2,PHP developer,programmer
3,Backend developer,programmer
4,Backend developer,programmer


In [9]:
df['title'].apply(len).mean(), df['title'].apply(len).max()
# для моделей возьмем  max_seq_len = 64, чтобы было не сильно больше средней длины, но и не сильно меньше максимальной

(24.91454713657504, 100)

Each vacancy can have more than one area separated be space

Exapmle:

Malware Analyst for Imunify Security,analyst it_security

In [17]:
df_train, df_test = train_test_split(df, train_size=0.9, random_state=42)
df_train, df_valid = train_test_split(df_train, train_size=0.8, random_state=42)

# Finish TextClassificationDataset (**1 point max**)

In [18]:
class TextClassificationDataset(Dataset):
    def __init__(self, data, tokenizer, bianizer):
        self.data = data
        self.tokenizer = tokenizer
        sentences = [clean(sent) for sent in data.title.tolist()]
        self.encodings = tokenizer(
            sentences, truncation=True, padding=True,
            max_length=MAX_SEQ_LENGTH, return_tensors="pt"
        )
        self.target = [labels.split() for labels in data.area.tolist()]
        self.bianiezer = bianizer
        self.target_one_hot = torch.tensor(self.bianiezer.transform(self.target), dtype=torch.float)



    def __len__(self):
      return len(self.data)

    def __getitem__(self, idx):
        item = {key: val[idx] for key, val in self.encodings.items()}
        item['labels'] = self.target_one_hot[idx]
        return item

In [19]:
tokenizer = transformers.AutoTokenizer.from_pretrained(MODEL_NAME)
binarizer = MultiLabelBinarizer()
labels_train = [labels.split() for labels in df_train.area.tolist()]
binarizer.fit(labels_train)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [20]:
batch_size = 8

train_dataset = TextClassificationDataset(df_train, tokenizer, binarizer)
train_sampler = RandomSampler(train_dataset)
train_dataloader =  DataLoader(train_dataset, sampler=train_sampler, batch_size=batch_size,)

valid_dataset = TextClassificationDataset(df_valid, tokenizer, binarizer)
valid_dataloader = DataLoader(valid_dataset, batch_size=batch_size)

test_dataset = TextClassificationDataset(df_test, tokenizer, binarizer)
test_dataloader = DataLoader(test_dataset, batch_size=batch_size)


In [21]:

class BertForMultilabel(nn.Module):
    def __init__(self,  num_labels: int):
      super().__init__()
      self.bert = transformers.BertModel.from_pretrained(MODEL_NAME)
      self.classifier = nn.Linear(self.bert.config.hidden_size, num_labels)
      self.dropout = nn.Dropout(0.3) # дропаут для регуляризации

    def train_bert(self, train_bert_flag=True):
      for param in self.bert.parameters():
		      param.requires_grad = train_bert_flag

    def forward(self, input_ids=None, attention_mask=None, token_type_ids=None):
      outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask)
      pooled_output = self.dropout(outputs.pooler_output)  #дропаут
      logits = self.classifier(pooled_output)  # линейный слой (классификация)
      return logits

In [None]:
num_labels = len(binarizer.classes_)
model = BertForMultilabel(num_labels)
model.to(device);

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

# Train your classifier with freezed bert and save model with the lowest val loss during training (**2 points max**)

print train/val loss after each epoch


In [None]:
def train(model, iterator, optimizer, criterion, scheduler): #добавила на всякий случай scheduler сразу в функцию
  model.train()
  epoch_loss = 0
  for batch in tqdm(iterator):
      optimizer.zero_grad()
      input_ids = batch['input_ids'].to(device)
      attention_mask = batch['attention_mask'].to(device)
      labels = batch['labels'].to(device)
      logits = model(input_ids=input_ids, attention_mask=attention_mask)
      loss = criterion(logits, labels)
      loss.backward()
      optimizer.step()
      scheduler.step()
      epoch_loss += loss.item()
  return epoch_loss / len(iterator)

In [None]:
def validate(model, iterator, criterion):
  model.eval()
  epoch_loss = 0
  preds_list = []
  labels_list = []
  with torch.no_grad():
      for batch in iterator:
          input_ids = batch['input_ids'].to(device)
          attention_mask = batch['attention_mask'].to(device)
          labels = batch['labels'].to(device)
          logits = model(input_ids=input_ids, attention_mask=attention_mask)
          loss = criterion(logits, labels)
          epoch_loss += loss.item()
          preds_list.extend(logits_to_labels(logits))
          labels_list.extend(labels.cpu().numpy())
  return epoch_loss / len(iterator), preds_list, labels_list

In [None]:
def logits_to_labels(logits):
    preds = nn.Sigmoid()(logits.view(-1, num_labels))
    preds = preds.to('cpu').numpy()>0.5
    return preds.tolist()

In [None]:
model.train_bert(False)

In [None]:
epochs = 5
#lr=1e-5
criterion = nn.BCEWithLogitsLoss() #cигмоида + binary cross-entropy
optimizer = torch.optim.Adam(model.parameters(), lr=2e-5)
#optimizer = optim.Adam(model.parameters(), lr=lr)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=1, gamma=0.9)

In [None]:
best_val_loss = float('inf')
for epoch in range(epochs):
    print(f'Epoch {epoch + 1}/{epochs}')
    train_loss = train(model, train_dataloader, optimizer, criterion, scheduler)
    val_loss, val_preds, val_labels = validate(model, valid_dataloader, criterion)

    print(f"Train Loss: {train_loss:.4f} | Validation Loss: {val_loss:.4f}")

    # сохраним модель, если val_loss уменьшается
    if val_loss < best_val_loss:
        best_val_loss = val_loss
        torch.save(model.state_dict(), RESULT_MODEL_PATH)
        print("Model saved")

Epoch 1/5


100%|██████████| 7102/7102 [01:56<00:00, 60.99it/s]


Train Loss: 0.0002 | Validation Loss: 0.0002
Model saved
Epoch 2/5


100%|██████████| 7102/7102 [01:53<00:00, 62.75it/s]


Train Loss: 0.0002 | Validation Loss: 0.0002
Epoch 3/5


100%|██████████| 7102/7102 [01:52<00:00, 62.86it/s]


Train Loss: 0.0002 | Validation Loss: 0.0002
Epoch 4/5


100%|██████████| 7102/7102 [01:53<00:00, 62.69it/s]


Train Loss: 0.0002 | Validation Loss: 0.0002
Epoch 5/5


100%|██████████| 7102/7102 [01:53<00:00, 62.75it/s]


Train Loss: 0.0002 | Validation Loss: 0.0002


In [None]:
model.load_state_dict(torch.load(RESULT_MODEL_PATH, map_location=torch.device(device)))
test_loss, test_preds, test_labels = validate(model, test_dataloader, criterion)

  model.load_state_dict(torch.load(RESULT_MODEL_PATH, map_location=torch.device(device)))


In [None]:
print(classification_report(binarizer.transform(test_dataset.target), test_preds,
                            target_names=binarizer.classes_))


                 precision    recall  f1-score   support

          admin       1.00      1.00      1.00        61
        analyst       1.00      1.00      1.00       302
    architector       1.00      1.00      1.00       111
      assistant       1.00      1.00      1.00        14
     consultant       1.00      1.00      1.00        23
          coord       1.00      1.00      1.00        11
  data_engineer       1.00      1.00      1.00       136
 data_scientist       1.00      1.00      1.00       154
       designer       1.00      1.00      1.00       409
devel_metodolog       1.00      1.00      1.00        44
         devops       1.00      1.00      1.00       338
       director       1.00      1.00      1.00        17
     doc_writer       1.00      1.00      1.00        18
    it_security       1.00      1.00      1.00        54
machine_learner       1.00      1.00      1.00        42
        manager       1.00      1.00      1.00       427
       networks       0.95    

даже с замороженным бертом выглядит неплохо, только категория networks немного проседает

# Train your classifier with unfreezed bert and save model with the lowest val loss during training (**2 points max**)

print train/val loss after each epoch

In [None]:
epochs = 5
lr = 2e-5
WARMUP_PROPORTION = 0.1
warmup_steps = int(len(train_dataloader) * epochs * WARMUP_PROPORTION)

In [None]:
model.train_bert(True)

In [None]:
t_total = len(train_dataloader) * epochs
no_decay = ['bias', 'LayerNorm.weight'] # ToDo create a list of parameters to which weight_decay should not be applied, explain your choice in the results section
#LayerNorm.weight -- не уверена, что он тут должен быть... но мне кажется, это оправдано
param_optimizer = list(model.named_parameters())
optimizer_grouped_parameters = [
        {'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)], 'weight_decay': 0.001},
        {'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)], 'weight_decay': 0.0},
    ]


criterion = nn.BCEWithLogitsLoss()
lr = 2e-5
optimizer = transformers.AdamW(optimizer_grouped_parameters, lr=lr)
scheduler = transformers.get_linear_schedule_with_warmup(
    optimizer, num_warmup_steps=warmup_steps, num_training_steps=t_total
)



In [None]:
best_val_loss = float('inf')
for epoch in range(epochs):
    print(f'Epoch {epoch + 1}/{epochs}')
    train_loss = train(model, train_dataloader, optimizer, criterion, scheduler)
    val_loss, val_preds, val_labels = validate(model, valid_dataloader, criterion)

    print(f"Train Loss: {train_loss:.4f} | Validation Loss: {val_loss:.4f}")

    # сохраним модель, если val_loss уменьшается
    if val_loss < best_val_loss:
        best_val_loss = val_loss
        torch.save(model.state_dict(), RESULT_MODEL_PATH)
        print("Model saved")

Epoch 1/5


100%|██████████| 7102/7102 [09:19<00:00, 12.70it/s]


Train Loss: 0.0845 | Validation Loss: 0.0036
Model saved
Epoch 2/5


100%|██████████| 7102/7102 [10:01<00:00, 11.80it/s]


Train Loss: 0.0021 | Validation Loss: 0.0017
Model saved
Epoch 3/5


100%|██████████| 7102/7102 [09:24<00:00, 12.57it/s]


Train Loss: 0.0006 | Validation Loss: 0.0003
Model saved
Epoch 4/5


100%|██████████| 7102/7102 [08:57<00:00, 13.20it/s]


Train Loss: 0.0003 | Validation Loss: 0.0002
Model saved
Epoch 5/5


100%|██████████| 7102/7102 [09:03<00:00, 13.07it/s]


Train Loss: 0.0002 | Validation Loss: 0.0002
Model saved


In [None]:
model.load_state_dict(torch.load(RESULT_MODEL_PATH, map_location=torch.device(device)))
test_loss, test_preds, test_labels = validate(model, test_dataloader, criterion)

  model.load_state_dict(torch.load(RESULT_MODEL_PATH, map_location=torch.device(device)))


In [None]:
print(classification_report(binarizer.transform(test_dataset.target), test_preds,
                            target_names=binarizer.classes_))

                 precision    recall  f1-score   support

          admin       1.00      1.00      1.00        61
        analyst       1.00      1.00      1.00       302
    architector       1.00      1.00      1.00       111
      assistant       1.00      1.00      1.00        14
     consultant       1.00      1.00      1.00        23
          coord       1.00      1.00      1.00        11
  data_engineer       1.00      1.00      1.00       136
 data_scientist       1.00      1.00      1.00       154
       designer       1.00      1.00      1.00       409
devel_metodolog       1.00      1.00      1.00        44
         devops       1.00      1.00      1.00       338
       director       1.00      1.00      1.00        17
     doc_writer       1.00      1.00      1.00        18
    it_security       1.00      1.00      1.00        54
machine_learner       1.00      1.00      1.00        42
        manager       1.00      1.00      1.00       427
       networks       0.95    

как будто ничего не изменилось?..

### distilbert/distilbert-base-uncased

если честно, я хотела попробовать вот эту модель: microsoft/deberta-v3-small -- потому что, как я поняла, за счет disentangled attention (слово кодируется двумя раздельными векторами, один для "семантики", второй для относитнльного позиционного кодирования) эта модель лучше моделирует контекст и семантические взаимосвязи в данных

но у меня все время возникали какие-то ошибки с памятью и с cuda, так что в качестве второй модели я решила взять distilbert/distilbert-base-uncased тк она более легковесная и на нее у меня должно хватить ресурсов...

In [9]:
#MODEL_NAME =  "microsoft/deberta-v3-small"
MODEL_NAME =  "distilbert/distilbert-base-uncased"
MAX_SEQ_LENGTH = 64
RESULT_MODEL_PATH = './model.pt'

In [None]:
def seed_everything(seed_value):
    random.seed(seed_value)
    np.random.seed(seed_value)
    torch.manual_seed(seed_value)
    os.environ['PYTHONHASHSEED'] = str(seed_value)

    if torch.cuda.is_available():
        torch.cuda.manual_seed(seed_value)
        torch.cuda.manual_seed_all(seed_value)
        torch.backends.cudnn.deterministic = True
        torch.backends.cudnn.benchmark = False

seed = 12
seed_everything(seed)

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device

device(type='cuda')

In [None]:
tokenizer = transformers.AutoTokenizer.from_pretrained(MODEL_NAME)
binarizer = MultiLabelBinarizer()
labels_train = [labels.split() for labels in df_train.area.tolist()]
binarizer.fit(labels_train)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [None]:
batch_size = 8

train_dataset = TextClassificationDataset(df_train, tokenizer, binarizer)
train_sampler = RandomSampler(train_dataset)
train_dataloader =  DataLoader(train_dataset, sampler=train_sampler, batch_size=batch_size,)

valid_dataset = TextClassificationDataset(df_valid, tokenizer, binarizer)
valid_dataloader = DataLoader(valid_dataset, batch_size=batch_size)

test_dataset = TextClassificationDataset(df_test, tokenizer, binarizer)
test_dataloader = DataLoader(test_dataset, batch_size=batch_size)

In [22]:
num_labels = len(binarizer.classes_)
model = BertForMultilabel(num_labels)
model.to(device);

You are using a model of type distilbert to instantiate a model of type bert. This is not supported for all configurations of models and can yield errors.


model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of BertModel were not initialized from the model checkpoint at distilbert/distilbert-base-uncased and are newly initialized: ['embeddings.LayerNorm.bias', 'embeddings.LayerNorm.weight', 'embeddings.position_embeddings.weight', 'embeddings.token_type_embeddings.weight', 'embeddings.word_embeddings.weight', 'encoder.layer.0.attention.output.LayerNorm.bias', 'encoder.layer.0.attention.output.LayerNorm.weight', 'encoder.layer.0.attention.output.dense.bias', 'encoder.layer.0.attention.output.dense.weight', 'encoder.layer.0.attention.self.key.bias', 'encoder.layer.0.attention.self.key.weight', 'encoder.layer.0.attention.self.query.bias', 'encoder.layer.0.attention.self.query.weight', 'encoder.layer.0.attention.self.value.bias', 'encoder.layer.0.attention.self.value.weight', 'encoder.layer.0.intermediate.dense.bias', 'encoder.layer.0.intermediate.dense.weight', 'encoder.layer.0.output.LayerNorm.bias', 'encoder.layer.0.output.LayerNorm.weight', 'encoder.layer.0.output.dense.bias',

In [23]:
def train(model, iterator, optimizer, criterion, scheduler):
  model.train()
  epoch_loss = 0
  for batch in tqdm(iterator):
      optimizer.zero_grad()
      input_ids = batch['input_ids'].to(device)
      attention_mask = batch['attention_mask'].to(device)
      labels = batch['labels'].to(device)
      logits = model(input_ids=input_ids, attention_mask=attention_mask)
      loss = criterion(logits, labels)
      loss.backward()
      optimizer.step()
      scheduler.step()
      epoch_loss += loss.item()
  return epoch_loss / len(iterator)

def validate(model, iterator, criterion):
  model.eval()
  epoch_loss = 0
  preds_list = []
  labels_list = []
  with torch.no_grad():
      for batch in iterator:
          input_ids = batch['input_ids'].to(device)
          attention_mask = batch['attention_mask'].to(device)
          labels = batch['labels'].to(device)
          logits = model(input_ids=input_ids, attention_mask=attention_mask)
          loss = criterion(logits, labels)
          epoch_loss += loss.item()
          preds_list.extend(logits_to_labels(logits))
          labels_list.extend(labels.cpu().numpy())
  return epoch_loss / len(iterator), preds_list, labels_list

def logits_to_labels(logits):
    preds = nn.Sigmoid()(logits.view(-1, num_labels))
    preds = preds.to('cpu').numpy()>0.5
    return preds.tolist()

In [None]:
model.train_bert(False)

epochs = 5
#lr=1e-5
criterion = nn.BCEWithLogitsLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=2e-5)
#optimizer = optim.Adam(model.parameters(), lr=lr)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=1, gamma=0.9)

In [None]:

best_val_loss = float('inf')
for epoch in range(epochs):
    print(f'Epoch {epoch + 1}/{epochs}')
    train_loss = train(model, train_dataloader, optimizer, criterion, scheduler)
    val_loss, val_preds, val_labels = validate(model, valid_dataloader, criterion)

    print(f"Train Loss: {train_loss:.4f} | Validation Loss: {val_loss:.4f}")

    # сохраним модель, если val_loss уменьшается
    if val_loss < best_val_loss:
        best_val_loss = val_loss
        torch.save(model.state_dict(), RESULT_MODEL_PATH)
        print("Model saved")

Epoch 1/5


100%|██████████| 7102/7102 [02:15<00:00, 52.52it/s]


Train Loss: 0.6914 | Validation Loss: 0.6827
Model saved
Epoch 2/5


100%|██████████| 7102/7102 [02:06<00:00, 56.35it/s]


Train Loss: 0.6912 | Validation Loss: 0.6827
Epoch 3/5


100%|██████████| 7102/7102 [02:01<00:00, 58.52it/s]


Train Loss: 0.6913 | Validation Loss: 0.6827
Epoch 4/5


100%|██████████| 7102/7102 [02:00<00:00, 58.81it/s]


Train Loss: 0.6912 | Validation Loss: 0.6827
Epoch 5/5


100%|██████████| 7102/7102 [02:00<00:00, 58.88it/s]


Train Loss: 0.6913 | Validation Loss: 0.6827


In [None]:
model.load_state_dict(torch.load(RESULT_MODEL_PATH, map_location=torch.device(device)))
test_loss, test_preds, test_labels = validate(model, test_dataloader, criterion)

  model.load_state_dict(torch.load(RESULT_MODEL_PATH, map_location=torch.device(device)))


In [None]:
print(classification_report(binarizer.transform(test_dataset.target), test_preds,
                            target_names=binarizer.classes_))

                 precision    recall  f1-score   support

          admin       0.00      0.26      0.01        61
        analyst       0.04      0.82      0.07       302
    architector       0.00      0.00      0.00       111
      assistant       0.00      0.71      0.00        14
     consultant       0.00      0.00      0.00        23
          coord       0.00      0.00      0.00        11
  data_engineer       0.02      1.00      0.03       136
 data_scientist       0.02      1.00      0.04       154
       designer       0.07      0.07      0.07       409
devel_metodolog       0.00      0.00      0.00        44
         devops       0.00      0.00      0.00       338
       director       0.00      0.00      0.00        17
     doc_writer       0.00      0.00      0.00        18
    it_security       0.00      0.04      0.00        54
machine_learner       0.00      0.17      0.01        42
        manager       0.05      1.00      0.10       427
       networks       0.00    

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


очень низкая точность (много false positive?), но для некоторых категорий довольно приличная полнота (видимо, как раз за счет низкой точности для большинства категорий)))

но в целом пока что качество очень низкое

In [24]:
epochs = 5
lr = 2e-5

WARMUP_PROPORTION = 0.1
warmup_steps = int(len(train_dataloader) * epochs * WARMUP_PROPORTION)
model.train_bert(True)
t_total = len(train_dataloader) * epochs
no_decay = ['bias', 'LayerNorm.weight'] # ToDo create a list of parameters to which weight_decay should not be applied, explain your choice in the results section
param_optimizer = list(model.named_parameters())
optimizer_grouped_parameters = [
        {'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)], 'weight_decay': 0.001},
        {'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)], 'weight_decay': 0.0},
    ]


criterion = nn.BCEWithLogitsLoss()
lr = 2e-5
optimizer = transformers.AdamW(optimizer_grouped_parameters, lr=lr)
scheduler = transformers.get_linear_schedule_with_warmup(
    optimizer, num_warmup_steps=warmup_steps, num_training_steps=t_total
)



In [25]:
best_val_loss = float('inf')
for epoch in range(epochs):
    print(f'Epoch {epoch + 1}/{epochs}')
    train_loss = train(model, train_dataloader, optimizer, criterion, scheduler)
    val_loss, val_preds, val_labels = validate(model, valid_dataloader, criterion)

    print(f"Train Loss: {train_loss:.4f} | Validation Loss: {val_loss:.4f}")

    # сохраним модель, если val_loss уменьшается
    if val_loss < best_val_loss:
        best_val_loss = val_loss
        torch.save(model.state_dict(), RESULT_MODEL_PATH)
        print("Model saved")

Epoch 1/5


100%|██████████| 7102/7102 [09:12<00:00, 12.85it/s]


Train Loss: 0.0612 | Validation Loss: 0.0045
Model saved
Epoch 2/5


100%|██████████| 7102/7102 [08:57<00:00, 13.20it/s]


Train Loss: 0.0026 | Validation Loss: 0.0013
Model saved
Epoch 3/5


100%|██████████| 7102/7102 [08:57<00:00, 13.21it/s]


Train Loss: 0.0012 | Validation Loss: 0.0010
Model saved
Epoch 4/5


100%|██████████| 7102/7102 [08:58<00:00, 13.20it/s]


Train Loss: 0.0007 | Validation Loss: 0.0005
Model saved
Epoch 5/5


100%|██████████| 7102/7102 [08:59<00:00, 13.16it/s]


Train Loss: 0.0004 | Validation Loss: 0.0005
Model saved


In [26]:
model.load_state_dict(torch.load(RESULT_MODEL_PATH, map_location=torch.device(device)))
test_loss, test_preds, test_labels = validate(model, test_dataloader, criterion)

  model.load_state_dict(torch.load(RESULT_MODEL_PATH, map_location=torch.device(device)))


In [27]:
print(classification_report(binarizer.transform(test_dataset.target), test_preds,
                            target_names=binarizer.classes_))

                 precision    recall  f1-score   support

          admin       1.00      1.00      1.00        61
        analyst       1.00      1.00      1.00       302
    architector       1.00      1.00      1.00       111
      assistant       0.93      1.00      0.97        14
     consultant       1.00      1.00      1.00        23
          coord       1.00      1.00      1.00        11
  data_engineer       1.00      1.00      1.00       136
 data_scientist       0.99      0.99      0.99       154
       designer       1.00      1.00      1.00       409
devel_metodolog       1.00      1.00      1.00        44
         devops       1.00      1.00      1.00       338
       director       1.00      1.00      1.00        17
     doc_writer       1.00      1.00      1.00        18
    it_security       1.00      1.00      1.00        54
machine_learner       1.00      1.00      1.00        42
        manager       1.00      0.98      0.99       427
       networks       1.00    

Стало сильно лучше! дистилированная модель работает гораздо быстрее оригинальной, но с дообучением (unfreezed bert) качество предсказаний как будто почти такое же высокое

# Results (3 points max)

Write your conclusion

What models and what training parameters did you use?

What was the reason for your choice?

What were the results?

What metrics do you consider the most important?

Модели:
Базовую версию берта (uncased) и дистиллированный берт, чтобы можно было обучить модель имея ограниченные ресурсы в колабе



Параметры:
* epochs = 5, batch_size = 8 (во многом из-за того, что у меня не очень много ресурсов gpu)

* lr=2e-5 (вроде бы гуглится, что это дефолт для подобных задач)

* criterion = BCEWithLogitsLoss:

комбинация BCE и сигмоиды (для каждого класса оцениваем вероятность принадлежности между 0 и 1)

* Из-за того, что классы очень не сбалансированные + нам в целом важен баланс между полнотой предсказаний и их точностью, самой важной метрикой я бы считала f1-score



Результаты:

Лучший результат показала модель bert-base-uncased, причем не было разницы между обучением замороженным и незамороженным бертом (f1 score для всех категорий = 1)

Модель distillbert обучалась гораздо быстрее, в обучении с замороженным бертом результаты были очень плохие, однако при обучении с незамороженным бертом качество классификатора было почти таким же идеальным как у не дистиллированной модели