# Fine-Tuning RuBERT for Resume-to-Vacancy Matching

## Description

This notebook presents a comprehensive guide to fine-tuning the `cointegrated/rubert-tiny2` model for the task of matching resumes to job vacancies. The primary focus is on developing a binary classifier that predicts whether a given resume matches a job vacancy based on textual content.

### Key Steps:

1. **Data Preprocessing**: The dataset, containing resumes and job vacancies, is loaded and preprocessed. Each entry is labeled as 'confirmed' (1) if the resume matches the vacancy and 'not confirmed' (0) otherwise.

2. **Feature Engineering**: Textual features from both resumes and vacancies are extracted and combined. This includes information such as job titles, descriptions, key skills, and education levels.

3. **Data Augmentation**: To enhance the model's robustness and improve its generalization capabilities, data augmentation techniques such as random deletion and swapping of words are applied to the original textual data. **Implementing augmentation resulted in increase in weighted avg accuracy from f1-score 0.4 to 0.7!**

4. **Dataset Preparation**: The augmented dataset is split into training and validation sets. A custom `DuoDataset` class is utilized to handle pairs of text data (resume and vacancy) along with their labels.

5. **Model Setup**: The `cointegrated/rubert-tiny2` model is loaded using Hugging Face's Transformers library. The model is adapted for the task by employing mean pooling over token embeddings to derive fixed-size sentence embeddings.

6. **Training**: The model is trained using a contrastive loss function, which is designed to minimize the distance between embeddings of matching resume-vacancy pairs while maximizing the distance for non-matching pairs.

7. **Evaluation**: The trained model's performance is evaluated on the validation set using cosine similarity between resume and vacancy embeddings. The similarity scores are thresholded to make binary predictions.

8. **Threshold Optimization**: Optuna, a hyperparameter optimization framework, is used to find the optimal threshold for converting cosine similarity scores into binary predictions, aiming to maximize the F1 score.

9. **Results Analysis**: The final model's performance is assessed using the optimized threshold, with metrics including F1 score, accuracy, and a detailed classification report.

This notebook provides a practical framework for semantic matching tasks, leveraging transformer models and contrastive learning. It demonstrates the entire workflow from data preparation to model evaluation and optimization, making it a valuable resource for data scientists working on similar natural language processing (NLP) tasks.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
!pip install optuna

Collecting optuna
  Downloading optuna-3.5.0-py3-none-any.whl (413 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m413.4/413.4 kB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting alembic>=1.5.0 (from optuna)
  Downloading alembic-1.13.1-py3-none-any.whl (233 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m233.4/233.4 kB[0m [31m8.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting colorlog (from optuna)
  Downloading colorlog-6.8.2-py3-none-any.whl (11 kB)
Collecting Mako (from alembic>=1.5.0->optuna)
  Downloading Mako-1.3.2-py3-none-any.whl (78 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m78.7/78.7 kB[0m [31m9.7 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: Mako, colorlog, alembic, optuna
Successfully installed Mako-1.3.2 alembic-1.13.1 colorlog-6.8.2 optuna-3.5.0


In [35]:
import pandas as pd
import numpy as np
import torch
import optuna
import random
from torch import nn
from tqdm import tqdm
from transformers import AutoTokenizer, AutoModel
from torch.utils.data import DataLoader, Dataset
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, f1_score, balanced_accuracy_score
from torch.optim import Adam
from torch.utils.data import DataLoader

In [5]:
train_data = pd.read_csv('/content/drive/MyDrive/hhhack24/data/output.csv')
train_data['Target'] = train_data['Target'].apply(lambda x: 1 if x == 'confirmed' else 0)
train_data.sample(2)

Unnamed: 0,Vacancy UUID,Vacancy Name,Keywords,Description,Comment,Resume UUID,First Name,Last Name,Birth Date,Country,...,Position,Experience Description,Year,Organization,Faculty,Specialty,Result,Education Type,Education Level,Target
38,7a4813fc-43bc-3896-a607-4c8682b01002,Системный аналитик,,Уровень: СА уровня Middle+/Senior от 3х лет В...,350 000 - 400 000 гросс,d21ed030-bdd4-3b1a-b371-b6fb93d85c2a,Елизавета,Сысоев,1982-08-05,Россия,...,Ведущий системный аналитик/Руководитель/Систем...,Участие в создании решений по противодействию...,2020.0,GeekBrains,,Факультет искусственного интеллекта,Специалист,Повышение квалификации,,1
628,a5d0e1fd-7baa-3a6f-98f4-b908ac7fce43,Java-разработчик,,Только Senior до 15 грейда Проект «Трансформа...,,74bce970-26fd-3cb3-84c0-8a2b636553a1,Илья,Доронина,1996-01-01,Россия,...,Java Developer,Разработка внутренней системы безопастности Ж...,2017.0,КубГУ,Юридический,,,Основное,Высшее,0


In [6]:
vacancies_train_examples = []
resumes_train_examples = []
labels = []

for index, example in train_data.iterrows():

    vacancy_features = [
        f"Название вакансии: {example['Vacancy Name']}",
        f"Описание: {example['Description']}"
    ]
    vacancy_text = " ".join(vacancy_features)
    vacancy_text = " ".join([feature for feature in vacancy_features if feature.split(': ')[1] != 'None'])

    # Concatenating all resume fields with their Russian names for each example
    resume_features = [
        f"Дата рождения: {example['Birth Date']}",
        f"Страна: {example['Country']}",
        f"Город: {example['City']}",
        f"Ключевые навыки: {example['Key Skills']}",
        f"Должность: {example['Position']}",
        f"Описание опыта: {example['Experience Description']}",
        f"Организация: {example['Organization']}",
        f"Факультет: {example['Faculty']}",
        f"Специальность: {example['Specialty']}",
        f"Уровень образования: {example['Education Level']}"
    ]
    resume_text = " ".join([feature for feature in resume_features if feature.split(': ')[1] != 'None'])

    vacancies_train_examples.append(vacancy_text)
    resumes_train_examples.append(resume_text)
    labels.append(example['Target'])

In [7]:
len(vacancies_train_examples), len(resumes_train_examples), len(labels)

(656, 656, 656)

In [8]:
def random_deletion(sentence, p=0.5):
    """Randomly delete words from a sentence with probability p."""
    words = sentence.split()
    if len(words) == 1:
        return sentence
    remaining = [word for word in words if random.random() > p]
    if len(remaining) == 0:
        return random.choice(words)
    return ' '.join(remaining)

def random_swap(sentence, n=2):
    """Randomly swap two words in the sentence n times."""
    words = sentence.split()
    length = len(words)
    if length < 2:
        return sentence
    for _ in range(n):
        idx1, idx2 = np.random.randint(0, length, 2)
        words[idx1], words[idx2] = words[idx2], words[idx1]
    return ' '.join(words)

combined_features = list(zip(vacancies_train_examples, resumes_train_examples, labels))
train_features, valid_features = train_test_split(
    combined_features,
    test_size=0.2,
    random_state=42
)

# Separate back into vacancies, resumes, and labels
vacancies_train, resumes_train, labels_train = zip(*train_features)
vacancies_valid, resumes_valid, labels_valid = zip(*valid_features)

# Convert to lists for further processing
vacancies_train = list(vacancies_train)
resumes_train = list(resumes_train)
vacancies_valid = list(vacancies_valid)
resumes_valid = list(resumes_valid)

In [9]:
# combined_features = list(zip(vacancies_train_examples, resumes_train_examples))

# train_features, valid_features, train_labels, valid_labels = train_test_split(
#     combined_features,
#     labels,
#     test_size=0.2,
#     random_state=42
# )

# vacancies_train, resumes_train = zip(*train_features)
# vacancies_valid, resumes_valid = zip(*valid_features)

# vacancies_train = list(vacancies_train)
# resumes_train = list(resumes_train)
# vacancies_valid = list(vacancies_valid)
# resumes_valid = list(resumes_valid)

augmented_vacancies_train = []
augmented_resumes_train = []
augmented_labels_train = []

# Only augment training data
for vacancy_text, resume_text, label in zip(vacancies_train, resumes_train, labels_train):
    # Add original data to the augmented dataset
    augmented_vacancies_train.append(vacancy_text)
    augmented_resumes_train.append(resume_text)
    augmented_labels_train.append(label)

    # Generate and add augmented data with random deletion
    vacancy_del = random_deletion(vacancy_text, p=0.2)
    resume_del = random_deletion(resume_text, p=0.2)
    augmented_vacancies_train.append(vacancy_del)
    augmented_resumes_train.append(resume_del)
    augmented_labels_train.append(label)

    # Generate and add augmented data with random swap
    vacancy_swap = random_swap(vacancy_text, n=2)
    resume_swap = random_swap(resume_text, n=2)
    augmented_vacancies_train.append(vacancy_swap)
    augmented_resumes_train.append(resume_swap)
    augmented_labels_train.append(label)

In [10]:
class DuoDataset(Dataset):
    def __init__(self, text1, text2, labels):
        self.text1 = text1
        self.text2 = text2
        self.labels = torch.tensor(labels, dtype=torch.float32) if labels is not None else None

    def __len__(self):
        return len(self.text1)

    def __getitem__(self, idx):
        text1_sample = self.text1[idx]
        text2_sample = self.text2[idx]

        if self.labels is not None:
            label = self.labels[idx]
            return text1_sample, text2_sample, label
        else:
            # You may consider returning some placeholder for label if necessary
            return text1_sample, text2_sample, torch.tensor(0, dtype=torch.float32)

train_dataset_augmented = DuoDataset(augmented_vacancies_train, augmented_resumes_train, augmented_labels_train)
val_dataset = DuoDataset(vacancies_valid, resumes_valid, labels_valid)

train_dataloader_augmented = DataLoader(train_dataset_augmented, shuffle=True, batch_size=5)
val_dataloader = DataLoader(val_dataset, shuffle=False, batch_size=5)

print(f"Train set size (augmented): {len(train_dataset_augmented)}")
print(f"Valid set size: {len(val_dataset)}")

Train set size (augmented): 1572
Valid set size: 132


In [23]:
DEVICE = 'cuda'
tokenizer = AutoTokenizer.from_pretrained("cointegrated/rubert-tiny2")
model = AutoModel.from_pretrained("cointegrated/rubert-tiny2").to(DEVICE)

EPOCHS = 25
optimizer = Adam(model.parameters(), lr=3e-7)

#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

class ContrastiveLoss(nn.Module):
    def __init__(self, margin=0.3):
        super(ContrastiveLoss, self).__init__()
        self.margin = margin

    def forward(self, y1, y2, label):
        # Calculate the cosine similarity
        cos_sim = nn.functional.cosine_similarity(y1, y2)

        # Contrastive loss calculation
        # If label is 1 (meaning y1 and y2 are similar), we want cos_sim to be 1, so we minimize (1 - cos_sim)
        # If label is 0 (meaning y1 and y2 are different), we want cos_sim to be less than margin, so we minimize max(0, cos_sim - margin)
        loss_positive = (1 - cos_sim) * label  # Loss for similar pairs
        loss_negative = (cos_sim - self.margin).clamp(min=0) * (1 - label)  # Loss for dissimilar pairs

        # Combine the losses
        loss = loss_positive + loss_negative
        return loss.mean()

criterion = ContrastiveLoss(margin=0.35)

In [24]:
best_valid_loss = float('inf')
epochs_no_improve = 0
early_stop = 10

train_losses, valid_losses = [], []

for epoch in tqdm(range(EPOCHS)):
    # Training step
    train_batch_losses = []
    model.train()
    for resume, vacancy, batch_labels in train_dataloader_augmented:
        batch_labels = batch_labels.to(DEVICE)
        resume_input = tokenizer(resume, padding=True, truncation=True, return_tensors="pt").to(DEVICE)
        vacancy_input = tokenizer(vacancy, padding=True, truncation=True, return_tensors="pt").to(DEVICE)

        resume_embeddings = model(**resume_input)
        vacancy_embeddings = model(**vacancy_input)

        resume_embeddings = mean_pooling(resume_embeddings, resume_input['attention_mask'])
        vacancy_embeddings = mean_pooling(vacancy_embeddings, vacancy_input['attention_mask'])

        resume_embeddings.requires_grad_()
        vacancy_embeddings.requires_grad_()

        loss = criterion(resume_embeddings, vacancy_embeddings, batch_labels)
        loss.backward()
        optimizer.step()
        train_batch_losses.append(loss.item())

    # Validation step
    valid_batch_losses = []
    model.eval()
    with torch.no_grad():
        for resume, vacancy, batch_labels in val_dataloader:
            batch_labels = batch_labels.to(DEVICE)
            resume_input = tokenizer(resume, padding=True, truncation=True, return_tensors="pt").to(DEVICE)
            vacancy_input = tokenizer(vacancy, padding=True, truncation=True, return_tensors="pt").to(DEVICE)

            resume_embeddings = model(**resume_input)
            vacancy_embeddings = model(**vacancy_input)

            resume_embeddings = mean_pooling(resume_embeddings, resume_input['attention_mask'])
            vacancy_embeddings = mean_pooling(vacancy_embeddings, vacancy_input['attention_mask'])

            loss = criterion(resume_embeddings, vacancy_embeddings, batch_labels)
            valid_batch_losses.append(loss.item())

    average_train_loss = sum(train_batch_losses) / len(train_batch_losses)
    average_valid_loss = sum(valid_batch_losses) / len(valid_batch_losses)
    train_losses.append(average_train_loss)
    valid_losses.append(average_valid_loss)

    # Early stopping and saving best model
    if average_valid_loss < best_valid_loss:
        best_valid_loss = average_valid_loss
        epochs_no_improve = 0
        torch.save(model.state_dict(), f'/content/drive/MyDrive/hhhack24/data/aug_epoch_{epoch+1}_model_weights.pth')  # Save your model weights
    else:
        epochs_no_improve += 1
        if epochs_no_improve == early_stop:
            print(f'Early stopping at epoch {epoch + 1}, no improvement for {early_stop} epochs')
            break

    print(f"\nEpoch {epoch+1}/{EPOCHS}, Train Loss: {average_train_loss:.4f}, Validation Loss: {average_valid_loss:.4f}")

  4%|▍         | 1/25 [00:48<19:22, 48.44s/it]


Epoch 1/25, Train Loss: 0.3594, Validation Loss: 0.3438


  8%|▊         | 2/25 [01:36<18:24, 48.02s/it]


Epoch 2/25, Train Loss: 0.2953, Validation Loss: 0.2491


 12%|█▏        | 3/25 [02:23<17:31, 47.79s/it]


Epoch 3/25, Train Loss: 0.2485, Validation Loss: 0.2078


 16%|█▌        | 4/25 [03:11<16:44, 47.86s/it]


Epoch 4/25, Train Loss: 0.2382, Validation Loss: 0.2015


 20%|██        | 5/25 [03:58<15:53, 47.67s/it]


Epoch 5/25, Train Loss: 0.2266, Validation Loss: 0.2159


 24%|██▍       | 6/25 [04:46<15:04, 47.62s/it]


Epoch 6/25, Train Loss: 0.2184, Validation Loss: 0.1943


 28%|██▊       | 7/25 [05:33<14:16, 47.58s/it]


Epoch 7/25, Train Loss: 0.2079, Validation Loss: 0.1910


 32%|███▏      | 8/25 [06:21<13:28, 47.58s/it]


Epoch 8/25, Train Loss: 0.1992, Validation Loss: 0.2035


 36%|███▌      | 9/25 [07:08<12:37, 47.32s/it]


Epoch 9/25, Train Loss: 0.1909, Validation Loss: 0.1863


 40%|████      | 10/25 [07:55<11:50, 47.40s/it]


Epoch 10/25, Train Loss: 0.1866, Validation Loss: 0.1842


 44%|████▍     | 11/25 [08:42<11:01, 47.28s/it]


Epoch 11/25, Train Loss: 0.1804, Validation Loss: 0.1913


 48%|████▊     | 12/25 [09:30<10:14, 47.27s/it]


Epoch 12/25, Train Loss: 0.1735, Validation Loss: 0.1799


 52%|█████▏    | 13/25 [10:17<09:26, 47.19s/it]


Epoch 13/25, Train Loss: 0.1676, Validation Loss: 0.1841


 56%|█████▌    | 14/25 [11:04<08:39, 47.20s/it]


Epoch 14/25, Train Loss: 0.1626, Validation Loss: 0.1883


 60%|██████    | 15/25 [11:51<07:51, 47.11s/it]


Epoch 15/25, Train Loss: 0.1554, Validation Loss: 0.1797


 64%|██████▍   | 16/25 [12:38<07:04, 47.16s/it]


Epoch 16/25, Train Loss: 0.1483, Validation Loss: 0.1886


 68%|██████▊   | 17/25 [13:25<06:16, 47.07s/it]


Epoch 17/25, Train Loss: 0.1408, Validation Loss: 0.1848


 72%|███████▏  | 18/25 [14:12<05:29, 47.05s/it]


Epoch 18/25, Train Loss: 0.1333, Validation Loss: 0.1868


 76%|███████▌  | 19/25 [14:59<04:41, 46.98s/it]


Epoch 19/25, Train Loss: 0.1259, Validation Loss: 0.1935


 76%|███████▌  | 19/25 [15:04<04:45, 47.59s/it]


KeyboardInterrupt: 

Least loss value at epoch 15:
> Valid Loss = 0.1797

In [25]:
eval_preds, eval_labels = [], []

state_dict = torch.load('/content/drive/MyDrive/hhhack24/data/aug_epoch_15_model_weights.pth')
model.load_state_dict(state_dict)
model.to('cuda')
model.eval()

with torch.no_grad():
    for batch in tqdm(val_dataloader):
        texts1, texts2, labels = batch
        inp1 = tokenizer(texts1, padding=True, truncation=True,
                        return_tensors='pt').to('cuda')
        inp2 = tokenizer(texts2, padding=True, truncation=True,
                        return_tensors='pt').to('cuda')
        inp1 = {key: val.to('cuda') for key, val in inp1.items()}
        inp2 = {key: val.to('cuda') for key, val in inp2.items()}

        out1 = model(**inp1)
        out2 = model(**inp2)

        emb1 = mean_pooling(out1, inp1['attention_mask'])
        emb2 = mean_pooling(out2, inp2['attention_mask'])
        cos_sim = nn.functional.cosine_similarity(emb1, emb2, dim=1)
        preds = cos_sim
        eval_preds.append(preds.cpu().tolist())
        eval_labels.append(labels.cpu().tolist())

100%|██████████| 27/27 [00:01<00:00, 14.13it/s]


In [26]:
eval_preds_flat = [pred for sublist in eval_preds for pred in sublist]
eval_labels_flat = [label for sublist in eval_labels for label in sublist]

In [29]:
def objective(trial):
    thresh = trial.suggest_float('thresh', 0.0, 1.0)
    binary_preds = [int(pred > thresh) for pred in eval_preds_flat]
    f1 = f1_score(eval_labels_flat, binary_preds)

    return f1

study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=500)

best_thresh = study.best_params['thresh']
print(f'Best Threshold: {best_thresh}')

binary_preds_optimized = [int(pred > best_thresh) for pred in eval_preds_flat]
print(f'Optimized F1-Score: {f1_score(eval_labels_flat, binary_preds_optimized):.4f}')

[I 2024-03-05 11:18:19,215] A new study created in memory with name: no-name-443b4755-7ea7-4b4f-bc2e-7a54f4309267
[I 2024-03-05 11:18:19,225] Trial 0 finished with value: 0.3373493975903614 and parameters: {'thresh': 0.3734035004163977}. Best is trial 0 with value: 0.3373493975903614.
[I 2024-03-05 11:18:19,234] Trial 1 finished with value: 0.3 and parameters: {'thresh': 0.3354402774498185}. Best is trial 0 with value: 0.3373493975903614.
[I 2024-03-05 11:18:19,241] Trial 2 finished with value: 0.2105263157894737 and parameters: {'thresh': 0.6726486381386855}. Best is trial 0 with value: 0.3373493975903614.
[I 2024-03-05 11:18:19,249] Trial 3 finished with value: 0.336283185840708 and parameters: {'thresh': 0.2818720896517335}. Best is trial 0 with value: 0.3373493975903614.
[I 2024-03-05 11:18:19,257] Trial 4 finished with value: 0.36923076923076925 and parameters: {'thresh': 0.22582701406592653}. Best is trial 4 with value: 0.36923076923076925.
[I 2024-03-05 11:18:19,267] Trial 5 fin

Best Threshold: 0.4343764681236604
Optimized F1-Score: 0.3944


In [34]:
print(f'Balanced accuracy: {balanced_accuracy_score(eval_labels_flat, binary_preds_optimized)}\n')
print(classification_report(eval_labels_flat, binary_preds_optimized, target_names=['Class 0', 'Class 1']))

Balanced accuracy: 0.6009803921568628

              precision    recall  f1-score   support

     Class 0       0.82      0.74      0.78       102
     Class 1       0.34      0.47      0.39        30

    accuracy                           0.67       132
   macro avg       0.58      0.60      0.59       132
weighted avg       0.71      0.67      0.69       132

