# Finetuning BERT model for embeddings extraction

This notebook wraps up a finetuning process for vectors representation. Unlike the Fasttext model, here we use the pre-trained BERT [provided by SberDevice team](https://huggingface.co/ai-forever/sbert_large_nlu_ru).

## 1. Impoprts and insatllations

In [None]:
!pip install -q accelerate pyarrow pyarrow-hotfix datasets==2.20.0 numpy==1.26.4 # [optional]

Easy way to manage structure of finetuning process:


- **data/**
  - containing all uploaded files like raw text
  - `raw_corpus.pkl`: pickle file to quick load the corpus
  
- **utility/**
  - Directory for utils and tools for text preprocessing
  - `text_preprocessing.py`: custom .py file TextPreprocesser class.

- **rubert_results/**
  - Storage for BERT training history (runs, checkpoints and logs).
    - **final/**: finetuned model with attendant files


In [4]:
# source directories
!mkdir data utility rubert_results rubert_results/final

In [5]:
import pickle
import numpy as np
from sklearn.model_selection import train_test_split
from datasets import Dataset

import torch
from transformers import (AutoTokenizer, AutoModelForSequenceClassification,
                          Trainer, TrainingArguments)

from utility.text_preprocessing import TextPreprocesser

## 2. Device choice and corpus preparation

In [8]:
# device for GPU unit train
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device)

cuda


In [None]:
# Load your dataset
with open("./data/raw_corpus.pkl", "rb") as file:
    corpus = pickle.load(file)

corpus = TextPreprocesser(corpus, n_grams=2)
sentences = corpus.clean_corpus

print(sentences[:5])

## 3. Model initialization and dataset customization

In [10]:
# Load the pre-trained RuBERT model and tokenizer
sentences = sentences[:10000] # reduce amount of data to handle free gpu limitation
model_name = "ai-forever/sbert_large_nlu_ru" # BERT large model (uncased) for Sentence Embeddings in Russian language

# model and tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name,
                                                           num_labels=2).to(device)

# Tokenize the dataset
inputs = tokenizer(sentences,
                   return_tensors='pt',
                   padding=True,
                   truncation=True,
                   max_length=64).to(device)

# Fine-tune the model for a specific task (e.g., binary classification)
labels = [0 if i < len(sentences) // 2 else 1 for i in range(len(sentences))]

# Split dataset
train_sentences, test_sentences, train_labels, test_labels = train_test_split(sentences,
                                                                              labels,
                                                                              test_size=0.25)

# Tokenize datasets
train_inputs = tokenizer(train_sentences,
                         return_tensors='pt',
                         padding=True,
                         truncation=True,
                         max_length=64)
test_inputs = tokenizer(test_sentences,
                        return_tensors='pt',
                        padding=True,
                        truncation=True,
                        max_length=64)

# move it to device
train_inputs = {k: v.to(device) for k, v in train_inputs.items()}
test_inputs = {k: v.to(device) for k, v in test_inputs.items()}

# Create datasets
train_dataset = Dataset.from_dict({'input_ids': train_inputs['input_ids'],
                                   'attention_mask': train_inputs['attention_mask'],
                                   'labels': train_labels})
test_dataset = Dataset.from_dict({'input_ids': test_inputs['input_ids'],
                                  'attention_mask': test_inputs['attention_mask'],
                                  'labels': test_labels})

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/1.27k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/3.71M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/863 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.71G [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at ai-forever/sbert_large_nlu_ru and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## 4. Train Loop

In [11]:
# Define training arguments
training_args = TrainingArguments(
    output_dir='./rubert_results',
    evaluation_strategy="epoch",
    learning_rate=3e-4,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=2,
    weight_decay=0.01,
)

# Initialize Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
)

# Train the model
trainer.train()



Epoch,Training Loss,Validation Loss
1,0.7469,0.693118
2,0.7082,0.693131


TrainOutput(global_step=1876, training_loss=0.719536559668177, metrics={'train_runtime': 1026.3834, 'train_samples_per_second': 14.614, 'train_steps_per_second': 1.828, 'total_flos': 1747371306240000.0, 'train_loss': 0.719536559668177, 'epoch': 2.0})

## 5. Save finetuned model

Here's your localy saved model and tokenizer

In [12]:
# Save the fine-tuned model and embeddings
model.save_pretrained('rubert_results/final')
tokenizer.save_pretrained('rubert_results/final')

('rubert_results/final/tokenizer_config.json',
 'rubert_results/final/special_tokens_map.json',
 'rubert_results/final/vocab.txt',
 'rubert_results/final/added_tokens.json',
 'rubert_results/final/tokenizer.json')