# Embedding Model Fine-tuning with Sentence Transformers

[![Open In Colab](https://img.shields.io/badge/Open%20In-Colab-blue?style=for-the-badge&logo=google-colab)](https://colab.research.google.com/github/dnth/rag-datakit/blob/main/nbs/02_train.ipynb)
[![Open In Kaggle](https://img.shields.io/badge/Open%20In-Kaggle-blue?style=for-the-badge&logo=kaggle)](https://kaggle.com/kernels/welcome?src=https://github.com/dnth/rag-datakit/blob/main/nbs/02_train.ipynb)

This notebook demonstrates how to fine-tune an embedding model using the synthetic triplet data generated from Singapore SkillsFuture Framework job descriptions. We'll use the sentence-transformers library to train a model that can better understand job-related semantic similarity for improved retrieval and matching.

## What you'll learn:
- How to set up sentence-transformers training pipeline
- Configuring training arguments for embedding models
- Using MultipleNegativesRankingLoss for triplet training
- Monitoring training with Weights & Biases
- Saving and publishing trained models

## Installation

Install the rag-datakit package which includes all necessary dependencies including distilabel, transformers, and dataset utilities. Uncomment the cell below to install if you haven't already.

On Google Colab you might need to uninstall the existing packages due to conflicting versions.

In [1]:
!pip uninstall -y transformers torch torchvision

Found existing installation: transformers 4.56.0
Uninstalling transformers-4.56.0:
  Successfully uninstalled transformers-4.56.0
Found existing installation: torch 2.8.0
Uninstalling torch-2.8.0:
  Successfully uninstalled torch-2.8.0
[0m

In [2]:
!pip install git+https://github.com/dnth/rag-datakit.git

Collecting git+https://github.com/dnth/rag-datakit.git
  Cloning https://github.com/dnth/rag-datakit.git to /tmp/pip-req-build-xirmbroz
  Running command git clone --filter=blob:none --quiet https://github.com/dnth/rag-datakit.git /tmp/pip-req-build-xirmbroz
  Resolved https://github.com/dnth/rag-datakit.git to commit e7e791e3a187011848a6acb8744f123f98d6a4ae
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
Collecting torch>=2.8.0 (from rag-datakit==0.1.0)
  Using cached torch-2.8.0-cp312-cp312-manylinux_2_28_x86_64.whl.metadata (30 kB)
Collecting transformers>=4.55.0 (from rag-datakit==0.1.0)
  Using cached transformers-4.56.0-py3-none-any.whl.metadata (40 kB)
Using cached transformers-4.56.0-py3-none-any.whl (11.6 MB)
Using cached torch-2.8.0-cp312-cp312-manylinux_2_28_x86_64.whl (887.9 MB)
Installing collected packages: torch, transformers
[2K   [90m━━━━━━━━━━━━━━━━━━

## Import Required Libraries

We begin by importing the essential libraries for training our embedding model:

- **sentence_transformers**: The core library providing tools for training and evaluating sentence transformers
- **datasets**: Hugging Face's library for loading and managing our training dataset
- **wandb**: Weights & Biases for experiment tracking and visualization of training metrics

In [3]:
from sentence_transformers import SentenceTransformer, SentenceTransformerTrainingArguments
from sentence_transformers.losses import MultipleNegativesRankingLoss
from sentence_transformers.trainer import SentenceTransformerTrainer
from sentence_transformers.training_args import BatchSamplers
from datasets import load_dataset

## Dataset Loading and Inspection

We load the preprocessed training dataset with train/validation splits that were created in the previous notebook. This dataset contains triplets specifically formatted for embedding model training using contrastive learning approaches.

**Dataset source**: `dnth/ssf-train-valid`  
**Structure**:
- **anchor**: Original job descriptions from the SkillsFuture Framework
- **positive**: Semantically similar paraphrases generated through synthetic data techniques
- **negative**: Semantically different job descriptions used as contrastive examples

Let's examine the dataset structure and review a sample triplet to understand the data format:

In [4]:
dataset = load_dataset("Fatin757/ssf-train-valid")
dataset

DatasetDict({
    train: Dataset({
        features: ['anchor', 'positive', 'negative'],
        num_rows: 6032
    })
    valid: Dataset({
        features: ['anchor', 'positive', 'negative'],
        num_rows: 1508
    })
})

In [5]:
dataset['valid'][0]

{'anchor': 'The Chief Executive Officer/Chief Operating Officer/Managing Director/General Manager/President defines the long-term strategic direction to grow the business in line with the organisations overall vision, mission and values. He/She translates broad goals into achievable steps, anticipates and stays ahead of trends, and takes advantage of business opportunities. He represents the organisation with customers, investors, and business partners, and holds responsibility for fostering a culture of workplace safety and health and adherence to industry quality standards. He inspires the organisation towards achieving business goals and fulfilling the vision, mission and values by striving for continuous improvement, driving innovation and equipping the organisation to embrace change. He possesses excellent analytical, problem-solving and leadership skills and is an effective people leader.',
 'positive': 'The Chief Executive Officer (CEO) is responsible for establishing the long-t

## Initialize Weights & Biases and Model Configuration

We initialize Weights & Biases for experiment tracking and define our base model and save path. We're using the `all-minilm-l6-v2` model as our starting point, which is a efficient, general-purpose sentence embedding model that provides a good balance between speed and performance.

In [5]:
import wandb

model_id = "Qwen/Qwen3-Embedding-0.6B"
save_model_path = "./models/Qwen/Qwen3-Embedding-0.6B"

# wandb.login()
wandb.init(project="rag-datakit-finetunes", name="Qwen/Qwen3-Embedding-0.6B")

[34m[1mwandb[0m: Currently logged in as: [33mfatinnurafiqah-imk[0m ([33mfatinnurafiqah-imk-cxsanalytics[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


## Configure Training Arguments

We configure the training arguments for our embedding model. These parameters control various aspects of the training process including batch sizes, learning rate, precision settings, and checkpointing behavior. Key configurations include:

- 5 training epochs with cosine learning rate scheduler
- Mixed precision training using bf16 for efficiency
- Gradient accumulation to achieve an effective batch size of 512
- NO_DUPLICATES batch sampler to ensure diverse negative samples
- Checkpointing at the end of each epoch with a limit of 3 saved models
- Evaluation after each epoch to monitor validation loss

In [None]:
args = SentenceTransformerTrainingArguments(
    output_dir=save_model_path,
    num_train_epochs=5,                         # number of epochs
    per_device_train_batch_size=8,             # train batch size
    gradient_accumulation_steps=16,             # for a global batch size of 512
    per_device_eval_batch_size=4,              # evaluation batch size
    warmup_ratio=0.1,                           # warmup ratio
    learning_rate=2e-5,                         # learning rate, 2e-5 is a good value
    lr_scheduler_type="cosine",                 # use cosine learning rate scheduler
    optim="adamw_torch_fused",                  # use fused adamw optimizer
    tf32=False,                                  # use tf32 precision
    bf16=False,                                  # use bf16 precision
    fp16=True,                                   # use fp16 precision
    batch_sampler=BatchSamplers.NO_DUPLICATES,  # MultipleNegativesRankingLoss benefits from no duplicate samples in a batch
    eval_strategy="epoch",                      # evaluate after each epoch
    save_strategy="epoch",                      # save after each epoch
    logging_strategy="epoch",                   # log after each epoch
    save_total_limit=3,                         # save only the last 3 models
    load_best_model_at_end=True,                # load the best model when training ends
    report_to="wandb",
    gradient_checkpointing=True,
    use_cache=False

    )

## Initialize Model, Loss Function, and Trainer

We initialize our sentence transformer model and configure the training components:

- Load the base `all-minilm-l6-v2` model from sentence-transformers
- Configure `MultipleNegativesRankingLoss` which is ideal for triplet training as it maximizes the similarity between anchor and positive pairs while minimizing similarity between anchor and negative pairs
- Set up the `SentenceTransformerTrainer` with our model, training arguments, datasets, and loss function

In [8]:
model = SentenceTransformer(model_id)
train_loss = MultipleNegativesRankingLoss(model)

trainer = SentenceTransformerTrainer(
    model=model,
    args=args,
    train_dataset=dataset['train'],
    eval_dataset=dataset['valid'],  
    loss=train_loss,
)

Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

## Execute Model Training

We start the training process using our configured trainer. The model will train for 5 epochs, with evaluation happening after each epoch. The training progress and metrics are tracked through Weights & Biases, showing both training and validation loss metrics.

As training progresses, we can observe the validation loss decreasing, indicating that our model is learning to distinguish between semantically similar and dissimilar job descriptions.

In [9]:
trainer.train()

`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.


Epoch,Training Loss,Validation Loss
1,0.0038,0.001404
2,0.0013,0.002085
3,0.0008,0.001326
4,0.0006,0.001291
5,0.001,0.00118


TrainOutput(global_step=240, training_loss=0.001517765208457907, metrics={'train_runtime': 1733.9773, 'train_samples_per_second': 17.394, 'train_steps_per_second': 0.138, 'total_flos': 0.0, 'train_loss': 0.001517765208457907, 'epoch': 5.0})

## Save and Upload Trained Model

After training is complete, we save the fine-tuned model to disk and upload it to Weights & Biases for versioning and sharing. The saved model includes all necessary components:

- Model weights and configuration
- Tokenizer files
- Pooling layer configuration
- Normalization components

This ensures that the model can be easily loaded and used for inference later.

In [10]:
trainer.save_model()

In [11]:
import os
wandb.save(os.path.join(save_model_path, "*"))



['/root/rag-datakit/nbs/wandb/run-20250903_024134-o1oxa4dl/files/models/Qwen/Qwen3-Embedding-0.6B/sentence_bert_config.json',
 '/root/rag-datakit/nbs/wandb/run-20250903_024134-o1oxa4dl/files/models/Qwen/Qwen3-Embedding-0.6B/1_Pooling',
 '/root/rag-datakit/nbs/wandb/run-20250903_024134-o1oxa4dl/files/models/Qwen/Qwen3-Embedding-0.6B/2_Normalize',
 '/root/rag-datakit/nbs/wandb/run-20250903_024134-o1oxa4dl/files/models/Qwen/Qwen3-Embedding-0.6B/checkpoint-240',
 '/root/rag-datakit/nbs/wandb/run-20250903_024134-o1oxa4dl/files/models/Qwen/Qwen3-Embedding-0.6B/checkpoint-144',
 '/root/rag-datakit/nbs/wandb/run-20250903_024134-o1oxa4dl/files/models/Qwen/Qwen3-Embedding-0.6B/modules.json',
 '/root/rag-datakit/nbs/wandb/run-20250903_024134-o1oxa4dl/files/models/Qwen/Qwen3-Embedding-0.6B/config_sentence_transformers.json',
 '/root/rag-datakit/nbs/wandb/run-20250903_024134-o1oxa4dl/files/models/Qwen/Qwen3-Embedding-0.6B/special_tokens_map.json',
 '/root/rag-datakit/nbs/wandb/run-20250903_024134-o

## Training Results and Next Steps

The training completed successfully with a final validation loss of 0.00813, showing that our model has learned to effectively distinguish between semantically similar and dissimilar job descriptions. The Weights & Biases dashboard provides detailed metrics and visualizations of the training process.

### Next Steps

To use this model in production:

1. **Load the model** using `SentenceTransformer('./models/all-minilm-l6-v2')`
2. **Evaluate** on a test set to verify performance on unseen data
3. **Deploy** in your RAG pipeline for improved job description matching
4. **Publish** to the Hugging Face Hub (uncomment the last cell) to share with the community

The fine-tuned model is now ready to provide more accurate semantic similarity scores for job descriptions in your retrieval-augmented generation workflows.

In [12]:
wandb.finish()

0,1
eval/loss,▃█▂▂▁
eval/runtime,▁█▃▄▄
eval/samples_per_second,█▁▆▅▅
eval/steps_per_second,█▁▆▅▅
train/epoch,▁▁▃▃▅▅▆▆███
train/global_step,▁▁▃▃▅▅▆▆███
train/grad_norm,▅█▂▁▁
train/learning_rate,█▆▄▂▁
train/loss,█▃▁▁▂

0,1
eval/loss,0.00118
eval/runtime,36.9339
eval/samples_per_second,40.83
eval/steps_per_second,10.207
total_flos,0.0
train/epoch,5.0
train/global_step,240.0
train/grad_norm,0.00189
train/learning_rate,0.0
train/loss,0.001


In [14]:
trainer.model.push_to_hub("Fatin757/ssf-retriever-qwen3", exist_ok=True)

Processing Files (0 / 0)                : |          |  0.00B /  0.00B            

New Data Upload                         : |          |  0.00B /  0.00B            

  /tmp/tmpuqqfu0j6/tokenizer.json       :   0%|          | 29.7kB / 11.4MB            

  /tmp/tmpuqqfu0j6/model.safetensors    :   0%|          |  904kB / 2.38GB            

'https://huggingface.co/Fatin757/ssf-retriever-qwen3/commit/d759c52f8f0eacd5a46cb57a15606a344a9fb4cb'