## Using distilabel for synthetic data generation

This notebook shows how to use `distilabel` to generate synthetic data to train a customized embedding model.

First let's install the required packages. Uncomment the following cell to install.

In [1]:
# !pip install git+https://github.com/dnth/rag-datakit.git

## Load and Inspect Dataset

This is a dataset from the Skills Frameworks - https://jobsandskills.skillsfuture.gov.sg/frameworks/skills-frameworks

I've uploaded the excel file into the Hugging Face dataset repo - `dnth/ssf-dataset`. 

In [2]:
from datasets import load_dataset

dataset = load_dataset("dnth/ssf-dataset")

In [3]:
dataset

DatasetDict({
    train: Dataset({
        features: ['Sector', 'Track', 'Job Role', 'Job Role Description', 'Performance Expectation'],
        num_rows: 1885
    })
})

In [4]:
dataset['train'][0]

{'Sector': 'Accountancy',
 'Track': 'Assurance',
 'Job Role': 'Audit Associate / Audit Assistant Associate',
 'Job Role Description': 'The Audit Associate/Audit Assistant Associate undertakes specific stages of audit work under supervision. He/She begins to appreciate the underlying principles behind the tasks assigned to him as part of the audit plan. He is also able to make adjustments to the application of skills to improve the work tasks or solve non-complex issues. The Audit Associate/Audit Assistant Associate operates in a structured work environment. He is able to build relationships, work in a team and identify ethical issues with reference to the code of professional conduct and ethics. He is able to select and apply from a range of known solutions to familiar problems and takes responsibility for his own learning and performance. He is a trustworthy and meticulous individual.',
 'Performance Expectation': 'In accordance with: Singapore Standards on Auditing, Ethics Pronouncem

## Generate positive and negative queries

In order to train an embedding dataset, we need to generate a positive and negative queries based on the anchor.

In this example, the anchor is the job role description. For simplicity we will use the OpenAI model to generate. Make sure you have the API KEY saved in a `.env` file.

For example

`OPENAI_API_KEY = sk-proj-.......`

In [5]:
import os
from distilabel.llms import OpenAILLM, TransformersLLM

llm = TransformersLLM(
    model="Qwen/Qwen3-4B-Instruct-2507",
    device_map="auto",
    torch_dtype="float16",
)

# llm = OpenAILLM(
#     model="gpt-4o-mini",
#     api_key=os.getenv("OPENAI_API_KEY"),
# )



  from distilabel.llms import OpenAILLM, TransformersLLM


In [None]:
context = """
The text is a job description from the Singapore SkillsFuture Framework. Your task is to generate realistic job descriptions from the provided description.

For the positive query, generate a realistic description for this role. Focus on creating variations that capture the essence of the role in different words, as if written by different people or organizations posting similar jobs.

For negative descriptions you are allowed to choose from the following strategies

1. Same industry, different seniority level (Senior → Junior or Vice versa)
2. Same industry, different function (Business Valuation → Risk Management)
3. Similar skills, different domain (Financial Analysis in Banking vs Healthcare)
4. Same title, different industry context

The query should always include the job role. Start the description with The <job role>.

"""

from distilabel.pipeline import Pipeline
from distilabel.steps import LoadDataFromHub
from distilabel.steps.tasks import GenerateSentencePair

with Pipeline(name="generate") as pipeline:
    load_dataset = LoadDataFromHub(
        num_examples=10,
        use_cache=False,
        output_mappings={"Job Role Description": "anchor"},
    )
    generate_retrieval_pairs_easy = GenerateSentencePair(
        name="easy_triplets",
        triplet=True,
        hard_negative=False,
        action="paraphrase",
        llm=llm,
        input_batch_size=10,
        context=context,
    )

    load_dataset.connect(generate_retrieval_pairs_easy)

In [None]:
distiset = pipeline.run(
    parameters={
        load_dataset.name: {
            "repo_id": "dnth/ssf-dataset",
            "split": "train",
        },
        "easy_triplets": {
            "llm": {"generation_kwargs": {"temperature": 0.7, "max_new_tokens": 256}}
        },
    }
)

Device set to use cuda:0


Generating train split: 0 examples [00:00, ? examples/s]

In [8]:
distiset

Distiset({
    default: DatasetDict({
        train: Dataset({
            features: ['Sector', 'Track', 'Job Role', 'anchor', 'Performance Expectation', 'positive', 'negative', 'distilabel_metadata', 'model_name'],
            num_rows: 10
        })
    })
})

In [9]:
distiset["default"]["train"][0]

{'Sector': 'Accountancy',
 'Track': 'Assurance',
 'Job Role': 'Audit Associate / Audit Assistant Associate',
 'anchor': 'The Audit Associate/Audit Assistant Associate undertakes specific stages of audit work under supervision. He/She begins to appreciate the underlying principles behind the tasks assigned to him as part of the audit plan. He is also able to make adjustments to the application of skills to improve the work tasks or solve non-complex issues. The Audit Associate/Audit Assistant Associate operates in a structured work environment. He is able to build relationships, work in a team and identify ethical issues with reference to the code of professional conduct and ethics. He is able to select and apply from a range of known solutions to familiar problems and takes responsibility for his own learning and performance. He is a trustworthy and meticulous individual.',
 'Performance Expectation': 'In accordance with: Singapore Standards on Auditing, Ethics Pronouncements in Singap

In [10]:
distiset["default"]["train"].to_pandas()

Unnamed: 0,Sector,Track,Job Role,anchor,Performance Expectation,positive,negative,distilabel_metadata,model_name
0,Accountancy,Assurance,Audit Associate / Audit Assistant Associate,The Audit Associate/Audit Assistant Associate ...,In accordance with: Singapore Standards on Aud...,The Audit Associate/Audit Assistant undertakes...,The Audit Associate/Audit Assistant enjoys wee...,{'raw_input_easy_triplets': [{'content': 'Your...,Qwen/Qwen3-4B-Instruct-2507
1,Accountancy,Assurance,Audit Manager,The Audit Senior Manager/Audit Manager manages...,In accordance with: Singapore Standards on Aud...,The Audit Senior Manager/Audit Manager leads a...,The Audit Senior Manager/Audit Manager oversee...,{'raw_input_easy_triplets': [{'content': 'Your...,Qwen/Qwen3-4B-Instruct-2507
2,Accountancy,Assurance,Audit Partner / Audit Director,The Audit Partner/Audit Director is a transfor...,In accordance with: Singapore Standards on Aud...,The Audit Partner/Audit Director is a visionar...,The Audit Partner/Audit Director is a passiona...,{'raw_input_easy_triplets': [{'content': 'Your...,Qwen/Qwen3-4B-Instruct-2507
3,Accountancy,Assurance,Audit Senior,The Audit Senior is expected to team lead vari...,In accordance with: Singapore Standards on Aud...,The Audit Senior is responsible for leading au...,The Audit Senior is responsible for managing i...,{'raw_input_easy_triplets': [{'content': 'Your...,Qwen/Qwen3-4B-Instruct-2507
4,Accountancy,Business Valuation,Business Valuation Associate / Business Valuat...,The Business Valuation Associate/Business Valu...,In accordance with the International Valuation...,The Business Valuation Associate/Business Valu...,The Business Valuation Associate/Business Valu...,{'raw_input_easy_triplets': [{'content': 'Your...,Qwen/Qwen3-4B-Instruct-2507
5,Accountancy,Business Valuation,Business Valuation Manager,The Business Valuation Manager is second in ch...,In accordance with the International Valuation...,The Business Valuation Manager is second-in-co...,The Business Valuation Manager is responsible ...,{'raw_input_easy_triplets': [{'content': 'Your...,Qwen/Qwen3-4B-Instruct-2507
6,Accountancy,Business Valuation,Business Valuation Partner / Business Valuatio...,The Business Valuation Partner/Business Valuat...,In accordance with the International Valuation...,The Business Valuation Partner/Business Valuat...,The Business Valuation Partner/Business Valuat...,{'raw_input_easy_triplets': [{'content': 'Your...,Qwen/Qwen3-4B-Instruct-2507
7,Accountancy,Business Valuation,Business Valuation Senior / Business Valuation...,The Business Valuation Senior/Business Valuati...,In accordance with the International Valuation...,The Business Valuation Senior is responsible f...,The Business Valuation Senior works remotely f...,{'raw_input_easy_triplets': [{'content': 'Your...,Qwen/Qwen3-4B-Instruct-2507
8,Accountancy,Enterprise Risk Management,Chief Risk Officer / Risk Partner / Head of Ri...,The Chief Risk Officer/Risk Partner/Head of Ri...,,The Enterprise Risk Management Director overse...,The Enterprise Risk Management Director overse...,{'raw_input_easy_triplets': [{'content': 'Your...,Qwen/Qwen3-4B-Instruct-2507
9,Accountancy,Enterprise Risk Management,Enterprise Risk Management Associate / Enterpr...,The Enterprise Risk Management Associate/Enter...,,The Enterprise Risk Management Associate/Enter...,The Enterprise Risk Management Associate/Enter...,{'raw_input_easy_triplets': [{'content': 'Your...,Qwen/Qwen3-4B-Instruct-2507


In [11]:
distiset["default"].push_to_hub(repo_id="dnth/ssf-dataset-synthetic", revision="main")

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ? shards/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Processing Files (0 / 0)                : |          |  0.00B /  0.00B            

New Data Upload                         : |          |  0.00B /  0.00B            

                                        : 100%|##########| 59.1kB / 59.1kB            

README.md: 0.00B [00:00, ?B/s]

CommitInfo(commit_url='https://huggingface.co/datasets/dnth/ssf-dataset-synthetic/commit/9e4185e84053dcb429d2591cb166a5eab612025f', commit_message='Upload dataset', commit_description='', oid='9e4185e84053dcb429d2591cb166a5eab612025f', pr_url=None, repo_url=RepoUrl('https://huggingface.co/datasets/dnth/ssf-dataset-synthetic', endpoint='https://huggingface.co', repo_type='dataset', repo_id='dnth/ssf-dataset-synthetic'), pr_revision=None, pr_num=None)