# Synthetic Data Generation for Embedding Training with Distilabel

This notebook demonstrates how to use `distilabel` to generate synthetic training data for customized embedding models. We'll work with job descriptions from Singapore's SkillsFuture Framework to create positive and negative query pairs that can be used to train embedding models for better job matching and semantic search capabilities.

## What you'll learn:
- How to load and prepare datasets for synthetic data generation
- Creating positive and negative query pairs using LLMs
- Building distilabel pipelines for automated data generation
- Publishing synthetic datasets to Hugging Face Hub

## Prerequisites
Make sure you have the required packages installed. The main dependency is the `rag-datakit` package which includes distilabel and other necessary components.

In [1]:
# !pip install git+https://github.com/dnth/rag-datakit.git

## Installation

Install the rag-datakit package which includes all necessary dependencies including distilabel, transformers, and dataset utilities. Uncomment the cell below to install if you haven't already.

## Dataset Loading and Inspection

We'll work with the Singapore Skills Framework (SSF) dataset, which contains job roles and descriptions across various sectors. This dataset is ideal for training embedding models for job matching applications.

**Dataset source:** [Skills Frameworks Singapore](https://jobsandskills.skillsfuture.gov.sg/frameworks/skills-frameworks)  
**Hugging Face repo:** `dnth/ssf-dataset`

The dataset contains structured information about:
- **Sector**: Industry sector (e.g., Accountancy, Technology)
- **Track**: Specialization within the sector (e.g., Assurance, Business Valuation)
- **Job Role**: Specific job title
- **Job Role Description**: Detailed description of responsibilities and requirements
- **Performance Expectation**: Standards and compliance requirements

Let's load and examine the dataset structure:

In [2]:
from datasets import load_dataset

dataset = load_dataset("dnth/ssf-dataset")

In [3]:
dataset

DatasetDict({
    train: Dataset({
        features: ['Sector', 'Track', 'Job Role', 'Job Role Description', 'Performance Expectation'],
        num_rows: 1885
    })
})

In [4]:
dataset['train'][0]

{'Sector': 'Accountancy',
 'Track': 'Assurance',
 'Job Role': 'Audit Associate / Audit Assistant Associate',
 'Job Role Description': 'The Audit Associate/Audit Assistant Associate undertakes specific stages of audit work under supervision. He/She begins to appreciate the underlying principles behind the tasks assigned to him as part of the audit plan. He is also able to make adjustments to the application of skills to improve the work tasks or solve non-complex issues. The Audit Associate/Audit Assistant Associate operates in a structured work environment. He is able to build relationships, work in a team and identify ethical issues with reference to the code of professional conduct and ethics. He is able to select and apply from a range of known solutions to familiar problems and takes responsibility for his own learning and performance. He is a trustworthy and meticulous individual.',
 'Performance Expectation': 'In accordance with: Singapore Standards on Auditing, Ethics Pronouncem

## Synthetic Data Generation Setup

To train effective embedding models, we need to create training triplets consisting of:
- **Anchor**: The original job role description
- **Positive**: A paraphrased or similar job description (semantically similar)
- **Negative**: A different job description (semantically dissimilar)

We'll use a Large Language Model (LLM) to generate these positive and negative examples automatically. This approach allows us to:
1. Create diverse paraphrases of job descriptions
2. Generate realistic negative examples from different roles/industries
3. Scale up our training data efficiently

### LLM Configuration
We support both local models (via Transformers) and API-based models (like OpenAI). For this example, we'll use a local Qwen model, but you can switch to OpenAI by uncommenting the appropriate section.

**For OpenAI API**: Make sure you have your API key in a `.env` file:
```
OPENAI_API_KEY=sk-proj-...
```

In [5]:
import os
from distilabel.llms import OpenAILLM, TransformersLLM

llm = TransformersLLM(
    model="Qwen/Qwen3-4B-Instruct-2507",
    device_map="auto",
    torch_dtype="float16",
)

# llm = OpenAILLM(
#     model="gpt-4o-mini",
#     api_key=os.getenv("OPENAI_API_KEY"),
# )



  from distilabel.llms import OpenAILLM, TransformersLLM


In [None]:
context = """
The text is a job description from the Singapore SkillsFuture Framework. Your task is to generate realistic job descriptions from the provided description.

For the positive query, generate a realistic description for this role. Focus on creating variations that capture the essence of the role in different words, as if written by different people or organizations posting similar jobs.

For negative descriptions you are allowed to choose from the following strategies

1. Same industry, different seniority level (Senior → Junior or Vice versa)
2. Same industry, different function (Business Valuation → Risk Management)
3. Similar skills, different domain (Financial Analysis in Banking vs Healthcare)
4. Same title, different industry context

The query should always include the job role. Start the description with The <job role>.

"""

from distilabel.pipeline import Pipeline
from distilabel.steps import LoadDataFromHub
from distilabel.steps.tasks import GenerateSentencePair

with Pipeline(name="generate") as pipeline:
    load_dataset = LoadDataFromHub(
        num_examples=10,  # Limit to 10 examples for demo - increase for production datasets
        use_cache=False,  # Disable caching to ensure fresh data generation each run
        output_mappings={"Job Role Description": "anchor"},  # Map original column to 'anchor' for triplet generation
    )
    generate_retrieval_pairs_easy = GenerateSentencePair(
        name="easy_triplets",
        triplet=True,  # Generate anchor-positive-negative triplets for embedding training
        hard_negative=False,  # Use easier negatives rather than hard negatives
        action="paraphrase",  # Focus on paraphrasing for positive examples
        llm=llm,  # Use the LLM configured above (local Qwen or OpenAI)
        input_batch_size=10,  # Process 10 examples at once for efficiency
        context=context,  # Provide the context instructions for generation quality
    )

    load_dataset.connect(generate_retrieval_pairs_easy)

## Pipeline Configuration

Now we'll set up a distilabel pipeline to automatically generate positive and negative examples. The pipeline consists of:

1. **LoadDataFromHub**: Loads our SSF dataset from Hugging Face
2. **GenerateSentencePair**: Uses the LLM to create positive/negative pairs

### Context for LLM Generation
We provide specific instructions to the LLM for generating realistic job descriptions:
- **Positive examples**: Paraphrases that maintain the essence of the original role
- **Negative examples**: Use strategies like different seniority levels, functions, or industries

The pipeline is configured to generate triplets (anchor, positive, negative) for embedding training.

In [7]:
distiset = pipeline.run(
    parameters={
        load_dataset.name: {
            "repo_id": "dnth/ssf-dataset",
            "split": "train",
        },
        "easy_triplets": {
            "llm": {"generation_kwargs": {"temperature": 0.7, "max_new_tokens": 256}}
        },
    }
)

Device set to use cuda:0


Generating train split: 0 examples [00:00, ? examples/s]

## Running the Pipeline

Execute the pipeline with specific parameters:
- **num_examples**: Limit to 10 examples for demonstration (increase for production)
- **temperature**: Controls randomness in LLM generation (0.7 for creative but consistent outputs)
- **max_new_tokens**: Maximum length of generated text

The pipeline will process each job description and generate corresponding positive and negative examples.

In [8]:
distiset

Distiset({
    default: DatasetDict({
        train: Dataset({
            features: ['Sector', 'Track', 'Job Role', 'anchor', 'Performance Expectation', 'positive', 'negative', 'distilabel_metadata', 'model_name'],
            num_rows: 10
        })
    })
})

## Results Inspection

The pipeline generates a `Distiset` object containing our synthetic dataset. Let's examine the structure and content of the generated data:

### Dataset Structure
The output includes the original fields plus new generated columns:
- **anchor**: Original job role description
- **positive**: LLM-generated paraphrase (semantically similar)
- **negative**: LLM-generated different job description (semantically dissimilar)
- **distilabel_metadata**: Generation metadata and token usage
- **model_name**: LLM model used for generation

In [9]:
distiset["default"]["train"][0]

{'Sector': 'Accountancy',
 'Track': 'Assurance',
 'Job Role': 'Audit Associate / Audit Assistant Associate',
 'anchor': 'The Audit Associate/Audit Assistant Associate undertakes specific stages of audit work under supervision. He/She begins to appreciate the underlying principles behind the tasks assigned to him as part of the audit plan. He is also able to make adjustments to the application of skills to improve the work tasks or solve non-complex issues. The Audit Associate/Audit Assistant Associate operates in a structured work environment. He is able to build relationships, work in a team and identify ethical issues with reference to the code of professional conduct and ethics. He is able to select and apply from a range of known solutions to familiar problems and takes responsibility for his own learning and performance. He is a trustworthy and meticulous individual.',
 'Performance Expectation': 'In accordance with: Singapore Standards on Auditing, Ethics Pronouncements in Singap

In [10]:
distiset["default"]["train"].to_pandas()

Unnamed: 0,Sector,Track,Job Role,anchor,Performance Expectation,positive,negative,distilabel_metadata,model_name
0,Accountancy,Assurance,Audit Associate / Audit Assistant Associate,The Audit Associate/Audit Assistant Associate ...,In accordance with: Singapore Standards on Aud...,The Audit Assistant undertakes hands-on audit ...,The Junior Financial Analyst supports daily op...,{'raw_input_easy_triplets': [{'content': 'Your...,Qwen/Qwen3-4B-Instruct-2507
1,Accountancy,Assurance,Audit Manager,The Audit Senior Manager/Audit Manager manages...,In accordance with: Singapore Standards on Aud...,The Audit Senior Manager oversees a diverse ra...,The Junior Auditor supports day-to-day account...,{'raw_input_easy_triplets': [{'content': 'Your...,Qwen/Qwen3-4B-Instruct-2507
2,Accountancy,Assurance,Audit Partner / Audit Director,The Audit Partner/Audit Director is a transfor...,In accordance with: Singapore Standards on Aud...,The Audit Partner or Audit Director is a visio...,The Junior Auditor is tasked with assisting in...,{'raw_input_easy_triplets': [{'content': 'Your...,Qwen/Qwen3-4B-Instruct-2507
3,Accountancy,Assurance,Audit Senior,The Audit Senior is expected to team lead vari...,In accordance with: Singapore Standards on Aud...,The Audit Senior leads diverse audit projects ...,The Junior Auditor supports smaller-scale audi...,{'raw_input_easy_triplets': [{'content': 'Your...,Qwen/Qwen3-4B-Instruct-2507
4,Accountancy,Business Valuation,Business Valuation Associate / Business Valuat...,The Business Valuation Associate/Business Valu...,In accordance with the International Valuation...,The Business Valuation Analyst supports the fu...,The Junior Accountant maintains daily bookkeep...,{'raw_input_easy_triplets': [{'content': 'Your...,Qwen/Qwen3-4B-Instruct-2507
5,Accountancy,Business Valuation,Business Valuation Manager,The Business Valuation Manager is second in ch...,In accordance with the International Valuation...,The Business Valuation Lead oversees key funct...,The Junior Business Analyst supports the opera...,{'raw_input_easy_triplets': [{'content': 'Your...,Qwen/Qwen3-4B-Instruct-2507
6,Accountancy,Business Valuation,Business Valuation Partner / Business Valuatio...,The Business Valuation Partner/Business Valuat...,In accordance with the International Valuation...,The Business Valuation Director leads a dynami...,The Junior Financial Analyst supports daily ac...,{'raw_input_easy_triplets': [{'content': 'Your...,Qwen/Qwen3-4B-Instruct-2507
7,Accountancy,Business Valuation,Business Valuation Senior / Business Valuation...,The Business Valuation Senior/Business Valuati...,In accordance with the International Valuation...,The Business Valuation Specialist is responsib...,The IT Support Technician provides assistance ...,{'raw_input_easy_triplets': [{'content': 'Your...,Qwen/Qwen3-4B-Instruct-2507
8,Accountancy,Enterprise Risk Management,Chief Risk Officer / Risk Partner / Head of Ri...,The Chief Risk Officer/Risk Partner/Head of Ri...,,The Enterprise Risk Management Director overse...,The Junior Risk Analyst supports the daily mon...,{'raw_input_easy_triplets': [{'content': 'Your...,Qwen/Qwen3-4B-Instruct-2507
9,Accountancy,Enterprise Risk Management,Enterprise Risk Management Associate / Enterpr...,The Enterprise Risk Management Associate/Enter...,,The Enterprise Risk Management Support Special...,The Junior Accounts Payable Clerk is responsib...,{'raw_input_easy_triplets': [{'content': 'Your...,Qwen/Qwen3-4B-Instruct-2507


## Publishing to Hugging Face Hub

Finally, we can push our synthetic dataset to Hugging Face Hub for sharing and future use. This makes the dataset easily accessible for training embedding models or other downstream tasks.

The dataset will be uploaded with all the generated triplets and metadata, ready for use in embedding training pipelines.

In [11]:
distiset["default"].push_to_hub(repo_id="dnth/ssf-dataset-synthetic", revision="main")

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ? shards/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Processing Files (0 / 0)                : |          |  0.00B /  0.00B            

New Data Upload                         : |          |  0.00B /  0.00B            

                                        : 100%|##########| 56.4kB / 56.4kB            

README.md: 0.00B [00:00, ?B/s]

CommitInfo(commit_url='https://huggingface.co/datasets/dnth/ssf-dataset-synthetic/commit/7b5bbd8fb81244171bf5736d5280972500942798', commit_message='Upload dataset', commit_description='', oid='7b5bbd8fb81244171bf5736d5280972500942798', pr_url=None, repo_url=RepoUrl('https://huggingface.co/datasets/dnth/ssf-dataset-synthetic', endpoint='https://huggingface.co', repo_type='dataset', repo_id='dnth/ssf-dataset-synthetic'), pr_revision=None, pr_num=None)