## Using distilabel for synthetic data generation

This notebook shows how to use `distilabel` to generate synthetic data to train a customized embedding model.

First let's install the required packages. Uncomment the following cell to install.

In [1]:
# !pip install git+https://github.com/dnth/rag-datakit.git

## Load and Inspect Dataset

This is a dataset from the Skills Frameworks - https://jobsandskills.skillsfuture.gov.sg/frameworks/skills-frameworks

I've uploaded the excel file into the Hugging Face dataset repo - `dnth/ssf-dataset`. 

In [2]:
from datasets import load_dataset

dataset = load_dataset("dnth/ssf-dataset")

In [3]:
dataset

DatasetDict({
    train: Dataset({
        features: ['Sector', 'Track', 'Job Role', 'Job Role Description', 'Performance Expectation'],
        num_rows: 1885
    })
})

In [4]:
dataset['train'][0]

{'Sector': 'Accountancy',
 'Track': 'Assurance',
 'Job Role': 'Audit Associate / Audit Assistant Associate',
 'Job Role Description': 'The Audit Associate/Audit Assistant Associate undertakes specific stages of audit work under supervision. He/She begins to appreciate the underlying principles behind the tasks assigned to him as part of the audit plan. He is also able to make adjustments to the application of skills to improve the work tasks or solve non-complex issues. The Audit Associate/Audit Assistant Associate operates in a structured work environment. He is able to build relationships, work in a team and identify ethical issues with reference to the code of professional conduct and ethics. He is able to select and apply from a range of known solutions to familiar problems and takes responsibility for his own learning and performance. He is a trustworthy and meticulous individual.',
 'Performance Expectation': 'In accordance with: Singapore Standards on Auditing, Ethics Pronouncem

## Generate positive and negative queries

In order to train an embedding dataset, we need to generate a positive and negative queries based on the anchor.

In this example, the anchor is the job role description. For simplicity we will use the OpenAI model to generate. Make sure you have the API KEY exported.

In [5]:
# !export OPENAI_API_KEY="your_openai_api_key_here"

In [6]:
import os
from distilabel.llms import OpenAILLM

llm = OpenAILLM(
    model="gpt-4o-mini",
    api_key=os.getenv("OPENAI_API_KEY"),
)

  from distilabel.llms import OpenAILLM


In [7]:
context = """
The text is a job description from the Singapore SkillsFuture Skills Framework.

Your task is to generate realistic search queries that users would input when looking for similar job roles.

Users typically search by:
- Inputting partial job descriptions or requirements they're looking for
- Describing skills, responsibilities, or qualifications they want to match
- Using job titles or role descriptions as search terms
- Mentioning specific domains, industries, or technical requirements

The generated query should represent how someone would search for or describe a job opening similar to the given job description. 
The generated query should be in English. The generated query should not be a question.
The generated query should contain about the same amount of words as the original job description.

Respond in this exact format using ## before the positive and negative queries:

## Positive\n your positive query here
## Negative\n your negative description here
"""

from distilabel.pipeline import Pipeline
from distilabel.steps import LoadDataFromHub
from distilabel.steps.tasks import GenerateSentencePair

with Pipeline(name="generate") as pipeline:
    load_dataset = LoadDataFromHub(
        num_examples=10,
        use_cache=False,
        output_mappings={"Job Role Description": "anchor"},
    )
    generate_retrieval_pairs_easy = GenerateSentencePair(
        name="easy_triplets",
        triplet=True,
        hard_negative=False,
        action="paraphrase",
        llm=llm,
        input_batch_size=10,
        context=context,
    )

    load_dataset.connect(generate_retrieval_pairs_easy)

In [8]:
distiset = pipeline.run(
    parameters={
        load_dataset.name: {
            "repo_id": "dnth/ssf-dataset",
            "split": "train",
        },
        "easy_triplets": {
            "llm": {"generation_kwargs": {"temperature": 0.3, "max_new_tokens": 512}}
        },
    }
)

In [9]:
distiset

Distiset({
    default: DatasetDict({
        train: Dataset({
            features: ['Sector', 'Track', 'Job Role', 'anchor', 'Performance Expectation', 'positive', 'negative', 'distilabel_metadata', 'model_name'],
            num_rows: 10
        })
    })
})

In [10]:
distiset["default"]["train"][0]

{'Sector': 'Accountancy',
 'Track': 'Assurance',
 'Job Role': 'Audit Associate / Audit Assistant Associate',
 'anchor': 'The Audit Associate/Audit Assistant Associate undertakes specific stages of audit work under supervision. He/She begins to appreciate the underlying principles behind the tasks assigned to him as part of the audit plan. He is also able to make adjustments to the application of skills to improve the work tasks or solve non-complex issues. The Audit Associate/Audit Assistant Associate operates in a structured work environment. He is able to build relationships, work in a team and identify ethical issues with reference to the code of professional conduct and ethics. He is able to select and apply from a range of known solutions to familiar problems and takes responsibility for his own learning and performance. He is a trustworthy and meticulous individual.',
 'Performance Expectation': 'In accordance with: Singapore Standards on Auditing, Ethics Pronouncements in Singap

In [11]:
distiset["default"]["train"].to_pandas()

Unnamed: 0,Sector,Track,Job Role,anchor,Performance Expectation,positive,negative,distilabel_metadata,model_name
0,Accountancy,Assurance,Audit Associate / Audit Assistant Associate,The Audit Associate/Audit Assistant Associate ...,In accordance with: Singapore Standards on Aud...,Job opening for an Audit Associate or Audit As...,Seeking a position involving independent proje...,{'raw_input_easy_triplets': [{'content': 'Your...,gpt-4o-mini
1,Accountancy,Assurance,Audit Manager,The Audit Senior Manager/Audit Manager manages...,In accordance with: Singapore Standards on Aud...,Search for Audit Senior Manager or Audit Manag...,Looking for entry-level positions in marketing...,{'raw_input_easy_triplets': [{'content': 'Your...,gpt-4o-mini
2,Accountancy,Assurance,Audit Partner / Audit Director,The Audit Partner/Audit Director is a transfor...,In accordance with: Singapore Standards on Aud...,Looking for an Audit Partner or Director to le...,Seeking a financial analyst to manage daily tr...,{'raw_input_easy_triplets': [{'content': 'Your...,gpt-4o-mini
3,Accountancy,Assurance,Audit Senior,The Audit Senior is expected to team lead vari...,In accordance with: Singapore Standards on Aud...,"Audit Senior role leading audit engagements, m...",Seeking a project manager to oversee construct...,{'raw_input_easy_triplets': [{'content': 'Your...,gpt-4o-mini
4,Accountancy,Business Valuation,Business Valuation Associate / Business Valuat...,The Business Valuation Associate/Business Valu...,In accordance with the International Valuation...,Business Valuation Analyst role requiring hand...,Entry-level position in a completely different...,{'raw_input_easy_triplets': [{'content': 'Your...,gpt-4o-mini
5,Accountancy,Business Valuation,Business Valuation Manager,The Business Valuation Manager is second in ch...,In accordance with the International Valuation...,Business Valuation Manager overseeing valuatio...,Financial Analyst responsible for data analysi...,{'raw_input_easy_triplets': [{'content': 'Your...,gpt-4o-mini
6,Accountancy,Business Valuation,Business Valuation Partner / Business Valuatio...,The Business Valuation Partner/Business Valuat...,In accordance with the International Valuation...,Job opening for a Business Valuation Director ...,Seeking a project manager with experience in s...,{'raw_input_easy_triplets': [{'content': 'Your...,gpt-4o-mini
7,Accountancy,Business Valuation,Business Valuation Senior / Business Valuation...,The Business Valuation Senior/Business Valuati...,In accordance with the International Valuation...,Senior Business Valuation Executive responsibl...,Entry-level position focused on administrative...,{'raw_input_easy_triplets': [{'content': 'Your...,gpt-4o-mini
8,Accountancy,Enterprise Risk Management,Chief Risk Officer / Risk Partner / Head of Ri...,The Chief Risk Officer/Risk Partner/Head of Ri...,,Job opening for a Chief Risk Officer responsib...,Seeking a candidate for a role focused on enha...,{'raw_input_easy_triplets': [{'content': 'Your...,gpt-4o-mini
9,Accountancy,Enterprise Risk Management,Enterprise Risk Management Associate / Enterpr...,The Enterprise Risk Management Associate/Enter...,,Job opening for an Enterprise Risk Management ...,Seeking a candidate for a role involving custo...,{'raw_input_easy_triplets': [{'content': 'Your...,gpt-4o-mini


In [12]:
distiset.push_to_hub(repo_id="dnth/ssf-dataset-synthetic", revision="main")

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ? shards/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Processing Files (0 / 0)                : |          |  0.00B /  0.00B            

New Data Upload                         : |          |  0.00B /  0.00B            

                                        : 100%|##########| 33.8kB / 33.8kB            

README.md:   0%|          | 0.00/919 [00:00<?, ?B/s]

No files have been modified since last commit. Skipping to prevent empty commit.
--- Logging error ---
Traceback (most recent call last):
  File "/home/dnth/.local/share/uv/python/cpython-3.12.8-linux-x86_64-gnu/lib/python3.12/logging/handlers.py", line 1496, in emit
    self.enqueue(self.prepare(record))
  File "/home/dnth/.local/share/uv/python/cpython-3.12.8-linux-x86_64-gnu/lib/python3.12/logging/handlers.py", line 1454, in enqueue
    self.queue.put_nowait(record)
  File "/home/dnth/.local/share/uv/python/cpython-3.12.8-linux-x86_64-gnu/lib/python3.12/multiprocessing/queues.py", line 138, in put_nowait
    return self.put(obj, False)
           ^^^^^^^^^^^^^^^^^^^^
  File "/home/dnth/.local/share/uv/python/cpython-3.12.8-linux-x86_64-gnu/lib/python3.12/multiprocessing/queues.py", line 88, in put
    raise ValueError(f"Queue {self!r} is closed")
ValueError: Queue <multiprocessing.queues.Queue object at 0x79a1ff90bc80> is closed
Call stack:
  File "<frozen runpy>", line 198, in _run