In [22]:
!pip uninstall -y transformers torch torchvision

Found existing installation: transformers 4.52.3
Uninstalling transformers-4.52.3:
  Successfully uninstalled transformers-4.52.3
Found existing installation: torch 2.7.0
Uninstalling torch-2.7.0:
  Successfully uninstalled torch-2.7.0
[0m

In [59]:
pip install --upgrade distilabel


Note: you may need to restart the kernel to use updated packages.


In [23]:
!pip install git+https://github.com/dnth/rag-datakit.git

Collecting git+https://github.com/dnth/rag-datakit.git
  Cloning https://github.com/dnth/rag-datakit.git to /private/var/folders/n3/bmyftghn5h9gg0sv_h_mjkw40000gn/T/pip-req-build-byls0dqx
  Running command git clone --quiet https://github.com/dnth/rag-datakit.git /private/var/folders/n3/bmyftghn5h9gg0sv_h_mjkw40000gn/T/pip-req-build-byls0dqx
  Resolved https://github.com/dnth/rag-datakit.git to commit 38d61d3854c6ef6d7209ba88af4f3e43dd89a165
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
[?25hCollecting accelerate>=1.10.0 (from rag-datakit==0.1.0)
  Downloading accelerate-1.10.0-py3-none-any.whl.metadata (19 kB)
Collecting datasets>=4.0.0 (from rag-datakit==0.1.0)
  Downloading datasets-4.0.0-py3-none-any.whl.metadata (19 kB)
Collecting pillow>=11.3.0 (from rag-datakit==0.1.0)
  Downloading pillow-11.3.0-cp313-cp313-macosx_11_0_arm64.whl.metadata (9.0 kB)
Collecting to

In [44]:
from datasets import load_dataset

dataset = load_dataset("dnth/ssf-dataset")

In [45]:
dataset

DatasetDict({
    train: Dataset({
        features: ['Sector', 'Track', 'Job Role', 'Job Role Description', 'Performance Expectation'],
        num_rows: 1885
    })
})

In [46]:
dataset['train'][100]

{'Sector': 'Aerospace',
 'Track': 'Aircraft Maintenance',
 'Job Role': 'Technician (Avionics)',
 'Job Role Description': "The Technician (Avionics) performs maintenance tasks for aircraft avionics systems in accordance with relevant technical manuals and standard operating procedures (SOPs). He/She examines parts for maintenance, repair or replacement and checks serviceability of electrical components. He troubleshoots system failures, takes corrective actions to restore aircraft avionics systems and components to performance requirements and documents all completed tasks. He may be authorised by the organisation to perform quality control functions, including inspection of incoming materials and outgoing serviced items and registration of non-conformances. He complies with airworthiness and legislative requirements, and the organisation's safety, health and quality systems. He supports in implementation of continuous improvement initiatives and lean practices. He works in a hangar or 

In [47]:
import os
from dotenv import load_dotenv
from distilabel.models import OpenAILLM, TransformersLLM

# Load the .env file
load_dotenv()

# Get the API key from environment
api_key = os.getenv("OPENAI_API_KEY")

# Set up the model
llm = OpenAILLM(
    model="gpt-4o-mini",
    api_key=api_key,
)


In [64]:
context = """
The text is a job description from the Singapore SkillsFuture Framework. Your task is to generate realistic job descriptions based on the provided description.

For the positive query, generate a realistic and varied description for the role. Ensure it reflects the core responsibilities and requirements of the job, capturing the essence in different phrasings, 
as if the description were written by an HR professional posting the job.

For negative descriptions, apply one of the following strategies:
1. Same industry, different seniority level (e.g., Senior → Junior or Vice versa).
2. Same industry, different function (e.g., Business Valuation → Risk Management).
3. Similar skills, different domain (e.g., Financial Analysis in Banking vs. Healthcare).
4. Same title, different industry context (e.g., Marketing Manager in Retail vs. Tech).

Each output should begin with "The <job role>" and be a complete job description written in full sentences. Do not end outputs abruptly or cut off mid-sentence.

"""


context_with_reasoning = """
The text is a job description from the Singapore SkillsFuture Framework. Your task is to generate realistic job descriptions based on the provided description.

1. Positive example: For the positive query, generate a realistic and varied description for the role. Ensure it reflects the core responsibilities and requirements of the job, capturing the essence in different phrasings, 
                     as if the description were written by an HR professional posting the job.

For negative descriptions, apply one of the following strategies:
1. Same industry, different seniority level (e.g., Senior → Junior or Vice versa).
2. Same industry, different function (e.g., Business Valuation → Risk Management).
3. Similar skills, different domain (e.g., Financial Analysis in Banking vs. Healthcare).
4. Same title, different industry context (e.g., Marketing Manager in Retail vs. Tech).

When generating hard negatives, prioritize to make it hard:
- Roles that sound similar or use similar language but differ in responsibilities, required skills, or expected outcomes.
- Confusing cases where job titles overlap across industries.
- Minimal changes in wording but meaningful change in job nature.
- The negative description should be deceptively similar to the positive — it should look and feel like the original job, but be functionally different in terms of core responsibilities.
- It should be the description of another job role that is almost similar.


3. Reason: After generating the negative, explain briefly how it differs from the anchor. Focus on differences in domain, function, seniority, or job outcome.

Format:
- Anchor: <original>
- Positive: <paraphrased>
- Negative: <deceptively different job>
- Reason: <brief explanation of why the negative is distinct and give the name of the role the description is for

Each description must begin with "The <job role>" and be complete and well-formed.
"""

from distilabel.pipeline import Pipeline
from distilabel.steps import LoadDataFromHub
from distilabel.steps.tasks import GenerateSentencePair

with Pipeline(name="generate") as pipeline:
    load_dataset = LoadDataFromHub(
        num_examples=10,  # Limit to 10 examples for demo - increase for production datasets
        use_cache=False,  # Disable caching to ensure fresh data generation each run
        output_mappings={"Job Role Description": "anchor"},  # Map original column to 'anchor' for triplet generation
    )
    generate_retrieval_pairs_easy = GenerateSentencePair(
        name="easy_triplets",
        triplet=True,  # Generate anchor-positive-negative triplets for embedding training
        hard_negative=False,  # Use easier negatives rather than hard negatives
        action="paraphrase",  # Focus on paraphrasing for positive examples
        llm=llm,  # Use the LLM configured above (local Qwen or OpenAI)
        input_batch_size=10,  # Process 10 examples at once for efficiency
        context=context,  # Provide the context instructions for generation quality
    )
    generate_retrieval_pairs_hard = GenerateSentencePair(
        name="hard_triplets",
        triplet=True,  
        hard_negative=True,  
        action="paraphrase",  
        llm=llm,  
        input_batch_size=2,  
        context=context_with_reasoning,  
    )
    



    load_dataset.connect(generate_retrieval_pairs_easy, generate_retrieval_pairs_hard)

In [65]:
distiset = pipeline.run(
    use_cache=False,
    parameters={
        load_dataset.name: {
            "repo_id": "dnth/ssf-dataset",
            "split": "train",
        },
        "easy_triplets": {
            "llm": {"generation_kwargs": {"temperature": 0.6, "max_new_tokens": 512}}
        },
        
         "hard_triplets": {
            "llm": {"generation_kwargs": {"temperature": 0.4, "max_new_tokens": 512}}
        },  
    }
)

Generating train split: 10 examples [00:00, 1624.19 examples/s]
Generating train split: 10 examples [00:00, 1119.68 examples/s]


In [68]:
distiset

Distiset({
    easy_triplets: DatasetDict({
        train: Dataset({
            features: ['Sector', 'Track', 'Job Role', 'anchor', 'Performance Expectation', 'positive', 'negative', 'distilabel_metadata', 'model_name'],
            num_rows: 10
        })
    })
    hard_triplets: DatasetDict({
        train: Dataset({
            features: ['Sector', 'Track', 'Job Role', 'anchor', 'Performance Expectation', 'positive', 'negative', 'distilabel_metadata', 'model_name'],
            num_rows: 10
        })
    })
})

In [69]:
distiset["hard_triplets"]["train"].to_pandas()

Unnamed: 0,Sector,Track,Job Role,anchor,Performance Expectation,positive,negative,distilabel_metadata,model_name
0,Accountancy,Assurance,Audit Associate / Audit Assistant Associate,The Audit Associate/Audit Assistant Associate ...,In accordance with: Singapore Standards on Aud...,The Audit Associate/Audit Assistant Associate ...,The Audit Manager is tasked with overseeing th...,{'raw_input_hard_triplets': [{'content': 'Your...,gpt-4o-mini
1,Accountancy,Assurance,Audit Manager,The Audit Senior Manager/Audit Manager manages...,In accordance with: Singapore Standards on Aud...,The Audit Senior Manager/Audit Manager oversee...,The Audit Senior Associate/Audit Analyst suppo...,{'raw_input_hard_triplets': [{'content': 'Your...,gpt-4o-mini
2,Accountancy,Assurance,Audit Partner / Audit Director,The Audit Partner/Audit Director is a transfor...,In accordance with: Singapore Standards on Aud...,The Audit Partner/Audit Director serves as a p...,The Audit Manager is a key figure who directs ...,{'raw_input_hard_triplets': [{'content': 'Your...,gpt-4o-mini
3,Accountancy,Assurance,Audit Senior,The Audit Senior is expected to team lead vari...,In accordance with: Singapore Standards on Aud...,The Audit Senior is responsible for leading au...,The Audit Associate is tasked with supporting ...,{'raw_input_hard_triplets': [{'content': 'Your...,gpt-4o-mini
4,Accountancy,Business Valuation,Business Valuation Associate / Business Valuat...,The Business Valuation Associate/Business Valu...,In accordance with the International Valuation...,The Business Valuation Associate/Business Valu...,The Business Valuation Associate/Business Valu...,{'raw_input_hard_triplets': [{'content': 'Your...,gpt-4o-mini
5,Accountancy,Business Valuation,Business Valuation Manager,The Business Valuation Manager is second in ch...,In accordance with the International Valuation...,The Business Valuation Manager plays a pivotal...,The Business Valuation Manager is responsible ...,{'raw_input_hard_triplets': [{'content': 'Your...,gpt-4o-mini
6,Accountancy,Business Valuation,Business Valuation Partner / Business Valuatio...,The Business Valuation Partner/Business Valuat...,In accordance with the International Valuation...,The Business Valuation Partner/Business Valuat...,The Business Valuation Associate/Business Valu...,{'raw_input_hard_triplets': [{'content': 'Your...,gpt-4o-mini
7,Accountancy,Business Valuation,Business Valuation Senior / Business Valuation...,The Business Valuation Senior/Business Valuati...,In accordance with the International Valuation...,The Business Valuation Senior/Business Valuati...,The Business Valuation Junior/Business Valuati...,{'raw_input_hard_triplets': [{'content': 'Your...,gpt-4o-mini
8,Accountancy,Enterprise Risk Management,Chief Risk Officer / Risk Partner / Head of Ri...,The Chief Risk Officer/Risk Partner/Head of Ri...,,The Chief Risk Officer oversees the comprehens...,The Chief Compliance Officer oversees the regu...,{'raw_input_hard_triplets': [{'content': 'Your...,gpt-4o-mini
9,Accountancy,Enterprise Risk Management,Enterprise Risk Management Associate / Enterpr...,The Enterprise Risk Management Associate/Enter...,,The Enterprise Risk Management Associate/Execu...,The Enterprise Risk Management Associate/Execu...,{'raw_input_hard_triplets': [{'content': 'Your...,gpt-4o-mini


In [71]:
distiset.push_to_hub("Fatin757/ssf-dataset-synthetic-with-reasons")

Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 104.66ba/s]
'(ReadTimeoutError("HTTPSConnectionPool(host='hf-hub-lfs-us-east-1.s3-accelerate.amazonaws.com', port=443): Read timed out. (read timeout=None)"), '(Request ID: cb5fc741-c4c3-4dad-9ee1-3ed30b6127a2)')' thrown while requesting PUT https://hf-hub-lfs-us-east-1.s3-accelerate.amazonaws.com/repos/cf/65/cf65f4bffd8a12cfb53d263e8c18b4b84c3f95c4265b41c529f8c4dbf2c5d40a/fd019cadbdb6c9acc940dd3b7e119dadd283836b83d8012f1689dcfa64862a3c?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Content-Sha256=UNSIGNED-PAYLOAD&X-Amz-Credential=AKIA2JU7TKAQLC2QXPN7%2F20250814%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20250814T052239Z&X-Amz-Expires=900&X-Amz-Signature=a7c9e8e7bd63c1cf19b9f82a3b2b5c39ca29497f4d33dc3102853be7f8997617&X-Amz-SignedHeaders=host&x-amz-storage-class=INTELLIGENT_TIERING&x-id=PutObject
--- Logging error ---
Traceback (most recent call last):
  File "/opt/homebrew/Caskroom/miniforge/base/envs/TA/lib/python3.13/sit