## Dataset Preparation

This notebook makes train valid splits for the generated synthetic dataset.

## Load Dataset

Load dataset from HF 

In [1]:
from datasets import load_dataset, concatenate_datasets

dataset = load_dataset("dnth/ssf-synthetic-data-for-retriever-openai", "generate_retrieval_pairs_easy")


In [4]:
dataset

DatasetDict({
    train: Dataset({
        features: ['Sector', 'Track', 'Job Role', 'anchor', 'Performance Expectation', 'positive', 'negative', 'distilabel_metadata', 'model_name'],
        num_rows: 1885
    })
})

In [6]:
dataset = dataset['train'].select_columns(['anchor', 'positive', 'negative'])
dataset

Dataset({
    features: ['anchor', 'positive', 'negative'],
    num_rows: 1885
})

## Random Split

Randomly shuffle and split dataset

In [7]:
dataset = dataset.shuffle()
dataset

Dataset({
    features: ['anchor', 'positive', 'negative'],
    num_rows: 1885
})

Make a 80/20 split of the dataset.

In [9]:
train_size = int(0.8 * len(dataset))
valid_size = len(dataset) - train_size

train_dataset = dataset.select(range(train_size))
valid_dataset = dataset.select(range(train_size, train_size + valid_size))

In [10]:
train_dataset

Dataset({
    features: ['anchor', 'positive', 'negative'],
    num_rows: 1508
})

In [11]:
valid_dataset

Dataset({
    features: ['anchor', 'positive', 'negative'],
    num_rows: 377
})

## Upload to HF

Upload the dataset to HF again.

In [14]:
from datasets import DatasetDict

ds = DatasetDict({
    "train": train_dataset,
    "valid": valid_dataset
})

ds

DatasetDict({
    train: Dataset({
        features: ['anchor', 'positive', 'negative'],
        num_rows: 1508
    })
    valid: Dataset({
        features: ['anchor', 'positive', 'negative'],
        num_rows: 377
    })
})

In [16]:
ds.push_to_hub("dnth/ssf-train-valid")

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ? shards/s]

Creating parquet from Arrow format:   0%|          | 0/2 [00:00<?, ?ba/s]

Processing Files (0 / 0)                : |          |  0.00B /  0.00B            

New Data Upload                         : |          |  0.00B /  0.00B            

                                        :  42%|####2     |  525kB / 1.24MB            

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ? shards/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Processing Files (0 / 0)                : |          |  0.00B /  0.00B            

New Data Upload                         : |          |  0.00B /  0.00B            

                                        : 100%|##########|  316kB /  316kB            

CommitInfo(commit_url='https://huggingface.co/datasets/dnth/ssf-train-valid/commit/f6a14257e9e3d6f026006f2180bfebe676b39d01', commit_message='Upload dataset', commit_description='', oid='f6a14257e9e3d6f026006f2180bfebe676b39d01', pr_url=None, repo_url=RepoUrl('https://huggingface.co/datasets/dnth/ssf-train-valid', endpoint='https://huggingface.co', repo_type='dataset', repo_id='dnth/ssf-train-valid'), pr_revision=None, pr_num=None)