# Custom Data Preprocessing for JPF/SPF-WCA Dataset

This script demonstrates how we perform our own data transformations for our own custom dataset. We implement `make_map_fn` functions to extract answers and format each example according to the required structure. The steps include:

- Loading the dataset that we created manually.
- Processing each example using a custom mapping function:
    - Constructing a data item with the fields: `data_source`, `prompt`, `ability`, `reward_model`, and `extra_info`.
- Saving the processed dataset in parquet format locally.
- Copying the local data to HDFS.

You can modify these functions to suit your own dataset or task requirements.


In [1]:
import pandas as pd
import os
dataframe =[]

directory_base = "work/invaR1ant-veRL/data/v2/output/base/data.parquet"
directory_instruct = "work/invaR1ant-veRL/data/v2/output/instruct/data.parquet"


dataframe_base = pd.read_parquet(directory_base)
dataframe_instruct = pd.read_parquet(directory_instruct)

In [2]:
dataframe_base.loc[0]

problem                                           SimpleAscendingLast
example_indices                                                   [1]
examples                             [{'index': 1, 'solution': None}]
question            A conversation between User and Assistant. The...
answer_index                                                        2
answer_constants     (declare-const in0 Int)\n(declare-const in1 Int)
answer_solution                               (assert  ( <  in1 in0))
Name: 0, dtype: object

In [3]:
dataframe_instruct.loc[0]

problem                                           SimpleAscendingLast
example_indices                                                   [1]
examples                             [{'index': 1, 'solution': None}]
question            <|im_start|>system\nYou are a helpful assistan...
answer_index                                                        2
answer_constants     (declare-const in0 Int)\n(declare-const in1 Int)
answer_solution                               (assert  ( <  in1 in0))
Name: 0, dtype: object

In [4]:
# Check if any row the examples array's solution is all None
dataframe_base["examples"].apply(lambda x: all([e["solution"] is None for e in x])).any()
dataframe_instruct["examples"].apply(lambda x: all([e["solution"] is None for e in x])).any()

np.True_

In [5]:
# Show all the rows that have examples with all None solutions
dataframe_base[dataframe_base["examples"].apply(lambda x: all([e["solution"] is None for e in x]))]


Unnamed: 0,problem,example_indices,examples,question,answer_index,answer_constants,answer_solution
0,SimpleAscendingLast,[1],"[{'index': 1, 'solution': None}]",A conversation between User and Assistant. The...,2,(declare-const in0 Int)\n(declare-const in1 Int),(assert ( < in1 in0))
1,SimpleAscendingLast,[1],"[{'index': 1, 'solution': None}]",A conversation between User and Assistant. The...,3,(declare-const in0 Int)\n(declare-const in2 In...,(assert (and ( < in0 in1) ( < in2 in0)))
2,SimpleAscendingLast,[1],"[{'index': 1, 'solution': None}]",A conversation between User and Assistant. The...,4,(declare-const in0 Int)\n(declare-const in2 In...,(assert (and (and ( < in0 in1) ( < in1 in2...
3,SimpleAscendingLast,[1],"[{'index': 1, 'solution': None}]",A conversation between User and Assistant. The...,5,(declare-const in0 Int)\n(declare-const in2 In...,(assert (and (and (and ( < in0 in1) ( < in...
4,SimpleAscendingLast,[1],"[{'index': 1, 'solution': None}]",A conversation between User and Assistant. The...,6,(declare-const in5 Int)\n(declare-const in0 In...,(assert (and (and (and (and ( < in0 in1) ( ...
...,...,...,...,...,...,...,...
23625,ComplexHalfEqual,[1],"[{'index': 1, 'solution': None}]",A conversation between User and Assistant. The...,2,(declare-const in0 Int)\n(declare-const in1 Int),(assert ( = in0 in1))
23626,ComplexHalfEqual,[1],"[{'index': 1, 'solution': None}]",A conversation between User and Assistant. The...,3,(declare-const in0 Int)\n(declare-const in2 In...,(assert (and ( = in0 in1) ( < in1 in2)))
23627,ComplexHalfEqual,[1],"[{'index': 1, 'solution': None}]",A conversation between User and Assistant. The...,4,(declare-const in0 Int)\n(declare-const in2 In...,(assert (and (and ( = in0 in1) ( = in1 in2...
23628,ComplexHalfEqual,[1],"[{'index': 1, 'solution': None}]",A conversation between User and Assistant. The...,5,(declare-const in0 Int)\n(declare-const in2 In...,(assert (and (and (and ( = in0 in1) ( = in...


In [6]:
dataframe_instruct[dataframe_instruct["examples"].apply(lambda x: all([e["solution"] is None for e in x]))]

Unnamed: 0,problem,example_indices,examples,question,answer_index,answer_constants,answer_solution
0,SimpleAscendingLast,[1],"[{'index': 1, 'solution': None}]",<|im_start|>system\nYou are a helpful assistan...,2,(declare-const in0 Int)\n(declare-const in1 Int),(assert ( < in1 in0))
1,SimpleAscendingLast,[1],"[{'index': 1, 'solution': None}]",<|im_start|>system\nYou are a helpful assistan...,3,(declare-const in0 Int)\n(declare-const in2 In...,(assert (and ( < in0 in1) ( < in2 in0)))
2,SimpleAscendingLast,[1],"[{'index': 1, 'solution': None}]",<|im_start|>system\nYou are a helpful assistan...,4,(declare-const in0 Int)\n(declare-const in2 In...,(assert (and (and ( < in0 in1) ( < in1 in2...
3,SimpleAscendingLast,[1],"[{'index': 1, 'solution': None}]",<|im_start|>system\nYou are a helpful assistan...,5,(declare-const in0 Int)\n(declare-const in2 In...,(assert (and (and (and ( < in0 in1) ( < in...
4,SimpleAscendingLast,[1],"[{'index': 1, 'solution': None}]",<|im_start|>system\nYou are a helpful assistan...,6,(declare-const in5 Int)\n(declare-const in0 In...,(assert (and (and (and (and ( < in0 in1) ( ...
...,...,...,...,...,...,...,...
23625,ComplexHalfEqual,[1],"[{'index': 1, 'solution': None}]",<|im_start|>system\nYou are a helpful assistan...,2,(declare-const in0 Int)\n(declare-const in1 Int),(assert ( = in0 in1))
23626,ComplexHalfEqual,[1],"[{'index': 1, 'solution': None}]",<|im_start|>system\nYou are a helpful assistan...,3,(declare-const in0 Int)\n(declare-const in2 In...,(assert (and ( = in0 in1) ( < in1 in2)))
23627,ComplexHalfEqual,[1],"[{'index': 1, 'solution': None}]",<|im_start|>system\nYou are a helpful assistan...,4,(declare-const in0 Int)\n(declare-const in2 In...,(assert (and (and ( = in0 in1) ( = in1 in2...
23628,ComplexHalfEqual,[1],"[{'index': 1, 'solution': None}]",<|im_start|>system\nYou are a helpful assistan...,5,(declare-const in0 Int)\n(declare-const in2 In...,(assert (and (and (and ( = in0 in1) ( = in...


In [7]:
dataframe_base.drop(dataframe_base[dataframe_base["examples"].apply(lambda x: all([e["solution"] is None for e in x]))].index, inplace=True)

In [8]:
dataframe_instruct.drop(dataframe_instruct[dataframe_instruct["examples"].apply(lambda x: all([e["solution"] is None for e in x]))].index, inplace=True)

In [9]:
def make_map_fn(split):
    def process_fn(row):
        data = {
            "data_source": "dannkoh/ConStruct-Base",
            "prompt": [{"role": "user", "content": row["question"]}],
            "ability": "generalisation",
            "reward_model": {"style": "rule", "ground_truth": row["answer_solution"]},
            "extra_info": {
                "answer_constants": row["answer_constants"],
                "answer_index": row["answer_index"],
                "example_indices": row["example_indices"],
            },
        }
        return data

    return process_fn

In [10]:
# Split df_instruct
train_instruct = dataframe_instruct.sample(frac=0.75, random_state=0)
test_instruct = dataframe_instruct.drop(train_instruct.index)

# Split df_plain
train_plain = dataframe_base.sample(frac=0.75, random_state=0)
test_plain = dataframe_base.drop(train_plain.index)

# Apply `make_map_fn()` to generate datasets
train_dataset_instruct = pd.DataFrame(train_instruct.apply(make_map_fn("train"), axis=1).tolist())
test_dataset_instruct = pd.DataFrame(test_instruct.apply(make_map_fn("test"), axis=1).tolist())

train_dataset_plain = pd.DataFrame(train_plain.apply(make_map_fn("train"), axis=1).tolist())
test_dataset_plain = pd.DataFrame(test_plain.apply(make_map_fn("test"), axis=1).tolist())


Upload to HuggingFace Datasets

In [None]:
# from datasets import DatasetDict, Dataset


# datasetdict= DatasetDict({
#     "instruct.test": Dataset.from_pandas(test_dataset_instruct),
#     "instruct.train": Dataset.from_pandas(train_dataset_instruct),
#     "base.test": Dataset.from_pandas(test_dataset_plain),
#     "base.train": Dataset.from_pandas(train_dataset_plain),
# })

# datasetdict.push_to_hub("dannkoh/invaR1ant-easy")

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: write).
The token `writer` has been saved to /home/jovyan/.cache/huggingface/stored_tokens
Your token has been saved to /home/jovyan/.cache/huggingface/token
Login successful.
The current active token is: `writer`


Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/6 [00:00<?, ?ba/s]

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/18 [00:00<?, ?ba/s]

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/6 [00:00<?, ?ba/s]

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/18 [00:00<?, ?ba/s]

CommitInfo(commit_url='https://huggingface.co/datasets/dannkoh/invaR1ant-easy/commit/aab4c9174243d2465e9724c3136ac9cafc7c3681', commit_message='Upload dataset', commit_description='', oid='aab4c9174243d2465e9724c3136ac9cafc7c3681', pr_url=None, repo_url=RepoUrl('https://huggingface.co/datasets/dannkoh/invaR1ant-easy', endpoint='https://huggingface.co', repo_type='dataset', repo_id='dannkoh/invaR1ant-easy'), pr_revision=None, pr_num=None)