# Dataset Schema Update Script (SAFE - Preserves Both Train and Test)

This notebook transforms BOTH train and test splits without deleting either.

In [27]:
import os
from pathlib import Path
from datasets import load_dataset, Dataset, DatasetDict
from dotenv import load_dotenv

# Load environment variables from root .env
load_dotenv(Path('../..') / '.env')

# Configuration
DATASET_REPO = "Cantina/dj-image-train-data-20251117"  # Update this to your dataset
HF_TOKEN = os.getenv('HUGGINGFACE_TOKEN')

if not HF_TOKEN:
    raise ValueError("HUGGINGFACE_TOKEN not found in environment variables")

print(f"✓ Environment configured")
print(f"✓ Dataset: {DATASET_REPO}")

✓ Environment configured
✓ Dataset: Cantina/dj-image-train-data-20251117


## Load BOTH Train and Test Splits

In [28]:
print("Loading dataset from Hugging Face...")
dataset_dict = load_dataset(
    DATASET_REPO,
    token=HF_TOKEN
)

print(f"\n✓ Available splits: {list(dataset_dict.keys())}")
for split_name in dataset_dict.keys():
    print(f"  - {split_name}: {len(dataset_dict[split_name])} rows")

print(f"\nCurrent schema (train):")
print(dataset_dict['train'].features)

print(f"\nFirst row sample (train):")
print(dataset_dict['train'][0])

Loading dataset from Hugging Face...


README.md:   0%|          | 0.00/650 [00:00<?, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/7.76M [00:00<?, ?B/s]

data/test-00000-of-00001.parquet:   0%|          | 0.00/842k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/14821 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1642 [00:00<?, ? examples/s]


✓ Available splits: ['train', 'test']
  - train: 14821 rows
  - test: 1642 rows

Current schema (train):
{'prompt_name': Value('string'), 'new_room_unified_format_input': Value('string'), 'unified_format_output_enriched_fixed': Value('string'), 'gpt5-results-20250905': Value('string'), 'gpt5-results-20251104': Value('string'), 'last_updated_ts': Value('string')}

First row sample (train):
{'prompt_name': 'mention_not_in_history_prompt', 'new_room_unified_format_input': 'ROOM MEMBERS:[\n  {"user_name": "dj-marley", "full_name": "Marley"},\n  {"user_name": "Lucas", "full_name": "Lucas"},\n  {"user_name": "Olivia", "full_name": "Olivia"},\n  {"user_name": "dj-aria", "full_name": "Aria"}\n]\n\nCHAT HISTORY:\n\n\ndj-marley (bot): Just wrapped up my blues-rock playlist, lots of soulful guitar tonight. Hope everyone vibed with it!\ndj-marley (bot): That Stevie Ray Vaughan track always hits different, especially late at night.\nLucas: Totally agree, Marley, SRV\'s solos are next-level. Always

## Transform Schema for BOTH Splits

Applies the same transformation to train and test splits.

In [29]:
def transform_split(dataset, split_name):
    """Transform a single split"""
    print(f"\nTransforming {split_name} split...")

    # Step 1: Rename columns
    print("  Renaming columns...")
    dataset = dataset.rename_column("new_room_unified_format_input", "input")
    dataset = dataset.rename_column("gpt5-results-20251104", "output")
    print("    ✓ Renamed: new_room_unified_format_input -> input")
    print("    ✓ Renamed: gpt5-results-20251104 -> output")

    # Step 2: Remove unwanted columns
    print("  Removing columns...")
    columns_to_remove = ["unified_format_output_enriched_fixed", "gpt5-results-20250905"]
    dataset = dataset.remove_columns(columns_to_remove)
    print(f"    ✓ Removed: {', '.join(columns_to_remove)}")

    # Step 3: Add annotation columns
    print("  Adding annotation columns...")
    def add_annotation_columns(example):
        example['manually_reviewed'] = False
        example['manually_reviewed_ts'] = 0
        example['last_updated_ts'] = ''
        return example

    dataset = dataset.map(add_annotation_columns)
    print("    ✓ Added: manually_reviewed (bool)")
    print("    ✓ Added: manually_reviewed_ts (int64)")
    print("    ✓ Added: last_updated_ts (string)")

    print(f"  ✓ {split_name} transformation complete!")
    return dataset

# Transform each split
transformed_dict = {}
for split_name in dataset_dict.keys():
    transformed_dict[split_name] = transform_split(dataset_dict[split_name], split_name)

# Create new DatasetDict with both splits
final_dataset = DatasetDict(transformed_dict)

print(f"\n✓ All splits transformed!")
print(f"\nFinal schema:")
print(final_dataset['train'].features)
print(f"\nFinal split sizes:")
for split_name in final_dataset.keys():
    print(f"  - {split_name}: {len(final_dataset[split_name])} rows")


Transforming train split...
  Renaming columns...
    ✓ Renamed: new_room_unified_format_input -> input
    ✓ Renamed: gpt5-results-20251104 -> output
  Removing columns...
    ✓ Removed: unified_format_output_enriched_fixed, gpt5-results-20250905
  Adding annotation columns...


Map:   0%|          | 0/14821 [00:00<?, ? examples/s]

    ✓ Added: manually_reviewed (bool)
    ✓ Added: manually_reviewed_ts (int64)
    ✓ Added: last_updated_ts (string)
  ✓ train transformation complete!

Transforming test split...
  Renaming columns...
    ✓ Renamed: new_room_unified_format_input -> input
    ✓ Renamed: gpt5-results-20251104 -> output
  Removing columns...
    ✓ Removed: unified_format_output_enriched_fixed, gpt5-results-20250905
  Adding annotation columns...


Map:   0%|          | 0/1642 [00:00<?, ? examples/s]

    ✓ Added: manually_reviewed (bool)
    ✓ Added: manually_reviewed_ts (int64)
    ✓ Added: last_updated_ts (string)
  ✓ test transformation complete!

✓ All splits transformed!

Final schema:
{'prompt_name': Value('string'), 'input': Value('string'), 'output': Value('string'), 'last_updated_ts': Value('string'), 'manually_reviewed': Value('bool'), 'manually_reviewed_ts': Value('int64')}

Final split sizes:
  - train: 14821 rows
  - test: 1642 rows


## Validate Transformed Dataset

In [30]:
print("Validation:")
for split_name in final_dataset.keys():
    print(f"\n{split_name.upper()} split:")
    print(f"  Rows: {len(final_dataset[split_name])}")
    print(f"  Columns: {final_dataset[split_name].column_names}")

    print(f"\n  Sample row:")
    sample = final_dataset[split_name][0]
    for key in ['prompt_name', 'input', 'output', 'manually_reviewed', 'manually_reviewed_ts', 'last_updated_ts']:
        if key in sample:
            value = sample[key]
            if isinstance(value, str) and len(value) > 100:
                value = value[:100] + "..."
            print(f"    {key}: {value}")

Validation:

TRAIN split:
  Rows: 14821
  Columns: ['prompt_name', 'input', 'output', 'last_updated_ts', 'manually_reviewed', 'manually_reviewed_ts']

  Sample row:
    prompt_name: mention_not_in_history_prompt
    input: ROOM MEMBERS:[
  {"user_name": "dj-marley", "full_name": "Marley"},
  {"user_name": "Lucas", "full_n...
    output: {"action": "dj", "requester": "Lucas", "requested_users": ["dj-aria"], "action_metadata": {"prompt":...
    manually_reviewed: False
    manually_reviewed_ts: 0
    last_updated_ts: 

TEST split:
  Rows: 1642
  Columns: ['prompt_name', 'input', 'output', 'last_updated_ts', 'manually_reviewed', 'manually_reviewed_ts']

  Sample row:
    prompt_name: confirming_music_prompt
    input: ROOM MEMBERS:[
  {"user_name": "Lucas", "full_name": "Lucas"},
  {"user_name": "Jade", "full_name": ...
    output: {"action": "dj", "requester": "Lucas", "requested_users": ["Jade"], "action_metadata": {"prompt": "l...
    manually_reviewed: False
    manually_reviewed_ts: 

## Push to Hugging Face

⚠️ **IMPORTANT**: This will upload BOTH train and test splits with the new schema.
Make sure you're happy with the transformation before running this cell!

In [31]:
print("Pushing to Hugging Face...")
print(f"  Repository: {DATASET_REPO}")
for split_name in final_dataset.keys():
    print(f"  {split_name}: {len(final_dataset[split_name])} rows")
print(f"  Columns: {len(final_dataset['train'].column_names)}")

# Push the entire DatasetDict (includes all splits)
final_dataset.push_to_hub(
    DATASET_REPO,
    token=HF_TOKEN,
    commit_message="Transform dataset: rename columns, remove old ones, add annotation fields (both train and test)"
)

print(f"\n✓ Successfully pushed BOTH splits!")
print(f"\nView at: https://huggingface.co/datasets/{DATASET_REPO}")

Pushing to Hugging Face...
  Repository: Cantina/dj-image-train-data-20251117
  train: 14821 rows
  test: 1642 rows
  Columns: 6


Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ? shards/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Processing Files (0 / 0): |          |  0.00B /  0.00B            

New Data Upload: |          |  0.00B /  0.00B            

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ? shards/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Processing Files (0 / 0): |          |  0.00B /  0.00B            

New Data Upload: |          |  0.00B /  0.00B            


✓ Successfully pushed BOTH splits!

View at: https://huggingface.co/datasets/Cantina/dj-image-train-data-20251117
