# Dataset Transformation Script

This notebook transforms the dataset to have only 3 columns:
- `prompt_name` (unchanged)
- `input` (from `unified_format_output_enriched_fixed`)
- `output` (from `gpt5-results-20251104`)

## Setup

In [1]:
import json
import os
from pathlib import Path
from huggingface_hub import hf_hub_download, upload_file, HfApi
from dotenv import load_dotenv

# Load environment variables from root .env
load_dotenv(Path('../..') / '.env')

# Configuration
DATASET_REPO = "Cantina/intent-full-data-20251106"
HF_TOKEN = os.getenv('HUGGINGFACE_TOKEN')

if not HF_TOKEN:
    raise ValueError("HUGGINGFACE_TOKEN not found in environment variables")

print(f"✓ Environment configured")
print(f"✓ Dataset: {DATASET_REPO}")

✓ Environment configured
✓ Dataset: Cantina/intent-full-data-20251106


  from .autonotebook import tqdm as notebook_tqdm


## Download the Dataset

In [2]:
# Try to download the dataset file
# The API will try multiple file names
possible_files = ['data.json', 'train.json', 'dataset.json', 'annotations.json']

data = None
source_file = None

for filename in possible_files:
    try:
        file_path = hf_hub_download(
            repo_id=DATASET_REPO,
            filename=filename,
            repo_type="dataset",
            token=HF_TOKEN
        )

        with open(file_path, 'r') as f:
            data = json.load(f)

        source_file = filename
        print(f"✓ Found and loaded: {filename}")
        break
    except Exception as e:
        print(f"✗ Could not load {filename}: {str(e)}")
        continue

if data is None:
    raise FileNotFoundError(f"Could not find any data file. Tried: {', '.join(possible_files)}")

print(f"\n✓ Loaded {len(data)} rows from {source_file}")

✗ Could not load data.json: 404 Client Error. (Request ID: Root=1-690e8c1e-1122829a23aa1563542b3151;2fa3ef77-c51b-4ed0-9ec9-22b8cd3f1f5d)

Repository Not Found for url: https://huggingface.co/datasets/Cantina/intent-full-data-20251106/resolve/main/data.json.
Please make sure you specified the correct `repo_id` and `repo_type`.
If you are trying to access a private or gated repo, make sure you are authenticated. For more details, see https://huggingface.co/docs/huggingface_hub/authentication
✗ Could not load train.json: 404 Client Error. (Request ID: Root=1-690e8c1e-6eaf987206a205875409e21a;edd6b337-3e99-4475-a90e-4ee1d6982eab)

Repository Not Found for url: https://huggingface.co/datasets/Cantina/intent-full-data-20251106/resolve/main/train.json.
Please make sure you specified the correct `repo_id` and `repo_type`.
If you are trying to access a private or gated repo, make sure you are authenticated. For more details, see https://huggingface.co/docs/huggingface_hub/authentication
✗ Coul

FileNotFoundError: Could not find any data file. Tried: data.json, train.json, dataset.json, annotations.json

## Inspect Original Data Structure

In [None]:
# Show the first row to understand the structure
if data and len(data) > 0:
    print("First row keys:")
    print(json.dumps(list(data[0].keys()), indent=2))
    print("\nFirst row sample:")
    print(json.dumps(data[0], indent=2)[:500] + "...")

## Transform the Data

Transform each row to have only 3 columns:
- `prompt_name` (keep as is)
- `input` (from `unified_format_output_enriched_fixed`)
- `output` (from `gpt5-results-20251104`)

In [None]:
transformed_data = []

for row in data:
    transformed_row = {
        'prompt_name': row.get('prompt_name', ''),
        'input': row.get('unified_format_output_enriched_fixed', ''),
        'output': row.get('gpt5-results-20251104', '')
    }
    transformed_data.append(transformed_row)

print(f"✓ Transformed {len(transformed_data)} rows")
print(f"\nTransformed structure:")
print(json.dumps(transformed_data[0], indent=2))

## Validate Transformation

In [None]:
# Check for any missing data
missing_prompt_name = sum(1 for row in transformed_data if not row['prompt_name'])
missing_input = sum(1 for row in transformed_data if not row['input'])
missing_output = sum(1 for row in transformed_data if not row['output'])

print("Data completeness:")
print(f"  Rows with missing prompt_name: {missing_prompt_name}")
print(f"  Rows with missing input: {missing_input}")
print(f"  Rows with missing output: {missing_output}")
print(f"\nTotal rows: {len(transformed_data)}")

# Show a few examples
print("\n" + "="*80)
print("Sample transformed rows:")
print("="*80)
for i, row in enumerate(transformed_data[:3]):
    print(f"\nRow {i+1}:")
    print(f"  prompt_name: {row['prompt_name']}")
    print(f"  input: {row['input'][:100]}...")
    print(f"  output: {row['output'][:100]}...")

## Save Transformed Data Locally

In [None]:
# Save to a local file first for inspection
output_file = Path('..') / 'transformed_data.json'

with open(output_file, 'w') as f:
    json.dump(transformed_data, f, indent=2)

print(f"✓ Saved transformed data to: {output_file}")
print(f"  File size: {output_file.stat().st_size / 1024:.2f} KB")

## Upload to Hugging Face (Optional)

**WARNING**: This will overwrite the existing dataset. Make sure you have a backup!

Uncomment and run the cell below to upload the transformed data.

In [None]:
# # UNCOMMENT TO UPLOAD
# api = HfApi()
#
# # Upload the transformed data
# api.upload_file(
#     path_or_fileobj=str(output_file),
#     path_in_repo="transformed_data.json",
#     repo_id=DATASET_REPO,
#     repo_type="dataset",
#     token=HF_TOKEN,
#     commit_message="Transform dataset to 3 columns: prompt_name, input, output"
# )
#
# print(f"✓ Uploaded transformed data to {DATASET_REPO}")
# print(f"  File: transformed_data.json")
# print(f"\nYou can view it at:")
# print(f"https://huggingface.co/datasets/{DATASET_REPO}")

## Summary

The transformation is complete! The new dataset has:
- **prompt_name**: Original prompt name (unchanged)
- **input**: Content from `unified_format_output_enriched_fixed`
- **output**: Content from `gpt5-results-20251104`

The transformed data is saved locally as `transformed_data.json` and can be uploaded to Hugging Face by uncommenting the upload cell.