Skip to content

Conversation

@Sindhuja217
Copy link
Collaborator

@Sindhuja217 Sindhuja217 commented Dec 2, 2025

Summary

This PR introduces a fully automated, end-to-end 8-stage data-construction pipeline designed to generate high-quality, safety-aligned training and evaluation datasets for Original DPO and Modified Factual-DPO experiments using the Skywork Reward-Preference dataset.
The pipeline covers extraction, transformation, factuality scoring, synthetic corruption generation, balanced sampling, dataset merging, and Safe-DPO preference flipping, producing reproducible train/eval splits for alignment research.

PR Type

[New feature]

Changes Made

Stage 1: Data Extraction

dataextraction.py

  • Loads the first 80k rows of the Skywork Reward-Preference dataset.
  • Extracts prompt, chosen, and rejected from dialog format.
  • Removes exact-match duplicates (i.e., chosen == rejected).
  • Saves the cleaned 77k dataset + removed duplicates.

dataextraction_eval.py

  • Extracts evaluation slice (rows 80001–81000).
  • Performs the same prompt/answer extraction & duplicate removal.
  • Outputs skywork_extracted_eval.jsonl and removed samples.

dataextraction_eval2.py

  • Extracts test slice (rows 81001–81500).
  • Performs identical cleaning + duplicate filtering.
  • Outputs skywork_extracted_test.jsonl and removed samples.

Stage 2: Preference Pair Conversion

dataconversion.py

  • Loads the cleaned 77k training samples (skywork_extracted_77k.jsonl).
  • Randomly assigns chosen and rejected responses into response_0 and response_1.
  • Computes the correct better_response_id.
  • Saves the final training preference pairs as skywork_preference_pairs_77k.jsonl.

dataconversion_eval.py

  • Performs the same conversion for the evaluation split (skywork_extracted_eval.jsonl).
  • Produces skywork_preference_pairs_eval.jsonl.
  • Mirrors the training pipeline but restricted to eval-only samples.

Stage 3: Binary Factuality Labeling (Train + Eval)

dataset_train.py

  • Loads training preference pairs (skywork_preference_pairs_train.jsonl).
  • Uses a strict PKU-style binary judge implemented via GPT-4o-mini.
  • Assigns factual_flag_0, factual_flag_1 for each response.
  • Adds convenience aliases h0, h1 for downstream filtering/loss computation.
  • Includes automatic resume, concurrency control (async), and checkpointing.
  • Saves fully labeled output to skywork_binary_factual_train.jsonl.

dataset_eval.py

  • Mirrors the entire pipeline for the evaluation split (skywork_preference_pairs_eval.jsonl).
  • Produces skywork_binary_factual_eval.jsonl.
  • Enables consistent factual-labeling across both training and evaluation datasets.

Stage 4: DPO-Ready Transformation (Train + Eval)

data_transform_train.py

  • Loads binary factual-labeled training data (skywork_binary_factual_train.jsonl).
  • Converts response_0 / response_1 into chosen and rejected strictly using better_response_id.
  • Maps factual flags into:
  • h_w → hallucination flag of the winner
  • h_l → hallucination flag of the loser
  • Preserves original responses and adds flipped=False for Safe-DPO compatibility.
  • Outputs the final training set: skywork_first_transformed_train.jsonl.

data_transform_eval.py

  • Performs the same transformation for evaluation data (skywork_binary_factual_eval.jsonl).
  • Produces skywork_first_transformed_eval.jsonl.
  • Ensures both train and eval sets share identical field structure.

Stage 5: Synthetic Corruption Generation (Train + Eval)

data_synthetic_train.py

  • Loads DPO-ready factual-scored training data (skywork_first_transformed_train.jsonl).
  • Selects samples where the factual winner is h_w = 0 and loser is h_l = 1.
  • Uses GPT-4o-mini to rewrite the factual answer into a subtly incorrect, fluent hallucination.
  • Produces inversion pairs where:
  • chosen = corrupted (hallucinated)
  • rejected = original factual answer
  • Generates 10,000 synthetic hallucination samples.
  • Saves to synthetic_llm_inversion_train_10k.jsonl.

data_synthetic_eval.py

Mirrors the same corruption generation process for the evaluation split.
Selects factual (0,1) pairs and produces hallucinated inversions.
Generates 400 synthetic eval corruption samples.
Saves to synthetic_llm_inversion_eval_400.jsonl.

Stage 6: Final Dataset Merge (Train + Eval)

data_merge_train.py

  • Loads the 10,000 synthetic inversion samples (synthetic_llm_inversion_train_10k.jsonl).
  • Loads the transformed factual-scored Skywork training set (skywork_first_transformed_train.jsonl).
  • Buckets real samples into:
  • (0,0) → both factual
  • (1,1) → both hallucinated
  • (0,1) → factual winner, hallucinated loser
  • Randomly samples 10,000 from the (0,1) group.
  • Merges:
  • all synthetic inversions
  • all (0,0)
  • all (1,1)
  • sampled (0,1)
  • Shuffles and outputs final dataset → skywork_final_train.jsonl.

data_merge_eval.py
Loads 400 synthetic eval inversions (synthetic_llm_inversion_eval_400.jsonl).
Loads the transformed evaluation set (skywork_first_transformed_eval.jsonl).
Buckets into (0,0), (1,1), (0,1) but keeps all real eval samples.
Merges: synthetic + all real eval data.
Shuffles and outputs final eval dataset → skywork_final_eval.jsonl.

Stage 7: Final Balanced Dataset Construction (Train + Eval)

data_final_train.py
Loads the merged training dataset (skywork_final_train.jsonl).
Buckets samples based on factuality pair type (h_w, h_l):
(0,1) → factual winner, hallucinated loser
(1,0) → hallucinated winner, factual loser
(0,0) → both factual
(1,1) → both hallucinated

Performs balanced sampling with exact target counts:
(0,1) → 10,000
(1,0) → 10,000
(0,0) → 15,000
(1,1) → 10,000
Supports sampling with replacement if a bucket is too small.
Shuffles and saves the final balanced dataset → train_finallast.jsonl.

data_final_eval.py
Loads:
400 synthetic eval inversions (synthetic_llm_inversion_eval_400.jsonl)
full Skywork eval transformed set (skywork_first_transformed_eval.jsonl)
Skywork training source (skywork_final_train.jsonl)
final train dataset (train_finallast.jsonl) to avoid leakage
Excludes any sample that appears in the training set.
Adds additional clean samples from training buckets:
1500 from (1,1)
1500 from (0,0)

Merges:
all synthetic eval samples
all real Skywork eval samples
+1500 (1,1)
+1500 (0,0)
Shuffles and saves → eval_final.jsonl.

Stage 8: Preference Label Flipping (Train + Eval)

data_flipped_train.py

  • Loads the final balanced training dataset (train_finallast.jsonl).
  • Identifies samples with factuality tuple (h_w, h_l) = (1, 0) — meaning the winner is hallucinated and the loser is factual.
  • Flips these samples by:
  • swapping chosen and rejected
  • converting the flags to (0,1)
  • Writes the flipped dataset to train_finallast_flipped.jsonl.

data_flipped_eval.py
Loads the final evaluation dataset (eval_final.jsonl).
Applies the exact same flipping rule on (1,0) preference pairs.
Saves the processed version as eval_final_flipped.jsonl.

Checklist

  • Code follows the project's style guidelines
  • Review of code completed
  • No sensitive information (API keys, credentials, file paths) exposed

@Sindhuja217 Sindhuja217 self-assigned this Dec 2, 2025
Copy link
Collaborator

@AhmedRadwan02 AhmedRadwan02 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


name: Feature Request
about: Suggest a new feature or enhancement
title: '[FEATURE] Code Organization and Refactoring for 8-Stage Data Pipeline'
labels: 'enhancement'
assignees: ''


Problem Statement

Hey! First off, great work on getting this 8-stage pipeline working end-to-end. The implementation is solid and functional. I noticed a few organizational things that could make the codebase easier to work with as it grows:

  1. All the files are currently in the root directory with some inconsistent naming (like dataextraction.py vs dataextraction_eval2.py)
  2. There's quite a bit of duplicate code across files - functions like extract_prompt_from_dialog and extract_answer_from_dialog appear in multiple places
  3. Some paths and parameters are hardcoded even though we have a config.yaml
  4. The model code, prompts, and data processing logic are all mixed together in the same files
  5. I noticed the test data file (_test.json) gets created but isn't used in the later stages - was this intentional?
  6. Quick question: are the data split sizes (train: 77k, eval: ~1k, test: 500) aligned with what's typical in DPO research?
  7. The config.yaml has the model name but it's still hardcoded in places

These aren't critical issues, but addressing them would make the code easier to maintain, debug, and extend down the road!

Proposed Solution

Here are some ideas to make things more organized:

1. File Organization

  • Group related files into directories by stage: /stage_1_extraction/, /stage_2_conversion/, etc.
  • Pick one naming style and stick with it (either data_extraction.py or dataExtraction.py - your preference!)
  • Maybe rename dataextraction_eval2.py to data_extraction_test.py? Just clearer about what it does
  • Create a /utils/ directory for shared code

2. Code Consolidation

  • Pull those repeated functions into /utils/data_utils.py so we only maintain them in one place
  • Create a base GPT class in /utils/model_utils.py that all the stages can use
  • Move all the prompts into /utils/prompt_templates.py to keep them separate from the logic

3. Better Config Usage

  • Add all the file paths to config.yaml (like SYNTHETIC_FILE, SKYWORK_FILE)
  • Include hyperparameters and sampling targets
  • Make sure we're actually reading from config everywhere instead of hardcoding
  • Could add a config_loader utility to make this consistent

4. Complete or Clarify Test Pipeline

  • Either extend the test data through all 8 stages, or
  • Add a note explaining why it only goes through extraction

5. Small Cleanup

  • Remove any emojis from the code
  • Ensure consistent formatting

Alternative Solutions

Option 1: Minimal changes

  • Just consolidate the duplicate functions
  • Keep the flat file structure
  • Pros: Quick to implement
  • Cons: Won't help much with long-term maintainability

Option 2: One big pipeline script

  • Combine everything into a single main.py
  • Pros: Simple to run
  • Cons: Loses the nice modularity you have now

I'd recommend going with the proposed solution - it keeps your modular approach while making things cleaner.

Use Cases

This refactoring would help with:

  • Adding new stages or modifying existing ones without breaking things
  • Debugging individual stages in isolation
  • Reusing the utilities in other projects
  • Swapping out different models or prompts easily
  • Running experiments with different hyperparameters
  • Onboarding others to work on the pipeline

Implementation Ideas

Suggested approach:

  1. Start by creating the directory structure and moving files
  2. Pull duplicate code into utils modules
  3. Expand the config.yaml and update everything to use it
  4. Decide what to do with the test pipeline
  5. Update the documentation to reflect the new structure

You don't have to do all of this at once - could tackle it in phases!

Component Impact

  • Core functionality - All 8 stages would need updates
  • API - Not affected
  • Docker/Infrastructure - Might need to update some paths
  • Documentation - README would need updates
  • Configuration - Expanding config.yaml usage

Additional Context

What's working well:

  • No sensitive info exposed - nice job keeping things secure!
  • The code is clear and easy to follow
  • Good separation into distinct stages

Quick question:
The eval and test splits seem pretty small compared to training. Is this the standard ratio for DPO papers? Just want to make sure we're aligned with best practices.

About the test file:
The _test.json gets created in Stage 1 but doesn't flow through the rest of the pipeline. Was this planned for later, or should we complete it now?

Priority

  • Nice to have
  • Would be helpful
  • Important for my use case
  • Critical/blocking

This is important for maintainability but not blocking your current work. Let me know what you think about these suggestions!

@Sindhuja217
Copy link
Collaborator Author

Sindhuja217 commented Dec 5, 2025

Thanks a lot for the detailed and thoughtful feature request — this is extremely helpful for improving the maintainability and scalability of the 8-stage data construction pipeline.

1. File Organization Improvements
You are absolutely right — the earlier version had inconsistent naming (dataextraction.py, dataextraction_eval2.py - The reason for this is if the name consists test then according to uv template it should pass the pytest which was giving some dependency error that requires change in /projects/aixpert/users/sindhu/AIXpert-preference-alignment/.pre-commit-config.yaml to make it less complicated I changed the name of the file) and too many scripts sitting in a single directory.

A clean directory hierarchy now exists:

data_construction/
    stage_1_extraction/
    stage_2_conversion/
    stage_3_factuality/
    stage_4_transformation/
    stage_5_syntheticdata/
    stage_6_merging/
    stage_7_balancing/
    stage_8_flipping/
    utils/

2. Code Consolidation
Agreed — several utility functions were duplicated early on (e.g., extracting prompt/answer from the Skywork dialog format).
Consolidated repeated logic into:

utils/data_utils.py
utils/factual_utils.py
utils/synthetic_utils.py
utils/dpo_transform_utils.py
utils/prompt_templates.py

Benefits:

  • One place to maintain shared logic
  • Clean separation between data processing, LLM scoring, and synthetic generation
  • Less duplication across stages

3. Better Configuration Usage
You made an excellent point — previously, some paths and hyperparameters were still hardcoded.
What’s fixed:
All paths and hyperparameters now live inside:
config/config.yaml
Every stage script now imports config through:
from utils.config_loader import load_config

4. Test Pipeline Clarification

Yes — the test slice (500 samples) is extracted in Stage 1 but not processed in the remaining stages.
This was intentional for the first version:
Test prompts are used only for evaluating model generations for factual score, hallucination score and win rate calculation in the evaluation pipeline, not for preference scoring or DPO dataset construction.

5. Additional Cleanup

  • Remove emojis inside code
  • Consistent Formmating is followed throughout

6. Why the Train/Val/Test Split Is Good for DPO

The train/val/test split of 45k / 4.4k / 500 is good for DPO because it matches how leading RLHF and preference-learning papers structure their datasets. DPO relies on pairwise comparisons between two responses, so it requires a large training set—and 45k pairs is directly in line with datasets used in InstructGPT (30–50k), Anthropic HH (~32k), SafeRLHF (30k), and PKU-SafeRLHF (29k). A much smaller validation set (~10% of train) is standard because it’s only used for monitoring to prevent preference drift—not for learning. The test set should be small and strictly held out, containing only prompts to evaluate free-form generation, not preference accuracy. Using 500 prompts aligns perfectly with common LLM benchmarks like Arena-Hard (500) also that would give us a better picture on win rate, factual score and hallucination score of the model. Overall, the split provides the right amount of training signal, efficient validation, and a clean test set—making it DPO-aligned.

Please take a look at all the changes that were made, let me know if anything else needs to be changed.

@AhmedRadwan02
Copy link
Collaborator


name: Feature Request
about: Suggest a new feature or enhancement
title: '[FEATURE] Final Review - 8-Stage Data Pipeline Refactoring'
labels: 'enhancement'

Problem Statement

This is a follow-up review on the 8-stage data construction pipeline refactoring. The major organizational improvements have been successfully implemented - the code is now much more maintainable with better structure and configuration management. There are just a few minor cleanup items remaining before final approval.

Proposed Solution

Great work on all the updates! The refactoring looks really solid now. Here are the remaining items to address:

If needed:

1. Add requirements.txt

  • Currently missing from the project
  • Should include all dependencies needed to run the pipeline
  • Recommended packages to include:
    • openai (for GPT API calls)
    • pyyaml (for config loading)
    • asyncio (if used for concurrency)
    • Any other third-party libraries used across stages

Important:

2. Code Cleanup Needed

Two files need cleaning:

  • src/aixpert/data_construction/stage_7_final/data_final_train.py
  • src/aixpert/data_construction/utils/dpo_transform_utils.py

Items to clean:

  • Remove any emojis from code/comments

Use Cases

  • Ensuring code is professional and clean for production use
  • Meeting project code quality standards
  • Allowing others to easily install and run the pipeline

Implementation Ideas

For requirements.txt:
Create a file at the project root:

requirements.txt should list all dependencies with versions if needed

For code cleanup:

  1. Open the two files mentioned
  2. Search for emojis and remove them

Should take about 10-15 minutes total.

Component Impact

  • API - Not affected
  • Core functionality - Minor cleanup only, no logic changes
  • Docker/Infrastructure - requirements.txt may help with Docker setup
  • Documentation - requirements.txt serves as dependency documentation
  • Any other part of the system

Additional Context

What's Working Great:

The restructuring you did addresses all the major concerns from the initial review:

  • Clean directory structure with all 8 stages properly organized
  • Utils are consolidated and being imported correctly
  • Config.yaml is comprehensive and actually being used throughout
  • File naming is consistent (train/val pattern)
  • Code is much more understandable and maintainable now
  • All the stage files are clean and well-structured

Verification Completed:

  • Directory structure looks good
  • Config paths are being used (no hardcoded paths found)
  • Utils are properly imported across stages
  • Naming conventions are consistent
  • Each stage has proper main() functions

Once these are done, the PR will be ready

Priority

  • Nice to have
  • Would be helpful
  • Important for my use case
  • Critical/blocking

These are the last few items needed before we can merge.

@Sindhuja217
Copy link
Collaborator Author

Thank you @AhmedRadwan02. I made those necessary changes in those respective files and as for the requirements.txt there is /AIXpert-preference-alignment/pyproject.toml that consists of all dependencies that are required for the whole project.

@Sindhuja217
Copy link
Collaborator Author

As I have addressed all the comments by Ahmed, Im merging this PR @shainarazavi

@Sindhuja217 Sindhuja217 merged commit b5dcb25 into main Dec 15, 2025
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants