Data Construction Pipeline for Original DPO & Modified (Factual-Aware) DPO #6

Sindhuja217 · 2025-12-02T20:56:09Z

Summary

This PR introduces a fully automated, end-to-end 8-stage data-construction pipeline designed to generate high-quality, safety-aligned training and evaluation datasets for Original DPO and Modified Factual-DPO experiments using the Skywork Reward-Preference dataset.
The pipeline covers extraction, transformation, factuality scoring, synthetic corruption generation, balanced sampling, dataset merging, and Safe-DPO preference flipping, producing reproducible train/eval splits for alignment research.

PR Type

[New feature]

Changes Made

Stage 1: Data Extraction

dataextraction.py

Loads the first 80k rows of the Skywork Reward-Preference dataset.
Extracts prompt, chosen, and rejected from dialog format.
Removes exact-match duplicates (i.e., chosen == rejected).
Saves the cleaned 77k dataset + removed duplicates.

dataextraction_eval.py

Extracts evaluation slice (rows 80001–81000).
Performs the same prompt/answer extraction & duplicate removal.
Outputs skywork_extracted_eval.jsonl and removed samples.

dataextraction_eval2.py

Extracts test slice (rows 81001–81500).
Performs identical cleaning + duplicate filtering.
Outputs skywork_extracted_test.jsonl and removed samples.

Stage 2: Preference Pair Conversion

dataconversion.py

Loads the cleaned 77k training samples (skywork_extracted_77k.jsonl).
Randomly assigns chosen and rejected responses into response_0 and response_1.
Computes the correct better_response_id.
Saves the final training preference pairs as skywork_preference_pairs_77k.jsonl.

dataconversion_eval.py

Performs the same conversion for the evaluation split (skywork_extracted_eval.jsonl).
Produces skywork_preference_pairs_eval.jsonl.
Mirrors the training pipeline but restricted to eval-only samples.

Stage 3: Binary Factuality Labeling (Train + Eval)

dataset_train.py

Loads training preference pairs (skywork_preference_pairs_train.jsonl).
Uses a strict PKU-style binary judge implemented via GPT-4o-mini.
Assigns factual_flag_0, factual_flag_1 for each response.
Adds convenience aliases h0, h1 for downstream filtering/loss computation.
Includes automatic resume, concurrency control (async), and checkpointing.
Saves fully labeled output to skywork_binary_factual_train.jsonl.

dataset_eval.py

Mirrors the entire pipeline for the evaluation split (skywork_preference_pairs_eval.jsonl).
Produces skywork_binary_factual_eval.jsonl.
Enables consistent factual-labeling across both training and evaluation datasets.

Stage 4: DPO-Ready Transformation (Train + Eval)

data_transform_train.py

Loads binary factual-labeled training data (skywork_binary_factual_train.jsonl).
Converts response_0 / response_1 into chosen and rejected strictly using better_response_id.
Maps factual flags into:
h_w → hallucination flag of the winner
h_l → hallucination flag of the loser
Preserves original responses and adds flipped=False for Safe-DPO compatibility.
Outputs the final training set: skywork_first_transformed_train.jsonl.

data_transform_eval.py

Performs the same transformation for evaluation data (skywork_binary_factual_eval.jsonl).
Produces skywork_first_transformed_eval.jsonl.
Ensures both train and eval sets share identical field structure.

Stage 5: Synthetic Corruption Generation (Train + Eval)

data_synthetic_train.py

Loads DPO-ready factual-scored training data (skywork_first_transformed_train.jsonl).
Selects samples where the factual winner is h_w = 0 and loser is h_l = 1.
Uses GPT-4o-mini to rewrite the factual answer into a subtly incorrect, fluent hallucination.
Produces inversion pairs where:
chosen = corrupted (hallucinated)
rejected = original factual answer
Generates 10,000 synthetic hallucination samples.
Saves to synthetic_llm_inversion_train_10k.jsonl.

data_synthetic_eval.py

Mirrors the same corruption generation process for the evaluation split.
Selects factual (0,1) pairs and produces hallucinated inversions.
Generates 400 synthetic eval corruption samples.
Saves to synthetic_llm_inversion_eval_400.jsonl.

Stage 6: Final Dataset Merge (Train + Eval)

data_merge_train.py

Loads the 10,000 synthetic inversion samples (synthetic_llm_inversion_train_10k.jsonl).
Loads the transformed factual-scored Skywork training set (skywork_first_transformed_train.jsonl).
Buckets real samples into:
(0,0) → both factual
(1,1) → both hallucinated
(0,1) → factual winner, hallucinated loser
Randomly samples 10,000 from the (0,1) group.
Merges:
all synthetic inversions
all (0,0)
all (1,1)
sampled (0,1)
Shuffles and outputs final dataset → skywork_final_train.jsonl.

data_merge_eval.py
Loads 400 synthetic eval inversions (synthetic_llm_inversion_eval_400.jsonl).
Loads the transformed evaluation set (skywork_first_transformed_eval.jsonl).
Buckets into (0,0), (1,1), (0,1) but keeps all real eval samples.
Merges: synthetic + all real eval data.
Shuffles and outputs final eval dataset → skywork_final_eval.jsonl.

Stage 7: Final Balanced Dataset Construction (Train + Eval)

data_final_train.py
Loads the merged training dataset (skywork_final_train.jsonl).
Buckets samples based on factuality pair type (h_w, h_l):
(0,1) → factual winner, hallucinated loser
(1,0) → hallucinated winner, factual loser
(0,0) → both factual
(1,1) → both hallucinated

Performs balanced sampling with exact target counts:
(0,1) → 10,000
(1,0) → 10,000
(0,0) → 15,000
(1,1) → 10,000
Supports sampling with replacement if a bucket is too small.
Shuffles and saves the final balanced dataset → train_finallast.jsonl.

data_final_eval.py
Loads:
400 synthetic eval inversions (synthetic_llm_inversion_eval_400.jsonl)
full Skywork eval transformed set (skywork_first_transformed_eval.jsonl)
Skywork training source (skywork_final_train.jsonl)
final train dataset (train_finallast.jsonl) to avoid leakage
Excludes any sample that appears in the training set.
Adds additional clean samples from training buckets:
1500 from (1,1)
1500 from (0,0)

Merges:
all synthetic eval samples
all real Skywork eval samples
+1500 (1,1)
+1500 (0,0)
Shuffles and saves → eval_final.jsonl.

Stage 8: Preference Label Flipping (Train + Eval)

data_flipped_train.py

Loads the final balanced training dataset (train_finallast.jsonl).
Identifies samples with factuality tuple (h_w, h_l) = (1, 0) — meaning the winner is hallucinated and the loser is factual.
Flips these samples by:
swapping chosen and rejected
converting the flags to (0,1)
Writes the flipped dataset to train_finallast_flipped.jsonl.

data_flipped_eval.py
Loads the final evaluation dataset (eval_final.jsonl).
Applies the exact same flipping rule on (1,0) preference pairs.
Saves the processed version as eval_final_flipped.jsonl.

Checklist

Code follows the project's style guidelines
Review of code completed
No sensitive information (API keys, credentials, file paths) exposed

AhmedRadwan02

name: Feature Request
about: Suggest a new feature or enhancement
title: '[FEATURE] Code Organization and Refactoring for 8-Stage Data Pipeline'
labels: 'enhancement'
assignees: ''

Problem Statement

Hey! First off, great work on getting this 8-stage pipeline working end-to-end. The implementation is solid and functional. I noticed a few organizational things that could make the codebase easier to work with as it grows:

All the files are currently in the root directory with some inconsistent naming (like dataextraction.py vs dataextraction_eval2.py)
There's quite a bit of duplicate code across files - functions like extract_prompt_from_dialog and extract_answer_from_dialog appear in multiple places
Some paths and parameters are hardcoded even though we have a config.yaml
The model code, prompts, and data processing logic are all mixed together in the same files
I noticed the test data file (_test.json) gets created but isn't used in the later stages - was this intentional?
Quick question: are the data split sizes (train: 77k, eval: ~1k, test: 500) aligned with what's typical in DPO research?
The config.yaml has the model name but it's still hardcoded in places

These aren't critical issues, but addressing them would make the code easier to maintain, debug, and extend down the road!

Proposed Solution

Here are some ideas to make things more organized:

1. File Organization

Group related files into directories by stage: /stage_1_extraction/, /stage_2_conversion/, etc.
Pick one naming style and stick with it (either data_extraction.py or dataExtraction.py - your preference!)
Maybe rename dataextraction_eval2.py to data_extraction_test.py? Just clearer about what it does
Create a /utils/ directory for shared code

2. Code Consolidation

Pull those repeated functions into /utils/data_utils.py so we only maintain them in one place
Create a base GPT class in /utils/model_utils.py that all the stages can use
Move all the prompts into /utils/prompt_templates.py to keep them separate from the logic

3. Better Config Usage

Add all the file paths to config.yaml (like SYNTHETIC_FILE, SKYWORK_FILE)
Include hyperparameters and sampling targets
Make sure we're actually reading from config everywhere instead of hardcoding
Could add a config_loader utility to make this consistent

4. Complete or Clarify Test Pipeline

Either extend the test data through all 8 stages, or
Add a note explaining why it only goes through extraction

5. Small Cleanup

Remove any emojis from the code
Ensure consistent formatting

Alternative Solutions

Option 1: Minimal changes

Just consolidate the duplicate functions
Keep the flat file structure
Pros: Quick to implement
Cons: Won't help much with long-term maintainability

Option 2: One big pipeline script

Combine everything into a single main.py
Pros: Simple to run
Cons: Loses the nice modularity you have now

I'd recommend going with the proposed solution - it keeps your modular approach while making things cleaner.

Use Cases

This refactoring would help with:

Adding new stages or modifying existing ones without breaking things
Debugging individual stages in isolation
Reusing the utilities in other projects
Swapping out different models or prompts easily
Running experiments with different hyperparameters
Onboarding others to work on the pipeline

Implementation Ideas

Suggested approach:

Start by creating the directory structure and moving files
Pull duplicate code into utils modules
Expand the config.yaml and update everything to use it
Decide what to do with the test pipeline
Update the documentation to reflect the new structure

You don't have to do all of this at once - could tackle it in phases!

Component Impact

Core functionality - All 8 stages would need updates
API - Not affected
Docker/Infrastructure - Might need to update some paths
Documentation - README would need updates
Configuration - Expanding config.yaml usage

Additional Context

What's working well:

No sensitive info exposed - nice job keeping things secure!
The code is clear and easy to follow
Good separation into distinct stages

Quick question:
The eval and test splits seem pretty small compared to training. Is this the standard ratio for DPO papers? Just want to make sure we're aligned with best practices.

About the test file:
The _test.json gets created in Stage 1 but doesn't flow through the rest of the pipeline. Was this planned for later, or should we complete it now?

Priority

Nice to have
Would be helpful
Important for my use case
Critical/blocking

This is important for maintainability but not blocking your current work. Let me know what you think about these suggestions!

Sindhuja217 · 2025-12-05T00:46:44Z

Thanks a lot for the detailed and thoughtful feature request — this is extremely helpful for improving the maintainability and scalability of the 8-stage data construction pipeline.

1. File Organization Improvements
You are absolutely right — the earlier version had inconsistent naming (dataextraction.py, dataextraction_eval2.py - The reason for this is if the name consists test then according to uv template it should pass the pytest which was giving some dependency error that requires change in /projects/aixpert/users/sindhu/AIXpert-preference-alignment/.pre-commit-config.yaml to make it less complicated I changed the name of the file) and too many scripts sitting in a single directory.

A clean directory hierarchy now exists:

data_construction/
    stage_1_extraction/
    stage_2_conversion/
    stage_3_factuality/
    stage_4_transformation/
    stage_5_syntheticdata/
    stage_6_merging/
    stage_7_balancing/
    stage_8_flipping/
    utils/

2. Code Consolidation
Agreed — several utility functions were duplicated early on (e.g., extracting prompt/answer from the Skywork dialog format).
Consolidated repeated logic into:

utils/data_utils.py
utils/factual_utils.py
utils/synthetic_utils.py
utils/dpo_transform_utils.py
utils/prompt_templates.py

Benefits:

One place to maintain shared logic
Clean separation between data processing, LLM scoring, and synthetic generation
Less duplication across stages

3. Better Configuration Usage
You made an excellent point — previously, some paths and hyperparameters were still hardcoded.
What’s fixed:
All paths and hyperparameters now live inside:
config/config.yaml
Every stage script now imports config through:
from utils.config_loader import load_config

4. Test Pipeline Clarification

Yes — the test slice (500 samples) is extracted in Stage 1 but not processed in the remaining stages.
This was intentional for the first version:
Test prompts are used only for evaluating model generations for factual score, hallucination score and win rate calculation in the evaluation pipeline, not for preference scoring or DPO dataset construction.

5. Additional Cleanup

Remove emojis inside code
Consistent Formmating is followed throughout

6. Why the Train/Val/Test Split Is Good for DPO

The train/val/test split of 45k / 4.4k / 500 is good for DPO because it matches how leading RLHF and preference-learning papers structure their datasets. DPO relies on pairwise comparisons between two responses, so it requires a large training set—and 45k pairs is directly in line with datasets used in InstructGPT (30–50k), Anthropic HH (~32k), SafeRLHF (30k), and PKU-SafeRLHF (29k). A much smaller validation set (~10% of train) is standard because it’s only used for monitoring to prevent preference drift—not for learning. The test set should be small and strictly held out, containing only prompts to evaluate free-form generation, not preference accuracy. Using 500 prompts aligns perfectly with common LLM benchmarks like Arena-Hard (500) also that would give us a better picture on win rate, factual score and hallucination score of the model. Overall, the split provides the right amount of training signal, efficient validation, and a clean test set—making it DPO-aligned.

Please take a look at all the changes that were made, let me know if anything else needs to be changed.

AhmedRadwan02 · 2025-12-08T14:49:32Z

name: Feature Request
about: Suggest a new feature or enhancement
title: '[FEATURE] Final Review - 8-Stage Data Pipeline Refactoring'
labels: 'enhancement'

Problem Statement

This is a follow-up review on the 8-stage data construction pipeline refactoring. The major organizational improvements have been successfully implemented - the code is now much more maintainable with better structure and configuration management. There are just a few minor cleanup items remaining before final approval.

Proposed Solution

Great work on all the updates! The refactoring looks really solid now. Here are the remaining items to address:

If needed:

1. Add requirements.txt

Currently missing from the project
Should include all dependencies needed to run the pipeline
Recommended packages to include:
- openai (for GPT API calls)
- pyyaml (for config loading)
- asyncio (if used for concurrency)
- Any other third-party libraries used across stages

Important:

2. Code Cleanup Needed

Two files need cleaning:

src/aixpert/data_construction/stage_7_final/data_final_train.py
src/aixpert/data_construction/utils/dpo_transform_utils.py

Items to clean:

Remove any emojis from code/comments

Use Cases

Ensuring code is professional and clean for production use
Meeting project code quality standards
Allowing others to easily install and run the pipeline

Implementation Ideas

For requirements.txt:
Create a file at the project root:

requirements.txt should list all dependencies with versions if needed

For code cleanup:

Open the two files mentioned
Search for emojis and remove them

Should take about 10-15 minutes total.

Component Impact

API - Not affected
Core functionality - Minor cleanup only, no logic changes
Docker/Infrastructure - requirements.txt may help with Docker setup
Documentation - requirements.txt serves as dependency documentation
Any other part of the system

Additional Context

What's Working Great:

The restructuring you did addresses all the major concerns from the initial review:

Clean directory structure with all 8 stages properly organized
Utils are consolidated and being imported correctly
Config.yaml is comprehensive and actually being used throughout
File naming is consistent (train/val pattern)
Code is much more understandable and maintainable now
All the stage files are clean and well-structured

Verification Completed:

Directory structure looks good
Config paths are being used (no hardcoded paths found)
Utils are properly imported across stages
Naming conventions are consistent
Each stage has proper main() functions

Once these are done, the PR will be ready

Priority

Nice to have
Would be helpful
Important for my use case
Critical/blocking

These are the last few items needed before we can merge.

Sindhuja217 · 2025-12-08T18:25:22Z

Thank you @AhmedRadwan02. I made those necessary changes in those respective files and as for the requirements.txt there is /AIXpert-preference-alignment/pyproject.toml that consists of all dependencies that are required for the whole project.

Sindhuja217 · 2025-12-15T20:40:32Z

As I have addressed all the comments by Ahmed, Im merging this PR @shainarazavi

sindchad added 6 commits December 2, 2025 13:06

remove data_extraction

8a4658a

data extraction

da5359c

data extraction eval

d9b7fbd

data extraction test

167add7

file name change

d809833

data flipped

8cb4603

Sindhuja217 requested a review from shainarazavi December 2, 2025 20:56

Sindhuja217 self-assigned this Dec 2, 2025

Sindhuja217 requested a review from AhmedRadwan02 December 2, 2025 21:01

AhmedRadwan02 reviewed Dec 4, 2025

View reviewed changes

sindchad added 4 commits December 4, 2025 18:45

folder updates

102f8b2

readme file

7d3ee5a

readme updates

d774776

readme updates

0cd7a9a

sindchad added 3 commits December 5, 2025 15:42

Fix pip-audit workflow configuration

ff117a7

Fix pip-audit workflow configuration

968a827

Fix pip-audit workflow configuration

698f51c

removed unnecessary print statements

06bbb1b

Sindhuja217 merged commit b5dcb25 into main Dec 15, 2025
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Data Construction Pipeline for Original DPO & Modified (Factual-Aware) DPO #6

Data Construction Pipeline for Original DPO & Modified (Factual-Aware) DPO #6

Uh oh!

Sindhuja217 commented Dec 2, 2025 •

edited

Loading

Uh oh!

AhmedRadwan02 left a comment

Uh oh!

Sindhuja217 commented Dec 5, 2025 •

edited

Loading

Uh oh!

AhmedRadwan02 commented Dec 8, 2025

Uh oh!

Sindhuja217 commented Dec 8, 2025

Uh oh!

Sindhuja217 commented Dec 15, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Data Construction Pipeline for Original DPO & Modified (Factual-Aware) DPO #6

Data Construction Pipeline for Original DPO & Modified (Factual-Aware) DPO #6

Uh oh!

Conversation

Sindhuja217 commented Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Type

Changes Made

Stage 1: Data Extraction

Stage 2: Preference Pair Conversion

Stage 3: Binary Factuality Labeling (Train + Eval)

Stage 4: DPO-Ready Transformation (Train + Eval)

Stage 5: Synthetic Corruption Generation (Train + Eval)

Stage 6: Final Dataset Merge (Train + Eval)

Stage 7: Final Balanced Dataset Construction (Train + Eval)

Stage 8: Preference Label Flipping (Train + Eval)

Checklist

Uh oh!

AhmedRadwan02 left a comment

Choose a reason for hiding this comment

Problem Statement

Proposed Solution

Alternative Solutions

Use Cases

Implementation Ideas

Component Impact

Additional Context

Priority

Uh oh!

Sindhuja217 commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AhmedRadwan02 commented Dec 8, 2025

name: Feature Request about: Suggest a new feature or enhancement title: '[FEATURE] Final Review - 8-Stage Data Pipeline Refactoring' labels: 'enhancement'

Problem Statement

Proposed Solution

If needed:

Important:

Use Cases

Implementation Ideas

Component Impact

Additional Context

Priority

Uh oh!

Sindhuja217 commented Dec 8, 2025

Uh oh!

Sindhuja217 commented Dec 15, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Sindhuja217 commented Dec 2, 2025 •

edited

Loading

Sindhuja217 commented Dec 5, 2025 •

edited

Loading

name: Feature Request
about: Suggest a new feature or enhancement
title: '[FEATURE] Final Review - 8-Stage Data Pipeline Refactoring'
labels: 'enhancement'