-
Notifications
You must be signed in to change notification settings - Fork 0
Data Construction Pipeline for Original DPO & Modified (Factual-Aware) DPO #6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
AhmedRadwan02
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
name: Feature Request
about: Suggest a new feature or enhancement
title: '[FEATURE] Code Organization and Refactoring for 8-Stage Data Pipeline'
labels: 'enhancement'
assignees: ''
Problem Statement
Hey! First off, great work on getting this 8-stage pipeline working end-to-end. The implementation is solid and functional. I noticed a few organizational things that could make the codebase easier to work with as it grows:
- All the files are currently in the root directory with some inconsistent naming (like dataextraction.py vs dataextraction_eval2.py)
- There's quite a bit of duplicate code across files - functions like extract_prompt_from_dialog and extract_answer_from_dialog appear in multiple places
- Some paths and parameters are hardcoded even though we have a config.yaml
- The model code, prompts, and data processing logic are all mixed together in the same files
- I noticed the test data file (_test.json) gets created but isn't used in the later stages - was this intentional?
- Quick question: are the data split sizes (train: 77k, eval: ~1k, test: 500) aligned with what's typical in DPO research?
- The config.yaml has the model name but it's still hardcoded in places
These aren't critical issues, but addressing them would make the code easier to maintain, debug, and extend down the road!
Proposed Solution
Here are some ideas to make things more organized:
1. File Organization
- Group related files into directories by stage: /stage_1_extraction/, /stage_2_conversion/, etc.
- Pick one naming style and stick with it (either data_extraction.py or dataExtraction.py - your preference!)
- Maybe rename dataextraction_eval2.py to data_extraction_test.py? Just clearer about what it does
- Create a /utils/ directory for shared code
2. Code Consolidation
- Pull those repeated functions into /utils/data_utils.py so we only maintain them in one place
- Create a base GPT class in /utils/model_utils.py that all the stages can use
- Move all the prompts into /utils/prompt_templates.py to keep them separate from the logic
3. Better Config Usage
- Add all the file paths to config.yaml (like SYNTHETIC_FILE, SKYWORK_FILE)
- Include hyperparameters and sampling targets
- Make sure we're actually reading from config everywhere instead of hardcoding
- Could add a config_loader utility to make this consistent
4. Complete or Clarify Test Pipeline
- Either extend the test data through all 8 stages, or
- Add a note explaining why it only goes through extraction
5. Small Cleanup
- Remove any emojis from the code
- Ensure consistent formatting
Alternative Solutions
Option 1: Minimal changes
- Just consolidate the duplicate functions
- Keep the flat file structure
- Pros: Quick to implement
- Cons: Won't help much with long-term maintainability
Option 2: One big pipeline script
- Combine everything into a single main.py
- Pros: Simple to run
- Cons: Loses the nice modularity you have now
I'd recommend going with the proposed solution - it keeps your modular approach while making things cleaner.
Use Cases
This refactoring would help with:
- Adding new stages or modifying existing ones without breaking things
- Debugging individual stages in isolation
- Reusing the utilities in other projects
- Swapping out different models or prompts easily
- Running experiments with different hyperparameters
- Onboarding others to work on the pipeline
Implementation Ideas
Suggested approach:
- Start by creating the directory structure and moving files
- Pull duplicate code into utils modules
- Expand the config.yaml and update everything to use it
- Decide what to do with the test pipeline
- Update the documentation to reflect the new structure
You don't have to do all of this at once - could tackle it in phases!
Component Impact
- Core functionality - All 8 stages would need updates
- API - Not affected
- Docker/Infrastructure - Might need to update some paths
- Documentation - README would need updates
- Configuration - Expanding config.yaml usage
Additional Context
What's working well:
- No sensitive info exposed - nice job keeping things secure!
- The code is clear and easy to follow
- Good separation into distinct stages
Quick question:
The eval and test splits seem pretty small compared to training. Is this the standard ratio for DPO papers? Just want to make sure we're aligned with best practices.
About the test file:
The _test.json gets created in Stage 1 but doesn't flow through the rest of the pipeline. Was this planned for later, or should we complete it now?
Priority
- Nice to have
- Would be helpful
- Important for my use case
- Critical/blocking
This is important for maintainability but not blocking your current work. Let me know what you think about these suggestions!
|
Thanks a lot for the detailed and thoughtful feature request — this is extremely helpful for improving the maintainability and scalability of the 8-stage data construction pipeline. 1. File Organization Improvements A clean directory hierarchy now exists: 2. Code Consolidation Benefits:
3. Better Configuration Usage 4. Test Pipeline Clarification Yes — the test slice (500 samples) is extracted in Stage 1 but not processed in the remaining stages. 5. Additional Cleanup
6. Why the Train/Val/Test Split Is Good for DPO The train/val/test split of 45k / 4.4k / 500 is good for DPO because it matches how leading RLHF and preference-learning papers structure their datasets. DPO relies on pairwise comparisons between two responses, so it requires a large training set—and 45k pairs is directly in line with datasets used in InstructGPT (30–50k), Anthropic HH (~32k), SafeRLHF (30k), and PKU-SafeRLHF (29k). A much smaller validation set (~10% of train) is standard because it’s only used for monitoring to prevent preference drift—not for learning. The test set should be small and strictly held out, containing only prompts to evaluate free-form generation, not preference accuracy. Using 500 prompts aligns perfectly with common LLM benchmarks like Arena-Hard (500) also that would give us a better picture on win rate, factual score and hallucination score of the model. Overall, the split provides the right amount of training signal, efficient validation, and a clean test set—making it DPO-aligned. Please take a look at all the changes that were made, let me know if anything else needs to be changed. |
name: Feature Request
|
|
Thank you @AhmedRadwan02. I made those necessary changes in those respective files and as for the |
|
As I have addressed all the comments by Ahmed, Im merging this PR @shainarazavi |
Summary
This PR introduces a fully automated, end-to-end 8-stage data-construction pipeline designed to generate high-quality, safety-aligned training and evaluation datasets for Original DPO and Modified Factual-DPO experiments using the Skywork Reward-Preference dataset.
The pipeline covers extraction, transformation, factuality scoring, synthetic corruption generation, balanced sampling, dataset merging, and Safe-DPO preference flipping, producing reproducible train/eval splits for alignment research.
PR Type
[New feature]
Changes Made
Stage 1: Data Extraction
dataextraction.pydataextraction_eval.pyskywork_extracted_eval.jsonland removed samples.dataextraction_eval2.pyskywork_extracted_test.jsonland removed samples.Stage 2: Preference Pair Conversion
dataconversion.pyskywork_extracted_77k.jsonl).response_0andresponse_1.better_response_id.skywork_preference_pairs_77k.jsonl.dataconversion_eval.pyskywork_extracted_eval.jsonl).skywork_preference_pairs_eval.jsonl.Stage 3: Binary Factuality Labeling (Train + Eval)
dataset_train.pyskywork_preference_pairs_train.jsonl).factual_flag_0,factual_flag_1for each response.h0,h1for downstream filtering/loss computation.skywork_binary_factual_train.jsonl.dataset_eval.pyskywork_preference_pairs_eval.jsonl).skywork_binary_factual_eval.jsonl.Stage 4: DPO-Ready Transformation (Train + Eval)
data_transform_train.pyskywork_binary_factual_train.jsonl).response_0/response_1into chosen and rejected strictly using better_response_id.h_w→ hallucination flag of the winnerh_l→ hallucination flag of the loserflipped=Falsefor Safe-DPO compatibility.skywork_first_transformed_train.jsonl.data_transform_eval.pyskywork_binary_factual_eval.jsonl).skywork_first_transformed_eval.jsonl.Stage 5: Synthetic Corruption Generation (Train + Eval)
data_synthetic_train.pyskywork_first_transformed_train.jsonl).h_w = 0and loser ish_l = 1.chosen= corrupted (hallucinated)rejected= original factual answersynthetic_llm_inversion_train_10k.jsonl.data_synthetic_eval.pyMirrors the same corruption generation process for the evaluation split.
Selects factual (0,1) pairs and produces hallucinated inversions.
Generates 400 synthetic eval corruption samples.
Saves to
synthetic_llm_inversion_eval_400.jsonl.Stage 6: Final Dataset Merge (Train + Eval)
data_merge_train.pysynthetic_llm_inversion_train_10k.jsonl).skywork_first_transformed_train.jsonl).(0,0)→ both factual(1,1)→ both hallucinated(0,1)→ factual winner, hallucinated loser(0,1)group.(0,0)(1,1)(0,1)skywork_final_train.jsonl.data_merge_eval.pyLoads 400 synthetic eval inversions (
synthetic_llm_inversion_eval_400.jsonl).Loads the transformed evaluation set (
skywork_first_transformed_eval.jsonl).Buckets into
(0,0),(1,1),(0,1)but keeps all real eval samples.Merges: synthetic + all real eval data.
Shuffles and outputs final eval dataset →
skywork_final_eval.jsonl.Stage 7: Final Balanced Dataset Construction (Train + Eval)
data_final_train.pyLoads the merged training dataset (
skywork_final_train.jsonl).Buckets samples based on factuality pair type (
h_w,h_l):(0,1)→ factual winner, hallucinated loser(1,0)→ hallucinated winner, factual loser(0,0)→ both factual(1,1)→ both hallucinatedPerforms balanced sampling with exact target counts:
(0,1)→ 10,000(1,0)→ 10,000(0,0)→ 15,000(1,1)→ 10,000Supports sampling with replacement if a bucket is too small.
Shuffles and saves the final balanced dataset →
train_finallast.jsonl.data_final_eval.pyLoads:
400 synthetic eval inversions (
synthetic_llm_inversion_eval_400.jsonl)full Skywork eval transformed set (
skywork_first_transformed_eval.jsonl)Skywork training source (
skywork_final_train.jsonl)final train dataset (
train_finallast.jsonl) to avoid leakageExcludes any sample that appears in the training set.
Adds additional clean samples from training buckets:
1500 from
(1,1)1500 from
(0,0)Merges:
all synthetic eval samples
all real Skywork eval samples
+1500
(1,1)+1500
(0,0)Shuffles and saves →
eval_final.jsonl.Stage 8: Preference Label Flipping (Train + Eval)
data_flipped_train.pytrain_finallast.jsonl).h_w,h_l) = (1, 0) — meaning the winner is hallucinated and the loser is factual.train_finallast_flipped.jsonl.data_flipped_eval.pyLoads the final evaluation dataset (
eval_final.jsonl).Applies the exact same flipping rule on (1,0) preference pairs.
Saves the processed version as
eval_final_flipped.jsonl.Checklist