Conversation
Mahatav
left a comment
There was a problem hiding this comment.
We should not have a clean file in the code or pr label folder. The main files for each should be updated to import the utility file. The clean files are not wrappers or optional
|
Refactored and made compulsory. |
Mahatav
left a comment
There was a problem hiding this comment.
The clean file should be in the src/utils/
|
Relocated clean file |
Mahatav
left a comment
There was a problem hiding this comment.
I think its needs to pllu from dev so i can test it
# Conflicts: # event_labelling/CodeStructure_Branching/main.py # event_labelling/PR/get_clean_pr_label.py
There was a problem hiding this comment.
hey Dhruv, thanks for working on this refactor! I re-ran event_labelling/PR/pr_label.py (which calls the new util clean script) and verified the resulting CLEAN csv matches what we were getting with the original PR clean script (just one difference in labelling but that's just due to the llm response variability).
From what I checked, create_clean_pr_label_csv in src/utils/clean.py is keeping the same behavior as the old event_labelling/PR/get_clean_pr_label.py:
- same event parsing (handles
"['...']"strings + normal strings) - same timestamp rules (merge →
merged_at, no_merge →updated_at, elsecreated_at, with fallback tocreated_at) - still writes the raw original
eventcell into the output (so downstream stuff won’t break)
Also the new clean_and_impute_branch_names util looks correct for the branching csvs (per pr_id, fill missing branch names using the first valid one, treat empty strings as missing, and it exits early without crashing if the input is missing/empty).
Tests look good and cover the important cases:
- basic imputation works
- empty strings get imputed
- multiple branch names per PR → uses the first one
- all-missing PRs stay missing
- missing cols / empty file / nonexistent file don’t crash
- output dir gets created
LGTM from me 👍
Overview
Refactored the logic for cleaning and imputing Branching/PR labels into a reusable utility module,
process_model/clean.py. This unifies the logic and fixes missing dependencies for tests.Changes
New Utility
Created
process_model/clean.pycontaining:clean_and_impute_branch_namesImputes missing branch names per PR by propagating the first valid name.
create_clean_branching_label_csvLogic ported from
clean_lable.py.create_clean_pr_label_csvLogic ported from
get_clean_pr_label.py.Refactoring
event_labelling/CodeStructure_Branching/clean_lable.pyto import fromprocess_model.clean.event_labelling/PR/get_clean_pr_label.pyto import fromprocess_model.clean.Documentation
code_structure_and_branching.md.pr_label.md.Verification
pytest test/testClean.pyto verify branch name imputation and edge cases → PASSED.clean_lable.pyon sample data to ensure backward compatibility → SUCCESS.Closes #40