Skip to content

Refactor: Clean Scripts into Utility#54

Merged
d2r3v merged 6 commits intodevfrom
Refactor/Clean-Util
Mar 9, 2026
Merged

Refactor: Clean Scripts into Utility#54
d2r3v merged 6 commits intodevfrom
Refactor/Clean-Util

Conversation

@d2r3v
Copy link
Copy Markdown
Collaborator

@d2r3v d2r3v commented Jan 26, 2026

Overview

Refactored the logic for cleaning and imputing Branching/PR labels into a reusable utility module,
process_model/clean.py. This unifies the logic and fixes missing dependencies for tests.

Changes

New Utility

Created process_model/clean.py containing:

  • clean_and_impute_branch_names
    Imputes missing branch names per PR by propagating the first valid name.

  • create_clean_branching_label_csv
    Logic ported from clean_lable.py.

  • create_clean_pr_label_csv
    Logic ported from get_clean_pr_label.py.

Refactoring

  • Updated event_labelling/CodeStructure_Branching/clean_lable.py to import from process_model.clean.
  • Updated event_labelling/PR/get_clean_pr_label.py to import from process_model.clean.

Documentation

  • Updated code_structure_and_branching.md.
  • Updated pr_label.md.

Verification

  • Ran pytest test/testClean.py to verify branch name imputation and edge cases → PASSED.
  • Manually ran clean_lable.py on sample data to ensure backward compatibility → SUCCESS.

Closes #40

@d2r3v d2r3v requested review from AdaraPutri and Mahatav January 26, 2026 08:09
@d2r3v d2r3v self-assigned this Jan 26, 2026
@d2r3v d2r3v added the Refactor Restructure your thinking label Jan 26, 2026
@d2r3v d2r3v linked an issue Jan 26, 2026 that may be closed by this pull request
Comment thread documentation/code_structure_and_branching.md Outdated
Comment thread documentation/code_structure_and_branching.md Outdated
Copy link
Copy Markdown
Collaborator

@Mahatav Mahatav left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not have a clean file in the code or pr label folder. The main files for each should be updated to import the utility file. The clean files are not wrappers or optional

@d2r3v
Copy link
Copy Markdown
Collaborator Author

d2r3v commented Feb 7, 2026

Refactored and made compulsory.

@d2r3v d2r3v requested a review from Mahatav February 7, 2026 05:17
@AdaraPutri AdaraPutri mentioned this pull request Feb 7, 2026
Copy link
Copy Markdown
Collaborator

@Mahatav Mahatav left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The clean file should be in the src/utils/

@d2r3v
Copy link
Copy Markdown
Collaborator Author

d2r3v commented Feb 9, 2026

Relocated clean file

@d2r3v d2r3v requested a review from Mahatav February 9, 2026 01:14
Copy link
Copy Markdown
Collaborator

@Mahatav Mahatav left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think its needs to pllu from dev so i can test it

# Conflicts:
#	event_labelling/CodeStructure_Branching/main.py
#	event_labelling/PR/get_clean_pr_label.py
@d2r3v d2r3v requested a review from Mahatav March 2, 2026 07:27
@Mahatav Mahatav closed this Mar 2, 2026
@Mahatav Mahatav deleted the Refactor/Clean-Util branch March 2, 2026 17:27
Copy link
Copy Markdown
Collaborator

@Mahatav Mahatav left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good

@Mahatav Mahatav restored the Refactor/Clean-Util branch March 2, 2026 17:52
@Mahatav Mahatav reopened this Mar 2, 2026
Copy link
Copy Markdown
Collaborator

@AdaraPutri AdaraPutri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hey Dhruv, thanks for working on this refactor! I re-ran event_labelling/PR/pr_label.py (which calls the new util clean script) and verified the resulting CLEAN csv matches what we were getting with the original PR clean script (just one difference in labelling but that's just due to the llm response variability).

Image

From what I checked, create_clean_pr_label_csv in src/utils/clean.py is keeping the same behavior as the old event_labelling/PR/get_clean_pr_label.py:

  • same event parsing (handles "['...']" strings + normal strings)
  • same timestamp rules (merge → merged_at, no_merge → updated_at, else created_at, with fallback to created_at)
  • still writes the raw original event cell into the output (so downstream stuff won’t break)

Also the new clean_and_impute_branch_names util looks correct for the branching csvs (per pr_id, fill missing branch names using the first valid one, treat empty strings as missing, and it exits early without crashing if the input is missing/empty).

Tests look good and cover the important cases:

  • basic imputation works
  • empty strings get imputed
  • multiple branch names per PR → uses the first one
  • all-missing PRs stay missing
  • missing cols / empty file / nonexistent file don’t crash
  • output dir gets created

LGTM from me 👍

@d2r3v d2r3v merged commit 1d6e841 into dev Mar 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Refactor Restructure your thinking

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Refactor Suggestions

3 participants