[POWSM] Improve data prep for powsm#6340
Conversation
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Code Review
This pull request refactors the data preparation scripts for the POWSM recipe, improving efficiency and correctness. Key changes include updating text normalization, streamlining file generation by reading directly from a CSV, and removing obsolete scripts and plotting functions. I've identified a few critical issues in egs2/powsm/s2t1/local/process_ipapack.py that need to be addressed: an incorrect function signature that will cause a runtime error, a critical typo in a shell command, and several instances of unsafe or fragile coding practices that should be improved for robustness and maintainability.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Refactor file handling in data preparation to use context managers for automatic resource management.
for more information, see https://pre-commit.ci
Refactor CSV reading to use DictReader for better readability and access to columns by name.
for more information, see https://pre-commit.ci
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #6340 +/- ##
==========================================
+ Coverage 69.39% 69.41% +0.01%
==========================================
Files 759 759
Lines 69853 69853
==========================================
+ Hits 48476 48487 +11
+ Misses 21377 21366 -11
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
LGTM. |
What did you change?
Changes are only made under POWSM recipe:
run.sh: Update BPE text input.local/data_prep.py: Fix text normalization and improve readability.local/process_ipapack.py: Re-generate text fromtranscript_normalized.csvefficiently, and merge the function insubset.py. Remove plotting functions.local/subset.py: Deleted.Why did you make this change?
Make data preparation more efficient, and apply ASR text normalization.
Is your PR small enough?
yes
Additional Context
Related PR #6341