Skip to content

[POWSM] Improve data prep for powsm#6340

Merged
sw005320 merged 9 commits intoespnet:masterfrom
chinjouli:powsm_updates
Jan 27, 2026
Merged

[POWSM] Improve data prep for powsm#6340
sw005320 merged 9 commits intoespnet:masterfrom
chinjouli:powsm_updates

Conversation

@chinjouli
Copy link
Copy Markdown
Contributor

@chinjouli chinjouli commented Jan 16, 2026

What did you change?

Changes are only made under POWSM recipe:

  • run.sh: Update BPE text input.
  • local/data_prep.py: Fix text normalization and improve readability.
  • local/process_ipapack.py: Re-generate text from transcript_normalized.csv efficiently, and merge the function in subset.py. Remove plotting functions.
  • local/subset.py: Deleted.

Why did you make this change?

Make data preparation more efficient, and apply ASR text normalization.


Is your PR small enough?

yes


Additional Context

Related PR #6341

@dosubot dosubot bot added size:XL This PR changes 500-999 lines, ignoring generated files. ASR Automatic speech recogntion labels Jan 16, 2026
@mergify mergify bot added the ESPnet2 label Jan 16, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the data preparation scripts for the POWSM recipe, improving efficiency and correctness. Key changes include updating text normalization, streamlining file generation by reading directly from a CSV, and removing obsolete scripts and plotting functions. I've identified a few critical issues in egs2/powsm/s2t1/local/process_ipapack.py that need to be addressed: an incorrect function signature that will cause a runtime error, a critical typo in a shell command, and several instances of unsafe or fragile coding practices that should be improved for robustness and maintainability.

chinjouli and others added 7 commits January 16, 2026 10:41
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Refactor file handling in data preparation to use context managers for automatic resource management.
Refactor CSV reading to use DictReader for better readability and access to columns by name.
@codecov
Copy link
Copy Markdown

codecov bot commented Jan 16, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 69.41%. Comparing base (f57e5ef) to head (8b5de61).
⚠️ Report is 65 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6340      +/-   ##
==========================================
+ Coverage   69.39%   69.41%   +0.01%     
==========================================
  Files         759      759              
  Lines       69853    69853              
==========================================
+ Hits        48476    48487      +11     
+ Misses      21377    21366      -11     
Flag Coverage Δ
test_integration_espnet2 46.96% <ø> (ø)
test_python_espnet2 62.38% <ø> (+0.30%) ⬆️
test_python_espnet3 16.06% <ø> (ø)
test_utils 62.38% <ø> (+0.30%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@sw005320
Copy link
Copy Markdown
Contributor

LGTM.
I'll merge this PR after the CI check

@sw005320 sw005320 added this to the v.202601 milestone Jan 27, 2026
@sw005320 sw005320 merged commit 665c25a into espnet:master Jan 27, 2026
31 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ASR Automatic speech recogntion ESPnet2 size:XL This PR changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants