[POWSM] Improve data prep for powsm by chinjouli · Pull Request #6340 · espnet/espnet

chinjouli · 2026-01-16T15:28:44Z

What did you change?

Changes are only made under POWSM recipe:

run.sh: Update BPE text input.
local/data_prep.py: Fix text normalization and improve readability.
local/process_ipapack.py: Re-generate text from transcript_normalized.csv efficiently, and merge the function in subset.py. Remove plotting functions.
local/subset.py: Deleted.

Why did you make this change?

Make data preparation more efficient, and apply ASR text normalization.

Is your PR small enough?

yes

Additional Context

Related PR #6341

for more information, see https://pre-commit.ci

gemini-code-assist

Code Review

This pull request refactors the data preparation scripts for the POWSM recipe, improving efficiency and correctness. Key changes include updating text normalization, streamlining file generation by reading directly from a CSV, and removing obsolete scripts and plotting functions. I've identified a few critical issues in egs2/powsm/s2t1/local/process_ipapack.py that need to be addressed: an incorrect function signature that will cause a runtime error, a critical typo in a shell command, and several instances of unsafe or fragile coding practices that should be improved for robustness and maintainability.

egs2/powsm/s2t1/local/process_ipapack.py

egs2/powsm/s2t1/local/data_prep.py

egs2/powsm/s2t1/local/process_ipapack.py

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Refactor file handling in data preparation to use context managers for automatic resource management.

for more information, see https://pre-commit.ci

Refactor CSV reading to use DictReader for better readability and access to columns by name.

for more information, see https://pre-commit.ci

codecov · 2026-01-16T16:44:03Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 69.41%. Comparing base (f57e5ef) to head (8b5de61).
⚠️ Report is 65 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #6340      +/-   ##
==========================================
+ Coverage   69.39%   69.41%   +0.01%     
==========================================
  Files         759      759              
  Lines       69853    69853              
==========================================
+ Hits        48476    48487      +11     
+ Misses      21377    21366      -11

Flag	Coverage Δ
test_integration_espnet2	`46.96% <ø> (ø)`
test_python_espnet2	`62.38% <ø> (+0.30%)`	⬆️
test_python_espnet3	`16.06% <ø> (ø)`
test_utils	`62.38% <ø> (+0.30%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

sw005320 · 2026-01-16T17:21:59Z

LGTM.
I'll merge this PR after the CI check

improve data prep for powsm

da5d746

dosubot bot added size:XL This PR changes 500-999 lines, ignoring generated files. ASR Automatic speech recogntion labels Jan 16, 2026

mergify bot added the ESPnet2 label Jan 16, 2026

[pre-commit.ci] auto fixes from pre-commit.com hooks

48d7ec1

for more information, see https://pre-commit.ci

gemini-code-assist bot reviewed Jan 16, 2026

View reviewed changes

chinjouli mentioned this pull request Jan 16, 2026

[POWSM] POWSM-CTC recipe, and changes for s2t-ctc training #6341

Merged

chinjouli and others added 7 commits January 16, 2026 10:41

remove unused args in egs2/powsm/s2t1/local/process_ipapack.py

2dc3165

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Fix str formatting in process_ipapack.py for file output

89d5d5f

Refactor file handling in data_prep.py

b3673e0

Refactor file handling in data preparation to use context managers for automatic resource management.

[pre-commit.ci] auto fixes from pre-commit.com hooks

3786948

for more information, see https://pre-commit.ci

Change CSV reader to DictReader in process_ipapack.py

9c65273

Refactor CSV reading to use DictReader for better readability and access to columns by name.

fix ci

cbcfcb7

[pre-commit.ci] auto fixes from pre-commit.com hooks

8b5de61

for more information, see https://pre-commit.ci

sw005320 added this to the v.202601 milestone Jan 27, 2026

sw005320 merged commit 665c25a into espnet:master Jan 27, 2026
31 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[POWSM] Improve data prep for powsm#6340

[POWSM] Improve data prep for powsm#6340
sw005320 merged 9 commits intoespnet:masterfrom
chinjouli:powsm_updates

chinjouli commented Jan 16, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov bot commented Jan 16, 2026 •

edited

Loading

Uh oh!

sw005320 commented Jan 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

chinjouli commented Jan 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What did you change?

Why did you make this change?

Is your PR small enough?

Additional Context

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov bot commented Jan 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

sw005320 commented Jan 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

chinjouli commented Jan 16, 2026 •

edited

Loading

codecov bot commented Jan 16, 2026 •

edited

Loading