Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixing How2-2000h Data preparation and Seq Length Assert for Longformer Encoder #4805

Merged
merged 8 commits into from Dec 5, 2022

Conversation

roshansh-cmu
Copy link
Contributor

@roshansh-cmu roshansh-cmu commented Dec 5, 2022

We changed the data download mechanism for How2-2000h recently. This impacts the asr1 and sum1 recipes for how2_2000h.
Therefore, we updated the README and data preparation scripts.

  • Data Prep and README for asr1
  • Data Prep and README for sum1
  • Sequence length assert should not be checked in espnet_model.py for longformer since we may do some padding in the encoder to optimize GPU usage. This is a package default and not easy to modify at this time, so the assert runs always except when the encoder's selfattn is "lf_selfattn".
  • Add HOW2_2kH to db.sh

Please consider and merge at your convenience.

@codecov
Copy link

codecov bot commented Dec 5, 2022

Codecov Report

Merging #4805 (01dbafa) into master (1a50788) will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##           master    #4805   +/-   ##
=======================================
  Coverage   80.38%   80.38%           
=======================================
  Files         533      533           
  Lines       46948    46949    +1     
=======================================
+ Hits        37738    37739    +1     
  Misses       9210     9210           
Flag Coverage Δ
test_integration_espnet1 66.37% <ø> (ø)
test_integration_espnet2 49.00% <100.00%> (-0.02%) ⬇️
test_python 68.75% <100.00%> (+<0.01%) ⬆️
test_utils 23.30% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
espnet2/asr/espnet_model.py 81.06% <100.00%> (+0.07%) ⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@sw005320 sw005320 added the ASR Automatic speech recogntion label Dec 5, 2022
@sw005320 sw005320 added this to the v.202211 milestone Dec 5, 2022
@sw005320 sw005320 added the Bugfix label Dec 5, 2022
@sw005320
Copy link
Contributor

sw005320 commented Dec 5, 2022

Thanks a lot, @roshansh-cmu!
After the CI check, I'll merge this PR.

@sw005320 sw005320 added the auto-merge Enable auto-merge label Dec 5, 2022
@mergify mergify bot merged commit 8394ce7 into espnet:master Dec 5, 2022
@roshansh-cmu roshansh-cmu deleted the how2_fix branch December 5, 2022 14:15
@roshansh-cmu roshansh-cmu restored the how2_fix branch December 6, 2022 11:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ASR Automatic speech recogntion auto-merge Enable auto-merge Bugfix ESPnet2 README
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants