Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The Bug in data preparation for How2 Summarization #4459

Closed
teinhonglo opened this issue Jun 20, 2022 · 6 comments
Closed

The Bug in data preparation for How2 Summarization #4459

teinhonglo opened this issue Jun 20, 2022 · 6 comments
Labels
Bug bug should be fixed

Comments

@teinhonglo
Copy link
Contributor

To Reproduce
Steps to reproduce the behavior:

  1. move to a recipe directory, e.g., cd egs2/how2_2000h/summ1
  2. local/run_asr.sh --asr_tag asr_pretrain

Error logs

2022-06-19T15:12:30 (asr.sh:252:main) ./asr.sh --lang en --feats_type extracted --token_type bpe --nbpe 1000 --nlsyms_txt data/nlsyms --bpe_nlsyms [hes] --use_lm false --asr_config conf/train_asr_conformer_lf.yaml --inference_config conf/decode_asr.yaml --train_set tr_2000h_utt --valid_set cv05_utt --test_sets dev5_test_utt --bpe_train_text data/tr_2000h_utt/text --asr_tag asr_pretrain
2022-06-19T15:12:34 (asr.sh:443:main) Stage 1: Data preparation for data/tr_2000h_utt, data/cv05_utt, etc.
2022-06-19T15:12:38 (data.sh:23:main) local/data.sh
2022-06-19T15:12:42 (data.sh:36:main) stage 0: Data download
/home/teinhonglo/espnets/espnet/tools/anaconda/envs/espnet/lib/python3.8/site-packages/gdown/cli.py:127: FutureWarning: Option `--id` was deprecated in version 4.3.1 and will be removed in 5.0. You don't need to pass it anymore to use a file ID.
  warnings.warn(
Access denied with the following error:

        Cannot retrieve the public link of the file. You may need to change
        the permission to 'Anyone with the link', or have had many accesses.

You may still be able to access the file from the browser:

         https://drive.google.com/uc?id=sharing

Besides,
The link provided by local/data.sh is not available now.

Thanks in advnace.

@teinhonglo teinhonglo added the Bug bug should be fixed label Jun 20, 2022
@sw005320
Copy link
Contributor

@roshansh-cmu, can you answer it for me?
Google drive downloading is tricky.
You may think of hosting it in the other storage (e.g., huggingface)

@roshansh-cmu
Copy link
Contributor

roshansh-cmu commented Jun 21, 2022

Thank you for opening this issue.

This issue might take some time to resolve due to server issues. I am investigating alternative storage and will respond here when the issue is resolved.

@roshansh-cmu
Copy link
Contributor

Please request the dataset using the data release form from the How2 data repository : https://github.com/srvk/how2-dataset.

Apologies for the long delay in replying

@teinhonglo
Copy link
Contributor Author

teinhonglo commented Dec 3, 2022

@roshansh-cmu

Thanks for your reply.
The problem above is solved, but I encountered a runtime error as below.
Any suggestion?

Traceback (most recent call last):
  File "/share/nas165/teinhonglo/espnets/espnet-20220922/tools/anaconda/envs/espnet/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/share/nas165/teinhonglo/espnets/espnet-20220922/tools/anaconda/envs/espnet/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/share/nas165/teinhonglo/espnets/espnet-20220922/espnet2/bin/asr_train.py", line 23, in <module>
    main()
  File "/share/nas165/teinhonglo/espnets/espnet-20220922/espnet2/bin/asr_train.py", line 19, in main
    ASRTask.main(cmd=cmd)
  File "/share/nas165/teinhonglo/espnets/espnet-20220922/espnet2/tasks/abs_task.py", line 1019, in main
    cls.main_worker(args)
  File "/share/nas165/teinhonglo/espnets/espnet-20220922/espnet2/tasks/abs_task.py", line 1315, in main_worker
    cls.trainer.run(
  File "/share/nas165/teinhonglo/espnets/espnet-20220922/espnet2/train/trainer.py", line 282, in run
    all_steps_are_invalid = cls.train_one_epoch(
  File "/share/nas165/teinhonglo/espnets/espnet-20220922/espnet2/train/trainer.py", line 556, in train_one_epoch
    retval = model(**batch)
  File "/share/nas165/teinhonglo/espnets/espnet-20220922/tools/anaconda/envs/espnet/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/share/nas165/teinhonglo/espnets/espnet-20220922/espnet2/asr/espnet_model.py", line 202, in forward
    encoder_out, encoder_out_lens = self.encode(speech, speech_lengths)
  File "/share/nas165/teinhonglo/espnets/espnet-20220922/espnet2/asr/espnet_model.py", line 368, in encode
    assert encoder_out.size(1) <= encoder_out_lens.max(), (
AssertionError: (torch.Size([1, 3120, 512]), tensor(3036, device='cuda:0'))
Loading tvm binary from: /share/nas165/teinhonglo/espnets/espnet-20220922/tools/anaconda/envs/espnet/lib/python3.8/site-packages/longformer/../longformer/lib/lib_diagonaled_mm_float32_cuda.so
# Accounting: time=13 threads=1

@roshansh-cmu
Copy link
Contributor

roshansh-cmu commented Dec 3, 2022

Please wait for a day or two- I am in the process of setting up a PR to fix this and other issues with the recipe. Thanks!

@roshansh-cmu
Copy link
Contributor

You may refer to our PR #4805 this when merged should be used to prepare data from the downloaded dataset bz2 file.
The modification made to espnet_model.py should fix the assertion error you are facing

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug bug should be fixed
Projects
None yet
Development

No branches or pull requests

3 participants