Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

token_text as outputs #64

Closed
valentinp72 opened this issue Apr 23, 2021 · 2 comments
Closed

token_text as outputs #64

valentinp72 opened this issue Apr 23, 2021 · 2 comments
Labels
question Further information is requested

Comments

@valentinp72
Copy link

❓ Questions and Help

Hello,

I'm a bit confused after reading changes from #58. I was currently using token_text for my work, and I would prefer to continue, if it is possible, using token_text instead of text (because I use special tags similar to <space>).
After reading changes from #58 I have the impression that it's still possible to use token_text as outputs for ASR systems, is it right?

If so, what argument should be given to the training script?
Which no additional argument, when training with JSON that includes token_text instead of text, I keep getting this error:

Traceback (most recent call last):
  File "/lium/home/vpelloin/miniconda3/envs/espresso/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
    fn(i, *args)
  File "/export/home/lium/vpelloin/git/espresso/fairseq/distributed/utils.py", line 328, in distributed_main
    main(cfg, **kwargs)
  File "/export/home/lium/vpelloin/git/espresso/fairseq_cli/train.py", line 176, in main
    valid_losses, should_stop = train(cfg, trainer, task, epoch_itr)
  File "/lium/home/vpelloin/miniconda3/envs/espresso/lib/python3.7/contextlib.py", line 74, in inner
    return func(*args, **kwds)
  File "/export/home/lium/vpelloin/git/espresso/fairseq_cli/train.py", line 287, in train
    log_output = trainer.train_step(samples)
  File "/lium/home/vpelloin/miniconda3/envs/espresso/lib/python3.7/contextlib.py", line 74, in inner
    return func(*args, **kwds)
  File "/export/home/lium/vpelloin/git/espresso/fairseq/trainer.py", line 674, in train_step
    ignore_grad=is_dummy_batch,
  File "/export/home/lium/vpelloin/git/espresso/fairseq/tasks/fairseq_task.py", line 476, in train_step
    loss, sample_size, logging_output = criterion(model, sample)
  File "/lium/home/vpelloin/miniconda3/envs/espresso/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/export/home/lium/vpelloin/git/espresso/espresso/criterions/label_smoothed_cross_entropy_v2.py", line 150, in forward
    net_output = model(**sample["net_input"], epoch=self.epoch)
  File "/lium/home/vpelloin/miniconda3/envs/espresso/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/export/home/lium/vpelloin/git/espresso/fairseq/distributed/module_proxy_wrapper.py", line 55, in forward
    return self.module(*args, **kwargs)
  File "/lium/home/vpelloin/miniconda3/envs/espresso/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/export/home/lium/vpelloin/git/espresso/fairseq/distributed/legacy_distributed_data_parallel.py", line 74, in forward
    return self.module(*inputs, **kwargs)
  File "/lium/home/vpelloin/miniconda3/envs/espresso/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
TypeError: forward() missing 1 required positional argument: 'prev_output_tokens'

Thank you so much for the incredible work you're doing with this tool!

@valentinp72 valentinp72 added the question Further information is requested label Apr 23, 2021
@freewym
Copy link
Owner

freewym commented Apr 23, 2021

Hi,

Maybe you can update the field name token_text to text in the json files, and then remove the arg --bpe sentencepiece --sentencepiece-model ${sentencepiece_model}.model passed to train.py. This will take the text as it-is without any v sentencepiece encoding process.

Alternatively I think you can define your own bpe class similar to https://github.com/freewym/espresso/blob/master/fairseq/data/encoders/sentencepiece_bpe.py, to take your additional tags into consideration. and pass the arg --bpe <your-bpe-name> to train.py

edit: there is a later commit 4c86e23 doing on-the-fly tokenization, where token_text is totally removed from the code. If you are using the version after that, I see where your confusion is from

@valentinp72
Copy link
Author

Thank you,

By renaming token_text to text, and removing the --bpe argument, fairseq dit not complain, and I achieved to train my model.

Now, I'm just having trouble when decoding, as there is no difference between the WER and the CER metric (decoded_char_results and decoded_results are the same). I think creating my own tokeniser/bpe class should work, I might try that later. Otherwise, I'll look the decoding functions and add some custom code.

@freewym freewym closed this as completed Jun 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants