token_text as outputs #64

valentinp72 · 2021-04-23T11:27:20Z

❓ Questions and Help

Hello,

I'm a bit confused after reading changes from #58. I was currently using token_text for my work, and I would prefer to continue, if it is possible, using token_text instead of text (because I use special tags similar to <space>).
After reading changes from #58 I have the impression that it's still possible to use token_text as outputs for ASR systems, is it right?

If so, what argument should be given to the training script?
Which no additional argument, when training with JSON that includes token_text instead of text, I keep getting this error:

Traceback (most recent call last):
  File "/lium/home/vpelloin/miniconda3/envs/espresso/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
    fn(i, *args)
  File "/export/home/lium/vpelloin/git/espresso/fairseq/distributed/utils.py", line 328, in distributed_main
    main(cfg, **kwargs)
  File "/export/home/lium/vpelloin/git/espresso/fairseq_cli/train.py", line 176, in main
    valid_losses, should_stop = train(cfg, trainer, task, epoch_itr)
  File "/lium/home/vpelloin/miniconda3/envs/espresso/lib/python3.7/contextlib.py", line 74, in inner
    return func(*args, **kwds)
  File "/export/home/lium/vpelloin/git/espresso/fairseq_cli/train.py", line 287, in train
    log_output = trainer.train_step(samples)
  File "/lium/home/vpelloin/miniconda3/envs/espresso/lib/python3.7/contextlib.py", line 74, in inner
    return func(*args, **kwds)
  File "/export/home/lium/vpelloin/git/espresso/fairseq/trainer.py", line 674, in train_step
    ignore_grad=is_dummy_batch,
  File "/export/home/lium/vpelloin/git/espresso/fairseq/tasks/fairseq_task.py", line 476, in train_step
    loss, sample_size, logging_output = criterion(model, sample)
  File "/lium/home/vpelloin/miniconda3/envs/espresso/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/export/home/lium/vpelloin/git/espresso/espresso/criterions/label_smoothed_cross_entropy_v2.py", line 150, in forward
    net_output = model(**sample["net_input"], epoch=self.epoch)
  File "/lium/home/vpelloin/miniconda3/envs/espresso/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/export/home/lium/vpelloin/git/espresso/fairseq/distributed/module_proxy_wrapper.py", line 55, in forward
    return self.module(*args, **kwargs)
  File "/lium/home/vpelloin/miniconda3/envs/espresso/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/export/home/lium/vpelloin/git/espresso/fairseq/distributed/legacy_distributed_data_parallel.py", line 74, in forward
    return self.module(*inputs, **kwargs)
  File "/lium/home/vpelloin/miniconda3/envs/espresso/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
TypeError: forward() missing 1 required positional argument: 'prev_output_tokens'

Thank you so much for the incredible work you're doing with this tool!

The text was updated successfully, but these errors were encountered:

freewym · 2021-04-23T19:32:52Z

Hi,

Maybe you can update the field name token_text to text in the json files, and then remove the arg --bpe sentencepiece --sentencepiece-model ${sentencepiece_model}.model passed to train.py. This will take the text as it-is without any v sentencepiece encoding process.

Alternatively I think you can define your own bpe class similar to https://github.com/freewym/espresso/blob/master/fairseq/data/encoders/sentencepiece_bpe.py, to take your additional tags into consideration. and pass the arg --bpe <your-bpe-name> to train.py

edit: there is a later commit 4c86e23 doing on-the-fly tokenization, where token_text is totally removed from the code. If you are using the version after that, I see where your confusion is from

valentinp72 · 2021-04-28T08:19:45Z

Thank you,

By renaming token_text to text, and removing the --bpe argument, fairseq dit not complain, and I achieved to train my model.

Now, I'm just having trouble when decoding, as there is no difference between the WER and the CER metric (decoded_char_results and decoded_results are the same). I think creating my own tokeniser/bpe class should work, I might try that later. Otherwise, I'll look the decoding functions and add some custom code.

valentinp72 added the question Further information is requested label Apr 23, 2021

freewym closed this as completed Jun 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

token_text as outputs #64

token_text as outputs #64

valentinp72 commented Apr 23, 2021

freewym commented Apr 23, 2021 •

edited

Loading

valentinp72 commented Apr 28, 2021

token_text as outputs #64

token_text as outputs #64

Comments

valentinp72 commented Apr 23, 2021

❓ Questions and Help

freewym commented Apr 23, 2021 • edited Loading

valentinp72 commented Apr 28, 2021

freewym commented Apr 23, 2021 •

edited

Loading