Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

separate initial loading and per-line operations #46

Closed
bittlingmayer opened this issue Feb 28, 2019 · 4 comments
Closed

separate initial loading and per-line operations #46

bittlingmayer opened this issue Feb 28, 2019 · 4 comments

Comments

@bittlingmayer
Copy link

In order to support interactive mode and/or run-time querying from other languages, it would be ideal if the code under if __name__ == '__main__': in a task like embed had an initial load and then processed each line from stdin as soon as it arrived.

@hoschwenk
Copy link
Contributor

This is already the case for the LSTM encoder itself.
The more tricky part is Moses tokenization and fastBPE.
Preloading the model and keeping it in memory would require some (substantial ?) changes of this 3rd party code. An option could be to use named pipes ?
If you can provide a pull request for this option, I'm happy to integrate it.

@bittlingmayer
Copy link
Author

Thanks for the suggestions

Logically, facebookresearch/UnsupervisedMT has the exact same issue

For tokenisation, my instinct would be to just use a different tokeniser, because LASER is basically agnostic to the tokenisation scheme, as long as it is applied consistently at train time and run time. But that would break compatibility with the current pre-trained models.

For fastBPE, I don't have a great answer, Fairseq (and Sockeye) supports interactive mode and BPE, but the BPE story is not something I would emulate, it's by far the biggest problem with the lib. fastBPE is a small and new lib, so there is some hope that it could evolve from just research to eng. In any case, I opened an issue, maybe you can add something there: glample/fastBPE#10

@hoschwenk
Copy link
Contributor

I assume that fairseq (and Sockeye) use Sennrich's BPE in python.
In principle, one should be able to replace fastBPE with another BPE implementation. There are minor differences, but it may be worth trying what is the impact, without retraining the models.
The long term solution I favor is to switch to an unified tokenzation and segmentation approach like sentencepiece. This would make the whole pipeline language agnostic.
I hope to update the models and code in the near future ...

@bittlingmayer
Copy link
Author

fastBPE now supports this.

See glample/fastBPE#10 (comment)

@glample @loretoparisi

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants