Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About choice of Tokenizers #14

Closed
freddycct opened this issue Aug 1, 2020 · 5 comments
Closed

About choice of Tokenizers #14

freddycct opened this issue Aug 1, 2020 · 5 comments

Comments

@freddycct
Copy link

@chengchingwen What do we do about the Huggingface Bert Tokenizer? Is that in the plan?

@chengchingwen
Copy link
Owner

chengchingwen commented Aug 14, 2020

it's on the roadmap, but won't be covered by the GSoC project. I'm considering wrapping the rust implementation in huggingface/tokenizer with BinaryBuilder.jl.

@freddycct
Copy link
Author

This project needs more love from the community. Looks like progress is stalled? @chengchingwen

@chengchingwen
Copy link
Owner

Yes. I'll love to see more people contributing to this project. Currently I'm quite busy (working and studying) and therefore
I can't spend too much effort on this project. I do have some unreleased code snippet for the tokenizer but most of them are not well tested. I would try to squeeze out some time to release them (probably 1-2 weeks later).

Some update/thought about tokenizers:

  1. The rust implementation doesn't have a stable API and even a public C API for making bindings. We would need either come up with a C API or find some way to directly hook rust functions in Julia.
  2. The easiest way would actually be using PyCall.jl and load huggingface/transformers or huggingface/tokenizer and use those tokenizer in python, but I personally dislike this approach. I avoid python like a plague (that's why there is a Pickle.jl).
  3. Ideally I would love to see a native julia implementation, but then there would be lots of stuff need to be reimplemented. Currently We only have some basic tokenizer that I modified from the origin python implementation (the WordPiece in src/bert, the Bpe from BytePairEncoding.jl) and many other stuff from WordTokenizers.jl
  4. Binding is a desired approach. But the problem is that most of the tokenizer are implemented in a binding-unfriendly way, like Cython/Python or C++/rust without C API.

@SeanLee97
Copy link

I implemented the word-piece tokenizer using native Julia, named BertWordPieceTokenizer, and I've registered it to the JuliaHub.

Currently, it works to load the word-piece tokenizers, e.g. BERT, RoBERTa but fails for sentencepiece tokenizers such as ALBERT.

@chengchingwen
Copy link
Owner

@SeanLee97 Actually, we already have word piece tokenizer in Transformers.jl. See here and here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants