Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Bert Word Piece Tokenizer #7141

Merged
merged 9 commits into from Feb 12, 2019

Conversation

Projects
None yet
2 participants
@treo
Copy link
Member

commented Feb 11, 2019

Requires a Vocab File like the one from the published Bert models
(e.g. vocab.txt in https://storage.googleapis.com/bert_models/2018_11_23/multi_cased_L-12_H-768_A-12.zip)

Tests require deeplearning4j/dl4j-test-resources#185

treo added some commits Feb 11, 2019

Add Bert Word Piece Tokenizer
Requires a Vocab File like the one from the published Bert models
(e.g. vocab.txt in https://storage.googleapis.com/bert_models/2018_11_23/multi_cased_L-12_H-768_A-12.zip)
@AlexDBlack
Copy link
Member

left a comment

Overall good, as expected. NavigableMap use here is cool :)
Some minor nitpicks here and there.
Only issue of note I can see here: some of the BERT models are lower case only... We should probably have that configurable somewhere, with an appropriate toLowerCase() call in the tokenizer when enabled?

treo added some commits Feb 11, 2019

Implement suggestions and speedup tokenization
* Update Copyright header

* Use @slf4j

* Share split pattern between Stream and non-stream versions

* Share vocab search between Stream and non-stream versions

* Continue search only in the sub tree that will have the result

* Allow lowerCase only vocab configuration

* Make BertWordPieceTokenzier safer to use with threads

* Allow passing an input stream in addition to a file to BertWordPieceTokenizerFactory

* Add documentation on the expected file format for vocab files
@treo

This comment has been minimized.

Copy link
Member Author

commented Feb 11, 2019

@AlexDBlack I've implemented the requested changes, fixed some more bugs I've found along the way,

@AlexDBlack
Copy link
Member

left a comment

LGTM 👍

@treo treo merged commit 41e1a88 into master Feb 12, 2019

0 of 2 checks passed

codeclimate Code Climate is analyzing this code.
Details
continuous-integration/jenkins/pr-head This commit is being built
Details

@treo treo deleted the treo/bert_wordpiece_tokenizer branch Feb 12, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.