Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DL4J: BERT iterator can get into infinite loop with bad character encoding #7678

Closed
AlexDBlack opened this issue May 6, 2019 · 2 comments

Comments

@AlexDBlack
Copy link
Contributor

commented May 6, 2019

Best guess to cause of infinite loop: If the string to be tokenized has a character that doesn't appear in the vocab (due to bad encoding of the vocab, or of the string to be tokenized) an infinite loop will occur... the following runs until we hit: java.lang.OutOfMemoryError: Java heap space.

    @Test
    public void testBertWordPieceTokenizer1() throws Exception {
        String toTokenize = "I saw a girl with a telescope. bad" + (char) 8 + "word";
        TokenizerFactory t = new BertWordPieceTokenizerFactory(pathToVocab, c);
        Tokenizer tokenizer = t.create(toTokenize);
        Tokenizer tokenizer2 = t.create(new ByteArrayInputStream(toTokenize.getBytes()));
        int position = 1;
        while (tokenizer2.hasMoreTokens()) {
            String tok1 = tokenizer.nextToken();
            String tok2 = tokenizer2.nextToken();
            log.info("Position: [" + position + "], token1: '" + tok1 + "', token 2: '" + tok2 + "'");
            position++;
            assertEquals(tok1, tok2);
        }
    }

@AlexDBlack AlexDBlack self-assigned this May 22, 2019

AlexDBlack added a commit that referenced this issue May 25, 2019
SameDiff fixes + improvements; BertWordPieceTokenizer (handle out-of-…
…vocab characters) (#7774)

* Extra overloads for TrainingConfig

* #7678 BERT fixes - control characters etc

* #7678 BERT tokenizer - detect out-of-vocab characters, throw useful exception

* Javadoc/polish

* #7705 SameDiff duplicate name validation

* #7546 SDVariable.getArr scalar issue fix

* SameDiff: add variable renaming

* SameDiff SDVariable renaming javadoc

* Move async iterators to ND4J for use in SameDiff

* Add async iterator support to SameDiff training

* SameDiff listener API

* Base listener, score listener

* Score listener now working

* Fixes and listener polishing

* Small fix
@AlexDBlack

This comment has been minimized.

Copy link
Contributor Author

commented May 25, 2019

Fixed here, and merged to dev branch; will be merged from dev branch to master soon: #7774

@AlexDBlack

This comment has been minimized.

Copy link
Contributor Author

commented Jun 3, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.