Tokenizer java.lang.StringIndexOutOfBoundsException #7

nartz · 2016-05-12T05:46:52Z

Hi - I'm running this on some text that is erroring out - its sensitive text so unfortunately I can't provide it, but may be able to dig into it more at some point in debugger. For now, it seems maybe that there is some error with bounds? This is with the nlp4j-1.1.1.jar (english model). I see some commits that recently rewrote some of this code, so maybe its fixed.

java.lang.StringIndexOutOfBoundsException: String index out of range: 44826
at java.lang.String.substring(String.java:1963)
at edu.emory.mathcs.nlp.tokenization.Tokenizer.mergeParenthesis(Tokenizer.java:650)
at edu.emory.mathcs.nlp.tokenization.Tokenizer.finalize(Tokenizer.java:608)
at edu.emory.mathcs.nlp.tokenization.Tokenizer.tokenizeWhiteSpaces(Tokenizer.java:165)
at edu.emory.mathcs.nlp.tokenization.Tokenizer.tokenize(Tokenizer.java:113)
at edu.emory.mathcs.nlp.tokenization.Tokenizer.segmentize(Tokenizer.java:133)
at edu.emory.mathcs.nlp.decode.AbstractNLPDecoder.decodeRaw(AbstractNLPDecoder.java:221)
at edu.emory.mathcs.nlp.decode.AbstractNLPDecoder.decode(AbstractNLPDecoder.java:182)
at edu.emory.mathcs.nlp.decode.AbstractNLPDecoder$NLPTask.run(AbstractNLPDecoder.java:345)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

jdchoi77 · 2016-05-13T01:26:52Z

This is quite strange because Tokenizer#650 is a commented out line. I just released a snapshot: nlp4j-tokenization-1.1.2-SNAPSHOT with the latest code so could you please try out the snapshot and let me know? If this fixes the issue, I'll make another minor release. Thanks.

elithrion · 2016-06-09T17:57:52Z

I got the same error when trying to parse some (freely available) novels with all default settings with the 1.1.1 jar. One file that errored is attached.

java.lang.StringIndexOutOfBoundsException: String index out of range: 3423
at java.lang.String.substring(String.java:1963)
at edu.emory.mathcs.nlp.tokenization.Tokenizer.mergeParenthesis(Tokenizer.java:650)
at edu.emory.mathcs.nlp.tokenization.Tokenizer.finalize(Tokenizer.java:608)
at edu.emory.mathcs.nlp.tokenization.Tokenizer.tokenizeWhiteSpaces(Tokenizer.java:165)
...

(I'll just leave the testing to you.)

Unearthly-2_5.txt

jdchoi77 · 2016-06-09T19:59:18Z

Thanks for providing the data; I fixed this bug and will include it in the next minor release (either tonight or tomorrow). Please sign up for our discussion group if you already haven't, so you'll get the notification for the new release.

https://groups.google.com/forum/#!forum/emorynlp

jdchoi77 · 2016-06-29T21:36:33Z

Sorry for taking it so long; I just released the version 1.1.2 which should have this fixed. Thanks.

jdchoi77 closed this as completed Jun 29, 2016

jdchoi77 mentioned this issue Jun 29, 2016

StringIndexOutOfBoundsException emorynlp/nlp4j-old#20

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenizer java.lang.StringIndexOutOfBoundsException #7

Tokenizer java.lang.StringIndexOutOfBoundsException #7

nartz commented May 12, 2016

jdchoi77 commented May 13, 2016

elithrion commented Jun 9, 2016

jdchoi77 commented Jun 9, 2016

jdchoi77 commented Jun 29, 2016

Tokenizer java.lang.StringIndexOutOfBoundsException #7

Tokenizer java.lang.StringIndexOutOfBoundsException #7

Comments

nartz commented May 12, 2016

jdchoi77 commented May 13, 2016

elithrion commented Jun 9, 2016

jdchoi77 commented Jun 9, 2016

jdchoi77 commented Jun 29, 2016