Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenizer java.lang.StringIndexOutOfBoundsException #7

Closed
nartz opened this issue May 12, 2016 · 4 comments
Closed

Tokenizer java.lang.StringIndexOutOfBoundsException #7

nartz opened this issue May 12, 2016 · 4 comments

Comments

@nartz
Copy link

nartz commented May 12, 2016

Hi - I'm running this on some text that is erroring out - its sensitive text so unfortunately I can't provide it, but may be able to dig into it more at some point in debugger. For now, it seems maybe that there is some error with bounds? This is with the nlp4j-1.1.1.jar (english model). I see some commits that recently rewrote some of this code, so maybe its fixed.

java.lang.StringIndexOutOfBoundsException: String index out of range: 44826
at java.lang.String.substring(String.java:1963)
at edu.emory.mathcs.nlp.tokenization.Tokenizer.mergeParenthesis(Tokenizer.java:650)
at edu.emory.mathcs.nlp.tokenization.Tokenizer.finalize(Tokenizer.java:608)
at edu.emory.mathcs.nlp.tokenization.Tokenizer.tokenizeWhiteSpaces(Tokenizer.java:165)
at edu.emory.mathcs.nlp.tokenization.Tokenizer.tokenize(Tokenizer.java:113)
at edu.emory.mathcs.nlp.tokenization.Tokenizer.segmentize(Tokenizer.java:133)
at edu.emory.mathcs.nlp.decode.AbstractNLPDecoder.decodeRaw(AbstractNLPDecoder.java:221)
at edu.emory.mathcs.nlp.decode.AbstractNLPDecoder.decode(AbstractNLPDecoder.java:182)
at edu.emory.mathcs.nlp.decode.AbstractNLPDecoder$NLPTask.run(AbstractNLPDecoder.java:345)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

@jdchoi77
Copy link
Member

This is quite strange because Tokenizer#650 is a commented out line. I just released a snapshot: nlp4j-tokenization-1.1.2-SNAPSHOT with the latest code so could you please try out the snapshot and let me know? If this fixes the issue, I'll make another minor release. Thanks.

@elithrion
Copy link

I got the same error when trying to parse some (freely available) novels with all default settings with the 1.1.1 jar. One file that errored is attached.

java.lang.StringIndexOutOfBoundsException: String index out of range: 3423
at java.lang.String.substring(String.java:1963)
at edu.emory.mathcs.nlp.tokenization.Tokenizer.mergeParenthesis(Tokenizer.java:650)
at edu.emory.mathcs.nlp.tokenization.Tokenizer.finalize(Tokenizer.java:608)
at edu.emory.mathcs.nlp.tokenization.Tokenizer.tokenizeWhiteSpaces(Tokenizer.java:165)
...

(I'll just leave the testing to you.)

Unearthly-2_5.txt

@jdchoi77
Copy link
Member

jdchoi77 commented Jun 9, 2016

Thanks for providing the data; I fixed this bug and will include it in the next minor release (either tonight or tomorrow). Please sign up for our discussion group if you already haven't, so you'll get the notification for the new release.

https://groups.google.com/forum/#!forum/emorynlp

@jdchoi77
Copy link
Member

Sorry for taking it so long; I just released the version 1.1.2 which should have this fixed. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants