New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tokenizer java.lang.StringIndexOutOfBoundsException #7
Comments
This is quite strange because Tokenizer#650 is a commented out line. I just released a snapshot: nlp4j-tokenization-1.1.2-SNAPSHOT with the latest code so could you please try out the snapshot and let me know? If this fixes the issue, I'll make another minor release. Thanks. |
I got the same error when trying to parse some (freely available) novels with all default settings with the 1.1.1 jar. One file that errored is attached. java.lang.StringIndexOutOfBoundsException: String index out of range: 3423 (I'll just leave the testing to you.) |
Thanks for providing the data; I fixed this bug and will include it in the next minor release (either tonight or tomorrow). Please sign up for our discussion group if you already haven't, so you'll get the notification for the new release. |
Sorry for taking it so long; I just released the version 1.1.2 which should have this fixed. Thanks. |
Hi - I'm running this on some text that is erroring out - its sensitive text so unfortunately I can't provide it, but may be able to dig into it more at some point in debugger. For now, it seems maybe that there is some error with bounds? This is with the nlp4j-1.1.1.jar (english model). I see some commits that recently rewrote some of this code, so maybe its fixed.
java.lang.StringIndexOutOfBoundsException: String index out of range: 44826
at java.lang.String.substring(String.java:1963)
at edu.emory.mathcs.nlp.tokenization.Tokenizer.mergeParenthesis(Tokenizer.java:650)
at edu.emory.mathcs.nlp.tokenization.Tokenizer.finalize(Tokenizer.java:608)
at edu.emory.mathcs.nlp.tokenization.Tokenizer.tokenizeWhiteSpaces(Tokenizer.java:165)
at edu.emory.mathcs.nlp.tokenization.Tokenizer.tokenize(Tokenizer.java:113)
at edu.emory.mathcs.nlp.tokenization.Tokenizer.segmentize(Tokenizer.java:133)
at edu.emory.mathcs.nlp.decode.AbstractNLPDecoder.decodeRaw(AbstractNLPDecoder.java:221)
at edu.emory.mathcs.nlp.decode.AbstractNLPDecoder.decode(AbstractNLPDecoder.java:182)
at edu.emory.mathcs.nlp.decode.AbstractNLPDecoder$NLPTask.run(AbstractNLPDecoder.java:345)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
The text was updated successfully, but these errors were encountered: