MakeKneserNeyArpaFromText throws ArrayIndexOutOfBoundsException #8

GoogleCodeExporter · 2015-03-22T17:33:56Z

I am running edu.berkeley.nlp.lm.io.MakeKneserNeyArpaFromText on some German 
text but keep running into an ArrayIndexOutOfBoundsException exception. If I 
try to build a model from very limited data no such error arises. Is there a 
limit on the number of distinct characters the input text can contain? The out 
of bounds array value is 256 which is suspiciously the size of a byte.

I have attached the input file (German wikipedia data prepared for a character 
level n-gram model).

Here is the output I am seeing:

Reading text files [de-test.txt] and writing to file en-test.model {
    Reading from files [de-test.txt] {
        On line 0
        Writing ARPA {
            On order 1
            Writing line 0
            On order 2
            Writing line 0
            On order 3
            Writing line 0
            Writing line 0
            On order 4
            Writing line 0
[WARNING] 
java.lang.reflect.InvocationTargetException
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.codehaus.mojo.exec.ExecJavaMojo$1.run(ExecJavaMojo.java:297)
    at java.lang.Thread.run(Thread.java:619)
Caused by: java.lang.ArrayIndexOutOfBoundsException: 256
    at java.lang.Long.valueOf(Long.java:548)
    at edu.berkeley.nlp.lm.map.ExplicitWordHashMap$KeyIterator.next(ExplicitWordHashMap.java:132)
    at edu.berkeley.nlp.lm.map.ExplicitWordHashMap$KeyIterator.next(ExplicitWordHashMap.java:113)
    at edu.berkeley.nlp.lm.collections.Iterators$Transform.next(Iterators.java:107)
    at edu.berkeley.nlp.lm.io.KneserNeyLmReaderCallback.writeToPrintWriter(KneserNeyLmReaderCallback.java:130)
    at edu.berkeley.nlp.lm.io.KneserNeyLmReaderCallback.cleanup(KneserNeyLmReaderCallback.java:111)
    at edu.berkeley.nlp.lm.io.TextReader.countNgrams(TextReader.java:85)
    at edu.berkeley.nlp.lm.io.TextReader.readFromFiles(TextReader.java:51)
    at edu.berkeley.nlp.lm.io.TextReader.parse(TextReader.java:44)
    at edu.berkeley.nlp.lm.io.LmReaders.createKneserNeyLmFromTextFiles(LmReaders.java:280)
    at edu.berkeley.nlp.lm.io.MakeKneserNeyArpaFromText.main(MakeKneserNeyArpaFromText.java:55)

Original issue reported on code.google.com by hhohw...@shutterstock.com on 9 Aug 2012 at 4:48

Attachments:

de-test.txt

The text was updated successfully, but these errors were encountered:

GoogleCodeExporter · 2015-03-22T17:33:57Z

Hi,

Interesting. When I run on that file, there is an exception from a bug (which I 
have fixed), but it is not that exception. That stack trace looks an awful lot 
like the caching inside the java builtin Long class is doing funny things -- 
might it have something to do with your ExecJavaMojo calling things through 
reflection?

In any case, I have fixed the big and am running some tests before I release a 
fix. 1.1.1 should be out by tomorrow.

Original comment by adpa...@gmail.com on 9 Aug 2012 at 5:31

Changed state: Started

GoogleCodeExporter · 2015-03-22T17:33:57Z

Hi,

Thanks for looking into the issue so quickly.

Interesting that you don't see the same exception. I assume that since
berkeleylm in written in Java it should support input encoded in UTF-8. Is
that a fair assumption?

I have tried calling the program through maven (I imported all the source)
and also without using maven at all and see the same exception in both
cases which is a bit odd if it is caused by reflection.

Original comment by hhohw...@shutterstock.com on 9 Aug 2012 at 5:43

GoogleCodeExporter · 2015-03-22T17:33:57Z

UTF-8 should be fine. Hopefully the fix I've committed will resolve your issue 
in any case.

Original comment by adpa...@gmail.com on 9 Aug 2012 at 7:33

GoogleCodeExporter · 2015-03-22T17:33:57Z

Apologies, I fell asleep on this fix. Version 1.1.1 has been uploaded. Let me 
know if this doesn't fix your issue.

Original comment by adpa...@gmail.com on 13 Aug 2012 at 2:02

Changed state: Fixed

GoogleCodeExporter · 2015-03-22T17:33:57Z

I unzipped the new 1.1.1 code but unfortunately am still seeing the same 
ArrayIndexOutOfBoundsException. I have tried on a different input data set in 
case that was the problem (en-test.txt, attached below) but I see the same 
problem on that input.

Here's the steps I took to produce the error:

1. Unzip the code
2. cd to the top level directory, berkeleylm-1.1.1
3. Run ant from the top level directory
4. From the top level directory, run:
java -cp jar/berkeleylm.jar edu.berkeley.nlp.lm.io.MakeKneserNeyArpaFromText 5 
test-en.model en-test.txt
5. Output is:
Reading text files [en-test.txt] and writing to file test-en.model {
    Reading in ngrams from raw text {
        On line 0
    } [2s]
    Writing Kneser-Ney probabilities {
        Counting counts for order 0 {
        } [0s]
        Counting counts for order 1 {
        } [0s]
        Counting counts for order 2 {
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 256
    at java.lang.Long.valueOf(Long.java:548)
    at edu.berkeley.nlp.lm.map.ExplicitWordHashMap$KeyIterator.next(ExplicitWordHashMap.java:140)
    at edu.berkeley.nlp.lm.map.ExplicitWordHashMap$KeyIterator.next(ExplicitWordHashMap.java:121)
    at edu.berkeley.nlp.lm.collections.Iterators$Transform.next(Iterators.java:107)
    at edu.berkeley.nlp.lm.io.KneserNeyLmReaderCallback.parse(KneserNeyLmReaderCallback.java:284)
    at edu.berkeley.nlp.lm.io.LmReaders.createKneserNeyLmFromTextFiles(LmReaders.java:299)
    at edu.berkeley.nlp.lm.io.MakeKneserNeyArpaFromText.main(MakeKneserNeyArpaFromText.java:57)

Original comment by hhohw...@shutterstock.com on 15 Aug 2012 at 11:34

Attachments:

en-test.txt

GoogleCodeExporter · 2015-03-22T17:33:58Z

Followed your steps and did not encounter any exceptions. I'm guessing this is 
a bug in your JVM -- the exception is occurring while boxing a long! You can 
try using a different JVM, or even try using -server (which you should do 
anyway, for speed).

Original comment by adpa...@gmail.com on 15 Aug 2012 at 5:10

GoogleCodeExporter · 2015-03-22T17:33:58Z

Thanks again for testing this out. It is quite odd that the error comes from 
boxing a long. I ran both with and without -server but saw the exception in 
both cases. I'm going to try a different JVM. Would you mind posting the output 
you get from running "java -version" so that I can start with that 
implementation? I'm using HotSpot 64 bit:

$ java -version
java version "1.6.0_10"
Java(TM) SE Runtime Environment (build 1.6.0_10-b33)
Java HotSpot(TM) 64-Bit Server VM (build 11.0-b15, mixed mode)

Thanks for the help.

Original comment by hhohw...@shutterstock.com on 15 Aug 2012 at 5:28

GoogleCodeExporter · 2015-03-22T17:33:58Z

$ java -version
java version "1.6.0_33"
Java(TM) SE Runtime Environment (build 1.6.0_33-b03-424-10M3720)
Java HotSpot(TM) 64-Bit Server VM (build 20.8-b03-424, mixed mode)

Original comment by adpa...@gmail.com on 15 Aug 2012 at 5:56

GoogleCodeExporter · 2015-03-22T17:33:58Z

I updated my java-6-sun jvm to 1.6.0_34, I was using a version from 2008. I no 
longer see the exception. Looks like Oracle has been hard at work fixing 
autoboxing issues in the last few years. :)

Original comment by hhohw...@shutterstock.com on 15 Aug 2012 at 8:58

GoogleCodeExporter added Type-Defect Priority-Medium auto-migrated labels Mar 22, 2015

GoogleCodeExporter closed this as completed Mar 22, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MakeKneserNeyArpaFromText throws ArrayIndexOutOfBoundsException #8

MakeKneserNeyArpaFromText throws ArrayIndexOutOfBoundsException #8

GoogleCodeExporter commented Mar 22, 2015

GoogleCodeExporter commented Mar 22, 2015

GoogleCodeExporter commented Mar 22, 2015

GoogleCodeExporter commented Mar 22, 2015

GoogleCodeExporter commented Mar 22, 2015

GoogleCodeExporter commented Mar 22, 2015

GoogleCodeExporter commented Mar 22, 2015

GoogleCodeExporter commented Mar 22, 2015

GoogleCodeExporter commented Mar 22, 2015

GoogleCodeExporter commented Mar 22, 2015

MakeKneserNeyArpaFromText throws ArrayIndexOutOfBoundsException #8

MakeKneserNeyArpaFromText throws ArrayIndexOutOfBoundsException #8

Comments

GoogleCodeExporter commented Mar 22, 2015

GoogleCodeExporter commented Mar 22, 2015

GoogleCodeExporter commented Mar 22, 2015

GoogleCodeExporter commented Mar 22, 2015

GoogleCodeExporter commented Mar 22, 2015

GoogleCodeExporter commented Mar 22, 2015

GoogleCodeExporter commented Mar 22, 2015

GoogleCodeExporter commented Mar 22, 2015

GoogleCodeExporter commented Mar 22, 2015

GoogleCodeExporter commented Mar 22, 2015