Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MakeKneserNeyArpaFromText throws ArrayIndexOutOfBoundsException #8

Closed
GoogleCodeExporter opened this issue Mar 22, 2015 · 9 comments
Closed

Comments

@GoogleCodeExporter
Copy link

I am running edu.berkeley.nlp.lm.io.MakeKneserNeyArpaFromText on some German 
text but keep running into an ArrayIndexOutOfBoundsException exception. If I 
try to build a model from very limited data no such error arises. Is there a 
limit on the number of distinct characters the input text can contain? The out 
of bounds array value is 256 which is suspiciously the size of a byte.

I have attached the input file (German wikipedia data prepared for a character 
level n-gram model).

Here is the output I am seeing:

Reading text files [de-test.txt] and writing to file en-test.model {
    Reading from files [de-test.txt] {
        On line 0
        Writing ARPA {
            On order 1
            Writing line 0
            On order 2
            Writing line 0
            On order 3
            Writing line 0
            Writing line 0
            On order 4
            Writing line 0
[WARNING] 
java.lang.reflect.InvocationTargetException
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.codehaus.mojo.exec.ExecJavaMojo$1.run(ExecJavaMojo.java:297)
    at java.lang.Thread.run(Thread.java:619)
Caused by: java.lang.ArrayIndexOutOfBoundsException: 256
    at java.lang.Long.valueOf(Long.java:548)
    at edu.berkeley.nlp.lm.map.ExplicitWordHashMap$KeyIterator.next(ExplicitWordHashMap.java:132)
    at edu.berkeley.nlp.lm.map.ExplicitWordHashMap$KeyIterator.next(ExplicitWordHashMap.java:113)
    at edu.berkeley.nlp.lm.collections.Iterators$Transform.next(Iterators.java:107)
    at edu.berkeley.nlp.lm.io.KneserNeyLmReaderCallback.writeToPrintWriter(KneserNeyLmReaderCallback.java:130)
    at edu.berkeley.nlp.lm.io.KneserNeyLmReaderCallback.cleanup(KneserNeyLmReaderCallback.java:111)
    at edu.berkeley.nlp.lm.io.TextReader.countNgrams(TextReader.java:85)
    at edu.berkeley.nlp.lm.io.TextReader.readFromFiles(TextReader.java:51)
    at edu.berkeley.nlp.lm.io.TextReader.parse(TextReader.java:44)
    at edu.berkeley.nlp.lm.io.LmReaders.createKneserNeyLmFromTextFiles(LmReaders.java:280)
    at edu.berkeley.nlp.lm.io.MakeKneserNeyArpaFromText.main(MakeKneserNeyArpaFromText.java:55)


Original issue reported on code.google.com by hhohw...@shutterstock.com on 9 Aug 2012 at 4:48

Attachments:

@GoogleCodeExporter
Copy link
Author

Hi,

Interesting. When I run on that file, there is an exception from a bug (which I 
have fixed), but it is not that exception. That stack trace looks an awful lot 
like the caching inside the java builtin Long class is doing funny things -- 
might it have something to do with your ExecJavaMojo calling things through 
reflection?

In any case, I have fixed the big and am running some tests before I release a 
fix. 1.1.1 should be out by tomorrow.

Original comment by adpa...@gmail.com on 9 Aug 2012 at 5:31

  • Changed state: Started

@GoogleCodeExporter
Copy link
Author

Hi,

Thanks for looking into the issue so quickly.

Interesting that you don't see the same exception. I assume that since
berkeleylm in written in Java it should support input encoded in UTF-8. Is
that a fair assumption?

I have tried calling the program through maven (I imported all the source)
and also without using maven at all and see the same exception in both
cases which is a bit odd if it is caused by reflection.

Original comment by hhohw...@shutterstock.com on 9 Aug 2012 at 5:43

@GoogleCodeExporter
Copy link
Author

UTF-8 should be fine. Hopefully the fix I've committed will resolve your issue 
in any case.

Original comment by adpa...@gmail.com on 9 Aug 2012 at 7:33

@GoogleCodeExporter
Copy link
Author

Apologies, I fell asleep on this fix. Version 1.1.1 has been uploaded. Let me 
know if this doesn't fix your issue. 

Original comment by adpa...@gmail.com on 13 Aug 2012 at 2:02

  • Changed state: Fixed

@GoogleCodeExporter
Copy link
Author

I unzipped the new 1.1.1 code but unfortunately am still seeing the same 
ArrayIndexOutOfBoundsException. I have tried on a different input data set in 
case that was the problem (en-test.txt, attached below) but I see the same 
problem on that input.

Here's the steps I took to produce the error:

1. Unzip the code
2. cd to the top level directory, berkeleylm-1.1.1
3. Run ant from the top level directory
4. From the top level directory, run:
java -cp jar/berkeleylm.jar edu.berkeley.nlp.lm.io.MakeKneserNeyArpaFromText 5 
test-en.model en-test.txt
5. Output is:
Reading text files [en-test.txt] and writing to file test-en.model {
    Reading in ngrams from raw text {
        On line 0
    } [2s]
    Writing Kneser-Ney probabilities {
        Counting counts for order 0 {
        } [0s]
        Counting counts for order 1 {
        } [0s]
        Counting counts for order 2 {
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 256
    at java.lang.Long.valueOf(Long.java:548)
    at edu.berkeley.nlp.lm.map.ExplicitWordHashMap$KeyIterator.next(ExplicitWordHashMap.java:140)
    at edu.berkeley.nlp.lm.map.ExplicitWordHashMap$KeyIterator.next(ExplicitWordHashMap.java:121)
    at edu.berkeley.nlp.lm.collections.Iterators$Transform.next(Iterators.java:107)
    at edu.berkeley.nlp.lm.io.KneserNeyLmReaderCallback.parse(KneserNeyLmReaderCallback.java:284)
    at edu.berkeley.nlp.lm.io.LmReaders.createKneserNeyLmFromTextFiles(LmReaders.java:299)
    at edu.berkeley.nlp.lm.io.MakeKneserNeyArpaFromText.main(MakeKneserNeyArpaFromText.java:57)

Original comment by hhohw...@shutterstock.com on 15 Aug 2012 at 11:34

Attachments:

@GoogleCodeExporter
Copy link
Author

Followed your steps and did not encounter any exceptions. I'm guessing this is 
a bug in your JVM -- the exception is occurring while boxing a long! You can 
try using a different JVM, or even try using -server (which you should do 
anyway, for speed). 



Original comment by adpa...@gmail.com on 15 Aug 2012 at 5:10

@GoogleCodeExporter
Copy link
Author

Thanks again for testing this out. It is quite odd that the error comes from 
boxing a long. I ran both with and without -server but saw the exception in 
both cases. I'm going to try a different JVM. Would you mind posting the output 
you get from running "java -version" so that I can start with that 
implementation? I'm using HotSpot 64 bit:

$ java -version
java version "1.6.0_10"
Java(TM) SE Runtime Environment (build 1.6.0_10-b33)
Java HotSpot(TM) 64-Bit Server VM (build 11.0-b15, mixed mode)

Thanks for the help.

Original comment by hhohw...@shutterstock.com on 15 Aug 2012 at 5:28

@GoogleCodeExporter
Copy link
Author

$ java -version
java version "1.6.0_33"
Java(TM) SE Runtime Environment (build 1.6.0_33-b03-424-10M3720)
Java HotSpot(TM) 64-Bit Server VM (build 20.8-b03-424, mixed mode)

Original comment by adpa...@gmail.com on 15 Aug 2012 at 5:56

@GoogleCodeExporter
Copy link
Author

I updated my java-6-sun jvm to 1.6.0_34, I was using a version from 2008. I no 
longer see the exception. Looks like Oracle has been hard at work fixing 
autoboxing issues in the last few years. :)


Original comment by hhohw...@shutterstock.com on 15 Aug 2012 at 8:58

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant