Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ArrayIndexOutOfBoundsException during read #77

Closed
luca-vercelli opened this issue Dec 14, 2020 · 7 comments
Closed

ArrayIndexOutOfBoundsException during read #77

luca-vercelli opened this issue Dec 14, 2020 · 7 comments

Comments

@luca-vercelli
Copy link

While reading of certain 900MB file, I get this error:

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException
    at java.util.Arrays.copyOfRange(Unknown Source)
    at com.epam.parso.impl.SasFileParser.getBytesFromFile(SasFileParser.java:917)
    at com.epam.parso.impl.SasFileParser.readDeletedInfo(SasFileParser.java:783)
    at com.epam.parso.impl.SasFileParser.processNextPage(SasFileParser.java:702)
    at com.epam.parso.impl.SasFileParser.readNextPage(SasFileParser.java:671)
    at com.epam.parso.impl.SasFileParser.readNext(SasFileParser.java:578)
    at com.epam.parso.impl.SasFileReaderImpl.readNext(SasFileReaderImpl.java:180)
    at it.finsoft.sas2csv.Main.main(Main.java:60)

This is the line of code 917:

vars.add(Arrays.copyOfRange(cachedPage, (int) (long) offset[i], (int) (long) offset[i] + length[i]));

The problem is, I don't know why, the routine getBytesFromFile get some strange values for offset and lenght: offset[i]=-209 and length[i]=0.

As a workaround, I solved this way:

            if (offset[i] < 0 || length[i] == 0) {
                vars.add(new byte[0]);
            } else {
                vars.add(Arrays.copyOfRange(cachedPage, (int) (long) offset[i], (int) (long) offset[i]
                                                                                             + length[i]));
            }

If you agree with this, I can send a PR.

I am not able reproducing the issue with a smaller file.

@printsev
Copy link
Contributor

thanks for reporting the issue! would it be possible for you to share the file or it contains sensitive information? just trying to understand why this particular file is tricky.

@luca-vercelli
Copy link
Author

I'm sorry, I am not allowed to share this file.

@printsev
Copy link
Contributor

we will investigate the issue. looks like your dataset contains deleted rows (see another similar issue: pandas-dev/pandas#15963), and the simplest workaround would be to remove those deleted rows if you don't need them, then the file should be read properly.

@xantorohara
Copy link
Contributor

I'm sorry, I am not allowed to share this file.

Hi, @luca-vercelli

As you can not share your dataset, would it be possible to try to read that file on your machine with this change in the
com.epam.parso.impl.SasFileParser.bytesToShort:

Current code:

private int bytesToShort(byte[] bytes) {
    return byteArrayToByteBuffer(bytes).getShort();
}

Fix proposal:

private int bytesToShort(byte[] bytes) {
    return byteArrayToByteBuffer(bytes).getShort() & 0xFFFF;
}

?

@luca-vercelli
Copy link
Author

Hi @xantorohara
Yes, your patch solves my issue! Thanks a lot.

@xantorohara
Copy link
Contributor

Thanks @luca-vercelli !

Seems like this fix can solve some other problems.
I'll check it on other datasets and create PR if all ok.

xantorohara added a commit to xantorohara/parso that referenced this issue Dec 15, 2020
Actually `vars.get(0)` is array of 4 bytes (its length is based on a SasFileConstants.PAGE_DELETED_POINTER_LENGTH = 4). So, it is more correct to use bytesToInt() conversion here instead of bytesToShort().
printsev pushed a commit that referenced this issue Dec 16, 2020
Actually `vars.get(0)` is array of 4 bytes (its length is based on a SasFileConstants.PAGE_DELETED_POINTER_LENGTH = 4). So, it is more correct to use bytesToInt() conversion here instead of bytesToShort().
@printsev
Copy link
Contributor

closed thanks to xantorohara

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants