PARQUET-1633: Fix integer overflow #902

eadwright · 2021-05-04T13:34:19Z

This PR addresses this issue: https://issues.apache.org/jira/browse/PARQUET-1633

I have not added unit tests, as to check overflow conditions I would need test data over 2GB in size (on disk, compressed), considerably larger in-memory and thus requiring significant CI resources.

The issue was using an int for length field, which for parquet files with very large row_group_size (row groups over 2GB) would cause silent integer overflow, manifesting itself as negative length and an attempt to create an ArrayList with negative length.

I have tested reading a 6.6GB parquet file with a huge row group size (works out at over 2GB) which recreated this problem, and with this modification can read the file without any issues.

eadwright · 2021-05-04T13:42:12Z

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java

@@ -1464,7 +1464,7 @@ protected PageHeader readPageHeader(BlockCipher.Decryptor blockDecryptor, byte[]
     */
    private void verifyCrc(int referenceCrc, byte[] bytes, String exceptionMsg) {
      crc.reset();
-      crc.update(bytes);
+      crc.update(bytes, 0, bytes.length);


Changed to adopt a Java 8 API, to be consistent with the pom

This is unrelated, I would prefer to update this in another PR.

Agreed, committed to revert this change

shangxinli · 2021-05-04T22:28:29Z

Can you come up with a unit test?

eadwright · 2021-05-05T20:37:57Z

Can you come up with a unit test?

I want to, and I'm open to ideas. To re-create the problem you need a parquet file which has a full row group over 2GB in size followed by some more data, which translates to over 13GB+ of memory used. My concern is whether such a test would blow the constraints during CI or put an unreasonable memory burden on anyone wanting to build/test the project. What are reasonable limits there?

I will add that no tests have found a regression with this change.

eadwright · 2021-05-05T21:13:48Z

Interesting to note ParquetWriter takes an int for the block size, so probably can't create a file which recreates this bug. I can easily recreate this bug by creating a large file with an insane block size in python.

eadwright · 2021-05-05T21:42:02Z

I tried adapting TestParquetReaderRandomAccess to create a file with a 2GB block size and 135M records, each with 2 longs, which should spill over into two blocks and recreate the issue. Alas this fails to run even if I give it 12GB of heap. One needs a lot of RAM to reproduce this issue.

gszadovszky · 2021-05-10T07:57:02Z

@eadwright,

I'll try to summarize the issue, please correct me if I'm wrong. Parquet-mr is not able to write such big row groups (>2GB) because of the int array size limitation. Meanwhile, both the format and some other implementations allow such big row groups. So, parquet-mr shall be prepared for this issue in some way.
One option is to "simply" read the large row groups. It would require significant efforts to use proper memory handling objects that would properly support reading the large row groups. (A similar effort would also make parquet-mr available to write larger row groups than 2GB.)

The other option is to handle the too large row groups with a proper error message in parquet-mr without allowing silent overflows. This second option would be this effort. It is great to handle to potential int overflows but the main point, I think, would be at the footer conversion (ParquetMetadataConverter) where we create our own object structure from the file footer. At this point we can throw the proper error messages if the row group is too large to be handled (for now) in parquet-mr.
BTW, it might not be enough to check the potential overflows to validate if we can read a row group size. (See e.g. the source code of ArrayList.)

About the lack of the unit tests. I can accept in some cases where unit tests are not practically feasible to be implemented. In these cases I usually ask to validate the code offline.

advancedxy

I met a similar issue: https://issues.apache.org/jira/browse/PARQUET-2045, probably the same one.

I would like this fix got merged.

advancedxy · 2021-05-14T09:10:10Z

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java

@@ -1464,7 +1464,7 @@ protected PageHeader readPageHeader(BlockCipher.Decryptor blockDecryptor, byte[]
     */
    private void verifyCrc(int referenceCrc, byte[] bytes, String exceptionMsg) {
      crc.reset();
-      crc.update(bytes);
+      crc.update(bytes, 0, bytes.length);


This is unrelated, I would prefer to update this in another PR.

advancedxy · 2021-05-14T09:11:00Z

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java

@@ -1763,8 +1763,8 @@ public void addChunk(ChunkDescriptor descriptor) {
    public void readAll(SeekableInputStream f, ChunkListBuilder builder) throws IOException {
      f.seek(offset);

-      int fullAllocations = length / options.getMaxAllocationSize();
-      int lastAllocationSize = length % options.getMaxAllocationSize();
+      int fullAllocations = (int)(length / options.getMaxAllocationSize());


(int)(length / options.getMaxAllocationSize()) -> Math.toIntExact(length / options.getMaxAllocationSize());

advancedxy · 2021-05-14T09:19:42Z

@eadwright,

I'll try to summarize the issue, please correct me if I'm wrong. Parquet-mr is not able to write such big row groups (>2GB) because of the int array size limitation. Meanwhile, both the format and some other implementations allow such big row groups. So, parquet-mr shall be prepared for this issue in some way.
One option is to "simply" read the large row groups. It would require significant efforts to use proper memory handling objects that would properly support reading the large row groups. (A similar effort would also make parquet-mr available to write larger row groups than 2GB.)

The other option is to handle the too large row groups with a proper error message in parquet-mr without allowing silent overflows. This second option would be this effort. It is great to handle to potential int overflows but the main point, I think, would be at the footer conversion (ParquetMetadataConverter) where we create our own object structure from the file footer. At this point we can throw the proper error messages if the row group is too large to be handled (for now) in parquet-mr.
BTW, it might not be enough to check the potential overflows to validate if we can read a row group size. (See e.g. the source code of ArrayList.)

About the lack of the unit tests. I can accept in some cases where unit tests are not practically feasible to be implemented. In these cases I usually ask to validate the code offline.

Hi @gszadovszky parquet-mr is able to produce big row groups. We found some files wrote by Spark(which uses parquet-mr) have this problem. See https://issues.apache.org/jira/browse/PARQUET-2045 for details.

There are two options to fix this problem:

fail at writer side when creating such large row group/column chunk
support at reader side, which is this approach. It would require a lot of resource, but it's feasible.

Either option is fine for me, WDYT?

gszadovszky · 2021-05-14T09:49:57Z

@advancedxy, thanks for explaining.
I think, the best option is 2. It is up to the user to provide enough resources for handling the large row groups or not writing them.
Meanwhile, even though I've written I can accept lacking of unit tests in some situations my concern in this case is I am not sure that every aspect of a large row group is handled properly. So, we clearly need to validate this fix with such large row groups. This test can be even implemented in this source code but we must not include it in the unit tests or integration tests we run regularly.

advancedxy · 2021-05-17T02:31:33Z

@advancedxy, thanks for explaining.
I think, the best option is 2. It is up to the user to provide enough resources for handling the large row groups or not writing them.
Meanwhile, even though I've written I can accept lacking of unit tests in some situations my concern in this case is I am not sure that every aspect of a large row group is handled properly. So, we clearly need to validate this fix with such large row groups. This test can be even implemented in this source code but we must not include it in the unit tests or integration tests we run regularly.

@gszadovszky well understood.

Hi @eadwright, is it possible for you to add the test case(tagged with @ignore) in your PR, so others like @gszadovszky or me can verify it offline?

eadwright · 2021-05-19T08:21:44Z

Not a unit test yet, but I have at least found a reliable way to create a parquet file in python which causes Java to break when reading it. For this we require a huge number of rows, and a string column with very high entropy to avoid the use dictionary lookup or compression. Not these conditions can be real-world - records containing long SHA3 hashes for example.

This python code needs more than 16GB of RAM to execute.

import numpy as np

rand_array = np.random.rand(75000000, 4)

df = pd.DataFrame(rand_array, columns=['number1', 'number2', 'number3', 'number4'])
df['string1'] = df['number1'].astype(str) + df['number2'].astype(str) + df['number3'].astype(str)

df.to_parquet("random.parquet", compression='snappy', engine='pyarrow')```

eadwright · 2021-05-19T08:23:17Z

Screenshot from parquet-tools meta looking at the generated file. Note the TS (total size) of row group 1 is way over 2GB. For the bug to happen, it is also necessary for the rows to spill over into a second row group.

gszadovszky · 2021-05-19T10:56:18Z

@eadwright, what do you mean by "necessary for the rows to spill over into a second row group."? It shall not be possible. Even the pages keep row boundaries but for row groups it is required by the specification.

eadwright · 2021-05-19T14:14:51Z

@eadwright, what do you mean by "necessary for the rows to spill over into a second row group."? It shall not be possible. Even the pages keep row boundaries but for row groups it is required by the specification.

Sorry I probably didn't phrase that well. I mean, for this bug to occur, you need i) a row group which is taking more than 2GB of space, to get the 2^31 signed-int overflow, and ii) another subsequent row group (any size) so the code with the bug adds to a file offset using a corrupted value.

In the example python code, if I asked it to produce 50M rows instead of 75M, you get a ~3.3GB row group, but no second row group. The file offset addition code path is not executed and the file gets read correctly, the bug is not triggered.

gszadovszky · 2021-05-19T14:57:59Z

Thanks, @eadwright for explaining. I get it now.

eadwright · 2021-05-19T15:25:36Z

Build failed due to some transient connectivity issue - builds fine for me, Java 8 and 11

eadwright · 2021-05-19T15:43:55Z

Correction, we need a column within a row group to be over 2GB in size to cause the issue. It is not the row group size in total which counts.

eadwright · 2021-05-19T16:03:51Z

I've tweaked the python to create a test file which Java can't read. The python can now run fine on a 16GB machine.

import pandas as pd
import numpy as np
rand_array = np.random.rand(38000000, 3)
df = pd.DataFrame(rand_array, columns=["number1", "number2", "number3"])
df["string"] = df["number1"].astype(str) + df["number2"].astype(str) + df["number3"].astype(str)
df.drop(["number1", "number2", "number3"], axis=1, inplace=True)
df.to_parquet("random.parquet", compression=None, engine="pyarrow", **{"row_group_size": 37800000})

It creates 38M records, 37.8M of which are in the first row group, and the data for the string column in the first row group is about 2.1GB in size, over the threshold to cause the Java bug.

advancedxy · 2021-05-23T05:03:35Z

@eadwright can you upload the corrupt parquet file to some public cloud service like google drive or s3? or is it possible for you to add a new test case(not run by default) in TestParquetReaderRandomAccess?

@gszadovszky do you have any other concerns about this pr?

eadwright · 2021-05-23T13:03:52Z

I've uploaded the python-generated file to my Google Drive: https://drive.google.com/file/d/1trWjeJaHpqbHlnipnDUrwlwYM-AKFNWg/view?usp=sharing

Still working on a unit test. I'm used to working with Avro schema, but I'm trying to cut it back to bare-bones.

eadwright · 2021-05-23T19:53:54Z

Update - I've created some Java code which writes a parquet file nearly big enough to cause the issue, and can successfully read this file. However, two problems:

If I bump the record count so I have a file big enough to reproduce this bug 1633, another bug causes 32-bit integer overflow. The code as it stands cannot reproduce the file which was created in python.
The code I'm using to read the data, the potential unit test, does not work against the python-produced file (which has an avro schema) as I get this error (I suspect some classpath issue) java.lang.NoSuchMethodError: org.apache.parquet.format.LogicalType.getSetField()Lshaded/parquet/org/apache/thrift/TFieldIdEnum;

So... if any of you can reproduce the original error with the parquet file I posted above, and can validate this 7 line fix addresses it, that'd be great. Open to ideas of course.

gszadovszky · 2021-05-25T08:32:53Z

@eadwright, I'll try to look into this this week and produce a java code to reproduce the issue.

eadwright · 2021-05-25T10:29:30Z

@gszadovszky Awesome, appreciated. Also note the file uploaded isn't corrupt as such, it just goes beyond 32-bit limits.

- Updated ParquetWriter to support setting row group size in long - Removed Xmx settings in the pom to allow more memory for the tests

gszadovszky · 2021-05-26T16:50:24Z

@eadwright, I've implemented a unit test to reproduce the issue and test your solution. Feel free to use it in your PR. I've left some TODOs for you :)

gszadovszky · 2021-05-27T08:14:34Z

@eadwright, so the CI was executed somehow on my private repo and failed due to OoM. So, we may either investigate if we can tweak our configs/CI or disable this test by default.

Parquet 1633

eadwright · 2021-05-27T22:05:31Z

@gszadovszky I raised a PR to bring your changes in to my fork. Not had time yet alas to address the TODOs. I can say though that I believe to read that example file correctly with the fix requires 10GB of heap or so, probably similar with your test. Agree this test should be disabled by default, too heavy for CI.

gszadovszky · 2021-05-31T08:10:09Z

@eadwright, I've made some changes in the unit test (no more TODOs). See the update here. The idea is to not skip the tests in "normal" case but catch the OoM and skip. This way no tests should fail on any environments. Most of the modern laptops should have enough memory so this test will be executed on them.

eadwright · 2021-06-01T20:47:45Z

@gszadovszky I had a look at your changes. I feel uncomfortable relying on any behaviour at all after an OOM error. Are you sure this is the right approach?

eadwright · 2021-06-10T09:30:08Z

@gszadovszky Sorry, delay on my part. Have merged your changes in. Even in a testing pipeline I am uncomfortable with catching any kind of java.lang.Error, especially an OOM. Should we remove those catch clauses and mark this test to be ignored usually - let it be run manually?

Love how the tests need ~3GB, not 10GB.

gszadovszky · 2021-06-10T09:47:32Z

@eadwright, I understand your concerns I don't really like it either. Meanwhile, I don't feel good having a test that is not executed automatically. Without regular executions there is no guarantee that this test would be executed ever again and even if someone would execute it it might fail because of the lack of maintenance.

What do you think about the following options? @shangxinli, I'm also curious about your ideas.

Execute this test separately with a maven profile. I am not sure if the CI allows allocating such large memory but with Xmx options we might give a try and create a separate check for this test only.
Similar to the previous with the profile but not executing in the CI ever. Instead, we add some comments to the release doc so this test will be executed at least once per release.
Configuring the CI profile to skip this test but have it in the normal scenario meaning the devs will execute it locally. There are a couple of cons though. There is no guarantee that devs executes all the tests including this one. It also can cause issues if the dev doesn't have enough memory and don't know that the test failure is not related to the current change.

eadwright · 2021-06-10T09:54:53Z

Interesting options @gszadovszky , I have no strong opinion. I'd just like this fix merged once everyone is happy :)

gszadovszky · 2021-06-10T12:13:01Z

@eadwright, sorry, you're right. This is not tightly related to your PR. Please, remove the try-catch blocks for OOE and put an @Ignore annotation to the test class for now. I'll open a separate jira to solve the issue of such "not-to-be-tested-by-ci" tests.

eadwright · 2021-06-10T14:11:00Z

@gszadovszky done

gszadovszky

Thanks a lot for your efforts, @eadwright.

eadwright · 2021-06-10T14:44:53Z

Pleasure @gszadovszky - thanks for yours!

eadwright · 2021-06-10T17:13:44Z

Note I don't have write access, so I guess someone needs to merge this at some point? Not sure of your workflow.

gszadovszky · 2021-06-11T08:22:45Z

Sure, @eadwright, I'm merging it. I just usually give another day after the approval so others have the chance to comment.

gszadovszky · 2021-06-11T08:26:54Z

@eadwright, do you have a jira account so I can assign this one to you?

eadwright · 2021-06-11T15:30:28Z

@gszadovszky Thanks - just created Jira account edw_vtxa.

Unit test: - Updated ParquetWriter to support setting row group size in long - Removed Xmx settings in the pom to allow more memory for the tests Co-authored-by: Gabor Szadovszky <gabor@apache.org>

Unit test: - Updated ParquetWriter to support setting row group size in long - Removed Xmx settings in the pom to allow more memory for the tests Co-authored-by: Gabor Szadovszky <gabor@apache.org> Cherry-picked from apache#902 commit 98ddadf

Unit test: - Updated ParquetWriter to support setting row group size in long - Removed Xmx settings in the pom to allow more memory for the tests Co-authored-by: Gabor Szadovszky <gabor@apache.org> (cherry picked from commit aa132b3)

PARQUET-1633 Fix integer overflow

1cc49ad

eadwright commented May 4, 2021

View reviewed changes

advancedxy suggested changes May 14, 2021

View reviewed changes

eadwright added 2 commits May 18, 2021 22:17

PARQUET-1633 Undo Java 8 change

481b348

PARQUET-1633 Use Math.toIntExact()

4dd6977

PARQUET-1633: Unit test

fcaf412

- Updated ParquetWriter to support setting row group size in long - Removed Xmx settings in the pom to allow more memory for the tests

Merge pull request #1 from gszadovszky/PARQUET-1633

b9d3167

Parquet 1633

Test updates from @gszadovszky

631eb6b

Fix merging mistake

c29b0c8

gszadovszky changed the title ~~PARQUET-1633 Fix integer overflow~~ PARQUET-1633: Fix integer overflow Jun 10, 2021

Ignore test, remove OOM checks

03a2e0e

gszadovszky approved these changes Jun 10, 2021

View reviewed changes

gszadovszky merged commit 98ddadf into apache:master Jun 11, 2021

eadwright deleted the PARQUET-1633 branch June 11, 2021 15:33

schlosna mentioned this pull request Sep 23, 2021

PARQUET-1633: Fix integer overflow palantir/parquet-mr#46

Draft

PARQUET-1633: Fix integer overflow #902

PARQUET-1633: Fix integer overflow #902

Conversation

eadwright commented May 4, 2021 • edited

eadwright May 4, 2021

Choose a reason for hiding this comment

advancedxy May 14, 2021

Choose a reason for hiding this comment

eadwright May 19, 2021

Choose a reason for hiding this comment

shangxinli commented May 4, 2021

eadwright commented May 5, 2021 • edited

eadwright commented May 5, 2021 • edited

eadwright commented May 5, 2021

gszadovszky commented May 10, 2021

advancedxy left a comment

Choose a reason for hiding this comment

advancedxy May 14, 2021

Choose a reason for hiding this comment

advancedxy May 14, 2021

Choose a reason for hiding this comment

advancedxy commented May 14, 2021

gszadovszky commented May 14, 2021

advancedxy commented May 17, 2021

eadwright commented May 19, 2021

eadwright commented May 19, 2021 • edited

gszadovszky commented May 19, 2021

eadwright commented May 19, 2021

gszadovszky commented May 19, 2021

eadwright commented May 19, 2021 • edited

eadwright commented May 19, 2021 • edited

eadwright commented May 19, 2021 • edited

advancedxy commented May 23, 2021

eadwright commented May 23, 2021 • edited

eadwright commented May 23, 2021

gszadovszky commented May 25, 2021

eadwright commented May 25, 2021 • edited

gszadovszky commented May 26, 2021

gszadovszky commented May 27, 2021

eadwright commented May 27, 2021

gszadovszky commented May 31, 2021

eadwright commented Jun 1, 2021

eadwright commented Jun 10, 2021 • edited

gszadovszky commented Jun 10, 2021

eadwright commented Jun 10, 2021

gszadovszky commented Jun 10, 2021

eadwright commented Jun 10, 2021

gszadovszky left a comment

Choose a reason for hiding this comment

eadwright commented Jun 10, 2021 • edited

eadwright commented Jun 10, 2021

gszadovszky commented Jun 11, 2021

gszadovszky commented Jun 11, 2021

eadwright commented Jun 11, 2021

eadwright commented May 4, 2021 •

edited

eadwright commented May 5, 2021 •

edited

eadwright commented May 5, 2021 •

edited

eadwright commented May 19, 2021 •

edited

eadwright commented May 19, 2021 •

edited

eadwright commented May 19, 2021 •

edited

eadwright commented May 19, 2021 •

edited

eadwright commented May 23, 2021 •

edited

eadwright commented May 25, 2021 •

edited

eadwright commented Jun 10, 2021 •

edited

eadwright commented Jun 10, 2021 •

edited