Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-1633: Fix integer overflow #902

Merged
merged 8 commits into from Jun 11, 2021
Merged

Conversation

eadwright
Copy link
Contributor

@eadwright eadwright commented May 4, 2021

This PR addresses this issue: https://issues.apache.org/jira/browse/PARQUET-1633

I have not added unit tests, as to check overflow conditions I would need test data over 2GB in size (on disk, compressed), considerably larger in-memory and thus requiring significant CI resources.

The issue was using an int for length field, which for parquet files with very large row_group_size (row groups over 2GB) would cause silent integer overflow, manifesting itself as negative length and an attempt to create an ArrayList with negative length.

I have tested reading a 6.6GB parquet file with a huge row group size (works out at over 2GB) which recreated this problem, and with this modification can read the file without any issues.

@@ -1464,7 +1464,7 @@ protected PageHeader readPageHeader(BlockCipher.Decryptor blockDecryptor, byte[]
*/
private void verifyCrc(int referenceCrc, byte[] bytes, String exceptionMsg) {
crc.reset();
crc.update(bytes);
crc.update(bytes, 0, bytes.length);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to adopt a Java 8 API, to be consistent with the pom

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unrelated, I would prefer to update this in another PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, committed to revert this change

@shangxinli
Copy link
Contributor

Can you come up with a unit test?

@eadwright
Copy link
Contributor Author

eadwright commented May 5, 2021

Can you come up with a unit test?

I want to, and I'm open to ideas. To re-create the problem you need a parquet file which has a full row group over 2GB in size followed by some more data, which translates to over 13GB+ of memory used. My concern is whether such a test would blow the constraints during CI or put an unreasonable memory burden on anyone wanting to build/test the project. What are reasonable limits there?

I will add that no tests have found a regression with this change.

@eadwright
Copy link
Contributor Author

eadwright commented May 5, 2021

Interesting to note ParquetWriter takes an int for the block size, so probably can't create a file which recreates this bug. I can easily recreate this bug by creating a large file with an insane block size in python.

@eadwright
Copy link
Contributor Author

I tried adapting TestParquetReaderRandomAccess to create a file with a 2GB block size and 135M records, each with 2 longs, which should spill over into two blocks and recreate the issue. Alas this fails to run even if I give it 12GB of heap. One needs a lot of RAM to reproduce this issue.

@gszadovszky
Copy link
Contributor

@eadwright,

I'll try to summarize the issue, please correct me if I'm wrong. Parquet-mr is not able to write such big row groups (>2GB) because of the int array size limitation. Meanwhile, both the format and some other implementations allow such big row groups. So, parquet-mr shall be prepared for this issue in some way.
One option is to "simply" read the large row groups. It would require significant efforts to use proper memory handling objects that would properly support reading the large row groups. (A similar effort would also make parquet-mr available to write larger row groups than 2GB.)

The other option is to handle the too large row groups with a proper error message in parquet-mr without allowing silent overflows. This second option would be this effort. It is great to handle to potential int overflows but the main point, I think, would be at the footer conversion (ParquetMetadataConverter) where we create our own object structure from the file footer. At this point we can throw the proper error messages if the row group is too large to be handled (for now) in parquet-mr.
BTW, it might not be enough to check the potential overflows to validate if we can read a row group size. (See e.g. the source code of ArrayList.)

About the lack of the unit tests. I can accept in some cases where unit tests are not practically feasible to be implemented. In these cases I usually ask to validate the code offline.

Copy link

@advancedxy advancedxy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I met a similar issue: https://issues.apache.org/jira/browse/PARQUET-2045, probably the same one.

I would like this fix got merged.

@@ -1464,7 +1464,7 @@ protected PageHeader readPageHeader(BlockCipher.Decryptor blockDecryptor, byte[]
*/
private void verifyCrc(int referenceCrc, byte[] bytes, String exceptionMsg) {
crc.reset();
crc.update(bytes);
crc.update(bytes, 0, bytes.length);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unrelated, I would prefer to update this in another PR.

@@ -1763,8 +1763,8 @@ public void addChunk(ChunkDescriptor descriptor) {
public void readAll(SeekableInputStream f, ChunkListBuilder builder) throws IOException {
f.seek(offset);

int fullAllocations = length / options.getMaxAllocationSize();
int lastAllocationSize = length % options.getMaxAllocationSize();
int fullAllocations = (int)(length / options.getMaxAllocationSize());

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(int)(length / options.getMaxAllocationSize()) -> Math.toIntExact(length / options.getMaxAllocationSize());

@advancedxy
Copy link

@eadwright,

I'll try to summarize the issue, please correct me if I'm wrong. Parquet-mr is not able to write such big row groups (>2GB) because of the int array size limitation. Meanwhile, both the format and some other implementations allow such big row groups. So, parquet-mr shall be prepared for this issue in some way.
One option is to "simply" read the large row groups. It would require significant efforts to use proper memory handling objects that would properly support reading the large row groups. (A similar effort would also make parquet-mr available to write larger row groups than 2GB.)

The other option is to handle the too large row groups with a proper error message in parquet-mr without allowing silent overflows. This second option would be this effort. It is great to handle to potential int overflows but the main point, I think, would be at the footer conversion (ParquetMetadataConverter) where we create our own object structure from the file footer. At this point we can throw the proper error messages if the row group is too large to be handled (for now) in parquet-mr.
BTW, it might not be enough to check the potential overflows to validate if we can read a row group size. (See e.g. the source code of ArrayList.)

About the lack of the unit tests. I can accept in some cases where unit tests are not practically feasible to be implemented. In these cases I usually ask to validate the code offline.

Hi @gszadovszky parquet-mr is able to produce big row groups. We found some files wrote by Spark(which uses parquet-mr) have this problem. See https://issues.apache.org/jira/browse/PARQUET-2045 for details.

There are two options to fix this problem:

  1. fail at writer side when creating such large row group/column chunk
  2. support at reader side, which is this approach. It would require a lot of resource, but it's feasible.

Either option is fine for me, WDYT?

@gszadovszky
Copy link
Contributor

@advancedxy, thanks for explaining.
I think, the best option is 2. It is up to the user to provide enough resources for handling the large row groups or not writing them.
Meanwhile, even though I've written I can accept lacking of unit tests in some situations my concern in this case is I am not sure that every aspect of a large row group is handled properly. So, we clearly need to validate this fix with such large row groups. This test can be even implemented in this source code but we must not include it in the unit tests or integration tests we run regularly.

@advancedxy
Copy link

@advancedxy, thanks for explaining.
I think, the best option is 2. It is up to the user to provide enough resources for handling the large row groups or not writing them.
Meanwhile, even though I've written I can accept lacking of unit tests in some situations my concern in this case is I am not sure that every aspect of a large row group is handled properly. So, we clearly need to validate this fix with such large row groups. This test can be even implemented in this source code but we must not include it in the unit tests or integration tests we run regularly.

@gszadovszky well understood.

Hi @eadwright, is it possible for you to add the test case(tagged with @ignore) in your PR, so others like @gszadovszky or me can verify it offline?

@eadwright
Copy link
Contributor Author

Not a unit test yet, but I have at least found a reliable way to create a parquet file in python which causes Java to break when reading it. For this we require a huge number of rows, and a string column with very high entropy to avoid the use dictionary lookup or compression. Not these conditions can be real-world - records containing long SHA3 hashes for example.

This python code needs more than 16GB of RAM to execute.

import numpy as np

rand_array = np.random.rand(75000000, 4)

df = pd.DataFrame(rand_array, columns=['number1', 'number2', 'number3', 'number4'])
df['string1'] = df['number1'].astype(str) + df['number2'].astype(str) + df['number3'].astype(str)

df.to_parquet("random.parquet", compression='snappy', engine='pyarrow')```

@eadwright
Copy link
Contributor Author

eadwright commented May 19, 2021

Screenshot from parquet-tools meta looking at the generated file. Note the TS (total size) of row group 1 is way over 2GB. For the bug to happen, it is also necessary for the rows to spill over into a second row group.

image

@gszadovszky
Copy link
Contributor

@eadwright, what do you mean by "necessary for the rows to spill over into a second row group."? It shall not be possible. Even the pages keep row boundaries but for row groups it is required by the specification.

@eadwright
Copy link
Contributor Author

@eadwright, what do you mean by "necessary for the rows to spill over into a second row group."? It shall not be possible. Even the pages keep row boundaries but for row groups it is required by the specification.

Sorry I probably didn't phrase that well. I mean, for this bug to occur, you need i) a row group which is taking more than 2GB of space, to get the 2^31 signed-int overflow, and ii) another subsequent row group (any size) so the code with the bug adds to a file offset using a corrupted value.

In the example python code, if I asked it to produce 50M rows instead of 75M, you get a ~3.3GB row group, but no second row group. The file offset addition code path is not executed and the file gets read correctly, the bug is not triggered.

@gszadovszky
Copy link
Contributor

Thanks, @eadwright for explaining. I get it now.

@eadwright
Copy link
Contributor Author

eadwright commented May 19, 2021

Build failed due to some transient connectivity issue - builds fine for me, Java 8 and 11

@eadwright
Copy link
Contributor Author

eadwright commented May 19, 2021

Correction, we need a column within a row group to be over 2GB in size to cause the issue. It is not the row group size in total which counts.

@eadwright
Copy link
Contributor Author

eadwright commented May 19, 2021

I've tweaked the python to create a test file which Java can't read. The python can now run fine on a 16GB machine.

import pandas as pd
import numpy as np
rand_array = np.random.rand(38000000, 3)
df = pd.DataFrame(rand_array, columns=["number1", "number2", "number3"])
df["string"] = df["number1"].astype(str) + df["number2"].astype(str) + df["number3"].astype(str)
df.drop(["number1", "number2", "number3"], axis=1, inplace=True)
df.to_parquet("random.parquet", compression=None, engine="pyarrow", **{"row_group_size": 37800000})

It creates 38M records, 37.8M of which are in the first row group, and the data for the string column in the first row group is about 2.1GB in size, over the threshold to cause the Java bug.

@advancedxy
Copy link

@eadwright can you upload the corrupt parquet file to some public cloud service like google drive or s3? or is it possible for you to add a new test case(not run by default) in TestParquetReaderRandomAccess?

@gszadovszky do you have any other concerns about this pr?

@eadwright
Copy link
Contributor Author

eadwright commented May 23, 2021

I've uploaded the python-generated file to my Google Drive: https://drive.google.com/file/d/1trWjeJaHpqbHlnipnDUrwlwYM-AKFNWg/view?usp=sharing

Still working on a unit test. I'm used to working with Avro schema, but I'm trying to cut it back to bare-bones.

@eadwright
Copy link
Contributor Author

Update - I've created some Java code which writes a parquet file nearly big enough to cause the issue, and can successfully read this file. However, two problems:

  • If I bump the record count so I have a file big enough to reproduce this bug 1633, another bug causes 32-bit integer overflow. The code as it stands cannot reproduce the file which was created in python.
  • The code I'm using to read the data, the potential unit test, does not work against the python-produced file (which has an avro schema) as I get this error (I suspect some classpath issue) java.lang.NoSuchMethodError: org.apache.parquet.format.LogicalType.getSetField()Lshaded/parquet/org/apache/thrift/TFieldIdEnum;

So... if any of you can reproduce the original error with the parquet file I posted above, and can validate this 7 line fix addresses it, that'd be great. Open to ideas of course.

@gszadovszky
Copy link
Contributor

@eadwright, I'll try to look into this this week and produce a java code to reproduce the issue.

@eadwright
Copy link
Contributor Author

eadwright commented May 25, 2021

@gszadovszky Awesome, appreciated. Also note the file uploaded isn't corrupt as such, it just goes beyond 32-bit limits.

- Updated ParquetWriter to support setting row group size in long
- Removed Xmx settings in the pom to allow more memory for the tests
@gszadovszky
Copy link
Contributor

@eadwright, I've implemented a unit test to reproduce the issue and test your solution. Feel free to use it in your PR. I've left some TODOs for you :)

@gszadovszky
Copy link
Contributor

@eadwright, so the CI was executed somehow on my private repo and failed due to OoM. So, we may either investigate if we can tweak our configs/CI or disable this test by default.

@eadwright
Copy link
Contributor Author

@gszadovszky I raised a PR to bring your changes in to my fork. Not had time yet alas to address the TODOs. I can say though that I believe to read that example file correctly with the fix requires 10GB of heap or so, probably similar with your test. Agree this test should be disabled by default, too heavy for CI.

@gszadovszky
Copy link
Contributor

@eadwright, I've made some changes in the unit test (no more TODOs). See the update here. The idea is to not skip the tests in "normal" case but catch the OoM and skip. This way no tests should fail on any environments. Most of the modern laptops should have enough memory so this test will be executed on them.

@eadwright
Copy link
Contributor Author

@gszadovszky I had a look at your changes. I feel uncomfortable relying on any behaviour at all after an OOM error. Are you sure this is the right approach?

@eadwright
Copy link
Contributor Author

eadwright commented Jun 10, 2021

@gszadovszky Sorry, delay on my part. Have merged your changes in. Even in a testing pipeline I am uncomfortable with catching any kind of java.lang.Error, especially an OOM. Should we remove those catch clauses and mark this test to be ignored usually - let it be run manually?

Love how the tests need ~3GB, not 10GB.

@gszadovszky
Copy link
Contributor

@eadwright, I understand your concerns I don't really like it either. Meanwhile, I don't feel good having a test that is not executed automatically. Without regular executions there is no guarantee that this test would be executed ever again and even if someone would execute it it might fail because of the lack of maintenance.

What do you think about the following options? @shangxinli, I'm also curious about your ideas.

  • Execute this test separately with a maven profile. I am not sure if the CI allows allocating such large memory but with Xmx options we might give a try and create a separate check for this test only.
  • Similar to the previous with the profile but not executing in the CI ever. Instead, we add some comments to the release doc so this test will be executed at least once per release.
  • Configuring the CI profile to skip this test but have it in the normal scenario meaning the devs will execute it locally. There are a couple of cons though. There is no guarantee that devs executes all the tests including this one. It also can cause issues if the dev doesn't have enough memory and don't know that the test failure is not related to the current change.

@eadwright
Copy link
Contributor Author

Interesting options @gszadovszky , I have no strong opinion. I'd just like this fix merged once everyone is happy :)

@gszadovszky
Copy link
Contributor

@eadwright, sorry, you're right. This is not tightly related to your PR. Please, remove the try-catch blocks for OOE and put an @Ignore annotation to the test class for now. I'll open a separate jira to solve the issue of such "not-to-be-tested-by-ci" tests.

@gszadovszky gszadovszky changed the title PARQUET-1633 Fix integer overflow PARQUET-1633: Fix integer overflow Jun 10, 2021
@eadwright
Copy link
Contributor Author

@gszadovszky done

Copy link
Contributor

@gszadovszky gszadovszky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for your efforts, @eadwright.

@eadwright
Copy link
Contributor Author

eadwright commented Jun 10, 2021

Pleasure @gszadovszky - thanks for yours!

@eadwright
Copy link
Contributor Author

Note I don't have write access, so I guess someone needs to merge this at some point? Not sure of your workflow.

@gszadovszky
Copy link
Contributor

Sure, @eadwright, I'm merging it. I just usually give another day after the approval so others have the chance to comment.

@gszadovszky gszadovszky merged commit 98ddadf into apache:master Jun 11, 2021
@gszadovszky
Copy link
Contributor

@eadwright, do you have a jira account so I can assign this one to you?

@eadwright
Copy link
Contributor Author

@gszadovszky Thanks - just created Jira account edw_vtxa.

@eadwright eadwright deleted the PARQUET-1633 branch June 11, 2021 15:33
shangxinli pushed a commit that referenced this pull request Sep 9, 2021
Unit test:
- Updated ParquetWriter to support setting row group size in long
- Removed Xmx settings in the pom to allow more memory for the tests

Co-authored-by: Gabor Szadovszky <gabor@apache.org>
schlosna added a commit to palantir/parquet-mr that referenced this pull request Sep 23, 2021
Unit test:
- Updated ParquetWriter to support setting row group size in long
- Removed Xmx settings in the pom to allow more memory for the tests

Co-authored-by: Gabor Szadovszky <gabor@apache.org>

Cherry-picked from apache#902
commit 98ddadf
sunchao pushed a commit to sunchao/parquet-mr that referenced this pull request Mar 3, 2022
Unit test:
- Updated ParquetWriter to support setting row group size in long
- Removed Xmx settings in the pom to allow more memory for the tests

Co-authored-by: Gabor Szadovszky <gabor@apache.org>
(cherry picked from commit aa132b3)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants