Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-8694: [C++][Parquet] Relax string size limit when deserializing Thrift messages #7103

Closed
wants to merge 2 commits into from

Conversation

wesm
Copy link
Member

@wesm wesm commented May 4, 2020

While it's not an ideal use case for Parquet, the 10MB limit for strings was causing a Thrift deserialization failure due to the large "pandas metadata" JSON blob written with the Schema when there are many columns. A 100MB limit should still catch "memory bombs" caused by nefarious input while allowing pretty wide data frames to be stored successfully

@github-actions
Copy link

github-actions bot commented May 4, 2020

@@ -362,7 +362,7 @@ inline void DeserializeThriftUnencryptedMsg(const uint8_t* buf, uint32_t* len,
new ThriftBuffer(const_cast<uint8_t*>(buf), *len));
apache::thrift::protocol::TCompactProtocolFactoryT<ThriftBuffer> tproto_factory;
// Protect against CPU and memory bombs
tproto_factory.setStringSizeLimit(10 * 1000 * 1000);
tproto_factory.setStringSizeLimit(100 * 1000 * 1000);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could this be added easily to reader properties? If not 100 MB seems like a good amount of buffer, but in case people really want to abuse it, it would be nice not to get follow-up bug reports.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More generally, it turns out that there are other types of memory bombs possible with Parquet (for example using compression). So i'm not sure the limit is very useful anyway.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I briefly took a look at this. It is a lot of refactoring to be able to pass a reader property into this function for little gain. If the problem arises I think we can address it then.

@wesm
Copy link
Member Author

wesm commented May 5, 2020

+1

@wesm wesm closed this in bc283e3 May 5, 2020
@wesm wesm deleted the ARROW-8694 branch May 5, 2020 21:40
kszucs pushed a commit to kszucs/arrow that referenced this pull request May 13, 2020
… Thrift messages

While it's not an ideal use case for Parquet, the 10MB limit for strings was causing a Thrift deserialization failure due to the large "pandas metadata" JSON blob written with the Schema when there are many columns. A 100MB limit should still catch "memory bombs" caused by nefarious input while allowing pretty wide data frames to be stored successfully

Closes apache#7103 from wesm/ARROW-8694

Authored-by: Wes McKinney <wesm+git@apache.org>
Signed-off-by: Wes McKinney <wesm+git@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants