-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-8694: [C++][Parquet] Relax string size limit when deserializing Thrift messages #7103
Conversation
@@ -362,7 +362,7 @@ inline void DeserializeThriftUnencryptedMsg(const uint8_t* buf, uint32_t* len, | |||
new ThriftBuffer(const_cast<uint8_t*>(buf), *len)); | |||
apache::thrift::protocol::TCompactProtocolFactoryT<ThriftBuffer> tproto_factory; | |||
// Protect against CPU and memory bombs | |||
tproto_factory.setStringSizeLimit(10 * 1000 * 1000); | |||
tproto_factory.setStringSizeLimit(100 * 1000 * 1000); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could this be added easily to reader properties? If not 100 MB seems like a good amount of buffer, but in case people really want to abuse it, it would be nice not to get follow-up bug reports.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More generally, it turns out that there are other types of memory bombs possible with Parquet (for example using compression). So i'm not sure the limit is very useful anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I briefly took a look at this. It is a lot of refactoring to be able to pass a reader property into this function for little gain. If the problem arises I think we can address it then.
+1 |
… Thrift messages While it's not an ideal use case for Parquet, the 10MB limit for strings was causing a Thrift deserialization failure due to the large "pandas metadata" JSON blob written with the Schema when there are many columns. A 100MB limit should still catch "memory bombs" caused by nefarious input while allowing pretty wide data frames to be stored successfully Closes apache#7103 from wesm/ARROW-8694 Authored-by: Wes McKinney <wesm+git@apache.org> Signed-off-by: Wes McKinney <wesm+git@apache.org>
While it's not an ideal use case for Parquet, the 10MB limit for strings was causing a Thrift deserialization failure due to the large "pandas metadata" JSON blob written with the Schema when there are many columns. A 100MB limit should still catch "memory bombs" caused by nefarious input while allowing pretty wide data frames to be stored successfully