New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-10426: [C++] Allow writing large strings to Parquet #8632
Conversation
Perhaps I should add tests on the C++ side. |
d347413
to
dfb698e
Compare
@kou Is it normal that the MinGW tests enable Python? |
dfb698e
to
dff7d33
Compare
Yes. It's normal. |
@kou Well, the Python tests sometimes seem to time out on MinGW... |
Umm, they may have a problem in finalization... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would be nice to implement the safety checks for max length strings if possible. Otherwise, most of the comments are questions to help be better understand some of the refactoring.
const auto large_array = ::arrow::ArrayFromJSON(large_type, json); | ||
const auto narrow_array = ::arrow::ArrayFromJSON(narrow_type, json); | ||
|
||
this->RoundTripSingleColumn(large_array, narrow_array, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this intended to be two different arrays? same as above? maybe add a comment on what you are testing here? (lack of schema to read back?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add a comment indeed.
void PutBinaryArray(const ArrayType& array) { | ||
const int64_t total_bytes = | ||
array.value_offset(array.length()) - array.value_offset(0); | ||
PARQUET_THROW_NOT_OK(sink_.Reserve(total_bytes + array.length() * sizeof(uint32_t))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is a check for int32 overflow needed here here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so (these are int64_t
calculations AFAICT). However, I should fix the XXX below :-)
@@ -127,6 +129,21 @@ class PlainEncoder : public EncoderImpl, virtual public TypedEncoder<DType> { | |||
} | |||
|
|||
protected: | |||
template <typename ArrayType> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm not very familiar with this code, could you let me know what these changes are intended to do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Put(const ::arrow::Array&)
was hardwired for narrow binary arrays, we need to handle both narrow and large arrays, hence this small refactor.
std::shared_ptr<Comparator> Comparator::Make(Type::type physical_type, | ||
SortOrder::type sort_order, | ||
int type_length) { | ||
if (SortOrder::SIGNED == sort_order) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this just moving code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, so as to put internal implementation details in the anonymous namespace.
dff7d33
to
4dcff96
Compare
@emkornfield Any other concern? |
cpp/src/parquet/encoding.cc
Outdated
array.value_offset(array.length()) - array.value_offset(0); | ||
PARQUET_THROW_NOT_OK(sink_.Reserve(total_bytes + array.length() * sizeof(uint32_t))); | ||
|
||
constexpr size_t kMaxByteArraySize = std::numeric_limits<uint32_t>::max(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
its not clear from the spec, but I think this expected to be signed:
Java appears to use a int (signed 32 bit integer)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've changed the code to check for the signed maximum. It will be safer in any case.
One concern with the length check, otherwise LGTM. |
Large strings are still read back as regular strings.
4dcff96
to
81c95c4
Compare
Large strings are still read back as regular strings. Closes apache#8632 from pitrou/ARROW-10426-parquet-large-binary Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>
Large strings are still read back as regular strings.