Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-10426: [C++] Allow writing large strings to Parquet #8632

Closed
wants to merge 5 commits into from

Conversation

pitrou
Copy link
Member

@pitrou pitrou commented Nov 10, 2020

Large strings are still read back as regular strings.

@pitrou
Copy link
Member Author

pitrou commented Nov 10, 2020

Perhaps I should add tests on the C++ side.

@github-actions
Copy link

@pitrou pitrou force-pushed the ARROW-10426-parquet-large-binary branch 2 times, most recently from d347413 to dfb698e Compare November 12, 2020 09:50
@pitrou
Copy link
Member Author

pitrou commented Nov 12, 2020

@kou Is it normal that the MinGW tests enable Python?

@pitrou pitrou force-pushed the ARROW-10426-parquet-large-binary branch from dfb698e to dff7d33 Compare November 12, 2020 14:32
@kou
Copy link
Member

kou commented Nov 12, 2020

Yes. It's normal.

@pitrou
Copy link
Member Author

pitrou commented Nov 12, 2020

@kou Well, the Python tests sometimes seem to time out on MinGW...

@kou
Copy link
Member

kou commented Nov 12, 2020

Umm, they may have a problem in finalization...

Copy link
Contributor

@emkornfield emkornfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be nice to implement the safety checks for max length strings if possible. Otherwise, most of the comments are questions to help be better understand some of the refactoring.

const auto large_array = ::arrow::ArrayFromJSON(large_type, json);
const auto narrow_array = ::arrow::ArrayFromJSON(narrow_type, json);

this->RoundTripSingleColumn(large_array, narrow_array,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this intended to be two different arrays? same as above? maybe add a comment on what you are testing here? (lack of schema to read back?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add a comment indeed.

void PutBinaryArray(const ArrayType& array) {
const int64_t total_bytes =
array.value_offset(array.length()) - array.value_offset(0);
PARQUET_THROW_NOT_OK(sink_.Reserve(total_bytes + array.length() * sizeof(uint32_t)));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is a check for int32 overflow needed here here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so (these are int64_t calculations AFAICT). However, I should fix the XXX below :-)

cpp/src/parquet/encoding.cc Outdated Show resolved Hide resolved
@@ -127,6 +129,21 @@ class PlainEncoder : public EncoderImpl, virtual public TypedEncoder<DType> {
}

protected:
template <typename ArrayType>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm not very familiar with this code, could you let me know what these changes are intended to do?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put(const ::arrow::Array&) was hardwired for narrow binary arrays, we need to handle both narrow and large arrays, hence this small refactor.

std::shared_ptr<Comparator> Comparator::Make(Type::type physical_type,
SortOrder::type sort_order,
int type_length) {
if (SortOrder::SIGNED == sort_order) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this just moving code?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, so as to put internal implementation details in the anonymous namespace.

@pitrou pitrou force-pushed the ARROW-10426-parquet-large-binary branch from dff7d33 to 4dcff96 Compare November 16, 2020 13:53
@pitrou
Copy link
Member Author

pitrou commented Nov 16, 2020

@emkornfield Any other concern?

array.value_offset(array.length()) - array.value_offset(0);
PARQUET_THROW_NOT_OK(sink_.Reserve(total_bytes + array.length() * sizeof(uint32_t)));

constexpr size_t kMaxByteArraySize = std::numeric_limits<uint32_t>::max();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its not clear from the spec, but I think this expected to be signed:

Java appears to use a int (signed 32 bit integer)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed the code to check for the signed maximum. It will be safer in any case.

@emkornfield
Copy link
Contributor

One concern with the length check, otherwise LGTM.

@pitrou pitrou force-pushed the ARROW-10426-parquet-large-binary branch from 4dcff96 to 81c95c4 Compare November 23, 2020 13:14
@pitrou pitrou closed this in b4a0751 Nov 23, 2020
@pitrou pitrou deleted the ARROW-10426-parquet-large-binary branch November 23, 2020 13:54
GeorgeAp pushed a commit to sirensolutions/arrow that referenced this pull request Jun 7, 2021
Large strings are still read back as regular strings.

Closes apache#8632 from pitrou/ARROW-10426-parquet-large-binary

Authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants