-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-8797: [C++] Read RecordBatch in a different endian #7507
Conversation
Thanks for opening a pull request! Could you open an issue for this pull request on JIRA? Then could you also rename pull request title in the following format?
See also: |
cpp/src/arrow/ipc/read_write_test.cc
Outdated
@@ -1289,6 +1338,10 @@ INSTANTIATE_TEST_SUITE_P(StreamDecoderSmallChunksRoundTripTests, | |||
INSTANTIATE_TEST_SUITE_P(StreamDecoderLargeChunksRoundTripTests, | |||
TestStreamDecoderLargeChunks, BATCH_CASES()); | |||
|
|||
// TODO(kiszk): Enable this test since it is not successful now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now, I intetionally comment out this test since this type of communication is not supported.
cpp/src/arrow/ipc/read_write_test.cc
Outdated
|
||
std::shared_ptr<Schema> schema_result; | ||
DoSchemaRoundTrip(*src_batch.schema(), &dictionary_memo, &schema_result); | ||
ASSERT_TRUE(expected_batch.schema()->Equals(*schema_result)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we always convert schema's endianness to native endian?
I think that we should keep the original schema's endianness for serialization.
Did we discuss this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your comment. You are right. We should keep the original schema until ReadRecordBatch
.
Then, if LoadRecordBatchSubset
will automatically convert the endian based on the schema's endianness, LoadRecordBatchSubset
will update the endian flag in the schema.
I will move this assertion after Do*RoundTrip
.
cpp/src/arrow/ipc/read_write_test.cc
Outdated
public: | ||
void SetUp() { IpcTestFixture::SetUp(); } | ||
void TearDown() { IpcTestFixture::TearDown(); } | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how difficult to implement this idea...
How about enhancing existing IpcTestFixture
instead of adding a new fixture?
diff --git a/cpp/src/arrow/ipc/read_write_test.cc b/cpp/src/arrow/ipc/read_write_test.cc
index 923670834..1e03472b8 100644
--- a/cpp/src/arrow/ipc/read_write_test.cc
+++ b/cpp/src/arrow/ipc/read_write_test.cc
@@ -397,6 +397,9 @@ class IpcTestFixture : public io::MemoryMapFixture, public ExtensionTypesMixin {
ASSERT_OK_AND_ASSIGN(result, DoLargeRoundTrip(batch, /*zero_data=*/true));
CheckReadResult(*result, batch);
+
+ ASSERT_OK_AND_ASSIGN(result, DoEndiannessRoundTrip(batch));
+ CheckReadResult(*result, batch);
}
void CheckRoundtrip(const std::shared_ptr<Array>& array,
IpcWriteOptions options = IpcWriteOptions::Defaults(),
DoEndiannessRoundTrip()
will:
- create a nonnative endianness record batch by swapping the given record batch's endianness
- We need to implement this endianness conversion feature but how difficult...?
- run
DoStandardRoundTrip()
against the nonnative endianness record batch
BTW, should we always convert endianness in IPC? If users just relay data (A (little, generate data) -> B (big, not touch data) -> C (little, process data)), it'll be inefficient.
Or should we convert endianness explicitly out of IPC by providing Result<std::shared_ptr<RecordBatch>> RecordBatch::EnsureEndianness(Endianness)
or something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. In my current thought, another version of ArrayLoader (LoadType
method), which is called from LoadRecordBatchSubset
, will be implemented. One existing version is the non-endian conversion. The other is the endian conversion version. At this level, we know the type of each element. Thus, I imagine it is easy to swap endianness. What do you think?
To add a new API Result<std::shared_ptr<RecordBatch>> RecordBatch::EnsureEndianness(Endianness)
is good idea to avoid unnecessary conversion in the above scenario. Technically, this could work correctly.
I have two questions.
- I imagine that this API is explicitly called by a user. If so, is this change acceptable by a user? Of course, most of the scenarios (communicate between the same endianness) work well without calling this new API.
- This new API needs to traverse all of the elements again if we need the data conversion. It may lead to additional overhead.
What do you think? cc @pitrou
@kiszk I would suggest creating a LE and BE example corpus in apache/arrow-testing. You can use the integration test command line tools to create point-of-truth JSON files and then use the JSON_TO_ARROW mode of the integration test executable to create IPC binary files that can be checked in |
@wesm Thank you for your suggestion. I will pursue the approach that you suggested. I will check the integration test command line tool and the integration test with the JSON_TO_ARROW mode. |
@wesm Can I ask a question? I am writing a test case after creating a point-of-truth JSON file and generating two arrow binary files for BE and LE. Is this my approach what you suggested? I am writing a function https://gist.github.com/kiszk/bccb2640a68b706049c1631ba9514eae |
I was suggesting to generate BE and LE files to be put in this directory https://github.com/apache/arrow-testing/tree/master/data/arrow-ipc-stream/integration Then you can specify the respective directory of e.g. BE files to be read on an LE platform I think it's better to stick with the integration test tools for this rather than writing new C++ code |
@wesm Thank you for your clarification. I thought that I will check in binary files and write C++ files to validate the result to be read. Now, I realized the integration test |
Yes you need to use the |
This commit can succeed to run an integration test with primitive types like this.
|
de53dcf
to
89b2277
Compare
After merging this PR apache/arrow-testing#41, I will update the |
ci/scripts/integration_arrow.sh
Outdated
|
||
pip install -e $arrow_dir/dev/archery | ||
|
||
archery integration --with-all --run-flight \ | ||
--gold-dirs=$gold_dir_0_14_1 \ | ||
--gold-dirs=$gold_dir_0_17_1 \ | ||
|
||
# TODO: support other languages | ||
archery integration --with-cpp=1 --run-flight \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since other languages (e.g. Java) cause failure of integration tests, future separate PR will fix failures in other languages.
This PR focuses on cpp integration test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean even the little-endian cases fail on Java?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You didn't answer my question above :-)
Also, instead of running archery integration
a second time, you should add the new "gold dirs" to the command above, and add selective skips like in https://github.com/apache/arrow/blob/master/dev/archery/archery/integration/runner.py#L129
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, sorry, I should mention you here
The answer is yes. It fails on Java at least.
Sure, I will update runner.py
instead of integration_arrow.sh
.
Thank you very much @kiszk ! I'll merge once CI is green. |
The CI failure is unrelated (see https://issues.apache.org/jira/browse/ARROW-11668). |
@pitrou I appreciate a lot of your valuable feedbacks. |
This PR creates a test to receive RecordBatch for different endian (e.g. receive RecordBatch with big-endian schema on little-endian platform). This PR changes 1. Introduce Endianness enum class to represent endianness 2. Add a new flag `endianness` to `arrow::schema` to represent endianness of Array in `RecordBatch`. 3. Eagerly convert non-nativeendian data to nativeendian data in a batch if `IpcReadOption.use_native_endian = true` (`true` by default). 4. Add golden arrow files for integration test in both endians and test script Regarding 3., other possible choices are as follows: - Lazily convert non-nativeendian data to nativeendian data for a column in each RecordBatch. Pros: Avoid conversion for columns that will not be read Cons: Complex management of endianness of each column. Inconsistency of endianness between schema and column data. - Convert non-nativeendian data to nativeendian data when each element is read Pros: Can keep the original schema without batch conversion Cons: 1) Each RecordBatch may need an additional field to show whether the endian conversion is necessary or not. 2) Need to update test cases to accept different endianness between expected and actual schemas. Now, this PR uses the simplest approach (always see native endianness in schemas) that eagerly converts all of the columns in a batch. TODO - [x] Support to convert endian of each element for primitive types - [x] Support to convert endian of each element for complex types - [x] Support to convert endian of each element for all types for stream - [x] Add golden arrow files in both endians For creating this PR, @kou helps me greatly (prototype of arrow::schema and teaching me about RecordBatch). Closes apache#7507 from kiszk/ARROW-8797 Lead-authored-by: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>
This PR creates a test to receive RecordBatch for different endian (e.g. receive RecordBatch with big-endian schema on little-endian platform). This PR changes 1. Introduce Endianness enum class to represent endianness 2. Add a new flag `endianness` to `arrow::schema` to represent endianness of Array in `RecordBatch`. 3. Eagerly convert non-nativeendian data to nativeendian data in a batch if `IpcReadOption.use_native_endian = true` (`true` by default). 4. Add golden arrow files for integration test in both endians and test script Regarding 3., other possible choices are as follows: - Lazily convert non-nativeendian data to nativeendian data for a column in each RecordBatch. Pros: Avoid conversion for columns that will not be read Cons: Complex management of endianness of each column. Inconsistency of endianness between schema and column data. - Convert non-nativeendian data to nativeendian data when each element is read Pros: Can keep the original schema without batch conversion Cons: 1) Each RecordBatch may need an additional field to show whether the endian conversion is necessary or not. 2) Need to update test cases to accept different endianness between expected and actual schemas. Now, this PR uses the simplest approach (always see native endianness in schemas) that eagerly converts all of the columns in a batch. TODO - [x] Support to convert endian of each element for primitive types - [x] Support to convert endian of each element for complex types - [x] Support to convert endian of each element for all types for stream - [x] Add golden arrow files in both endians For creating this PR, @kou helps me greatly (prototype of arrow::schema and teaching me about RecordBatch). Closes apache#7507 from kiszk/ARROW-8797 Lead-authored-by: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>
This PR creates a test to receive RecordBatch for different endian (e.g. receive RecordBatch with big-endian schema on little-endian platform).
This PR changes
endianness
toarrow::schema
to represent endianness of Array inRecordBatch
.IpcReadOption.use_native_endian = true
(true
by default).Regarding 3., other possible choices are as follows:
Pros: Avoid conversion for columns that will not be read
Cons: Complex management of endianness of each column. Inconsistency of endianness between schema and column data.
Pros: Can keep the original schema without batch conversion
Cons: 1) Each RecordBatch may need an additional field to show whether the endian conversion is necessary or not. 2) Need to update test cases to accept different endianness between expected and actual schemas.
Now, this PR uses the simplest approach (always see native endianness in schemas) that eagerly converts all of the columns in a batch.
TODO
For creating this PR, @kou helps me greatly (prototype of arrow::schema and teaching me about RecordBatch).