ARROW-8797: [C++] Read RecordBatch in a different endian #7507

kiszk · 2020-06-21T18:41:15Z

This PR creates a test to receive RecordBatch for different endian (e.g. receive RecordBatch with big-endian schema on little-endian platform).

This PR changes

Introduce Endianness enum class to represent endianness
Add a new flag endianness to arrow::schema to represent endianness of Array in RecordBatch.
Eagerly convert non-nativeendian data to nativeendian data in a batch if IpcReadOption.use_native_endian = true (true by default).
Add golden arrow files for integration test in both endians and test script

Regarding 3., other possible choices are as follows:

Lazily convert non-nativeendian data to nativeendian data for a column in each RecordBatch.
Pros: Avoid conversion for columns that will not be read
Cons: Complex management of endianness of each column. Inconsistency of endianness between schema and column data.
Convert non-nativeendian data to nativeendian data when each element is read
Pros: Can keep the original schema without batch conversion
Cons: 1) Each RecordBatch may need an additional field to show whether the endian conversion is necessary or not. 2) Need to update test cases to accept different endianness between expected and actual schemas.

Now, this PR uses the simplest approach (always see native endianness in schemas) that eagerly converts all of the columns in a batch.

TODO

Support to convert endian of each element for primitive types
Support to convert endian of each element for complex types
Support to convert endian of each element for all types for stream
Add golden arrow files in both endians

For creating this PR, @kou helps me greatly (prototype of arrow::schema and teaching me about RecordBatch).

github-actions · 2020-06-21T18:46:28Z

Thanks for opening a pull request!

Could you open an issue for this pull request on JIRA?
https://issues.apache.org/jira/browse/ARROW

Then could you also rename pull request title in the following format?

ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}

See also:

kiszk · 2020-06-21T19:27:42Z

Comments are appreciated. In particular, arrow::schema and generation of RecordBatch using non-native endian representation

cc @pitrou @kou

github-actions · 2020-06-21T19:31:45Z

https://issues.apache.org/jira/browse/ARROW-8797

kiszk · 2020-06-22T04:00:14Z

cpp/src/arrow/ipc/read_write_test.cc

@@ -1289,6 +1338,10 @@ INSTANTIATE_TEST_SUITE_P(StreamDecoderSmallChunksRoundTripTests,
 INSTANTIATE_TEST_SUITE_P(StreamDecoderLargeChunksRoundTripTests,
                         TestStreamDecoderLargeChunks, BATCH_CASES());

+// TODO(kiszk): Enable this test since it is not successful now


Now, I intetionally comment out this test since this type of communication is not supported.

kou · 2020-06-22T21:20:24Z

cpp/src/arrow/ipc/read_write_test.cc

+
+    std::shared_ptr<Schema> schema_result;
+    DoSchemaRoundTrip(*src_batch.schema(), &dictionary_memo, &schema_result);
+    ASSERT_TRUE(expected_batch.schema()->Equals(*schema_result));


Should we always convert schema's endianness to native endian?
I think that we should keep the original schema's endianness for serialization.
Did we discuss this?

Thank you for your comment. You are right. We should keep the original schema until ReadRecordBatch.
Then, if LoadRecordBatchSubset will automatically convert the endian based on the schema's endianness, LoadRecordBatchSubset will update the endian flag in the schema.

I will move this assertion after Do*RoundTrip.

kiszk · 2020-06-23T17:53:12Z

Are there any comments about this approach for preparing test cases between different endians? cc @pitrou @wesm
If not, I will prepare other tests (but disabled now) with this approach.

kou · 2020-06-24T21:33:30Z

cpp/src/arrow/ipc/read_write_test.cc

+ public:
+  void SetUp() { IpcTestFixture::SetUp(); }
+  void TearDown() { IpcTestFixture::TearDown(); }
+};


I'm not sure how difficult to implement this idea...
How about enhancing existing IpcTestFixture instead of adding a new fixture?

diff --git a/cpp/src/arrow/ipc/read_write_test.cc b/cpp/src/arrow/ipc/read_write_test.cc index 923670834..1e03472b8 100644 --- a/cpp/src/arrow/ipc/read_write_test.cc +++ b/cpp/src/arrow/ipc/read_write_test.cc @@ -397,6 +397,9 @@ class IpcTestFixture : public io::MemoryMapFixture, public ExtensionTypesMixin { ASSERT_OK_AND_ASSIGN(result, DoLargeRoundTrip(batch, /*zero_data=*/true)); CheckReadResult(*result, batch); + + ASSERT_OK_AND_ASSIGN(result, DoEndiannessRoundTrip(batch)); + CheckReadResult(*result, batch); } void CheckRoundtrip(const std::shared_ptr<Array>& array, IpcWriteOptions options = IpcWriteOptions::Defaults(),

DoEndiannessRoundTrip() will:

create a nonnative endianness record batch by swapping the given record batch's endianness

We need to implement this endianness conversion feature but how difficult...?

run DoStandardRoundTrip() against the nonnative endianness record batch

BTW, should we always convert endianness in IPC? If users just relay data (A (little, generate data) -> B (big, not touch data) -> C (little, process data)), it'll be inefficient.
Or should we convert endianness explicitly out of IPC by providing Result<std::shared_ptr<RecordBatch>> RecordBatch::EnsureEndianness(Endianness) or something?

Good point. In my current thought, another version of ArrayLoader (LoadType method), which is called from LoadRecordBatchSubset, will be implemented. One existing version is the non-endian conversion. The other is the endian conversion version. At this level, we know the type of each element. Thus, I imagine it is easy to swap endianness. What do you think?

To add a new API Result<std::shared_ptr<RecordBatch>> RecordBatch::EnsureEndianness(Endianness) is good idea to avoid unnecessary conversion in the above scenario. Technically, this could work correctly.
I have two questions.

I imagine that this API is explicitly called by a user. If so, is this change acceptable by a user? Of course, most of the scenarios (communicate between the same endianness) work well without calling this new API.

This new API needs to traverse all of the elements again if we need the data conversion. It may lead to additional overhead.

What do you think? cc @pitrou

wesm · 2020-07-01T02:02:58Z

@kiszk I would suggest creating a LE and BE example corpus in apache/arrow-testing. You can use the integration test command line tools to create point-of-truth JSON files and then use the JSON_TO_ARROW mode of the integration test executable to create IPC binary files that can be checked in

kiszk · 2020-07-01T15:20:42Z

@wesm Thank you for your suggestion. I will pursue the approach that you suggested. I will check the integration test command line tool and the integration test with the JSON_TO_ARROW mode.

kiszk · 2020-07-12T19:01:40Z

@wesm Can I ask a question? I am writing a test case after creating a point-of-truth JSON file and generating two arrow binary files for BE and LE. Is this my approach what you suggested? I am writing a function MakePrimitiveBatch() by importing values from the point-of-truth JSON file.

https://gist.github.com/kiszk/bccb2640a68b706049c1631ba9514eae

wesm · 2020-07-12T22:10:32Z

I was suggesting to generate BE and LE files to be put in this directory

https://github.com/apache/arrow-testing/tree/master/data/arrow-ipc-stream/integration

Then you can specify the respective directory of e.g. BE files to be read on an LE platform

I think it's better to stick with the integration test tools for this rather than writing new C++ code

kiszk · 2020-07-13T01:18:08Z

@wesm Thank you for your clarification. I thought that I will check in binary files and write C++ files to validate the result to be read.

Now, I realized the integration test archery integration ... will kick arrow-json-integration-test --integration --mode VALIDATE --json ... --arrow .... I will stick with ``archery integration ...`.

wesm · 2020-07-13T03:47:02Z

Yes you need to use the --gold-dirs argument

kiszk · 2020-07-19T18:48:56Z

This commit can succeed to run an integration test with primitive types like this.

(on BE platform) > ../../../build/debug/arrow-json-integration-test --integration --json int32_primitive.json --arrow int32_primitive_le.arrow  --mode=VALIDATE

kiszk · 2020-07-26T07:02:33Z

After merging this PR apache/arrow-testing#41, I will update the integration_arrow.sh

kiszk · 2020-08-01T02:59:54Z

ci/scripts/integration_arrow.sh


 pip install -e $arrow_dir/dev/archery

 archery integration --with-all --run-flight \
    --gold-dirs=$gold_dir_0_14_1 \
    --gold-dirs=$gold_dir_0_17_1 \
+
+# TODO: support other languages
+archery integration --with-cpp=1 --run-flight \


Since other languages (e.g. Java) cause failure of integration tests, future separate PR will fix failures in other languages.
This PR focuses on cpp integration test.

You mean even the little-endian cases fail on Java?

You didn't answer my question above :-)

Also, instead of running archery integration a second time, you should add the new "gold dirs" to the command above, and add selective skips like in https://github.com/apache/arrow/blob/master/dev/archery/archery/integration/runner.py#L129

Oh, sorry, I should mention you here

The answer is yes. It fails on Java at least.

Sure, I will update runner.py instead of integration_arrow.sh.

pitrou · 2021-02-17T14:05:17Z

Thank you very much @kiszk ! I'll merge once CI is green.

pitrou · 2021-02-17T14:43:29Z

The CI failure is unrelated (see https://issues.apache.org/jira/browse/ARROW-11668).

kiszk · 2021-02-17T14:47:24Z

@pitrou I appreciate a lot of your valuable feedbacks.

@kou

This PR creates a test to receive RecordBatch for different endian (e.g. receive RecordBatch with big-endian schema on little-endian platform). This PR changes 1. Introduce Endianness enum class to represent endianness 2. Add a new flag `endianness` to `arrow::schema` to represent endianness of Array in `RecordBatch`. 3. Eagerly convert non-nativeendian data to nativeendian data in a batch if `IpcReadOption.use_native_endian = true` (`true` by default). 4. Add golden arrow files for integration test in both endians and test script Regarding 3., other possible choices are as follows: - Lazily convert non-nativeendian data to nativeendian data for a column in each RecordBatch. Pros: Avoid conversion for columns that will not be read Cons: Complex management of endianness of each column. Inconsistency of endianness between schema and column data. - Convert non-nativeendian data to nativeendian data when each element is read Pros: Can keep the original schema without batch conversion Cons: 1) Each RecordBatch may need an additional field to show whether the endian conversion is necessary or not. 2) Need to update test cases to accept different endianness between expected and actual schemas. Now, this PR uses the simplest approach (always see native endianness in schemas) that eagerly converts all of the columns in a batch. TODO - [x] Support to convert endian of each element for primitive types - [x] Support to convert endian of each element for complex types - [x] Support to convert endian of each element for all types for stream - [x] Add golden arrow files in both endians For creating this PR, @kou helps me greatly (prototype of arrow::schema and teaching me about RecordBatch). Closes apache#7507 from kiszk/ARROW-8797 Lead-authored-by: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>

@kou

This PR creates a test to receive RecordBatch for different endian (e.g. receive RecordBatch with big-endian schema on little-endian platform). This PR changes 1. Introduce Endianness enum class to represent endianness 2. Add a new flag `endianness` to `arrow::schema` to represent endianness of Array in `RecordBatch`. 3. Eagerly convert non-nativeendian data to nativeendian data in a batch if `IpcReadOption.use_native_endian = true` (`true` by default). 4. Add golden arrow files for integration test in both endians and test script Regarding 3., other possible choices are as follows: - Lazily convert non-nativeendian data to nativeendian data for a column in each RecordBatch. Pros: Avoid conversion for columns that will not be read Cons: Complex management of endianness of each column. Inconsistency of endianness between schema and column data. - Convert non-nativeendian data to nativeendian data when each element is read Pros: Can keep the original schema without batch conversion Cons: 1) Each RecordBatch may need an additional field to show whether the endian conversion is necessary or not. 2) Need to update test cases to accept different endianness between expected and actual schemas. Now, this PR uses the simplest approach (always see native endianness in schemas) that eagerly converts all of the columns in a batch. TODO - [x] Support to convert endian of each element for primitive types - [x] Support to convert endian of each element for complex types - [x] Support to convert endian of each element for all types for stream - [x] Add golden arrow files in both endians For creating this PR, @kou helps me greatly (prototype of arrow::schema and teaching me about RecordBatch). Closes apache#7507 from kiszk/ARROW-8797 Lead-authored-by: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>

kiszk changed the title ~~[WIP] ARROW-8797: [C++] Create test to receive RecordBatch for different endian~~ ARROW-8797: [WIP] [C++] Create test to receive RecordBatch for different endian Jun 21, 2020

kiszk changed the title ~~ARROW-8797: [WIP] [C++] Create test to receive RecordBatch for different endian~~ ARROW-8797: [C++] [WIP] Create test to receive RecordBatch for different endian Jun 21, 2020

kiszk commented Jun 22, 2020

View reviewed changes

kou reviewed Jun 22, 2020

View reviewed changes

kou reviewed Jun 24, 2020

View reviewed changes

kiszk marked this pull request as draft June 27, 2020 16:44

kiszk changed the title ~~ARROW-8797: [C++] [WIP] Create test to receive RecordBatch for different endian~~ ARROW-8797: [C++] Create test to receive RecordBatch for different endian Jun 27, 2020

kiszk force-pushed the ARROW-8797 branch from e54dd2f to 212e81d Compare July 10, 2020 15:51

kiszk mentioned this pull request Jul 12, 2020

ARROW-9417: [C++] Write length in IPC message by using little-endian #7716

Closed

kiszk force-pushed the ARROW-8797 branch from 212e81d to 76b2489 Compare July 19, 2020 14:26

kiszk changed the title ~~ARROW-8797: [C++] Create test to receive RecordBatch for different endian~~ ARROW-8797: [C++] Read RecordBatch in a different endian Jul 19, 2020

kiszk force-pushed the ARROW-8797 branch 2 times, most recently from de53dcf to 89b2277 Compare July 26, 2020 05:08

kiszk mentioned this pull request Jul 26, 2020

ARROW-8797: Add golden files to support ipc between different endians apache/arrow-testing#41

Merged

kiszk force-pushed the ARROW-8797 branch from e54a052 to b959575 Compare August 1, 2020 02:55

kiszk commented Aug 1, 2020

View reviewed changes

kiszk force-pushed the ARROW-8797 branch from b959575 to 237e7bd Compare August 1, 2020 04:17

kiszk marked this pull request as ready for review August 1, 2020 05:09

kiszk and others added 20 commits February 17, 2021 13:54

fix compilation error

b71868a

fix test failure in fuzzing

ad7dc09

fix test failure in fuzzing

ce28dc2

update arguments

3c08285

fix lint error

ce1929b

Allocate Buffer when data is swapped

1d95fd2

fix test failure in fuzzing

b29815b

address review comments

bb85b32

add missing file

87ab14a

address review comment

fb2e041

add test cases for SwapEndianArrayData

3c43047

add usage

a9be933

fix compilation error

8b7593f

fix lint error

5dd00d3

fix compilation error

39c55d9

fix directory path

a42b4dc

address review comments

ccb395e

avoid to run archery integration twice

0cd7468

update IpcReadContext

5d7a4b1

Minor fixes and improvements

bcf5bdb

pitrou force-pushed the ARROW-8797 branch from 7d4f522 to bcf5bdb Compare February 17, 2021 13:43

pitrou approved these changes Feb 17, 2021

View reviewed changes

Add endianness conversion to feature matrix

252a080

pitrou closed this in 858f45e Feb 17, 2021

asfimport mentioned this pull request Feb 21, 2021

[C++] Support Flight RPC among diffent endian platforms #24942

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-8797: [C++] Read RecordBatch in a different endian #7507

ARROW-8797: [C++] Read RecordBatch in a different endian #7507

kiszk commented Jun 21, 2020 •

edited

Loading

github-actions bot commented Jun 21, 2020

kiszk commented Jun 21, 2020

github-actions bot commented Jun 21, 2020

kiszk Jun 22, 2020

kou Jun 22, 2020

kiszk Jun 23, 2020

kiszk commented Jun 23, 2020

kou Jun 24, 2020

kiszk Jun 25, 2020

wesm commented Jul 1, 2020

kiszk commented Jul 1, 2020

kiszk commented Jul 12, 2020

wesm commented Jul 12, 2020

kiszk commented Jul 13, 2020 •

edited

Loading

wesm commented Jul 13, 2020

kiszk commented Jul 19, 2020 •

edited

Loading

kiszk commented Jul 26, 2020

kiszk Aug 1, 2020

pitrou Jan 21, 2021

pitrou Feb 16, 2021

kiszk Feb 16, 2021

pitrou commented Feb 17, 2021

pitrou commented Feb 17, 2021

kiszk commented Feb 17, 2021

ARROW-8797: [C++] Read RecordBatch in a different endian #7507

ARROW-8797: [C++] Read RecordBatch in a different endian #7507

Conversation

kiszk commented Jun 21, 2020 • edited Loading

github-actions bot commented Jun 21, 2020

kiszk commented Jun 21, 2020

github-actions bot commented Jun 21, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kiszk commented Jun 23, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wesm commented Jul 1, 2020

kiszk commented Jul 1, 2020

kiszk commented Jul 12, 2020

wesm commented Jul 12, 2020

kiszk commented Jul 13, 2020 • edited Loading

wesm commented Jul 13, 2020

kiszk commented Jul 19, 2020 • edited Loading

kiszk commented Jul 26, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pitrou commented Feb 17, 2021

pitrou commented Feb 17, 2021

kiszk commented Feb 17, 2021

kiszk commented Jun 21, 2020 •

edited

Loading

kiszk commented Jul 13, 2020 •

edited

Loading

kiszk commented Jul 19, 2020 •

edited

Loading