Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-8797: [C++] Read RecordBatch in a different endian #7507

Closed
wants to merge 63 commits into from

Conversation

kiszk
Copy link
Member

@kiszk kiszk commented Jun 21, 2020

This PR creates a test to receive RecordBatch for different endian (e.g. receive RecordBatch with big-endian schema on little-endian platform).

This PR changes

  1. Introduce Endianness enum class to represent endianness
  2. Add a new flag endianness to arrow::schema to represent endianness of Array in RecordBatch.
  3. Eagerly convert non-nativeendian data to nativeendian data in a batch if IpcReadOption.use_native_endian = true (true by default).
  4. Add golden arrow files for integration test in both endians and test script

Regarding 3., other possible choices are as follows:

  • Lazily convert non-nativeendian data to nativeendian data for a column in each RecordBatch.
    Pros: Avoid conversion for columns that will not be read
    Cons: Complex management of endianness of each column. Inconsistency of endianness between schema and column data.
  • Convert non-nativeendian data to nativeendian data when each element is read
    Pros: Can keep the original schema without batch conversion
    Cons: 1) Each RecordBatch may need an additional field to show whether the endian conversion is necessary or not. 2) Need to update test cases to accept different endianness between expected and actual schemas.

Now, this PR uses the simplest approach (always see native endianness in schemas) that eagerly converts all of the columns in a batch.

TODO

  • Support to convert endian of each element for primitive types
  • Support to convert endian of each element for complex types
  • Support to convert endian of each element for all types for stream
  • Add golden arrow files in both endians

For creating this PR, @kou helps me greatly (prototype of arrow::schema and teaching me about RecordBatch).

@github-actions
Copy link

Thanks for opening a pull request!

Could you open an issue for this pull request on JIRA?
https://issues.apache.org/jira/browse/ARROW

Then could you also rename pull request title in the following format?

ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}

See also:

@kiszk kiszk changed the title [WIP] ARROW-8797: [C++] Create test to receive RecordBatch for different endian ARROW-8797: [WIP] [C++] Create test to receive RecordBatch for different endian Jun 21, 2020
@kiszk kiszk changed the title ARROW-8797: [WIP] [C++] Create test to receive RecordBatch for different endian ARROW-8797: [C++] [WIP] Create test to receive RecordBatch for different endian Jun 21, 2020
@kiszk
Copy link
Member Author

kiszk commented Jun 21, 2020

Comments are appreciated. In particular, arrow::schema and generation of RecordBatch using non-native endian representation

cc @pitrou @kou

@github-actions
Copy link

@@ -1289,6 +1338,10 @@ INSTANTIATE_TEST_SUITE_P(StreamDecoderSmallChunksRoundTripTests,
INSTANTIATE_TEST_SUITE_P(StreamDecoderLargeChunksRoundTripTests,
TestStreamDecoderLargeChunks, BATCH_CASES());

// TODO(kiszk): Enable this test since it is not successful now
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now, I intetionally comment out this test since this type of communication is not supported.


std::shared_ptr<Schema> schema_result;
DoSchemaRoundTrip(*src_batch.schema(), &dictionary_memo, &schema_result);
ASSERT_TRUE(expected_batch.schema()->Equals(*schema_result));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we always convert schema's endianness to native endian?
I think that we should keep the original schema's endianness for serialization.
Did we discuss this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your comment. You are right. We should keep the original schema until ReadRecordBatch.
Then, if LoadRecordBatchSubset will automatically convert the endian based on the schema's endianness, LoadRecordBatchSubset will update the endian flag in the schema.

I will move this assertion after Do*RoundTrip.

@kiszk
Copy link
Member Author

kiszk commented Jun 23, 2020

Are there any comments about this approach for preparing test cases between different endians? cc @pitrou @wesm
If not, I will prepare other tests (but disabled now) with this approach.

public:
void SetUp() { IpcTestFixture::SetUp(); }
void TearDown() { IpcTestFixture::TearDown(); }
};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how difficult to implement this idea...
How about enhancing existing IpcTestFixture instead of adding a new fixture?

diff --git a/cpp/src/arrow/ipc/read_write_test.cc b/cpp/src/arrow/ipc/read_write_test.cc
index 923670834..1e03472b8 100644
--- a/cpp/src/arrow/ipc/read_write_test.cc
+++ b/cpp/src/arrow/ipc/read_write_test.cc
@@ -397,6 +397,9 @@ class IpcTestFixture : public io::MemoryMapFixture, public ExtensionTypesMixin {
 
     ASSERT_OK_AND_ASSIGN(result, DoLargeRoundTrip(batch, /*zero_data=*/true));
     CheckReadResult(*result, batch);
+
+    ASSERT_OK_AND_ASSIGN(result, DoEndiannessRoundTrip(batch));
+    CheckReadResult(*result, batch);
   }
   void CheckRoundtrip(const std::shared_ptr<Array>& array,
                       IpcWriteOptions options = IpcWriteOptions::Defaults(),

DoEndiannessRoundTrip() will:

  • create a nonnative endianness record batch by swapping the given record batch's endianness
    • We need to implement this endianness conversion feature but how difficult...?
  • run DoStandardRoundTrip() against the nonnative endianness record batch

BTW, should we always convert endianness in IPC? If users just relay data (A (little, generate data) -> B (big, not touch data) -> C (little, process data)), it'll be inefficient.
Or should we convert endianness explicitly out of IPC by providing Result<std::shared_ptr<RecordBatch>> RecordBatch::EnsureEndianness(Endianness) or something?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. In my current thought, another version of ArrayLoader (LoadType method), which is called from LoadRecordBatchSubset, will be implemented. One existing version is the non-endian conversion. The other is the endian conversion version. At this level, we know the type of each element. Thus, I imagine it is easy to swap endianness. What do you think?

To add a new API Result<std::shared_ptr<RecordBatch>> RecordBatch::EnsureEndianness(Endianness) is good idea to avoid unnecessary conversion in the above scenario. Technically, this could work correctly.
I have two questions.

  1. I imagine that this API is explicitly called by a user. If so, is this change acceptable by a user? Of course, most of the scenarios (communicate between the same endianness) work well without calling this new API.
  2. This new API needs to traverse all of the elements again if we need the data conversion. It may lead to additional overhead.

What do you think? cc @pitrou

@kiszk kiszk marked this pull request as draft June 27, 2020 16:44
@kiszk kiszk changed the title ARROW-8797: [C++] [WIP] Create test to receive RecordBatch for different endian ARROW-8797: [C++] Create test to receive RecordBatch for different endian Jun 27, 2020
@wesm
Copy link
Member

wesm commented Jul 1, 2020

@kiszk I would suggest creating a LE and BE example corpus in apache/arrow-testing. You can use the integration test command line tools to create point-of-truth JSON files and then use the JSON_TO_ARROW mode of the integration test executable to create IPC binary files that can be checked in

@kiszk
Copy link
Member Author

kiszk commented Jul 1, 2020

@wesm Thank you for your suggestion. I will pursue the approach that you suggested. I will check the integration test command line tool and the integration test with the JSON_TO_ARROW mode.

@kiszk
Copy link
Member Author

kiszk commented Jul 12, 2020

@wesm Can I ask a question? I am writing a test case after creating a point-of-truth JSON file and generating two arrow binary files for BE and LE. Is this my approach what you suggested? I am writing a function MakePrimitiveBatch() by importing values from the point-of-truth JSON file.

https://gist.github.com/kiszk/bccb2640a68b706049c1631ba9514eae

@wesm
Copy link
Member

wesm commented Jul 12, 2020

I was suggesting to generate BE and LE files to be put in this directory

https://github.com/apache/arrow-testing/tree/master/data/arrow-ipc-stream/integration

Then you can specify the respective directory of e.g. BE files to be read on an LE platform

I think it's better to stick with the integration test tools for this rather than writing new C++ code

@kiszk
Copy link
Member Author

kiszk commented Jul 13, 2020

@wesm Thank you for your clarification. I thought that I will check in binary files and write C++ files to validate the result to be read.

Now, I realized the integration test archery integration ... will kick arrow-json-integration-test --integration --mode VALIDATE --json ... --arrow .... I will stick with ``archery integration ...`.

@wesm
Copy link
Member

wesm commented Jul 13, 2020

Yes you need to use the --gold-dirs argument

@kiszk kiszk changed the title ARROW-8797: [C++] Create test to receive RecordBatch for different endian ARROW-8797: [C++] Read RecordBatch in a different endian Jul 19, 2020
@kiszk
Copy link
Member Author

kiszk commented Jul 19, 2020

This commit can succeed to run an integration test with primitive types like this.

(on BE platform) > ../../../build/debug/arrow-json-integration-test --integration --json int32_primitive.json --arrow int32_primitive_le.arrow  --mode=VALIDATE

@kiszk
Copy link
Member Author

kiszk commented Jul 26, 2020

After merging this PR apache/arrow-testing#41, I will update the integration_arrow.sh


pip install -e $arrow_dir/dev/archery

archery integration --with-all --run-flight \
--gold-dirs=$gold_dir_0_14_1 \
--gold-dirs=$gold_dir_0_17_1 \

# TODO: support other languages
archery integration --with-cpp=1 --run-flight \
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since other languages (e.g. Java) cause failure of integration tests, future separate PR will fix failures in other languages.
This PR focuses on cpp integration test.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean even the little-endian cases fail on Java?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You didn't answer my question above :-)

Also, instead of running archery integration a second time, you should add the new "gold dirs" to the command above, and add selective skips like in https://github.com/apache/arrow/blob/master/dev/archery/archery/integration/runner.py#L129

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, sorry, I should mention you here

The answer is yes. It fails on Java at least.

Sure, I will update runner.py instead of integration_arrow.sh.

@kiszk kiszk marked this pull request as ready for review August 1, 2020 05:09
@pitrou
Copy link
Member

pitrou commented Feb 17, 2021

Thank you very much @kiszk ! I'll merge once CI is green.

@pitrou
Copy link
Member

pitrou commented Feb 17, 2021

The CI failure is unrelated (see https://issues.apache.org/jira/browse/ARROW-11668).

@pitrou pitrou closed this in 858f45e Feb 17, 2021
@kiszk
Copy link
Member Author

kiszk commented Feb 17, 2021

@pitrou I appreciate a lot of your valuable feedbacks.

GeorgeAp pushed a commit to sirensolutions/arrow that referenced this pull request Jun 7, 2021
This PR creates a test to receive RecordBatch for different endian (e.g. receive RecordBatch with big-endian schema on little-endian platform).

This PR changes
1.  Introduce Endianness enum class to represent endianness
2.  Add a new flag `endianness` to `arrow::schema` to represent endianness of Array in `RecordBatch`.
3.  Eagerly convert non-nativeendian data to nativeendian data in a batch if `IpcReadOption.use_native_endian = true` (`true` by default).
4.  Add golden arrow files for integration test in both endians and test script

Regarding 3., other possible choices are as follows:
- Lazily convert non-nativeendian data to nativeendian data for a column in each RecordBatch.
Pros: Avoid conversion for columns that will not be read
Cons: Complex management of endianness of each column. Inconsistency of endianness between schema and column data.
- Convert non-nativeendian data to nativeendian data when each element is read
Pros: Can keep the original schema without batch conversion
Cons: 1) Each RecordBatch may need an additional field to show whether the endian conversion is necessary or not. 2) Need to update test cases to accept different endianness between expected and actual schemas.

Now, this PR uses the simplest approach (always see native endianness in schemas) that eagerly converts all of the columns in a batch.

TODO

- [x] Support to convert endian of each element for primitive types
- [x] Support to convert endian of each element for complex types
- [x] Support to convert endian of each element for all types for stream
- [x] Add golden arrow files in both endians

For creating this PR, @kou helps me greatly (prototype of arrow::schema and teaching me about RecordBatch).

Closes apache#7507 from kiszk/ARROW-8797

Lead-authored-by: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
michalursa pushed a commit to michalursa/arrow that referenced this pull request Jun 13, 2021
This PR creates a test to receive RecordBatch for different endian (e.g. receive RecordBatch with big-endian schema on little-endian platform).

This PR changes
1.  Introduce Endianness enum class to represent endianness
2.  Add a new flag `endianness` to `arrow::schema` to represent endianness of Array in `RecordBatch`.
3.  Eagerly convert non-nativeendian data to nativeendian data in a batch if `IpcReadOption.use_native_endian = true` (`true` by default).
4.  Add golden arrow files for integration test in both endians and test script

Regarding 3., other possible choices are as follows:
- Lazily convert non-nativeendian data to nativeendian data for a column in each RecordBatch.
Pros: Avoid conversion for columns that will not be read
Cons: Complex management of endianness of each column. Inconsistency of endianness between schema and column data.
- Convert non-nativeendian data to nativeendian data when each element is read
Pros: Can keep the original schema without batch conversion
Cons: 1) Each RecordBatch may need an additional field to show whether the endian conversion is necessary or not. 2) Need to update test cases to accept different endianness between expected and actual schemas.

Now, this PR uses the simplest approach (always see native endianness in schemas) that eagerly converts all of the columns in a batch.

TODO

- [x] Support to convert endian of each element for primitive types
- [x] Support to convert endian of each element for complex types
- [x] Support to convert endian of each element for all types for stream
- [x] Add golden arrow files in both endians

For creating this PR, @kou helps me greatly (prototype of arrow::schema and teaching me about RecordBatch).

Closes apache#7507 from kiszk/ARROW-8797

Lead-authored-by: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants