ARROW-363: [Java/C++] integration testing harness, initial integration tests #211

wesm · 2016-11-21T21:04:02Z

This also includes format reconciliation as discussed in ARROW-384.

wesm · 2016-11-21T21:58:01Z

@julienledem alright, I found a couple initial issues in the file implementation.

In C++, we're writing the record batch metadata after the buffers have been written to the file. In Java the metadata is written first https://github.com/apache/arrow/blob/master/java/vector/src/main/java/org/apache/arrow/vector/file/ArrowWriter.java#L104. This is related to the next point:
We've been writing the serialized IPC metadata size at the after the flatbuffer bytes: https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/metadata-internal.cc#L320 . One benefit of this in an IPC context is that if you have a shared memory offset to the end of the record batch, and the metadata size is the last 4 bytes of the blob, then you can read the metadata, then the record batch.

so here is what things look like right now on the C++ side:

<buffers> <metadata> <metadata size: int32>

and on the java side:

<metadata> <buffers>

In Java, the metadata size is encoded in the Block component in the footer: https://github.com/apache/arrow/blob/master/format/File.fbs#L36

There's a couple ways forward

Option A: Java adapts what is currently being done in C++
Option B: C++ drops the metadata length, conforms to the current Java layout, then presumes that in a shared memory IPC setting that there is some Block metadata obtained (containing file/shared memory offset, metadata size, and body size) from some other place

The reason we have things as they are right now was so that you could reconstruct a record batch from shared memory having only the offset to the end of the payload. Since we don't have a rigid model for handling IPC metadata (i.e. the Block struct used in the file footer) I don't feel very strongly about this; I would be fine doing what Java is doing

wesm · 2016-11-21T22:04:41Z

cc @emkornfield

wesm · 2016-11-21T22:44:38Z

@julienledem another thing is the record batch metadata itself

In the Java implementation, the buffer offsets are relative to the start of the record batch buffer payload:

arrow/java/vector/src/main/java/org/apache/arrow/vector/schema/ArrowRecordBatch.java

Line 64 in 6996c17

arrowBuffers.add(new ArrowBuffer(0, offset, size));
In the C++ implementation, the buffer offsets are absolute offsets into the output file / shared memory (

arrow/cpp/src/arrow/ipc/adapter.cc

Line 170 in ed6ec3b

buffer_meta_.push_back(flatbuf::Buffer(0, position, size + padding));

)

I believe this is one of the reasons why the metadata is being written after the buffers (since we need to know the absolute positions of everything prior to writing out the flatbuffer)

The problem with absolute offsets is that the record batches are not relocatable (the metadata would have to be rewritten). This seems like a flaw -- I'm going to change the C++ to use relative offsets, and make the record batch layout conform to the Java implementation as it is right now so we can proceed with the integration tests

julienledem · 2016-11-21T23:38:07Z

@wesm sounds good.
I have created a jira to discuss further if needed.
I don't see why we couldn't add the metadata size in there. Thank you
https://issues.apache.org/jira/browse/ARROW-384

wesm · 2016-11-24T18:54:47Z

@julienledem I refactored to account for our discussion in ARROW-384. I left a bunch of notes in the format/ documents in this PR, could you take a look when you can? One item is padding / alignment generally. I have 64-byte alignment in there now, but I think 8 byte alignment is sufficient for a minimal format (for SIMD, alignment is marginally important, but things have to be padded at least).

wesm · 2016-11-27T22:26:10Z

@julienledem we'll need to make the corresponding changes in the Java format (adding the metadata length prefix to record batches in the file) -- how do you want to proceed?

Change-Id: Ifa749bee89654da257a61a6f2f5255abd4ec5a38

Change-Id: I6f98e27f4f106f5919ec1868770866d1e155dd69

Change-Id: Ifce9e8f23d98435b7ea841bdd5de4020d4bdeac3

Change-Id: Ibe6461c0e7673072471e3a613e6d6934759ed6e4

…sion in ARROW-384 Change-Id: Ic119c5647bdb6b1ff1fb31b3712527058c68e40c

Change-Id: I9635f8eea06d0e30da15344ede32b69035723b35

Change-Id: I6d77fb2944098e6c8815a3c20871838ab3c62b67

Change-Id: I4da527fb90b66750ddaf54c783414858f2be7150

Change-Id: I057c329090faa0c2eb5fb65f8566c17130dd2a69

julienledem · 2016-11-28T19:35:51Z

This LGTM.
Do we want to increment MetadataVersion since the change to the file format is incompatible?
like V0_2SNAPSHOT?

…or now Change-Id: Ib493e982ff6c2d1502212985cec858a1e341c874

wesm · 2016-11-28T20:04:31Z

Sure, sounds good

…etadata Change-Id: Ic04ef893957c98ba2747f6ad9cef0e7ebe596958

Change-Id: I8e69e6403b6d16e9dcfae20b313ebdc685bd47f7

wesm · 2016-11-28T22:06:41Z

Finally I have got a passing minimal integration test:

$ python integration_test.py
-- Java producing, C++ consuming
Testing with /home/wesm/code/arrow/integration/data/simple.json
-- C++ producing, Java consuming
Testing with /home/wesm/code/arrow/integration/data/simple.json

I suggest merging this and proceeding to write more integration tests in a new patch.

wesm · 2016-11-28T22:09:20Z

@julienledem this include some Java changes, if you wouldn't mind taking a look. thanks!

Change-Id: Id6a5ee0d9c716fcbf92c452b51a249595bc0f90d

wesm · 2016-11-28T22:15:47Z

Just incremented the metadata version. I'm just using V1 and V2 for now to avoid having to do code changes at release time.

…ring test case to simple.json Change-Id: If62adb8cc209868e105f07f276c54140fe366df5

wesm · 2016-11-28T22:51:17Z

I added a broken string data test case. It fails in C++ with:

Breakpoint 1, arrow::io::BufferReader::Seek (this=0x7fffffff67d0, position=112)
    at /home/wesm/code/arrow/cpp/src/arrow/io/memory.cc:310
310	    return Status::IOError("position out of bounds");
(gdb) l
305	  return Status::OK();
306	}
307	
308	Status BufferReader::Seek(int64_t position) {
309	  if (position < 0 || position >= size_) {
310	    return Status::IOError("position out of bounds");
311	  }
312	
313	  position_ = position;
314	  return Status::OK();
(gdb) where
#0  arrow::io::BufferReader::Seek (this=0x7fffffff67d0, position=112) at /home/wesm/code/arrow/cpp/src/arrow/io/memory.cc:310
#1  0x00007ffff7bc8413 in non-virtual thunk to arrow::io::BufferReader::Seek(long) ()
    at /home/wesm/code/arrow/cpp/src/arrow/io/memory.cc:308
#2  0x00007ffff7bc6956 in arrow::io::ReadableFileInterface::ReadAt (this=0x7fffffff67d0, position=112, nbytes=0, 
    out=0x7fffffff64c0) at /home/wesm/code/arrow/cpp/src/arrow/io/interfaces.cc:43
#3  0x00007ffff791e6a8 in arrow::ipc::RecordBatchReader::GetBuffer (this=0x7fffffff66a8, buffer_index=6, out=0x7fffffff64c0)
    at /home/wesm/code/arrow/cpp/src/arrow/ipc/adapter.cc:405
#4  0x00007ffff791d251 in arrow::ipc::RecordBatchReader::NextArray (this=0x7fffffff66a8, field=0x707d20, 
    max_recursion_depth=64, out=0x708070) at /home/wesm/code/arrow/cpp/src/arrow/ipc/adapter.cc:359
#5  0x00007ffff7919ede in arrow::ipc::RecordBatchReader::Read (this=0x7fffffff66a8, out=0x7fffffff6b98)
    at /home/wesm/code/arrow/cpp/src/arrow/ipc/adapter.cc:310
#6  0x00007ffff7918382 in arrow::ipc::ReadRecordBatch (metadata=..., schema=..., max_recursion_depth=64, file=0x7fffffff67d0, 
    out=0x7fffffff6b98) at /home/wesm/code/arrow/cpp/src/arrow/ipc/adapter.cc:448
#7  0x00007ffff7918315 in arrow::ipc::ReadRecordBatch (metadata=..., schema=..., file=0x7fffffff67d0, out=0x7fffffff6b98)
    at /home/wesm/code/arrow/cpp/src/arrow/ipc/adapter.cc:441
#8  0x00007ffff792a195 in arrow::ipc::FileReader::GetRecordBatch (this=0x6f65f0, i=0, batch=0x7fffffff6b98)
    at /home/wesm/code/arrow/cpp/src/arrow/ipc/file.cc:213
#9  0x0000000000436c30 in arrow::ValidateArrowVsJson (arrow_path=..., json_path=...)
    at /home/wesm/code/arrow/cpp/src/arrow/ipc/json-integration-test.cc:169
#10 0x0000000000434d67 in arrow::RunCommand (json_path=..., arrow_path=..., command=...)
    at /home/wesm/code/arrow/cpp/src/arrow/ipc/json-integration-test.cc:208
#11 0x0000000000438ceb in main (argc=1, argv=0x7fffffff7378)
    at /home/wesm/code/arrow/cpp/src/arrow/ipc/json-integration-test.cc:376
(gdb) q

Let's address this and other data type tests in a follow up JIRA

Change-Id: Id0508a1a282738ab0346141b7718a050ac1fb2ba

wesm · 2016-11-29T02:27:24Z

+1, we can continue to iterate on fine details while writing more tests

Author: Uwe L. Korn <uwelk@xhochy.com> Closes apache#211 from xhochy/PARQUET-819 and squashes the following commits: 4d21407 [Uwe L. Korn] PARQUET-819: Don't try to install no longer existing arrow/utils.h

Author: Uwe L. Korn <uwelk@xhochy.com> Closes apache#211 from xhochy/PARQUET-819 and squashes the following commits: 4d21407 [Uwe L. Korn] PARQUET-819: Don't try to install no longer existing arrow/utils.h Change-Id: I323a15754e6d0150e83eb89c8821dced80566ce3

wesm mentioned this pull request Nov 21, 2016

ARROW-383: [C++] Integration testing CLI tool #209

Closed

wesm force-pushed the ARROW-363 branch from 8246061 to d2e2e2f Compare November 21, 2016 22:48

wesm force-pushed the ARROW-363 branch from adc6972 to d67ad65 Compare November 28, 2016 18:06

wesm added 8 commits November 28, 2016 13:12

Draft testing harness

a7e2d4b

Change-Id: Ifa749bee89654da257a61a6f2f5255abd4ec5a38

Exclude JSON files from Apache RAT checks

d806aa6

Change-Id: I6f98e27f4f106f5919ec1868770866d1e155dd69

Fix JSON integration test example to make it further

c529d63

Change-Id: Ifce9e8f23d98435b7ea841bdd5de4020d4bdeac3

Do not embed metadata length in WriteDataHeader

dbd6ed6

Change-Id: Ibe6461c0e7673072471e3a613e6d6934759ed6e4

Refactor IPC/File record batch read/write structure to reflect discus…

85128f7

…sion in ARROW-384 Change-Id: Ic119c5647bdb6b1ff1fb31b3712527058c68e40c

cpplint

aef4382

Change-Id: I9635f8eea06d0e30da15344ede32b69035723b35

More notes about the file format

c2ffde4

Change-Id: I6d77fb2944098e6c8815a3c20871838ab3c62b67

Note that padding is 64-bytes at start of file (for now)

72ea42c

Change-Id: I4da527fb90b66750ddaf54c783414858f2be7150

wesm force-pushed the ARROW-363 branch from d67ad65 to 72ea42c Compare November 28, 2016 18:13

Write record batch size prefix in Java

6a66fc8

Change-Id: I057c329090faa0c2eb5fb65f8566c17130dd2a69

Relax 64 byte padding. Do not write RecordBatch embedded in Message f…

13608ef

…or now Change-Id: Ib493e982ff6c2d1502212985cec858a1e341c874

wesm added 2 commits November 28, 2016 17:04

Address other format incompatibilities, write vectorLayout to Arrow m…

a2711f2

…etadata Change-Id: Ic04ef893957c98ba2747f6ad9cef0e7ebe596958

cpplint

844257e

Change-Id: I8e69e6403b6d16e9dcfae20b313ebdc685bd47f7

wesm changed the title ~~ARROW-363: WIP: Java/C++ integration testing harness, initial integration tests~~ ARROW-363: Java/C++ integration testing harness, initial integration tests Nov 28, 2016

Increment MetadataVersion in flatbuffer

36ab5d6

Change-Id: Id6a5ee0d9c716fcbf92c452b51a249595bc0f90d

Fix logical error with offsets array in JsonFileWriter. Add broken st…

4d46c8b

…ring test case to simple.json Change-Id: If62adb8cc209868e105f07f276c54140fe366df5

Permit end of buffer IPC reads if length is 0

6982c3c

Change-Id: Id0508a1a282738ab0346141b7718a050ac1fb2ba

wesm changed the title ~~ARROW-363: Java/C++ integration testing harness, initial integration tests~~ ARROW-363: [Java/C++] integration testing harness, initial integration tests Nov 29, 2016

asfgit closed this in e3c167b Nov 29, 2016

wesm deleted the ARROW-363 branch November 29, 2016 22:49

toddfarmer mentioned this pull request Jun 3, 2019

Align Java and C++ RecordBatch data and metadata layout toddfarmer/arrow-migration#336

Closed

asfimport mentioned this pull request Jun 3, 2019

Align Java and C++ RecordBatch data and metadata layout #15990

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-363: [Java/C++] integration testing harness, initial integration tests #211

ARROW-363: [Java/C++] integration testing harness, initial integration tests #211

wesm commented Nov 21, 2016 •

edited

wesm commented Nov 21, 2016

wesm commented Nov 21, 2016

wesm commented Nov 21, 2016 •

edited

julienledem commented Nov 21, 2016

wesm commented Nov 24, 2016

wesm commented Nov 27, 2016

julienledem commented Nov 28, 2016

wesm commented Nov 28, 2016

wesm commented Nov 28, 2016

wesm commented Nov 28, 2016

wesm commented Nov 28, 2016

wesm commented Nov 28, 2016

wesm commented Nov 29, 2016

ARROW-363: [Java/C++] integration testing harness, initial integration tests #211

ARROW-363: [Java/C++] integration testing harness, initial integration tests #211

Conversation

wesm commented Nov 21, 2016 • edited

wesm commented Nov 21, 2016

wesm commented Nov 21, 2016

wesm commented Nov 21, 2016 • edited

julienledem commented Nov 21, 2016

wesm commented Nov 24, 2016

wesm commented Nov 27, 2016

julienledem commented Nov 28, 2016

wesm commented Nov 28, 2016

wesm commented Nov 28, 2016

wesm commented Nov 28, 2016

wesm commented Nov 28, 2016

wesm commented Nov 28, 2016

wesm commented Nov 29, 2016

wesm commented Nov 21, 2016 •

edited

wesm commented Nov 21, 2016 •

edited