Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-363: [Java/C++] integration testing harness, initial integration tests #211

Closed
wants to merge 15 commits into from

Conversation

wesm
Copy link
Member

@wesm wesm commented Nov 21, 2016

This also includes format reconciliation as discussed in ARROW-384.

@wesm
Copy link
Member Author

wesm commented Nov 21, 2016

@julienledem alright, I found a couple initial issues in the file implementation.

  1. In C++, we're writing the record batch metadata after the buffers have been written to the file. In Java the metadata is written first https://github.com/apache/arrow/blob/master/java/vector/src/main/java/org/apache/arrow/vector/file/ArrowWriter.java#L104. This is related to the next point:

  2. We've been writing the serialized IPC metadata size at the after the flatbuffer bytes: https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/metadata-internal.cc#L320 . One benefit of this in an IPC context is that if you have a shared memory offset to the end of the record batch, and the metadata size is the last 4 bytes of the blob, then you can read the metadata, then the record batch.

so here is what things look like right now on the C++ side:

<buffers> <metadata> <metadata size: int32>

and on the java side:

<metadata> <buffers>

In Java, the metadata size is encoded in the Block component in the footer: https://github.com/apache/arrow/blob/master/format/File.fbs#L36

There's a couple ways forward

  • Option A: Java adapts what is currently being done in C++
  • Option B: C++ drops the metadata length, conforms to the current Java layout, then presumes that in a shared memory IPC setting that there is some Block metadata obtained (containing file/shared memory offset, metadata size, and body size) from some other place

The reason we have things as they are right now was so that you could reconstruct a record batch from shared memory having only the offset to the end of the payload. Since we don't have a rigid model for handling IPC metadata (i.e. the Block struct used in the file footer) I don't feel very strongly about this; I would be fine doing what Java is doing

@wesm
Copy link
Member Author

wesm commented Nov 21, 2016

cc @emkornfield

@wesm
Copy link
Member Author

wesm commented Nov 21, 2016

@julienledem another thing is the record batch metadata itself

I believe this is one of the reasons why the metadata is being written after the buffers (since we need to know the absolute positions of everything prior to writing out the flatbuffer)

The problem with absolute offsets is that the record batches are not relocatable (the metadata would have to be rewritten). This seems like a flaw -- I'm going to change the C++ to use relative offsets, and make the record batch layout conform to the Java implementation as it is right now so we can proceed with the integration tests

@julienledem
Copy link
Member

@wesm sounds good.
I have created a jira to discuss further if needed.
I don't see why we couldn't add the metadata size in there. Thank you
https://issues.apache.org/jira/browse/ARROW-384

@wesm
Copy link
Member Author

wesm commented Nov 24, 2016

@julienledem I refactored to account for our discussion in ARROW-384. I left a bunch of notes in the format/ documents in this PR, could you take a look when you can? One item is padding / alignment generally. I have 64-byte alignment in there now, but I think 8 byte alignment is sufficient for a minimal format (for SIMD, alignment is marginally important, but things have to be padded at least).

@wesm
Copy link
Member Author

wesm commented Nov 27, 2016

@julienledem we'll need to make the corresponding changes in the Java format (adding the metadata length prefix to record batches in the file) -- how do you want to proceed?

Change-Id: Ifa749bee89654da257a61a6f2f5255abd4ec5a38
Change-Id: I6f98e27f4f106f5919ec1868770866d1e155dd69
Change-Id: Ifce9e8f23d98435b7ea841bdd5de4020d4bdeac3
Change-Id: Ibe6461c0e7673072471e3a613e6d6934759ed6e4
…sion in ARROW-384

Change-Id: Ic119c5647bdb6b1ff1fb31b3712527058c68e40c
Change-Id: I9635f8eea06d0e30da15344ede32b69035723b35
Change-Id: I6d77fb2944098e6c8815a3c20871838ab3c62b67
Change-Id: I4da527fb90b66750ddaf54c783414858f2be7150
Change-Id: I057c329090faa0c2eb5fb65f8566c17130dd2a69
@julienledem
Copy link
Member

This LGTM.
Do we want to increment MetadataVersion since the change to the file format is incompatible?
like V0_2SNAPSHOT?

…or now

Change-Id: Ib493e982ff6c2d1502212985cec858a1e341c874
@wesm
Copy link
Member Author

wesm commented Nov 28, 2016

Sure, sounds good

…etadata

Change-Id: Ic04ef893957c98ba2747f6ad9cef0e7ebe596958
Change-Id: I8e69e6403b6d16e9dcfae20b313ebdc685bd47f7
@wesm wesm changed the title ARROW-363: WIP: Java/C++ integration testing harness, initial integration tests ARROW-363: Java/C++ integration testing harness, initial integration tests Nov 28, 2016
@wesm
Copy link
Member Author

wesm commented Nov 28, 2016

Finally I have got a passing minimal integration test:

$ python integration_test.py
-- Java producing, C++ consuming
Testing with /home/wesm/code/arrow/integration/data/simple.json
-- C++ producing, Java consuming
Testing with /home/wesm/code/arrow/integration/data/simple.json

I suggest merging this and proceeding to write more integration tests in a new patch.

@wesm
Copy link
Member Author

wesm commented Nov 28, 2016

@julienledem this include some Java changes, if you wouldn't mind taking a look. thanks!

Change-Id: Id6a5ee0d9c716fcbf92c452b51a249595bc0f90d
@wesm
Copy link
Member Author

wesm commented Nov 28, 2016

Just incremented the metadata version. I'm just using V1 and V2 for now to avoid having to do code changes at release time.

…ring test case to simple.json

Change-Id: If62adb8cc209868e105f07f276c54140fe366df5
@wesm
Copy link
Member Author

wesm commented Nov 28, 2016

I added a broken string data test case. It fails in C++ with:

Breakpoint 1, arrow::io::BufferReader::Seek (this=0x7fffffff67d0, position=112)
    at /home/wesm/code/arrow/cpp/src/arrow/io/memory.cc:310
310	    return Status::IOError("position out of bounds");
(gdb) l
305	  return Status::OK();
306	}
307	
308	Status BufferReader::Seek(int64_t position) {
309	  if (position < 0 || position >= size_) {
310	    return Status::IOError("position out of bounds");
311	  }
312	
313	  position_ = position;
314	  return Status::OK();
(gdb) where
#0  arrow::io::BufferReader::Seek (this=0x7fffffff67d0, position=112) at /home/wesm/code/arrow/cpp/src/arrow/io/memory.cc:310
#1  0x00007ffff7bc8413 in non-virtual thunk to arrow::io::BufferReader::Seek(long) ()
    at /home/wesm/code/arrow/cpp/src/arrow/io/memory.cc:308
#2  0x00007ffff7bc6956 in arrow::io::ReadableFileInterface::ReadAt (this=0x7fffffff67d0, position=112, nbytes=0, 
    out=0x7fffffff64c0) at /home/wesm/code/arrow/cpp/src/arrow/io/interfaces.cc:43
#3  0x00007ffff791e6a8 in arrow::ipc::RecordBatchReader::GetBuffer (this=0x7fffffff66a8, buffer_index=6, out=0x7fffffff64c0)
    at /home/wesm/code/arrow/cpp/src/arrow/ipc/adapter.cc:405
#4  0x00007ffff791d251 in arrow::ipc::RecordBatchReader::NextArray (this=0x7fffffff66a8, field=0x707d20, 
    max_recursion_depth=64, out=0x708070) at /home/wesm/code/arrow/cpp/src/arrow/ipc/adapter.cc:359
#5  0x00007ffff7919ede in arrow::ipc::RecordBatchReader::Read (this=0x7fffffff66a8, out=0x7fffffff6b98)
    at /home/wesm/code/arrow/cpp/src/arrow/ipc/adapter.cc:310
#6  0x00007ffff7918382 in arrow::ipc::ReadRecordBatch (metadata=..., schema=..., max_recursion_depth=64, file=0x7fffffff67d0, 
    out=0x7fffffff6b98) at /home/wesm/code/arrow/cpp/src/arrow/ipc/adapter.cc:448
#7  0x00007ffff7918315 in arrow::ipc::ReadRecordBatch (metadata=..., schema=..., file=0x7fffffff67d0, out=0x7fffffff6b98)
    at /home/wesm/code/arrow/cpp/src/arrow/ipc/adapter.cc:441
#8  0x00007ffff792a195 in arrow::ipc::FileReader::GetRecordBatch (this=0x6f65f0, i=0, batch=0x7fffffff6b98)
    at /home/wesm/code/arrow/cpp/src/arrow/ipc/file.cc:213
#9  0x0000000000436c30 in arrow::ValidateArrowVsJson (arrow_path=..., json_path=...)
    at /home/wesm/code/arrow/cpp/src/arrow/ipc/json-integration-test.cc:169
#10 0x0000000000434d67 in arrow::RunCommand (json_path=..., arrow_path=..., command=...)
    at /home/wesm/code/arrow/cpp/src/arrow/ipc/json-integration-test.cc:208
#11 0x0000000000438ceb in main (argc=1, argv=0x7fffffff7378)
    at /home/wesm/code/arrow/cpp/src/arrow/ipc/json-integration-test.cc:376
(gdb) q

Let's address this and other data type tests in a follow up JIRA

Change-Id: Id0508a1a282738ab0346141b7718a050ac1fb2ba
@wesm
Copy link
Member Author

wesm commented Nov 29, 2016

+1, we can continue to iterate on fine details while writing more tests

@wesm wesm changed the title ARROW-363: Java/C++ integration testing harness, initial integration tests ARROW-363: [Java/C++] integration testing harness, initial integration tests Nov 29, 2016
@asfgit asfgit closed this in e3c167b Nov 29, 2016
@wesm wesm deleted the ARROW-363 branch November 29, 2016 22:49
wesm pushed a commit to wesm/arrow that referenced this pull request Sep 2, 2018
Author: Uwe L. Korn <uwelk@xhochy.com>

Closes apache#211 from xhochy/PARQUET-819 and squashes the following commits:

4d21407 [Uwe L. Korn] PARQUET-819: Don't try to install no longer existing arrow/utils.h
wesm pushed a commit to wesm/arrow that referenced this pull request Sep 4, 2018
Author: Uwe L. Korn <uwelk@xhochy.com>

Closes apache#211 from xhochy/PARQUET-819 and squashes the following commits:

4d21407 [Uwe L. Korn] PARQUET-819: Don't try to install no longer existing arrow/utils.h

Change-Id: I323a15754e6d0150e83eb89c8821dced80566ce3
wesm pushed a commit to wesm/arrow that referenced this pull request Sep 6, 2018
Author: Uwe L. Korn <uwelk@xhochy.com>

Closes apache#211 from xhochy/PARQUET-819 and squashes the following commits:

4d21407 [Uwe L. Korn] PARQUET-819: Don't try to install no longer existing arrow/utils.h

Change-Id: I323a15754e6d0150e83eb89c8821dced80566ce3
wesm pushed a commit to wesm/arrow that referenced this pull request Sep 7, 2018
Author: Uwe L. Korn <uwelk@xhochy.com>

Closes apache#211 from xhochy/PARQUET-819 and squashes the following commits:

4d21407 [Uwe L. Korn] PARQUET-819: Don't try to install no longer existing arrow/utils.h

Change-Id: I323a15754e6d0150e83eb89c8821dced80566ce3
wesm pushed a commit to wesm/arrow that referenced this pull request Sep 8, 2018
Author: Uwe L. Korn <uwelk@xhochy.com>

Closes apache#211 from xhochy/PARQUET-819 and squashes the following commits:

4d21407 [Uwe L. Korn] PARQUET-819: Don't try to install no longer existing arrow/utils.h

Change-Id: I323a15754e6d0150e83eb89c8821dced80566ce3
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants