ARROW-14634: [Flatbuffers] introduction of ColumnBag #11646

nbauernfeind · 2021-11-08T21:46:36Z

This Draft PR is the proposed flatbuffer change referenced on the mailing list with subject "[DISCUSS] next iteration of flatbuffer structures"

I wrote a small document with some details of the proposal: https://docs.google.com/document/d/1jsmmqLTyJkU8fx0sUGIqd6yu72N4v9uHFsuGSgB_DfE

This is the non-documentation side of the proposal.

github-actions · 2021-11-08T21:46:58Z

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on JIRA? https://issues.apache.org/jira/browse/ARROW

Opening JIRAs ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename pull request title in the following format?

ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

github-actions · 2021-11-08T21:59:21Z

https://issues.apache.org/jira/browse/ARROW-14634

github-actions · 2021-11-08T21:59:23Z

⚠️ Ticket has no components in JIRA, make sure you assign one.

github-actions · 2021-11-08T21:59:24Z

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

lidavidm · 2021-11-08T22:04:59Z

format/Message.fbs

+  /// If not provided, all field nodes are included and this payload is
+  /// identical to a RecordBatch. Otherwise the reader needs to skip
+  /// top level FieldNodes that were not included.
+  includedNodes: [FieldNodeRange];


So to be clear, we can't do something like provide only a nested array - and implementations will need to validate that this only skips entire top level fields?

Can we mark this as experimental like how Tensor does?

arrow/format/Tensor.fbs

Lines 18 to 20 in 939db7f

/// EXPERIMENTAL: Metadata for n-dimensional arrays, aka "tensors" or

/// "ndarrays". Arrow implementations in general are not required to implement

/// this type

In RecordBatch, the FieldNodes are listed in-order from first field node, to its children, and grandchildren, followed by the second field node. Note that if we include a top level field node we must include its children. This requirement certainly applies to array-types, and I assume it applies to nested structures -- but I have not used them enough to play with the idea.

I think there are a few options to represent which top level nodes to include.

encoded BitSet, but it is too easy to create degenerate cases

each FieldNode could include a third parameter -- but in flatbuffers this means that the struct is written down differently (I think if the struct is greater than 16B then it must be pre-written before constructing the flatbuffer table that uses it)

include a parallel array with field node indicating which field offset, but this would be empty for child nodes

what remains is a compromise listing ranges of columns that were included -- the use case I have in mind is single-digit number of ranges almost always - but columns can be easily into the tens of dozens.

So to be clear, we can't do something like provide only a nested array - and implementations will need to validate that this only skips entire top level fields?

I can't quite tell what solution you are proposing here. I think client implementations do end up working exactly like you are saying, though. Could you elaborate on your idea or defend an alternative approach?

--

Can we mark this as experimental like how Tensor does?

Absolutely, patch update incoming.

Sorry, I guess what I mean is "Can we make it explicit that only top-level arrays can be skipped, and top-level arrays must be skipped as a whole", i.e. we can't "patch" a child of a nested array. (And that implementations must reject "degenerate" messages that skip, say, only one child of a struct, since that makes no sense in the first place, but should be validated.)

Ooh. I wish there was a way to tell. I don't think you can tell that a message is a degenerate. The FieldNode themselves do not identify which node or location in the schema it belongs to. They are expected to be visited in-order (parent, children, grandchildren; then the parent's siblings -- the other root field nodes, etc). The point of FieldNodeRange is that this is the list of top level field nodes only -- not including nested nodes. So, as typical, it is impossible to tell if the message was written incorrectly. You typically only find out from out-of-memory access where a node thinks it's an array but it's really just a column of uint64.

Point is, I think I'm preventing this by not being able to describe it on the wire.

I updated the comment. Maybe this makes the intention more obvious? Does this resolve your concern?

Yes, thanks.

I think consumers can validate, so long as they have the schema: they would traverse the schema as usual, and look at both nodes and includedNodes. If while traversing a type they ran into a buffer not included in includedNodes, then the message is invalid.

Ah, wait, this is a list of top level nodes/fields only. Ok, I see the comments now, thanks - that works.

lidavidm · 2021-11-08T22:05:36Z

format/Message.fbs

+
+/// A data header describing the shared memory layout of a "bag" of "columns".
+/// It is similar to a RecordBatch but not every top level FieldNode is required
+/// to be included in the wire payload.


Thinking ahead - how do the APIs for this look like? In Java, this would be a "ragged" VectorSchemaRoot?

I am open to anything. I haven't quite gotten far enough to consider more than the high level -- that the ColumnBag doesn't know its size; so the user will be asking each column instead. Otherwise; I think the API should be pretty similar. I'd like to avoid any other divergence if that also makes sense to you.

Understandable. In C++ we would need a container similar to RecordBatch, and I suppose in Java something similar.

format/Schema.fbs

format/Message.fbs

Jimexist · 2021-11-09T00:43:47Z

@alamb FYI, pertaining to partition columns in datafusion

alamb · 2021-11-09T12:17:57Z

Thank you for the writeup and PR @nbauernfeind -- I added a link to the mailing list discussion and left a question on the context for this change in the google doc.

nbauernfeind · 2021-11-09T17:03:20Z

I've updated the PR based on feedback so far. I had to move FieldNode to Schema.fbs -- to avoid a cyclical dependency between Schema, Message, and ColumnBag.

emkornfield · 2021-11-23T18:47:50Z

format/Schema.fbs

@@ -501,6 +501,24 @@ struct Buffer {
  length: long;
 }

+/// ----------------------------------------------------------------------


Field node seems more a part of message than schema? If it is needed as a top level type lets make it a separate file?

emkornfield · 2021-11-23T18:49:17Z

format/ColumnBag.fbs

+  /// be listed in strictly increasing order and be non-overlapping.
+  includedNodes: [FieldNodeRange];
+
+  /// Nodes correspond to the pre-ordered flattened logical schema


For top-level field nodes are they going to be allowed to have separate lengths?

emkornfield · 2021-11-23T18:50:52Z

format/ColumnBag.fbs

+
+/// A data header describing the shared memory layout of a "bag" of "columns".
+/// It is similar to a RecordBatch but not every top level node is required
+/// to be included in the wire payload.


the mentions that different number of rows might also be important but I might be misunderstanding?

emkornfield · 2021-11-24T04:00:55Z

format/Schema.fbs

@@ -37,7 +37,7 @@ enum MetadataVersion:short {
  /// >= 0.8.0 (December 2017). Non-backwards compatible with V3.
  V4,

-  /// >= 1.0.0 (July 2020. Backwards compatible with V4 (V5 readers can read V4
+  /// >= 1.0.0 (July 2020). Backwards compatible with V4 (V5 readers can read V4


there is a enum defined with "features" we should add column bag to that enum. Unfortunately, I haven't had the time to integrate the enum but the purpose was to allow clients to express which capabilities they handle the server.

ColumnBag flatbuffer proposal v1

e361191

nbauernfeind changed the title ~~[flatbuffer] introduction of ColumnBag~~ ARROW-14634: [flatbuffer] introduction of ColumnBag Nov 8, 2021

nbauernfeind changed the title ~~ARROW-14634: [flatbuffer] introduction of ColumnBag~~ ARROW-14634: [Flatbuffer] introduction of ColumnBag Nov 8, 2021

nbauernfeind changed the title ~~ARROW-14634: [Flatbuffer] introduction of ColumnBag~~ ARROW-14634: [Flatbuffers] introduction of ColumnBag Nov 8, 2021

lidavidm reviewed Nov 8, 2021

View reviewed changes

David Li's feedback

9366a04

nbauernfeind force-pushed the column_bag_demo_v1 branch from 6a96992 to 9366a04 Compare November 9, 2021 16:57

newline at eof

ff411be

nbauernfeind requested a review from lidavidm November 9, 2021 17:14

emkornfield reviewed Nov 23, 2021

View reviewed changes

emkornfield reviewed Nov 24, 2021

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-14634: [Flatbuffers] introduction of ColumnBag #11646

ARROW-14634: [Flatbuffers] introduction of ColumnBag #11646

nbauernfeind commented Nov 8, 2021 •

edited by alamb

github-actions bot commented Nov 8, 2021

github-actions bot commented Nov 8, 2021

github-actions bot commented Nov 8, 2021

github-actions bot commented Nov 8, 2021

lidavidm Nov 8, 2021

lidavidm Nov 8, 2021

nbauernfeind Nov 9, 2021

lidavidm Nov 9, 2021

nbauernfeind Nov 9, 2021 •

edited

nbauernfeind Nov 9, 2021

lidavidm Nov 9, 2021

lidavidm Nov 9, 2021

lidavidm Nov 8, 2021

nbauernfeind Nov 9, 2021

lidavidm Nov 9, 2021

Jimexist commented Nov 9, 2021 •

edited

alamb commented Nov 9, 2021

nbauernfeind commented Nov 9, 2021

emkornfield Nov 23, 2021

emkornfield Nov 23, 2021

emkornfield Nov 23, 2021

emkornfield Nov 24, 2021

	/// EXPERIMENTAL: Metadata for n-dimensional arrays, aka "tensors" or
	/// "ndarrays". Arrow implementations in general are not required to implement
	/// this type

ARROW-14634: [Flatbuffers] introduction of ColumnBag #11646

Are you sure you want to change the base?

ARROW-14634: [Flatbuffers] introduction of ColumnBag #11646

Conversation

nbauernfeind commented Nov 8, 2021 • edited by alamb

github-actions bot commented Nov 8, 2021

github-actions bot commented Nov 8, 2021

github-actions bot commented Nov 8, 2021

github-actions bot commented Nov 8, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nbauernfeind Nov 9, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Jimexist commented Nov 9, 2021 • edited

alamb commented Nov 9, 2021

nbauernfeind commented Nov 9, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nbauernfeind commented Nov 8, 2021 •

edited by alamb

nbauernfeind Nov 9, 2021 •

edited

Jimexist commented Nov 9, 2021 •

edited