-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-14752: [Docs] List the logical types in Columnar.rst #41795
Conversation
@github-actions crossbow submit preview-docs |
Revision: f739ead Submitted crossbow builds: ursacomputing/crossbow @ actions-cc168caf41
|
Would you mind having a look? Small PR with only new documentation content ;) @felipecrv @pitrou @paleolimbot @amoeba |
@@ -1078,8 +1078,95 @@ Arrow columnar format. Each logical type uses one of the above | |||
physical layouts. Nested logical types may have different physical | |||
layouts depending on the particular realization of the type. | |||
|
|||
We do not go into detail about the logical types definitions in this | |||
document as we consider `Schema.fbs`_ to be authoritative. | |||
.. glossary:: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the glossary directive in Sphinx is global, and we already have a glossary elsewhere, so this could interfere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to be possible but a term must only appear once, not in both glossaries: sphinx-doc/sphinx#1399.
Nevertheless, this is a good setting for future problems =) We could only list them without the directive if we want to keep it.
------------------------------------------------- | ||
|
||
+--------------------+----------------------------------+------------------------------+ | ||
| Type | Physical Memory Layout | Additional info stored | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what "additional info" alludes to? This seems to be mostly type parameters, but not only. Mixing concepts together like this feels confusing to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added all additional parameters from Schema.fbs
and only removed some from Dictionary
, I think, because those are used for IPC and not for the format.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could change "Additional info stored" to "Parameters"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! This would have saved me so much time when I first got started with low-level Arrow format programming!
I know of at least one previous discussion regarding the term "logical type". This PR may not be the place to do it, but readers that are familiar with existing arrow implementations may be confused because Schema.fbs
, while authoritative, parameterizes types differently than most implementations (e.g., pretty much every implementation has an pyarrow.int32()
not pyarrow.int(32)
). I think this only affects Int
, Date
, Time
, and FloatingPoint
, and ("types" that are parameterized differently in pyarrow vs. Schema.fbs.
(I am not sure the best way to incorporate that here or if it's a battle for another day 🙂 )
| Binary View | " | / | | ||
+--------------------+----------------------------------+------------------------------+ | ||
| Utf8 View | Variable-size Binary View Layout | / | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Variable-size Binary View Layout
should maybe move up one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yes, great catch!
+--------------------+----------------------------------+------------------------------+ | ||
| Null | Null Layout | / | | ||
+--------------------+----------------------------------+------------------------------+ | ||
| Boolean | Fixed-size Primitive Layout | values bit-packed | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps this is clearer when the table is rendered, but this column in the table seems to store the parameter names and useful notes about the type. It might be nice to make clear which is which (or might take up too much table real-estate...your call!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would the table header serve this missing information? Rendered version: http://crossbow.voltrondata.com/pr_docs/41795/format/Columnar.html#types-and-their-respective-physical-memory-layout
What I could also do is to repeat the table header in between also (before nested and special)
Array where all values are null. | ||
|
||
Boolean Type | ||
Array of bit-packed values. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"bit-packed" confuses the data type with its physical layout IMHO.
I've always been frustrated with the state of the columnar format spec, as I found the confusion between concepts there quite frustrating (for example, it uses the term "logical type" throughout, even though that term doesn't appear elsewhere). That said, I don't think this PR makes things less confusion, on the contrary. |
The issue about "logical type" and substituting it with just "type" was meant to be done in a separate PR. This PR should add a list of all types and their respective layout. Maybe the "glossary" part should be removed for now and we should focus on making the table of types clear. I agree and share the frustration but am sure we will improve, one step at the time ;) |
Definitely :-) I am trying my hand on an alternate set of changes right now, hopefully we can converge on something. |
Ha. Yes, this is true. Maybe if the title in the table is changed from "Additional info stored" to "Parameters" it is clearer and then, I guess, it is up to the implementation to decide if the parameter defines a separate type ( |
Linking the html version here, if it helps: http://crossbow.voltrondata.com/pr_docs/41795/format/Columnar.html#types-and-their-respective-physical-memory-layout |
Or perhaps just a note that most implementations expand parameterized types such as ... to make it more convenient to refer to commonly-used parameterizations of these types? (All optional and/or battle for another PR) |
Closing in favour of #41958 |
Rationale for this change
Columnar.rst doesn't describe the logical types, since Schema.fbs is considered authoritative but list of all types would be worth adding.
What changes are included in this PR?
Are these changes tested?
Crossbow preview build (documentation).
Are there any user-facing changes?
No.