-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-41691: [Doc] Remove notion of "logical type" #41958
GH-41691: [Doc] Remove notion of "logical type" #41958
Conversation
|
24f089e
to
f654194
Compare
I reused @AlenkaF 's idea of a table here, but I changed the contents quite a bit. Some open questions:
|
Also cc @alamb @tustvold @paleolimbot @wesm |
@github-actions crossbow submit preview-docs |
Revision: f654194 Submitted crossbow builds: ursacomputing/crossbow @ actions-c270be27f4
|
@github-actions crossbow submit preview-docs |
Revision: 0f2b458 Submitted crossbow builds: ursacomputing/crossbow @ actions-c9a11fc3b7
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 generally for these changes, very helpful clarification
The HTML preview can be seen at http://crossbow.voltrondata.com/pr_docs/41958/format/Columnar.html |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great! ❤️
- for the type parameters, should we use a description in natural language (such as "bit width" or "signedness") or the physical parameter names in
Schema.fbs
(such asbitWidth
oris_signed
)?
I like the natural language you are using!
- should we explain the type parameters in full (this would result in a more lengthy though more informative doc), or should we just keep a couple explanatory notes as here?
I can't decide. Similar as in #41795 and the glossary option: I do not like adding too much content, making the columnar page too long. At the same time it would be nice, from a user's perspective, to have a sentence or two with an explanation of type parameters.
- should we repeat the term "Layout" under the "Physical Memory Layout" column? For example, write "Fixed-size Primitive Layout" instead of the terser "Fixed-size Primitive".
I think keeping it terser is better as full "Physical Memory Layout" in the table header is making it clear.
* **Parametric type**: a type which requires additional parameters | ||
for full determination of its semantics. For example, all nested types | ||
are parametric by construction. A timestamp is also parametric as it needs | ||
a unit (such as microseconds) and a timezone. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would integers or floating point types count here? (i.e., types that implementations have chosen not to parameterize for convenience but that are "parameterized" in this document?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahah, good question. In the context of the Flatbuffers serialization, yes. But there's only a small fixed number of possible parameters, so this is less obvious than for timestamps or nested types, where there's a virtual infinity of possibilities.
As @jorisvandenbossche rightfully pointed out below, the Flatbuffers serialization in the IPC format is only one possible representation of the type system; others could work differently; as a matter of fact, the C Data Interface uses entirely different format codes for these types.
I think for the content you created here (i.e. the overview table), the current level of detail is fine. For me it is a separate question if we want to keep Schema.fbs as the "authoritative source" for the explanation of those parameters. Personally, I don't like that this is "hidden" in a fbs file which is actually specific for IPC (there are a few things in that file that are not generic for the columnar format). I think ideally we would move the explanation of the parameters in the fbs file to the actual format specification, but then the question is how to deal with the duplication with the fbs file (we don't want two places to keep in sync, but would it be fine to cut down the content in the fbs file largely?) |
I agree that it makes things a bit intimidating; also, reading a fbs file just to find prose documentation is not really a pleasant experience. Perhaps open a new issue for this? |
I agree and am very glad that this documentation will now be in a place where people are likely to actually see it 🙂 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @pitrou -- I really like this clarification. Consistently using the terms "Data Type" and "Physical Layout" rather than also "Logical Type" I think makes the intent and specification clearer.
Also I would like to explicitly agree with the opinion that this is not a change to the spec (and thus doesn't require any formal votes, etc) -- rather this is a clarification to the wording of the existing spec
|
||
* \(1) Type parameters listed in *italics* denote a data type's child types. | ||
|
||
* \(2) The *bit width* parameter of a Time type is technically redundant as |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is nice to call out as I have noticed it before
Also address apacheGH-14752 by adding a table of data types with their respective parameters and the corresponding layouts.
0f2b458
to
e02d51c
Compare
Opened #42011 about moving a lot of the prose content from Schema.fbs to Columnar.rst |
After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 93712bf. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 1 possible false positive for unstable benchmarks that are known to sometimes produce them. |
In several places in the Arrow specification and documentation we use the term "logical types", but we don't use it consistently and we don't actually have physical types (only physical layouts) to contrast it with. This creates confusion for readers as it is not immediately clear whether all data types are "logical" and if there is a meaningful distinction behind our usage of this term.
Also address GH-14752 by adding a table of data types with their respective parameters and the corresponding layouts.