Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-41691: [Doc] Remove notion of "logical type" #41958

Merged
merged 3 commits into from
Jun 6, 2024

Conversation

pitrou
Copy link
Member

@pitrou pitrou commented Jun 4, 2024

In several places in the Arrow specification and documentation we use the term "logical types", but we don't use it consistently and we don't actually have physical types (only physical layouts) to contrast it with. This creates confusion for readers as it is not immediately clear whether all data types are "logical" and if there is a meaningful distinction behind our usage of this term.

Also address GH-14752 by adding a table of data types with their respective parameters and the corresponding layouts.

Copy link

github-actions bot commented Jun 4, 2024

⚠️ GitHub issue #41691 has been automatically assigned in GitHub to PR creator.

@pitrou pitrou force-pushed the gh41691-columnar-data-types-overhauld branch 2 times, most recently from 24f089e to f654194 Compare June 4, 2024 14:21
@pitrou pitrou marked this pull request as ready for review June 4, 2024 14:23
@pitrou
Copy link
Member Author

pitrou commented Jun 4, 2024

cc @jorisvandenbossche @AlenkaF

@pitrou
Copy link
Member Author

pitrou commented Jun 4, 2024

I reused @AlenkaF 's idea of a table here, but I changed the contents quite a bit.

Some open questions:

  • for the type parameters, should we use a description in natural language (such as "bit width" or "signedness") or the physical parameter names in Schema.fbs (such as bitWidth or is_signed)?
  • should we explain the type parameters in full (this would result in a more lengthy though more informative doc), or should we just keep a couple explanatory notes as here?
  • should we repeat the term "Layout" under the "Physical Memory Layout" column? For example, write "Fixed-size Primitive Layout" instead of the terser "Fixed-size Primitive".

@pitrou
Copy link
Member Author

pitrou commented Jun 4, 2024

Also cc @alamb @tustvold @paleolimbot @wesm

@pitrou
Copy link
Member Author

pitrou commented Jun 4, 2024

@github-actions crossbow submit preview-docs

Copy link

github-actions bot commented Jun 4, 2024

Revision: f654194

Submitted crossbow builds: ursacomputing/crossbow @ actions-c270be27f4

Task Status
preview-docs GitHub Actions

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting review Awaiting review labels Jun 4, 2024
@pitrou
Copy link
Member Author

pitrou commented Jun 4, 2024

@github-actions crossbow submit preview-docs

@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Jun 4, 2024
Copy link

github-actions bot commented Jun 4, 2024

Revision: 0f2b458

Submitted crossbow builds: ursacomputing/crossbow @ actions-c9a11fc3b7

Task Status
preview-docs GitHub Actions

Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 generally for these changes, very helpful clarification

@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting change review Awaiting change review labels Jun 4, 2024
@pitrou
Copy link
Member Author

pitrou commented Jun 4, 2024

The HTML preview can be seen at http://crossbow.voltrondata.com/pr_docs/41958/format/Columnar.html

Copy link
Member

@AlenkaF AlenkaF left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! ❤️

  • for the type parameters, should we use a description in natural language (such as "bit width" or "signedness") or the physical parameter names in Schema.fbs (such as bitWidth or is_signed)?

I like the natural language you are using!

  • should we explain the type parameters in full (this would result in a more lengthy though more informative doc), or should we just keep a couple explanatory notes as here?

I can't decide. Similar as in #41795 and the glossary option: I do not like adding too much content, making the columnar page too long. At the same time it would be nice, from a user's perspective, to have a sentence or two with an explanation of type parameters.

  • should we repeat the term "Layout" under the "Physical Memory Layout" column? For example, write "Fixed-size Primitive Layout" instead of the terser "Fixed-size Primitive".

I think keeping it terser is better as full "Physical Memory Layout" in the table header is making it clear.

Comment on lines +87 to +90
* **Parametric type**: a type which requires additional parameters
for full determination of its semantics. For example, all nested types
are parametric by construction. A timestamp is also parametric as it needs
a unit (such as microseconds) and a timezone.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would integers or floating point types count here? (i.e., types that implementations have chosen not to parameterize for convenience but that are "parameterized" in this document?)

Copy link
Member Author

@pitrou pitrou Jun 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahah, good question. In the context of the Flatbuffers serialization, yes. But there's only a small fixed number of possible parameters, so this is less obvious than for timestamps or nested types, where there's a virtual infinity of possibilities.

As @jorisvandenbossche rightfully pointed out below, the Flatbuffers serialization in the IPC format is only one possible representation of the type system; others could work differently; as a matter of fact, the C Data Interface uses entirely different format codes for these types.

@jorisvandenbossche
Copy link
Member

  • should we explain the type parameters in full (this would result in a more lengthy though more informative doc), or should we just keep a couple explanatory notes as here?

I think for the content you created here (i.e. the overview table), the current level of detail is fine.

For me it is a separate question if we want to keep Schema.fbs as the "authoritative source" for the explanation of those parameters. Personally, I don't like that this is "hidden" in a fbs file which is actually specific for IPC (there are a few things in that file that are not generic for the columnar format). I think ideally we would move the explanation of the parameters in the fbs file to the actual format specification, but then the question is how to deal with the duplication with the fbs file (we don't want two places to keep in sync, but would it be fine to cut down the content in the fbs file largely?)

@pitrou
Copy link
Member Author

pitrou commented Jun 5, 2024

For me it is a separate question if we want to keep Schema.fbs as the "authoritative source" for the explanation of those parameters. Personally, I don't like that this is "hidden" in a fbs file which is actually specific for IPC (there are a few things in that file that are not generic for the columnar format).

I agree that it makes things a bit intimidating; also, reading a fbs file just to find prose documentation is not really a pleasant experience. Perhaps open a new issue for this?

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting merge Awaiting merge labels Jun 5, 2024
@paleolimbot
Copy link
Member

I agree that it makes things a bit intimidating; also, reading a fbs file just to find prose documentation is not really a pleasant experience.

I agree and am very glad that this documentation will now be in a place where people are likely to actually see it 🙂

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @pitrou -- I really like this clarification. Consistently using the terms "Data Type" and "Physical Layout" rather than also "Logical Type" I think makes the intent and specification clearer.

Also I would like to explicitly agree with the opinion that this is not a change to the spec (and thus doesn't require any formal votes, etc) -- rather this is a clarification to the wording of the existing spec

docs/source/format/Columnar.rst Outdated Show resolved Hide resolved

* \(1) Type parameters listed in *italics* denote a data type's child types.

* \(2) The *bit width* parameter of a Time type is technically redundant as
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is nice to call out as I have noticed it before

docs/source/format/Columnar.rst Outdated Show resolved Hide resolved
@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting changes Awaiting changes labels Jun 6, 2024
Also address apacheGH-14752 by adding a table of data types with their respective parameters and the corresponding layouts.
@pitrou pitrou force-pushed the gh41691-columnar-data-types-overhauld branch from 0f2b458 to e02d51c Compare June 6, 2024 13:59
@jorisvandenbossche
Copy link
Member

I agree that it makes things a bit intimidating; also, reading a fbs file just to find prose documentation is not really a pleasant experience. Perhaps open a new issue for this?

Opened #42011 about moving a lot of the prose content from Schema.fbs to Columnar.rst

@pitrou pitrou merged commit 93712bf into apache:main Jun 6, 2024
7 checks passed
@pitrou pitrou removed the awaiting merge Awaiting merge label Jun 6, 2024
@pitrou pitrou deleted the gh41691-columnar-data-types-overhauld branch June 6, 2024 15:39
Copy link

After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 93712bf.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 1 possible false positive for unstable benchmarks that are known to sometimes produce them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants