Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Docs] List the logical types in Columnar.rst for searchability #14752

Closed
lidavidm opened this issue Nov 28, 2022 · 15 comments
Closed

[Docs] List the logical types in Columnar.rst for searchability #14752

lidavidm opened this issue Nov 28, 2022 · 15 comments

Comments

@lidavidm
Copy link
Member

lidavidm commented Nov 28, 2022

Describe the enhancement requested

See apache/arrow-nanoarrow#74 (review)

A few comments...it may be worth following this up with a section on the Map in https://arrow.apache.org/docs/format/Columnar.html ...other than a footnote in the C Data interface spec, I can't find any reference to the memory layout for a Map (or for the datetime interval types, which I remember having to look up in the C++ implementation)

Columnar.rst doesn't describe the logical types, since Schema.fbs is considered authoritative. But it is probably worth at least listing the types in this document so that they can be easily searched for (and possibly even summarize those types).

Component(s)

Documentation

@lidavidm lidavidm changed the title [Docs][Format] List the logical types in Columnar.rst for searchability [Docs] List the logical types in Columnar.rst for searchability Nov 28, 2022
@paleolimbot
Copy link
Member

Thanks for opening this! In particular David's comment:

Each list item represents a dictionary, where the Struct values of each list entry are key-value pairs.

was particularly helpful.

The other types that I could not find in the columnar spec when I was looking for them were the interval types, which might be worth mentioning.

@lidavidm
Copy link
Member Author

Interval types are also logical types, so yeah, a listing of all the logical types might be useful.

@pitrou
Copy link
Member

pitrou commented Nov 30, 2022

Arrow has no notion of logical types. But, yes, making the Columnar format spec more readable would be useful.

@pitrou
Copy link
Member

pitrou commented Nov 30, 2022

Ok, unfortunately, Columnar.rst does use the wording "logical type". Which is contradicted by the fact that there's no separate set of "physical types" (only layouts). The whole thing has always been confusing to me.

@paleolimbot
Copy link
Member

I think it would be nice if all the types were in Columnar.rst (with the corresponding layouts and any parameters). These don't change frequently and so I don't think that maintaining a sync between the .fbs file and the documentation will be prohibitively complicated?

@pitrou
Copy link
Member

pitrou commented Dec 1, 2022

I agree it would be nice, at least as a synthetic table.

@ianmcook
Copy link
Member

Semi-related issue: #33958

@amoeba
Copy link
Member

amoeba commented May 16, 2023

I'd like to see this part of the format docs improved and would be happy to submit a PR for review. I read the comments above and in other issues and it seems like there's:

  1. Still some discussion to be had about avoiding "logical" vs. "physical" in favor of "types" and "layouts" and possibly updating the format docs comprehensively
  2. Concensus around listing all types (ie everything in https://github.com/apache/arrow/blob/505a2e4519ba8d4983da6caa1c56ecae8d063e84/format/Schema.fbs#L407C5-L430) in a table with columns for layouts and parameters, where appropriate.

I could start with a PR for (2). Looking at the high level sections in https://arrow.apache.org/docs/format/Columnar.html, I think a comprehensive table of types would be best right after Terminology and before Physical Memory Layout as I think people would generally want to know the types before their physical layouts. Does this sound reasonable?

@paleolimbot
Copy link
Member

I would be excited to see that PR! I think that a type listing at the start ("this is what Arrow can do") followed by layouts ("this is what it looks like in memory") makes a lot of sense and would have helped me a lot when I was trying to implement them. I like types + layouts rather than logical + physical but I don't have strong feelings about it as long as it's consistent.

@pitrou
Copy link
Member

pitrou commented May 16, 2023

The "logical" vs. "physical" distinction is actually extra confusing, because nowadays we do have logical types (aka semantic variations of existing types), but they are called... extension types.

@amoeba
Copy link
Member

amoeba commented May 16, 2023

Gotcha. That brings up another point... should the newly-added Tensor types be in the aforementioned table of Types. I'd think yes.

@pitrou
Copy link
Member

pitrou commented May 16, 2023

I don't think so. They're extension types, not part of the columnar spec itself. You may instead add a seealso after the table to point to https://arrow.apache.org/docs/format/CanonicalExtensions.html

@amoeba
Copy link
Member

amoeba commented May 16, 2023

Okay, that makes sense.

@jorisvandenbossche
Copy link
Member

  1. Still some discussion to be had about avoiding "logical" vs. "physical" in favor of "types" and "layouts" and possibly updating the format docs comprehensively

I opened a dedicated issue for this, as it requires a more comprehensive update of the docs than just adding a table of all (logical) type: #41691

@AlenkaF AlenkaF self-assigned this May 21, 2024
pitrou added a commit to pitrou/arrow that referenced this issue Jun 4, 2024
Also address apacheGH-14752 by adding a table of data types with their respective parameters and the corresponding layouts.
pitrou added a commit to pitrou/arrow that referenced this issue Jun 4, 2024
Also address apacheGH-14752 by adding a table of data types with their respective parameters and the corresponding layouts.
pitrou added a commit to pitrou/arrow that referenced this issue Jun 4, 2024
Also address apacheGH-14752 by adding a table of data types with their respective parameters and the corresponding layouts.
pitrou added a commit to pitrou/arrow that referenced this issue Jun 6, 2024
Also address apacheGH-14752 by adding a table of data types with their respective parameters and the corresponding layouts.
pitrou added a commit that referenced this issue Jun 6, 2024
In several places in the Arrow specification and documentation we use the term "logical types", but we don't use it consistently and we don't actually have physical types (only physical layouts) to contrast it with. This creates confusion for readers as it is not immediately clear whether all data types are "logical" and if there is a meaningful distinction behind our usage of this term.

Also address GH-14752 by adding a table of data types with their respective parameters and the corresponding layouts.

* GitHub Issue: #41691

Authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
@pitrou pitrou self-assigned this Jun 6, 2024
@pitrou
Copy link
Member

pitrou commented Jun 6, 2024

This was fixed in #41958, after inspiration from @AlenkaF 's own PR. Thanks!

@pitrou pitrou closed this as completed Jun 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants