Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-14752: [Docs] List the logical types in Columnar.rst #41795

Closed
wants to merge 3 commits into from

Conversation

AlenkaF
Copy link
Member

@AlenkaF AlenkaF commented May 23, 2024

Rationale for this change

Columnar.rst doesn't describe the logical types, since Schema.fbs is considered authoritative but list of all types would be worth adding.

What changes are included in this PR?

  • Table with all types and their respective physical memory layout
  • TBD: list of types and their short summary in a glossary form.

Are these changes tested?

Crossbow preview build (documentation).

Are there any user-facing changes?

No.

@AlenkaF
Copy link
Member Author

AlenkaF commented May 23, 2024

@github-actions crossbow submit preview-docs

Copy link

Revision: f739ead

Submitted crossbow builds: ursacomputing/crossbow @ actions-cc168caf41

Task Status
preview-docs GitHub Actions

@AlenkaF AlenkaF marked this pull request as ready for review May 27, 2024 13:14
@AlenkaF
Copy link
Member Author

AlenkaF commented Jun 4, 2024

Would you mind having a look? Small PR with only new documentation content ;) @felipecrv @pitrou @paleolimbot @amoeba

@@ -1078,8 +1078,95 @@ Arrow columnar format. Each logical type uses one of the above
physical layouts. Nested logical types may have different physical
layouts depending on the particular realization of the type.

We do not go into detail about the logical types definitions in this
document as we consider `Schema.fbs`_ to be authoritative.
.. glossary::
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the glossary directive in Sphinx is global, and we already have a glossary elsewhere, so this could interfere.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to be possible but a term must only appear once, not in both glossaries: sphinx-doc/sphinx#1399.

Nevertheless, this is a good setting for future problems =) We could only list them without the directive if we want to keep it.

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Jun 4, 2024
-------------------------------------------------

+--------------------+----------------------------------+------------------------------+
| Type | Physical Memory Layout | Additional info stored |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what "additional info" alludes to? This seems to be mostly type parameters, but not only. Mixing concepts together like this feels confusing to me.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added all additional parameters from Schema.fbs and only removed some from Dictionary, I think, because those are used for IPC and not for the format.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could change "Additional info stored" to "Parameters"?

Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! This would have saved me so much time when I first got started with low-level Arrow format programming!

I know of at least one previous discussion regarding the term "logical type". This PR may not be the place to do it, but readers that are familiar with existing arrow implementations may be confused because Schema.fbs, while authoritative, parameterizes types differently than most implementations (e.g., pretty much every implementation has an pyarrow.int32() not pyarrow.int(32)). I think this only affects Int, Date, Time, and FloatingPoint, and ("types" that are parameterized differently in pyarrow vs. Schema.fbs.

(I am not sure the best way to incorporate that here or if it's a battle for another day 🙂 )

Comment on lines +1141 to +1143
| Binary View | " | / |
+--------------------+----------------------------------+------------------------------+
| Utf8 View | Variable-size Binary View Layout | / |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Variable-size Binary View Layout should maybe move up one?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yes, great catch!

+--------------------+----------------------------------+------------------------------+
| Null | Null Layout | / |
+--------------------+----------------------------------+------------------------------+
| Boolean | Fixed-size Primitive Layout | values bit-packed |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps this is clearer when the table is rendered, but this column in the table seems to store the parameter names and useful notes about the type. It might be nice to make clear which is which (or might take up too much table real-estate...your call!)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would the table header serve this missing information? Rendered version: http://crossbow.voltrondata.com/pr_docs/41795/format/Columnar.html#types-and-their-respective-physical-memory-layout

What I could also do is to repeat the table header in between also (before nested and special)

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Jun 4, 2024
Array where all values are null.

Boolean Type
Array of bit-packed values.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"bit-packed" confuses the data type with its physical layout IMHO.

@pitrou
Copy link
Member

pitrou commented Jun 4, 2024

I've always been frustrated with the state of the columnar format spec, as I found the confusion between concepts there quite frustrating (for example, it uses the term "logical type" throughout, even though that term doesn't appear elsewhere).

That said, I don't think this PR makes things less confusion, on the contrary.

@AlenkaF
Copy link
Member Author

AlenkaF commented Jun 4, 2024

I've always been frustrated with the state of the columnar format spec, as I found the confusion between concepts there quite frustrating (for example, it uses the term "logical type" throughout, even though that term doesn't appear elsewhere).

That said, I don't think this PR makes things less confusion, on the contrary.

The issue about "logical type" and substituting it with just "type" was meant to be done in a separate PR. This PR should add a list of all types and their respective layout. Maybe the "glossary" part should be removed for now and we should focus on making the table of types clear.

I agree and share the frustration but am sure we will improve, one step at the time ;)

@pitrou
Copy link
Member

pitrou commented Jun 4, 2024

I agree and share the frustration but am sure we will improve, one step at the time ;)

Definitely :-) I am trying my hand on an alternate set of changes right now, hopefully we can converge on something.

@AlenkaF
Copy link
Member Author

AlenkaF commented Jun 4, 2024

I know of at least one previous discussion regarding the term "logical type". This PR may not be the place to do it, but readers that are familiar with existing arrow implementations may be confused because Schema.fbs, while authoritative, parameterizes types differently than most implementations (e.g., pretty much every implementation has an pyarrow.int32() not pyarrow.int(32)). I think this only affects Int, Date, Time, and FloatingPoint, and ("types" that are parameterized differently in pyarrow vs. Schema.fbs.

(I am not sure the best way to incorporate that here or if it's a battle for another day 🙂 )

Ha. Yes, this is true. Maybe if the title in the table is changed from "Additional info stored" to "Parameters" it is clearer and then, I guess, it is up to the implementation to decide if the parameter defines a separate type (int32) or is actually a parameter of the type (int(bit_width=32, is_signed=False)).

@AlenkaF
Copy link
Member Author

AlenkaF commented Jun 4, 2024

@paleolimbot
Copy link
Member

Maybe if the title in the table is changed from "Additional info stored" to "Parameters"

Or perhaps just a note that most implementations expand parameterized types such as ... to make it more convenient to refer to commonly-used parameterizations of these types? (All optional and/or battle for another PR)

@AlenkaF
Copy link
Member Author

AlenkaF commented Jun 6, 2024

Closing in favour of #41958

@AlenkaF AlenkaF closed this Jun 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants