You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently the columnar format is only documented at this page: https://arrow.apache.org/docs/format/Columnar.html. However, when I try to actually implement the format, I find the physical representation underdocumented.
Particularly, the encoding of primitive types is unclear. The only info given is an example int32 layout, but no other layouts are given, while other type are unclear. How are booleans represented, for example? Do implementation choose what representation they use? I suppose that's not the case as it will defeat Arrow's goal.
I was pointed to https://github.com/apache/arrow/blob/main/format/Schema.fbs for reference. However, as far as I understand, this specification is only for the IPC schema. It includes specification of type information, but when it comes to physical representation, there's only struct Buffer with a length and offset.
I would like a clear documentation of the memory layout of every type supported by Arrow. An example specification I can think of is CTF, which provides not only layouts of all types, but also side-by-side examples of schema, layout, and values. Similar documentation will be immensely helpful for Arrow, especially showing layouts of various array types.
Component(s)
Format
The text was updated successfully, but these errors were encountered:
pitrou
changed the title
Physical representation of columnar format not well documented
[Format] Physical representation of columnar format not well documented
Jan 11, 2024
I am attempting to add a general introductory page to the documentation that would list all the physical layouts with diagrams and basic explanations here: #41593. Reviews welcome!
Describe the enhancement requested
Currently the columnar format is only documented at this page: https://arrow.apache.org/docs/format/Columnar.html. However, when I try to actually implement the format, I find the physical representation underdocumented.
Particularly, the encoding of primitive types is unclear. The only info given is an example int32 layout, but no other layouts are given, while other type are unclear. How are booleans represented, for example? Do implementation choose what representation they use? I suppose that's not the case as it will defeat Arrow's goal.
I was pointed to https://github.com/apache/arrow/blob/main/format/Schema.fbs for reference. However, as far as I understand, this specification is only for the IPC schema. It includes specification of type information, but when it comes to physical representation, there's only
struct Buffer
with a length and offset.I would like a clear documentation of the memory layout of every type supported by Arrow. An example specification I can think of is CTF, which provides not only layouts of all types, but also side-by-side examples of schema, layout, and values. Similar documentation will be immensely helpful for Arrow, especially showing layouts of various array types.
Component(s)
Format
The text was updated successfully, but these errors were encountered: