-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-41673: [Format][Docs] Add arrow format introductory page #41593
base: main
Are you sure you want to change the base?
Conversation
cc @amoeba this could use a look already. I think all I wanted to add is here. Will need to do a general look through one more time before marking it ready for review though. |
@github-actions crossbow submit preview-docs |
Revision: 3cdd97a Submitted crossbow builds: ursacomputing/crossbow @ actions-4a1cc2326d
|
docs/source/format/index.rst
Outdated
CDataInterface | ||
CStreamInterface | ||
CDeviceDataInterface |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jorisvandenbossche have kept C Stream Interface in a separate file as the structure is nicer IMHO.
803b4b9
to
4d2bf8a
Compare
@github-actions crossbow submit preview-docs |
Revision: 4d2bf8a Submitted crossbow builds: ursacomputing/crossbow @ actions-cc7da250f4
|
Update: I have removed the change in docs/source/format/index.rst (captions for the Specifications section) and will move it to a separate PR, see 97e4217. |
4eef9dc
to
3c8b4fa
Compare
@github-actions crossbow submit preview-docs |
Revision: 9f9bbff Submitted crossbow builds: ursacomputing/crossbow @ actions-cee8fb4563
|
Fresh link to the html version: http://crossbow.voltrondata.com/pr_docs/41593/format/Intro.html |
As the format gets more adoption, it becomes easier for data processing | ||
systems to exchange tabular data. Among other things, an agreed upon | ||
in-memory format, enables the implementations of zero-copy IPC protocols | ||
(inter-process communication without copying data in memory) and | ||
more efficient reading and writing of file formats like CSV, `Apache ORC`_, | ||
and `Apache Parquet`_. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two comments here (cc @felipecrv as I think you suggested this paragraph):
- If we mention inter-process communication here, should we also mention zero-copy within-processing sharing? (i.e. what the C Data Interface provides). Also, I know we generally say about the IPC protocol that it is zero copy, but of course it's not entirely zero copy, so mentioning it twice in context of IPC is maybe a bit too much
- I find the mention of "more efficient reading and writing of file formats" a bit out of place now, because it's not really the format itself that enables to read eg a Parquet file more efficiently? (it's that many Arrow implementation will provide this functionality as well, and that we can more easily reuse such implementations if they read into Arrow format)
different data types and the way their values are stored in memory varies among | ||
the data types. The specification of how these values are arranged in memory is | ||
what we call a **physical memory layout**. One contiguous region of memory that | ||
stores data for arrays is called a **Buffer**. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would add here something like (to connect the array with buffer concepts): "An array consists of one or more buffers"
docs/source/format/Intro.rst
Outdated
We read validity bitmaps right-to-left within a group of 8 bits due to | ||
`bit-endianness <https://en.wikipedia.org/wiki/Bit_numbering>`_ being | ||
used. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would explicitly call out here that this is depicted that way in the diagrams
List and large list view | ||
------------------------ | ||
|
||
List view data type allows arrays to specify out-of-order offsets. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you expand this explanation a bit?
(I think the main point is that in addition to the offsets buffer, there is now also a sizes buffer. The offsets still indicate the start of each element, but the size is not inferred from the next offset value, but now coded explicitly in a separate sizes buffer. That allows to have out-of-order offsets.
Co-authored-by: Felipe Oliveira Carvalho <felipekde@gmail.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Rationale for this change
The documentation for Arrow Format could be improved:
What changes are included in this PR?
This PR includes:
in a separate "introduction" page with no technical details. Specifications index page is also restructured to include captions and make the left sidebar menu better organised.
Note: a table with all types listed together with their physical layout will be added in a separate PR to existing Columnar.rst page: #14752
Are these changes tested?
No, this is a docs change.
Are there any user-facing changes?
No.