Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-41673: [Format][Docs] Add arrow format introductory page #41593

Open
wants to merge 51 commits into
base: main
Choose a base branch
from

Conversation

AlenkaF
Copy link
Member

@AlenkaF AlenkaF commented May 8, 2024

Rationale for this change

The documentation for Arrow Format could be improved:

  • all types are not listed
  • all layouts are not explained

What changes are included in this PR?

This PR includes:

  • motivation behind the columnar format
  • different physical layouts explained together with diagrams of example type in comparison to the physical layout
  • Arrow terminology
  • Extension types and sharing of Arrow data

in a separate "introduction" page with no technical details. Specifications index page is also restructured to include captions and make the left sidebar menu better organised.

Note: a table with all types listed together with their physical layout will be added in a separate PR to existing Columnar.rst page: #14752

Are these changes tested?

No, this is a docs change.

Are there any user-facing changes?

No.

@AlenkaF
Copy link
Member Author

AlenkaF commented May 9, 2024

cc @amoeba this could use a look already. I think all I wanted to add is here. Will need to do a general look through one more time before marking it ready for review though.

@AlenkaF
Copy link
Member Author

AlenkaF commented May 9, 2024

@github-actions crossbow submit preview-docs

Copy link

github-actions bot commented May 9, 2024

Revision: 3cdd97a

Submitted crossbow builds: ursacomputing/crossbow @ actions-4a1cc2326d

Task Status
preview-docs GitHub Actions

Comment on lines 40 to 34
CDataInterface
CStreamInterface
CDeviceDataInterface
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche have kept C Stream Interface in a separate file as the structure is nicer IMHO.

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels May 9, 2024
@AlenkaF
Copy link
Member Author

AlenkaF commented May 13, 2024

@github-actions crossbow submit preview-docs

Copy link

Revision: 4d2bf8a

Submitted crossbow builds: ursacomputing/crossbow @ actions-cc7da250f4

Task Status
preview-docs GitHub Actions

@AlenkaF
Copy link
Member Author

AlenkaF commented May 13, 2024

Not sure why the captions in the left sidebar menu are not visible in the crossbow preview build:

Screenshot 2024-05-13 at 19 47 01

but are visible for me locally:

Screenshot 2024-05-13 at 19 46 44

@AlenkaF AlenkaF marked this pull request as ready for review May 13, 2024 17:53
@AlenkaF
Copy link
Member Author

AlenkaF commented May 15, 2024

Update: I have removed the change in docs/source/format/index.rst (captions for the Specifications section) and will move it to a separate PR, see 97e4217.

@AlenkaF AlenkaF changed the title GH-39569: [Format][Docs] Add arrow format introductory page GH-41673: [Format][Docs] Add arrow format introductory page May 15, 2024
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Jun 10, 2024
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Jun 10, 2024
@AlenkaF
Copy link
Member Author

AlenkaF commented Jun 11, 2024

@github-actions crossbow submit preview-docs

Copy link

Revision: 9f9bbff

Submitted crossbow builds: ursacomputing/crossbow @ actions-cee8fb4563

Task Status
preview-docs GitHub Actions

@AlenkaF
Copy link
Member Author

AlenkaF commented Jun 11, 2024

Fresh link to the html version: http://crossbow.voltrondata.com/pr_docs/41593/format/Intro.html

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Jun 13, 2024
Comment on lines +37 to +42
As the format gets more adoption, it becomes easier for data processing
systems to exchange tabular data. Among other things, an agreed upon
in-memory format, enables the implementations of zero-copy IPC protocols
(inter-process communication without copying data in memory) and
more efficient reading and writing of file formats like CSV, `Apache ORC`_,
and `Apache Parquet`_.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two comments here (cc @felipecrv as I think you suggested this paragraph):

  • If we mention inter-process communication here, should we also mention zero-copy within-processing sharing? (i.e. what the C Data Interface provides). Also, I know we generally say about the IPC protocol that it is zero copy, but of course it's not entirely zero copy, so mentioning it twice in context of IPC is maybe a bit too much
  • I find the mention of "more efficient reading and writing of file formats" a bit out of place now, because it's not really the format itself that enables to read eg a Parquet file more efficiently? (it's that many Arrow implementation will provide this functionality as well, and that we can more easily reuse such implementations if they read into Arrow format)

different data types and the way their values are stored in memory varies among
the data types. The specification of how these values are arranged in memory is
what we call a **physical memory layout**. One contiguous region of memory that
stores data for arrays is called a **Buffer**.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add here something like (to connect the array with buffer concepts): "An array consists of one or more buffers"

docs/source/format/Intro.rst Outdated Show resolved Hide resolved
Comment on lines 106 to 108
We read validity bitmaps right-to-left within a group of 8 bits due to
`bit-endianness <https://en.wikipedia.org/wiki/Bit_numbering>`_ being
used.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would explicitly call out here that this is depicted that way in the diagrams

docs/source/format/Intro.rst Outdated Show resolved Hide resolved
docs/source/format/Intro.rst Outdated Show resolved Hide resolved
List and large list view
------------------------

List view data type allows arrays to specify out-of-order offsets.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you expand this explanation a bit?

(I think the main point is that in addition to the offsets buffer, there is now also a sizes buffer. The offsets still indicate the start of each element, but the size is not inferred from the next offset value, but now coded explicitly in a separate sizes buffer. That allows to have out-of-order offsets.

docs/source/format/Intro.rst Outdated Show resolved Hide resolved
docs/source/format/Intro.rst Outdated Show resolved Hide resolved
docs/source/format/Intro.rst Outdated Show resolved Hide resolved
Co-authored-by: Felipe Oliveira Carvalho <felipekde@gmail.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Jun 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants