GH-41673: [Format][Docs] Add arrow format introductory page #41593

AlenkaF · 2024-05-08T15:32:18Z

Rationale for this change

The documentation for Arrow Format could be improved:

all types are not listed
all layouts are not explained

What changes are included in this PR?

This PR includes:

motivation behind the columnar format
different physical layouts explained together with diagrams of example type in comparison to the physical layout
Arrow terminology
Extension types and sharing of Arrow data

in a separate "introduction" page with no technical details. Specifications index page is also restructured to include captions and make the left sidebar menu better organised.

Note: a table with all types listed together with their physical layout will be added in a separate PR to existing Columnar.rst page: #14752

Are these changes tested?

No, this is a docs change.

Are there any user-facing changes?

No.

GitHub Issue: [Docs][Format] Add an introductory page to the Arrow Columnar Format #41673

AlenkaF · 2024-05-09T13:12:08Z

cc @amoeba this could use a look already. I think all I wanted to add is here. Will need to do a general look through one more time before marking it ready for review though.

AlenkaF · 2024-05-09T13:14:13Z

@github-actions crossbow submit preview-docs

github-actions · 2024-05-09T13:16:27Z

Revision: 3cdd97a

Submitted crossbow builds: ursacomputing/crossbow @ actions-4a1cc2326d

Task	Status
preview-docs

AlenkaF · 2024-05-09T13:16:43Z

docs/source/format/index.rst

   CDataInterface
   CStreamInterface
   CDeviceDataInterface


@jorisvandenbossche have kept C Stream Interface in a separate file as the structure is nicer IMHO.

docs/source/format/index.rst

AlenkaF · 2024-05-13T16:33:27Z

@github-actions crossbow submit preview-docs

github-actions · 2024-05-13T16:35:48Z

Revision: 4d2bf8a

Submitted crossbow builds: ursacomputing/crossbow @ actions-cc7da250f4

Task	Status
preview-docs

AlenkaF · 2024-05-13T17:53:22Z

Not sure why the captions in the left sidebar menu are not visible in the crossbow preview build:

but are visible for me locally:

AlenkaF · 2024-05-15T12:40:49Z

Update: I have removed the change in docs/source/format/index.rst (captions for the Specifications section) and will move it to a separate PR, see 97e4217.

docs/source/format/FormatIntro.rst

AlenkaF · 2024-06-11T04:00:27Z

@github-actions crossbow submit preview-docs

github-actions · 2024-06-11T04:02:42Z

Revision: 9f9bbff

Submitted crossbow builds: ursacomputing/crossbow @ actions-cee8fb4563

Task	Status
preview-docs

AlenkaF · 2024-06-11T07:32:14Z

Fresh link to the html version: http://crossbow.voltrondata.com/pr_docs/41593/format/Intro.html

docs/source/format/Intro.rst

jorisvandenbossche · 2024-06-13T07:31:05Z

docs/source/format/Intro.rst

+As the format gets more adoption, it becomes easier for data processing
+systems to exchange tabular data. Among other things, an agreed upon
+in-memory format, enables the implementations of zero-copy IPC protocols
+(inter-process communication without copying data in memory) and
+more efficient reading and writing of file formats like CSV, `Apache ORC`_,
+and `Apache Parquet`_.


Two comments here (cc @felipecrv as I think you suggested this paragraph):

If we mention inter-process communication here, should we also mention zero-copy within-processing sharing? (i.e. what the C Data Interface provides). Also, I know we generally say about the IPC protocol that it is zero copy, but of course it's not entirely zero copy, so mentioning it twice in context of IPC is maybe a bit too much

I find the mention of "more efficient reading and writing of file formats" a bit out of place now, because it's not really the format itself that enables to read eg a Parquet file more efficiently? (it's that many Arrow implementation will provide this functionality as well, and that we can more easily reuse such implementations if they read into Arrow format)

jorisvandenbossche · 2024-06-13T07:35:53Z

docs/source/format/Intro.rst

+different data types and the way their values are stored in memory varies among
+the data types. The specification of how these values are arranged in memory is
+what we call a **physical memory layout**. One contiguous region of memory that
+stores data for arrays is called a **Buffer**.


I would add here something like (to connect the array with buffer concepts): "An array consists of one or more buffers"

docs/source/format/Intro.rst

jorisvandenbossche · 2024-06-13T07:39:01Z

docs/source/format/Intro.rst

+   We read validity bitmaps right-to-left within a group of 8 bits due to
+   `bit-endianness <https://en.wikipedia.org/wiki/Bit_numbering>`_ being
+   used.


I would explicitly call out here that this is depicted that way in the diagrams

docs/source/format/Intro.rst

jorisvandenbossche · 2024-06-13T07:59:55Z

docs/source/format/Intro.rst

+List and large list view
+------------------------
+
+List view data type allows arrays to specify out-of-order offsets.


Can you expand this explanation a bit?

(I think the main point is that in addition to the offsets buffer, there is now also a sizes buffer. The offsets still indicate the start of each element, but the size is not inferred from the next offset value, but now coded explicitly in a separate sizes buffer. That allows to have out-of-order offsets.

docs/source/format/Intro.rst

Co-authored-by: Felipe Oliveira Carvalho <felipekde@gmail.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

github-actions bot added Component: Documentation awaiting review Awaiting review labels May 8, 2024

AlenkaF mentioned this pull request May 8, 2024

[Format] Physical representation of columnar format not well documented #39569

Open

AlenkaF commented May 9, 2024

View reviewed changes

github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels May 9, 2024

AlenkaF commented May 9, 2024

View reviewed changes

docs/source/format/index.rst Show resolved Hide resolved

AlenkaF force-pushed the arrow-format-docs-101 branch from 803b4b9 to 4d2bf8a Compare May 13, 2024 16:33

AlenkaF marked this pull request as ready for review May 13, 2024 17:53

AlenkaF requested review from amoeba, raulcd and jorisvandenbossche May 14, 2024 07:56

AlenkaF changed the title ~~GH-39569: [Format][Docs] Add arrow format introductory page~~ GH-41673: [Format][Docs] Add arrow format introductory page May 15, 2024

amoeba reviewed May 15, 2024

View reviewed changes

docs/source/format/FormatIntro.rst Outdated Show resolved Hide resolved

amoeba reviewed May 15, 2024

View reviewed changes

docs/source/format/FormatIntro.rst Outdated Show resolved Hide resolved

amoeba reviewed May 15, 2024

View reviewed changes

docs/source/format/FormatIntro.rst Outdated Show resolved Hide resolved

amoeba reviewed May 15, 2024

View reviewed changes

docs/source/format/FormatIntro.rst Outdated Show resolved Hide resolved

amoeba reviewed May 15, 2024

View reviewed changes

docs/source/format/FormatIntro.rst Outdated Show resolved Hide resolved

amoeba reviewed May 15, 2024

View reviewed changes

docs/source/format/FormatIntro.rst Outdated Show resolved Hide resolved

amoeba reviewed May 15, 2024

View reviewed changes

docs/source/format/FormatIntro.rst Outdated Show resolved Hide resolved

amoeba reviewed May 15, 2024

View reviewed changes

docs/source/format/FormatIntro.rst Outdated Show resolved Hide resolved

amoeba reviewed May 15, 2024

View reviewed changes

docs/source/format/FormatIntro.rst Outdated Show resolved Hide resolved

amoeba reviewed May 15, 2024

View reviewed changes

docs/source/format/FormatIntro.rst Outdated Show resolved Hide resolved

Change title and update first intro section

3c8b4fa

AlenkaF force-pushed the arrow-format-docs-101 branch from 4eef9dc to 3c8b4fa Compare June 10, 2024 12:28

AlenkaF added 2 commits June 10, 2024 14:30

Add missing paragraph in the intro section

f330074

Update union diagrams

2524fda

github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Jun 10, 2024

Add a note about bit-endianness

9f9bbff

github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Jun 10, 2024