Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-4194: [Format][Docs] Remove duplicated / out-of-date logical type information from documentation #4523

Closed
wants to merge 1 commit into from
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
245 changes: 70 additions & 175 deletions docs/source/format/Metadata.rst
Expand Up @@ -65,96 +65,6 @@ the columns. The Flatbuffers IDL for a field is: ::
The ``type`` is the logical type of the field. Nested types, such as List,
Struct, and Union, have a sequence of child fields.

A JSON representation of the schema is also provided:

Field: ::

{
"name" : "name_of_the_field",
"nullable" : false,
"type" : /* Type */,
"children" : [ /* Field */ ],
}

Type: ::

{
"name" : "null|struct|list|union|int|floatingpoint|utf8|binary|fixedsizebinary|bool|decimal|date|time|timestamp|interval"
// fields as defined in the Flatbuffer depending on the type name
}

Union: ::

{
"name" : "union",
"mode" : "Sparse|Dense",
"typeIds" : [ /* integer */ ]
}

The ``typeIds`` field in the Union are the codes used to denote each type, which
may be different from the index of the child array. This is so that the union
type ids do not have to be enumerated from 0.

Int: ::

{
"name" : "int",
"bitWidth" : /* integer */,
"isSigned" : /* boolean */
}

FloatingPoint: ::

{
"name" : "floatingpoint",
"precision" : "HALF|SINGLE|DOUBLE"
}

Decimal: ::

{
"name" : "decimal",
"precision" : /* integer */,
"scale" : /* integer */
}

Timestamp: ::

{
"name" : "timestamp",
"unit" : "SECOND|MILLISECOND|MICROSECOND|NANOSECOND"
}

Date: ::

{
"name" : "date",
"unit" : "DAY|MILLISECOND"
}

Time: ::

{
"name" : "time",
"unit" : "SECOND|MILLISECOND|MICROSECOND|NANOSECOND",
"bitWidth": /* integer: 32 or 64 */
}

Interval: ::

{
"name" : "interval",
"unit" : "YEAR_MONTH|DAY_TIME"
}

Schema: ::

{
"fields" : [
/* Field */
]
}

Record data headers
-------------------

Expand Down Expand Up @@ -280,117 +190,102 @@ categories:
* Types having equivalent memory layout to a physical nested type (e.g. strings
use the list representation, but logically are not nested types)

Integers
~~~~~~~~
Refer to `Schema.fbs`_ for up-to-date descriptions of each built-in
logical type.

In the first version of Arrow we provide the standard 8-bit through 64-bit size
standard C integer types, both signed and unsigned:
Integration Testing
-------------------

* Signed types: Int8, Int16, Int32, Int64
* Unsigned types: UInt8, UInt16, UInt32, UInt64
A JSON representation of the schema is provided for cross-language
integration testing purposes.

The IDL looks like: ::
Field: ::

table Int {
bitWidth: int;
is_signed: bool;
{
"name" : "name_of_the_field",
"nullable" : false,
"type" : /* Type */,
"children" : [ /* Field */ ],
}

The integer endianness is currently set globally at the schema level. If a
schema is set to be little-endian, then all integer types occurring within must
be little-endian. Integers that are part of other data representations, such as
list offsets and union types, must have the same endianness as the entire
record batch.

Floating point numbers
~~~~~~~~~~~~~~~~~~~~~~

We provide 3 types of floating point numbers as fixed bit-width primitive array

- Half precision, 16-bit width
- Single precision, 32-bit width
- Double precision, 64-bit width

The IDL looks like: ::

enum Precision:int {HALF, SINGLE, DOUBLE}
Type: ::

table FloatingPoint {
precision: Precision;
{
"name" : "null|struct|list|union|int|floatingpoint|utf8|binary|fixedsizebinary|bool|decimal|date|time|timestamp|interval"
// fields as defined in the Flatbuffer depending on the type name
}

Boolean
~~~~~~~

The Boolean logical type is represented as a 1-bit wide primitive physical
type. The bits are numbered using least-significant bit (LSB) ordering.

Like other fixed bit-width primitive types, boolean data appears as 2 buffers
in the data header (one bitmap for the validity vector and one for the values).

List
~~~~

The ``List`` logical type is the logical (and identically-named) counterpart to
the List physical type.

In data header form, the list field node contains 2 buffers:
Union: ::

* Validity bitmap
* List offsets
{
"name" : "union",
"mode" : "Sparse|Dense",
"typeIds" : [ /* integer */ ]
}

The buffers associated with a list's child field are handled recursively
according to the child logical type (e.g. ``List<Utf8>`` vs. ``List<Boolean>``).
The ``typeIds`` field in the Union are the codes used to denote each type, which
may be different from the index of the child array. This is so that the union
type ids do not have to be enumerated from 0.

Utf8 and Binary
~~~~~~~~~~~~~~~
Int: ::

We specify two logical types for variable length bytes:
{
"name" : "int",
"bitWidth" : /* integer */,
"isSigned" : /* boolean */
}

* ``Utf8`` data is Unicode values with UTF-8 encoding
* ``Binary`` is any other variable length bytes
FloatingPoint: ::

These types both have the same memory layout as the nested type ``List<UInt8>``,
with the constraint that the inner bytes can contain no null values. From a
logical type perspective they are primitive, not nested types.
{
"name" : "floatingpoint",
"precision" : "HALF|SINGLE|DOUBLE"
}

In data header form, while ``List<UInt8>`` would appear as 2 field nodes (``List``
and ``UInt8``) and 4 buffers (2 for each of the nodes, as per above), these types
have a simplified representation single field node (of ``Utf8`` or ``Binary``
logical type, which have no children) and 3 buffers:
Decimal: ::

* Validity bitmap
* List offsets
* Byte data
{
"name" : "decimal",
"precision" : /* integer */,
"scale" : /* integer */
}

Decimal
~~~~~~~
Timestamp: ::

Decimals are represented as a 2's complement 128-bit (16 byte) signed integer
in little-endian byte order.
{
"name" : "timestamp",
"unit" : "SECOND|MILLISECOND|MICROSECOND|NANOSECOND"
}

Timestamp
~~~~~~~~~
Date: ::

All timestamps are stored as a 64-bit integer, with one of four unit
resolutions: second, millisecond, microsecond, and nanosecond.
{
"name" : "date",
"unit" : "DAY|MILLISECOND"
}

Date
~~~~
Time: ::

We support two different date types:
{
"name" : "time",
"unit" : "SECOND|MILLISECOND|MICROSECOND|NANOSECOND",
"bitWidth": /* integer: 32 or 64 */
}

* Days since the UNIX epoch as a 32-bit integer
* Milliseconds since the UNIX epoch as a 64-bit integer
Interval: ::

Time
~~~~
{
"name" : "interval",
"unit" : "YEAR_MONTH|DAY_TIME"
}

Time supports the same unit resolutions: second, millisecond, microsecond, and
nanosecond. We represent time as the smallest integer accommodating the
indicated unit. For second and millisecond: 32-bit, for the others 64-bit.
Schema: ::

Dictionary encoding
-------------------
{
"fields" : [
/* Field */
]
}

.. _Flatbuffers: http://github.com/google/flatbuffers
.. _Schema.fbs: https://github.com/apache/arrow/blob/master/format/Schema.fbs