From 27641260455808eff392deaf3cd2b87e7258913a Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Tue, 11 Jun 2019 19:48:09 -0500 Subject: [PATCH] Remove duplicated / out-of-date logical type information from documentation and direct readers to Schema.fbs --- docs/source/format/Metadata.rst | 245 +++++++++----------------------- 1 file changed, 70 insertions(+), 175 deletions(-) diff --git a/docs/source/format/Metadata.rst b/docs/source/format/Metadata.rst index 293d0113875a6..b6c2a5f6caee9 100644 --- a/docs/source/format/Metadata.rst +++ b/docs/source/format/Metadata.rst @@ -65,96 +65,6 @@ the columns. The Flatbuffers IDL for a field is: :: The ``type`` is the logical type of the field. Nested types, such as List, Struct, and Union, have a sequence of child fields. -A JSON representation of the schema is also provided: - -Field: :: - - { - "name" : "name_of_the_field", - "nullable" : false, - "type" : /* Type */, - "children" : [ /* Field */ ], - } - -Type: :: - - { - "name" : "null|struct|list|union|int|floatingpoint|utf8|binary|fixedsizebinary|bool|decimal|date|time|timestamp|interval" - // fields as defined in the Flatbuffer depending on the type name - } - -Union: :: - - { - "name" : "union", - "mode" : "Sparse|Dense", - "typeIds" : [ /* integer */ ] - } - -The ``typeIds`` field in the Union are the codes used to denote each type, which -may be different from the index of the child array. This is so that the union -type ids do not have to be enumerated from 0. - -Int: :: - - { - "name" : "int", - "bitWidth" : /* integer */, - "isSigned" : /* boolean */ - } - -FloatingPoint: :: - - { - "name" : "floatingpoint", - "precision" : "HALF|SINGLE|DOUBLE" - } - -Decimal: :: - - { - "name" : "decimal", - "precision" : /* integer */, - "scale" : /* integer */ - } - -Timestamp: :: - - { - "name" : "timestamp", - "unit" : "SECOND|MILLISECOND|MICROSECOND|NANOSECOND" - } - -Date: :: - - { - "name" : "date", - "unit" : "DAY|MILLISECOND" - } - -Time: :: - - { - "name" : "time", - "unit" : "SECOND|MILLISECOND|MICROSECOND|NANOSECOND", - "bitWidth": /* integer: 32 or 64 */ - } - -Interval: :: - - { - "name" : "interval", - "unit" : "YEAR_MONTH|DAY_TIME" - } - -Schema: :: - - { - "fields" : [ - /* Field */ - ] - } - Record data headers ------------------- @@ -280,117 +190,102 @@ categories: * Types having equivalent memory layout to a physical nested type (e.g. strings use the list representation, but logically are not nested types) -Integers -~~~~~~~~ +Refer to `Schema.fbs`_ for up-to-date descriptions of each built-in +logical type. -In the first version of Arrow we provide the standard 8-bit through 64-bit size -standard C integer types, both signed and unsigned: +Integration Testing +------------------- -* Signed types: Int8, Int16, Int32, Int64 -* Unsigned types: UInt8, UInt16, UInt32, UInt64 +A JSON representation of the schema is provided for cross-language +integration testing purposes. -The IDL looks like: :: +Field: :: - table Int { - bitWidth: int; - is_signed: bool; + { + "name" : "name_of_the_field", + "nullable" : false, + "type" : /* Type */, + "children" : [ /* Field */ ], } -The integer endianness is currently set globally at the schema level. If a -schema is set to be little-endian, then all integer types occurring within must -be little-endian. Integers that are part of other data representations, such as -list offsets and union types, must have the same endianness as the entire -record batch. - -Floating point numbers -~~~~~~~~~~~~~~~~~~~~~~ - -We provide 3 types of floating point numbers as fixed bit-width primitive array - -- Half precision, 16-bit width -- Single precision, 32-bit width -- Double precision, 64-bit width - -The IDL looks like: :: - - enum Precision:int {HALF, SINGLE, DOUBLE} +Type: :: - table FloatingPoint { - precision: Precision; + { + "name" : "null|struct|list|union|int|floatingpoint|utf8|binary|fixedsizebinary|bool|decimal|date|time|timestamp|interval" + // fields as defined in the Flatbuffer depending on the type name } -Boolean -~~~~~~~ - -The Boolean logical type is represented as a 1-bit wide primitive physical -type. The bits are numbered using least-significant bit (LSB) ordering. - -Like other fixed bit-width primitive types, boolean data appears as 2 buffers -in the data header (one bitmap for the validity vector and one for the values). - -List -~~~~ - -The ``List`` logical type is the logical (and identically-named) counterpart to -the List physical type. - -In data header form, the list field node contains 2 buffers: +Union: :: -* Validity bitmap -* List offsets + { + "name" : "union", + "mode" : "Sparse|Dense", + "typeIds" : [ /* integer */ ] + } -The buffers associated with a list's child field are handled recursively -according to the child logical type (e.g. ``List`` vs. ``List``). +The ``typeIds`` field in the Union are the codes used to denote each type, which +may be different from the index of the child array. This is so that the union +type ids do not have to be enumerated from 0. -Utf8 and Binary -~~~~~~~~~~~~~~~ +Int: :: -We specify two logical types for variable length bytes: + { + "name" : "int", + "bitWidth" : /* integer */, + "isSigned" : /* boolean */ + } -* ``Utf8`` data is Unicode values with UTF-8 encoding -* ``Binary`` is any other variable length bytes +FloatingPoint: :: -These types both have the same memory layout as the nested type ``List``, -with the constraint that the inner bytes can contain no null values. From a -logical type perspective they are primitive, not nested types. + { + "name" : "floatingpoint", + "precision" : "HALF|SINGLE|DOUBLE" + } -In data header form, while ``List`` would appear as 2 field nodes (``List`` -and ``UInt8``) and 4 buffers (2 for each of the nodes, as per above), these types -have a simplified representation single field node (of ``Utf8`` or ``Binary`` -logical type, which have no children) and 3 buffers: +Decimal: :: -* Validity bitmap -* List offsets -* Byte data + { + "name" : "decimal", + "precision" : /* integer */, + "scale" : /* integer */ + } -Decimal -~~~~~~~ +Timestamp: :: -Decimals are represented as a 2's complement 128-bit (16 byte) signed integer -in little-endian byte order. + { + "name" : "timestamp", + "unit" : "SECOND|MILLISECOND|MICROSECOND|NANOSECOND" + } -Timestamp -~~~~~~~~~ +Date: :: -All timestamps are stored as a 64-bit integer, with one of four unit -resolutions: second, millisecond, microsecond, and nanosecond. + { + "name" : "date", + "unit" : "DAY|MILLISECOND" + } -Date -~~~~ +Time: :: -We support two different date types: + { + "name" : "time", + "unit" : "SECOND|MILLISECOND|MICROSECOND|NANOSECOND", + "bitWidth": /* integer: 32 or 64 */ + } -* Days since the UNIX epoch as a 32-bit integer -* Milliseconds since the UNIX epoch as a 64-bit integer +Interval: :: -Time -~~~~ + { + "name" : "interval", + "unit" : "YEAR_MONTH|DAY_TIME" + } -Time supports the same unit resolutions: second, millisecond, microsecond, and -nanosecond. We represent time as the smallest integer accommodating the -indicated unit. For second and millisecond: 32-bit, for the others 64-bit. +Schema: :: -Dictionary encoding -------------------- + { + "fields" : [ + /* Field */ + ] + } .. _Flatbuffers: http://github.com/google/flatbuffers +.. _Schema.fbs: https://github.com/apache/arrow/blob/master/format/Schema.fbs