-
Notifications
You must be signed in to change notification settings - Fork 489
Fix errors, grammar, and consistency in core format documentation #576
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
|
|
@@ -54,9 +54,10 @@ Supported Types: all | |||||
| This is the plain encoding that must be supported for types. It is | ||||||
| intended to be the simplest encoding. Values are encoded back to back. | ||||||
|
|
||||||
| The plain encoding is used whenever a more efficient encoding can not be used. It | ||||||
| The plain encoding is used whenever a more efficient encoding cannot be used. It | ||||||
| stores the data in the following format: | ||||||
| - BOOLEAN: [Bit Packed](#BITPACKED), LSB first | ||||||
| - BOOLEAN: bit-packed, LSB first (using the same packing scheme as the | ||||||
| [RLE/bit-packing hybrid](#RLE) encoding) | ||||||
| - INT32: 4 bytes little endian | ||||||
| - INT64: 8 bytes little endian | ||||||
| - INT96: 12 bytes little endian (deprecated) | ||||||
|
|
@@ -68,7 +69,7 @@ stores the data in the following format: | |||||
| For native types, this outputs the data as little endian. Floating | ||||||
| point types are encoded in IEEE. | ||||||
|
|
||||||
| For the byte array type, it encodes the length as a 4 byte little | ||||||
| For the byte array type, it encodes the length as a 4-byte little | ||||||
| endian, followed by the bytes. | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
|
|
||||||
| <a name="DICTIONARY"></a> | ||||||
|
|
@@ -82,7 +83,7 @@ written first, before the data pages of the column chunk. | |||||
| Dictionary page format: the entries in the dictionary using the [plain](#PLAIN) encoding. | ||||||
|
|
||||||
| Data page format: the bit width used to encode the entry ids stored as 1 byte (max bit width = 32), | ||||||
| followed by the values encoded using RLE/Bit packed described above (with the given bit width). | ||||||
| followed by the values encoded using RLE/Bit-Packed described above (with the given bit width). | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
|
|
||||||
| Using the `PLAIN_DICTIONARY` enum value is deprecated, use `RLE_DICTIONARY` | ||||||
| in a data page and `PLAIN` in a dictionary page for new Parquet files. | ||||||
|
|
@@ -130,8 +131,8 @@ repeated-value := value that is repeated, using a fixed-width of round-up-to-nex | |||||
| ``` | ||||||
|
|
||||||
| The reason for this packing order is to have fewer word-boundaries on little-endian hardware | ||||||
| when deserializing more than one byte at at time. This is because 4 bytes can be read into a | ||||||
| 32 bit register (or 8 bytes into a 64 bit register) and values can be unpacked just by | ||||||
| when deserializing more than one byte at a time. This is because 4 bytes can be read into a | ||||||
| 32-bit register (or 8 bytes into a 64-bit register) and values can be unpacked just by | ||||||
| shifting and ORing with a mask. (to make this optimization work on a big-endian machine, | ||||||
| you would have to use the ordering used in the [deprecated bit-packing](#BITPACKED) encoding) | ||||||
|
|
||||||
|
|
@@ -151,7 +152,7 @@ data: | |||||
| * Dictionary indices | ||||||
| * Boolean values in data pages, as an alternative to PLAIN encoding | ||||||
|
|
||||||
| Whether prepending the four-byte `length` to the `encoded-data` is summarized as the table below: | ||||||
| Whether prepending the four-byte `length` to the `encoded-data` is summarized in the table below: | ||||||
| ``` | ||||||
| +--------------+------------------------+-----------------+ | ||||||
| | Page kind | RLE-encoded data kind | Prepend length? | | ||||||
|
|
@@ -171,10 +172,10 @@ Whether prepending the four-byte `length` to the `encoded-data` is summarized as | |||||
| <a name="BITPACKED"></a> | ||||||
| ### Bit-packed (Deprecated) (BIT_PACKED = 4) | ||||||
|
|
||||||
| This is a bit-packed only encoding, which is deprecated and will be replaced by the [RLE/bit-packing](#RLE) hybrid encoding. | ||||||
| This is a bit-packed only encoding, which is deprecated; it has been replaced by the [RLE/bit-packing](#RLE) hybrid encoding. | ||||||
| Each value is encoded back to back using a fixed width. | ||||||
| There is no padding between values (except for the last byte, which is padded with 0s). | ||||||
| For example, if the max repetition level was 3 (2 bits) and the max definition level as 3 | ||||||
| For example, if the max repetition level was 3 (2 bits) and the max definition level was 3 | ||||||
| (2 bits), to encode 30 values, we would have 30 * 2 = 60 bits = 8 bytes. | ||||||
|
|
||||||
| This implementation is deprecated because the [RLE/bit-packing](#RLE) hybrid is a superset of this implementation. | ||||||
|
|
@@ -230,8 +231,8 @@ Each block contains | |||||
| ``` | ||||||
| * the min delta is a zigzag ULEB128 int (we compute a minimum as we need | ||||||
| positive integers for bit packing) | ||||||
| * the bitwidth of each block is stored as a byte | ||||||
| * each miniblock is a list of bit packed ints according to the bit width | ||||||
| * the bitwidth of each miniblock is stored as a byte | ||||||
| * each miniblock is a list of bit-packed ints according to the bit width | ||||||
| stored at the beginning of the block | ||||||
|
|
||||||
| To encode a block, we will: | ||||||
|
|
@@ -322,8 +323,6 @@ The delta encoding algorithm described above stores a bit width per miniblock an | |||||
|
|
||||||
| Supported Types: BYTE_ARRAY | ||||||
|
|
||||||
| This encoding is always preferred over PLAIN for byte array columns. | ||||||
|
|
||||||
|
Comment on lines
-325
to
-326
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This isn't a typo, you're actually changing the spec here. I think this needs an actual discussion. |
||||||
| For this encoding, we will take all the byte array lengths and encode them using delta | ||||||
| encoding (DELTA_BINARY_PACKED). The byte array data follows all of the length data just | ||||||
| concatenated back to back. The expected savings is from the cost of encoding the lengths | ||||||
|
|
||||||
| Original file line number | Diff line number | Diff line change | ||||||
|---|---|---|---|---|---|---|---|---|
|
|
@@ -97,7 +97,7 @@ The sort order used for `UUID` values is unsigned byte-wise comparison. | |||||||
| The annotation has two parameters: bit width and sign. | ||||||||
| Allowed bit width values are `8`, `16`, `32`, `64`, and sign can be `true` or `false`. | ||||||||
| For signed integers, the second parameter should be `true`, | ||||||||
| for example, a signed integer with bit width of 8 is defined as `INT(8, true)` | ||||||||
| for example, a signed integer with bit width of 8 is defined as `INT(8, true)`. | ||||||||
| Implementations may use these annotations to produce smaller | ||||||||
| in-memory representations when reading data. | ||||||||
|
|
||||||||
|
|
@@ -120,7 +120,7 @@ along with a maximum number of bits in the stored value. | |||||||
| The annotation has two parameters: bit width and sign. | ||||||||
| Allowed bit width values are `8`, `16`, `32`, `64`, and sign can be `true` or `false`. | ||||||||
| In case of unsigned integers, the second parameter should be `false`, | ||||||||
| for example, an unsigned integer with bit width of 8 is defined as `INT(8, false)` | ||||||||
| for example, an unsigned integer with bit width of 8 is defined as `INT(8, false)`. | ||||||||
| Implementations may use these annotations to produce smaller | ||||||||
| in-memory representations when reading data. | ||||||||
|
|
||||||||
|
|
@@ -166,7 +166,7 @@ unsigned integers with 8, 16, 32, or 64 bit width. | |||||||
| *Forward compatibility:* | ||||||||
|
|
||||||||
| <table> | ||||||||
| <tr colspan="3"> | ||||||||
| <tr> | ||||||||
| <th colspan="3">LogicalType</th> | ||||||||
| <th>ConvertedType</th> | ||||||||
| </tr> | ||||||||
|
|
@@ -219,15 +219,15 @@ scale stores the number of digits of that value that are to the right of the | |||||||
| decimal point, and the precision stores the maximum number of digits supported | ||||||||
| in the unscaled value. | ||||||||
|
|
||||||||
| If not specified, the scale is 0. Scale must be zero or a positive integer less | ||||||||
| than or equal to the precision. Precision is required and must be a non-zero positive | ||||||||
| If not specified, the scale is 0. Scale must be a non-negative integer less | ||||||||
| than or equal to the precision. Precision is required and must be a positive | ||||||||
|
Comment on lines
+222
to
+223
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this change is less clear, even if correct. |
||||||||
| integer. A precision too large for the underlying type (see below) is an error. | ||||||||
|
|
||||||||
| `DECIMAL` can be used to annotate the following types: | ||||||||
| * `int32`: for 1 <= precision <= 9 | ||||||||
| * `int64`: for 1 <= precision <= 18; precision < 10 will produce a | ||||||||
| warning | ||||||||
| * `fixed_len_byte_array`: precision is limited by the array size. Length `n` | ||||||||
| * `fixed_len_byte_array`: `precision` is limited by the array size. Length `n` | ||||||||
| can store <= `floor(log_10(2^(8*n - 1) - 1))` base-10 digits | ||||||||
| * `byte_array`: `precision` is not limited, but is required. The minimum number of | ||||||||
| bytes to store the unscaled value should be used. | ||||||||
|
|
@@ -243,7 +243,7 @@ comparison. | |||||||
|
|
||||||||
| *Compatibility* | ||||||||
|
|
||||||||
| To support compatibility with older readers, implementations of parquet-format should | ||||||||
| To support compatibility with older readers, implementations of parquet-format must | ||||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Another spec change. |
||||||||
| write `DecimalType` precision and scale into the corresponding SchemaElement field in metadata. | ||||||||
|
|
||||||||
| ### FLOAT16 | ||||||||
|
|
@@ -271,9 +271,10 @@ The sort order used for `DATE` is signed. | |||||||
|
|
||||||||
| ### TIME | ||||||||
|
|
||||||||
| `TIME` is used for a logical time type without a date with millisecond or microsecond precision. | ||||||||
| `TIME` is used for a logical time type without a date with millisecond, microsecond, | ||||||||
| or nanosecond precision. | ||||||||
| The type has two type parameters: UTC adjustment (`true` or `false`) | ||||||||
| and unit (`MILLIS` or `MICROS`, `NANOS`). | ||||||||
| and unit (`MILLIS`, `MICROS`, or `NANOS`). | ||||||||
|
|
||||||||
| `TIME` with unit `MILLIS` is used for millisecond precision. | ||||||||
| It must annotate an `int32` that stores the number of | ||||||||
|
|
@@ -299,10 +300,10 @@ counterpart, it must annotate an `int32`. | |||||||
| type that is UTC normalized and has `MICROS` precision. Like the logical type | ||||||||
| counterpart, it must annotate an `int64`. | ||||||||
|
|
||||||||
| Despite there is no exact corresponding ConvertedType for local time semantic, | ||||||||
| Although there is no exact corresponding ConvertedType for local time semantic, | ||||||||
| in order to support forward compatibility with those libraries, which annotated | ||||||||
| their local time with legacy `TIME_MICROS` and `TIME_MILLIS` annotation, | ||||||||
| Parquet writer implementation *must* annotate local time with legacy annotations too, | ||||||||
| their local time with legacy `TIME_MICROS` and `TIME_MILLIS` annotations, | ||||||||
| Parquet writer implementations *must* annotate local time with legacy annotations too, | ||||||||
| as shown below. | ||||||||
|
|
||||||||
| *Backward compatibility:* | ||||||||
|
|
@@ -315,7 +316,7 @@ as shown below. | |||||||
| *Forward compatibility:* | ||||||||
|
|
||||||||
| <table> | ||||||||
| <tr colspan="3"> | ||||||||
| <tr> | ||||||||
| <th colspan="3">LogicalType</th> | ||||||||
| <th>ConvertedType</th> | ||||||||
| </tr> | ||||||||
|
|
@@ -358,7 +359,7 @@ time-line and such interpretations are allowed on purpose. | |||||||
|
|
||||||||
| The `TIMESTAMP` type has two type parameters: | ||||||||
| - `isAdjustedToUTC` must be either `true` or `false`. | ||||||||
| - `unit` must be one of `MILLIS`, `MICROS` or `NANOS`. This list is subject | ||||||||
| - `unit` must be one of `MILLIS`, `MICROS`, or `NANOS`. This list is subject | ||||||||
| to potential expansion in the future. Upon reading, unknown `unit`-s must | ||||||||
| be handled as unsupported features (rather than as errors in the data files). | ||||||||
|
|
||||||||
|
|
@@ -448,7 +449,7 @@ limits and implementations may choose to only support a limited range. | |||||||
| On the other hand, not every combination of year, month, day, hour, minute, | ||||||||
| second and subsecond values can be encoded into an `int64`. Most notably: | ||||||||
|
|
||||||||
| - An arbitrary combination of timestamp fields can not be encoded as a single | ||||||||
| - An arbitrary combination of timestamp fields cannot be encoded as a single | ||||||||
| number if the values for some of the fields are outside of their normal range | ||||||||
| (where the "normal range" corresponds to everyday usage). For example, neither | ||||||||
| of the following can be represented in a timestamp: | ||||||||
|
|
@@ -459,7 +460,7 @@ second and subsecond values can be encoded into an `int64`. Most notably: | |||||||
| - day = 29, month = 2, year = any non-leap year | ||||||||
| - Due to the range of the `int64` type, timestamps using the `NANOS` unit | ||||||||
| can only represent values between 1677-09-21 00:12:43 and 2262-04-11 23:47:16. | ||||||||
| Values outside of this range can not be represented with the `NANOS` | ||||||||
| Values outside of this range cannot be represented with the `NANOS` | ||||||||
| unit. (Other precisions have similar limits but those are outside of the | ||||||||
| domain for practical everyday usage.) | ||||||||
|
|
||||||||
|
|
@@ -475,10 +476,10 @@ type counterpart, it must annotate an `int64`. | |||||||
| logical type that is UTC normalized and has `MICROS` precision. Like the logical | ||||||||
| type counterpart, it must annotate an `int64`. | ||||||||
|
|
||||||||
| Despite there is no exact corresponding ConvertedType for local timestamp semantic, | ||||||||
| Although there is no exact corresponding ConvertedType for local timestamp semantic, | ||||||||
| in order to support forward compatibility with those libraries, which annotated | ||||||||
| their local timestamps with legacy `TIMESTAMP_MICROS` and `TIMESTAMP_MILLIS` annotation, | ||||||||
| Parquet writer implementation *must* annotate local timestamps with legacy annotations too, | ||||||||
| their local timestamps with legacy `TIMESTAMP_MICROS` and `TIMESTAMP_MILLIS` annotations, | ||||||||
| Parquet writer implementations *must* annotate local timestamps with legacy annotations too, | ||||||||
| as shown below. | ||||||||
|
|
||||||||
| *Backward compatibility:* | ||||||||
|
|
@@ -491,7 +492,7 @@ as shown below. | |||||||
| *Forward compatibility:* | ||||||||
|
|
||||||||
| <table> | ||||||||
| <tr colspan="3"> | ||||||||
| <tr> | ||||||||
| <th colspan="3">LogicalType</th> | ||||||||
| <th>ConvertedType</th> | ||||||||
| </tr> | ||||||||
|
|
@@ -544,7 +545,8 @@ are found during reading, they must be ignored. | |||||||
|
|
||||||||
| ## Embedded Types | ||||||||
|
|
||||||||
| Embedded types do not have type-specific orderings. | ||||||||
| Embedded types do not have type-specific orderings beyond the unsigned | ||||||||
| byte-wise comparison of their physical type (`BYTE_ARRAY`). | ||||||||
|
Comment on lines
+548
to
+549
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This isn't correct since
Suggested change
|
||||||||
|
|
||||||||
| ### JSON | ||||||||
|
|
||||||||
|
|
@@ -606,7 +608,7 @@ optional group variant_shredded (VARIANT(1)) { | |||||||
| ### GEOMETRY | ||||||||
|
|
||||||||
| `GEOMETRY` is used for geospatial features in the Well-Known Binary (WKB) format | ||||||||
| with linear/planar edges interpolation. It must annotate a `BYTE_ARRAY` | ||||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As with the other PR, I think "edges" is deliberate |
||||||||
| with linear/planar edge interpolation. It must annotate a `BYTE_ARRAY` | ||||||||
| primitive type. See [Geospatial.md](Geospatial.md) for more detail. | ||||||||
|
|
||||||||
| The type has only one type parameter: | ||||||||
|
|
@@ -621,14 +623,14 @@ are found during reading, they must be ignored. | |||||||
| ### GEOGRAPHY | ||||||||
|
|
||||||||
| `GEOGRAPHY` is used for geospatial features in the WKB format with an explicit | ||||||||
| (non-linear/non-planar) edges interpolation algorithm. It must annotate a | ||||||||
| (non-linear/non-planar) edge interpolation algorithm. It must annotate a | ||||||||
| `BYTE_ARRAY` primitive type. See [Geospatial.md](Geospatial.md) for more detail. | ||||||||
|
|
||||||||
| The type has two type parameters: | ||||||||
| - `crs`: An optional string value for CRS. It must be a geographic CRS, where | ||||||||
| longitudes are bound by [-180, 180] and latitudes are bound by [-90, 90]. | ||||||||
| If unset, the CRS defaults to `"OGC:CRS84"`. | ||||||||
| - `algorithm`: An optional enum value to describes the edge interpolation | ||||||||
| - `algorithm`: An optional enum value that describes the edge interpolation | ||||||||
| algorithm. Supported values are: `SPHERICAL`, `VINCENTY`, `THOMAS`, `ANDOYER`, | ||||||||
| `KARNEY`. If unset, the algorithm defaults to `SPHERICAL`. | ||||||||
|
|
||||||||
|
|
@@ -834,7 +836,7 @@ to values. `MAP` must annotate a 3-level structure: | |||||||
| field of the repeated `key_value` group. | ||||||||
| * The `value` field encodes the map's value type and repetition. This field can | ||||||||
| be `required`, `optional`, or omitted. It must always be the second field of | ||||||||
| the repeated `key_value` group if present. In case of not present, it can be | ||||||||
| the repeated `key_value` group if present. If not present, it can be | ||||||||
| represented as a map with all null values or as a set of keys. | ||||||||
|
|
||||||||
| The following example demonstrates the type for a non-null map from strings to | ||||||||
|
|
||||||||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch!