diff --git a/Compression.md b/Compression.md index c1cad5d2..c397476a 100644 --- a/Compression.md +++ b/Compression.md @@ -48,7 +48,7 @@ No-op codec. Data is left uncompressed. A codec based on the [Snappy compression format](https://github.com/google/snappy/blob/master/format_description.txt). If any ambiguity arises when implementing this format, the implementation -provided by Google Snappy [library](https://github.com/google/snappy/) +provided by the [Snappy compression library](https://github.com/google/snappy/) is authoritative. ### GZIP @@ -58,7 +58,7 @@ formats) defined by [RFC 1952](https://tools.ietf.org/html/rfc1952). If any ambiguity arises when implementing this format, the implementation provided by the [zlib compression library](https://zlib.net/) is authoritative. -Readers should support reading pages containing multiple GZIP members, however, +Readers should support reading pages containing multiple GZIP members; however, as this has historically not been supported by all implementations, it is recommended that writers refrain from creating such pages by default for better interoperability. @@ -72,7 +72,7 @@ A codec based on or interoperable with the A codec based on the Brotli format defined by [RFC 7932](https://tools.ietf.org/html/rfc7932). If any ambiguity arises when implementing this format, the implementation -provided by the [Brotli compression library](https://github.com/google/brotli) +provided by the [Brotli compression library](https://github.com/google/brotli) is authoritative. ### LZ4 @@ -89,7 +89,7 @@ switch to the newer, interoperable `LZ4_RAW` codec. ### ZSTD A codec based on the Zstandard format defined by -[RFC 8478](https://tools.ietf.org/html/rfc8478). If any ambiguity arises +[RFC 8878](https://tools.ietf.org/html/rfc8878). If any ambiguity arises when implementing this format, the implementation provided by the [Zstandard compression library](https://facebook.github.io/zstd/) is authoritative. diff --git a/Encodings.md b/Encodings.md index 1c766fb5..7702c6d7 100644 --- a/Encodings.md +++ b/Encodings.md @@ -54,9 +54,10 @@ Supported Types: all This is the plain encoding that must be supported for types. It is intended to be the simplest encoding. Values are encoded back to back. -The plain encoding is used whenever a more efficient encoding can not be used. It +The plain encoding is used whenever a more efficient encoding cannot be used. It stores the data in the following format: - - BOOLEAN: [Bit Packed](#BITPACKED), LSB first + - BOOLEAN: bit-packed, LSB first (using the same packing scheme as the + [RLE/bit-packing hybrid](#RLE) encoding) - INT32: 4 bytes little endian - INT64: 8 bytes little endian - INT96: 12 bytes little endian (deprecated) @@ -68,8 +69,8 @@ stores the data in the following format: For native types, this outputs the data as little endian. Floating point types are encoded in IEEE. -For the byte array type, it encodes the length as a 4 byte little -endian, followed by the bytes. +For the byte array type, it encodes the length as a 4-byte little +endian integer, followed by the bytes. ### Dictionary Encoding (PLAIN_DICTIONARY = 2 and RLE_DICTIONARY = 8) @@ -82,7 +83,7 @@ written first, before the data pages of the column chunk. Dictionary page format: the entries in the dictionary using the [plain](#PLAIN) encoding. Data page format: the bit width used to encode the entry ids stored as 1 byte (max bit width = 32), -followed by the values encoded using RLE/Bit packed described above (with the given bit width). +followed by the values encoded using the RLE/Bit-Packing described above (with the given bit width). Using the `PLAIN_DICTIONARY` enum value is deprecated, use `RLE_DICTIONARY` in a data page and `PLAIN` in a dictionary page for new Parquet files. @@ -130,8 +131,8 @@ repeated-value := value that is repeated, using a fixed-width of round-up-to-nex ``` The reason for this packing order is to have fewer word-boundaries on little-endian hardware - when deserializing more than one byte at at time. This is because 4 bytes can be read into a - 32 bit register (or 8 bytes into a 64 bit register) and values can be unpacked just by + when deserializing more than one byte at a time. This is because 4 bytes can be read into a + 32-bit register (or 8 bytes into a 64-bit register) and values can be unpacked just by shifting and ORing with a mask. (to make this optimization work on a big-endian machine, you would have to use the ordering used in the [deprecated bit-packing](#BITPACKED) encoding) @@ -151,7 +152,7 @@ data: * Dictionary indices * Boolean values in data pages, as an alternative to PLAIN encoding -Whether prepending the four-byte `length` to the `encoded-data` is summarized as the table below: +Whether prepending the four-byte `length` to the `encoded-data` is summarized in the table below: ``` +--------------+------------------------+-----------------+ | Page kind | RLE-encoded data kind | Prepend length? | @@ -171,10 +172,10 @@ Whether prepending the four-byte `length` to the `encoded-data` is summarized as ### Bit-packed (Deprecated) (BIT_PACKED = 4) -This is a bit-packed only encoding, which is deprecated and will be replaced by the [RLE/bit-packing](#RLE) hybrid encoding. +This is a bit-packed only encoding, which is deprecated; it has been replaced by the [RLE/bit-packing](#RLE) hybrid encoding. Each value is encoded back to back using a fixed width. There is no padding between values (except for the last byte, which is padded with 0s). -For example, if the max repetition level was 3 (2 bits) and the max definition level as 3 +For example, if the max repetition level was 3 (2 bits) and the max definition level was 3 (2 bits), to encode 30 values, we would have 30 * 2 = 60 bits = 8 bytes. This implementation is deprecated because the [RLE/bit-packing](#RLE) hybrid is a superset of this implementation. @@ -230,8 +231,8 @@ Each block contains ``` * the min delta is a zigzag ULEB128 int (we compute a minimum as we need positive integers for bit packing) - * the bitwidth of each block is stored as a byte - * each miniblock is a list of bit packed ints according to the bit width + * the bitwidth of each miniblock is stored as a byte + * each miniblock is a list of bit-packed ints according to the bit width stored at the beginning of the block To encode a block, we will: diff --git a/LogicalTypes.md b/LogicalTypes.md index 795c223f..690ae3f5 100644 --- a/LogicalTypes.md +++ b/LogicalTypes.md @@ -97,7 +97,7 @@ The sort order used for `UUID` values is unsigned byte-wise comparison. The annotation has two parameters: bit width and sign. Allowed bit width values are `8`, `16`, `32`, `64`, and sign can be `true` or `false`. For signed integers, the second parameter should be `true`, -for example, a signed integer with bit width of 8 is defined as `INT(8, true)` +for example, a signed integer with bit width of 8 is defined as `INT(8, true)`. Implementations may use these annotations to produce smaller in-memory representations when reading data. @@ -120,7 +120,7 @@ along with a maximum number of bits in the stored value. The annotation has two parameters: bit width and sign. Allowed bit width values are `8`, `16`, `32`, `64`, and sign can be `true` or `false`. In case of unsigned integers, the second parameter should be `false`, -for example, an unsigned integer with bit width of 8 is defined as `INT(8, false)` +for example, an unsigned integer with bit width of 8 is defined as `INT(8, false)`. Implementations may use these annotations to produce smaller in-memory representations when reading data. @@ -166,7 +166,7 @@ unsigned integers with 8, 16, 32, or 64 bit width. *Forward compatibility:* - + @@ -227,7 +227,7 @@ integer. A precision too large for the underlying type (see below) is an error. * `int32`: for 1 <= precision <= 9 * `int64`: for 1 <= precision <= 18; precision < 10 will produce a warning -* `fixed_len_byte_array`: precision is limited by the array size. Length `n` +* `fixed_len_byte_array`: `precision` is limited by the array size. Length `n` can store <= `floor(log_10(2^(8*n - 1) - 1))` base-10 digits * `byte_array`: `precision` is not limited, but is required. The minimum number of bytes to store the unscaled value should be used. @@ -271,9 +271,10 @@ The sort order used for `DATE` is signed. ### TIME -`TIME` is used for a logical time type without a date with millisecond or microsecond precision. +`TIME` is used for a logical time type without a date with millisecond, microsecond, +or nanosecond precision. The type has two type parameters: UTC adjustment (`true` or `false`) -and unit (`MILLIS` or `MICROS`, `NANOS`). +and unit (`MILLIS`, `MICROS`, or `NANOS`). `TIME` with unit `MILLIS` is used for millisecond precision. It must annotate an `int32` that stores the number of @@ -299,10 +300,10 @@ counterpart, it must annotate an `int32`. type that is UTC normalized and has `MICROS` precision. Like the logical type counterpart, it must annotate an `int64`. -Despite there is no exact corresponding ConvertedType for local time semantic, +Although there is no exact corresponding ConvertedType for local time semantic, in order to support forward compatibility with those libraries, which annotated -their local time with legacy `TIME_MICROS` and `TIME_MILLIS` annotation, -Parquet writer implementation *must* annotate local time with legacy annotations too, +their local time with legacy `TIME_MICROS` and `TIME_MILLIS` annotations, +Parquet writer implementations *must* annotate local time with legacy annotations too, as shown below. *Backward compatibility:* @@ -315,7 +316,7 @@ as shown below. *Forward compatibility:*
LogicalType ConvertedType
- + @@ -358,7 +359,7 @@ time-line and such interpretations are allowed on purpose. The `TIMESTAMP` type has two type parameters: - `isAdjustedToUTC` must be either `true` or `false`. -- `unit` must be one of `MILLIS`, `MICROS` or `NANOS`. This list is subject +- `unit` must be one of `MILLIS`, `MICROS`, or `NANOS`. This list is subject to potential expansion in the future. Upon reading, unknown `unit`-s must be handled as unsupported features (rather than as errors in the data files). @@ -448,7 +449,7 @@ limits and implementations may choose to only support a limited range. On the other hand, not every combination of year, month, day, hour, minute, second and subsecond values can be encoded into an `int64`. Most notably: -- An arbitrary combination of timestamp fields can not be encoded as a single +- An arbitrary combination of timestamp fields cannot be encoded as a single number if the values for some of the fields are outside of their normal range (where the "normal range" corresponds to everyday usage). For example, neither of the following can be represented in a timestamp: @@ -459,7 +460,7 @@ second and subsecond values can be encoded into an `int64`. Most notably: - day = 29, month = 2, year = any non-leap year - Due to the range of the `int64` type, timestamps using the `NANOS` unit can only represent values between 1677-09-21 00:12:43 and 2262-04-11 23:47:16. - Values outside of this range can not be represented with the `NANOS` + Values outside of this range cannot be represented with the `NANOS` unit. (Other precisions have similar limits but those are outside of the domain for practical everyday usage.) @@ -475,10 +476,10 @@ type counterpart, it must annotate an `int64`. logical type that is UTC normalized and has `MICROS` precision. Like the logical type counterpart, it must annotate an `int64`. -Despite there is no exact corresponding ConvertedType for local timestamp semantic, +Although there is no exact corresponding ConvertedType for local timestamp semantic, in order to support forward compatibility with those libraries, which annotated -their local timestamps with legacy `TIMESTAMP_MICROS` and `TIMESTAMP_MILLIS` annotation, -Parquet writer implementation *must* annotate local timestamps with legacy annotations too, +their local timestamps with legacy `TIMESTAMP_MICROS` and `TIMESTAMP_MILLIS` annotations, +Parquet writer implementations *must* annotate local timestamps with legacy annotations too, as shown below. *Backward compatibility:* @@ -491,7 +492,7 @@ as shown below. *Forward compatibility:*
LogicalType ConvertedType
- + @@ -544,7 +545,7 @@ are found during reading, they must be ignored. ## Embedded Types -Embedded types do not have type-specific orderings. +Embedded types do not have type-specific orderings unless otherwise specified. ### JSON @@ -606,7 +607,7 @@ optional group variant_shredded (VARIANT(1)) { ### GEOMETRY `GEOMETRY` is used for geospatial features in the Well-Known Binary (WKB) format -with linear/planar edges interpolation. It must annotate a `BYTE_ARRAY` +with linear/planar `edges` interpolation. It must annotate a `BYTE_ARRAY` primitive type. See [Geospatial.md](Geospatial.md) for more detail. The type has only one type parameter: @@ -621,14 +622,14 @@ are found during reading, they must be ignored. ### GEOGRAPHY `GEOGRAPHY` is used for geospatial features in the WKB format with an explicit -(non-linear/non-planar) edges interpolation algorithm. It must annotate a +(non-linear/non-planar) `edges` interpolation algorithm. It must annotate a `BYTE_ARRAY` primitive type. See [Geospatial.md](Geospatial.md) for more detail. The type has two type parameters: - `crs`: An optional string value for CRS. It must be a geographic CRS, where longitudes are bound by [-180, 180] and latitudes are bound by [-90, 90]. If unset, the CRS defaults to `"OGC:CRS84"`. -- `algorithm`: An optional enum value to describes the edge interpolation +- `algorithm`: An optional enum value that describes the edge interpolation algorithm. Supported values are: `SPHERICAL`, `VINCENTY`, `THOMAS`, `ANDOYER`, `KARNEY`. If unset, the algorithm defaults to `SPHERICAL`. @@ -834,7 +835,7 @@ to values. `MAP` must annotate a 3-level structure: field of the repeated `key_value` group. * The `value` field encodes the map's value type and repetition. This field can be `required`, `optional`, or omitted. It must always be the second field of - the repeated `key_value` group if present. In case of not present, it can be + the repeated `key_value` group if present. If not present, it can be represented as a map with all null values or as a set of keys. The following example demonstrates the type for a non-null map from strings to diff --git a/README.md b/README.md index d398ac4f..fb59cbb6 100644 --- a/README.md +++ b/README.md @@ -134,14 +134,14 @@ with a focus on how the types affect disk storage. For example, 16-bit ints are not explicitly supported in the storage format since they are covered by 32-bit ints with an efficient encoding. This reduces the complexity of implementing readers and writers for the format. The types are: - - BOOLEAN: 1 bit boolean - - INT32: 32 bit signed ints - - INT64: 64 bit signed ints - - INT96: 96 bit signed ints + - BOOLEAN: 1-bit boolean + - INT32: 32-bit signed ints + - INT64: 64-bit signed ints + - INT96: 96-bit signed ints - FLOAT: IEEE 32-bit floating point values - DOUBLE: IEEE 64-bit floating point values - BYTE_ARRAY: arbitrarily long byte arrays - - FIXED_LEN_BYTE_ARRAY: fixed length byte arrays + - FIXED_LEN_BYTE_ARRAY: fixed-length byte arrays ### Logical Types Logical types are used to extend the types that parquet can be used to store, @@ -172,7 +172,7 @@ be computed from the schema (i.e. how much nesting there is). This defines the maximum number of bits required to store the levels (levels are defined for all values in the column). -Two encodings for the levels are supported BIT_PACKED and RLE. Only RLE is now used as it supersedes BIT_PACKED. +Two encodings for the levels are supported: `BIT_PACKED` and `RLE`. Only `RLE` is currently used as it supersedes `BIT_PACKED`. ## Nulls Nullity is encoded in the definition levels (which is run-length encoded). NULL values @@ -190,11 +190,11 @@ In order we have: The value of `uncompressed_page_size` specified in the header is for all the 3 pieces combined. -The encoded values for the data page is always required. The definition and repetition levels +The encoded values for the data page are always required. The definition and repetition levels are optional, based on the schema definition. If the column is not nested (i.e. -the path to the column has length 1), we do not encode the repetition levels (it would -always have the value 1). For data that is required, the definition levels are -skipped (if encoded, it will always have the value of the max definition level). +the path to the column has length 1), we do not encode the repetition levels (they would +always have the value 0). For data that is required, the definition levels are +skipped (if encoded, they will always have the value of the max definition level). For example, in the case where the column is non-nested and required, the data in the page is only the encoded values. @@ -224,7 +224,7 @@ the reasoning behind adding these to the format. ## Checksumming Pages of all kinds can be individually checksummed. This allows disabling of checksums at the HDFS file level, to better support single row lookups. Checksums are calculated -using the standard CRC32 algorithm - as used in e.g. GZip - on the serialized binary +using the standard CRC32 algorithm - as used in e.g. GZIP - on the serialized binary representation of a page (not including the page header itself). ## Error recovery @@ -239,10 +239,10 @@ metadata at the end. If an error happens while writing the file metadata, all t data written will be unreadable. This can be fixed by writing the file metadata every Nth row group. Each file metadata would be cumulative and include all the row groups written so -far. Combining this with the strategy used for rc or avro files using sync markers, +far. Combining this with the strategy used for RCFile or Avro files using sync markers, a reader could recover partially written files. -## Separating metadata and column data. +## Separating metadata and column data The format is explicitly designed to separate the metadata from the data. This allows splitting columns into multiple files, as well as having a single metadata file reference multiple parquet files. @@ -256,7 +256,7 @@ one HDFS block. Therefore, HDFS block sizes should also be set to be larger. A optimized read setup would be: 1GB row groups, 1GB HDFS block size, 1 HDFS block per HDFS file. - Data page size: Data pages should be considered indivisible so smaller data pages -allow for more fine grained reading (e.g. single row lookup). Larger page sizes +allow for more fine-grained reading (e.g. single row lookup). Larger page sizes incur less space overhead (less page headers) and potentially less parsing overhead (processing headers). Note: for sequential scans, it is not expected to read a page at a time; this is not the IO chunk. We recommend 8KB for page sizes. @@ -278,7 +278,9 @@ Changes to this core format definition are proposed and discussed in depth on th ## Code of Conduct -We hold ourselves and the Parquet developer community to a code of conduct as described by [Twitter OSS](https://engineering.twitter.com/opensource): . +We hold ourselves and the Parquet developer community to two codes of conduct: +1. [The Apache Software Foundation Code of Conduct](https://www.apache.org/foundation/policies/conduct.html) +2. [The Apache Software Foundation Code of Conduct for GitHub](https://github.com/apache/.github/blob/main/CODE_OF_CONDUCT.md) ## License Copyright 2013 Twitter, Cloudera and other contributors.
LogicalType ConvertedType