Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 4 additions & 4 deletions Compression.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,7 @@ No-op codec. Data is left uncompressed.
A codec based on the
[Snappy compression format](https://github.com/google/snappy/blob/master/format_description.txt).
If any ambiguity arises when implementing this format, the implementation
provided by Google Snappy [library](https://github.com/google/snappy/)
provided by the [Snappy compression library](https://github.com/google/snappy/)
is authoritative.

### GZIP
Expand All @@ -58,7 +58,7 @@ formats) defined by [RFC 1952](https://tools.ietf.org/html/rfc1952).
If any ambiguity arises when implementing this format, the implementation
provided by the [zlib compression library](https://zlib.net/) is authoritative.

Readers should support reading pages containing multiple GZIP members, however,
Readers should support reading pages containing multiple GZIP members; however,
as this has historically not been supported by all implementations, it is recommended
that writers refrain from creating such pages by default for better interoperability.

Expand All @@ -72,7 +72,7 @@ A codec based on or interoperable with the
A codec based on the Brotli format defined by
[RFC 7932](https://tools.ietf.org/html/rfc7932).
If any ambiguity arises when implementing this format, the implementation
provided by the [Brotli compression library](https://github.com/google/brotli)
provided by the [Brotli compression library](https://github.com/google/brotli)
is authoritative.

### LZ4
Expand All @@ -89,7 +89,7 @@ switch to the newer, interoperable `LZ4_RAW` codec.
### ZSTD

A codec based on the Zstandard format defined by
[RFC 8478](https://tools.ietf.org/html/rfc8478). If any ambiguity arises
[RFC 8878](https://tools.ietf.org/html/rfc8878). If any ambiguity arises
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch!

when implementing this format, the implementation provided by the
[Zstandard compression library](https://facebook.github.io/zstd/)
is authoritative.
Expand Down
25 changes: 12 additions & 13 deletions Encodings.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,9 +54,10 @@ Supported Types: all
This is the plain encoding that must be supported for types. It is
intended to be the simplest encoding. Values are encoded back to back.

The plain encoding is used whenever a more efficient encoding can not be used. It
The plain encoding is used whenever a more efficient encoding cannot be used. It
stores the data in the following format:
- BOOLEAN: [Bit Packed](#BITPACKED), LSB first
- BOOLEAN: bit-packed, LSB first (using the same packing scheme as the
[RLE/bit-packing hybrid](#RLE) encoding)
- INT32: 4 bytes little endian
- INT64: 8 bytes little endian
- INT96: 12 bytes little endian (deprecated)
Expand All @@ -68,7 +69,7 @@ stores the data in the following format:
For native types, this outputs the data as little endian. Floating
point types are encoded in IEEE.

For the byte array type, it encodes the length as a 4 byte little
For the byte array type, it encodes the length as a 4-byte little
endian, followed by the bytes.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
endian, followed by the bytes.
endian integer, followed by the bytes.


<a name="DICTIONARY"></a>
Expand All @@ -82,7 +83,7 @@ written first, before the data pages of the column chunk.
Dictionary page format: the entries in the dictionary using the [plain](#PLAIN) encoding.

Data page format: the bit width used to encode the entry ids stored as 1 byte (max bit width = 32),
followed by the values encoded using RLE/Bit packed described above (with the given bit width).
followed by the values encoded using RLE/Bit-Packed described above (with the given bit width).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
followed by the values encoded using RLE/Bit-Packed described above (with the given bit width).
followed by the values encoded using the RLE/Bit-Packing described above (with the given bit width).


Using the `PLAIN_DICTIONARY` enum value is deprecated, use `RLE_DICTIONARY`
in a data page and `PLAIN` in a dictionary page for new Parquet files.
Expand Down Expand Up @@ -130,8 +131,8 @@ repeated-value := value that is repeated, using a fixed-width of round-up-to-nex
```

The reason for this packing order is to have fewer word-boundaries on little-endian hardware
when deserializing more than one byte at at time. This is because 4 bytes can be read into a
32 bit register (or 8 bytes into a 64 bit register) and values can be unpacked just by
when deserializing more than one byte at a time. This is because 4 bytes can be read into a
32-bit register (or 8 bytes into a 64-bit register) and values can be unpacked just by
shifting and ORing with a mask. (to make this optimization work on a big-endian machine,
you would have to use the ordering used in the [deprecated bit-packing](#BITPACKED) encoding)

Expand All @@ -151,7 +152,7 @@ data:
* Dictionary indices
* Boolean values in data pages, as an alternative to PLAIN encoding

Whether prepending the four-byte `length` to the `encoded-data` is summarized as the table below:
Whether prepending the four-byte `length` to the `encoded-data` is summarized in the table below:
```
+--------------+------------------------+-----------------+
| Page kind | RLE-encoded data kind | Prepend length? |
Expand All @@ -171,10 +172,10 @@ Whether prepending the four-byte `length` to the `encoded-data` is summarized as
<a name="BITPACKED"></a>
### Bit-packed (Deprecated) (BIT_PACKED = 4)

This is a bit-packed only encoding, which is deprecated and will be replaced by the [RLE/bit-packing](#RLE) hybrid encoding.
This is a bit-packed only encoding, which is deprecated; it has been replaced by the [RLE/bit-packing](#RLE) hybrid encoding.
Each value is encoded back to back using a fixed width.
There is no padding between values (except for the last byte, which is padded with 0s).
For example, if the max repetition level was 3 (2 bits) and the max definition level as 3
For example, if the max repetition level was 3 (2 bits) and the max definition level was 3
(2 bits), to encode 30 values, we would have 30 * 2 = 60 bits = 8 bytes.

This implementation is deprecated because the [RLE/bit-packing](#RLE) hybrid is a superset of this implementation.
Expand Down Expand Up @@ -230,8 +231,8 @@ Each block contains
```
* the min delta is a zigzag ULEB128 int (we compute a minimum as we need
positive integers for bit packing)
* the bitwidth of each block is stored as a byte
* each miniblock is a list of bit packed ints according to the bit width
* the bitwidth of each miniblock is stored as a byte
* each miniblock is a list of bit-packed ints according to the bit width
stored at the beginning of the block

To encode a block, we will:
Expand Down Expand Up @@ -322,8 +323,6 @@ The delta encoding algorithm described above stores a bit width per miniblock an

Supported Types: BYTE_ARRAY

This encoding is always preferred over PLAIN for byte array columns.

Comment on lines -325 to -326
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't a typo, you're actually changing the spec here. I think this needs an actual discussion.

For this encoding, we will take all the byte array lengths and encode them using delta
encoding (DELTA_BINARY_PACKED). The byte array data follows all of the length data just
concatenated back to back. The expected savings is from the cost of encoding the lengths
Expand Down
52 changes: 27 additions & 25 deletions LogicalTypes.md
Original file line number Diff line number Diff line change
Expand Up @@ -97,7 +97,7 @@ The sort order used for `UUID` values is unsigned byte-wise comparison.
The annotation has two parameters: bit width and sign.
Allowed bit width values are `8`, `16`, `32`, `64`, and sign can be `true` or `false`.
For signed integers, the second parameter should be `true`,
for example, a signed integer with bit width of 8 is defined as `INT(8, true)`
for example, a signed integer with bit width of 8 is defined as `INT(8, true)`.
Implementations may use these annotations to produce smaller
in-memory representations when reading data.

Expand All @@ -120,7 +120,7 @@ along with a maximum number of bits in the stored value.
The annotation has two parameters: bit width and sign.
Allowed bit width values are `8`, `16`, `32`, `64`, and sign can be `true` or `false`.
In case of unsigned integers, the second parameter should be `false`,
for example, an unsigned integer with bit width of 8 is defined as `INT(8, false)`
for example, an unsigned integer with bit width of 8 is defined as `INT(8, false)`.
Implementations may use these annotations to produce smaller
in-memory representations when reading data.

Expand Down Expand Up @@ -166,7 +166,7 @@ unsigned integers with 8, 16, 32, or 64 bit width.
*Forward compatibility:*

<table>
<tr colspan="3">
<tr>
<th colspan="3">LogicalType</th>
<th>ConvertedType</th>
</tr>
Expand Down Expand Up @@ -219,15 +219,15 @@ scale stores the number of digits of that value that are to the right of the
decimal point, and the precision stores the maximum number of digits supported
in the unscaled value.

If not specified, the scale is 0. Scale must be zero or a positive integer less
than or equal to the precision. Precision is required and must be a non-zero positive
If not specified, the scale is 0. Scale must be a non-negative integer less
than or equal to the precision. Precision is required and must be a positive
Comment on lines +222 to +223
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this change is less clear, even if correct.

integer. A precision too large for the underlying type (see below) is an error.

`DECIMAL` can be used to annotate the following types:
* `int32`: for 1 &lt;= precision &lt;= 9
* `int64`: for 1 &lt;= precision &lt;= 18; precision &lt; 10 will produce a
warning
* `fixed_len_byte_array`: precision is limited by the array size. Length `n`
* `fixed_len_byte_array`: `precision` is limited by the array size. Length `n`
can store &lt;= `floor(log_10(2^(8*n - 1) - 1))` base-10 digits
* `byte_array`: `precision` is not limited, but is required. The minimum number of
bytes to store the unscaled value should be used.
Expand All @@ -243,7 +243,7 @@ comparison.

*Compatibility*

To support compatibility with older readers, implementations of parquet-format should
To support compatibility with older readers, implementations of parquet-format must
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another spec change.

write `DecimalType` precision and scale into the corresponding SchemaElement field in metadata.

### FLOAT16
Expand Down Expand Up @@ -271,9 +271,10 @@ The sort order used for `DATE` is signed.

### TIME

`TIME` is used for a logical time type without a date with millisecond or microsecond precision.
`TIME` is used for a logical time type without a date with millisecond, microsecond,
or nanosecond precision.
The type has two type parameters: UTC adjustment (`true` or `false`)
and unit (`MILLIS` or `MICROS`, `NANOS`).
and unit (`MILLIS`, `MICROS`, or `NANOS`).

`TIME` with unit `MILLIS` is used for millisecond precision.
It must annotate an `int32` that stores the number of
Expand All @@ -299,10 +300,10 @@ counterpart, it must annotate an `int32`.
type that is UTC normalized and has `MICROS` precision. Like the logical type
counterpart, it must annotate an `int64`.

Despite there is no exact corresponding ConvertedType for local time semantic,
Although there is no exact corresponding ConvertedType for local time semantic,
in order to support forward compatibility with those libraries, which annotated
their local time with legacy `TIME_MICROS` and `TIME_MILLIS` annotation,
Parquet writer implementation *must* annotate local time with legacy annotations too,
their local time with legacy `TIME_MICROS` and `TIME_MILLIS` annotations,
Parquet writer implementations *must* annotate local time with legacy annotations too,
as shown below.

*Backward compatibility:*
Expand All @@ -315,7 +316,7 @@ as shown below.
*Forward compatibility:*

<table>
<tr colspan="3">
<tr>
<th colspan="3">LogicalType</th>
<th>ConvertedType</th>
</tr>
Expand Down Expand Up @@ -358,7 +359,7 @@ time-line and such interpretations are allowed on purpose.

The `TIMESTAMP` type has two type parameters:
- `isAdjustedToUTC` must be either `true` or `false`.
- `unit` must be one of `MILLIS`, `MICROS` or `NANOS`. This list is subject
- `unit` must be one of `MILLIS`, `MICROS`, or `NANOS`. This list is subject
to potential expansion in the future. Upon reading, unknown `unit`-s must
be handled as unsupported features (rather than as errors in the data files).

Expand Down Expand Up @@ -448,7 +449,7 @@ limits and implementations may choose to only support a limited range.
On the other hand, not every combination of year, month, day, hour, minute,
second and subsecond values can be encoded into an `int64`. Most notably:

- An arbitrary combination of timestamp fields can not be encoded as a single
- An arbitrary combination of timestamp fields cannot be encoded as a single
number if the values for some of the fields are outside of their normal range
(where the "normal range" corresponds to everyday usage). For example, neither
of the following can be represented in a timestamp:
Expand All @@ -459,7 +460,7 @@ second and subsecond values can be encoded into an `int64`. Most notably:
- day = 29, month = 2, year = any non-leap year
- Due to the range of the `int64` type, timestamps using the `NANOS` unit
can only represent values between 1677-09-21 00:12:43 and 2262-04-11 23:47:16.
Values outside of this range can not be represented with the `NANOS`
Values outside of this range cannot be represented with the `NANOS`
unit. (Other precisions have similar limits but those are outside of the
domain for practical everyday usage.)

Expand All @@ -475,10 +476,10 @@ type counterpart, it must annotate an `int64`.
logical type that is UTC normalized and has `MICROS` precision. Like the logical
type counterpart, it must annotate an `int64`.

Despite there is no exact corresponding ConvertedType for local timestamp semantic,
Although there is no exact corresponding ConvertedType for local timestamp semantic,
in order to support forward compatibility with those libraries, which annotated
their local timestamps with legacy `TIMESTAMP_MICROS` and `TIMESTAMP_MILLIS` annotation,
Parquet writer implementation *must* annotate local timestamps with legacy annotations too,
their local timestamps with legacy `TIMESTAMP_MICROS` and `TIMESTAMP_MILLIS` annotations,
Parquet writer implementations *must* annotate local timestamps with legacy annotations too,
as shown below.

*Backward compatibility:*
Expand All @@ -491,7 +492,7 @@ as shown below.
*Forward compatibility:*

<table>
<tr colspan="3">
<tr>
<th colspan="3">LogicalType</th>
<th>ConvertedType</th>
</tr>
Expand Down Expand Up @@ -544,7 +545,8 @@ are found during reading, they must be ignored.

## Embedded Types

Embedded types do not have type-specific orderings.
Embedded types do not have type-specific orderings beyond the unsigned
byte-wise comparison of their physical type (`BYTE_ARRAY`).
Comment on lines +548 to +549
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't correct since VARIANT, GEOMETRY, and GEOGRAPHY have undefined orderings. Maybe instead

Suggested change
Embedded types do not have type-specific orderings beyond the unsigned
byte-wise comparison of their physical type (`BYTE_ARRAY`).
Embedded types do not have type-specific orderings unless otherwise specified.


### JSON

Expand Down Expand Up @@ -606,7 +608,7 @@ optional group variant_shredded (VARIANT(1)) {
### GEOMETRY

`GEOMETRY` is used for geospatial features in the Well-Known Binary (WKB) format
with linear/planar edges interpolation. It must annotate a `BYTE_ARRAY`
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As with the other PR, I think "edges" is deliberate

with linear/planar edge interpolation. It must annotate a `BYTE_ARRAY`
primitive type. See [Geospatial.md](Geospatial.md) for more detail.

The type has only one type parameter:
Expand All @@ -621,14 +623,14 @@ are found during reading, they must be ignored.
### GEOGRAPHY

`GEOGRAPHY` is used for geospatial features in the WKB format with an explicit
(non-linear/non-planar) edges interpolation algorithm. It must annotate a
(non-linear/non-planar) edge interpolation algorithm. It must annotate a
`BYTE_ARRAY` primitive type. See [Geospatial.md](Geospatial.md) for more detail.

The type has two type parameters:
- `crs`: An optional string value for CRS. It must be a geographic CRS, where
longitudes are bound by [-180, 180] and latitudes are bound by [-90, 90].
If unset, the CRS defaults to `"OGC:CRS84"`.
- `algorithm`: An optional enum value to describes the edge interpolation
- `algorithm`: An optional enum value that describes the edge interpolation
algorithm. Supported values are: `SPHERICAL`, `VINCENTY`, `THOMAS`, `ANDOYER`,
`KARNEY`. If unset, the algorithm defaults to `SPHERICAL`.

Expand Down Expand Up @@ -834,7 +836,7 @@ to values. `MAP` must annotate a 3-level structure:
field of the repeated `key_value` group.
* The `value` field encodes the map's value type and repetition. This field can
be `required`, `optional`, or omitted. It must always be the second field of
the repeated `key_value` group if present. In case of not present, it can be
the repeated `key_value` group if present. If not present, it can be
represented as a map with all null values or as a set of keys.

The following example demonstrates the type for a non-null map from strings to
Expand Down
Loading