apache · iemejia · Jun 2, 2026 · etseidl · Jun 2, 2026 · etseidl
diff --git a/Compression.md b/Compression.md
@@ -48,7 +48,7 @@ No-op codec.  Data is left uncompressed.
 A codec based on the
 [Snappy compression format](https://github.com/google/snappy/blob/master/format_description.txt).
 If any ambiguity arises when implementing this format, the implementation
-provided by Google Snappy [library](https://github.com/google/snappy/)
+provided by the [Snappy compression library](https://github.com/google/snappy/)
 is authoritative.
 
 ### GZIP
@@ -58,7 +58,7 @@ formats) defined by [RFC 1952](https://tools.ietf.org/html/rfc1952).
 If any ambiguity arises when implementing this format, the implementation
 provided by the [zlib compression library](https://zlib.net/) is authoritative.
 
-Readers should support reading pages containing multiple GZIP members, however,
+Readers should support reading pages containing multiple GZIP members; however,
 as this has historically not been supported by all implementations, it is recommended
 that writers refrain from creating such pages by default for better interoperability.
 
@@ -72,7 +72,7 @@ A codec based on or interoperable with the
 A codec based on the Brotli format defined by
 [RFC 7932](https://tools.ietf.org/html/rfc7932).
 If any ambiguity arises when implementing this format, the implementation
-provided by the  [Brotli compression library](https://github.com/google/brotli)
+provided by the [Brotli compression library](https://github.com/google/brotli)
 is authoritative.
 
 ### LZ4
@@ -89,7 +89,7 @@ switch to the newer, interoperable `LZ4_RAW` codec.
 ### ZSTD
 
 A codec based on the Zstandard format defined by
-[RFC 8478](https://tools.ietf.org/html/rfc8478).  If any ambiguity arises
+[RFC 8878](https://tools.ietf.org/html/rfc8878).  If any ambiguity arises
 when implementing this format, the implementation provided by the
 [Zstandard compression library](https://facebook.github.io/zstd/)
 is authoritative.

diff --git a/Encodings.md b/Encodings.md
@@ -54,9 +54,10 @@ Supported Types: all
 This is the plain encoding that must be supported for types.  It is
 intended to be the simplest encoding.  Values are encoded back to back.
 
-The plain encoding is used whenever a more efficient encoding can not be used. It
+The plain encoding is used whenever a more efficient encoding cannot be used. It
 stores the data in the following format:
- - BOOLEAN: [Bit Packed](#BITPACKED), LSB first
+ - BOOLEAN: bit-packed, LSB first (using the same packing scheme as the
+   [RLE/bit-packing hybrid](#RLE) encoding)
  - INT32: 4 bytes little endian
  - INT64: 8 bytes little endian
  - INT96: 12 bytes little endian (deprecated)
@@ -68,7 +69,7 @@ stores the data in the following format:
 For native types, this outputs the data as little endian. Floating
     point types are encoded in IEEE.
 
-For the byte array type, it encodes the length as a 4 byte little
+For the byte array type, it encodes the length as a 4-byte little
 endian, followed by the bytes.
-endian, followed by the bytes.
+endian integer, followed by the bytes.
-endian, followed by the bytes.
+endian integer, followed by the bytes.
 
 <a name="DICTIONARY"></a>
@@ -82,7 +83,7 @@ written first, before the data pages of the column chunk.
 Dictionary page format: the entries in the dictionary using the [plain](#PLAIN) encoding.
 
 Data page format: the bit width used to encode the entry ids stored as 1 byte (max bit width = 32),
-followed by the values encoded using RLE/Bit packed described above (with the given bit width).
+followed by the values encoded using RLE/Bit-Packed described above (with the given bit width).
-followed by the values encoded using RLE/Bit-Packed described above (with the given bit width).
+followed by the values encoded using the RLE/Bit-Packing described above (with the given bit width).
-followed by the values encoded using RLE/Bit-Packed described above (with the given bit width).
+followed by the values encoded using the RLE/Bit-Packing described above (with the given bit width).
 
 Using the `PLAIN_DICTIONARY` enum value is deprecated, use `RLE_DICTIONARY`
 in a data page and `PLAIN` in a dictionary page for new Parquet files.
@@ -130,8 +131,8 @@ repeated-value := value that is repeated, using a fixed-width of round-up-to-nex
    ```
 
    The reason for this packing order is to have fewer word-boundaries on little-endian hardware
-   when deserializing more than one byte at at time. This is because 4 bytes can be read into a
-   32 bit register (or 8 bytes into a 64 bit register) and values can be unpacked just by
+   when deserializing more than one byte at a time. This is because 4 bytes can be read into a
+   32-bit register (or 8 bytes into a 64-bit register) and values can be unpacked just by
    shifting and ORing with a mask. (to make this optimization work on a big-endian machine,
    you would have to use the ordering used in the [deprecated bit-packing](#BITPACKED) encoding)
 
@@ -151,7 +152,7 @@ data:
 * Dictionary indices
 * Boolean values in data pages, as an alternative to PLAIN encoding
 
-Whether prepending the four-byte `length` to the `encoded-data` is summarized as the table below:
+Whether prepending the four-byte `length` to the `encoded-data` is summarized in the table below:
 ```
 +--------------+------------------------+-----------------+
 | Page kind    | RLE-encoded data kind  | Prepend length? |
@@ -171,10 +172,10 @@ Whether prepending the four-byte `length` to the `encoded-data` is summarized as
 <a name="BITPACKED"></a>
 ### Bit-packed (Deprecated) (BIT_PACKED = 4)
 
-This is a bit-packed only encoding, which is deprecated and will be replaced by the [RLE/bit-packing](#RLE) hybrid encoding.
+This is a bit-packed only encoding, which is deprecated; it has been replaced by the [RLE/bit-packing](#RLE) hybrid encoding.
 Each value is encoded back to back using a fixed width.
 There is no padding between values (except for the last byte, which is padded with 0s).
-For example, if the max repetition level was 3 (2 bits) and the max definition level as 3
+For example, if the max repetition level was 3 (2 bits) and the max definition level was 3
 (2 bits), to encode 30 values, we would have 30 * 2 = 60 bits = 8 bytes.
 
 This implementation is deprecated because the [RLE/bit-packing](#RLE) hybrid is a superset of this implementation.
@@ -230,8 +231,8 @@ Each block contains
 ```
  * the min delta is a zigzag ULEB128 int (we compute a minimum as we need
    positive integers for bit packing)
- * the bitwidth of each block is stored as a byte
- * each miniblock is a list of bit packed ints according to the bit width
+ * the bitwidth of each miniblock is stored as a byte
+ * each miniblock is a list of bit-packed ints according to the bit width
    stored at the beginning of the block
 
 To encode a block, we will:
@@ -322,8 +323,6 @@ The delta encoding algorithm described above stores a bit width per miniblock an
 
 Supported Types: BYTE_ARRAY
 
-This encoding is always preferred over PLAIN for byte array columns.
-
 For this encoding, we will take all the byte array lengths and encode them using delta
 encoding (DELTA_BINARY_PACKED). The byte array data follows all of the length data just
 concatenated back to back. The expected savings is from the cost of encoding the lengths

diff --git a/LogicalTypes.md b/LogicalTypes.md
@@ -97,7 +97,7 @@ The sort order used for `UUID` values is unsigned byte-wise comparison.
 The annotation has two parameters: bit width and sign.
 Allowed bit width values are `8`, `16`, `32`, `64`, and sign can be `true` or `false`.
 For signed integers, the second parameter should be `true`,
-for example, a signed integer with bit width of 8 is defined as `INT(8, true)`
+for example, a signed integer with bit width of 8 is defined as `INT(8, true)`.
 Implementations may use these annotations to produce smaller
 in-memory representations when reading data.
 
@@ -120,7 +120,7 @@ along with a maximum number of bits in the stored value.
 The annotation has two parameters: bit width and sign.
 Allowed bit width values are `8`, `16`, `32`, `64`, and sign can be `true` or `false`.
 In case of unsigned integers, the second parameter should be `false`,
-for example, an unsigned integer with bit width of 8 is defined as `INT(8, false)`
+for example, an unsigned integer with bit width of 8 is defined as `INT(8, false)`.
 Implementations may use these annotations to produce smaller
 in-memory representations when reading data.
 
@@ -166,7 +166,7 @@ unsigned integers with 8, 16, 32, or 64 bit width.
 *Forward compatibility:*
 
 <table>
-    <tr colspan="3">
+    <tr>
         <th colspan="3">LogicalType</th>
         <th>ConvertedType</th>
     </tr>
@@ -219,15 +219,15 @@ scale stores the number of digits of that value that are to the right of the
 decimal point, and the precision stores the maximum number of digits supported
 in the unscaled value.
 
-If not specified, the scale is 0. Scale must be zero or a positive integer less
-than or equal to the precision. Precision is required and must be a non-zero positive
+If not specified, the scale is 0. Scale must be a non-negative integer less
+than or equal to the precision. Precision is required and must be a positive
 integer. A precision too large for the underlying type (see below) is an error.
 
 `DECIMAL` can be used to annotate the following types:
 * `int32`: for 1 &lt;= precision &lt;= 9
 * `int64`: for 1 &lt;= precision &lt;= 18; precision &lt; 10 will produce a
   warning
-* `fixed_len_byte_array`: precision is limited by the array size. Length `n`
+* `fixed_len_byte_array`: `precision` is limited by the array size. Length `n`
   can store &lt;= `floor(log_10(2^(8*n - 1) - 1))` base-10 digits
 * `byte_array`: `precision` is not limited, but is required. The minimum number of
   bytes to store the unscaled value should be used.
@@ -243,7 +243,7 @@ comparison.
 
 *Compatibility*
 
-To support compatibility with older readers, implementations of parquet-format should
+To support compatibility with older readers, implementations of parquet-format must
 write `DecimalType` precision and scale into the corresponding SchemaElement field in metadata.
 
 ### FLOAT16
@@ -271,9 +271,10 @@ The sort order used for `DATE` is signed.
 
 ### TIME
 
-`TIME` is used for a logical time type without a date with millisecond or microsecond precision.
+`TIME` is used for a logical time type without a date with millisecond, microsecond,
+or nanosecond precision.
 The type has two type parameters: UTC adjustment (`true` or `false`)
-and unit (`MILLIS` or `MICROS`, `NANOS`).
+and unit (`MILLIS`, `MICROS`, or `NANOS`).
 
 `TIME` with unit `MILLIS` is used for millisecond precision.
 It must annotate an `int32` that stores the number of
@@ -299,10 +300,10 @@ counterpart, it must annotate an `int32`.
 type that is UTC normalized and has `MICROS` precision. Like the logical type
 counterpart, it must annotate an `int64`.
 
-Despite there is no exact corresponding ConvertedType for local time semantic,
+Although there is no exact corresponding ConvertedType for local time semantic,
 in order to support forward compatibility with those libraries, which annotated
-their local time with legacy `TIME_MICROS` and `TIME_MILLIS` annotation,
-Parquet writer implementation *must* annotate local time with legacy annotations too,
+their local time with legacy `TIME_MICROS` and `TIME_MILLIS` annotations,
+Parquet writer implementations *must* annotate local time with legacy annotations too,
 as shown below.
 
 *Backward compatibility:*
@@ -315,7 +316,7 @@ as shown below.
 *Forward compatibility:*
 
 <table>
-    <tr colspan="3">
+    <tr>
         <th colspan="3">LogicalType</th>
         <th>ConvertedType</th>
     </tr>
@@ -358,7 +359,7 @@ time-line and such interpretations are allowed on purpose.
 
 The `TIMESTAMP` type has two type parameters:
 - `isAdjustedToUTC` must be either `true` or `false`.
-- `unit` must be one of `MILLIS`, `MICROS` or `NANOS`. This list is subject
+- `unit` must be one of `MILLIS`, `MICROS`, or `NANOS`. This list is subject
   to potential expansion in the future. Upon reading, unknown `unit`-s must
   be handled as unsupported features (rather than as errors in the data files).
 
@@ -448,7 +449,7 @@ limits and implementations may choose to only support a limited range.
 On the other hand, not every combination of year, month, day, hour, minute,
 second and subsecond values can be encoded into an `int64`. Most notably:
 
-- An arbitrary combination of timestamp fields can not be encoded as a single
+- An arbitrary combination of timestamp fields cannot be encoded as a single
   number if the values for some of the fields are outside of their normal range
   (where the "normal range" corresponds to everyday usage). For example, neither
   of the following can be represented in a timestamp:
@@ -459,7 +460,7 @@ second and subsecond values can be encoded into an `int64`. Most notably:
   - day = 29, month = 2, year = any non-leap year
 - Due to the range of the `int64` type, timestamps using the `NANOS` unit
   can only represent values between 1677-09-21 00:12:43 and 2262-04-11 23:47:16.
-  Values outside of this range can not be represented with the `NANOS`
+  Values outside of this range cannot be represented with the `NANOS`
   unit. (Other precisions have similar limits but those are outside of the
   domain for practical everyday usage.)
 
@@ -475,10 +476,10 @@ type counterpart, it must annotate an `int64`.
 logical type that is UTC normalized and has `MICROS` precision. Like the logical
 type counterpart, it must annotate an `int64`.
 
-Despite there is no exact corresponding ConvertedType for local timestamp semantic,
+Although there is no exact corresponding ConvertedType for local timestamp semantic,
 in order to support forward compatibility with those libraries, which annotated
-their local timestamps with legacy `TIMESTAMP_MICROS` and `TIMESTAMP_MILLIS` annotation,
-Parquet writer implementation *must* annotate local timestamps with legacy annotations too,
+their local timestamps with legacy `TIMESTAMP_MICROS` and `TIMESTAMP_MILLIS` annotations,
+Parquet writer implementations *must* annotate local timestamps with legacy annotations too,
 as shown below.
 
 *Backward compatibility:*
@@ -491,7 +492,7 @@ as shown below.
 *Forward compatibility:*
 
 <table>
-    <tr colspan="3">
+    <tr>
         <th colspan="3">LogicalType</th>
         <th>ConvertedType</th>
     </tr>
@@ -544,7 +545,8 @@ are found during reading, they must be ignored.
 
 ## Embedded Types
 
-Embedded types do not have type-specific orderings.
+Embedded types do not have type-specific orderings beyond the unsigned
+byte-wise comparison of their physical type (`BYTE_ARRAY`).
-Embedded types do not have type-specific orderings beyond the unsigned
-byte-wise comparison of their physical type (`BYTE_ARRAY`).
+Embedded types do not have type-specific orderings unless otherwise specified.
-Embedded types do not have type-specific orderings beyond the unsigned
-byte-wise comparison of their physical type (`BYTE_ARRAY`).
+Embedded types do not have type-specific orderings unless otherwise specified.
 
 ### JSON
 
@@ -606,7 +608,7 @@ optional group variant_shredded (VARIANT(1)) {
 ### GEOMETRY
 
 `GEOMETRY` is used for geospatial features in the Well-Known Binary (WKB) format
-with linear/planar edges interpolation. It must annotate a `BYTE_ARRAY`
+with linear/planar edge interpolation. It must annotate a `BYTE_ARRAY`
 primitive type. See [Geospatial.md](Geospatial.md) for more detail.
 
 The type has only one type parameter:
@@ -621,14 +623,14 @@ are found during reading, they must be ignored.
 ### GEOGRAPHY
 
 `GEOGRAPHY` is used for geospatial features in the WKB format with an explicit
-(non-linear/non-planar) edges interpolation algorithm. It must annotate a
+(non-linear/non-planar) edge interpolation algorithm. It must annotate a
 `BYTE_ARRAY` primitive type. See [Geospatial.md](Geospatial.md) for more detail.
 
 The type has two type parameters:
 - `crs`: An optional string value for CRS. It must be a geographic CRS, where
   longitudes are bound by [-180, 180] and latitudes are bound by [-90, 90].
   If unset, the CRS defaults to `"OGC:CRS84"`.
-- `algorithm`: An optional enum value to describes the edge interpolation
+- `algorithm`: An optional enum value that describes the edge interpolation
   algorithm. Supported values are: `SPHERICAL`, `VINCENTY`, `THOMAS`, `ANDOYER`,
   `KARNEY`. If unset, the algorithm defaults to `SPHERICAL`.
 
@@ -834,7 +836,7 @@ to values. `MAP` must annotate a 3-level structure:
   field of the repeated `key_value` group.
 * The `value` field encodes the map's value type and repetition. This field can
   be `required`, `optional`, or omitted. It must always be the second field of
-  the repeated `key_value` group if present. In case of not present, it can be
+  the repeated `key_value` group if present. If not present, it can be
   represented as a map with all null values or as a set of keys.
 
 The following example demonstrates the type for a non-null map from strings to