Fix errors, grammar, and consistency in core format documentation#576
Open
iemejia wants to merge 1 commit into
Open
Fix errors, grammar, and consistency in core format documentation#576iemejia wants to merge 1 commit into
iemejia wants to merge 1 commit into
Conversation
README.md:
- Fix repetition level value for non-nested columns (1 -> 0)
- Update defunct Twitter Code of Conduct links to ASF
- Fix plural agreement ("encoded values is" -> "are")
- Hyphenate compound adjectives ("32 bit" -> "32-bit", etc.)
- Normalize GZIP casing; capitalize proper nouns (RCFile, Avro)
Encodings.md:
- Fix "bitwidth of each block" -> "each miniblock" (DELTA_BINARY_PACKED)
- Remove misleading "always preferred" claim for DELTA_LENGTH_BYTE_ARRAY
- Fix "at at time" -> "at a time"
- Fix BIT_PACKED tense ("will be replaced" -> already replaced)
- Fix PLAIN BOOLEAN link to reference RLE/bit-packing hybrid section
- Hyphenate compound adjectives; "can not" -> "cannot"
Compression.md:
- Fix ZSTD RFC reference (8478 -> 8878)
- Fix Snappy description to match parallel construction
- Remove double space; fix comma splice
LogicalTypes.md:
- Fix embedded types ordering contradiction
- Add nanosecond to TIME precision description
- Remove invalid <tr colspan=3> from logical-type tables
- Align DECIMAL precision/scale wording with parquet.thrift
- Fix NaNs casing; add Oxford commas
- "can not" -> "cannot"; grammar fixes throughout
etseidl
reviewed
Jun 2, 2026
|
|
||
| For the byte array type, it encodes the length as a 4 byte little | ||
| For the byte array type, it encodes the length as a 4-byte little | ||
| endian, followed by the bytes. |
Contributor
There was a problem hiding this comment.
Suggested change
| endian, followed by the bytes. | |
| endian integer, followed by the bytes. |
|
|
||
| Data page format: the bit width used to encode the entry ids stored as 1 byte (max bit width = 32), | ||
| followed by the values encoded using RLE/Bit packed described above (with the given bit width). | ||
| followed by the values encoded using RLE/Bit-Packed described above (with the given bit width). |
Contributor
There was a problem hiding this comment.
Suggested change
| followed by the values encoded using RLE/Bit-Packed described above (with the given bit width). | |
| followed by the values encoded using the RLE/Bit-Packing described above (with the given bit width). |
Comment on lines
-325
to
-326
| This encoding is always preferred over PLAIN for byte array columns. | ||
|
|
Contributor
There was a problem hiding this comment.
This isn't a typo, you're actually changing the spec here. I think this needs an actual discussion.
Comment on lines
+222
to
+223
| If not specified, the scale is 0. Scale must be a non-negative integer less | ||
| than or equal to the precision. Precision is required and must be a positive |
Contributor
There was a problem hiding this comment.
I think this change is less clear, even if correct.
| *Compatibility* | ||
|
|
||
| To support compatibility with older readers, implementations of parquet-format should | ||
| To support compatibility with older readers, implementations of parquet-format must |
Comment on lines
+548
to
+549
| Embedded types do not have type-specific orderings beyond the unsigned | ||
| byte-wise comparison of their physical type (`BYTE_ARRAY`). |
Contributor
There was a problem hiding this comment.
This isn't correct since VARIANT, GEOMETRY, and GEOGRAPHY have undefined orderings. Maybe instead
Suggested change
| Embedded types do not have type-specific orderings beyond the unsigned | |
| byte-wise comparison of their physical type (`BYTE_ARRAY`). | |
| Embedded types do not have type-specific orderings unless otherwise specified. |
| ### GEOMETRY | ||
|
|
||
| `GEOMETRY` is used for geospatial features in the Well-Known Binary (WKB) format | ||
| with linear/planar edges interpolation. It must annotate a `BYTE_ARRAY` |
Contributor
There was a problem hiding this comment.
As with the other PR, I think "edges" is deliberate
| values in the column). | ||
|
|
||
| Two encodings for the levels are supported BIT_PACKED and RLE. Only RLE is now used as it supersedes BIT_PACKED. | ||
| Two encodings for the levels are supported: `BIT_PACKED` and `RLE`. Only `RLE` is now used as it supersedes `BIT_PACKED`. |
Contributor
There was a problem hiding this comment.
Suggested change
| Two encodings for the levels are supported: `BIT_PACKED` and `RLE`. Only `RLE` is now used as it supersedes `BIT_PACKED`. | |
| Two encodings for the levels are supported: `BIT_PACKED` and `RLE`. Only `RLE` is currently used as it supersedes `BIT_PACKED`. |
|
|
||
| A codec based on the Zstandard format defined by | ||
| [RFC 8478](https://tools.ietf.org/html/rfc8478). If any ambiguity arises | ||
| [RFC 8878](https://tools.ietf.org/html/rfc8878). If any ambiguity arises |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fix errors, grammar, and cross-document inconsistencies in the core Parquet format documentation (README, Encodings, Compression, LogicalTypes).
Changes
README.md
Encodings.md
Compression.md
LogicalTypes.md
<tr colspan=3>from logical-type tablesValidation
No semantic/behavioral changes to the format specification. All fixes are documentation-only.
Split from #572 for easier review.