Skip to content

Fix errors, grammar, and consistency in core format documentation#576

Open
iemejia wants to merge 1 commit into
apache:masterfrom
iemejia:fix/core-format-docs
Open

Fix errors, grammar, and consistency in core format documentation#576
iemejia wants to merge 1 commit into
apache:masterfrom
iemejia:fix/core-format-docs

Conversation

@iemejia
Copy link
Copy Markdown
Member

@iemejia iemejia commented Jun 2, 2026

Summary

Fix errors, grammar, and cross-document inconsistencies in the core Parquet format documentation (README, Encodings, Compression, LogicalTypes).

Changes

README.md

  • Fix repetition level value for non-nested columns (1 -> 0)
  • Update defunct Twitter Code of Conduct links to ASF
  • Fix plural agreement ("encoded values is" -> "are")
  • Hyphenate compound adjectives ("32 bit" -> "32-bit", etc.)
  • Normalize GZIP casing; capitalize proper nouns (RCFile, Avro)

Encodings.md

  • Fix "bitwidth of each block" -> "each miniblock" (DELTA_BINARY_PACKED)
  • Remove misleading "always preferred" claim for DELTA_LENGTH_BYTE_ARRAY
  • Fix "at at time" -> "at a time"
  • Fix BIT_PACKED tense ("will be replaced" -> already replaced)
  • Fix PLAIN BOOLEAN link to reference RLE/bit-packing hybrid section
  • Hyphenate compound adjectives; "can not" -> "cannot"

Compression.md

  • Fix ZSTD RFC reference (8478 -> 8878)
  • Fix Snappy description to match parallel construction
  • Remove double space; fix comma splice

LogicalTypes.md

  • Fix embedded types ordering contradiction
  • Add nanosecond to TIME precision description
  • Remove invalid <tr colspan=3> from logical-type tables
  • Align DECIMAL precision/scale wording with parquet.thrift
  • Fix NaNs casing; add Oxford commas
  • "can not" -> "cannot"; grammar fixes throughout

Validation

No semantic/behavioral changes to the format specification. All fixes are documentation-only.

Split from #572 for easier review.

README.md:
- Fix repetition level value for non-nested columns (1 -> 0)
- Update defunct Twitter Code of Conduct links to ASF
- Fix plural agreement ("encoded values is" -> "are")
- Hyphenate compound adjectives ("32 bit" -> "32-bit", etc.)
- Normalize GZIP casing; capitalize proper nouns (RCFile, Avro)

Encodings.md:
- Fix "bitwidth of each block" -> "each miniblock" (DELTA_BINARY_PACKED)
- Remove misleading "always preferred" claim for DELTA_LENGTH_BYTE_ARRAY
- Fix "at at time" -> "at a time"
- Fix BIT_PACKED tense ("will be replaced" -> already replaced)
- Fix PLAIN BOOLEAN link to reference RLE/bit-packing hybrid section
- Hyphenate compound adjectives; "can not" -> "cannot"

Compression.md:
- Fix ZSTD RFC reference (8478 -> 8878)
- Fix Snappy description to match parallel construction
- Remove double space; fix comma splice

LogicalTypes.md:
- Fix embedded types ordering contradiction
- Add nanosecond to TIME precision description
- Remove invalid <tr colspan=3> from logical-type tables
- Align DECIMAL precision/scale wording with parquet.thrift
- Fix NaNs casing; add Oxford commas
- "can not" -> "cannot"; grammar fixes throughout
Copy link
Copy Markdown
Contributor

@etseidl etseidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @iemejia. A few nits/suggestions.

I do think a few might go to far and represent actual spec changes that would require discussion.

Comment thread Encodings.md

For the byte array type, it encodes the length as a 4 byte little
For the byte array type, it encodes the length as a 4-byte little
endian, followed by the bytes.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
endian, followed by the bytes.
endian integer, followed by the bytes.

Comment thread Encodings.md

Data page format: the bit width used to encode the entry ids stored as 1 byte (max bit width = 32),
followed by the values encoded using RLE/Bit packed described above (with the given bit width).
followed by the values encoded using RLE/Bit-Packed described above (with the given bit width).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
followed by the values encoded using RLE/Bit-Packed described above (with the given bit width).
followed by the values encoded using the RLE/Bit-Packing described above (with the given bit width).

Comment thread Encodings.md
Comment on lines -325 to -326
This encoding is always preferred over PLAIN for byte array columns.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't a typo, you're actually changing the spec here. I think this needs an actual discussion.

Comment thread LogicalTypes.md
Comment on lines +222 to +223
If not specified, the scale is 0. Scale must be a non-negative integer less
than or equal to the precision. Precision is required and must be a positive
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this change is less clear, even if correct.

Comment thread LogicalTypes.md
*Compatibility*

To support compatibility with older readers, implementations of parquet-format should
To support compatibility with older readers, implementations of parquet-format must
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another spec change.

Comment thread LogicalTypes.md
Comment on lines +548 to +549
Embedded types do not have type-specific orderings beyond the unsigned
byte-wise comparison of their physical type (`BYTE_ARRAY`).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't correct since VARIANT, GEOMETRY, and GEOGRAPHY have undefined orderings. Maybe instead

Suggested change
Embedded types do not have type-specific orderings beyond the unsigned
byte-wise comparison of their physical type (`BYTE_ARRAY`).
Embedded types do not have type-specific orderings unless otherwise specified.

Comment thread LogicalTypes.md
### GEOMETRY

`GEOMETRY` is used for geospatial features in the Well-Known Binary (WKB) format
with linear/planar edges interpolation. It must annotate a `BYTE_ARRAY`
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As with the other PR, I think "edges" is deliberate

Comment thread README.md
values in the column).

Two encodings for the levels are supported BIT_PACKED and RLE. Only RLE is now used as it supersedes BIT_PACKED.
Two encodings for the levels are supported: `BIT_PACKED` and `RLE`. Only `RLE` is now used as it supersedes `BIT_PACKED`.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Two encodings for the levels are supported: `BIT_PACKED` and `RLE`. Only `RLE` is now used as it supersedes `BIT_PACKED`.
Two encodings for the levels are supported: `BIT_PACKED` and `RLE`. Only `RLE` is currently used as it supersedes `BIT_PACKED`.

Comment thread Compression.md

A codec based on the Zstandard format defined by
[RFC 8478](https://tools.ietf.org/html/rfc8478). If any ambiguity arises
[RFC 8878](https://tools.ietf.org/html/rfc8878). If any ambiguity arises
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants