Fix specification typos, grammar, inconsistencies, and errors#572
Closed
iemejia wants to merge 4 commits into
Closed
Fix specification typos, grammar, inconsistencies, and errors#572iemejia wants to merge 4 commits into
iemejia wants to merge 4 commits into
Conversation
- BloomFilter.md: Fix block_check pseudocode (setBit -> isSet)
- BloomFilter.md: Fix struct name to match thrift (BloomFilterHeader)
- parquet.thrift: Fix typos ("to be be", "documention", "not necessary")
- parquet.thrift: Remove off-by-one in DataPageHeaderV2 is_compressed comment
- README.md: Fix repetition level value for non-nested columns (1 -> 0)
- README.md: Update defunct Twitter Code of Conduct links to ASF
- LogicalTypes.md: Fix embedded types ordering contradiction
- LogicalTypes.md: Add nanosecond to TIME precision description
- VariantEncoding.md: Fix BINARY -> BYTE_ARRAY for equivalent Parquet type
- VariantEncoding.md: Add note on decimal little-endian vs big-endian difference
- Compression.md: Fix ZSTD RFC reference (8478 -> 8878)
- Encryption.md: Fix double-negative and align GCM invocation limit to NIST
- Encodings.md: Remove misleading "always preferred" claim for DELTA_LENGTH_BYTE_ARRAY
…ions
Round-2 cleanup pass over the spec docs. Highlights:
Bugs / clear errors:
- PageIndex.md, parquet.thrift: fix ""Blart Versenwald III" double-quote typo
- VariantShredding.md: fix Python syntax error (`: Variant:` -> `-> Variant:`)
- VariantShredding.md: replace BINARY with BYTE_ARRAY (BINARY is not a Parquet
physical type); fix `,` -> `:` inside a JSON-like literal in a table cell
- BloomFilter.md: include missing `bloom_filter_length` field in the
ColumnMetaData snippet (it exists in parquet.thrift)
- Encodings.md: "bitwidth of each block" -> "each miniblock" in
DELTA_BINARY_PACKED description
- README.md: add missing colon and code formatting in BIT_PACKED/RLE sentence
- LogicalTypes.md: fix TIME unit list punctuation, "Despite there is" grammar
Cross-document consistency:
- LogicalTypes.md: align DECIMAL precision/scale wording with parquet.thrift
("should" -> "must")
- Geospatial.md: use uppercase edge-interpolation algorithm names
(SPHERICAL/VINCENTY/...) to match parquet.thrift enum and LogicalTypes.md
- Geospatial.md: make srid: prefix consistent (lowercase example)
- VariantEncoding.md, VariantShredding.md: use `INT(N, true)` notation
consistent with LogicalTypes.md syntax
- VariantEncoding.md: make sorted_strings description consistent across the
three places it is defined
- Encodings.md: PLAIN BOOLEAN no longer links to the deprecated MSB-first
BITPACKED section; references the RLE/bit-packing hybrid section instead
- parquet.thrift: disambiguate `compressed_page_size` comment in PageLocation
(it includes the header; the field of the same name on PageHeader does not)
Coherence/clarity:
- VariantEncoding.md: label undocumented reserved bits in metadata header,
object value_header, and array value_header diagrams
- VariantEncoding.md: fix decimal implied-precision formula for val <= 0
- Encodings.md: BIT_PACKED is already replaced, not "will be replaced"
- Encryption.md: replace "allows to" with idiomatic English
- Encryption.md: "Data PageHeader" -> "Data Page Header" (spacing)
- BloomFilter.md: "64 bits version" -> "the 64-bit version"
- Geospatial.md: fix XYZM table column alignment
Round 3 of specification cleanup. 28 minor fixes across 8 files:
CONTRIBUTING.md (7 typos):
docuemnt, an prototype, demostrate, interopability,
libaries, highlighed, compatiblity
Encryption.md (6):
- 'reflects the identity' -> 'reflect the identity' (plural subject)
- explictly -> explicitly
- Data/Dictionary PageHeader -> Data/Dictionary Page Header (spacing,
same as previous fix in section 4.1, second occurrence in section 4.13)
- Removed double space after 'right after'
- 'the the FileMetaData' -> 'the FileMetaData'
- Smart quotes 'PAR1' -> ASCII "PAR1" for magic-bytes literal
Encodings.md (1):
'at at time' -> 'at a time'
PageIndex.md (1):
Added missing terminal period after parquet.thrift link
parquet.thrift (4):
- 'a element' -> 'an element' (DataPageHeaderV2.num_nulls comment)
- Terminal periods after 'It was never used' and 'use PLAIN instead'
- Rewrote BIT_PACKED comment to clarify it is superseded by RLE and
cross-reference Encodings.md
Compression.md (1):
Removed double space before [Brotli] link
LogicalTypes.md (7):
- Terminal periods after INT(8, true) and INT(8, false) paragraphs
- Removed invalid <tr colspan=3> from three logical-type tables
(<tr> does not accept colspan; the intended colspan is already
on the <th> header row)
- Made 'precision' consistent with backticked identifier style
- 'NANs' -> 'NaNs'
- 'Despite there is no' -> 'Although there is no'
(same fix as round 2 in a different paragraph)
- 'In case of not present' -> 'If not present'
VariantEncoding.md (1):
Hyphenated '1 byte', '2 byte', '4 byte', '8 byte' to '1-byte', etc.
in primitive-types table (character-count-neutral edit, column widths
preserved)
Round 4 of specification cleanup. 52 minor fixes across 13 files.
parquet.thrift (6):
- L676: 'a OffsetIndex' -> 'an OffsetIndex'
- L427/446/452: 'edges interpolation' -> 'edge interpolation'
(Geospatial doc comments; align with thrift enum/struct naming)
- L44: 'frameworks(e.g. hive, pig)' -> 'frameworks (e.g. Hive, Pig)'
(missing space + capitalize proper nouns)
- L816: 'GZip' -> 'GZIP' (match Compression.md heading and enum casing)
Geospatial.md (8):
- L31: Missing space before parenthesis: 'OGC(' -> 'OGC ('
- L50: 'well known' -> 'well-known' (compound adjective, 2 occurrences)
- L61: ', and is also commonly used' -> '. It is also commonly used'
(comma splice between independent clauses)
- L97: 'Y Values' -> 'Y values' (mid-sentence; sibling 'X values' lowercase)
- L137: Added 'The' before bullet sentence for grammaticality
- L157: Markdown heading: bare line -> proper '# Coordinate Axis Order'
- L72/73: 'edges interpolation' -> 'edge interpolation'
LogicalTypes.md (11):
- 'to describes' -> 'that describes'
- DECIMAL scale/precision: 'integer literal annotation' wording made
precise ('non-negative integer' / 'positive integer')
- TIME L302-303: 'annotation' -> 'annotations'; 'implementation' ->
'implementations' (plural agreement)
- L302/353/360: Oxford comma added before 'or NANOS' (consistency with
earlier paragraph)
- L449/460/478/479: 'can not' -> 'cannot' (consistent with rest of repo)
- L608/623: 'edges interpolation' -> 'edge interpolation'
README.md (8):
- L191: 'encoded values for the data page is' -> 'are' (plural subject)
- L193/195: 'it would always' / 'if encoded, it will' -> 'they would' /
'they will' (refer to plural repetition/definition levels)
- L137-140: '1 bit' / '32 bit' / '64 bit' / '96 bit' / 'fixed length' ->
hyphenated forms (compound adjectives)
- L225: 'GZip' -> 'GZIP' (same as thrift)
- L240: 'rc or avro files' -> 'RCFile or Avro files' (proper nouns)
- L243: Removed trailing period from heading
- L257: 'fine grained' -> 'fine-grained' (compound adjective)
CONTRIBUTING.md (6):
- L46: Removed extra 'the'; added comma after 'feature'
- L46: 'features desirability' -> "a feature's desirability"
(possessive apostrophe)
- L53: 'After the first two steps are complete' + 'After the vote passes':
added missing commas after introductory clauses
- L58: 'an external dependencies' -> 'an external dependency' (agreement)
- L70: Comma splice between independent clauses -> semicolon
- L90: Removed trailing period from heading
BinaryProtocolExtensions.md (10):
- L29/53/56/79: 'FileMetadata' -> 'FileMetaData' (4 occurrences; match
the canonical thrift struct name used elsewhere in this file)
- L29: 'implementers which MUST' -> 'implementers who MUST' (who for
people)
- L70: 'extension shared publicly' -> 'extension is shared publicly'
(missing copula)
- L53/55/72/77/80: 'Flatbuffers' / 'flatbuffer' -> 'FlatBuffers' (5
occurrences; project's official capitalization, already used twice
elsewhere in the same file)
Encodings.md (5):
- L57: 'can not' -> 'cannot'
- L72: '4 byte little endian' -> '4-byte little endian' (compound
adjective; preserved 'little endian' as in surrounding text)
- L134-135: '32 bit register' / '64 bit register' -> '32-bit' / '64-bit'
- L178: 'max definition level as 3' -> 'max definition level was 3'
(parallel construction with previous clause)
- L175: 'RLE/Bit packed described above' -> 'RLE/Bit-Packed described
above' (match section heading 'Bit-packed (Deprecated)')
- L235: 'list of bit packed ints' -> 'list of bit-packed ints'
Compression.md (2):
- L51: 'provided by Google Snappy [library]' -> 'provided by the
[Snappy compression library]' (match parallel construction used for
GZIP, BROTLI, ZSTD sibling entries in the same file)
- L61: Comma splice between independent clauses -> semicolon
Encryption.md (7):
- L82: Removed double space after 'pages and'
- L250: Removed double space after 'files'
- L289: Heading '## 5 File Format' -> '## 5. File Format' (period to
match sibling headings ## 5.1, ## 5.2, ## 5.3)
- L256-258: '2 byte short, little endian' -> '2-byte short, little-endian'
(compound adjectives; 3 lines)
- L396: 'from a secret data' -> 'from secret data' (mass noun; no
article)
- L412: '/** Column metadata for this chunk.. **/' -> 'this chunk' (two
trailing dots is neither sentence end nor ellipsis)
BloomFilter.md (4):
- L268-270: 'multi-block bloom filter' / 'bloom filter header' /
'bloom filter bitset' / 'bloom filter bit set' -> normalized casing
('Bloom filter' as a proper noun, consistent with the rest of this
file and with the thrift comments) and 'bit set' -> 'bitset'
- L311: Added terminal period in '/** The size of bitset in bytes **/'
- L317: Added terminal period in '/** The compression used in the
Bloom filter **/'
PageIndex.md (2):
- L20: Title-case in heading: 'page index' -> 'Page Index' (proper noun;
rest of heading already title-cased)
- L40: 'one data page per the retrieved column' -> 'one data page per
retrieved column' (article+'per' is ungrammatical)
VariantShredding.md (2):
- L297: Python bug 'for (name, field) in typed_value' ->
'for (name, field) in typed_value.items()' (iterating a dict yields
keys only; the unpacking would fail at runtime — verified with
'python3 -c' that the original raises ValueError)
- L151: Removed trailing space inside backticks of table header
('`typed_value `' -> '`typed_value`')
VariantEncoding.md (3 lines, 8 hyphenations):
- L145: '3 byte offsets' -> '3-byte offsets' (compound adjective; the
same sentence already uses '1-byte', '2-byte', '4-byte' hyphenated)
- L373-375: 'a 1 or 4 byte' / 'a 1, 2, 3 or 4 byte' -> '1- or 4-byte' /
'1-, 2-, 3-, or 4-byte' (hyphenated number/unit compounds)
- L386: '3 byte IDs' -> '3-byte IDs'
- L387: 'one or four byte value' -> 'one- or four-byte value'
Thrift validation passes after these edits (only pre-existing doctext
warnings on lines 18, 22, 588 remain — unrelated to any fix).
Contributor
|
Thanks @iemejia - it would help me review this if you could break it into several smaller PRs -- as I think different people have different levels of expertise and I think it will be eaiser to get reviews on smaller proposed changes For example, I am not sure if the |
This was referenced Jun 2, 2026
Member
Author
|
Closing in favor of smaller, concept-focused PRs for easier review:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR fixes numerous typos, grammar issues, inconsistencies, and minor errors across the Parquet format specification documents. The changes span 13 files with 4 focused cleanup commits.
Changes
Commit 1: Fix specification inconsistencies, typos, and errors
block_checkpseudocode (setBit->isSet); fix struct name to match thriftCommit 2: Fix more specification inconsistencies and clarify ambiguous descriptions
bloom_filter_lengthfieldCommit 3: Fix additional typos, grammar, invalid HTML, and consistency issues (28 fixes)
<tr colspan=3>, NaN casing, grammar)Commit 4: Fix additional typos, grammar, hyphenation, and consistency issues (52 fixes)
Validation