Skip to content

Fix specification typos, grammar, inconsistencies, and errors#572

Closed
iemejia wants to merge 4 commits into
apache:masterfrom
iemejia:master
Closed

Fix specification typos, grammar, inconsistencies, and errors#572
iemejia wants to merge 4 commits into
apache:masterfrom
iemejia:master

Conversation

@iemejia
Copy link
Copy Markdown
Member

@iemejia iemejia commented Jun 2, 2026

Summary

This PR fixes numerous typos, grammar issues, inconsistencies, and minor errors across the Parquet format specification documents. The changes span 13 files with 4 focused cleanup commits.

Changes

Commit 1: Fix specification inconsistencies, typos, and errors

  • BloomFilter.md: Fix block_check pseudocode (setBit -> isSet); fix struct name to match thrift
  • parquet.thrift: Fix typos ("to be be", "documention", "not necessary"); remove off-by-one in DataPageHeaderV2 comment
  • README.md: Fix repetition level value for non-nested columns (1 -> 0); update defunct Twitter CoC links to ASF
  • LogicalTypes.md: Fix embedded types ordering contradiction; add nanosecond to TIME precision
  • VariantEncoding.md: Fix BINARY -> BYTE_ARRAY; add decimal endianness note
  • Compression.md: Fix ZSTD RFC reference (8478 -> 8878)
  • Encryption.md: Fix double-negative; align GCM invocation limit to NIST
  • Encodings.md: Remove misleading "always preferred" claim for DELTA_LENGTH_BYTE_ARRAY

Commit 2: Fix more specification inconsistencies and clarify ambiguous descriptions

  • PageIndex.md, parquet.thrift: Fix double-quote typo
  • VariantShredding.md: Fix Python syntax error; replace BINARY with BYTE_ARRAY
  • BloomFilter.md: Include missing bloom_filter_length field
  • Encodings.md: "bitwidth of each block" -> "each miniblock"
  • LogicalTypes.md: Align DECIMAL precision/scale wording with thrift
  • Geospatial.md: Use uppercase edge-interpolation algorithm names to match thrift enum
  • VariantEncoding.md: Label undocumented reserved bits; fix decimal implied-precision formula

Commit 3: Fix additional typos, grammar, invalid HTML, and consistency issues (28 fixes)

  • CONTRIBUTING.md: 7 typos (docuemnt, interopability, libaries, etc.)
  • Encryption.md: 6 fixes (plural agreement, explictly, smart quotes, double spaces)
  • LogicalTypes.md: 7 fixes (invalid <tr colspan=3>, NaN casing, grammar)
  • parquet.thrift: 4 fixes (article agreement, terminal periods, BIT_PACKED comment)
  • Encodings.md, Compression.md, PageIndex.md, VariantEncoding.md: Minor fixes

Commit 4: Fix additional typos, grammar, hyphenation, and consistency issues (52 fixes)

  • parquet.thrift: Article agreement, edge interpolation, proper noun capitalization
  • Geospatial.md: Compound adjectives, comma splices, heading formatting
  • LogicalTypes.md: Grammar, Oxford commas, "can not" -> "cannot"
  • README.md: Plural agreement, compound adjective hyphenation, proper nouns
  • BinaryProtocolExtensions.md: FileMetaData casing, FlatBuffers capitalization
  • Encodings.md, Compression.md, Encryption.md, BloomFilter.md, PageIndex.md, VariantShredding.md, VariantEncoding.md: Various grammar, punctuation, and consistency fixes

Validation

  • Thrift definition compiles cleanly after all changes
  • No semantic/behavioral changes to the format specification
  • All fixes are documentation-only (typos, grammar, consistency, correctness of descriptions)

iemejia added 4 commits June 2, 2026 17:08
- BloomFilter.md: Fix block_check pseudocode (setBit -> isSet)
- BloomFilter.md: Fix struct name to match thrift (BloomFilterHeader)
- parquet.thrift: Fix typos ("to be be", "documention", "not necessary")
- parquet.thrift: Remove off-by-one in DataPageHeaderV2 is_compressed comment
- README.md: Fix repetition level value for non-nested columns (1 -> 0)
- README.md: Update defunct Twitter Code of Conduct links to ASF
- LogicalTypes.md: Fix embedded types ordering contradiction
- LogicalTypes.md: Add nanosecond to TIME precision description
- VariantEncoding.md: Fix BINARY -> BYTE_ARRAY for equivalent Parquet type
- VariantEncoding.md: Add note on decimal little-endian vs big-endian difference
- Compression.md: Fix ZSTD RFC reference (8478 -> 8878)
- Encryption.md: Fix double-negative and align GCM invocation limit to NIST
- Encodings.md: Remove misleading "always preferred" claim for DELTA_LENGTH_BYTE_ARRAY
…ions

Round-2 cleanup pass over the spec docs. Highlights:

Bugs / clear errors:
- PageIndex.md, parquet.thrift: fix ""Blart Versenwald III" double-quote typo
- VariantShredding.md: fix Python syntax error (`: Variant:` -> `-> Variant:`)
- VariantShredding.md: replace BINARY with BYTE_ARRAY (BINARY is not a Parquet
  physical type); fix `,` -> `:` inside a JSON-like literal in a table cell
- BloomFilter.md: include missing `bloom_filter_length` field in the
  ColumnMetaData snippet (it exists in parquet.thrift)
- Encodings.md: "bitwidth of each block" -> "each miniblock" in
  DELTA_BINARY_PACKED description
- README.md: add missing colon and code formatting in BIT_PACKED/RLE sentence
- LogicalTypes.md: fix TIME unit list punctuation, "Despite there is" grammar

Cross-document consistency:
- LogicalTypes.md: align DECIMAL precision/scale wording with parquet.thrift
  ("should" -> "must")
- Geospatial.md: use uppercase edge-interpolation algorithm names
  (SPHERICAL/VINCENTY/...) to match parquet.thrift enum and LogicalTypes.md
- Geospatial.md: make srid: prefix consistent (lowercase example)
- VariantEncoding.md, VariantShredding.md: use `INT(N, true)` notation
  consistent with LogicalTypes.md syntax
- VariantEncoding.md: make sorted_strings description consistent across the
  three places it is defined
- Encodings.md: PLAIN BOOLEAN no longer links to the deprecated MSB-first
  BITPACKED section; references the RLE/bit-packing hybrid section instead
- parquet.thrift: disambiguate `compressed_page_size` comment in PageLocation
  (it includes the header; the field of the same name on PageHeader does not)

Coherence/clarity:
- VariantEncoding.md: label undocumented reserved bits in metadata header,
  object value_header, and array value_header diagrams
- VariantEncoding.md: fix decimal implied-precision formula for val <= 0
- Encodings.md: BIT_PACKED is already replaced, not "will be replaced"
- Encryption.md: replace "allows to" with idiomatic English
- Encryption.md: "Data PageHeader" -> "Data Page Header" (spacing)
- BloomFilter.md: "64 bits version" -> "the 64-bit version"
- Geospatial.md: fix XYZM table column alignment
Round 3 of specification cleanup. 28 minor fixes across 8 files:

CONTRIBUTING.md (7 typos):
  docuemnt, an prototype, demostrate, interopability,
  libaries, highlighed, compatiblity

Encryption.md (6):
  - 'reflects the identity' -> 'reflect the identity' (plural subject)
  - explictly -> explicitly
  - Data/Dictionary PageHeader -> Data/Dictionary Page Header (spacing,
    same as previous fix in section 4.1, second occurrence in section 4.13)
  - Removed double space after 'right after'
  - 'the the FileMetaData' -> 'the FileMetaData'
  - Smart quotes 'PAR1' -> ASCII "PAR1" for magic-bytes literal

Encodings.md (1):
  'at at time' -> 'at a time'

PageIndex.md (1):
  Added missing terminal period after parquet.thrift link

parquet.thrift (4):
  - 'a element' -> 'an element' (DataPageHeaderV2.num_nulls comment)
  - Terminal periods after 'It was never used' and 'use PLAIN instead'
  - Rewrote BIT_PACKED comment to clarify it is superseded by RLE and
    cross-reference Encodings.md

Compression.md (1):
  Removed double space before [Brotli] link

LogicalTypes.md (7):
  - Terminal periods after INT(8, true) and INT(8, false) paragraphs
  - Removed invalid <tr colspan=3> from three logical-type tables
    (<tr> does not accept colspan; the intended colspan is already
    on the <th> header row)
  - Made 'precision' consistent with backticked identifier style
  - 'NANs' -> 'NaNs'
  - 'Despite there is no' -> 'Although there is no'
    (same fix as round 2 in a different paragraph)
  - 'In case of not present' -> 'If not present'

VariantEncoding.md (1):
  Hyphenated '1 byte', '2 byte', '4 byte', '8 byte' to '1-byte', etc.
  in primitive-types table (character-count-neutral edit, column widths
  preserved)
Round 4 of specification cleanup. 52 minor fixes across 13 files.

parquet.thrift (6):
  - L676: 'a OffsetIndex' -> 'an OffsetIndex'
  - L427/446/452: 'edges interpolation' -> 'edge interpolation'
    (Geospatial doc comments; align with thrift enum/struct naming)
  - L44: 'frameworks(e.g. hive, pig)' -> 'frameworks (e.g. Hive, Pig)'
    (missing space + capitalize proper nouns)
  - L816: 'GZip' -> 'GZIP' (match Compression.md heading and enum casing)

Geospatial.md (8):
  - L31: Missing space before parenthesis: 'OGC(' -> 'OGC ('
  - L50: 'well known' -> 'well-known' (compound adjective, 2 occurrences)
  - L61: ', and is also commonly used' -> '. It is also commonly used'
    (comma splice between independent clauses)
  - L97: 'Y Values' -> 'Y values' (mid-sentence; sibling 'X values' lowercase)
  - L137: Added 'The' before bullet sentence for grammaticality
  - L157: Markdown heading: bare line -> proper '# Coordinate Axis Order'
  - L72/73: 'edges interpolation' -> 'edge interpolation'

LogicalTypes.md (11):
  - 'to describes' -> 'that describes'
  - DECIMAL scale/precision: 'integer literal annotation' wording made
    precise ('non-negative integer' / 'positive integer')
  - TIME L302-303: 'annotation' -> 'annotations'; 'implementation' ->
    'implementations' (plural agreement)
  - L302/353/360: Oxford comma added before 'or NANOS' (consistency with
    earlier paragraph)
  - L449/460/478/479: 'can not' -> 'cannot' (consistent with rest of repo)
  - L608/623: 'edges interpolation' -> 'edge interpolation'

README.md (8):
  - L191: 'encoded values for the data page is' -> 'are' (plural subject)
  - L193/195: 'it would always' / 'if encoded, it will' -> 'they would' /
    'they will' (refer to plural repetition/definition levels)
  - L137-140: '1 bit' / '32 bit' / '64 bit' / '96 bit' / 'fixed length' ->
    hyphenated forms (compound adjectives)
  - L225: 'GZip' -> 'GZIP' (same as thrift)
  - L240: 'rc or avro files' -> 'RCFile or Avro files' (proper nouns)
  - L243: Removed trailing period from heading
  - L257: 'fine grained' -> 'fine-grained' (compound adjective)

CONTRIBUTING.md (6):
  - L46: Removed extra 'the'; added comma after 'feature'
  - L46: 'features desirability' -> "a feature's desirability"
    (possessive apostrophe)
  - L53: 'After the first two steps are complete' + 'After the vote passes':
    added missing commas after introductory clauses
  - L58: 'an external dependencies' -> 'an external dependency' (agreement)
  - L70: Comma splice between independent clauses -> semicolon
  - L90: Removed trailing period from heading

BinaryProtocolExtensions.md (10):
  - L29/53/56/79: 'FileMetadata' -> 'FileMetaData' (4 occurrences; match
    the canonical thrift struct name used elsewhere in this file)
  - L29: 'implementers which MUST' -> 'implementers who MUST' (who for
    people)
  - L70: 'extension shared publicly' -> 'extension is shared publicly'
    (missing copula)
  - L53/55/72/77/80: 'Flatbuffers' / 'flatbuffer' -> 'FlatBuffers' (5
    occurrences; project's official capitalization, already used twice
    elsewhere in the same file)

Encodings.md (5):
  - L57: 'can not' -> 'cannot'
  - L72: '4 byte little endian' -> '4-byte little endian' (compound
    adjective; preserved 'little endian' as in surrounding text)
  - L134-135: '32 bit register' / '64 bit register' -> '32-bit' / '64-bit'
  - L178: 'max definition level as 3' -> 'max definition level was 3'
    (parallel construction with previous clause)
  - L175: 'RLE/Bit packed described above' -> 'RLE/Bit-Packed described
    above' (match section heading 'Bit-packed (Deprecated)')
  - L235: 'list of bit packed ints' -> 'list of bit-packed ints'

Compression.md (2):
  - L51: 'provided by Google Snappy [library]' -> 'provided by the
    [Snappy compression library]' (match parallel construction used for
    GZIP, BROTLI, ZSTD sibling entries in the same file)
  - L61: Comma splice between independent clauses -> semicolon

Encryption.md (7):
  - L82: Removed double space after 'pages and'
  - L250: Removed double space after 'files'
  - L289: Heading '## 5 File Format' -> '## 5. File Format' (period to
    match sibling headings ## 5.1, ## 5.2, ## 5.3)
  - L256-258: '2 byte short, little endian' -> '2-byte short, little-endian'
    (compound adjectives; 3 lines)
  - L396: 'from a secret data' -> 'from secret data' (mass noun; no
    article)
  - L412: '/** Column metadata for this chunk.. **/' -> 'this chunk' (two
    trailing dots is neither sentence end nor ellipsis)

BloomFilter.md (4):
  - L268-270: 'multi-block bloom filter' / 'bloom filter header' /
    'bloom filter bitset' / 'bloom filter bit set' -> normalized casing
    ('Bloom filter' as a proper noun, consistent with the rest of this
    file and with the thrift comments) and 'bit set' -> 'bitset'
  - L311: Added terminal period in '/** The size of bitset in bytes **/'
  - L317: Added terminal period in '/** The compression used in the
    Bloom filter **/'

PageIndex.md (2):
  - L20: Title-case in heading: 'page index' -> 'Page Index' (proper noun;
    rest of heading already title-cased)
  - L40: 'one data page per the retrieved column' -> 'one data page per
    retrieved column' (article+'per' is ungrammatical)

VariantShredding.md (2):
  - L297: Python bug 'for (name, field) in typed_value' ->
    'for (name, field) in typed_value.items()' (iterating a dict yields
    keys only; the unpacking would fail at runtime — verified with
    'python3 -c' that the original raises ValueError)
  - L151: Removed trailing space inside backticks of table header
    ('`typed_value `' -> '`typed_value`')

VariantEncoding.md (3 lines, 8 hyphenations):
  - L145: '3 byte offsets' -> '3-byte offsets' (compound adjective; the
    same sentence already uses '1-byte', '2-byte', '4-byte' hyphenated)
  - L373-375: 'a 1 or 4 byte' / 'a 1, 2, 3 or 4 byte' -> '1- or 4-byte' /
    '1-, 2-, 3-, or 4-byte' (hyphenated number/unit compounds)
  - L386: '3 byte IDs' -> '3-byte IDs'
  - L387: 'one or four byte value' -> 'one- or four-byte value'

Thrift validation passes after these edits (only pre-existing doctext
warnings on lines 18, 22, 588 remain — unrelated to any fix).
@alamb
Copy link
Copy Markdown
Contributor

alamb commented Jun 2, 2026

Thanks @iemejia - it would help me review this if you could break it into several smaller PRs -- as I think different people have different levels of expertise and I think it will be eaiser to get reviews on smaller proposed changes

For example, I am not sure if the edge --> edges interpolation is technically accurate but I feel bad pinging the geospatial people on such a large PR

@iemejia
Copy link
Copy Markdown
Member Author

iemejia commented Jun 2, 2026

Closing in favor of smaller, concept-focused PRs for easier review:

@iemejia iemejia closed this Jun 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants