apache · iemejia · May 23, 2026 · May 23, 2026 · May 23, 2026 · May 23, 2026
diff --git a/BinaryProtocolExtensions.md b/BinaryProtocolExtensions.md
@@ -26,11 +26,11 @@ The extension mechanism of the `binary` Thrift field-id `32767` has some desirab
 * The content of the extension is freeform and can be encoded in any format. This format is not restricted to Thrift.  
 * Extensions can be appended to existing Thrift serialized structs [without requiring Thrift libraries](#appending-extensions-to-thrift) for manipulation (or changes to the thrift IDL).
 
-Because only one field-id is reserved the extension bytes themselves require disambiguation; otherwise readers will not be able to decode extensions safely. This is left to implementers which MUST put enough unique state in their extension bytes for disambiguation. This can be relatively easily achieved by adding a [UUID](https://en.wikipedia.org/wiki/Universally\_unique\_identifier) at the start or end of the extension bytes. The extension does not specify a disambiguation mechanism to allow more flexibility to implementers.
+Because only one field-id is reserved the extension bytes themselves require disambiguation; otherwise readers will not be able to decode extensions safely. This is left to implementers who MUST put enough unique state in their extension bytes for disambiguation. This can be relatively easily achieved by adding a [UUID](https://en.wikipedia.org/wiki/Universally\_unique\_identifier) at the start or end of the extension bytes. The extension does not specify a disambiguation mechanism to allow more flexibility to implementers.
 
 Putting everything together in an example, if we would extend `FileMetaData` it would look like this on the wire.
 
-    N-1 bytes | Thrift compact protocol encoded FileMetadata (minus \0 thrift stop field)
+    N-1 bytes | Thrift compact protocol encoded FileMetaData (minus \0 thrift stop field)
     4 bytes   | 08 FF FF 01 (long form header for 32767: binary)
     1-5 bytes | ULEB128(M) encoded size of the extension
     M bytes   | extension bytes
@@ -50,14 +50,14 @@ To illustrate the applicability of the extension mechanism we provide examples o
 
 ### Footer
 
-A variant of `FileMetaData` encoded in Flatbuffers is introduced. This variant is more performant and can scale to very wide tables, something that current Thrift `FileMetaData` struggles with.
+A variant of `FileMetaData` encoded in FlatBuffers is introduced. This variant is more performant and can scale to very wide tables, something that current Thrift `FileMetaData` struggles with.
 
 In its private form the footer of a Parquet file will look like so:
 
-    N-1 bytes | Thrift compact protocol encoded FileMetadata (minus \0 thrift stop field)
+    N-1 bytes | Thrift compact protocol encoded FileMetaData (minus \0 thrift stop field)
     4 bytes   | 08 FF FF 01 (long form header for 32767: binary)
     1-5 bytes | ULEB128(K+28) encoded size of the extension
-    K bytes   | Flatbuffers representation (v0) of FileMetaData
+    K bytes   | FlatBuffers representation (v0) of FileMetaData
     4 bytes   | little-endian crc32(flatbuffer)
     4 bytes   | little-endian size(flatbuffer)
     4 bytes   | little-endian crc32(size(flatbuffer))
@@ -67,20 +67,20 @@ In its private form the footer of a Parquet file will look like so:
 
 some-UUID is some UUID picked for this extension and it is used throughout (possibly internal) experimentation. It is put at the end to allow detection of the extension when parsed in reverse. The little-endian sizes and crc32s are also to the end to facilitate efficient parsing the footer in reverse without requiring parsing the Thrift compact protocol that precedes it.
 
-At some point the experiments conclude and the extension shared publicly with the community. The extension is proposed for inclusion to the standard with a migration plan to replace the existing `FileMetaData`.
+At some point the experiments conclude and the extension is shared publicly with the community. The extension is proposed for inclusion to the standard with a migration plan to replace the existing `FileMetaData`.
 
-The community reviews the proposal and (potentially) proposes changes to the Flatbuffers IDL representation. In addition, because this extension is a *replacement* of an existing struct, it must:
+The community reviews the proposal and (potentially) proposes changes to the FlatBuffers IDL representation. In addition, because this extension is a *replacement* of an existing struct, it must:
 
 1. have some way of being extended in the future much like what it replaces. Because the extension mechanism only allows for a single extension, without this in place we cannot have footer extensions during the migration.  
 2. consider its intermediate form where both the **Thrift** `FileMetaData` and the **FlatBuffers** `FileMetaData` will be present.  
 3. consider its final form where the long form header for `32767: binary` may not be present.
 
-Once the design is ratified the new `FileMetaData` encoding is made final with the following migration plan. For the next N years writers will write both the Thrift and the flatbuffer `FileMetaData`. It will look much like its private form except the flatbuffer IDL may be different:
+Once the design is ratified the new `FileMetaData` encoding is made final with the following migration plan. For the next N years writers will write both the Thrift and the FlatBuffers `FileMetaData`. It will look much like its private form except the FlatBuffers IDL may be different:
 
-    N-1 bytes | Thrift compact protocol encoded FileMetadata (minus \0 thrift stop field)
+    N-1 bytes | Thrift compact protocol encoded FileMetaData (minus \0 thrift stop field)
     4 bytes   | 08 FF FF 01 (long form header for 32767: binary)
     1-5 bytes | ULEB128(K+28) encoded size of the extension
-    K bytes   | Flatbuffers representation (v1) of FileMetaData
+    K bytes   | FlatBuffers representation (v1) of FileMetaData
     4 bytes   | little-endian crc32(flatbuffer)
     4 bytes   | little-endian size(flatbuffer)
     4 bytes   | little-endian crc32(size(flatbuffer))
@@ -90,7 +90,7 @@ Once the design is ratified the new `FileMetaData` encoding is made final with t
 
 After the migration period, the end of the Parquet file may look like this:
 
-    K bytes   | Flatbuffers representation (v1) of FileMetaData
+    K bytes   | FlatBuffers representation (v1) of FileMetaData
     4 bytes   | little-endian crc32(flatbuffer)
     4 bytes   | little-endian size(flatbuffer)
     4 bytes   | little-endian crc32(size(flatbuffer))

diff --git a/BloomFilter.md b/BloomFilter.md
@@ -122,7 +122,7 @@ boolean block_check(block b, unsigned int32 x) {
   for i in [0..7] {
     for j in [0..31] {
       if (masked.getWord(i).isSet(j)) {
-        if (not b.getWord(i).setBit(j)) {
+        if (not b.getWord(i).isSet(j)) {
           return false
         }
       }
@@ -266,8 +266,8 @@ false positive rates:
 #### File Format
 
 Each multi-block Bloom filter is required to work for only one column chunk. The data of a multi-block
-bloom filter consists of the bloom filter header followed by the bloom filter bitset. The bloom filter
-header encodes the size of the bloom filter bit set in bytes that is used to read the bitset.
+Bloom filter consists of the Bloom filter header followed by the Bloom filter bitset. The Bloom filter
+header encodes the size of the Bloom filter bitset in bytes that is used to read the bitset.
 
 Here are the Bloom filter definitions in thrift:
 
@@ -282,7 +282,7 @@ union BloomFilterAlgorithm {
 }
 
 /** Hash strategy type annotation. xxHash is an extremely fast non-cryptographic hash
- * algorithm. It uses 64 bits version of xxHash. 
+ * algorithm. It uses the 64-bit version of xxHash.
  **/
 struct XxHash {}
 
@@ -307,21 +307,29 @@ union BloomFilterCompression {
   * Bloom filter header is stored at beginning of Bloom filter data of each column
   * and followed by its bitset.
   **/
-struct BloomFilterPageHeader {
-  /** The size of bitset in bytes **/
+struct BloomFilterHeader {
+  /** The size of bitset in bytes. **/
   1: required i32 numBytes;
   /** The algorithm for setting bits. **/
   2: required BloomFilterAlgorithm algorithm;
   /** The hash function used for Bloom filter. **/
   3: required BloomFilterHash hash;
-  /** The compression used in the Bloom filter **/
+  /** The compression used in the Bloom filter. **/
   4: required BloomFilterCompression compression;
 }
 
 struct ColumnMetaData {
   ...
   /** Byte offset from beginning of file to Bloom filter data. **/
   14: optional i64 bloom_filter_offset;
+
+  /** Size of Bloom filter data including the serialized header, in bytes.
+   * Added in 2.10 so readers may not read this field from old files and
+   * it can be obtained after the BloomFilterHeader has been deserialized.
+   * Writers should write this field so readers can read the bloom filter
+   * in a single I/O.
+   */
+  15: optional i32 bloom_filter_length;
 }
 
 ```
@@ -339,8 +347,8 @@ information such as the presence of value. Therefore the Bloom filter of columns
 data should be encrypted with the column key, and the Bloom filter of other (not sensitive) columns
 do not need to be encrypted.
 
-Bloom filters have two serializable modules - the PageHeader thrift structure (with its internal
-fields, including the BloomFilterPageHeader `bloom_filter_page_header`), and the Bitset. The header
+Bloom filters have two serializable modules - the Bloom filter header (the BloomFilterHeader thrift
+structure and its internal fields), and the Bitset. The header
 structure is serialized by Thrift, and written to file output stream; it is followed by the
 serialized Bitset.
 

diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -43,10 +43,10 @@ The general steps for adding features to the format are as follows:
 1. Design/scoping: The goal of this phase is to identify design goals of a
    feature and provide some demonstration that the feature meets those goals.
    This phase starts with a discussion of changes on the developer mailing list
-   (dev@parquet.apache.org). Depending on the scope and goals of the feature the
-   it can be useful to provide additional artifacts as part of a discussion. The
-   artifacts can include a design docuemnt, a draft pull request to make the
-   discussion concrete and/or an prototype implementation to demostrate the
+   (dev@parquet.apache.org). Depending on the scope and goals of the feature, it
+   can be useful to provide additional artifacts as part of a discussion. The
+   artifacts can include a design document, a draft pull request to make the
+   discussion concrete and/or a prototype implementation to demonstrate the
    viability of implementation. This step is complete when there is lazy
    consensus. Part of the consensus is whether it is sufficient to provide two
    working implementations as outlined in step 2, or if demonstration of the
@@ -58,7 +58,7 @@ The general steps for adding features to the format are as follows:
 2. Completeness: The goal of this phase is to ensure the feature is viable,
    there is no ambiguity in its specification by demonstrating compatibility
    between implementations. Once a change has lazy consensus, two
-   implementations of the feature demonstrating interopability must also be
+   implementations of the feature demonstrating interoperability must also be
    provided.  One implementation MUST be
    [`parquet-java`](http://github.com/apache/parquet-java).  It is preferred
    that the second implementation be
@@ -73,35 +73,35 @@ The general steps for adding features to the format are as follows:
    fit for inclusion (for example, they were submitted as a pull request against
    the target repository and committers gave positive reviews). Reports on the
    benefits from closed source implementations are welcome and can help lend
-   weight to features desirability but are not sufficient for acceptance of a
+   weight to a feature's desirability but are not sufficient for acceptance of a
    new feature.
 
 Unless otherwise discussed, it is expected the implementations will be developed
 from their respective main branch (i.e. backporting is not required), to
 demonstrate that the feature is mergeable to its implementation.
 
-3. Ratification: After the first two steps are complete a formal vote is held on
+3. Ratification: After the first two steps are complete, a formal vote is held on
    dev@parquet.apache.org to officially ratify the feature.  After the vote
-   passes the format change is merged into the `parquet-format` repository and
+   passes, the format change is merged into the `parquet-format` repository and
    it is expected the changes from step 2 will also be merged soon after
    (implementations should not be merged until the addition has been merged to
    `parquet-format`).
 
-#### General guidelines/preferences on additions.
+#### General guidelines/preferences on additions
 
 1. To the greatest extent possible changes should have an option for forward
    compatibility (old readers can still read files). The [compatibility and
    feature enablement](#compatibility-and-feature-enablement) section below 
    provides more details on expectations for changes that break compatibility.
 
 2. New encodings should be fully specified in this repository and not
-   rely on an external dependencies for implementation (i.e. `parquet-format` is
+   rely on an external dependency for implementation (i.e. `parquet-format` is
    the source of truth for the encoding). If it does require an
    external dependency, then the external dependency must have its
    own specification separate from implementation.
 
 3. New compression mechanisms should have a pure Java implementation that can be
-   used as a dependency in `parquet-java`, exceptions may be
+   used as a dependency in `parquet-java`; exceptions may be
    discussed on the mailing list to see if a non-native Java
    implementation is acceptable.
 
@@ -154,15 +154,15 @@ recommendations for managing features:
 2. Forward compatible features/changes may be enabled and used by default in
    implementations once the parquet-format containing those changes has been
    formally released.  For features that may pose a significant performance
-   regression to older format readers, libaries should consider delaying default
+   regression to older format readers, libraries should consider delaying default
    enablement until 1 year after the release of the parquet-java implementation
    that contains the feature implementation.
 
 3. Forward incompatible features/changes should not be turned on by default
    until 2 years after the parquet-java implementation containing the feature is
    released. It is recommended that changing the default value for a forward
    incompatible feature flag should be clearly advertised to consumers (e.g. via
-   a major version release if using Semantic Versioning, or highlighed in
+   a major version release if using Semantic Versioning, or highlighted in
    release notes).
 
 For forward compatible changes which have a high chance of performance
@@ -174,7 +174,7 @@ the same timelines as `parquet-java`. Parquet-java will wait to enable features
 by default until the most conservative timelines outlined above have been
 exceeded. This timeline is an attempt to balance ensuring
 new features make their way into the ecosystem and avoiding
-breaking compatiblity for readers that are slower to adopt new standards. We
+breaking compatibility for readers that are slower to adopt new standards. We
 encourage earlier adoption of new features when an organization using Parquet
 can guarantee that all readers of the parquet files they produce can read a new
 feature.

diff --git a/Compression.md b/Compression.md
@@ -48,7 +48,7 @@ No-op codec.  Data is left uncompressed.
 A codec based on the
 [Snappy compression format](https://github.com/google/snappy/blob/master/format_description.txt).
 If any ambiguity arises when implementing this format, the implementation
-provided by Google Snappy [library](https://github.com/google/snappy/)
+provided by the [Snappy compression library](https://github.com/google/snappy/)
 is authoritative.
 
 ### GZIP
@@ -58,7 +58,7 @@ formats) defined by [RFC 1952](https://tools.ietf.org/html/rfc1952).
 If any ambiguity arises when implementing this format, the implementation
 provided by the [zlib compression library](https://zlib.net/) is authoritative.
 
-Readers should support reading pages containing multiple GZIP members, however,
+Readers should support reading pages containing multiple GZIP members; however,
 as this has historically not been supported by all implementations, it is recommended
 that writers refrain from creating such pages by default for better interoperability.
 
@@ -72,7 +72,7 @@ A codec based on or interoperable with the
 A codec based on the Brotli format defined by
 [RFC 7932](https://tools.ietf.org/html/rfc7932).
 If any ambiguity arises when implementing this format, the implementation
-provided by the  [Brotli compression library](https://github.com/google/brotli)
+provided by the [Brotli compression library](https://github.com/google/brotli)
 is authoritative.
 
 ### LZ4
@@ -89,7 +89,7 @@ switch to the newer, interoperable `LZ4_RAW` codec.
 ### ZSTD
 
 A codec based on the Zstandard format defined by
-[RFC 8478](https://tools.ietf.org/html/rfc8478).  If any ambiguity arises
+[RFC 8878](https://tools.ietf.org/html/rfc8878).  If any ambiguity arises
 when implementing this format, the implementation provided by the
 [Zstandard compression library](https://facebook.github.io/zstd/)
 is authoritative.