PARQUET-134 patch - Support file write mode#100
Closed
masokan wants to merge 1 commit into
Closed
Conversation
Member
There was a problem hiding this comment.
Please create an enum for this.
Member
|
Thanks for contributing. I made a comment above. Otherwise this looks good to me. |
Contributor
|
This was merged as #111. Can someone close it? |
Member
|
@masokan could you close this pull request as the change was merged as part of another one? |
vinooganesh
added a commit
to vinooganesh/parquet-java
that referenced
this pull request
May 17, 2026
Adds generateAlpFixturesAtMultipleVectorSizes to TestInterOpReadAlp. For each of the four source files in parquet-testing PR apache#100 (alp_spotify1, alp_arade, alp_float_spotify1, alp_float_arade), reads every row, then re-encodes as Java ALP at both vectorSize=1024 and vectorSize=4096. Output goes to ALP_OUTPUT_DIR (default ${user.dir}/alp-java-generated/), producing 8 files total named alp_java_<stem>_vs{1024,4096}.parquet. Each output is verified by reading back through the standard reader path and bit-comparing every value via doubleToRawLongBits / floatToRawIntBits — catches NaN payload and signed-zero divergence, not just numerical equality. Skips when ALP_TEST_DATA_DIR isn't set, so it stays inert in CI on machines without the source datasets. To run: git clone --branch alpFloatingPointDataset \\ https://github.com/prtkgaur/parquet-testing.git ALP_TEST_DATA_DIR=path/to/parquet-testing/data \\ mvn -pl parquet-hadoop \\ -Dtest=TestInterOpReadAlp#generateAlpFixturesAtMultipleVectorSizes \\ test
vinooganesh
added a commit
to vinooganesh/parquet-java
that referenced
this pull request
May 18, 2026
Extends generateAlpFixturesAtMultipleVectorSizes to vary writer page
version (PARQUET_1_0, PARQUET_2_0) as a third axis alongside dataset
and ALP vector size. Output grows from 8 → 16 files per run:
alp_java_<stem>_v{1,2}_vs{1024,4096}.parquet
Page version is orthogonal to ALP encoding — the page version
difference lives in the parquet protocol layer, not in the ALP
payload — but covering both axes makes the fixture set fully
symmetric for cross-language compatibility verification. C++/Rust/Go
readers can use the V1 and V2 variants to prove their decoders
handle Java-written ALP regardless of how the surrounding pages are
framed. Avoids an asymmetry where the existing PR apache#100 set has C++
at V1 and Java at V2 with no overlap.
All 16 outputs independently verified against the canonical
_expect.csv truth files from parquet-testing PR apache#100 (1.56M values,
0 mismatches).
vinooganesh
added a commit
to vinooganesh/parquet-java
that referenced
this pull request
May 18, 2026
Two new tests in TestInterOpReadAlp: readAllFixtureFilesIndependently Opens every alp_java_*.parquet in ALP_OUTPUT_DIR and asserts each column chunk declares Encoding.ALP and decodes through the standard reader path without error. Separate from the generator's own round-trip verification so reader correctness surfaces as a distinct signal in CI when the fixtures are present. Skips cleanly when ALP_OUTPUT_DIR is empty so it stays inert in default CI environments. generateAndVerifyCornerCaseFixture Writes a single small fixture file (alp_java_cornercases.parquet, ~60 KB) targeting the corner cases enumerated in parquet-testing issue apache#105: vectors with no exceptions, one exception per vector, all exceptions, NaN/Inf/-0.0, constant values (bit_width=0), multi-vector with differing exponents, and optional columns with nulls. Both f32 and f64 variants — 14 columns × 2048 rows total. Reads each column back and bit-exactly verifies every value against the expected pattern via doubleToRawLongBits / floatToRawIntBits. The corner-case fixture is intended as a candidate file for parquet-testing PR apache#100 once naming/design is confirmed. Generating it also surfaced (and verified the fix for) a pre-existing reader bug where optional columns with nulls couldn't be decoded — see the preceding commit.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.