Skip to content

PARQUET-134 patch - Support file write mode#100

Closed
masokan wants to merge 1 commit into
apache:masterfrom
masokan:master
Closed

PARQUET-134 patch - Support file write mode#100
masokan wants to merge 1 commit into
apache:masterfrom
masokan:master

Conversation

@masokan
Copy link
Copy Markdown
Contributor

@masokan masokan commented Jan 7, 2015

No description provided.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please create an enum for this.

@julienledem
Copy link
Copy Markdown
Member

Thanks for contributing. I made a comment above. Otherwise this looks good to me.

@rdblue
Copy link
Copy Markdown
Contributor

rdblue commented Mar 5, 2015

This was merged as #111. Can someone close it?

@julienledem
Copy link
Copy Markdown
Member

@masokan could you close this pull request as the change was merged as part of another one?
Thank you.

@asfgit asfgit closed this in 8f898da Jul 13, 2015
vinooganesh added a commit to vinooganesh/parquet-java that referenced this pull request May 17, 2026
Adds generateAlpFixturesAtMultipleVectorSizes to TestInterOpReadAlp.
For each of the four source files in parquet-testing PR apache#100
(alp_spotify1, alp_arade, alp_float_spotify1, alp_float_arade), reads
every row, then re-encodes as Java ALP at both vectorSize=1024 and
vectorSize=4096. Output goes to ALP_OUTPUT_DIR (default
${user.dir}/alp-java-generated/), producing 8 files total named
alp_java_<stem>_vs{1024,4096}.parquet.

Each output is verified by reading back through the standard reader
path and bit-comparing every value via doubleToRawLongBits /
floatToRawIntBits — catches NaN payload and signed-zero divergence,
not just numerical equality.

Skips when ALP_TEST_DATA_DIR isn't set, so it stays inert in CI on
machines without the source datasets.

To run:
  git clone --branch alpFloatingPointDataset \\
    https://github.com/prtkgaur/parquet-testing.git
  ALP_TEST_DATA_DIR=path/to/parquet-testing/data \\
    mvn -pl parquet-hadoop \\
    -Dtest=TestInterOpReadAlp#generateAlpFixturesAtMultipleVectorSizes \\
    test
vinooganesh added a commit to vinooganesh/parquet-java that referenced this pull request May 18, 2026
Extends generateAlpFixturesAtMultipleVectorSizes to vary writer page
version (PARQUET_1_0, PARQUET_2_0) as a third axis alongside dataset
and ALP vector size. Output grows from 8 → 16 files per run:

  alp_java_<stem>_v{1,2}_vs{1024,4096}.parquet

Page version is orthogonal to ALP encoding — the page version
difference lives in the parquet protocol layer, not in the ALP
payload — but covering both axes makes the fixture set fully
symmetric for cross-language compatibility verification. C++/Rust/Go
readers can use the V1 and V2 variants to prove their decoders
handle Java-written ALP regardless of how the surrounding pages are
framed. Avoids an asymmetry where the existing PR apache#100 set has C++
at V1 and Java at V2 with no overlap.

All 16 outputs independently verified against the canonical
_expect.csv truth files from parquet-testing PR apache#100 (1.56M values,
0 mismatches).
vinooganesh added a commit to vinooganesh/parquet-java that referenced this pull request May 18, 2026
Two new tests in TestInterOpReadAlp:

readAllFixtureFilesIndependently
  Opens every alp_java_*.parquet in ALP_OUTPUT_DIR and asserts each
  column chunk declares Encoding.ALP and decodes through the
  standard reader path without error. Separate from the generator's
  own round-trip verification so reader correctness surfaces as a
  distinct signal in CI when the fixtures are present. Skips
  cleanly when ALP_OUTPUT_DIR is empty so it stays inert in default
  CI environments.

generateAndVerifyCornerCaseFixture
  Writes a single small fixture file (alp_java_cornercases.parquet,
  ~60 KB) targeting the corner cases enumerated in parquet-testing
  issue apache#105: vectors with no exceptions, one exception per vector,
  all exceptions, NaN/Inf/-0.0, constant values (bit_width=0),
  multi-vector with differing exponents, and optional columns with
  nulls. Both f32 and f64 variants — 14 columns × 2048 rows total.
  Reads each column back and bit-exactly verifies every value
  against the expected pattern via doubleToRawLongBits /
  floatToRawIntBits.

The corner-case fixture is intended as a candidate file for
parquet-testing PR apache#100 once naming/design is confirmed. Generating
it also surfaced (and verified the fix for) a pre-existing reader
bug where optional columns with nulls couldn't be decoded — see the
preceding commit.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants