Optimize ByteStreamSplitValuesWriter: remove per-value allocation and batch single-byte writes

### Describe the enhancement requested

`ByteStreamSplitValuesWriter` is the primary writer for `BYTE_STREAM_SPLIT`-encoded `FLOAT`, `DOUBLE`, `INT32`, and `INT64` columns. Each value goes through a hot path that performs both an unnecessary allocation and N single-byte virtual dispatches.

For `FloatByteStreamSplitValuesWriter.writeFloat(float v)`:

```java
super.scatterBytes(BytesUtils.intToBytes(Float.floatToIntBits(v)));
```

`BytesUtils.intToBytes` allocates a fresh `byte[4]` on every call. `scatterBytes` then loops:

```java
for (int i = 0; i < bytes.length; ++i) {
  this.byteStreams[i].write(bytes[i]);   // CapacityByteArrayOutputStream.write(int)
}
```

That is, **per value**: 1 `byte[4]` allocation + 4 single-byte virtual dispatches. For a 100k-value `FLOAT` page that is 100k allocations and 400k single-byte writes. `DOUBLE`/`LONG` are even worse (`byte[8]`, 800k single-byte writes).

JMH (new `ByteStreamSplitEncodingBenchmark`, 100k values per invocation, JDK 18, `-wi 5 -i 10 -f 3`, 30 samples) on master:

| Type   | ops/s   | gc.alloc.rate.norm |
|--------|--------:|--------------------:|
| Float  | 15.08M  | 33.27 B/op |
| Double |  6.99M  | 42.54 B/op |
| Int    | 15.64M  | 33.27 B/op |
| Long   |  7.09M  | 42.54 B/op |

The B/op figure for Float/Int (33 B) is mostly the per-value `byte[4]` allocation.

### Proposal

Two stacked changes in `ByteStreamSplitValuesWriter`:

1. **Eliminate per-value allocation**: replace `super.scatterBytes(BytesUtils.intToBytes(v))` with `bufferInt(v)` / `bufferLong(v)` that perform the little-endian decomposition with bit shifts directly, no temporary `byte[]`.

2. **Batch single-byte writes**: accumulate `BATCH_SIZE = 128` values in a small per-instance scratch buffer and flush them as `N` bulk `write(byte[], off, len)` calls (one per stream), replacing `BATCH_SIZE * elementSizeInBytes` single-byte virtual dispatches with `elementSizeInBytes` bulk writes per flush. The constant was chosen by sweeping 16/32/64/128/256/512/1024 — 128 is the sweet spot for `FLOAT` throughput while still capturing most of the `DOUBLE`/`LONG` gains.

Pending values are included in `getBufferedSize()` (so page-sizing decisions remain correct) and flushed in `getBytes()`. `reset()` and `close()` clear pending state. Only the four numeric subclasses use the batching path; `FixedLenByteArrayByteStreamSplitValuesWriter` continues to use `scatterBytes(byte[])` since its values arrive as already-laid-out byte arrays.

Expected speedup (same JMH config):

| Type   | Before  | After   | Δ              | Alloc B/op       |
|--------|--------:|--------:|---------------:|------------------|
| Float  | 15.08M  | 65.06M  | **+331% (4.3x)** | 33.27 → 9.27 (**-72%**) |
| Double |  6.99M  | 49.48M  | **+608% (7.1x)** | 42.54 → 18.55 (**-56%**) |
| Int    | 15.64M  | 68.13M  | **+335% (4.4x)** | 33.27 → 9.27 (**-72%**) |
| Long   |  7.09M  | 53.23M  | **+651% (7.5x)** | 42.54 → 18.55 (**-56%**) |

### Scope

- Single file change to `parquet-column/src/main/java/org/apache/parquet/column/values/bytestreamsplit/ByteStreamSplitValuesWriter.java`.
- No public-API change; `bufferInt`/`bufferLong` are package-internal helpers; existing public methods preserve their contracts.
- All 573 `parquet-column` tests pass; 51 BSS-specific tests pass.

### Relation

Part of a small series of focused performance PRs from work in [parquet-perf](https://github.com/iemejia/parquet-perf). Previous: #3494 (PlainValuesReader), #3496 (PlainValuesWriter), #3500 (Binary.hashCode cache).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize ByteStreamSplitValuesWriter: remove per-value allocation and batch single-byte writes #3503

Describe the enhancement requested

Proposal

Scope

Relation

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Type	ops/s	gc.alloc.rate.norm
Float	15.08M	33.27 B/op
Double	6.99M	42.54 B/op
Int	15.64M	33.27 B/op
Long	7.09M	42.54 B/op

Type	Before	After	Δ	Alloc B/op
Float	15.08M	65.06M	+331% (4.3x)	33.27 → 9.27 (-72%)
Double	6.99M	49.48M	+608% (7.1x)	42.54 → 18.55 (-56%)
Int	15.64M	68.13M	+335% (4.4x)	33.27 → 9.27 (-72%)
Long	7.09M	53.23M	+651% (7.5x)	42.54 → 18.55 (-56%)

Optimize ByteStreamSplitValuesWriter: remove per-value allocation and batch single-byte writes #3503

Description

Describe the enhancement requested

Proposal

Scope

Relation

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions