feat: RunEndEncodedIterator #327

GeorgeLeePatterson · 2025-11-02T21:31:41Z

Summary

Adds O(1) amortized sequential access optimization for RunEndEncoded vectors through a stateful iterator implementation.

This PR builds on the base RunEndEncoded support added in #326 by optimizing the iteration performance from O(n log n) to O(n) for sequential access patterns.

Implementation

Implements RunEndEncodedIterator based on Apache Arrow C++'s PhysicalIndexFinder caching algorithm:

Caches the last physical index from previous lookup
Fast path: Validates cached index is still valid for current logical index (O(1))
Fallback: Uses cached index to partition search space, then binary searches the relevant partition
Sequential iteration: O(1) amortized per element instead of O(log n)

Algorithm

Check if cached physical index is still valid for current logical index
If valid and within run bounds, return cached index immediately (common case)
If not valid, use cached index to partition search space into before/after ranges
Binary search only the relevant partition

Performance

Best case (sequential iteration): O(1) per element
Worst case (random access): O(log n) + 1 extra probe vs standard binary search
Typical iteration patterns: Near-constant time per element

Changes

src/visitor/iterator.ts: Added RunEndEncodedIterator class and runEndEncodedIterator() function

Testing

Existing iterator tests cover the functionality. The optimization is transparent to the API.

This PR adds read support for BinaryView and Utf8View types (Arrow format 1.4.0+), enabling arrow-js to consume IPC data from systems like InfluxDB 3.0 and DataFusion that use view types for efficient string handling. - Added BinaryView and Utf8View type classes with view struct layout constants - Type enum entries: Type.BinaryView = 23, Type.Utf8View = 24 - Data class support for variadic buffer management - Get visitor: Implements proper view semantics (16-byte structs, inline/out-of-line data) - Set visitor: Marks as immutable (read-only) - VectorLoader: Reads from IPC format with variadicBufferCounts - TypeComparator, TypeCtor: Type system integration - JSON visitors: Explicitly unsupported (throws error) - Generated schema files for BinaryView, Utf8View, ListView, LargeListView - Script to regenerate from Arrow format definitions - Reading BinaryView/Utf8View columns from Arrow IPC files - Accessing values with proper inline/out-of-line handling - Variadic buffer management - Type checking and comparison - ✅ Unit tests for BinaryView and Utf8View (test/unit/ipc/view-types-tests.ts) - ✅ Tests verify both inline (≤12 bytes) and out-of-line data handling - ✅ TypeScript compiles without errors - ✅ All existing tests pass - ✅ Verified with DataFusion 50.0.3 integration (enables native view types, removing need for workarounds) - Reading query results from DataFusion 50.0+ with view types enabled - Consuming InfluxDB 3.0 Arrow data with Utf8View/BinaryView columns - Processing Arrow IPC streams from any system using view types - Builders for write operations - ListView/LargeListView type implementation - Additional test coverage Closes apache#311 Related to apache#225

… from test tsconfig

Add scripts/update_flatbuffers.sh and test/unit/ipc/view-types-tests.ts to RAT (Release Audit Tool) exclusion list. Both files have proper Apache license headers but need to be excluded from license scanning.

This reverts commit dfe9d56.

Remove blank line after shebang to match Apache Arrow JS convention. License header must start on line 2 with '#' as shown in ci/scripts/build.sh

Add BinaryView and Utf8View to main exports in Arrow.ts. These types were implemented but not exported, causing 'BinaryView is not a constructor' errors in ES5 UMD tests.

Add BinaryView and Utf8View to Arrow.dom.ts exports. Arrow.node.ts re-exports from Arrow.dom.ts, so this fixes both entrypoints.

- Simplify variadicBuffers byteLength calculation with reduce - Remove unsupported type enum entries (only add BinaryView and Utf8View) - Eliminate type casting by extracting getBinaryViewBytes helper - Simplify readVariadicBuffers with Array.from - Remove CompressedVectorLoader override (inherits base implementation) - Delete SparseTensor.ts (not implementing tensors in this PR)

- Implement BinaryViewBuilder with inline/out-of-line storage logic - Implement Utf8ViewBuilder with UTF-8 encoding support - Support random-access writes (not just append-only) - Proper variadic buffer management (32MB buffers per spec) - Handle null values correctly - Register builders in builderctor visitor - Add comprehensive test suite covering: - Inline values (≤12 bytes) - Out-of-line values (>12 bytes) - Mixed inline/out-of-line - Null values - Empty values - 12-byte boundary cases - UTF-8 multibyte characters - Large batches (1000 values) - Multiple flushes Fixes: - Correct buffer allocation for random-access writes - Proper byteLength calculation (no double-counting) - Follows FixedWidthBuilder patterns for index-based writes

ESLint rule jest/prefer-to-have-length requires using toHaveLength() instead of toBe() for length checks.

Use reduce instead of explicit loops for variadicBuffers byteLength calculation, consistent with changes in Data class.

Add patch file to remove .skip_tester('JS') for BinaryView tests and modify CI workflow to apply the patch before running Archery. This enables the official Apache Arrow integration tests to validate BinaryView and Utf8View support in arrow-js.

Fixes RAT (Release Audit Tool) license check failure.

The integration tests require JSON format support for cross-implementation validation. This adds recognition of 'binaryview' and 'utf8view' type names in the JSON type parser. Fixes integration test failures where arrow-js couldn't parse BinaryView/Utf8View types from JSON schema definitions.

The JSONVectorLoader needs to read variadic buffers from JSON format to support BinaryView and Utf8View types in integration tests. This method reads hex-encoded variadic buffer data from JSON sources and converts it to Uint8Array buffers.

This commit implements complete JSON integration test support for BinaryView and Utf8View types by adding handling for variadic data buffers. Changes: - Updated buffersFromJSON() to handle VIEWS and VARIADIC_DATA_BUFFERS fields - Added variadicBufferCountsFromJSON() using reduce pattern to extract counts - Updated recordBatchFromJSON() to pass variadicBufferCounts to RecordBatch - Updated JSONVectorLoader constructor to accept and pass variadicBufferCounts - Updated RecordBatchJSONReaderImpl to pass variadicBufferCounts to loader

Implements viewDataFromJSON() to convert JSON view objects into 16-byte view structs required by the Arrow view format. The JSON VIEWS field contains objects with structure: - Inline views (≤12 bytes): {SIZE, INLINED} - Out-of-line views (>12 bytes): {SIZE, PREFIX_HEX, BUFFER_INDEX, OFFSET} This function converts these to the binary view struct layout: [size: i32, prefix/inlined: 12 bytes, buffer_index: i32, offset: i32] Changes: - Added viewDataFromJSON() helper function - Updated JSONVectorLoader.readData() to handle BinaryView and Utf8View types - Properly constructs 16-byte view structs from JSON representation

…riter) Implements JSON writing for BinaryView and Utf8View types to enable 'JS producing' integration tests. This completes the JSON format support for view types. Implementation: - Added visitBinaryView() and visitUtf8View() methods to JSONVectorAssembler - Implemented viewDataToJSON() helper that converts 16-byte view structs to JSON - Handles both inline (≤12 bytes) and out-of-line (>12 bytes) views - Properly maps variadic buffer indices and converts buffers to hex strings JSON output format matches Apache Arrow spec: - Inline views: {SIZE, INLINED} where INLINED is hex (BinaryView) or string (Utf8View) - Out-of-line views: {SIZE, PREFIX_HEX, BUFFER_INDEX, OFFSET} - VARIADIC_DATA_BUFFERS array contains hex-encoded buffer data This enables the complete roundtrip: Builder → Data → JSON → IPC → validation

…into feat/binary-utf8-view

This fixes integration test failures for BinaryView and Utf8View types. Changes: - Fix JSONTypeAssembler to serialize BinaryView/Utf8View type metadata - Fix JSONMessageReader to include VIEWS and VARIADIC_DATA_BUFFERS in sources - Fix viewDataFromJSON to handle both hex (BinaryView) and UTF-8 (Utf8View) INLINED formats - Fix readVariadicBuffers to handle individual hex strings correctly - Fix lint error: use String.fromCodePoint() instead of String.fromCharCode() - Fix lint error: use for-of loop instead of traditional for loop - Add comprehensive unit tests for JSON round-trip serialization Root cause: The JSON format uses different representations for inline data: - BinaryView INLINED: hex string (e.g., "48656C6C6F") - Utf8View INLINED: UTF-8 string (e.g., "Hello") The reader now auto-detects the format and handles both correctly. Fixes apache#320 integration test failures

- Extract hexStringToBytes() helper function to reduce code duplication - Update readVariadicBuffers() to use helper instead of wrapping in array - Update binaryDataFromJSON() to use helper for cleaner implementation - Add comprehensive documentation explaining design matches C++ reference - Document why 'as unknown as string' cast is necessary for heterogeneous sources array - Reference Arrow C++ implementation in comments for architectural clarity

- Add ListView and LargeListView type classes with child field support - Add type guard methods isListView and isLargeListView - Add visitor support in typeassembler and typector - Add Data interfaces for ListView with offsets and sizes buffers - Add makeData overloads for ListView and LargeListView - Update DataProps union type to include ListView types ListView and LargeListView use offset+size buffers instead of consecutive offsets, allowing out-of-order writes and value sharing.

- Add ListView and LargeListView type classes to src/type.ts - Add visitor support in src/visitor.ts (inferDType and getVisitFnByTypeId) - Add visitor support in src/visitor/typector.ts and typeassembler.ts - Add DataProps interfaces for ListView/LargeListView in src/data.ts - Implement MakeDataVisitor methods for ListView/LargeListView - Implement GetVisitor methods for ListView/LargeListView in src/visitor/get.ts - Add comprehensive test suite in test/unit/ipc/list-view-tests.ts - Tests in-order and out-of-order offsets - Tests value sharing between list elements - Tests null handling and empty lists - Tests LargeListView with BigInt64Array offsets - Tests type properties ListView and LargeListView are Arrow 1.4 variable-size list types that use offset+size buffers instead of consecutive offsets, enabling out-of-order writes and value sharing.

Add type 25 (ListView) and 26 (LargeListView) to the Type enum.

Implements builders for ListView and LargeListView types: - ListViewBuilder: Uses Int32Array for offsets and sizes - LargeListViewBuilder: Uses BigInt64Array for offsets and sizes Key implementation details: - Both builders extend Builder directly (not VariableWidthBuilder) - Use DataBufferBuilder for independent offset and size buffers - Override flush() to pass both valueOffsets and sizes to makeData - Properly handle null values and empty lists Includes comprehensive test suite with 11 passing tests: - Basic value appending - Null handling - Empty lists - Multiple flushes - Varying list sizes - BigInt offset verification This is part of the stacked PR strategy for view types support.

ESLint rule jest/prefer-to-have-length requires using toHaveLength() instead of toBe() for length checks.

…ibility

- Add LargeList type class and interface to type system - Implement LargeListBuilder for write support - Add LargeList visitors for all operations (get, set, indexof, etc.) - Add LargeList to data props and makeData function - Update vectorassembler and vectorloader for LargeList - Add LargeList enum entry (Type.LargeList = 21) - Use BigInt64Array for LargeList offsets

Implements RunEndEncoded array type support following the Apache Arrow specification. Key features: - Two-child structure: run_ends (Int16/32/64) and values (any type) - Binary search O(log n) algorithm for value lookup (matches Arrow C++ implementation) - Immutable design for set operations (similar to BinaryView/Utf8View) - Full visitor pattern integration across all visitor files - Proper TypeScript generics with Int_ constraint for type safety Implementation follows the same patterns as LargeList and other complex types. Future optimization: Plan to implement RunEndEncodedIterator for O(1) amortized sequential access, matching Arrow C++'s stateful iterator optimization. Files modified: - src/enum.ts: Added Type.RunEndEncoded = 22 - src/type.ts: Added RunEndEncoded type class and TRunEnds helper type - src/data.ts: Added RunEndEncodedDataProps and visitor method - src/visitor/get.ts: Implemented binary search lookup - src/visitor/set.ts: Made immutable (throws error) - src/visitor/*.ts: Added RunEndEncoded to all visitor files - src/visitor.ts: Added to base Visitor class - src/Arrow.ts: Added RunEndEncoded to exports

Implements stateful caching optimization based on Arrow C++ PhysicalIndexFinder: - Caches last physical index from previous lookup - Fast path: validates cached index for sequential access patterns (O(1)) - Falls back to binary search in partitioned ranges when cache invalid - Typical iteration becomes O(1) amortized instead of O(log n) per element Algorithm: 1. Check if cached physical index is still valid for current logical index 2. If valid and within run bounds, return cached index (common case) 3. If not valid, use cached index to partition search space 4. Binary search only the relevant partition Worst case (random access) adds one extra probe vs standard binary search. Best case (sequential iteration) is O(1) per element.

GeorgeLeePatterson added 30 commits October 29, 2025 19:28

WIP: add binaryview and uft8view support

a35ea1e

Add Apache license headers to fix RAT check

675b2f2

Fix Jest dynamic import errors by removing moduleResolution: NodeNext…

73bda86

… from test tsconfig

chore: Trigger CI validation on fork

456f85d

fix: Add new files to RAT exclusion list

dfe9d56

Add scripts/update_flatbuffers.sh and test/unit/ipc/view-types-tests.ts to RAT (Release Audit Tool) exclusion list. Both files have proper Apache license headers but need to be excluded from license scanning.

Revert "fix: Add new files to RAT exclusion list"

21a778f

This reverts commit dfe9d56.

fix: Correct license header format in update_flatbuffers.sh

e9d180b

Remove blank line after shebang to match Apache Arrow JS convention. License header must start on line 2 with '#' as shown in ci/scripts/build.sh

fix: Export BinaryView and Utf8View types

8d5bf77

Add BinaryView and Utf8View to main exports in Arrow.ts. These types were implemented but not exported, causing 'BinaryView is not a constructor' errors in ES5 UMD tests.

fix: Export BinaryView and Utf8View in Arrow.dom.ts

41f2d3e

Add BinaryView and Utf8View to Arrow.dom.ts exports. Arrow.node.ts re-exports from Arrow.dom.ts, so this fixes both entrypoints.

fix: Use toHaveLength() for jest length assertions

a28f69f

ESLint rule jest/prefer-to-have-length requires using toHaveLength() instead of toBe() for length checks.

Add BinaryViewBuilder and Utf8ViewBuilder exports

5b312d5

Simplify byteLength calculation in view builders

5344b8f

Use reduce instead of explicit loops for variadicBuffers byteLength calculation, consistent with changes in Data class.

fix: Add Apache license header to patch file

9502316

Fixes RAT (Release Audit Tool) license check failure.

Merge remote-tracking branch 'origin/feat/binary-utf8-view-builders' …

00bb3c9

…into feat/binary-utf8-view

Add ListView and LargeListView exports

77131b4

Add ListView and LargeListView type enum entries

233f233

Add type 25 (ListView) and 26 (LargeListView) to the Type enum.

GeorgeLeePatterson added 8 commits November 4, 2025 16:38

fix: Use toHaveLength() for jest length assertions

cf67aae

ESLint rule jest/prefer-to-have-length requires using toHaveLength() instead of toBe() for length checks.

Add ListViewBuilder and LargeListViewBuilder exports to Arrow.dom.ts

61d3169

fix: Replace BigInt literals with BigInt() constructor for ES5 compat…

de1e8a7

…ibility

feat: Export LargeList and LargeListBuilder from main module

3212b9a

feat: Add RunEndEncoded and LargeList to Arrow.dom.ts exports

8ccb655

GeorgeLeePatterson force-pushed the feat/run-end-encoded-iterator branch from 3e3fd94 to d1a4b63 Compare November 4, 2025 21:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: RunEndEncodedIterator #327

feat: RunEndEncodedIterator #327

Uh oh!

GeorgeLeePatterson commented Nov 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

feat: RunEndEncodedIterator #327

Are you sure you want to change the base?

feat: RunEndEncodedIterator #327

Uh oh!

Conversation

GeorgeLeePatterson commented Nov 2, 2025

Summary

Implementation

Algorithm

Performance

Changes

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant