Skip to content

Conversation

@xsa-dev
Copy link

@xsa-dev xsa-dev commented Nov 9, 2025

Which issue does this PR close?

Rationale for this change

This PR addresses the missing RunEndEncoded variant in the ScalarValue enum, which is essential for proper type coercion and operations involving RunEndEncoded arrays. RunEndEncoded (REE) is an Arrow data type for run-length encoding of repeated values, and without proper ScalarValue support, operations involving REE arrays would fail to work correctly.

The implementation follows the existing patterns in the DataFusion codebase and ensures compatibility with the Arrow specification for RunEndEncoded arrays.

What changes are included in this PR?

  1. Added RunEndEncoded variant to ScalarValue enum in datafusion/common/src/scalar/mod.rs

    • New variant: RunEndEncoded(Arc<RunEndEncodedScalar>)
    • Positioned correctly within the enum definition
  2. Implemented RunEndEncodedScalar struct with required trait implementations:

    • PartialEq and Eq for value equality comparisons
    • Hash for hash-based operations
    • PartialOrd for ordering operations
    • Debug and Clone for debugging and copying
  3. Updated existing trait implementations to handle the new variant:

    • PartialEq for ScalarValue
    • PartialOrd for ScalarValue
    • Hash for ScalarValue
  4. Added support for null values in try_new_null() method

  5. Comprehensive test coverage with unit tests covering:

    • Value equality and inequality
    • Hash consistency
    • Partial ordering
    • Null value handling
    • Type safety

Are these changes tested?

Yes, this PR includes comprehensive test coverage:

  • Unit tests for RunEndEncodedScalar covering all trait implementations
  • Integration tests for ScalarValue enum operations with the new variant
  • Null value tests to ensure proper handling of null RunEndEncoded scalars
  • Edge case tests for various value types and comparison scenarios

All tests pass successfully and the implementation maintains compatibility with existing functionality.

Are there any user-facing changes?

Yes, this PR introduces user-facing changes by extending the public API:

New Public API:

  • ScalarValue::RunEndEncoded(Arc<RunEndEncodedScalar>) - New enum variant
  • RunEndEncodedScalar struct with public run_ends and values fields

User Impact:

  • Positive: Users can now work with RunEndEncoded scalars in DataFusion queries
  • Backward Compatible: Existing code continues to work unchanged
  • Type Safe: Proper type support for REE operations

No Breaking Changes:

  • All existing APIs remain unchanged
  • No modifications to public method signatures
  • No changes to existing behavior

The changes follow DataFusion's API evolution guidelines and are fully backward compatible.

This commit introduces the RunEndEncodedScalar struct, which represents a scalar value for the RunEndEncoded type, facilitating efficient storage of repeated values through run-length encoding. The implementation includes methods for equality comparison, partial ordering, and hashing. Additionally, the ScalarValue enum is updated to include the RunEndEncoded variant, along with necessary adjustments in related methods. Unit tests are added to verify the functionality of equality, hashing, and partial ordering for the new type.
This commit enhances the handling of the RunEndEncoded type within the ScalarValue enum. It introduces null handling for RunEndEncoded scalars, ensuring that empty arrays are correctly represented as null. Additionally, it improves the formatting for display and debug outputs of RunEndEncoded scalars. Minor whitespace adjustments were also made for consistency. Unit tests are updated to verify the new null handling behavior.
@github-actions github-actions bot added the common Related to common crate label Nov 9, 2025
@xsa-dev xsa-dev changed the title feat: add RunEndEncoded variant to ScalarValue enum Closes #18563 feat: add RunEndEncoded variant to ScalarValue enum Nov 9, 2025
Copy link
Contributor

@Jefffrey Jefffrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it makes sense for the RunEndEncoded scalar to hold a reference to arrays; scalars are meant to be single values, and run end encoding is similar to dictionary encoding in that it is a way to encode another array (e.g. we can have Int32 array, or we could represent the same exact array (*without nulls) as a RunEndEncoded Int32 array)

I suggest reading the documentation on run end encoding to get a better understanding of what it actually represents:

@xsa-dev xsa-dev marked this pull request as draft November 10, 2025 11:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

common Related to common crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Missing ScalarValue variant for RunEndEncoded

2 participants