Skip to content

Conversation

@harshitsaini17
Copy link
Contributor

fix: shuffle should report nullability correctly

Which issue does this PR close?

Closes #19145

Rationale for this change

The shuffle UDF was using the default is_nullable implementation which always returns true, regardless of the input array's nullability. This causes:

  1. Incorrect schema inference - non-nullable inputs are incorrectly marked as nullable
  2. Missed optimization opportunities - the query optimizer cannot apply certain optimizations when nullability information is incorrect
  3. Potential runtime errors - incorrect metadata can lead to unexpected behavior in downstream operations

The shuffle function simply reorders elements within an array without changing the array's structure or nullability, so the output should have the same nullability as the input.

What changes are included in this PR?

  1. Implemented return_field_from_args: Returns the input field directly, preserving both data type and nullability
  2. Updated return_type: Now returns an error directing users to use return_field_from_args instead (following DataFusion best practices)
  3. Added comprehensive tests: Verifies that both nullable and non-nullable inputs are handled correctly

Are these changes tested?

Yes, this PR includes a new test test_shuffle_nullability that verifies:

  • Non-nullable array input produces non-nullable output
  • Nullable array input produces nullable output
  • Data types are preserved correctly in both cases

Test results:

Copilot AI review requested due to automatic review settings December 7, 2025 18:22
@github-actions github-actions bot added the spark label Dec 7, 2025
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes the shuffle function to correctly preserve the nullability of its input array, addressing issue #19145. Previously, the function always reported outputs as nullable regardless of input nullability, which could lead to incorrect schema inference and missed optimization opportunities.

Key Changes:

  • Replaced return_type implementation with return_field_from_args to preserve input field metadata including nullability
  • Added comprehensive unit tests to verify nullability preservation for both nullable and non-nullable inputs

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Member

@rluvaton rluvaton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@harshitsaini17 harshitsaini17 force-pushed the fix/shuffle-nullability-19145 branch 2 times, most recently from 5cd23bc to c3262d2 Compare December 7, 2025 18:35
@harshitsaini17 harshitsaini17 force-pushed the fix/shuffle-nullability-19145 branch from c3262d2 to 629e995 Compare December 7, 2025 18:52
- Replace return_type with return_field_from_args to preserve input nullability
- Add test to verify nullability is correctly reported
- Addresses issue apache#19145
@harshitsaini17 harshitsaini17 force-pushed the fix/shuffle-nullability-19145 branch from 629e995 to 853ff42 Compare December 7, 2025 19:28
Copy link
Member

@rluvaton rluvaton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

hope to see you again soon

@rluvaton rluvaton added this pull request to the merge queue Dec 7, 2025
Merged via the queue into apache:main with commit fc6d0a4 Dec 7, 2025
27 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

shuffle should report nullability correctly

2 participants