-
Notifications
You must be signed in to change notification settings - Fork 3k
API: Simplify the ManifestEvaluator #5926
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
336dbe5 to
18e4972
Compare
| return ROWS_MIGHT_MATCH; | ||
| } | ||
|
|
||
| private boolean allValuesAreNull(PartitionFieldSummary summary, Type.TypeID typeId) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'd still prefer to keep this as a separate function, rather than possibly recreating it (incorrectly) later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is a fair point, I've reinstated the function and reverted the changes on the null check. Now it just removes the redundant check on isNaN.
18e4972 to
10c9737
Compare
While writing tests for Python, I noticed some unreachable code. This also allows us to simplify the logic here by inlining the `allValuesAreNull` function. The `allValuesAreNull` check is redundant for the `isNan` check. Because we first check if there are NaN values: ```java @OverRide public <T> Boolean isNaN(BoundReference<T> ref) { int pos = Accessors.toPosition(ref.accessor()); if (stats.get(pos).containsNaN() != null && !stats.get(pos).containsNaN()) { return ROWS_CANNOT_MATCH; } if (allValuesAreNull(stats.get(pos), ref.type().typeId())) { return ROWS_CANNOT_MATCH; // Unreachable code } return ROWS_MIGHT_MATCH; } ``` And then we do the same in the `allValuesAreNull`: ```java private boolean allValuesAreNull(PartitionFieldSummary summary, Type.TypeID typeId) { // containsNull encodes whether at least one partition value is null, // lowerBound is null if all partition values are null boolean allNull = summary.containsNull() && summary.lowerBound() == null; if (allNull && (Type.TypeID.DOUBLE.equals(typeId) || Type.TypeID.FLOAT.equals(typeId))) { // floating point types may include NaN values, which we check separately. // In case bounds don't include NaN value, containsNaN needs to be checked against. allNull = summary.containsNaN() != null && !summary.containsNaN(); } return allNull; } ``` Since the `isNan` can only be applied to Floats and Doubles, we always take the branch. And then we come to the same conclusion.
10c9737 to
36f7487
Compare
| return ROWS_CANNOT_MATCH; | ||
| } | ||
|
|
||
| if (allValuesAreNull(stats.get(pos), ref.type().typeId())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is correct. The containsNaN field was introduced later, so it can be null for metadata files that were written before we tracked NaN values in partitions. In that case, if we know that all values are null then none of them are NaN and can then eliminate scanning the manifest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you are right on that one! It didn't think of encoding NaN as None. Then this makes sense
rdblue
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I initially approved, but after thinking about this, I think that the current logic is correct.
While writing tests for Python, I noticed some unreachable code. This also allows us to simplify the logic here by inlining the
allValuesAreNullfunction.The
allValuesAreNullcheck is redundant for theisNancheck.Because we first check if there are NaN values:
And then we do the same in the
allValuesAreNull:Since the
isNancan only be applied to Floats and Doubles, we always take the branch and it will setallNullto false. Therefore, never take the branch.