Skip to content

Conversation

@Fokko
Copy link
Contributor

@Fokko Fokko commented Oct 6, 2022

While writing tests for Python, I noticed some unreachable code. This also allows us to simplify the logic here by inlining the allValuesAreNull function.

The allValuesAreNull check is redundant for the isNan check.

Because we first check if there are NaN values:

@Override
public <T> Boolean isNaN(BoundReference<T> ref) {
  int pos = Accessors.toPosition(ref.accessor());

  if (stats.get(pos).containsNaN() != null && !stats.get(pos).containsNaN()) {
    return ROWS_CANNOT_MATCH;
  }

  if (allValuesAreNull(stats.get(pos), ref.type().typeId())) {
    return ROWS_CANNOT_MATCH; // Unreachable code
  }

  return ROWS_MIGHT_MATCH;
}

And then we do the same in the allValuesAreNull:

private boolean allValuesAreNull(PartitionFieldSummary summary, Type.TypeID typeId) {
  // containsNull encodes whether at least one partition value is null,
  // lowerBound is null if all partition values are null
  boolean allNull = summary.containsNull() && summary.lowerBound() == null;

  if (allNull && (Type.TypeID.DOUBLE.equals(typeId) || Type.TypeID.FLOAT.equals(typeId))) {
    // floating point types may include NaN values, which we check separately.
    // In case bounds don't include NaN value, containsNaN needs to be checked against.
    allNull = summary.containsNaN() != null && !summary.containsNaN();
  }
  return allNull;
}

Since the isNan can only be applied to Floats and Doubles, we always take the branch and it will set allNull to false. Therefore, never take the branch.

@github-actions github-actions bot added the API label Oct 6, 2022
@Fokko Fokko force-pushed the fd-simplify-manifest-evaluator branch from 336dbe5 to 18e4972 Compare October 6, 2022 07:57
return ROWS_MIGHT_MATCH;
}

private boolean allValuesAreNull(PartitionFieldSummary summary, Type.TypeID typeId) {
Copy link
Contributor

@rdblue rdblue Oct 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'd still prefer to keep this as a separate function, rather than possibly recreating it (incorrectly) later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a fair point, I've reinstated the function and reverted the changes on the null check. Now it just removes the redundant check on isNaN.

@Fokko Fokko force-pushed the fd-simplify-manifest-evaluator branch from 18e4972 to 10c9737 Compare October 6, 2022 20:00
While writing tests for Python, I noticed some unreachable code.
This also allows us to simplify the logic here by inlining the
`allValuesAreNull` function.

The `allValuesAreNull` check is redundant for the `isNan` check.

Because we first check if there are NaN values:

```java
@OverRide
public <T> Boolean isNaN(BoundReference<T> ref) {
  int pos = Accessors.toPosition(ref.accessor());

  if (stats.get(pos).containsNaN() != null && !stats.get(pos).containsNaN()) {
    return ROWS_CANNOT_MATCH;
  }

  if (allValuesAreNull(stats.get(pos), ref.type().typeId())) {
    return ROWS_CANNOT_MATCH; // Unreachable code
  }

  return ROWS_MIGHT_MATCH;
}
```

And then we do the same in the `allValuesAreNull`:

```java
private boolean allValuesAreNull(PartitionFieldSummary summary, Type.TypeID typeId) {
  // containsNull encodes whether at least one partition value is null,
  // lowerBound is null if all partition values are null
  boolean allNull = summary.containsNull() && summary.lowerBound() == null;

  if (allNull && (Type.TypeID.DOUBLE.equals(typeId) || Type.TypeID.FLOAT.equals(typeId))) {
    // floating point types may include NaN values, which we check separately.
    // In case bounds don't include NaN value, containsNaN needs to be checked against.
    allNull = summary.containsNaN() != null && !summary.containsNaN();
  }
  return allNull;
}
```

Since the `isNan` can only be applied to Floats and Doubles, we always take the branch.
And then we come to the same conclusion.
@Fokko Fokko force-pushed the fd-simplify-manifest-evaluator branch from 10c9737 to 36f7487 Compare October 6, 2022 20:01
return ROWS_CANNOT_MATCH;
}

if (allValuesAreNull(stats.get(pos), ref.type().typeId())) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is correct. The containsNaN field was introduced later, so it can be null for metadata files that were written before we tracked NaN values in partitions. In that case, if we know that all values are null then none of them are NaN and can then eliminate scanning the manifest.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are right on that one! It didn't think of encoding NaN as None. Then this makes sense

Copy link
Contributor

@rdblue rdblue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I initially approved, but after thinking about this, I think that the current logic is correct.

@Fokko Fokko closed this Oct 7, 2022
@Fokko Fokko deleted the fd-simplify-manifest-evaluator branch October 7, 2022 06:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants