Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parquet: Fixes Incorrect Skipping of RowGroups with NaNs #6517

Merged
merged 6 commits into from
Jan 13, 2023

Conversation

RussellSpitzer
Copy link
Member

@RussellSpitzer RussellSpitzer commented Jan 3, 2023

Previously, the RowGroupFilter would skip any RowGroups whose min or max value was set as NaN because Parquet would report that those row groups "hasNonNull" as false. When the statistics are not empty and hasNonNull is false this actually indicates that the min and max are undefined, not that there are no non nulls.

Fixes #6516

Previously, the RowGroupFilter would skip any RowGroups whose min or max value was set as NaN because Parquet would report that those row groups "hasNonNull" as false. When the statistics are not empty and hasNonNull is false this actually indicates that the min and max are undefined, not that there are no non nulls.
@@ -580,6 +608,10 @@ static boolean hasNonNullButNoMinMax(Statistics statistics, long valueCount) {
&& (statistics.getMaxBytes() == null || statistics.getMinBytes() == null);
}

static boolean minMaxUndefined(Statistics statistics) {
return !statistics.isEmpty() && !statistics.hasNonNullValue();
Copy link
Contributor

@zhongyujiang zhongyujiang Jan 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This unit test fails, I think it's because when all values are all null , min max will be null(undefined) too, seems we should handle the case of all values are null first.

Copy link
Contributor

@zhongyujiang zhongyujiang Jan 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am thinking we can simplify the handling of unreliable statistics by moving the judgment of all-null situation forward, as I mentioned here,we can use Statistics#getNumNulls() = ColumnChunkMetadata#getValueCount() to determine if all values are null (which will not be affected by null or undinfined min, max statistics); Then we handle the case of statistics.hasNonNullValue() is false(statistics.hasNonNullValue() will return false when statistics has null min, max or unreliable min, max), we should always return ROWS_MIGHT_MATCH when statistics.hasNonNullValue() is false bacause that means statistics is not reliable; Finally we can safely use min, max for comparative evaluation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that sounds good, I was basing this determination off of the code here from Statistics.java in Parquet

  @Override
  public String toString() {
    if (this.hasNonNullValue()) {
      if (isNumNullsSet()) {
        return String.format("min: %s, max: %s, num_nulls: %d", minAsString(), maxAsString(), this.getNumNulls());
      } else {
        return String.format("min: %s, max: %s, num_nulls not defined", minAsString(), maxAsString());
      }
    } else if (!this.isEmpty())
      return String.format("num_nulls: %d, min/max not defined", this.getNumNulls());
    else
      return "no stats for this column";
  }

Which similarly checks for the numNulls stat first

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I simplified this a bit further since I don't think we have to handle the CDH special case in the same way. Now we basically just have a two step check

  1. Are all the values null? This is only true if nullCount is set and it is equal to the value count from the chunk stats.

  2. We check if min/max are defined for the stats
    a. For normal parquet stats this means that the stats are not empty && hasNonNull is false
    b. For the old CDH version this means the stats are not empty but min and max are set to null. I think the "a" case may also cover this based on my old notes but I figured since this check will be short circuited it doesn't hurt to leave it in.

Reordered checks, we now check for the ALL_NULL case before checking before seeing
if min/max values are undefined. Since the behavior of "hasNonNullValues" cannot really be trusted in any circumstance we stop using it entirely. This also removes the
workaround for older CDH parquet files which could similarly rewrite metrics with a
null value count but undefined min/max statistics.
Copy link
Contributor

@zhongyujiang zhongyujiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to me, thanks for fixing this!

}

static boolean minMaxUndefined(Statistics statistics) {
return (!statistics.isEmpty() && !statistics.hasNonNullValue()) || nullMinMax(statistics);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So for my understanding its more like other stats are defined but not min/max? Its not strictly what the method name says, so maybe worth a javadoc?

Also I spent some time to explore this:

(!statistics.isEmpty() && !statistics.hasNonNullValue())

I noticed statistics.isEmpty() internally is (!statistics.hasNonNullValue() && !isNumNullsSet), so I think after some math it simplifies down then to :
(stats.isNumNullSet && !stats.hasNonNullValue)

It could be clearer this way, although I guess its safer to use isEmpty if statistitcs.isEmpty definition ever changes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually a direct reference to the logic in Parquet Statistics.java . I'm doing the same check they do which is "!IsEmpty && !nonNullValue" see

https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/statistics/Statistics.java#L454

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a comment to explain this, but I want to stick to parquet-mr at the moment. Ideally we would have some well-defined api for this but I don't believe we do.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me, I was just doing a sanity check by exploding the logic (!statistics.hasNonNullValue() && !isNumNullsSet) || !statistics.hasNonNullValue() , and it checks out. I think I saw your comment of toString() and was not sure if its that reliable as a source of truth.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @szehon-ho's simplification of the logic, and I'm debating whether I prefer that or using isEmpty, which is really unclear.

I also think that we may be adding a case to unknown when it is actually known. Since the check becomes isNumNullSet and not hasNonNullValue because NaN values trigger hasNonNullValue = false, are we including cases where hasNonNullValue is actually false because the null count is zero? This would be good to make sure tests cover.

Otherwise I think the logic in this PR looks good.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what you mean by
are we including cases where hasNonNullValue is actually false because the null count is zero?

There are only four assignments of hasNonNullValue

  1. In the Constructor for Statistics : false
  2. & 3. In FloatStats and DoubleStats when a NaN is seen : false
  3. In markAsNonEmpty() : true
/**
   * Sets the page/column as having a valid non-null value
   * kind of misnomer here
   */
  protected void markAsNotEmpty() {
    hasNonNullValue = true;
  }

This method is invoked in all of the Statistics sub classes when "setMinMax", "initializeStats" or "setMinMaxFromBytes" are called.

So i'm not sure how hasNonNullValue could be actually false unless we see a NaN or no min and max are set

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. It's odd that this is not based on the number of values and number of nulls. Okay though, I guess it's just a strange API.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is quite strange

@RussellSpitzer
Copy link
Member Author

Retest this please

Copy link
Contributor

@aokolnychyi aokolnychyi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems correct to me but it is really unfortunate we inherit this unreliable Parquet logic.

@RussellSpitzer
Copy link
Member Author

If no one has any more comments on this PR I'd like to merge it

@rdblue rdblue merged commit e5ea152 into apache:master Jan 13, 2023
@RussellSpitzer RussellSpitzer deleted the ParquetFixNaNRowGroupSkipping branch January 13, 2023 17:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Parquet: Metric Row Group Filter handles Undefined Min/Max incorrectly Missing Rows
5 participants