Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parquet: Should not throw exception for filter contains non-reference term #8243

Merged
merged 2 commits into from Aug 10, 2023

Conversation

ConeyLiu
Copy link
Contributor

@ConeyLiu ConeyLiu commented Aug 7, 2023

The parquet filter evaluator should not throw exceptions for those filters containing non-reference terms. For example, the filter evaluator apply on year(ts) = 20 will throw ValidationException;


@Override
public <T> Boolean handleNonReference(Bound<T> term) {
return ROWS_MIGHT_MATCH;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the column is not referenced, then the column is missing, right? How could it then match?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think here the NonReference should be it is not a column reference term. It can be a transform term, for example equal(year(ts), 10).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Fokko, I think at this point the columns are bound and transforms have already been removed if they can be (e.g. if they can be translated to a normal predicate on a partition field). This is the case where the transform is left over because there is no corresponding partition field. We could implement future optimizations to run the filter, but returning all rows is good enough for now.

boolean shouldRead =
new ParquetMetricsRowGroupFilter(SCHEMA, equal(truncate("required", 2), "some_value"), true)
.shouldRead(parquetSchema, rowGroupMetadata);
Assumptions.assumeThat(shouldRead)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you want to use an assertion rather than an assumption? If you use an assumption the test will be skipped if it fails, instead of failing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, fixed.

Copy link
Contributor

@rdblue rdblue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good other than the assumption that I think should be an assertion. I think @Fokko also has some questions that I don't fully understand so I'll defer to him on merging this.

@rdblue rdblue merged commit 84d00b8 into apache:master Aug 10, 2023
41 checks passed
@rdblue
Copy link
Contributor

rdblue commented Aug 10, 2023

Thanks, @ConeyLiu!

@ConeyLiu
Copy link
Contributor Author

Thanks @rdblue @Fokko

@ConeyLiu ConeyLiu deleted the parquet-transform-filter branch August 11, 2023 03:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants