Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-34: Add #contains FilterPredicate for Array columns #1328

Merged
merged 22 commits into from
Jun 4, 2024

Conversation

clairemcginty
Copy link
Contributor

@clairemcginty clairemcginty commented Apr 23, 2024

Proposal to add a new FilterPredicate, Contains, that can be applied to List types, and check if the specified element is present among the repeated values. It can be composed using And or Or:

FilterPredicate predicate = contains(
  eq(binaryColumn("phoneNumbers.phone.kind"), Binary.fromString("cell"))
)

FilterPredicate predicate = or(
  contains(eq(binaryColumn("phoneNumbers.phone.kind"), Binary.fromString("cell"))),
  contains(eq(binaryColumn("phoneNumbers.phone.kind"), Binary.fromString("home")))
)

FilterPredicate predicate = and(
  contains(eq(binaryColumn("phoneNumbers.phone.kind"), Binary.fromString("cell"))),
  contains(eq(binaryColumn("phoneNumbers.phone.kind"), Binary.fromString("home")))
)

The filtering logic is largely based on existing Eq predicates to apply filtering at the page/rowgroup level using statistics/dictionaries, with a specialized implementation in IncrementallyUpdatedFilterPredicateBuilder to do individual record-level filtering.

Jira

Tests

  • My PR adds the following unit tests OR does not need testing for this extremely good reason:

Commits

  • My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines
    from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Style

  • My contribution adheres to the code style guidelines and Spotless passes.
    • To apply the necessary changes, run mvn spotless:apply -Pvector-plugins

Documentation

  • In case of new functionality, my PR adds documentation that describes how to use it.
    • All the public functions and the classes in the PR contain Javadoc that explain what it does

@clairemcginty clairemcginty marked this pull request as ready for review April 24, 2024 13:40
@@ -257,6 +266,16 @@ public static <T extends Comparable<T>, C extends Column<T> & SupportsEqNotEq> N
return new NotIn<>(column, values);
}

public static <T extends Comparable<T>, C extends Column<T> & SupportsContains> Contains<T> contains(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking to the future, it could eventually be useful to support Array FilterPredicates like "repeated field X contains an int field greater than Y". We could refactor this API to be written like:

FilterApi.contains(
  FilterApi.eq(
    FilterApi.arrayColumn(FilterApi.intColumn("repeated_int_field")), 
    100
  )
)

& we could eventually support FilterApi.contains(FilterApi.lt(...)), FilterApi.contains(FilterApi.gt(...)), etc...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that Contains/DoesNotContain is not a standard SQL function. In which case they are used?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, now these new functions only accept single value. Does it make sense to support variable number of values? I also see some databases support CONTAINS(a OR b) and CONTAINS(a AND b). They can be composited by logical operators but the cost would be high compared to support them natively,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not a standard SQL function, but I've seen it in SQL extension languages such as BigQuery Standard SQL, and I've gotten several requests to support this by users of the Scio Parquet library!

that's a good point about making this composable, I think it would be more efficient to do CONTAINS(a or b) than CONTAINS(a) or CONTAINS(b). What do you think about supporting lt/gt in addition to eq-based Contains? for example, CONTAINS(eq(a) OR gt(b)) ? It would make this PR a lot more complex but I'm happy to try. We could probably re-use a lot of the existing filter code for eq, lt/gt, etc...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree that it worths making the function composable. Perhaps we can define all the necessary API (with implementation of the most common ones) in the PR and implementations of all other kinds of inputs can be split into separate PRs. Let's not hurry for the coming release. It still takes time to stabilize.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not a standard SQL function, but I've seen it in SQL extension languages such as BigQuery Standard SQL, and I've gotten several requests to support this by users of the Scio Parquet library!

that's a good point about making this composable, I think it would be more efficient to do CONTAINS(a or b) than CONTAINS(a) or CONTAINS(b). What do you think about supporting lt/gt in addition to eq-based Contains? for example, CONTAINS(eq(a) OR gt(b)) ? It would make this PR a lot more complex but I'm happy to try. We could probably re-use a lot of the existing filter code for eq, lt/gt, etc...

I took a second look, it seems that you were mentioning IN expression which we already have: https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/filter2/predicate/FilterApi.java#L209-L258

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I've pushed my changes to support composable contains predicates! example usage:

FilterApi.containsOr(
  FilterApi.containsEq(longColumn("phoneNumbers.phone.number"), 5555555555L),
  FilterApi.containsOr(
    FilterApi.containsEq(longColumn("phoneNumbers.phone.number"), -10000000L),
    FilterApi.containsEq(longColumn("phoneNumbers.phone.number"), 2222222222L)));

I left out implementing DoesNotContain/ContainsNotEq for now to keep the PR simpler to parse. Let me know if the API looks ok 👍 Then I can fill out the rest of the implementations/make the unit tests more thorough.

Sorry I didn't make it clear. I thought we could still leverage existing composite expression and combine CONTAINS(x) sub-expressions under the same level of AND or OR expression during rewrite process like https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/filter2/compat/FilterCompat.java#L79. It would be harder to maintain the code if introduce new specialized composite expression like containsOr.

Copy link
Contributor Author

@clairemcginty clairemcginty Apr 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh, ok. So we’d have a single public Contains predicate class that’s registered with the Visitor API, that can support one or more sub-predicates… and the Rewriter would inspect any Ands or Ors to see see if the left/right side is an instance of Contains… If so, we rewrite it into a single Contains predicate containing both clauses?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect so, but it seems requiring a lot more work. We can support the simplest case at this point.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I think I mostly got this working, supporting just Contains Eq (+ composition using And/Or) to start. hopefully this is closer to what you had in mind!

@wgtmac wgtmac requested a review from gszadovszky April 24, 2024 14:18
@wgtmac
Copy link
Member

wgtmac commented Apr 24, 2024

Thanks for the new feature! I will try to take a look soon and make it to the 1.14.0 release.

@@ -257,6 +266,16 @@ public static <T extends Comparable<T>, C extends Column<T> & SupportsEqNotEq> N
return new NotIn<>(column, values);
}

public static <T extends Comparable<T>, C extends Column<T> & SupportsContains> Contains<T> contains(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that Contains/DoesNotContain is not a standard SQL function. In which case they are used?

@@ -257,6 +266,16 @@ public static <T extends Comparable<T>, C extends Column<T> & SupportsEqNotEq> N
return new NotIn<>(column, values);
}

public static <T extends Comparable<T>, C extends Column<T> & SupportsContains> Contains<T> contains(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, now these new functions only accept single value. Does it make sense to support variable number of values? I also see some databases support CONTAINS(a OR b) and CONTAINS(a AND b). They can be composited by logical operators but the cost would be high compared to support them natively,

pom.xml Outdated Show resolved Hide resolved
@wgtmac
Copy link
Member

wgtmac commented Apr 26, 2024

As requested on the dev@parquet ML, I'll wait for this PR before starting the releasing process of 1.14.0.

Copy link
Contributor

@gszadovszky gszadovszky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @clairemcginty for working on this. This is a great improvement to support nested structures in filtering!

I agree with @wgtmac's comments/questions. Added some more.

@clairemcginty clairemcginty force-pushed the parquet-34 branch 2 times, most recently from a597fcf to 9cd2711 Compare April 28, 2024 10:25
@wgtmac
Copy link
Member

wgtmac commented Apr 29, 2024

BTW, should we postpone this feature to 1.15.0 release? We can always release new version as needed and I can volunteer to be the release manager once this is ready.

@gszadovszky
Copy link
Contributor

BTW, should we postpone this feature to 1.15.0 release? We can always release new version as needed and I can volunteer to be the release manager once this is ready.

I agree with @wgtmac. Let's create smaller PRs and make improvements then release when we feel it stable. During the development, we may advertise this feature so others may start experimenting on it and give feedback before actually releasing it. A too early release might make later improvements harder because we need to be backward compatible.

@clairemcginty
Copy link
Contributor Author

BTW, should we postpone this feature to 1.15.0 release? We can always release new version as needed and I can volunteer to be the release manager once this is ready.

I agree with @wgtmac. Let's create smaller PRs and make improvements then release when we feel it stable. During the development, we may advertise this feature so others may start experimenting on it and give feedback before actually releasing it. A too early release might make later improvements harder because we need to be backward compatible.

Sounds good to me! this PR might take another week or two to get right. It would also be nice to release support for operations like Array#size at the same time, so pushing it to 0.15 would give us time to do that 👍

@wgtmac
Copy link
Member

wgtmac commented May 15, 2024

Sorry for the delay. I will try to finish another pass by the end of this week.

Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This overall LGTM! Thanks @clairemcginty for working on this and adding exhaustive test!

cc @gszadovszky @julienledem @rdblue Sorry for bothering. I think this PR requires visibility to more reviewers.

@clairemcginty
Copy link
Contributor Author

This overall LGTM! Thanks @clairemcginty for working on this and adding exhaustive test!

great, I'm glad this implementation looks ok! I have a few more tests that I'd like to add around null handling + behavior of the Contains predicate on map types (I think that they should just work, but I haven't tried it out yet...). Will try to add those + address PR comments on Monday or Tuesday next week 👍

@clairemcginty
Copy link
Contributor Author

clairemcginty commented May 20, 2024

I tried adding a test case to TestRecordLevelFilters to test contains(eq(null)). My expectation was that if you have an array schema with an optional element type, this should return true if the array contains one or more null elements. However, I don't think this is possible to make work--I set a debugger on ValueInspector#update and ValueInspector#updateNull. and ValueInspector#update is only invoked for non-null elements, and ValueInspector#updateNull is only invoked if the entire array is null, which isn't exactly what we want, either.

So based on my current understanding, I don't think we can support a contains(eq(null)) predicate, and we can probably add a precondition check to the Contains constructor against a null predicate value. Wdyt @wgtmac ?

@@ -257,6 +258,10 @@ public static <T extends Comparable<T>, C extends Column<T> & SupportsEqNotEq> N
return new NotIn<>(column, values);
}

public static <T extends Comparable<T>> Contains<T> contains(Eq<T> pred) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a fast-follow-up PR we can just replace this signature with:

public static <T extends Comparable<T>> Contains<T> contains(ColumnFilterPredicate<T> pred) {
   ...
}

and remove the SupportsContains annotation. I did a few tests of Contains(lt) and Contains(gt) and everything is working as expected 👍

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, but keep in mind that this one and the follow ups shall be released together otherwise we'll have API compatibility issues.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

understood -- I can commit to working on the follow-up PR right away 👍

Copy link
Contributor Author

@clairemcginty clairemcginty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did some ad-hoc testing of this on larger datasets (1m+ elements with various page/row group distributions) and confirmed that everything looks right! so I think this PR is finally complete/ready for review.

The only outstanding task is to include support for NotEq/Lt/Gt/Lte/Gte/In/NotIn with Contains, which is just updating the FilterApi signature and adding a few lines to IncrementallyUpdatedFilterPredicateGenerator. I think that can be a followup PR 👍

@gszadovszky
Copy link
Contributor

@wgtmac, completely agree to have more people reviewing this. Thanks for pinging.
I'll try to take a look this week.

Copy link
Contributor

@gszadovszky gszadovszky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work, @clairemcginty!
I've added some comments.

@@ -257,6 +258,10 @@ public static <T extends Comparable<T>, C extends Column<T> & SupportsEqNotEq> N
return new NotIn<>(column, values);
}

public static <T extends Comparable<T>> Contains<T> contains(Eq<T> pred) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, but keep in mind that this one and the follow ups shall be released together otherwise we'll have API compatibility issues.

Comment on lines 381 to 385
public int nextInt() {
int result = next;
next = null;
return result;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this implementation follows the contract of an iterator. Iterators do not enforce the user to call hasNext before calling next. If you sure about the size, you may just call one next after another. It may throw a NoSuchElementException if there are no more values.
The trick is to calculate the next value beforehand and check the existence of that value at hasNext and calculate the "next-next" value at next before returning the pre-calculated next value. (Hope it makes sense 😄)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See IndexIterator as an example. You may even put static methods there for intersection and union and it will be more readable here.
I would not use boxing/unboxing here only to have a additional null value. Indices here can only be non-negative so you may use -1 to mark the case of no more values.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense! I had put some of the next-loading logic into hasNext() because as far as I could tell, this iterator would be used via the forEachRemaining() method, which IIRC does sequentially call hasNext()/next()... but it does make this a bit unstable 😅 I can rewrite it and move into static methods in IndexIterator.

@@ -20,6 +20,7 @@

import static org.apache.parquet.filter2.predicate.FilterApi.and;
import static org.apache.parquet.filter2.predicate.FilterApi.binaryColumn;
import static org.apache.parquet.filter2.predicate.FilterApi.contains;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

During checking the code I've found that this record level testing class does not seem to be correct. In several places (just like in your code) the method PhoneBookWriter.readFile is used which calls PhoneBookWriter.createReader to create the reader instance. The issue is, by default all the filtering (statistics/dictionary/column index) are working. As a result, we cannot be sure that the read results we are getting are really filtered only by the record level filter.
I know this is not your fault but could you please try using a properly set up reader for this test?

    // ...
    Configuration conf = new Configuration();
    GroupWriteSupport.setSchema(schema, conf);

    return ParquetReader.builder(new GroupReadSupport(), file)
        .withConf(conf)
        .withFilter(filter)
        .withAllocator(allocator)
        .useBloomFilter(false)
        .useDictionaryFilter(false)
        .useStatsFilter(false)
        .useColumnIndexFilter(false)
        .build();

If unrelated errors may occur, let's handle this separately, but I want to avoid not catching a potential issue in your code by leaving this test as is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ha, I did notice that -- I'd been refactoring StatisticsFilter, made a mistake, and was confused why TestRecordLevelFilter suddenly started failing.

updated! All tests continued to pass after disabling the other filters.

@clairemcginty
Copy link
Contributor Author

hey @gszadovszky! all your requested changes have been addressed - anything else that's missing?

Copy link
Contributor

@gszadovszky gszadovszky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've found an issue missed before.

Copy link
Contributor

@gszadovszky gszadovszky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry @clairemcginty, my most important comment got lost from the previous review, somehow.

Copy link
Contributor

@gszadovszky gszadovszky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @clairemcginty. It looks good to me. Let's wait for the tests to pass.

@wgtmac, since no one else chimed in this review, I think, we can step forward and merge it (after CI approval). There will be follow up changes anyway and the next release is probably a couple months away that gives time for others to comment on this feature.

@wgtmac
Copy link
Member

wgtmac commented Jun 4, 2024

Thanks @gszadovszky for the detail review! I'll take another pass shortly to be familiar with the latest change and then merge it if no concern.

Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@wgtmac wgtmac merged commit dab5aae into apache:master Jun 4, 2024
9 checks passed
@clairemcginty clairemcginty deleted the parquet-34 branch June 5, 2024 12:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants