Spark 3.4: Support pushing down system functions by V2 filters #7886

ConeyLiu · 2023-06-23T06:09:35Z

This PR adds support to push down system functions' filter by the Spark V2 filters. For example:

-- table(id long, name string) partitioned by bucket(10, id)

SELECT * FROM iceberg.db.table WHERE iceberg.system.bucket(10, id) = 2;

DELETE FROM TABLE iceberg.db.table WHERE iceberg.system.bucket(10, id) = 2;

This is the first part code. It changes filters to V2 filters and adds support to convert system functions wrapped in Spark UserDefinedScalarFunc to Iceberg expressions. In the following PR will add a rule to convert the system function call to
ApplyFunctionExpression which could be pushed down to the data source.

ConeyLiu · 2023-06-23T06:12:30Z

Hi, @rdblue @aokolnychyi @szehon-ho @Fokko @nastra , could you help to review this when you are free?

szehon-ho

Looks promising, wondering if we can split out unrelated changes for another pr

spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/SparkV2Filters.java

spark/v3.4/spark/src/test/java/org/apache/iceberg/spark/source/TestFilteredScan.java

.../spark/src/test/java/org/apache/iceberg/spark/source/TestFilteredWithSystemFunctionScan.java

spark/v3.4/spark/src/test/java/org/apache/iceberg/spark/TestSparkV2Filters.java

spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/SparkV2Filters.java

spark/v3.4/spark/src/test/java/org/apache/iceberg/DateTimeUtil.java

spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/SparkV2Filters.java

rdblue · 2023-07-04T21:04:41Z

@ConeyLiu, I'll take another look when this is rebased on #7898.

spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/SparkV2Filters.java

ConeyLiu · 2023-07-05T14:48:48Z

Thanks @rdblue, this has been rebased.

spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/SparkV2Filters.java

RussellSpitzer · 2023-07-05T15:22:55Z

spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/SparkV2Filters.java

+  private static boolean isSystemFunc(org.apache.spark.sql.connector.expressions.Expression expr) {
+    if (expr instanceof UserDefinedScalarFunc) {
+      UserDefinedScalarFunc udf = (UserDefinedScalarFunc) expr;
+      return udf.canonicalName().startsWith("iceberg")


Do we need this string check? I feel like all of the functions should be listed in the list on 369?

There could be possible some other UDF functions like 'catalog.other.years'.

I think the problem is that this is using the UDF's canonicalName but the other check uses the function's name. Those can differ: BucketFunction.name() return bucket, what the user would call, but the canonical function name identifies the exact bound function no matter how it is loaded so BucketInt.canonicalName() returns iceberg.bucket(int).

If we want to limit to just the Iceberg-defined functions, then this is necessary. It may be better to have a set of supported functions that uses just the canonicalName.

It may be better to have a set of supported functions that uses just the canonicalName.

There are some canonicalName that are not constants, for example:

@Override public String canonicalName() { return String.format("iceberg.bucket(%s)", sqlType.catalogString()); }

While that's not a constant, there are a limited number of values that sqlType.catalogString() might return. I think it would be better to enumerate them, rather than just looking for iceberg and mapping the name.

I have tried to do that, however TruncateDecimal needs precision and scale which are runtime values.

@Override public String canonicalName() { return String.format("iceberg.truncate(decimal(%d,%d))", precision, scale); }

You can similarly enumerate them, but this isn't a big deal.

spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/SparkV2Filters.java

rdblue · 2023-07-09T20:29:22Z

spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/SparkV2Filters.java

      return false;
    } else {
      return Arrays.stream(predicate.children()).skip(1).allMatch(SparkV2Filters::isLiteral);
    }
  }
+
+  /** Should be called after {@link #couldConvert} passed */
+  private static <T> UnboundTerm<Object> toTerm(T input) {


It would be really nice if Spark passed the BoundFunction in through the expression!

rdblue · 2023-07-09T20:35:03Z

.../spark/src/test/java/org/apache/iceberg/spark/source/TestFilteredWithSystemFunctionScan.java

+    pushFilters(builder, predicate);
+    scan = builder.build().toBatch();
+
+    Assertions.assertThat(scan.planInputPartitions().length).isEqualTo(10);


Are there any tests that exercise the entire code path using SQL? I'd like to see at least one end-to-end test with SQL.

ConeyLiu · 2023-07-12T08:57:50Z

...ensions/src/test/java/org/apache/iceberg/spark/extensions/TestSystemFunctionPushDownDQL.java

+import org.junit.Before;
+import org.junit.Test;
+
+public class TestSystemFunctionPushDownDQL extends SparkExtensionsTestBase {


@rdblue here is the SQL test. We need to add tests for update/delete as well. Will add it later.

ConeyLiu · 2023-07-12T09:01:08Z

...ensions/src/test/java/org/apache/iceberg/spark/extensions/TestSystemFunctionPushDownDQL.java

+
+  @Test
+  public void testYearsFunction() {
+    Assume.assumeTrue(!catalogName.equals("spark_catalog"));


system function can not be resolved with spark_catalog. Got the following error. Will address it in a follow-up.

scala> spark.sql("select spark_catalog.system.years(cast('2017-11-22' as timestamp))") org.apache.spark.sql.AnalysisException: [SCHEMA_NOT_FOUND] The schema `system` cannot be found. Verify the spelling and correctness of the schema and catalog. If you did not qualify the name with a catalog, verify the current_schema() output, or qualify the name with the correct catalog. To tolerate the error on drop use DROP SCHEMA IF EXISTS.; line 1 pos 7 at org.apache.spark.sql.catalyst.catalog.ExternalCatalog.requireDbExists(ExternalCatalog.scala:42) at org.apache.spark.sql.catalyst.catalog.ExternalCatalog.requireDbExists$(ExternalCatalog.scala:40) at org.apache.spark.sql.hive.HiveExternalCatalog.requireDbExists(HiveExternalCatalog.scala:56) at org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$functionExists$1(HiveExternalCatalog.scala:1352) at scala.runtime.java8.JFunction0$mcZ$sp.apply(JFunction0$mcZ$sp.java:23)

Rather than setting up the catalog just to skip all the tests, can you just run this with one catalog for now?

OK, updated to run hive only.

ConeyLiu · 2023-07-12T12:21:48Z

...sions/src/main/scala/org/apache/iceberg/spark/extensions/IcebergSparkSessionExtensions.scala

    extensions.injectCheckRule { _ => MergeIntoIcebergTableResolutionCheck }
    extensions.injectCheckRule { _ => AlignedRowLevelIcebergCommandCheck }

    // optimizer extensions
    extensions.injectOptimizerRule { _ => ExtendedSimplifyConditionalsInPredicate }
    extensions.injectOptimizerRule { _ => ExtendedReplaceNullWithFalseInPredicate }
+    extensions.injectOptimizerRule { spark => RewriteStaticInvoke(spark) }


Put in the optimizer in order to some constant expressions have been optimized. Such as system.days(ts) = system.days(date('2022-09-08')). The system.days(date('2022-09-08') will be evaluated into a constant value.

Can we do this in a separate PR? I don't think the basic support for functions should include this.

I intend to do that, however, SOL uts depends on it. I will submit a separate PR to do this.

Submit #8088 and will add the sql push-down UTs after this in.

rdblue · 2023-07-14T21:48:41Z

spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/SparkSQLProperties.java

+  // Controls whether to push down Iceberg system function
+  public static final String SYSTEM_FUNC_PUSH_DOWN_ENABLED =
+      "spark.sql.iceberg.system-function-push-down.enabled";
+  public static final boolean SYSTEM_FUNC_PUSH_DOWN_ENABLED_DEFAULT = false;


Why is there a setting for this and why does it default to false?

Iceberg always has the option of rejecting functions by returning them rather than consuming them. I don't see a reason to have a flag for this.

Compared with StaticInvoke, ApplyFunctionExpression may have a little decrease in performance because it can not leverage codegen.

holdenk

I am excited about this functionality.

Personally, I have some questions about how the filters are working for the delete use case, and I think it might be good to have some tests for the filter pushdown on deletes since (from my first pass reading of the code) it might throw an exception.

It looks great, and I'm excited to have more filter pushdowns working with Iceberg :)

holdenk · 2023-07-31T19:11:42Z

spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/SparkV2Filters.java

+    for (Predicate predicate : predicates) {
+      Expression converted = convert(predicate);
+      Preconditions.checkArgument(
+          converted != null, "Cannot convert Spark predicate to Iceberg expression: %s", predicate);


Even if it's following what was done elsewhere this means if there was non-iceberg UDF predicate that got pushed down it would fail to push down the iceberg expressions? Looking at the code this only seems to be used (currently) in the deleteWhere, but I think for the deleteWhere codepath we should not throw.

holdenk · 2023-07-31T19:35:50Z

spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/source/SparkTable.java

-  public void deleteWhere(Filter[] filters) {
-    Expression deleteExpr = SparkFilters.convert(filters);
+  public void deleteWhere(Predicate[] predicates) {
+    Expression deleteExpr = SparkV2Filters.convert(predicates);


So this is the one case where we use convert on a list of predicates instead one at a time, with canDeleteWhere using a for loop on the individual expressions we will can different results.

I think given the use case of convert + list of predicates we should change that to not throw as originally suggest by @rdblue .

Please see the above answers. I think here is deliberate.

Gotcha, I think I missed the this method will be invoked only if {@link #canDeleteWhere(Predicate[])} returns true. (I was looking at the Iceberg trait this extends instead of the upstream source).

spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/SparkV2Filters.java

ConeyLiu · 2023-08-01T03:43:58Z

Thanks @holdenk for your interest and comments.

Personally, I have some questions about how the filters are working for the delete use case, and I think it might be good to have some tests for the filter pushdown on deletes since (from my first pass reading of the code) it might throw an exception.

I mentioned a little in the descriptions.

This is the first part code. It changes filters to V2 filters and adds support to convert system functions wrapped in Spark UserDefinedScalarFunc to Iceberg expressions. In the following PR will add a rule to convert the system function call to
ApplyFunctionExpression which could be pushed down to the data source.

Actually, I have submitted some of the code for the rule and SQL UTs. However, they are split out to keep the PR small for review. I have a follow-up #8088 here and will be moved forward when this PR is merged.

holdenk · 2023-08-01T06:16:49Z

Actually, I have submitted some of the code for the rule and SQL UTs. However, they are split out to keep the PR small for review. I have a follow-up #8088 here and will be moved forward when this PR is merged.

Awesome :) Sorry for missing the previous discussion around canDeleteWhere and happy to learn there is a follow on PR with more UTs :)

rdblue · 2023-08-01T16:09:19Z

Merged! Thanks for working on this @ConeyLiu. Fantastic work.

ConeyLiu · 2023-08-02T01:57:12Z

Thanks @rdblue for merging this. And thanks all for your patient reviewing.

BsoBird · 2023-10-31T10:00:43Z

Although I am only an iceberg user, I think this issue should be discussed further. This solution solves the problem at hand, but it inherently hurts the fairness of ICEBERG's support for different engines, since not all engines can support ICEBERG's system_functions. The ideal would be to implement transformational pushdown for complex conditions within each engine. While there may not be a conflict between these two options, I thought I'd speak up and say what I'm thinking.

RussellSpitzer · 2023-11-15T21:49:37Z

@BsoBird this is not an issue, it's a pull request. If you would like to see support for pushing down transforms into other query engine I would suggest starting a new issue for those engines.

anigos · 2024-02-29T15:59:47Z

Will this work in case of MERGE INTO

MERGE INTO TAB1 trg USING TAB2 src ON (
  trg.serial_id = src.serial_id 
)
AND (
  (
    iceberg.system.bucket(10, id) = 2 
  )
)
WHEN MATCHED THEN
UPDATE
SET
  trg.data = src.data,
  trg.category = src.category
  WHEN NOT MATCHED THEN
INSERT
  (serial_id, data, category)
VALUES
  (
    src.serial_id, 
    src.data,
    src.category
  );

anigos · 2024-02-29T16:23:07Z

I am not sure what will be the syntax of it. iceberg.system.bucket(10, id) not working neither trg._partition.id_bucket = 1


  spark.sql(""" MERGE INTO iceberg.sample_trg trg USING iceberg.sample_src src ON (
              |  trg.id = src.id
              |)
              |AND (trg._partition.id_bucket = 1 and trg._partition.id_bucket = 7 )
              |WHEN MATCHED THEN
              |UPDATE
              |SET
              |  trg.data = src.data,
              |  trg.category = src.category
              |  WHEN NOT MATCHED THEN
              |INSERT
              |  (id, data, category, ts)
              |VALUES
              |  (
              |    src.id,
              |    src.data,
              |    src.category,
              |    src.ts
              |  );
              |  """.stripMargin)

@aokolnychyi @flyrain @RussellSpitzer Looking forward.

github-actions bot added the spark label Jun 23, 2023

szehon-ho reviewed Jun 23, 2023

View reviewed changes

spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/SparkV2Filters.java Outdated Show resolved Hide resolved

rdblue reviewed Jun 23, 2023

View reviewed changes

spark/v3.4/spark/src/test/java/org/apache/iceberg/spark/source/TestFilteredScan.java Outdated Show resolved Hide resolved

rdblue reviewed Jun 23, 2023

View reviewed changes

.../spark/src/test/java/org/apache/iceberg/spark/source/TestFilteredWithSystemFunctionScan.java Outdated Show resolved Hide resolved

rdblue reviewed Jun 23, 2023

View reviewed changes

spark/v3.4/spark/src/test/java/org/apache/iceberg/spark/TestSparkV2Filters.java Outdated Show resolved Hide resolved

rdblue reviewed Jun 23, 2023

View reviewed changes

spark/v3.4/spark/src/test/java/org/apache/iceberg/spark/TestSparkV2Filters.java Outdated Show resolved Hide resolved

rdblue reviewed Jun 23, 2023

View reviewed changes

spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/SparkV2Filters.java Outdated Show resolved Hide resolved

ConeyLiu mentioned this pull request Jun 24, 2023

Spark 3.4: Support NOT_EQ for V2 filters #7898

Merged

rdblue reviewed Jun 27, 2023

View reviewed changes

spark/v3.4/spark/src/test/java/org/apache/iceberg/DateTimeUtil.java Outdated Show resolved Hide resolved

rdblue reviewed Jun 27, 2023

View reviewed changes

spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/SparkV2Filters.java Outdated Show resolved Hide resolved

ConeyLiu force-pushed the fixes-options-optimizer branch from 54f2bbc to eb87b00 Compare July 5, 2023 14:40

ConeyLiu commented Jul 5, 2023

View reviewed changes

spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/SparkV2Filters.java Outdated Show resolved Hide resolved

RussellSpitzer reviewed Jul 5, 2023

View reviewed changes

spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/SparkV2Filters.java Outdated Show resolved Hide resolved

RussellSpitzer reviewed Jul 5, 2023

View reviewed changes

ConeyLiu commented Jul 6, 2023

View reviewed changes

spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/SparkV2Filters.java Outdated Show resolved Hide resolved

rdblue reviewed Jul 9, 2023

View reviewed changes

spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/SparkV2Filters.java Outdated Show resolved Hide resolved

rdblue reviewed Jul 9, 2023

View reviewed changes

spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/SparkV2Filters.java Outdated Show resolved Hide resolved

rdblue reviewed Jul 9, 2023

View reviewed changes

spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/SparkV2Filters.java Outdated Show resolved Hide resolved

rdblue reviewed Jul 9, 2023

View reviewed changes

spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/SparkV2Filters.java Outdated Show resolved Hide resolved

rdblue reviewed Jul 9, 2023

View reviewed changes

spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/SparkV2Filters.java Outdated Show resolved Hide resolved

rdblue reviewed Jul 9, 2023

View reviewed changes

ConeyLiu force-pushed the fixes-options-optimizer branch from 89424f7 to 3107553 Compare July 12, 2023 08:38

ConeyLiu commented Jul 12, 2023

View reviewed changes

rdblue reviewed Jul 14, 2023

View reviewed changes

ConeyLiu added 11 commits July 29, 2023 01:11

update

5e0628c

address comments

624fe54

address comments

7d67dda

fixes

997bed0

update

ee27567

fixes

c2efd43

address comments

69f08f2

change required to optional

ee81e6d

remove rule

8671457

update

405cce5

address comments

27401ce

ConeyLiu force-pushed the fixes-options-optimizer branch from 412a3ba to 27401ce Compare July 28, 2023 17:35

fixes typo

0f74daa

holdenk reviewed Jul 31, 2023

View reviewed changes

rdblue approved these changes Aug 1, 2023

View reviewed changes

rdblue merged commit 6875577 into apache:master Aug 1, 2023
31 checks passed

rdblue mentioned this pull request Aug 1, 2023

Spark: Rule for converting StaticInvoke to ApplyFunctionExpression for V2 filter push down #8088

Merged

rdblue mentioned this pull request Aug 10, 2023

Spark 3.4: Fix Iceberg expression description #8257

Merged

ConeyLiu mentioned this pull request Aug 31, 2023

Spark: value list of IN/NOT_IN containing null value should not be converted to Iceberg expression #8446

Closed

dramaticlly mentioned this pull request Sep 14, 2023

Spark 3.4: Push down system functions by V2 filters for rewriting DataFiles and PositionDeleteFiles #8560

Merged

tombarti mentioned this pull request Oct 23, 2023

Pushdown SUBSTRING filter when equivalent to STARTSWITH #8911

Open

rakeshJn mentioned this pull request Apr 12, 2024

How to filter by bucket in bucket partitioned Iceberg table Eventual-Inc/Daft#2105

Open

Spark 3.4: Support pushing down system functions by V2 filters #7886

Spark 3.4: Support pushing down system functions by V2 filters #7886

Conversation

ConeyLiu commented Jun 23, 2023 • edited

ConeyLiu commented Jun 23, 2023

szehon-ho left a comment

Choose a reason for hiding this comment

rdblue commented Jul 4, 2023

ConeyLiu commented Jul 5, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rdblue Jul 14, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ConeyLiu Jul 12, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

holdenk left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ConeyLiu commented Aug 1, 2023

holdenk commented Aug 1, 2023

rdblue commented Aug 1, 2023

ConeyLiu commented Aug 2, 2023

BsoBird commented Oct 31, 2023

RussellSpitzer commented Nov 15, 2023

anigos commented Feb 29, 2024

anigos commented Feb 29, 2024

ConeyLiu commented Jun 23, 2023 •

edited

rdblue Jul 14, 2023 •

edited

ConeyLiu Jul 12, 2023 •

edited