Spark 3.3, 3.4: use a deterministic where condition to make rewrite_data_files… #6760

ludlows · 2023-02-07T03:44:28Z

the behavior is requested in the issue #6759

here I implement the evaluation procedure to check if the where condition is a deterministic false.
if so, the rewrite_data_files exits directly.

Closes #6759

… exit directly without exceptions

…ndition is always false

szehon-ho · 2023-02-14T19:41:57Z

Hi @ludlows , im not too familiar with Spark side , but wondering , doesnt the RewriteDataFiles procedure already do a check to skip if there are no matching data files? Ref: https://github.com/apache/iceberg/blob/master/spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/actions/RewriteDataFilesSparkAction.java#L177 Not sure if we have a huge savings?

ludlows · 2023-02-15T12:40:56Z

Hi @szehon-ho , thanks for your comments.
I notice that the procedure rewrite_data_files first runs the function checkAndApplyFilter just before the action.execute()

iceberg/spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/procedures/RewriteDataFilesProcedure.java

Line 120 in 18d45b4

action = checkAndApplyFilter(action, where, quotedFullIdentifier);

However, if the where condition is always false like
where=>'0=1'
the function checkAndApplyFilter will raise an IllegalArgumentException in line below.

iceberg/spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/procedures/RewriteDataFilesProcedure.java

Line 136 in 18d45b4

    
           throw new IllegalArgumentException("Cannot parse predicates in where option: " + where);

to make the sql like call catalog.system.rewrite_data_files(table=>'hive.tbl', where=>'0=1') exit without exceptions, I proposed this PR.

szehon-ho · 2023-02-15T18:16:16Z

Thanks for explanation, problem now makes sense to me. To me, maybe modifying the original method to return a Option will be cleaner ?

But I think it will be nice for @aokolnychyi , @rdblue to take a look as well, as they are more familiar with Spark side

ludlows · 2023-02-21T13:09:53Z

Hi @szehon-ho , thanks for your suggestions.
yep, it is also possible to return an empty Option. but it seems that we need to modify more parts if so.

And hi @aokolnychyi , @rdblue , do you have any suggestions about the new feature of rewrite_data_files ?
thanks.

szehon-ho · 2023-02-23T22:54:07Z

@ludlows can you please add a unit test to demonstrate the bug as well?

…dition to make rewrite_data_files exit without exceptions)

ludlows · 2023-03-05T01:32:57Z

Hi @szehon-ho ,

I added a test case for this PR. you may comment some lines of code to reproduce the bug in file RewriteDataFilesProcedure.java :

// if (where != null && SparkExpressionConverter.checkWhereAlwaysFalse(spark(), quotedFullIdentifier, where)) {
//            RewriteDataFiles.Result result = new BaseRewriteDataFilesResult(Lists.newArrayList());
//            return toOutputRows(result);
// }

let me know if you have any questions.
thanks.

szehon-ho · 2023-03-11T01:03:36Z

I walked through the code and see the problem.

I still think, let's change the original method: collectResolvedSparkExpression to not throw AnalysisException in this case. I feel its not so useful to make an additional method that does the same thing , and have the code have to call both.

  def collectResolvedSparkExpression(session: SparkSession, tableName: String, where: String): Option[Expression] = {
    val tableAttrs = session.table(tableName).queryExecution.analyzed.output
    val unresolvedExpression = session.sessionState.sqlParser.parseExpression(where)
    val filter = Filter(unresolvedExpression, DummyRelation(tableAttrs))
    val optimizedLogicalPlan = session.sessionState.executePlan(filter).optimizedPlan
    optimizedLogicalPlan.collectFirst {
      case filter: Filter => Some(filter.condition)
    }.getOrElse(Option.empty())
  }

…ctResolvedSparkExpression and renaming this function

ludlows · 2023-03-11T03:05:25Z

Hi @szehon-ho ,
thank you so much for the code review!
indeed, the approach following your suggesstion modifies less code.
now, I have implemented this approach in the latest commit .
could you take a look and give more suggestions?
thanks and have a nice weekend!

szehon-ho · 2023-03-11T04:47:17Z

.../v3.3/spark/src/main/java/org/apache/iceberg/spark/procedures/RewriteDataFilesProcedure.java

-        return action.filter(SparkExpressionConverter.convertToIcebergExpression(expression));
+        Option<Expression> expressionOption =
+                SparkExpressionConverter.collectResolvedSparkExpressionOption(spark(), tableName, where);
+        if (expressionOption.isEmpty()) return action.filter(Expressions.alwaysFalse());


Nit: I'm not sure if checkstyle/spotless will fail here, but think we need extra newline in any case for the return.

yes. you are right. the style check failed here. i need to add {} for each if statement.

...ark/src/main/scala/org/apache/spark/sql/execution/datasources/SparkExpressionConverter.scala

…ocedure.java

…y collectResolvedSparkExpressionOption anymore.

.../v3.3/spark/src/main/java/org/apache/iceberg/spark/procedures/RewriteDataFilesProcedure.java

szehon-ho · 2023-03-11T21:58:17Z

Actually one thing is bothering me, can we check, if you pass in a filter that is always true, can we distinguish from always false case?

…ely here

ludlows · 2023-03-12T01:33:29Z

we now can tell the where filter is always false by checking if the TreeNode set in the LogicalPlan is empty.
it seems that we cannot tell whether the where filter is always true using this method.

on the other hand, we execute rewrite actions as normal if we are sure about where filter is always true.
and we do nothing if we know where filter is false to save time.
it seems that we only need to care about whether where is false or not.

ludlows · 2023-04-08T10:10:52Z

@szehon-ho thanks for the comments. I am glad to work with you to push this feature availiable in the master branch. I think you could add your commits in this pr to achieve the pattern-matching way since I don't know well with scala.

# Conflicts: # spark/v3.3/spark/src/main/scala/org/apache/spark/sql/execution/datasources/SparkExpressionConverter.scala

szehon-ho · 2023-05-05T04:56:55Z

...ark/src/main/scala/org/apache/spark/sql/execution/datasources/SparkExpressionConverter.scala

    val tableAttrs = session.table(tableName).queryExecution.analyzed.output
    val unresolvedExpression = session.sessionState.sqlParser.parseExpression(where)
    val filter = Filter(unresolvedExpression, DummyRelation(tableAttrs))
    val optimizedLogicalPlan = session.sessionState.executePlan(filter).optimizedPlan
    optimizedLogicalPlan.collectFirst {
-      case filter: Filter => filter.condition
+      case filter: Filter => convertToIcebergExpression(filter.condition)
+      case dummyRelation: DummyRelation => Expressions.alwaysTrue()


For cleaner code, can we return Spark's Expression.TRUE, Expression.FALSE, and return the convertToIcebergExpression outside?

hi @szehon-ho , do we need an Expression.TRUE in spark expression? finally we only need an iceberg one. but it seems possible if we implement it in the following way:

optimizedLogicalPlan.collectFirst { case filter: Filter =>filter.condition case dummyRelation: DummyRelation => session.sessionState.sqlParser.parseExpression("true") case localRelation: LocalRelation => session.sessionState.sqlParser.parseExpression("false") }.getOrElse(throw new AnalysisException("Failed to find filter expression"))

how do you think about it?

from spark, Literal.TrueLiteral()

szehon-ho · 2023-05-05T05:05:04Z

...ensions/src/test/java/org/apache/iceberg/spark/extensions/TestRewriteDataFilesProcedure.java

+    assertEquals(
+        "Action should rewrite 0 data files and add 0 data files",
+        row(0, 0),
+        Arrays.copyOf(output.get(0), 2));


Do we get output=0,0,0? Can we just assert all 3 values instead of first two in this case?

i think it is due to the first two values are of type of integer and the last one is of type of long.

You mean the assert fails? How about row(0,0,0L?)

yes. snapshotSummary().get(SnapshotSummary.REMOVED_FILE_SIZE_PROP) becomes null here. It makes the test falied.

szehon-ho · 2023-05-05T05:07:20Z

...ensions/src/test/java/org/apache/iceberg/spark/extensions/TestRewriteDataFilesProcedure.java

+        "Action should rewrite 0 data files and add 0 data files",
+        row(0, 0),
+        Arrays.copyOf(output.get(0), 2));
+    // verify rewritten bytes separately


Seems no need for this comment, as we don't assert for bytes.

yes. let me fix it.

szehon-ho · 2023-05-05T05:07:21Z

...ensions/src/test/java/org/apache/iceberg/spark/extensions/TestRewriteDataFilesProcedure.java

+    // create 10 files under non-partitioned table
+    insertData(10);
+    List<Object[]> expectedRecords = currentData();
+    // select only 0 files for compaction


Minor: select no files..

yes. let me fix it.

szehon-ho · 2023-05-05T05:08:10Z

...ensions/src/test/java/org/apache/iceberg/spark/extensions/TestRewriteDataFilesProcedure.java

+            catalogName, tableIdent);
+    assertEquals(
+        "Action should rewrite 10 data files and add 1 data files",
+        row(10, 1),


Optional: I think the test is more understandable if we just put row(10, 1, Long.valueOf(snapshotSummar().get(...)). I do realize its that way in other tests.

…and added files have been checked

szehon-ho · 2023-05-15T22:27:37Z

@ludlows do you have a chance to try this: #6760 (comment)?

ludlows · 2023-05-16T02:07:10Z

@szehon-ho oh, yes. the current version is using the Spark Expression of True and False. we do not need to modify RewriteDataFilesProcedure.java any more.

szehon-ho · 2023-05-16T20:39:15Z

Hi @ludlows thanks for the change! Sorry, just realized as now Spark 3.4 is the active branch, could you duplicate the Spark 3.3 changes to Spark 3.4 branch as well in this pr? (As usually we do new change on active branch first, and then backport as necessary).

szehon-ho · 2023-05-18T18:05:47Z

...ark/src/main/scala/org/apache/spark/sql/execution/datasources/SparkExpressionConverter.scala

@@ -19,7 +19,6 @@

 package org.apache.spark.sql.execution.datasources

-import org.apache.iceberg.expressions.Expressions


Is this un-intentional?

it seems that we do not need this import statement. so i remove it here.

szehon-ho · 2023-05-20T00:29:37Z

Merged, thanks @ludlows

spark: use a deterministic where condition to make rewrite_data_files…

22b8722

… exit directly without exceptions

github-actions bot added the spark label Feb 7, 2023

ludlows changed the title ~~spark: use a deterministic where condition to make rewrite_data_files…~~ Spark: use a deterministic where condition to make rewrite_data_files… Feb 9, 2023

ludlows changed the title ~~Spark: use a deterministic where condition to make rewrite_data_files…~~ Spark 3.3: use a deterministic where condition to make rewrite_data_files… Feb 11, 2023

Spark 3.3: Let rewrite_data_files exit without exceptions if where co…

f579cc7

…ndition is always false

ludlows added 2 commits February 26, 2023 12:32

Merge branch 'master' into rewritr_data_file-exit-where-false

6da89af

add a test case for the PR apache#6760 (use a deterministic where con…

8e09dac

…dition to make rewrite_data_files exit without exceptions)

modified the PR apache#6760 by changing return type of function colle…

f67abe2

…ctResolvedSparkExpression and renaming this function

szehon-ho reviewed Mar 11, 2023

View reviewed changes

ludlows added 2 commits March 11, 2023 13:13

using the right indentation to pass the coding style check

a27bf48

add '{}'s for the 'if' statement

1aa2838

szehon-ho reviewed Mar 11, 2023

View reviewed changes

...ark/src/main/scala/org/apache/spark/sql/execution/datasources/SparkExpressionConverter.scala Outdated Show resolved Hide resolved

ludlows added 5 commits March 11, 2023 13:44

change the indentation for the code style check in RewriteDataFilesPr…

f31e4e3

…ocedure.java

remove AnalysisException annotation

304e52a

remove try-catch block for AnalysisException since it is not raised b…

99f91c8

…y collectResolvedSparkExpressionOption anymore.

remove unused import statement

54ebdc5

update to pass java code style check

38145a6

szehon-ho reviewed Mar 11, 2023

View reviewed changes

.../v3.3/spark/src/main/java/org/apache/iceberg/spark/procedures/RewriteDataFilesProcedure.java Outdated Show resolved Hide resolved

provide helper function filter to get spark expression

ed35030

leave a more precise comment to explain why we can terminate immediat…

24bdd88

…ely here

use collectResolvedIcebergExpression directly to get less changes

5faac1e

ludlows added 4 commits April 16, 2023 11:38

using match case

4a026d3

collectSparkExpressionOption

b600d11

change the way to distinguish alwaysTrue, alwaysFalse and undetermined

5bbb179

Merge branch 'master' into rewritr_data_file-exit-where-false

31588fe

# Conflicts: # spark/v3.3/spark/src/main/scala/org/apache/spark/sql/execution/datasources/SparkExpressionConverter.scala

szehon-ho reviewed May 5, 2023

View reviewed changes

ludlows added 6 commits May 6, 2023 10:17

verify rewritten bytes

0a4d2c2

format fix

2451995

Merge branch 'master' into rewritr_data_file-exit-where-false

5fc3614

we do not verify rewritten bytes since the number of rewritten files …

fa865e8

…and added files have been checked

collectResolvedSparkExpression

3b0c395

format fix

bf413e1

remove checks for REMOVED_FILE_SIZE_PROP

0408626

szehon-ho approved these changes May 16, 2023

View reviewed changes

ludlows added 5 commits May 17, 2023 08:10

Merge branch 'apache:master' into master

e69ebdb

remove unnecessary import

c006716

spark 3.4 implementation

d500fe4

format fix in TestRewriteDataFilesProcedure.java

02e2f76

result output of rewriteDataFiles procedure has 4 elements in spark 3.4

63dfe2f

szehon-ho reviewed May 18, 2023

View reviewed changes

szehon-ho approved these changes May 19, 2023

View reviewed changes

ludlows changed the title ~~Spark 3.3: use a deterministic where condition to make rewrite_data_files…~~ Spark 3.3, 3.4: use a deterministic where condition to make rewrite_data_files… May 19, 2023

szehon-ho merged commit 08ae725 into apache:master May 20, 2023
31 checks passed

ludlows deleted the rewritr_data_file-exit-where-false branch May 20, 2023 01:29

ludlows mentioned this pull request Jun 3, 2023

Spark 3.1 3.2: Fix always true/false condition in rewrite_data_files #7760

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark 3.3, 3.4: use a deterministic where condition to make rewrite_data_files… #6760

Spark 3.3, 3.4: use a deterministic where condition to make rewrite_data_files… #6760

ludlows commented Feb 7, 2023 •

edited

szehon-ho commented Feb 14, 2023

ludlows commented Feb 15, 2023

szehon-ho commented Feb 15, 2023

ludlows commented Feb 21, 2023

szehon-ho commented Feb 23, 2023

ludlows commented Mar 5, 2023

szehon-ho commented Mar 11, 2023 •

edited

ludlows commented Mar 11, 2023

szehon-ho Mar 11, 2023

ludlows Mar 11, 2023

szehon-ho commented Mar 11, 2023

ludlows commented Mar 12, 2023

ludlows commented Apr 8, 2023

szehon-ho May 5, 2023

ludlows May 6, 2023

szehon-ho May 11, 2023

szehon-ho May 5, 2023

ludlows May 6, 2023

szehon-ho May 11, 2023 •

edited

ludlows May 12, 2023

szehon-ho May 5, 2023

ludlows May 6, 2023

szehon-ho May 5, 2023

ludlows May 6, 2023

szehon-ho May 5, 2023

szehon-ho commented May 15, 2023

ludlows commented May 16, 2023

szehon-ho commented May 16, 2023 •

edited

szehon-ho May 18, 2023

ludlows May 19, 2023

szehon-ho commented May 20, 2023

		@@ -19,7 +19,6 @@

		package org.apache.spark.sql.execution.datasources

		import org.apache.iceberg.expressions.Expressions

Spark 3.3, 3.4: use a deterministic where condition to make rewrite_data_files… #6760

Spark 3.3, 3.4: use a deterministic where condition to make rewrite_data_files… #6760

Conversation

ludlows commented Feb 7, 2023 • edited

szehon-ho commented Feb 14, 2023

ludlows commented Feb 15, 2023

szehon-ho commented Feb 15, 2023

ludlows commented Feb 21, 2023

szehon-ho commented Feb 23, 2023

ludlows commented Mar 5, 2023

szehon-ho commented Mar 11, 2023 • edited

ludlows commented Mar 11, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

szehon-ho commented Mar 11, 2023

ludlows commented Mar 12, 2023

ludlows commented Apr 8, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

szehon-ho May 11, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

szehon-ho commented May 15, 2023

ludlows commented May 16, 2023

szehon-ho commented May 16, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

szehon-ho commented May 20, 2023

ludlows commented Feb 7, 2023 •

edited

szehon-ho commented Mar 11, 2023 •

edited

szehon-ho May 11, 2023 •

edited

szehon-ho commented May 16, 2023 •

edited