Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spark 3.3, 3.4: use a deterministic where condition to make rewrite_data_files… #6760

Merged
merged 46 commits into from
May 20, 2023

Conversation

ludlows
Copy link
Contributor

@ludlows ludlows commented Feb 7, 2023

the behavior is requested in the issue #6759

here I implement the evaluation procedure to check if the where condition is a deterministic false.
if so, the rewrite_data_files exits directly.

Closes #6759

@github-actions github-actions bot added the spark label Feb 7, 2023
@ludlows ludlows changed the title spark: use a deterministic where condition to make rewrite_data_files… Spark: use a deterministic where condition to make rewrite_data_files… Feb 9, 2023
@ludlows ludlows changed the title Spark: use a deterministic where condition to make rewrite_data_files… Spark 3.3: use a deterministic where condition to make rewrite_data_files… Feb 11, 2023
@szehon-ho
Copy link
Collaborator

Hi @ludlows , im not too familiar with Spark side , but wondering , doesnt the RewriteDataFiles procedure already do a check to skip if there are no matching data files? Ref: https://github.com/apache/iceberg/blob/master/spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/actions/RewriteDataFilesSparkAction.java#L177 Not sure if we have a huge savings?

@ludlows
Copy link
Contributor Author

ludlows commented Feb 15, 2023

Hi @szehon-ho , thanks for your comments.
I notice that the procedure rewrite_data_files first runs the function checkAndApplyFilter just before the action.execute()

action = checkAndApplyFilter(action, where, quotedFullIdentifier);

However, if the where condition is always false like
where=>'0=1'
the function checkAndApplyFilter will raise an IllegalArgumentException in line below.

throw new IllegalArgumentException("Cannot parse predicates in where option: " + where);

to make the sql like call catalog.system.rewrite_data_files(table=>'hive.tbl', where=>'0=1') exit without exceptions, I proposed this PR.

@szehon-ho
Copy link
Collaborator

Thanks for explanation, problem now makes sense to me. To me, maybe modifying the original method to return a Option will be cleaner ?

But I think it will be nice for @aokolnychyi , @rdblue to take a look as well, as they are more familiar with Spark side

@ludlows
Copy link
Contributor Author

ludlows commented Feb 21, 2023

Hi @szehon-ho , thanks for your suggestions.
yep, it is also possible to return an empty Option. but it seems that we need to modify more parts if so.

And hi @aokolnychyi , @rdblue , do you have any suggestions about the new feature of rewrite_data_files ?
thanks.

@szehon-ho
Copy link
Collaborator

@ludlows can you please add a unit test to demonstrate the bug as well?

@ludlows
Copy link
Contributor Author

ludlows commented Mar 5, 2023

Hi @szehon-ho ,

I added a test case for this PR. you may comment some lines of code to reproduce the bug in file RewriteDataFilesProcedure.java :

// if (where != null && SparkExpressionConverter.checkWhereAlwaysFalse(spark(), quotedFullIdentifier, where)) {
//            RewriteDataFiles.Result result = new BaseRewriteDataFilesResult(Lists.newArrayList());
//            return toOutputRows(result);
// }

let me know if you have any questions.
thanks.

@szehon-ho
Copy link
Collaborator

szehon-ho commented Mar 11, 2023

I walked through the code and see the problem.

I still think, let's change the original method: collectResolvedSparkExpression to not throw AnalysisException in this case. I feel its not so useful to make an additional method that does the same thing , and have the code have to call both.

  def collectResolvedSparkExpression(session: SparkSession, tableName: String, where: String): Option[Expression] = {
    val tableAttrs = session.table(tableName).queryExecution.analyzed.output
    val unresolvedExpression = session.sessionState.sqlParser.parseExpression(where)
    val filter = Filter(unresolvedExpression, DummyRelation(tableAttrs))
    val optimizedLogicalPlan = session.sessionState.executePlan(filter).optimizedPlan
    optimizedLogicalPlan.collectFirst {
      case filter: Filter => Some(filter.condition)
    }.getOrElse(Option.empty())
  }

…ctResolvedSparkExpression and renaming this function
@ludlows
Copy link
Contributor Author

ludlows commented Mar 11, 2023

Hi @szehon-ho ,
thank you so much for the code review!
indeed, the approach following your suggesstion modifies less code.
now, I have implemented this approach in the latest commit .
could you take a look and give more suggestions?
thanks and have a nice weekend!

return action.filter(SparkExpressionConverter.convertToIcebergExpression(expression));
Option<Expression> expressionOption =
SparkExpressionConverter.collectResolvedSparkExpressionOption(spark(), tableName, where);
if (expressionOption.isEmpty()) return action.filter(Expressions.alwaysFalse());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I'm not sure if checkstyle/spotless will fail here, but think we need extra newline in any case for the return.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. you are right. the style check failed here. i need to add {} for each if statement.

@szehon-ho
Copy link
Collaborator

Actually one thing is bothering me, can we check, if you pass in a filter that is always true, can we distinguish from always false case?

@ludlows
Copy link
Contributor Author

ludlows commented Mar 12, 2023

we now can tell the where filter is always false by checking if the TreeNode set in the LogicalPlan is empty.
it seems that we cannot tell whether the where filter is always true using this method.

on the other hand, we execute rewrite actions as normal if we are sure about where filter is always true.
and we do nothing if we know where filter is false to save time.
it seems that we only need to care about whether where is false or not.

@ludlows
Copy link
Contributor Author

ludlows commented Apr 8, 2023

@szehon-ho thanks for the comments. I am glad to work with you to push this feature availiable in the master branch. I think you could add your commits in this pr to achieve the pattern-matching way since I don't know well with scala.

val tableAttrs = session.table(tableName).queryExecution.analyzed.output
val unresolvedExpression = session.sessionState.sqlParser.parseExpression(where)
val filter = Filter(unresolvedExpression, DummyRelation(tableAttrs))
val optimizedLogicalPlan = session.sessionState.executePlan(filter).optimizedPlan
optimizedLogicalPlan.collectFirst {
case filter: Filter => filter.condition
case filter: Filter => convertToIcebergExpression(filter.condition)
case dummyRelation: DummyRelation => Expressions.alwaysTrue()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For cleaner code, can we return Spark's Expression.TRUE, Expression.FALSE, and return the convertToIcebergExpression outside?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi @szehon-ho , do we need an Expression.TRUE in spark expression? finally we only need an iceberg one. but it seems possible if we implement it in the following way:

    optimizedLogicalPlan.collectFirst {
      case filter: Filter =>filter.condition
      case dummyRelation: DummyRelation => session.sessionState.sqlParser.parseExpression("true")
      case localRelation: LocalRelation => session.sessionState.sqlParser.parseExpression("false")
    }.getOrElse(throw new AnalysisException("Failed to find filter expression"))

how do you think about it?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from spark, Literal.TrueLiteral()

assertEquals(
"Action should rewrite 0 data files and add 0 data files",
row(0, 0),
Arrays.copyOf(output.get(0), 2));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we get output=0,0,0? Can we just assert all 3 values instead of first two in this case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think it is due to the first two values are of type of integer and the last one is of type of long.

Copy link
Collaborator

@szehon-ho szehon-ho May 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean the assert fails? How about row(0,0,0L?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image
yes. snapshotSummary().get(SnapshotSummary.REMOVED_FILE_SIZE_PROP) becomes null here. It makes the test falied.

"Action should rewrite 0 data files and add 0 data files",
row(0, 0),
Arrays.copyOf(output.get(0), 2));
// verify rewritten bytes separately
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems no need for this comment, as we don't assert for bytes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. let me fix it.

// create 10 files under non-partitioned table
insertData(10);
List<Object[]> expectedRecords = currentData();
// select only 0 files for compaction
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: select no files..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. let me fix it.

catalogName, tableIdent);
assertEquals(
"Action should rewrite 10 data files and add 1 data files",
row(10, 1),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional: I think the test is more understandable if we just put row(10, 1, Long.valueOf(snapshotSummar().get(...)). I do realize its that way in other tests.

@szehon-ho
Copy link
Collaborator

@ludlows do you have a chance to try this: #6760 (comment)?

@ludlows
Copy link
Contributor Author

ludlows commented May 16, 2023

@szehon-ho oh, yes. the current version is using the Spark Expression of True and False. we do not need to modify RewriteDataFilesProcedure.java any more.

@szehon-ho
Copy link
Collaborator

szehon-ho commented May 16, 2023

Hi @ludlows thanks for the change! Sorry, just realized as now Spark 3.4 is the active branch, could you duplicate the Spark 3.3 changes to Spark 3.4 branch as well in this pr? (As usually we do new change on active branch first, and then backport as necessary).

@@ -19,7 +19,6 @@

package org.apache.spark.sql.execution.datasources

import org.apache.iceberg.expressions.Expressions
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this un-intentional?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems that we do not need this import statement. so i remove it here.

@ludlows ludlows changed the title Spark 3.3: use a deterministic where condition to make rewrite_data_files… Spark 3.3, 3.4: use a deterministic where condition to make rewrite_data_files… May 19, 2023
@szehon-ho szehon-ho merged commit 08ae725 into apache:master May 20, 2023
31 checks passed
@szehon-ho
Copy link
Collaborator

Merged, thanks @ludlows

@ludlows ludlows deleted the rewritr_data_file-exit-where-false branch May 20, 2023 01:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
3 participants