Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HUDI-2658] When disable auto clean, do not check if MIN_COMMITS_TO_KEEP was larger CLEANER_COMMITS_RETAINED or not. #3897

Closed

Conversation

zhangyue19921010
Copy link
Contributor

What is the purpose of the pull request

When disable auto clean, do not check if MIN_COMMITS_TO_KEEP was larger CLEANER_COMMITS_RETAINED or not.

For current master branch
Exception mentioned blow will throw even though disabled auto clean.

java.lang.IllegalArgumentException: Increase hoodie.keep.min.commits=3 to be greater than hoodie.cleaner.commits.retained=10. Otherwise, there is risk of incremental pull missing data from few instants.
	at org.apache.hudi.common.util.ValidationUtils.checkArgument(ValidationUtils.java:40)
	at org.apache.hudi.config.HoodieCompactionConfig$Builder.build(HoodieCompactionConfig.java:355)
	at org.apache.hudi.config.HoodieWriteConfig$Builder.setDefaults(HoodieWriteConfig.java:1396)
	at org.apache.hudi.config.HoodieWriteConfig$Builder.build(HoodieWriteConfig.java:1436)
	at org.apache.hudi.DataSourceUtils.createHoodieConfig(DataSourceUtils.java:188)
	at org.apache.hudi.DataSourceUtils.createHoodieClient(DataSourceUtils.java:193)
	at org.apache.hudi.HoodieSparkSqlWriter$$anonfun$3.apply(HoodieSparkSqlWriter.scala:166)
	at org.apache.hudi.HoodieSparkSqlWriter$$anonfun$3.apply(HoodieSparkSqlWriter.scala:166)
	at scala.Option.getOrElse(Option.scala:121)
	at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:166)
	at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:145)
	at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
	at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
	at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
	at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
	at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
	at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
	at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
	at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676)
	at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:285)
	at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)
	at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229)
	at tv.freewheel.reporting.ssql.sinkers.HudiSinker.sink(HudiSinker.scala:20)
	at tv.freewheel.reporting.realtime.core.schedulers.RuleScheduler$$anonfun$execSink$1$$anonfun$apply$1.apply$mcV$sp(RuleScheduler.scala:73)
	at tv.freewheel.reporting.realtime.utils.Misc$.failFast(Misc.scala:72)
	at tv.freewheel.reporting.realtime.core.schedulers.RuleScheduler$$anonfun$execSink$1.apply(RuleScheduler.scala:73)
	at tv.freewheel.reporting.realtime.core.schedulers.RuleScheduler$$anonfun$execSink$1.apply(RuleScheduler.scala:71)
	at scala.Option.foreach(Option.scala:257)
	at tv.freewheel.reporting.realtime.core.schedulers.RuleScheduler.execSink(RuleScheduler.scala:71)
	at tv.freewheel.reporting.realtime.core.schedulers.RuleScheduler$$anonfun$submitRecursively$3$$anonfun$1.apply$mcV$sp(RuleScheduler.scala:35)
	at tv.freewheel.reporting.realtime.utils.Misc$$anon$2.run(Misc.scala:31)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Brief change log

(for example:)

  • Modify AnnotationLocation checkstyle rule in checkstyle.xml

Verify this pull request

(Please pick either of the following options)

This pull request is a trivial rework / code cleanup without any test coverage.

(or)

This pull request is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

  • Added integration tests for end-to-end.
  • Added HoodieClientWriteTest to verify the change.
  • Manually verified the change by running a job locally.

Committer checklist

  • Has a corresponding JIRA in PR title & commit

  • Commit message is descriptive of the change

  • CI is green

  • Necessary doc changes done or have another open PR

  • For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

@zhangyue19921010
Copy link
Contributor Author

@hudi-bot run azure

@zhangyue19921010
Copy link
Contributor Author

Hi @xushiyan Friendly ping. Could you please take a look at this PR at you convince? Thanks a lot!

@hudi-bot
Copy link

hudi-bot commented Dec 1, 2021

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@zhangyue19921010
Copy link
Contributor Author

Just fix conflict from master @xushiyan PTAL :)

@vinothchandar vinothchandar added this to Under Discussion PRs in PR Tracker Board Dec 10, 2021
Copy link
Member

@xushiyan xushiyan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhangyue19921010 thanks for the patch. Can you explain what is the downside of keep the logic as is? in another word: even if auto clean disabled, why wouldn't you increase min instants to keep to be greater than commits retained?

@zhangyue19921010
Copy link
Contributor Author

@zhangyue19921010 thanks for the patch. Can you explain what is the downside of keep the logic as is? in another word: even if auto clean disabled, why wouldn't you increase min instants to keep to be greater than commits retained?

Hi @xushiyan Thanks a lot for your attention. Actually this is a minor patch, and just make hudi‘s behavior maybe more appropriate.
If I must point a downside, it maybe trouble users who are doing or and writing UTs for archival which need to take care of the min-instants/commits-retained even they disable auto clean :) Just nit patch here I had to say.

@xushiyan
Copy link
Member

@zhangyue19921010 I think putting this conditional validity could compromise the integrity of min-instant as user can toggle auto clean any time. What if on the same table there is a writer and a compactor with different auto clean settings: the writer could disable auto clean and trigger archival and have less number of commits, then compactor runs and see actual instants less than min-instants? I found having consistency over the logic here is important.

@zhangyue19921010
Copy link
Contributor Author

Hi @xushiyan Thanks a lot for your explaining. I will close this pr and keep the behavior as before :)

PR Tracker Board automation moved this from Under Discussion PRs to Done Dec 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

Successfully merging this pull request may close these issues.

None yet

3 participants