Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HUDI-2214] fix the bug that residual temporary files after clustering are not cleaned up #3335

Merged
merged 1 commit into from
Jul 26, 2021

Conversation

xiarixiaoyao
Copy link
Contributor

Tips

What is the purpose of the pull request

residual temporary files after clustering are not cleaned up

// test step

step1: do clustering

val records1 = recordsToStrings(dataGen.generateInserts("001", 1000)).toList
val inputDF1: Dataset[Row] = spark.read.json(spark.sparkContext.parallelize(records1, 2))
inputDF1.write.format("org.apache.hudi")
.options(commonOpts)
.option(DataSourceWriteOptions.OPERATION_OPT_KEY.key(), DataSourceWriteOptions.BULK_INSERT_OPERATION_OPT_VAL)
.option(DataSourceWriteOptions.TABLE_TYPE_OPT_KEY.key(), DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL)
// option for clustering
.option("hoodie.parquet.small.file.limit", "0")
.option("hoodie.clustering.inline", "true")
.option("hoodie.clustering.inline.max.commits", "1")
.option("hoodie.clustering.plan.strategy.target.file.max.bytes", "1073741824")
.option("hoodie.clustering.plan.strategy.small.file.limit", "629145600")
.option("hoodie.clustering.plan.strategy.max.bytes.per.group", Long.MaxValue.toString)
.option("hoodie.clustering.plan.strategy.target.file.max.bytes", String.valueOf(12 *1024 * 1024L))
.option("hoodie.clustering.plan.strategy.sort.columns", "begin_lat, begin_lon")
.mode(SaveMode.Overwrite)
.save(basePath)

step2: check the temp dir, we find /tmp/junit1835474867260509758/dataset/.hoodie/.temp/ is not empty

/tmp/junit1835474867260509758/dataset/.hoodie/.temp/20210723171208

is not cleaned up.

Brief change log

(for example:)

  • Modify AnnotationLocation checkstyle rule in checkstyle.xml

Verify this pull request

ut added

Committer checklist

  • Has a corresponding JIRA in PR title & commit

  • Commit message is descriptive of the change

  • CI is green

  • Necessary doc changes done or have another open PR

  • For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

@hudi-bot
Copy link

hudi-bot commented Jul 23, 2021

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run travis re-run the last Travis build
  • @hudi-bot run azure re-run the last Azure build

@xiarixiaoyao
Copy link
Contributor Author

@garyli1019 could you help me to review this pr, thanks

@garyli1019
Copy link
Member

@garyli1019 could you help me to review this pr, thanks

@xiarixiaoyao Thanks for your contribution. I am not quite familiar with the clustering code. Might need help from @satishkotha

@xiarixiaoyao
Copy link
Contributor Author

@garyli1019 thanks . @satishkotha could you pls help me to review this pr

@satishkotha satishkotha merged commit 5353243 into apache:master Jul 26, 2021
liujinhui1994 pushed a commit to liujinhui1994/hudi that referenced this pull request Aug 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants