Skip to content

Comments

[HUDI-5932] Make the combine step in Call run_bootstrap Procedure optional#8179

Closed
huangxiaopingRD wants to merge 4 commits intoapache:masterfrom
huangxiaopingRD:HUDI-5932
Closed

[HUDI-5932] Make the combine step in Call run_bootstrap Procedure optional#8179
huangxiaopingRD wants to merge 4 commits intoapache:masterfrom
huangxiaopingRD:HUDI-5932

Conversation

@huangxiaopingRD
Copy link
Contributor

Change Logs

In the existing implementation, if the preCombine field is not specified, the default value (ts) of the preCombine field will be obtained, and "ts" filed will not be recognized in the case of Full record Bootstrap, resulting in failure to generate input records. Therefore, we hope that we do not need to specify the preCombine field when executing bootstrap.

Caused by: org.apache.hudi.exception.HoodieException: ts(Part -ts) field not found in record. Acceptable fields were :[timestamp, _row_key, partition_path, rider, driver, begin_lat, begin_lon, end_lat, end_lon, fare, tip_history, _hoodie_is_deleted, datestr]
	at org.apache.hudi.avro.HoodieAvroUtils.getNestedFieldVal(HoodieAvroUtils.java:557)
	at org.apache.hudi.avro.HoodieAvroUtils.getNestedFieldValAsString(HoodieAvroUtils.java:535)
	at org.apache.hudi.bootstrap.SparkFullBootstrapDataProviderBase.lambda$generateInputRecords$cbf13809$1(SparkFullBootstrapDataProviderBase.java:87)
	at org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1.apply(JavaPairRDD.scala:1040)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
	at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:193)
	at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:62)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
	at org.apache.spark.scheduler.Task.run(Task.scala:123)
	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Impact

Users do not need to specify preCombine when executing bootstrap.

Risk level (write none, low medium or high below)

None

Documentation Update

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@huangxiaopingRD huangxiaopingRD force-pushed the HUDI-5932 branch 2 times, most recently from 85352a3 to 088416b Compare March 15, 2023 02:13
@codope codope added priority:high Significant impact; potential bugs area:sql SQL interfaces labels Mar 20, 2023
@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@github-actions github-actions bot added the size:S PR with lines of changes in (10, 100] label Feb 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:sql SQL interfaces priority:high Significant impact; potential bugs size:S PR with lines of changes in (10, 100]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants