-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[HUDI-7906] improve the parallelism deduce in rdd write #11470
Conversation
|
||
if (SQLConf.get().contains(SQLConf.SHUFFLE_PARTITIONS().key())) { | ||
return SQLConf.get().defaultNumShufflePartitions(); | ||
} else if (rddData.context().conf().contains("spark.default.parallelism")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Java context may never has this config option right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you mean javaSparkContext? it contain, I have update the ut conf
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the Java client contains a Spark config option, that is not kind of reasonable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@danny0405 the java rdd is wrap spark rdd, and this code will running in spark driver, will contain context
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, my mistake, it's a RDD.
@@ -144,7 +144,7 @@ protected HoodiePairData<HoodieKey, HoodieRecordLocation> fetchRecordLocationsFo | |||
HoodieData<HoodieKey> hoodieKeys, HoodieEngineContext context, HoodieTable hoodieTable, | |||
int parallelism) { | |||
List<String> affectedPartitionPathList = | |||
hoodieKeys.map(HoodieKey::getPartitionPath).distinct().collectAsList(); | |||
hoodieKeys.map(HoodieKey::getPartitionPath).distinct(hoodieKeys.deduceNumPartitions()).collectAsList(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why we cannot use the parallelism value received as input in the distinct()
i.e.
hoodieKeys.map(HoodieKey::getPartitionPath).distinct(parallelism).collectAsList();
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
7c7fba6
to
b5cbb13
Compare
look like is a flaky test oom, other pr has the same question |
I also noticed that Github CI frequently fails due to OOM now. I’m going to triage the offending commit on master. |
@yihua hi, I've conducted some troubleshooting, but I've encountered some issues that I'm not very familiar with at the moment. I'll raise these questions in the hope that they may be helpful to you.
Finally: ENABLE_OPTIMIZED_LOG_BLOCKS_SCAN || hoodie.file.group.reader.enabled || UT result |
Thanks for the details. These are definitely helpful. I'll check for any leaks in the new file group reader. |
Hi @KnightChess somehow I could not reproduce OOM locally in my IntelliJ, but I figured out that the OOM is likely due to Spark's file index holding Hudi's table metadata instance, which caches the HFile readers for reading the metadata table. The new HFile reader, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yihua I try set _hoodie.hfile.use.native.reader
false in my local with Xmx 1g
, the ut will success too. thanks for resolve, retry this pr ci.
@yihua With reference to your thinking, I observed other objects. Indeed, this object also accounts for more memory. At first, I only paid attention to the most, they alse will hold 32% memory used, look like they all have different memory address. |
@yihua look like also has oom question, the ci running env is hard simulation |
oh, sorry, my problem, the pr you fix do not merge, ignore it |
No worries, it is just merged 20 minutes ago. |
Thanks for sending the heap dump. Yes, I also observed high memory usage of |
b5cbb13
to
84644bf
Compare
@KnightChess Sorry add a comment for feedback,I had a try,but seems not work as before,need some extract config change? |
@xuzifu666 do you have set |
Yes, spark.default.parallelism=2000 and spark.sql.shuffle.partitions = 2000 |
@xuzifu666 can you show a full spark ui info, in my test, it work fine, or how to reproduct it. |
Hi @KnightChess , This fix will be in which release? |
@bibhu107 0.16.0 and 1.0.0, but you can cherrypick or copy it in you verstion |
Change Logs
as #11274 and #11463 describe, there has two case question.
spark.default.parallelism
change it.spark.default.parallelism
orspark.sql.shuffle.partitions
can control it, other is advanced in hudi.Impact
like dedup where use new deduce logical, user can use
spark.sql.shuffle.partitions
orspark.default.parallelism
control the parallelism.For special scenes, also can use advanced params.
Risk level (write none, low medium or high below)
low
Documentation Update
None
Contributor's checklist