[HUDI-7906] improve the parallelism deduce in rdd write #11470

KnightChess · 2024-06-18T15:55:36Z

Change Logs

as #11274 and #11463 describe, there has two case question.

if the rdd is input rdd without shuffle, the partitiion number is too bigger or too small
user need can not control it easy
- in some case user can set spark.default.parallelism change it.
- in some case user can not change because hard-code
- and in spark, the better way is use spark.default.parallelism or spark.sql.shuffle.partitions can control it, other is advanced in hudi.

Impact

like dedup where use new deduce logical, user can use spark.sql.shuffle.partitions or spark.default.parallelism control the parallelism.
For special scenes, also can use advanced params.

Risk level (write none, low medium or high below)

low

Documentation Update

None

Contributor's checklist

Read through contributor's guide
Change Logs and Impact were stated clearly
Adequate tests were added if applicable
CI passed

danny0405 · 2024-06-19T00:34:13Z

hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/data/HoodieJavaRDD.java

+
+    if (SQLConf.get().contains(SQLConf.SHUFFLE_PARTITIONS().key())) {
+      return SQLConf.get().defaultNumShufflePartitions();
+    } else if (rddData.context().conf().contains("spark.default.parallelism")) {


The Java context may never has this config option right?

do you mean javaSparkContext? it contain, I have update the ut conf

Why the Java client contains a Spark config option, that is not kind of reasonable.

@danny0405 the java rdd is wrap spark rdd, and this code will running in spark driver, will contain context

Oh, my mistake, it's a RDD.

bibhu107 · 2024-06-20T11:08:21Z

...-client/hudi-client-common/src/main/java/org/apache/hudi/index/simple/HoodieSimpleIndex.java

@@ -144,7 +144,7 @@ protected HoodiePairData<HoodieKey, HoodieRecordLocation> fetchRecordLocationsFo
      HoodieData<HoodieKey> hoodieKeys, HoodieEngineContext context, HoodieTable hoodieTable,
      int parallelism) {
    List<String> affectedPartitionPathList =
-        hoodieKeys.map(HoodieKey::getPartitionPath).distinct().collectAsList();
+        hoodieKeys.map(HoodieKey::getPartitionPath).distinct(hoodieKeys.deduceNumPartitions()).collectAsList();


why we cannot use the parallelism value received as input in the distinct() i.e.
hoodieKeys.map(HoodieKey::getPartitionPath).distinct(parallelism).collectAsList();

the case you had met, it will case parallelism too bigger if we use spark own default inference logic

KnightChess · 2024-06-21T01:09:23Z

look like is a flaky test oom, other pr has the same question

yihua · 2024-06-21T06:45:40Z

look like is a flaky test oom, other pr has the same question

I also noticed that Github CI frequently fails due to OOM now. I’m going to triage the offending commit on master.

KnightChess · 2024-06-21T07:31:55Z

@yihua hi, I've conducted some troubleshooting, but I've encountered some issues that I'm not very familiar with at the moment. I'll raise these questions in the hope that they may be helpful to you.
reproduce:

TestSparkDataSource.testCoreFlow vm: -Xmx1g -Xms128m
vm dump analyzer:

there has four task, every task will contain three HoodieLogFileReader, I found HoodieLogFormatReverseReader will hold pre reader, don't know the reason.
so, I try to ENABLE_OPTIMIZED_LOG_BLOCKS_SCAN true to not use it, but this param can not work. I update the raw code to make it work, but the ut in compareUpdateDfWithHudiDf(inputDf2, snapshotDf3, snapshotRows2, colsToSelect) will be wrong, I don't know the reason.
I suspect the problem is file group reader, so I set hoodie.file.group.reader.enabled false, anything will be ok

Finally:
env: vm: -Xmx1g -Xms128m

ENABLE_OPTIMIZED_LOG_BLOCKS_SCAN || hoodie.file.group.reader.enabled || UT result
true || false || success
true || true || ut_error
false || false || success
false || true || oom

hope can help you

KnightChess · 2024-06-21T07:36:54Z

check_oom.patch

yihua · 2024-06-21T18:32:51Z

@yihua hi, I've conducted some troubleshooting, but I've encountered some issues that I'm not very familiar with at the moment. I'll raise these questions in the hope that they may be helpful to you. reproduce:

* TestSparkDataSource.testCoreFlow  vm: -Xmx1g -Xms128m
  vm dump analyzer:


1. there has four task, every task will contain three `HoodieLogFileReader`, I found `HoodieLogFormatReverseReader` will hold pre reader, don't know the reason.

2. so, I try to `ENABLE_OPTIMIZED_LOG_BLOCKS_SCAN` true to not use it, but this param can not work. I update the raw code to make it work, but the ut in `compareUpdateDfWithHudiDf(inputDf2, snapshotDf3, snapshotRows2, colsToSelect)` will be wrong, I don't know the reason.

3. I suspect the problem is file group reader, so I set `hoodie.file.group.reader.enabled` false, anything will be ok
   ![image](https://private-user-images.githubusercontent.com/20125927/341672553-a25d5193-4ee9-4915-929e-ef6710785392.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTg5OTQ5MzcsIm5iZiI6MTcxODk5NDYzNywicGF0aCI6Ii8yMDEyNTkyNy8zNDE2NzI1NTMtYTI1ZDUxOTMtNGVlOS00OTE1LTkyOWUtZWY2NzEwNzg1MzkyLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDA2MjElMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwNjIxVDE4MzAzN1omWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWRjYTRiY2FlZDE4NGQzODBiMzRjMWI1ODcyMjcxMjFiNjA1OWE2NThkMmRkNGYwNDUyNGFmYjk2OTA5YzcyYmMmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.cnwZR52UYTBtUCNRNxG2Hoh6vT-NrpTCqvf4zAyvtNI)

Finally: env: vm: -Xmx1g -Xms128m

ENABLE_OPTIMIZED_LOG_BLOCKS_SCAN || hoodie.file.group.reader.enabled || UT result true || false || success true || true || ut_error false || false || success false || true || oom hope can help you

Thanks for the details. These are definitely helpful. I'll check for any leaks in the new file group reader.

yihua · 2024-06-22T00:56:55Z

Hi @KnightChess somehow I could not reproduce OOM locally in my IntelliJ, but I figured out that the OOM is likely due to Spark's file index holding Hudi's table metadata instance, which caches the HFile readers for reading the metadata table. The new HFile reader, HoodieNativeAvroHFileReader, holds a reference to a shared underlying HFile reader that occupies memory. I'm working on a fix. To unblock CI, I've put up a PR to disable the new HFile reader by default: #11488.

KnightChess

@yihua I try set _hoodie.hfile.use.native.reader false in my local with Xmx 1g, the ut will success too. thanks for resolve, retry this pr ci.

KnightChess · 2024-06-22T01:56:44Z

@yihua With reference to your thinking, I observed other objects. Indeed, this object also accounts for more memory. At first, I only paid attention to the most, they alse will hold 32% memory used, look like they all have different memory address.

KnightChess · 2024-06-22T01:59:27Z

@yihua look like also has oom question, the ci running env is hard simulation

KnightChess · 2024-06-22T02:04:59Z

oh, sorry, my problem, the pr you fix do not merge, ignore it

yihua · 2024-06-22T02:05:56Z

oh, sorry, my problem, the pr you fix do not merge, ignore it

No worries, it is just merged 20 minutes ago.

yihua · 2024-06-22T02:06:57Z

@yihua With reference to your thinking, I observed other objects. Indeed, this object also accounts for more memory. At first, I only paid attention to the most, they alse will hold 32% memory used, look like they all have different memory address.

Thanks for sending the heap dump. Yes, I also observed high memory usage of HoodieNativeAvroHFileReader on my side.

hudi-bot · 2024-06-22T04:27:03Z

CI report:

3b5dfcb UNKNOWN
84644bf Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

xuzifu666 · 2024-06-23T04:12:12Z

@KnightChess Sorry add a comment for feedback，I had a try，but seems not work as before，need some extract config change？

KnightChess · 2024-06-24T03:15:22Z

@xuzifu666 do you have set spark.default.parallelism or spark.sql.shuffle.partitions ?

xuzifu666 · 2024-06-24T03:26:08Z

@xuzifu666 do you have set spark.default.parallelism or spark.sql.shuffle.partitions ?

Yes, spark.default.parallelism=2000 and spark.sql.shuffle.partitions = 2000

KnightChess · 2024-06-24T04:36:04Z

@xuzifu666 can you show a full spark ui info, in my test, it work fine, or how to reproduct it.

bibhu107 · 2024-06-24T13:22:31Z

Hi @KnightChess , This fix will be in which release?

KnightChess · 2024-06-24T14:11:48Z

@bibhu107 0.16.0 and 1.0.0, but you can cherrypick or copy it in you verstion

github-actions bot added the size:M PR with lines of changes in (100, 300] label Jun 18, 2024

danny0405 reviewed Jun 19, 2024

View reviewed changes

danny0405 added resource-and-concurrency performance usability priority:critical production down; pipelines stalled; Need help asap. labels Jun 19, 2024

bibhu107 reviewed Jun 20, 2024

View reviewed changes

KnightChess force-pushed the clear-parallelism branch from 7c7fba6 to b5cbb13 Compare June 20, 2024 11:28

KnightChess closed this Jun 20, 2024

KnightChess reopened this Jun 20, 2024

KnightChess closed this Jun 21, 2024

KnightChess reopened this Jun 21, 2024

KnightChess commented Jun 22, 2024

View reviewed changes

KnightChess closed this Jun 22, 2024

KnightChess reopened this Jun 22, 2024

[HUDI-7906] improve the parallelism deduce in rdd write

84644bf

KnightChess force-pushed the clear-parallelism branch from b5cbb13 to 84644bf Compare June 22, 2024 02:38

danny0405 approved these changes Jun 22, 2024

View reviewed changes

danny0405 merged commit 51c9c0e into apache:master Jun 22, 2024
46 checks passed

KnightChess deleted the clear-parallelism branch July 2, 2024 14:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[HUDI-7906] improve the parallelism deduce in rdd write #11470

[HUDI-7906] improve the parallelism deduce in rdd write #11470

KnightChess commented Jun 18, 2024

danny0405 Jun 19, 2024

KnightChess Jun 19, 2024

danny0405 Jun 20, 2024

KnightChess Jun 20, 2024

danny0405 Jun 20, 2024

bibhu107 Jun 20, 2024

KnightChess Jun 20, 2024 •

edited

Loading

KnightChess commented Jun 21, 2024

yihua commented Jun 21, 2024

KnightChess commented Jun 21, 2024 •

edited

Loading

KnightChess commented Jun 21, 2024

yihua commented Jun 21, 2024

yihua commented Jun 22, 2024

KnightChess left a comment

KnightChess commented Jun 22, 2024 •

edited

Loading

KnightChess commented Jun 22, 2024

KnightChess commented Jun 22, 2024

yihua commented Jun 22, 2024

yihua commented Jun 22, 2024

hudi-bot commented Jun 22, 2024

xuzifu666 commented Jun 23, 2024

KnightChess commented Jun 24, 2024

xuzifu666 commented Jun 24, 2024

KnightChess commented Jun 24, 2024

bibhu107 commented Jun 24, 2024

KnightChess commented Jun 24, 2024

[HUDI-7906] improve the parallelism deduce in rdd write #11470

[HUDI-7906] improve the parallelism deduce in rdd write #11470

Conversation

KnightChess commented Jun 18, 2024

Change Logs

Impact

Risk level (write none, low medium or high below)

Documentation Update

Contributor's checklist

danny0405 Jun 19, 2024

Choose a reason for hiding this comment

KnightChess Jun 19, 2024

Choose a reason for hiding this comment

danny0405 Jun 20, 2024

Choose a reason for hiding this comment

KnightChess Jun 20, 2024

Choose a reason for hiding this comment

danny0405 Jun 20, 2024

Choose a reason for hiding this comment

bibhu107 Jun 20, 2024

Choose a reason for hiding this comment

KnightChess Jun 20, 2024 • edited Loading

Choose a reason for hiding this comment

KnightChess commented Jun 21, 2024

yihua commented Jun 21, 2024

KnightChess commented Jun 21, 2024 • edited Loading

KnightChess commented Jun 21, 2024

yihua commented Jun 21, 2024

yihua commented Jun 22, 2024

KnightChess left a comment

Choose a reason for hiding this comment

KnightChess commented Jun 22, 2024 • edited Loading

KnightChess commented Jun 22, 2024

KnightChess commented Jun 22, 2024

yihua commented Jun 22, 2024

yihua commented Jun 22, 2024

hudi-bot commented Jun 22, 2024

CI report:

xuzifu666 commented Jun 23, 2024

KnightChess commented Jun 24, 2024

xuzifu666 commented Jun 24, 2024

KnightChess commented Jun 24, 2024

bibhu107 commented Jun 24, 2024

KnightChess commented Jun 24, 2024

KnightChess Jun 20, 2024 •

edited

Loading

KnightChess commented Jun 21, 2024 •

edited

Loading

KnightChess commented Jun 22, 2024 •

edited

Loading