HIVE-28276: Iceberg:Make Iceberg split threads configurable when table scanning #5260

zhangbutao · 2024-05-23T03:01:42Z

What changes were proposed in this pull request?

I have noticed that if a iceberg table has lots of metadata/data files, iceberg will use many system cores to do scan planning, which may put some pressure on Tez AM memory/cpu.
We can try to make the thread pool size configurable, to avoid concurrency pressure on Tez AM.

Other OSS also did similar optimization, like FIink:
https://github.com/apache/iceberg/blob/cbb853073e681b4075d7c8707610dceecbee3a82/flink/v1.18/flink/src/main/java/org/apache/iceberg/flink/source/FlinkInputFormat.java#L92-L100

Why are the changes needed?

This can configure the icbeberg planning thread size, to reduce Tez AM memory/cpu pressure as well as reduce some pressure on operating system.
If we don't limit the thread pool size, it wii use a default value which is equal system cores when doing scan.planTasks():,
https://github.com/apache/iceberg/blob/9a5d24fee239352021a9a73f6a4cad8ecf464f01/core/src/main/java/org/apache/iceberg/SystemConfigs.java#L38-L43

Does this PR introduce any user-facing change?

No

Is the change a dependency upgrade?

No

How was this patch tested?

Existing test.

ayushtkn · 2024-05-23T06:51:47Z

iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/mapreduce/IcebergInputFormat.java

+    final ExecutorService workerPool =
+            ThreadPools.newWorkerPool("iceberg-plan-worker-pool",
+                    conf.getInt(InputFormatConfig.TABLE_PLAN_WORKER_POOL_SIZE, ThreadPools.WORKER_THREAD_POOL_SIZE));


Does this work

final ExecutorService workerPool = ThreadPools.newWorkerPool("iceberg-plan-worker-pool", SystemConfigs.WORKER_THREAD_POOL_SIZE.value());

This can not set the thread pool size in seesion connection.
What i want to do is that we can set the scan thread pool size in seesion, e.g. we can set iceberg.worker.num-threads=8 in beeline console.

And if users don't set the scan thread pool size, the size will follow the current default logic to initialize a thread size which is equal system cores.

ayushtkn · 2024-05-23T06:52:30Z

iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/mapreduce/IcebergInputFormat.java

    if (fromVersion != -1) {
-      scan = applyConfig(conf, createIncrementalAppendScan(table, conf));
+      scan = applyConfig(conf, createIncrementalAppendScan(table, conf)).planWith(workerPool);


Do we need to shutdown this workerPool in the finally block

I think no need. I think iceberg side has done the others.

Just double check once, that iceberg takes care of that & there is no leak here before committing

Make sense. Let me double check again.

I checked iceberg code, and didn't find any code snippet about workPool shutdown. So i think you are right, shutdown this workerPool in the finally block is a more better way. Like flink iceberg side:
https://github.com/apache/iceberg/blob/cbb853073e681b4075d7c8707610dceecbee3a82/flink/v1.18/flink/src/main/java/org/apache/iceberg/flink/source/FlinkInputFormat.java#L92-L101

Change in this commit ef6d684

Thanks.

thnx, looks good

ayushtkn

maybe if the value configured is 1, we might not need to initialize this executor service and go with normal flow

zhangbutao · 2024-05-23T08:49:24Z

@ayushtkn I think I got mixed up. :( The scan is in Tez AM side not HS2 side. I will double check this PR later. Make it WIP...

zhangbutao · 2024-05-23T10:08:27Z

maybe if the value configured is 1, we might not need to initialize this executor service and go with normal flow

The iceberg api side will always initialize the executor service, so we can not do other things.

@ayushtkn I think I got mixed up. :( The scan is in Tez AM side not HS2 side. I will double check this PR later. Make it WIP...

This PR is aimed for Tez AM, as the scan logic is on Tez AM side. This PR can let hive users to configure the Tez AM scan thread size.
And i think we also have some iceberg table scan on HS2 side, especially iceberg maintenance feature. I think we can also do similar optimization in the future.

ayushtkn

LGTM

iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/InputFormatConfig.java

iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/mapreduce/IcebergInputFormat.java

deniskuzZ

no need for casting

iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/mapreduce/IcebergInputFormat.java

…e scanning

deniskuzZ

+1, pending tests

sonarcloud · 2024-06-03T09:14:04Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
No data about Coverage
No data about Duplication

See analysis details on SonarCloud

asf-ci-hive added tests pending tests passed and removed tests pending labels May 23, 2024

ayushtkn reviewed May 23, 2024

View reviewed changes

zhangbutao marked this pull request as draft May 23, 2024 08:49

zhangbutao marked this pull request as ready for review May 23, 2024 10:03

zhangbutao requested a review from ayushtkn May 23, 2024 10:12

ayushtkn approved these changes May 23, 2024

View reviewed changes

asf-ci-hive added tests pending tests passed and removed tests passed tests pending labels May 23, 2024

deniskuzZ reviewed May 26, 2024

View reviewed changes

iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/InputFormatConfig.java Outdated Show resolved Hide resolved

asf-ci-hive added tests pending and removed tests passed labels May 28, 2024

zhangbutao force-pushed the HIVE-28276 branch from 61e804f to 780c6f2 Compare May 28, 2024 01:39

asf-ci-hive added tests failed tests pending tests passed and removed tests pending tests failed labels May 28, 2024

deniskuzZ reviewed May 28, 2024

View reviewed changes

iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/mapreduce/IcebergInputFormat.java Outdated Show resolved Hide resolved

asf-ci-hive added tests pending and removed tests passed labels May 29, 2024

zhangbutao force-pushed the HIVE-28276 branch from ef79413 to a49f44a Compare May 29, 2024 04:21

asf-ci-hive removed the tests pending label May 29, 2024

asf-ci-hive added tests failed tests pending tests passed and removed tests failed tests pending labels May 29, 2024

zhangbutao requested a review from deniskuzZ May 30, 2024 03:23

deniskuzZ reviewed Jun 2, 2024

View reviewed changes

iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/mapreduce/IcebergInputFormat.java Outdated Show resolved Hide resolved

zhangbutao added 5 commits June 3, 2024 09:35

HIVE-28276: Iceberg:Make Iceberg split threads configurable when tabl…

35612a6

…e scanning

shutdown this workerPool in the finally block

2f6445b

use iceberg's conf name

2e50ef9

Refine

e278fe0

refine class cast

84c633a

zhangbutao force-pushed the HIVE-28276 branch from a49f44a to 84c633a Compare June 3, 2024 01:35

asf-ci-hive added tests pending tests unstable and removed tests passed tests pending tests unstable labels Jun 3, 2024

deniskuzZ approved these changes Jun 3, 2024

View reviewed changes

asf-ci-hive added tests passed and removed tests pending labels Jun 3, 2024

deniskuzZ merged commit 45867be into apache:master Jun 3, 2024
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HIVE-28276: Iceberg:Make Iceberg split threads configurable when table scanning #5260

HIVE-28276: Iceberg:Make Iceberg split threads configurable when table scanning #5260

zhangbutao commented May 23, 2024 •

edited

Loading

ayushtkn May 23, 2024

zhangbutao May 23, 2024

ayushtkn May 23, 2024

zhangbutao May 23, 2024

ayushtkn May 23, 2024

zhangbutao May 23, 2024

zhangbutao May 23, 2024

ayushtkn May 23, 2024

ayushtkn left a comment

zhangbutao commented May 23, 2024

zhangbutao commented May 23, 2024 •

edited

Loading

ayushtkn left a comment

deniskuzZ left a comment

deniskuzZ left a comment

sonarcloud bot commented Jun 3, 2024

HIVE-28276: Iceberg:Make Iceberg split threads configurable when table scanning #5260

HIVE-28276: Iceberg:Make Iceberg split threads configurable when table scanning #5260

Conversation

zhangbutao commented May 23, 2024 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

Is the change a dependency upgrade?

How was this patch tested?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ayushtkn left a comment

Choose a reason for hiding this comment

zhangbutao commented May 23, 2024

zhangbutao commented May 23, 2024 • edited Loading

ayushtkn left a comment

Choose a reason for hiding this comment

deniskuzZ left a comment

Choose a reason for hiding this comment

deniskuzZ left a comment

Choose a reason for hiding this comment

sonarcloud bot commented Jun 3, 2024

Quality Gate passed

zhangbutao commented May 23, 2024 •

edited

Loading

zhangbutao commented May 23, 2024 •

edited

Loading