New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-31949][SQL] Add spark.default.parallelism in SQLConf for isolated across session #28778
Conversation
@maropu @cloud-fan thanks for review. |
If the config is needed, then move the old |
* @since 3.1.0 | ||
*/ | ||
def defaultParallelism: Int = { | ||
sessionState.conf.defaultParallelism.getOrElse(sparkContext.defaultParallelism) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so we add a config, whose only usage is to let users get the config value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I said above. If add this config, I will move the exists defaultParallelism
which in sql module follow up. e.g. FilePartition.maxSplitBytes()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just do this in this pr ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please do, otherwise it's a useless config
Yea, having a sessein-local default parallelism param in the SQL side looks fine to me. But, as @cloud-fan said above, you need more work for applying the param into the exising SQL logics. |
@cloud-fan @maropu |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, could you add some tests?
@@ -371,6 +371,14 @@ object SQLConf { | |||
.booleanConf | |||
.createWithDefault(true) | |||
|
|||
val DEFAULT_PARALLELISM = buildConf("spark.sql.default.parallelism") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spark.sql.default.parallelism
-> spark.sql.sessionLocalDefaultParallelism
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Em.. is it better to keep similar with spark.default.parallelism
? so we can set this config easy. sessionLocalDefaultParallelism
seems complex.
@@ -371,6 +371,14 @@ object SQLConf { | |||
.booleanConf | |||
.createWithDefault(true) | |||
|
|||
val DEFAULT_PARALLELISM = buildConf("spark.sql.default.parallelism") | |||
.doc("This config behavior is same as spark.default.parallelism, and this value can be " + | |||
"isolated across sessions. Note: always use sc.defaultParallelism as default number.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about this?
The session-local default number of partitions and and this value is widely used inside physical plans.
If not set, the physical plans refer to `spark.default.parallelism` instead.
Test build #123746 has finished for PR 28778 at commit
|
Test build #123736 has finished for PR 28778 at commit
|
Test build #123752 has finished for PR 28778 at commit
|
Test build #123758 has finished for PR 28778 at commit
|
* | ||
* @since 3.1.0 | ||
*/ | ||
def defaultParallelism: Int = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to not have this API, as SparkSession
should provide high-level logical APIs, not physical ones.
After more thoughts, I'm wondering what's the real use case of it. The default parallelism depends on the cluster resources, and it looks weird if different sessions can have different default parallelism. Looking at the changes in this PR, I think most of them don't really need a per-session config to tune it. The only place looks reasonable is where we split file partitions. Maybe we can just add a new config to do fine-grained control of the file partition splitting? |
Actually, my first thought is the file partitions split and I tried to add another config to control it, like I also not see the reasonable case without file partition split now. If you think it's not needed, I'm OK. |
The most confusing part is, default parallelism is more like a physical stuff (related to cluster resource), and it's weird to have a per session setting for it. |
How about |
Any new thought ? @maropu @cloud-fan Also cc @HyukjinKwon @dongjoon-hyun @viirya |
Parallelism is a physical concept already. Can you explain more about how you are going to tune the file partition split? what are the problems you hit? |
Yeah, parallelism is a physical concept, but it is also shared among sessions. I used a long-lived Spark application with enough core and memory (means Some sql which query on hive table contains small files, as a result one sql may hold the total task resource. Then I tried to increase file size in each partition to reduce the partition number so that other sql can be assigned more tasks. But what I can do is reduce the As said above, I think Spark need to provide a behavior that can control every sql/session parallelism (in this case is file parallelism) so that user can reduce the parallelism if one sql query on small files. |
After more thoughts, I think the file partitions split logic itself is problematic. Its target is to make the number of partitions the same as the total number of cores, which doesn't make sense as the cluster may only have a few free cores. I think a proper way is to set an expected size of each partition, like 64mb. This is also what we do when coalescing shuffle partitions in AQE. Can we add such a config? |
Actually, AQE
In other words In file split, we already have some config BTW The file partition split algorithm is similar between I still think it's needed to control parallelism in session. At least, we should add a config to control file parallelism. |
So seems we just need to add a min-partition-num config for file source? |
Yas, it's a way. |
@cloud-fan |
### What changes were proposed in this pull request? Add a new config `spark.sql.files.minPartitionNum` to control file split partition in local session. ### Why are the changes needed? Aims to control file split partitions in session level. More details see discuss in [PR-28778](#28778). ### Does this PR introduce _any_ user-facing change? Yes, new config. ### How was this patch tested? Add UT. Closes #28853 from ulysses-you/SPARK-32019. Authored-by: ulysses <youxiduo@weidian.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
What changes were proposed in this pull request?
Add new config
spark.sql.default.parallelism
, the behavior is same asspark.default.parallelism
.Why are the changes needed?
For session isolated.
In concurrent scene, we need to determined parallelism session by session.
One case:
Multi sql running in SparkContext, we should split file into partition more carefully that avoid one sql use the total parallelism.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Add UT.