Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-38187][K8S][TESTS] Add K8S IT for volcano minResources cpu/memory spec #35640

Closed
wants to merge 1 commit into from

Conversation

Yikun
Copy link
Member

@Yikun Yikun commented Feb 24, 2022

What changes were proposed in this pull request?

This PR adds two tests to make sure resource reservation supported.

  • Run SparkPi Jobs with minCPU
  • Run SparkPi Jobs with minMemory

Why are the changes needed?

Test resource reservation (min Resoruce) with volcano implementations

Does this PR introduce any user-facing change?

No, K8S IT only

How was this patch tested?

  • integration test
[info] VolcanoSuite:
[info] - Run SparkPi with volcano scheduler (12 seconds, 738 milliseconds)
[info] - SPARK-38188: Run SparkPi jobs with 2 queues (only 1 enabled) (13 seconds, 294 milliseconds)
[info] - SPARK-38188: Run SparkPi jobs with 2 queues (all enabled) (25 seconds, 659 milliseconds)
[info] - SPARK-38423: Run SparkPi Jobs with priorityClassName (19 seconds, 310 milliseconds)
[info] - SPARK-38423: Run driver job to validate priority order (16 seconds, 467 milliseconds)
[info] - SPARK-38187: Run SparkPi Jobs with minCPU (29 seconds, 546 milliseconds)
[info] - SPARK-38187: Run SparkPi Jobs with minMemory (30 seconds, 473 milliseconds)
[info] Run completed in 2 minutes, 30 seconds.
[info] Total number of tests run: 7
[info] Suites: completed 2, aborted 0
[info] Tests: succeeded 7, failed 0, canceled 0, ignored 0, pending 0
[info] All tests passed.
[success] Total time: 236 s (03:56), completed 2022-3-10 9:17:46

@@ -1356,6 +1356,26 @@ See the [configuration page](configuration.html) for information on Spark config
</td>
<td>3.3.0</td>
</tr>
<tr>
<td><code>spark.kubernetes.job.minCPU</code></td>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just wondering if we can reuse some of the Spark configuration.

Copy link
Member Author

@Yikun Yikun Feb 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, this's a good question, there were some related use case of Spark with volcano in production can be shared in here:

  • Case1 (this PR): minCPU = specified minCPU: for users who know their own cluster resources well, this is the basic user case. Especially, when users don't want to set minRes strictly to which the spark job real needed resource amounts, this also helps to enhance the utilization of cluster in some level when cluster resource is limited.
  • Case2: minCPU = driver.request + executor.number * executor.request: for users who don't care much about job resource usage.
  • Case3: minCPU = (driver.request + executor.number * executor.request) * factor, for users want to guarantee the resources of the job in some level, but also want to improve the utilization of the cluster.

So, we might add the basic minRes configuration first.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also cc @holdenk

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to document what is used in case of (none), @Yikun .

Copy link
Member Author

@Yikun Yikun Mar 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If no value is specified, there will be no lower limit on job-level CPU resources.

will add it

<td>3.3.0</td>
</tr>
<tr>
<td><code>spark.kubernetes.job.minMemory</code></td>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto.

Copy link
Member

@martin-g martin-g left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a small typo in the name of method checkAnnotation

@dongjoon-hyun
Copy link
Member

Please check the dev mailing list . Ramping down for Apache Spark 3.3 release started, @Yikun .

@Yikun Yikun changed the title [WIP][SPARK-38187][K8S] Support resource reservation with volcano implementations [SPARK-38187][K8S] Support resource reservation with volcano implementations Mar 4, 2022
@Yikun
Copy link
Member Author

Yikun commented Mar 4, 2022

[info] VolcanoSuite:
[info] - Run SparkPi with volcano scheduler (12 seconds, 199 milliseconds)
[info] - SPARK-38187: Run SparkPi Jobs with minCPU (45 seconds, 453 milliseconds)
[info] - SPARK-38187: Run SparkPi Jobs with minMemory (43 seconds, 329 milliseconds)
[info] - SPARK-38188: Run SparkPi jobs with 2 queues (only 1 enable) (17 seconds, 385 milliseconds)
[info] - SPARK-38188: Run SparkPi jobs with 2 queues (all enable) (37 seconds, 684 milliseconds)

@Yikun Yikun marked this pull request as ready for review March 4, 2022 15:35
@Yikun Yikun force-pushed the SPARK-38187-minRes branch 2 times, most recently from 0f42741 to d8b38a8 Compare March 5, 2022 00:44
@Yikun Yikun changed the title [SPARK-38187][K8S] Support resource reservation with volcano implementations [SPARK-38423][K8S] Support resource reservation with volcano implementations Mar 6, 2022
@Yikun Yikun changed the title [SPARK-38423][K8S] Support resource reservation with volcano implementations [SPARK-38187][K8S] Support resource reservation with volcano implementations Mar 6, 2022
@Yikun Yikun force-pushed the SPARK-38187-minRes branch 3 times, most recently from 3f6dc20 to 8bf21e8 Compare March 6, 2022 07:06
@Yikun
Copy link
Member Author

Yikun commented Mar 6, 2022

[info] VolcanoSuite:
[info] - Run SparkPi with volcano scheduler (12 seconds, 211 milliseconds)
[info] - SPARK-38187: Run SparkPi Jobs with minCPU (34 seconds, 483 milliseconds)
[info] - SPARK-38187: Run SparkPi Jobs with minMemory (34 seconds, 345 milliseconds)
[info] - SPARK-38188: Run SparkPi jobs with 2 queues (only 1 enable) (17 seconds, 298 milliseconds)
[info] - SPARK-38188: Run SparkPi jobs with 2 queues (all enable) (37 seconds, 704 milliseconds)
[info] Run completed in 2 minutes, 18 seconds.
[info] Total number of tests run: 5
[info] Suites: completed 2, aborted 0
[info] Tests: succeeded 5, failed 0, canceled 0, ignored 0, pending 0
[info] All tests passed.

@dongjoon-hyun
Copy link
Member

#35733 is merged now . Could you rebase this PR, @Yikun ?

@Yikun
Copy link
Member Author

Yikun commented Mar 7, 2022

$ build/sbt -Pvolcano -Pkubernetes -Pkubernetes-integration-tests -Dtest.exclude.tags=minikube,r  -Dtest.include.tags=volcano  -Dspark.kubernetes.test.namespace=default "kubernetes-integration-tests/test"
[info] VolcanoSuite:
[info] - Run SparkPi with volcano scheduler (12 seconds, 245 milliseconds)
[info] - SPARK-38187: Run SparkPi Jobs with minCPU (43 seconds, 473 milliseconds)
[info] - SPARK-38187: Run SparkPi Jobs with minMemory (43 seconds, 354 milliseconds)
[info] - SPARK-38188: Run SparkPi jobs with 2 queues (only 1 enable) (18 seconds, 323 milliseconds)
[info] - SPARK-38188: Run SparkPi jobs with 2 queues (all enable) (38 seconds, 673 milliseconds)
[info] Run completed in 2 minutes, 38 seconds.
[info] Total number of tests run: 5
[info] Suites: completed 2, aborted 0
[info] Tests: succeeded 5, failed 0, canceled 0, ignored 0, pending 0
[info] All tests passed.
[success] Total time: 378 s (06:18), completed Mar 7, 2022 3:28:39 PM

$ k get pod
No resources found in default namespace.
$ k get queue
NAME      AGE
default   2d20h

@Yikun
Copy link
Member Author

Yikun commented Mar 7, 2022

@dongjoon-hyun Done.

@@ -299,6 +300,20 @@ private[spark] object Config extends Logging {
.stringConf
.createOptional

val KUBERNETES_JOB_MIN_CPU = ConfigBuilder("spark.kubernetes.job.minCPU")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, this naming looks inconsistent to me. For the other K8s configuration, we use Core instead of CPU. Please double check this.

Copy link
Member Author

@Yikun Yikun Mar 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this might be a good suggestion for improvement. I also searched in spark, only spark.task.cpus are using cpu, and others are using cores. I guess it (cores) might be because of the name originally inherited from Yarn.

So, I think cores is also fine for me, like spark.kubernetes.job.minCores. if no objection I can change this soon.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also cc @holdenk

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a few more comments.

@Yikun Yikun force-pushed the SPARK-38187-minRes branch 3 times, most recently from 90829d5 to a6fc888 Compare March 7, 2022 12:06
@Yikun
Copy link
Member Author

Yikun commented Mar 7, 2022

[info] VolcanoSuite:
[info] - Run SparkPi with volcano scheduler (13 seconds, 248 milliseconds)
[info] - SPARK-38187: Run SparkPi Jobs with minCPU (33 seconds, 501 milliseconds)
[info] - SPARK-38187: Run SparkPi Jobs with minMemory (34 seconds, 336 milliseconds)
[info] - SPARK-38188: Run SparkPi jobs with 2 queues (only 1 enabled) (17 seconds, 203 milliseconds)
[info] - SPARK-38188: Run SparkPi jobs with 2 queues (all enabled) (29 seconds, 392 milliseconds)
[info] - SPARK-38423: Run SparkPi Jobs with priorityClassName (21 seconds, 190 milliseconds)
[info] - SPARK-38423: Run driver job to validate priority order (20 seconds, 239 milliseconds)
[info] Run completed in 2 minutes, 54 seconds.
[info] Total number of tests run: 7
[info] Suites: completed 2, aborted 0
[info] Tests: succeeded 7, failed 0, canceled 0, ignored 0, pending 0
[info] All tests passed.
[success] Total time: 362 s (06:02), completed Mar 7, 2022 10:07:21 PM

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yikun
We are building PodGroup one by one manually. However, most of the configurations are static. Why don't we make PodGroup implement Loadable like Pod?

It seems that you decided not to use volcano API for some reasons . Can we use new DefaultVolcanoClient().podGroups().load()?

@Yikun
Copy link
Member Author

Yikun commented Mar 8, 2022

@dongjoon-hyun I'm not sure there were some misunderstanding, something below might help you understand if something confused:
We configure pod (driver and executors) one by one but we only return the PreAdditionalK8SResource once before driver creation. So there are only one podgroup is created for one job in current implementation(the pre kubernetes only calls in driver side).

If we are in the same page then:

It seems that you decided not to use volcano API for some reasons

No more hidden reason, just make sure it can be more generic, then it can be created by kubernetesClient.resourceList(preKubernetesResources: _*).createOrReplace(). So we only return volcano model in here.

Can we use new DefaultVolcanoClient().podGroups().load()?

if you meaned use this to build volcano-model, client side yaml load could be a alternative way to complete this, but I think we still need to left placeholder in yaml and do format in feature step. This perhaps not flexiable, and a little bit hard to maintain. and also introduce a volcano client module deps.

:), feel free to left any question you had, I will try my best to make work better.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yikun . Here is my PR for volcano based on my original suggestion to you. I believe the two PodGroup template configurations are enough for Volcano because it is a simpler and more extensible and future-proof way instead of introducing many new configurations one by one like this PR.

dongjoon-hyun added a commit that referenced this pull request Mar 9, 2022
…mplates

### What changes were proposed in this pull request?

This PR aims to support driver/executor `PodGroup` templates like the following.

```yaml
apiVersion: scheduling.volcano.sh/v1beta1
kind: PodGroup
spec:
  minMember: 1000
  minResources:
    cpu: "4"
    memory: "16Gi"
  priorityClassName: executor-priority
  queue: executor-queue
```

### Why are the changes needed?

This is a simpler, more extensible and robust way to support Volcano future because we don't need to add new configurations like #35640 for all Volcano features.

### Does this PR introduce _any_ user-facing change?

No because this is a new feature.

### How was this patch tested?

Pass the CIs.

Closes #35776 from dongjoon-hyun/SPARK-38455.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
@dongjoon-hyun
Copy link
Member

Could you rebase this PR to the master and convert it as a test case addition PR, @Yikun ?

@Yikun
Copy link
Member Author

Yikun commented Mar 10, 2022

@dongjoon-hyun Done! Thanks! : )

@Yikun Yikun changed the title [SPARK-38187][K8S] Support resource reservation with volcano implementations [SPARK-38187][K8S][TESTS] Add K8S IT for resource reservation with volcano implementations Mar 10, 2022
@Yikun Yikun changed the title [SPARK-38187][K8S][TESTS] Add K8S IT for resource reservation with volcano implementations [SPARK-38187][K8S][TESTS] Add K8S IT for resource reservation (Spark K8S with volcano) Mar 10, 2022
@Yikun
Copy link
Member Author

Yikun commented Mar 10, 2022

[info] VolcanoSuite:
[info] - Run SparkPi with volcano scheduler (12 seconds, 738 milliseconds)
[info] - SPARK-38188: Run SparkPi jobs with 2 queues (only 1 enabled) (13 seconds, 294 milliseconds)
[info] - SPARK-38188: Run SparkPi jobs with 2 queues (all enabled) (25 seconds, 659 milliseconds)
[info] - SPARK-38423: Run SparkPi Jobs with priorityClassName (19 seconds, 310 milliseconds)
[info] - SPARK-38423: Run driver job to validate priority order (16 seconds, 467 milliseconds)
[info] - SPARK-38187: Run SparkPi Jobs with minCPU (29 seconds, 546 milliseconds)
[info] - SPARK-38187: Run SparkPi Jobs with minMemory (30 seconds, 473 milliseconds)
[info] Run completed in 2 minutes, 30 seconds.
[info] Total number of tests run: 7
[info] Suites: completed 2, aborted 0
[info] Tests: succeeded 7, failed 0, canceled 0, ignored 0, pending 0
[info] All tests passed.
[success] Total time: 236 s (03:56), completed 2022-3-10 9:17:46

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Thank you, @Yikun . Merged to master.

$ build/sbt -Psparkr -Pkubernetes -Pvolcano -Pkubernetes-integration-tests -Dtest.exclude.tags=minikube -Dspark.kubernetes.test.deployMode=docker-for-desktop 'kubernetes-integration-tests/testOnly -- -z SPARK-38187'
...
[info] KubernetesSuite:
[info] VolcanoSuite:
[info] - SPARK-38187: Run SparkPi Jobs with minCPU (24 seconds, 485 milliseconds)
[info] - SPARK-38187: Run SparkPi Jobs with minMemory (24 seconds, 665 milliseconds)
[info] Run completed in 1 minute, 12 seconds.
[info] Total number of tests run: 2
[info] Suites: completed 2, aborted 0
[info] Tests: succeeded 2, failed 0, canceled 0, ignored 0, pending 0
[info] All tests passed.
[success] Total time: 99 s (01:39), completed Mar 9, 2022 5:26:29 PM

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-38187][K8S][TESTS] Add K8S IT for resource reservation (Spark K8S with volcano) [SPARK-38187][K8S][TESTS] Add K8S IT for volcano minResources cpu/memory spec Mar 10, 2022
LuciferYang pushed a commit to LuciferYang/spark that referenced this pull request Mar 10, 2022
…emory spec

### What changes were proposed in this pull request?
This PR adds two tests to make sure resource reservation supported.
- Run SparkPi Jobs with minCPU
- Run SparkPi Jobs with minMemory

### Why are the changes needed?
Test resource reservation (min Resoruce) with volcano implementations

### Does this PR introduce _any_ user-facing change?
No, K8S IT only

### How was this patch tested?
- integration test
```
[info] VolcanoSuite:
[info] - Run SparkPi with volcano scheduler (12 seconds, 738 milliseconds)
[info] - SPARK-38188: Run SparkPi jobs with 2 queues (only 1 enabled) (13 seconds, 294 milliseconds)
[info] - SPARK-38188: Run SparkPi jobs with 2 queues (all enabled) (25 seconds, 659 milliseconds)
[info] - SPARK-38423: Run SparkPi Jobs with priorityClassName (19 seconds, 310 milliseconds)
[info] - SPARK-38423: Run driver job to validate priority order (16 seconds, 467 milliseconds)
[info] - SPARK-38187: Run SparkPi Jobs with minCPU (29 seconds, 546 milliseconds)
[info] - SPARK-38187: Run SparkPi Jobs with minMemory (30 seconds, 473 milliseconds)
[info] Run completed in 2 minutes, 30 seconds.
[info] Total number of tests run: 7
[info] Suites: completed 2, aborted 0
[info] Tests: succeeded 7, failed 0, canceled 0, ignored 0, pending 0
[info] All tests passed.
[success] Total time: 236 s (03:56), completed 2022-3-10 9:17:46
```

Closes apache#35640 from Yikun/SPARK-38187-minRes.

Authored-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants