Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-44050][K8S]add retry config when creating Kubernetes resources. #45911

Closed
wants to merge 1 commit into from

Conversation

liangyouze
Copy link

What changes were proposed in this pull request?

add retry config when creating Kubernetes resources.

Why are the changes needed?

When creating Kubernetes resources, we occasionally encounter situations where resources such as ConfigMap cannot be successfully created, resulting in the driver pod remaining in the 'ContainerCreating' state. Therefore, it is necessary to add a verification mechanism after creating other resources to ensure that the resources are actually created

Does this PR introduce any user-facing change?

No

How was this patch tested?

add new tests

Was this patch authored or co-authored using generative AI tooling?

No

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for making a PR, but I'm not sure if this is a right layer to do. For me, it sounds like you are hitting your K8s cluster issue or K8s client library issue. Could you elaborate your environment and the error message more, @liangyouze ?

When creating Kubernetes resources, we occasionally encounter situations where resources such as ConfigMap cannot be successfully created, resulting in the driver pod remaining in the 'ContainerCreating' state. Therefore, it is necessary to add a verification mechanism after creating other resources to ensure that the resources are actually created

@liangyouze
Copy link
Author

Thank you for making a PR, but I'm not sure if this is a right layer to do. For me, it sounds like you are hitting your K8s cluster issue or K8s client library issue. Could you elaborate your environment and the error message more, @liangyouze ?

When creating Kubernetes resources, we occasionally encounter situations where resources such as ConfigMap cannot be successfully created, resulting in the driver pod remaining in the 'ContainerCreating' state. Therefore, it is necessary to add a verification mechanism after creating other resources to ensure that the resources are actually created

It's the same as described in SPARK-44050,I've encountered the same issue. When creating resources such as configmaps, occasionally this situation occurs: the code does not throw any exceptions, but the configmap resource is not actually created, causing the driver pod to remain in a ContainerCreating state and unable to proceed to the next step. This may be a Kubernetes issue, or a feature (as far as I know, Kubernetes has some rate-limiting policies that may cause certain requests to be dropped, but I'm not sure if it's related), but in any case, Spark should not be stuck because of this.

@beliefer
Copy link
Contributor

It seems you hit some bug of K8s or the using issue.

@imtzer
Copy link

imtzer commented May 11, 2024

same problem when using spark operator, it's weird why the code does not throw anything when configmap is not created

@imtzer
Copy link

imtzer commented May 24, 2024

Hi @liangyouze, I met the same issuse, I like to know if the problem occurs in using spark operator or spark-submit directly, is there anything in consle output?

@liangyouze
Copy link
Author

liangyouze commented May 24, 2024

same problem when using spark operator, it's weird why the code does not throw anything when configmap is not created

When using spark-submit, there is no error output in the console, and the client will show that the driver pod is always in the ContainerCreating state and will never end.

@imtzer
Copy link

imtzer commented May 24, 2024

same problem when using spark operator, it's weird why the code does not throw anything when configmap is not created

When using spark-submit, there is no error output in the console, and the client will show that the driver pod is always in the ContainerCreating state and will never end.

Maybe I met another issuse, I use spark-submit in spark operator pod with k8s mode, but sometimes driver pod keeps getting stuck in ContainerCreating state due to missing ConfigMap and the console output shows 'Killed'. I added some log in KubernetesClientApplication.scala like this:

    logInfo("before pod create, " + driverPodName)

    var watch: Watch = null
    var createdDriverPod: Pod = null
    try {
      createdDriverPod =
        kubernetesClient.pods().inNamespace(conf.namespace).resource(resolvedDriverPod).create()
    } catch {...}

    logInfo("before pre resource refresh, " + driverPodName)

    // Refresh all pre-resources' owner references
    try {
      addOwnerReference(createdDriverPod, preKubernetesResources)
      kubernetesClient.resourceList(preKubernetesResources: _*).forceConflicts().serverSideApply()
    } catch {...}

    logInfo("before other resource, " + driverPodName)

    // setup resources after pod creation, and refresh all resources' owner references
    try {
      val otherKubernetesResources = resolvedDriverSpec.driverKubernetesResources ++ Seq(configMap)
      addOwnerReference(createdDriverPod, otherKubernetesResources)
      kubernetesClient.resourceList(otherKubernetesResources: _*).createOrReplace()
      logInfo("after other resource, " + driverPodName)

    } catch {...}

and the log did not show after other resource when spark-submit was 'Killed', the configmap that driver pod need had not been successfully created!
Also, I checked oom using dmesg, and some process in spark operator pod was killed because of oom

@liangyouze
Copy link
Author

after other resource

"It seems that the reason 'after other resource' is not printed out is due to the client being OOM and exiting. Have you tried increasing the memory of the client pod to a sufficient size, so as to prevent the OOMKilled phenomenon?"

@imtzer
Copy link

imtzer commented May 29, 2024

after other resource

"It seems that the reason 'after other resource' is not printed out is due to the client being OOM and exiting. Have you tried increasing the memory of the client pod to a sufficient size, so as to prevent the OOMKilled phenomenon?"

We met the same phenomenon that configmap was not created, but maybe with different reason

Copy link

github-actions bot commented Sep 7, 2024

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Sep 7, 2024
@github-actions github-actions bot closed this Sep 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants