Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-32975] [K8S] Ensure driver is ready before executors start #32739

Closed
wants to merge 1 commit into from
Closed

[SPARK-32975] [K8S] Ensure driver is ready before executors start #32739

wants to merge 1 commit into from

Conversation

cchriswu
Copy link
Contributor

@cchriswu cchriswu commented Jun 2, 2021

What changes were proposed in this pull request?

Before creating executor pods, wait until the driver pod is ready.

Why are the changes needed?

The driver's headless service can be resolved by DNS only after the driver pod is ready. If the executor tries to connect to the headless service before driver pod is ready, it will hit UnkownHostException and get into error state but will not be restarted. This case usually happens when the driver pod has sidecar containers but hasn't finished their creation when executors start. So basically there is a race condition.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Exisiting tests.

Before creating executor pods, wait until driver gets ready. The
driver's headless service can be resolved by DNS only after the driver
pod is ready. If the executor tries to connect to the headless service
before driver pod is ready, it will hit UnkownHostException and get into
error state but will not be restarted.
@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@cchriswu cchriswu closed this Jun 2, 2021
@singh-abhijeet
Copy link

This only addresses driver, what about executors if they have sidecar injected too? The executor must wait as well before sidecar container is ready and then make bootstrap calls to driver.

I was getting connection refused excp because sidecar container was not ready and executor was trying to communicate. I resolved it by adding a sleep/wait time in entrypoint.sh for exec, but it would be neat to have a spark.k8s config which allows to set wait time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants