Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

druid.indexer.task.restoreTasksOnRestart does not work by default for Docker based deployments on Kubernetes #16352

Closed
aho135 opened this issue Apr 29, 2024 · 1 comment · Fixed by #16386

Comments

@aho135
Copy link
Contributor

aho135 commented Apr 29, 2024

druid.indexer.task.restoreTasksOnRestart does not work by default for Docker based deployments on Kubernetes

Affected Version

25.0.0 but the issue still exists in latest version

Description

Hi Druid experts. Our team runs Druid on Kubernetes and ingest data from Kafka. We have druid.indexer.task.restoreTasksOnRestart=true and expected ingestion tasks to restore and resume even when the MiddleManager is restarted. This is the current behavior:

  1. MiddleManager is shut down. Because druid.indexer.task.restoreTasksOnRestart=true, restore.json is created
  2. MiddleManager starts up, but with a different IP because we are running on Kubernetes. The task is restored and continues running.
  3. When the peon reports its status to the Overlord, the Overlord will log that the task is not in known task id's and proceeds to shutdown the task. This is because the MiddleManager IP has changed

The solution to fix this problem is to allow druid.host to use the default value of InetAddress.getLocalHost().getCanonicalHostName() and task restoration works after that. But setting druid.host to the default value requires setting DRUID_SET_HOST to 0 through an environment variable. I am wondering what the original reasoning for using IP instead of canonical host name is. And wondering if we should change the default behavior given that using IP breaks task restoration

@FrankChen021
Copy link
Member

I am wondering what the original reasoning for using IP instead of canonical host name is. And wondering if we should change the default behavior given that using IP breaks task restoration

See: #9019 and #6896 to know the history.

The DRUID_SET_HOST environment variable maybe is not well documented.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants