Skip to content

Use base task dir in kubernetes task runner#13880

Merged
dclim merged 5 commits intoapache:masterfrom
nlippis:use-base-task-dir-in-kubernetes-task-scheduler
Mar 7, 2023
Merged

Use base task dir in kubernetes task runner#13880
dclim merged 5 commits intoapache:masterfrom
nlippis:use-base-task-dir-in-kubernetes-task-scheduler

Conversation

@nlippis
Copy link
Contributor

@nlippis nlippis commented Mar 3, 2023

Use base task directory from from task config in kubernetes task scheduler

Get the base task directory from TaskConfig instead of TaskStorageDirTracker

The Kubernetes task runner runs each task within its own machine context (container). Given this, there is no need to allocate and deallocate a directory per task because only one task will ever use the directory, and in the case of the TaskStorageDirTracker only one of the specified directories will ever be used. More info here

Release note


Key changed/added classes in this PR
  • KubernetesTaskRunner

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

@abhishekagarwal87
Copy link
Contributor

I don't understand. is there any material impact this change will have? For sake of consistency, we prefer k8s task runner use the same interface as other runners.

@nlippis
Copy link
Contributor Author

nlippis commented Mar 6, 2023

I don't understand. is there any material impact this change will have? For sake of consistency, we prefer k8s task runner use the same interface as other runners.

I see the benefits of this change to be twofold.

Reduce Confusion
As a user, I don't know what the behavior of specifying multiple base task dir paths will be without reading the code within the context of the KubernetesTaskRunner. Given that only one of the directories will be used per task, this seems like odd behavior.

As a druid contributor, I would be confused why this concept exists within the KubernetesTaskRunner since every task runs within its own container so there is no benefit from using it here.

Reduce state that needs to be tracked by the KubernetesTaskRunner
After this PR I will be introducing a series of changes that allows the user to specify a specific Druid task <-> K8s Job adapter as well as a K8s pod template file based adapter. In this change, the only information passed to the adapter is the task itself. If we were to continue to use the TaskStorageDirTracker then we would need to pass more state information into the adapter. While implementing that change would be trivial, there is no reason to do so since there is only base task dir used per task.

Alternatively we could add a new concept to the TaskStorageDirTracker that just returns the first directory and doesn't track allocated directories per task, however IMHO that would make the purpose of the class unclear.

Another alternative would be to introduce a new interface and have two different implementations (one that tracks directories and another that returns the base task dir) however this seemed to be overkill.

In the end I decided to leave task dir management policy up to the implementation of the TaskRunner which IMHO is simplest for the Druid user and contributor to understand and use.

Copy link
Contributor

@gianm gianm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation. The logic makes sense to me. Please have a look at the comment about using getBaseTaskDirPaths instead.

generateCommand(task),
javaOpts(task),
dirTracker.getTaskDir(task.getId()),
new File(taskConfig.getBaseTaskDirPath()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getBaseTaskDirPath is for JSON backwards compat; it may be null and isn't used elsewhere. Better to do getBaseTaskDirPaths().get(0). Please also add @Deprecated to TaskConfig.getBaseTaskDirPath to make it clearer to others that this method shouldn't be used.

@dclim dclim merged commit faac43e into apache:master Mar 7, 2023
@nlippis nlippis deleted the use-base-task-dir-in-kubernetes-task-scheduler branch March 7, 2023 22:31
317brian pushed a commit to 317brian/druid that referenced this pull request Mar 10, 2023
* Use TaskConfig to get task dir in KubernetesTaskRunner

* Use the first path specified in baseTaskDirPaths instead of deprecated baseTaskDirPath

* Use getBaseTaskDirPaths in generate command
@clintropolis clintropolis added this to the 26.0 milestone Apr 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants