-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Proposal] Druid On Yarn #4400
Comments
@RongZhang828 this is a great proposal looking forward to see more details ! |
Just as a general fyi, we run on marathon/mesos using just marathon app definitions. |
Here is a list of challenges coming from someone using druid in a containerized world:
A useful solution can certainly be developed that only addresses a subset of these. Hopefully our experience will help ease adoption and maximize utility of such a feature. |
Thanks a lot for your suggestions @drcrallen . With your input and my practice, I came up with the solution below. Druid roles can be categorized into 2 different groups:
So on first stage of druid on yarn project, I only take worker role (broker/historical) into consideration. Details as below: One application master for each module.
Currently, Druid has its constrains as a distributed system.
Historical
Due to above reasons, druid is not a normal distributed systems. The druid application manager needs to be able to support at least 2 worker distribution strategies:
On first stage, I'll focus on the exclusive strategy. The second one needs more druid code modification. And also, for failover strategy, workers will be start up at the same node before it dead abnormally. From long term, druid broker and historical should be able to support the normal strategy. And the solutions to this problems are quiet easy.
Perhaps above functions would be done in stage two. Let me know if you have any questions or ideas. |
@RongZhang828 This is great proposal ! |
For query routing we do have a druid router node that can be used. Router will discover brokers via ZK service. So this can be a viable way to hide the complexity of having fix pool of ip/ports for brokers. |
I'm working on it now and hopefully will finish it in the next few weeks :) |
Interesting proposal, I think the following things should be considered closely:
Also, I don't know whether your MiddleManager cluster is using Indexer auto-scaling feature. If yes, then why do you need Yarn for MiddleManager? |
Well for now, we use yarn in production and most likely druid will run on hadoop clusters with other applications like flink. So currently I'd like to implement druid on yarn first.
|
Have you considered Kubernetes? My guess is that it can replace Zookeeper as well. |
This issue has been marked as stale due to 280 days of inactivity. It will be closed in 2 weeks if no further activity occurs. If this issue is still relevant, please simply write any comment. Even if closed, you can still revive the issue at any time or discuss it on the dev@druid.apache.org list. Thank you for your contributions. |
This issue has been closed due to lack of activity. If you think that is incorrect, or the issue requires additional review, you can revive the issue at any time. |
Motivation:
In large Druid clusters with over 100 machines, there will be hundreds of workers especially for brokers and historical nodes. It's hard to monitor, add or reduce workers for certain roles automatically.
For example, if a worker dead unexpectedly, sys admins need to start it manually.
The time gap between will affect user experience.
And also, if more historical workers are needed, manual work is needed to start more workers on hosts.
Proposal:
With Yarn included, those problems could be solved perfectly. There's no need for sys admins to care about the worker locations for all roles. Workers could be assigned and started automatically.
Dead workers will be started in time and with a command, more workers could be added to certain roles.
An app master for druid is responsible to start worker and monitor worker status. In app master, customer-oriented rules could be made for each role. For example, workers should be started at the same host when restart so that files on disk could be reused.
The text was updated successfully, but these errors were encountered: