-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Proposal] Native parallel batch indexing #5543
Comments
@jihoonson This seems to focus on initial data ingestion, can you comment on how this could play into things like Merge tasks? or potentially reducing the fidelity of existing data with a new index spec? (both possibly outside the scope of MVP) |
@drcrallen well, this proposal is not for introducing a general parallel framework to make any task types parallelizable. Instead, it introduces new task types (and required shuffle system) for parallel indexing. Since these new task types should allow any types of splittable firehoses, they can be used for reindexing if we add a splittable |
implementing the parallel indexing without shuffle itself sounds useful to me for many users e.g. in combination with #5238 for ingesting data from databases. so, 👍 I'll defer shuffle related comments because there isn't enough low level details about its implementation flushed out. my guess is that it will probably end up looking very similar to Hadoop MR's . that is not necessarily a bad thing given the advantages of it being supported out of the box in Druid. |
@himanshug thanks for the comment. We may need a more detailed proposal shuffle system. Also, I'm not sure about sharing the same shuffle system by both indexing and querying now because they need different requirements. I'll raise another issue for it later. |
This is very useful! 👍👍👍 |
Hi, does IngestSegmentFirehose work for the native parallel batch indexing? |
Hi @csimplestring, yes it should work. |
This motion is AWESOME AWESOME AWESOME!!!! I read in the comments section of related PRs about why one would need yet another data processing framework and what would be the issues with Spark/Hadoop. In my opinion, Druid needs native indexing support more than anything, especially in the context of finding a more wide-spread adoption and growing the community. I very much hope that more and more people can join in this effort. Most database systems come with native DML support and thus, competitor products like MPP databases such as Vertica have native support for ingesting big-data workloads. The second most needed feature is OLAP cubing (materialized views) which was added to Druid 0.13 as a prototype recently but currently requires a Hadoop cluster. So folks who went with a Spark-based indexing cannot use it unless they reinvent the wheel by adding support for it too.
|
@csimplestring, I'm sorry, but I was wrong. IngestSegmentFirehose is not available for native parallel indexing and will be implemented in #7048. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
This issue has been closed due to lack of activity. If you think that is incorrect, or the issue requires additional review, you can revive the issue at any time. |
This issue is no longer marked as stale. |
This issue has been marked as stale due to 280 days of inactivity. |
This issue has been closed due to lack of activity. If you think that |
We currently have two types of batch index tasks, i.e., local index task and Hadoop index task. Also a spark task is available with a third party extension.
Motivation
All task types have some limitations. The Hadoop/Spark index task requires an external Hadoop/Spark cluster to be run. This kind of dependency on external systems usually causes some issues like
Meeting these all requirements has been painful for the people who have just started to try Druid.
The local index task doesn't have a dependency on any external systems, but it runs with a single thread and thus it doesn't fit for practical use cases.
As a result, we need a new task type which meets both requirements of lack of dependency on external systems and being capable of parallel indexing. This is what we're already doing in Kafka ingestion.
Goals
The goal of this proposal is introducing new parallel indexing methods. This is not to replace existing Hadoop task with new ones. It's about providing more cool options to Druid users.
peons
)Design
Each new parallel indexing with/without shuffle consists of two task types, i.e., a supervisor task and its worker tasks. Once a supervisor task is submitted to an overlord by users, it internally submits its worker tasks to the overlord.
Worker tasks read input data and generate segments. The generated segments are pushed by worker tasks. Once they finish their work, they report the list of pushed segments to the supervisor task.
The supervisor task monitors the worker task statuses. If one of them fails, the supervisor task retries the failed one until the retry number reaches a preconfigured threshold. Once all worker task succeeds, then the supervisor task collects pushed segments from all worker tasks and publishes them atomically.
In both indexing methods, the parallelism of the initial phase is decided by the input firehose. To do this, the splittable firehose is introduced. A splittable firehose is responsible to let the supervisor task know how the input can be split. The supervisor task generates worker tasks according to the splittable firehose implementation.
In two phase parallel indexing, the supervisor task submits worker tasks of the second phase once the first phase completes. The second phase workers read the intermediate result of the first phase workers and generates segments. The parallelism of the second phase can be decided by the size of intermediate data. Thus, the supervisor should be capable of collecting the size of intermediate data from all worker tasks and adjusting the parallelism depending on the size. To support shuffle, the intermediate result of the first phase should be kept until the second phase completes.
Implementation Plan
- General shuffle system which is available for both indexing systems and querying systems in Druid- The shuffle system should be available for two phase parallel indexing- The shuffle system should also be available for Druid's querying system. This can be used for faster query processing when the size of intermediate data is large.Out of scope of this Proposal
The text was updated successfully, but these errors were encountered: