-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Native parallel batch indexing with shuffle #8061
Comments
SGTM in general
does overlord treat "supervisor" task as special task to be able to initiate cleanup requests? what if the MM is down temporarily or is cleanup fails for some reason ? In addition to overlord cleanup requests, It might be good for middleManagers to periodically check whether "supervisor" task is running or not and do the self cleanup.
I think a user defined upper limit could always exist in all "supervisor" tasks that spawn extra tasks so that user can plan worker capacity knowing how many tasks at a maximum would be running via parallel [shuffle] task. |
@himanshug thanks for taking a look!
Ah this is a good point. To handle middleManager failure, a sort of self-cleanup can be triggered when some amount of time is elapsed since the last access to any partition for a supervisorTask. Does this sound good?
Thanks for reminding me of this. Forgot to add it to the proposal. I'm thinking to use the existing
This is already supported with |
anything is good as long as data is not left behind forever in error scenarios :) , that said it might be more work to track access time of the partition...
it would be extra nice to document the failure cases and expected behavior e.g.
it is not necessary for things to be recoverable as most of this would be iterated upon later but we can just document what to expect in such failure scenarios. |
Oh, BTW, there's nothing special with supervisorTasks. I'm thinking to add a callback system to indexing service, like some callback functions can be executed at some predefined stage, e.g., after segments are published or some tasks are finished. |
Yes, this is what I'm thinking. What I'm thinking is pretty similar to yours, but a bit additional stuffs.
I think it's not very complex, but will reduce the number of calls to overlord API, so it would be good.
Ah, I didn't document them since it would be same with the existing parallel index task behavior.
If the supervisorTask process is killed normally,
SupervisorTask monitors subtask statuses and counts how many subtasks have failed to process the same input. If it notices more failures than configured I'll add these to the proposal.
Good point. Any task is not restorable now, but I think it might be useful to support rolling update in the future. Parallel index tasks are supposed to run for a long time and so it would be nice if it can be stopped/restored during rolling update. |
thanks for the explanations.
that is fine and I am guessing it will batch task status call i.e. ask status of multiple supervisor tasks in one overlord api request.
will the worker tasks complete and exit eventually or they will be left running till manual intervention ? we might consider making supervisor real "supervisor" like Kakfa instead of "task" so that they get special powers to manage things better ? but I think I am remembering some comment that they are made tasks because tasks are better equipped to work with locking framework available. one option could be to let spawned worker task do the locking , since all worker tasks would be in same group so multiple worker tasks trying to obtain same lock would still work..... this is unverified wishful thinking :) |
Correct.
All subtasks report the pushed segments to the supervisor task at the end of indexing. So, if the supervisor task is not running, then they would end up being failed at this stage. However, this is still quite annoying since they will occupy middleManager resources unnecessarily. I guess we could add some health check to subtasks for the supervisor task.
Yeah, my original intention was to add a "supervisor" like Kafka rather than a new task type, but I changed my mind to use the existing task lock type system. And now I'm inclined to keep the current design of supervisor task because
For the second reason, I would upvote to even demote the Kafka/Kinesis supervisor to the supervisor task. |
thanks , that reasoning about supervisors makes sense.
for their early termination, yes . |
Thanks! I updated the proposal to include things discussed. |
Went through the proposal, I see one issue with this which needs more thought, can we consider using an intermediate directory on the deep storage for storing intermediate segments and phase 2 tasks can read from there ? Am i understanding it correctly, or it is supposed to work with current design ? |
@nishantmonu51 good point! We currently have two provisioning strategies for auto scaling, i.e.,
I'm inclined to the first approach because 1) it's more efficient to read data from middleManagers than from deep storage and 2) intermediary data cleanup for deep storage could be more complex than that for middleManagers (it's still doable though). What do you think? BTW, no matter what way we go, I think this issue could be fixed in a follow-up PR. Does this make sense? |
Regarding the autoscaling, there are users who have developed autoscaling strategies outside druid e.g kubernetes users, use simple Horizontal Pod Autoscaler rules and keeping the autoscaling strategies simple help them write the rules easily. Thinking more on the lines of deep storage,
Efficiency wise I agree that reading data from deep storage (especially when using s3) would be slower than reading it from another MM, but till now I have seen creation of Druid indexes and final segments to be the major bottleneck instead of data transfer. What do you think ? Agreed, it can be done in a follow-up PR. |
I think two big changes are happening in Druid's indexing service recently. One would be the native parallel indexing and another could be the new Indexer module (#7900). I think this Indexer module would be better than the middleManager in terms of memory management, monitoring, resource scheduling and sharing and hope it would be able to replace the middleManager in the future. It would be nice if we can keep the current behavior for users even with these changes, but I think it wouldn't be easy to keep the current behavior especially with the Indexer.
Yeah, it would be pretty similar to when using MM as the intermediary data server with respect to intermediary data structure and management. But I think it could be more complex when it comes to the operational stuffs like permission or failure handling. Not only middleManager but the overlord should also be able to remove stale intermediary data because the middleManager could miss the supervisor task failure. The attempt to delete intermediary data could also fail and it would be easier if it's a local disk. I mean, it's still doable with deep storage but it would be a bit more complex.
Hmm, the additional functionality would be just serving intermediary data and cleaning up them periodically. Maybe the middleManager would need a bit more memory but it shouldn't be really big. Would you elaborate more about your concern? Looking at the code around
That sounds interesting. I haven't used Hadoop ingestion much, and I don't know how much data transfer would contribute to the total ingestion performance. Do you have any rough number on this? |
#6533 looks another issue in auto scaling. Looks like it needs more information about middleManager state anyway. |
@nishantmonu51 thinking more about this, it could be easier and more useful for these users if intermediary data is stored in deep storage. I think we can make this pluggable, like using MM or deep storage for intermediary data depending on the configuration. Does this make sense? |
Could I use index_parallel task to reindex my segments for multitenancy case? Just like single-dimension partitioning by tenant_id. If then, how should I config with |
Hi @quenlang, you should be able to do it. An example tuningConfig could be "tuningConfig" : {
"type" : "index_parallel",
"maxNumConcurrentSubTasks": 2, // max number of sub queries that can be run at the same time
"forceGuaranteedRollup": true,
"partitionsSpec": {
"type": "hashed",
"numShards": 3 // number of segments per time chunk after compaction
}
} |
@jihoonson Thanks a lot! |
@jihoonson is there any protection against getting inside a deadlock for parallel task implementation. For example, and simplicity, consider case of single phase parallel task ... say user submitted 10 parallel tasks .. each of which are supposed to create 2 sub tasks ... Now say you have only 8 task slots (all of which are occupied by supervisor tasks) ... they will submit subtasks which will never run .... is this situation possible or is there any protection to ensure this deadlock doesn't happen. I am noticing something like that on a cluster running loads of single phase parallel tasks, not entirely sure. |
@himanshug good point. I don't think we have any protection against this kind of scenario yet. One possible short term workaround could be checking timeout for pending sub tasks. If timeout expires, the supervisor task can kill its sub tasks, which will lead to killing itself in the end either by killing sub tasks 3 times just like in general failure handling or killing itself immediately in this case. In the future, we may need to improve task scheduling to consider task type to avoid this kind of deadlock. |
@jihoonson hmmm, that solution will fail the indexing ... if there is a lot of parallel indexing tasks submitted (relative to total available task slots) at same time ... indexing progress would be very very slow with many of them failing. that said, above workaround isn't great for usability in general as simple druid cluster setup should just work. In the long run, I think we need to do something to give special treatment to |
Interesting. Another possible workaround for now is you can explicitly specify For the long run idea, I'm not sure if it's fine to not reserve any resource, or it should reserve a minimum amount of resource. Maybe we need to distinguish different resource types (e.g., CPU, memory, disk, etc) to schedule tasks better. |
with 0
yes, I meant reserve something else and not task slots i.e. maybe a "supervisor slot" concept. |
Hmm, I just tested with
Sounds good. I was thinking a task scheduler respecting task priority. Like, it can schedule high priority tasks more often, but also should be able to avoid starvation for low priority tasks. Maybe the supervisor slot concept can go along with this idea together.
Also sounds good. |
looked at the code and workers just do whatever overlord tells them, so behavior you saw is expected and required to fulfill the contract of |
@jihoonson I am thinking of adding a tuning config in default value of in my case, I would set what do you think ? |
nevermind, I tried above and it didn't help my case as contrary to my thinking, most tasks had 2 splits. running the supervisors on separate set of workers using #7066 is the only decent thing to do for now. |
This table stated that `index_parallel` tasks were best-effort only. However, this changed with apache#8061 and this documentation update was simply missed.
This table stated that `index_parallel` tasks were best-effort only. However, this changed with #8061 and this documentation update was simply missed.
Hi @jihoonson, I split data into different segments by single dimension tenant_id for the multitenancy scene. In this way, I can get higher query performance that filters on the tenant_id dimension. But tenant data was skew, so the segment size was not perfect. For example, the max segment size was nearly 18GB but the min segment size was 5MB. Then perform queries on the 18GB segment ware more slower than that partitioned by all dimensions. Assume there is a way(eg. maxRowsPerSegment=10000000) can split the max segment into some more smaller segments, the queries will be quick. But How do you think about this scene? |
Hi @quenlang, I assume you're using hadoop indexing task since the parallel indexing task doesn't support single-dimension range partitioning yet. I think you can set |
@jihoonson Sorry for I did not describe clearly. I used the parallel indexing task to reindex segments for single-dimension range partitioning.
I got a correct result. It seems perfect except the data skew. I'm confused with |
Ah, the partitionsSpec you used is the hash-based partitioning. To use the range partitioning, the
I'm pretty surprised by this and wondering how big the performance gain was in your case. Sadly, Druid doesn't support segment pruning in brokers for hash-based partitioning for now (this is supported only for single-dimension range partitioning). That means, even though your segments are partitioned based on the hash value of
One popular way to mitigate the data skewness is adding other columns to the partition key, so that segment partitioning can be more well balanced. This will corrupt the locality of data so I guess you may need to find a good combination of columns for partition key. |
@jihoonson Thanks for the quick reply!
I did not get big the performance gain than expected. for the small tenant, the query latency only reduces 50ms-100ms but for the big tenant, the latency increases 10s-30s. I think it caused by the data skewness with the hashed partition of tenant_id. The biggest tenant in an 18GB segment.
even though the broker sends queries to all historicals, but only one historical node has the tenant data. So I think the data skewness is the root cause of large latency.
Do you mean that druid 0.17.0 will support single-dimension range partitioning in native parallel indexing tasks? |
Thank you for sharing! The performance gain does look small but still interesting.
I hope so. You can check the proposal in #8769.
Hmm, sorry I didn't mean single-dimension range partitioning helps with the skewed partitioning. I'm not sure if there's a better way except adding other columns to the partition key. |
@jihoonson Thank you so much. |
This is done. |
Motivation
General motivation for native batch indexing is described in #5543.
We now have the parallel index task, but it doesn't support perfect rollup yet because of lack of the shuffle system.
Proposed changes
I would propose to add a new mode for parallel index task which supports perfect rollup with two-phase shuffle.
Two phase partitioning with shuffle
Phase 1: each task partitions data by segmentGranularity and then by hash or range key of some dimensions.
Phase 2: each task reads a set of partitions created by the tasks of Phase 1 and creates a segment per partition.
PartitionsSpec
support forIndexTask
andParallelIndexTask
PartitionsSpec
is the way to define the secondary partitioning and is currently being used byHadoopIndexTask
. This interface should be adjusted to be more general as below.Hadoop tasks can use an extended interface which is more specialized for Hadoop.
IndexTask
currently provides duplicate configurations for partitioning in its tuningConfig such asmaxRowsPerSegment
,maxTotalRows
,numShards
, andpartitionDimensions
. These configurations will be deprecated and the indexTask will supportPartitionsSpec
instead.To support
maxRowsPerSegment
andmaxTotalRows
, a new partitionsSpec could be introduced.This partitionsSpec will be supported as a new configuration in the tuningConfig of
IndexTask
andParallelIndexTask
.New parallel index task runner to support secondary partitioning
ParallelIndexSupervisorTask
is the supervisor task which orchestrates the parallel ingestion. It's responsible for spawning and monitoring sub tasks, and publishing created segments at the end of ingestion.It uses
ParallelIndexTaskRunner
to run single-phase parallel ingestion without shuffle. To support two-phase ingestion, we can add a new implementation ofParallelIndexTaskRunner
,TwoPhaseParallelIndexTaskRunner
.ParallelIndexSupervisorTask
will choose the new runner if partitionsSpec in tuningConfig isHashedPartitionsSpec
orRangePartitionsSpec
.This new taskRunner does the followings:
TwoPhasesParallelIndexTaskRunner
as a new runner for the supervisor tasknumShards
is missing in tuningConfig)The supervisor task provides an additional configuration in its tuningConfig, i.e.,
numSecondPhaseTasks
orinputRowsPerSecondPhaseTask
, to support control of parallelism of the phase 2. This will be improved to automatically determine the optimal parallelism in the future.New sub task types
Partition determine task
HyperLogLog
per interval to compute approximate cardinality.Phase 1 task
Phase 2 task
MiddleManager as Intermediary data server
MiddleManager (and new Indexer) should be responsible for serving intermediary data during shuffle.
Each phase 1 task partitions input data and generates partitioned segments. These partitioned segments are stored in local disk of middleManager (or indexer proposed in #7900). The partitioned segment location would be
/configured/prefix/supervisorTaskId/
directory. The same configurations withStorageLocationConfig
would be provided for intermediary segment location.MiddleManagers and indexers would clean up intermediary segments using the below mechanism.
New API lists of MiddleManager
/druid/worker/v1/shuffle/tasks/{supervisorTaskId}/partition?start={startTimeOfSegment}&end={endTimeOfSegment}&partitionId={partitionId}
Return all partial segments generated by sub tasks of the given supervisor task, falling in the given interval, and having the given partitionId.
/druid/worker/v1/shuffle/tasks/{supervisorTaskId}
Removes all partial segments generated by sub tasks of the given supervisor task.
New metrics & task reports
ingest/task/time
: how long each task tookingest/task/bytes/processed
: how large data each task processedingest/shuffle/bytes
: how large data each middleManager servedingest/shuffle/requests
: how many requests each middleManager servedTask failure handling
Task failure handling is same with the current behavior.
Rationale
There could be two alternate designs for the shuffle system, especially for intermediary data server.
MiddleManager (or indexer) as intermediary data server is the simplest design. In an alternative design, phase 1 tasks could serve intermediary data for shuffle. In this alternate, phase 1 tasks should be guaranteed to run until the phase 2 is finished, which means task 1 resources should be held until the phase 2 is finished. This is rejected for better resource utilization.
Another alternate is a single set of tasks would process both phase 1 and phase 2. This design is rejected because it's not very flexible to use cluster resource efficiently.
Operational impact
maxRowsPerSegment
,numShards
,partitionDimensions
, andmaxTotalRows
in tuningConfig will be deprecated for indexTask.partitionsSpec
will be provided instead. The deprecated values will be removed in the next major release after the upcoming one.Test plan
Unit tests and integration tests will be implemented. I will also test this with our internal cluster once it's ready.
Future work
The text was updated successfully, but these errors were encountered: