Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Native parallel batch indexing with shuffle #8061

Closed
jihoonson opened this issue Jul 10, 2019 · 35 comments
Closed

Native parallel batch indexing with shuffle #8061

jihoonson opened this issue Jul 10, 2019 · 35 comments

Comments

@jihoonson
Copy link
Contributor

jihoonson commented Jul 10, 2019

Motivation

General motivation for native batch indexing is described in #5543.

We now have the parallel index task, but it doesn't support perfect rollup yet because of lack of the shuffle system.

Proposed changes

I would propose to add a new mode for parallel index task which supports perfect rollup with two-phase shuffle.

Two phase partitioning with shuffle

Phase 1

Phase 1: each task partitions data by segmentGranularity and then by hash or range key of some dimensions.

Phase 2

Phase 2: each task reads a set of partitions created by the tasks of Phase 1 and creates a segment per partition.

PartitionsSpec support for IndexTask and ParallelIndexTask

PartitionsSpec is the way to define the secondary partitioning and is currently being used by HadoopIndexTask. This interface should be adjusted to be more general as below.

public interface PartitionsSpec
{
  @Nullable
  Integer getNumShards();
  
  @Nullable
  Integer getMaxRowsPerSegment(); // or getTargetRowsPerSegment()
  
  @Nullable
  List<String> getPartitionDimensions();
}

Hadoop tasks can use an extended interface which is more specialized for Hadoop.

public interface HadoopPartitionsSpec extends PartitionsSpec
{
  Jobby getPartitionJob(HadoopDruidIndexerConfig config);
  boolean isAssumeGrouped();
  boolean isDeterminingPartitions();
}

IndexTask currently provides duplicate configurations for partitioning in its tuningConfig such as maxRowsPerSegment, maxTotalRows, numShards, and partitionDimensions. These configurations will be deprecated and the indexTask will support PartitionsSpec instead.

To support maxRowsPerSegment and maxTotalRows, a new partitionsSpec could be introduced.

/**
 * PartitionsSpec for best-effort rollup
 */
public class DynamicPartitionsSpec implements PartitionsSpec
{
  private final int maxRowsPerSegment;
  private final int maxTotalRows;
}

This partitionsSpec will be supported as a new configuration in the tuningConfig of IndexTask and ParallelIndexTask.

New parallel index task runner to support secondary partitioning

ParallelIndexSupervisorTask is the supervisor task which orchestrates the parallel ingestion. It's responsible for spawning and monitoring sub tasks, and publishing created segments at the end of ingestion.

It uses ParallelIndexTaskRunner to run single-phase parallel ingestion without shuffle. To support two-phase ingestion, we can add a new implementation of ParallelIndexTaskRunner, TwoPhaseParallelIndexTaskRunner. ParallelIndexSupervisorTask will choose the new runner if partitionsSpec in tuningConfig is HashedPartitionsSpec or RangePartitionsSpec.

This new taskRunner does the followings:

  • Add TwoPhasesParallelIndexTaskRunner as a new runner for the supervisor task
    • Spawns tasks for determining partitions (if numShards is missing in tuningConfig)
    • Spawns tasks for building partial segments (phase 1)
    • When all tasks of the phase 1 finish, spawns new tasks for building the complete segments (phase 2)
    • Each Phase 2 task is assigned one or multiple partitions
      • The assigned partition is represented as an HTTP URL
  • Publish the segments reported by phase 2 tasks.
  • Triggers intermediary data cleanup when the supervisor task is finished regardless of its last status.

The supervisor task provides an additional configuration in its tuningConfig, i.e., numSecondPhaseTasks or inputRowsPerSecondPhaseTask, to support control of parallelism of the phase 2. This will be improved to automatically determine the optimal parallelism in the future.

New sub task types

Partition determine task
  • Similar to what indexTask or HadoopIndexTask do.
  • Scan the whole input data and collect HyperLogLog per interval to compute approximate cardinality.
  • numShards could be computed as below:
        numShards = (int) Math.ceil(
            (double) numRows / Preconditions.checkNotNull(maxRowsPerSegment, "maxRowsPerSegment")
        );
Phase 1 task
  • Read data via the given firehose
  • Partition data by segmentGranularity by hash or range (and aggregates if rollup)
  • Should be able to access by (supervisorTaskId, timeChunk, partitionId)
  • Write partitioned segments in local disk. Multiple disks can be configured, and each task would write partitions in a round-robin manner to utilize disk bandwidth efficiently
Phase 2 task
  • Download all partial segments from middleManagers where phase 1 tasks ran.
  • Merge all fetched segments into a single segment per partitionId.
  • Push the merged segments and report them to the supervisor task.

MiddleManager as Intermediary data server

MiddleManager (and new Indexer) should be responsible for serving intermediary data during shuffle.

Each phase 1 task partitions input data and generates partitioned segments. These partitioned segments are stored in local disk of middleManager (or indexer proposed in #7900). The partitioned segment location would be /configured/prefix/supervisorTaskId/ directory. The same configurations with StorageLocationConfig would be provided for intermediary segment location.

MiddleManagers and indexers would clean up intermediary segments using the below mechanism.

  • MM will keep expiration time in memory. This expiration time is initialized with current time + configured timeout.
  • MM periodically checks there are any new partitions created for new supervisorTasks and initializes the expiration time if it finds any.
  • When a subtask accesses a partition, the expiration time for the supervisorTask is initialized or updated if it's already there.
  • MM periodically checks those expiration times for supervisorTasks. If it finds any expired supervisorTask, then it will ask the overlord if the task is still running. If not, MM will remove all partitions for the supervisorTask.
  • The overlord will also send a cleanup request to MM when the supervisorTask is finished. This will clean up the expiration time.
New API lists of MiddleManager
  • GET /druid/worker/v1/shuffle/tasks/{supervisorTaskId}/partition?start={startTimeOfSegment}&end={endTimeOfSegment}&partitionId={partitionId}

Return all partial segments generated by sub tasks of the given supervisor task, falling in the given interval, and having the given partitionId.

  • DELETE /druid/worker/v1/shuffle/tasks/{supervisorTaskId}

Removes all partial segments generated by sub tasks of the given supervisor task.

New metrics & task reports

  • ingest/task/time: how long each task took
  • ingest/task/bytes/processed: how large data each task processed
  • ingest/shuffle/bytes: how large data each middleManager served
  • ingest/shuffle/requests: how many requests each middleManager served

Task failure handling

Task failure handling is same with the current behavior.

  • If the supervisorTask process is killed normally, stopGracefully method is called which kills all running subtasks. If it's killed abnormally, then parallel index task doesn't handle this case for now.
  • SupervisorTask monitors subtask statuses and counts how many subtasks have failed to process the same input. If it notices more failures than configured maxRetry, it regards that input can't be processed and exists with an error. Otherwise, it respawns a new task which processes the same input.

Rationale

There could be two alternate designs for the shuffle system, especially for intermediary data server.

MiddleManager (or indexer) as intermediary data server is the simplest design. In an alternative design, phase 1 tasks could serve intermediary data for shuffle. In this alternate, phase 1 tasks should be guaranteed to run until the phase 2 is finished, which means task 1 resources should be held until the phase 2 is finished. This is rejected for better resource utilization.

Another alternate is a single set of tasks would process both phase 1 and phase 2. This design is rejected because it's not very flexible to use cluster resource efficiently.

Operational impact

maxRowsPerSegment, numShards, partitionDimensions, and maxTotalRows in tuningConfig will be deprecated for indexTask. partitionsSpec will be provided instead. The deprecated values will be removed in the next major release after the upcoming one.

Test plan

Unit tests and integration tests will be implemented. I will also test this with our internal cluster once it's ready.

Future work

  • The optimal parallelism for the phase 2 should be able to be determined automatically by collecting statistics during the phase 1.
  • To avoid "too many open files" problem, middleManager should be able to smoosh the intermediary segments into several large files.
  • If rollup is set, it could be better to combine intermediate data in middleManager before sending them. It would be similar to Hadoop's combiner.
    • This could be implemented to support seamless incremental segment merge in middleManager.
  • In Phase 1, tasks might skip index generation for faster shuffle. In this case, Phase 2 tasks should be able to generate the complete indexes.
@himanshug
Copy link
Contributor

SGTM in general

When the supervisor task is finished (either succeeded or failed), the overlord sends cleanup requests with supervisorTaskId to all middleManagers (and indexers)

does overlord treat "supervisor" task as special task to be able to initiate cleanup requests? what if the MM is down temporarily or is cleanup fails for some reason ? In addition to overlord cleanup requests, It might be good for middleManagers to periodically check whether "supervisor" task is running or not and do the self cleanup.
also maybe have some MM level configuration around maximum disk space that can be utilized for intermediary data.

The supervisor task provides an additional configuration in its tuningConfig, i.e., numSecondPhaseTasks or inputRowsPerSecondPhaseTask, to support control of parallelism of the phase 2. This will be improved to automatically determine the optimal parallelism in the future.

I think a user defined upper limit could always exist in all "supervisor" tasks that spawn extra tasks so that user can plan worker capacity knowing how many tasks at a maximum would be running via parallel [shuffle] task.

@jihoonson
Copy link
Contributor Author

jihoonson commented Jul 11, 2019

@himanshug thanks for taking a look!

does overlord treat "supervisor" task as special task to be able to initiate cleanup requests? what if the MM is down temporarily or if cleanup fails for some reason ? In addition to overlord cleanup requests, It might be good for middleManagers to periodically check whether "supervisor" task is running or not and do the self cleanup.

Ah this is a good point. To handle middleManager failure, a sort of self-cleanup can be triggered when some amount of time is elapsed since the last access to any partition for a supervisorTask. Does this sound good?

also maybe have some MM level configuration around maximum disk space that can be utilized for intermediary data.

Thanks for reminding me of this. Forgot to add it to the proposal. I'm thinking to use the existing StorageLocationConfig for this. To fully utilize the disk bandwidth, the partitions of the same supervisorTaskId will be assigned in a round-robin fashion. Will update the proposal shortly.

I think a user defined upper limit could always exist in all "supervisor" tasks that spawn extra tasks so that user can plan worker capacity knowing how many tasks at a maximum would be running via parallel [shuffle] task.

This is already supported with maxNumSubTasks (https://druid.apache.org/docs/latest/ingestion/native_tasks.html#tuningconfig). maxNumSubTasks is to limit the total number of subtasks at any time while a parallel index task is running. numSecondPhaseTasks is somewhat different. It's the total number of phase 2 tasks and the supervisor task will regard the phase 2 is succeeded once numSecondPhaseTasks phase 2 tasks are succeeded.

@himanshug
Copy link
Contributor

To handle middleManager failure, a sort of self-cleanup can be triggered when some amount of time is elapsed since the last access to any partition for a supervisorTask. Does this sound good?

anything is good as long as data is not left behind forever in error scenarios :) , that said it might be more work to track access time of the partition...

  • are you planning on using the OS managed file access time ? note that many FS are configured to not update access time on reads due to associated IO overhead
  • are you planning to track access time in MM memory ? what happens if MM process gets restarted for some reason and then it will lose track of that.
    I was imagining a simpler world where MM just periodically scans directories where partition files could be stored, accumulates a list of supervisor taskIds(for which some data is stored), makes a call to overlord to get state of all those tasks and then based on state(failed,completed etc) returned deletes data for those. that said, anything is fine really.

it would be extra nice to document the failure cases and expected behavior e.g.

  • supervisor task process (or MM running it) crashed while phase1/2 tasks were running
  • one or more of phase1 tasks crashed
  • one or more of phase2 tasks crashed
    will we try to recover in some of those cases? for example if one or more phase1/2 tasks crashed, will supervisor task retry them ?are any/all of these tasks restorable i.e. return true for canRestore() ?

it is not necessary for things to be recoverable as most of this would be iterated upon later but we can just document what to expect in such failure scenarios.

@jihoonson
Copy link
Contributor Author

does overlord treat "supervisor" task as special task to be able to initiate cleanup requests?

Oh, BTW, there's nothing special with supervisorTasks. I'm thinking to add a callback system to indexing service, like some callback functions can be executed at some predefined stage, e.g., after segments are published or some tasks are finished.

@jihoonson
Copy link
Contributor Author

are you planning to track access time in MM memory ? what happens if MM process gets restarted for some reason and then it will lose track of that.

Yes, this is what I'm thinking. What I'm thinking is pretty similar to yours, but a bit additional stuffs.

  • MM will keep expiration time in memory.
  • This expiration time is initialized with current time + configured timeout.
  • MM periodically checks there are any new partitions created for new supervisorTasks and initializes the expiration time if it finds any.
  • When a subtask accesses a partition, the expiration time for the supervisorTask is initialized or updated if it's already there.
  • MM periodically checks those expiration times for supervisorTasks. If it finds any expired supervisorTask, then it will ask the overlord if the task is still running. If not, MM will remove all partitions for the supervisorTask.
  • The overlord will also send a cleanup request to MM when the supervisorTask is finished. This will clean up the expiration time.

I think it's not very complex, but will reduce the number of calls to overlord API, so it would be good.

it would be extra nice to document the failure cases and expected behavior e.g.

Ah, I didn't document them since it would be same with the existing parallel index task behavior.
Here are descriptions on how it handles failures currently.

supervisor task process (or MM running it) crashed while phase1/2 tasks were running

If the supervisorTask process is killed normally, stopGracefully method is called which kills all running subtasks. If it's killed abnormally, then parallel index task doesn't handle this case for now.

one or more of phase1/2 tasks crashed

SupervisorTask monitors subtask statuses and counts how many subtasks have failed to process the same input. If it notices more failures than configured maxRetry, it regards that input can't be processed and exists with an error. Otherwise, it respawns a new task which processes the same input.

I'll add these to the proposal.

are any/all of these tasks restorable i.e. return true for canRestore() ?

Good point. Any task is not restorable now, but I think it might be useful to support rolling update in the future. Parallel index tasks are supposed to run for a long time and so it would be nice if it can be stopped/restored during rolling update.

@himanshug
Copy link
Contributor

thanks for the explanations.

I think it's not very complex, but will reduce the number of calls to overlord API, so it would be good.

that is fine and I am guessing it will batch task status call i.e. ask status of multiple supervisor tasks in one overlord api request.

If supervisorTask killed abnormally, then parallel index task doesn't handle this case for now.

will the worker tasks complete and exit eventually or they will be left running till manual intervention ?

we might consider making supervisor real "supervisor" like Kakfa instead of "task" so that they get special powers to manage things better ? but I think I am remembering some comment that they are made tasks because tasks are better equipped to work with locking framework available. one option could be to let spawned worker task do the locking , since all worker tasks would be in same group so multiple worker tasks trying to obtain same lock would still work..... this is unverified wishful thinking :)

@jihoonson
Copy link
Contributor Author

that is fine and I am guessing it will batch task status call i.e. ask status of multiple supervisor tasks in one overlord api request.

Correct.

will the worker tasks complete and exit eventually or they will be left running till manual intervention ?

All subtasks report the pushed segments to the supervisor task at the end of indexing. So, if the supervisor task is not running, then they would end up being failed at this stage. However, this is still quite annoying since they will occupy middleManager resources unnecessarily. I guess we could add some health check to subtasks for the supervisor task.

we might consider making supervisor real "supervisor" like Kafka instead of "task" so that they get special powers to manage things better ? but I think I am remembering some comment that they are made tasks because tasks are better equipped to work with locking framework available. one option could be to let spawned worker task do the locking , since all worker tasks would be in same group so multiple worker tasks trying to obtain same lock would still work..... this is unverified wishful thinking :)

Yeah, my original intention was to add a "supervisor" like Kafka rather than a new task type, but I changed my mind to use the existing task lock type system. And now I'm inclined to keep the current design of supervisor task because

  • "supervisor" is designed for more like stream ingestion for each dataSource. It runs forever once it's submitted and the history of its spec changes is recorded in metadata store.
  • "supervisor" design is less scalable since the overlord handles all supervisors and requests/responses for their tasks. One of our customers already had some issue with this. They were running more than 1000 tasks for each Kafka supervisor. Kafka ingestion got stuck because their overlord couldn't handle too many HTTP requests from tasks in time.

For the second reason, I would upvote to even demote the Kafka/Kinesis supervisor to the supervisor task.

@himanshug
Copy link
Contributor

thanks , that reasoning about supervisors makes sense.

I guess we could add some health check to subtasks for the supervisor task.

for their early termination, yes .

@jihoonson
Copy link
Contributor Author

Thanks! I updated the proposal to include things discussed.

@nishantmonu51
Copy link
Member

nishantmonu51 commented Jul 30, 2019

Went through the proposal, I see one issue with this which needs more thought,
Above approach may break with autoscaling -
Consider the case when phase1 task ran on a specific middlemanager and finished. (this was the only task running on it), the autoscaler decided to kill this MM as there is no task running on it.
Phase 2 tasks running will be able to locate the data from phase 1 task.

can we consider using an intermediate directory on the deep storage for storing intermediate segments and phase 2 tasks can read from there ?

Am i understanding it correctly, or it is supposed to work with current design ?

@jihoonson
Copy link
Contributor Author

@nishantmonu51 good point! We currently have two provisioning strategies for auto scaling, i.e., simple and pendingTaskBased, and both of them look to terminate middleManagers if it has been a long time since they completed the last task. I think there could be two options available to avoid this issue.

  • Improve the provisioning strategy to consider intermediary data in middleManagers. If they are still serving intermediary data for parallel batch tasks, then the auto scaler shouldn't terminate them.
  • As you mentioned, store intermediary data on the deep storage.

I'm inclined to the first approach because 1) it's more efficient to read data from middleManagers than from deep storage and 2) intermediary data cleanup for deep storage could be more complex than that for middleManagers (it's still doable though). What do you think?

BTW, no matter what way we go, I think this issue could be fixed in a follow-up PR. Does this make sense?

@nishantmonu51
Copy link
Member

Regarding the autoscaling, there are users who have developed autoscaling strategies outside druid e.g kubernetes users, use simple Horizontal Pod Autoscaler rules and keeping the autoscaling strategies simple help them write the rules easily.

Thinking more on the lines of deep storage,

  • data cleanup could be easy if we follow a hierarchy, e.g the baseTaskDir/supervisor-task-id of the supervisor task can serve as the base path for the intermediary location and MM can just ensure that the base path and any underlying sub-dirs are cleaned up when the supervisor task fails.
  • I see MM as very lightweight processes that have the responsiblity of orchestration of peons, monitoring and cleaning any leftover files/data. It would be great if we can keep it that way.
  • If we use deep storage as handoff for intermediate segments, we probably do not need any additional complexity in the MiddleManager to serve as an Intermediary Data Server.
  • Keeping MM simple would optionally keep us open to expanding our TaskRunners to directly leverage modern container orchestration platforms like kubernetes, marathon, Yarn, Yunicorn etc, where TaskRunners schedule peons directly in e.g. a kubernetes cluster. See Below Issues, although marked as stale, but i still believe there is value in those proposals -

Efficiency wise I agree that reading data from deep storage (especially when using s3) would be slower than reading it from another MM, but till now I have seen creation of Druid indexes and final segments to be the major bottleneck instead of data transfer.

What do you think ?

Agreed, it can be done in a follow-up PR.

@jihoonson
Copy link
Contributor Author

I think two big changes are happening in Druid's indexing service recently. One would be the native parallel indexing and another could be the new Indexer module (#7900). I think this Indexer module would be better than the middleManager in terms of memory management, monitoring, resource scheduling and sharing and hope it would be able to replace the middleManager in the future. It would be nice if we can keep the current behavior for users even with these changes, but I think it wouldn't be easy to keep the current behavior especially with the Indexer.

data cleanup could be easy if we follow a hierarchy, e.g the baseTaskDir/supervisor-task-id of the supervisor task can serve as the base path for the intermediary location and MM can just ensure that the base path and any underlying sub-dirs are cleaned up when the supervisor task fails.

Yeah, it would be pretty similar to when using MM as the intermediary data server with respect to intermediary data structure and management. But I think it could be more complex when it comes to the operational stuffs like permission or failure handling. Not only middleManager but the overlord should also be able to remove stale intermediary data because the middleManager could miss the supervisor task failure. The attempt to delete intermediary data could also fail and it would be easier if it's a local disk. I mean, it's still doable with deep storage but it would be a bit more complex.

I see MM as very lightweight processes that have the responsiblity of orchestration of peons, monitoring and cleaning any leftover files/data. It would be great if we can keep it that way.

Hmm, the additional functionality would be just serving intermediary data and cleaning up them periodically. Maybe the middleManager would need a bit more memory but it shouldn't be really big. Would you elaborate more about your concern?

Looking at the code around AutoScaler and Provisioner, if the overlord can collect the remaining intermediary data on each middleManager, this information could be used for auto scaling. I guess this information could be easily added to existing code base since the overlord is already collecting some data from middleManagers (like host name and port, etc). And since the AutoScaler and Provisioner are executed in the overlord process, I guess we can easily provide this information to the Provisioner. Looks like the implementation wouldn't be not that difficult?

Efficiency wise I agree that reading data from deep storage (especially when using s3) would be slower than reading it from another MM, but till now I have seen creation of Druid indexes and final segments to be the major bottleneck instead of data transfer.

That sounds interesting. I haven't used Hadoop ingestion much, and I don't know how much data transfer would contribute to the total ingestion performance. Do you have any rough number on this?

@jihoonson
Copy link
Contributor Author

#6533 looks another issue in auto scaling. Looks like it needs more information about middleManager state anyway.

@jihoonson
Copy link
Contributor Author

jihoonson commented Aug 2, 2019

Keeping MM simple would optionally keep us open to expanding our TaskRunners to directly leverage modern container orchestration platforms like kubernetes, marathon, Yarn, Yunicorn etc, where TaskRunners schedule peons directly in e.g. a kubernetes cluster.

@nishantmonu51 thinking more about this, it could be easier and more useful for these users if intermediary data is stored in deep storage. I think we can make this pluggable, like using MM or deep storage for intermediary data depending on the configuration. Does this make sense?

@quenlang
Copy link

Could I use index_parallel task to reindex my segments for multitenancy case? Just like single-dimension partitioning by tenant_id. If then, how should I config with turningConfig option?
I had a try but it does not work.

@jihoonson
Copy link
Contributor Author

Hi @quenlang, you should be able to do it. An example tuningConfig could be

    "tuningConfig" : {
      "type" : "index_parallel",
      "maxNumConcurrentSubTasks": 2, // max number of sub queries that can be run at the same time
      "forceGuaranteedRollup": true,
      "partitionsSpec": {
        "type": "hashed",
        "numShards": 3 // number of segments per time chunk after compaction
      }
    }

@quenlang
Copy link

@jihoonson Thanks a lot!

@himanshug
Copy link
Contributor

himanshug commented Oct 1, 2019

@jihoonson is there any protection against getting inside a deadlock for parallel task implementation. For example, and simplicity, consider case of single phase parallel task ... say user submitted 10 parallel tasks .. each of which are supposed to create 2 sub tasks ... Now say you have only 8 task slots (all of which are occupied by supervisor tasks) ... they will submit subtasks which will never run .... is this situation possible or is there any protection to ensure this deadlock doesn't happen. I am noticing something like that on a cluster running loads of single phase parallel tasks, not entirely sure.

@jihoonson
Copy link
Contributor Author

@himanshug good point. I don't think we have any protection against this kind of scenario yet. One possible short term workaround could be checking timeout for pending sub tasks. If timeout expires, the supervisor task can kill its sub tasks, which will lead to killing itself in the end either by killing sub tasks 3 times just like in general failure handling or killing itself immediately in this case. In the future, we may need to improve task scheduling to consider task type to avoid this kind of deadlock.

@himanshug
Copy link
Contributor

@jihoonson hmmm, that solution will fail the indexing ... if there is a lot of parallel indexing tasks submitted (relative to total available task slots) at same time ... indexing progress would be very very slow with many of them failing.
One workaround could be to make sure that all supervisor tasks goto a separate set of middleManagers dedicated to running only supervisor tasks (and no other tasks are sent to these) ... we could probably use #7066 (comment) to achieve that.

that said, above workaround isn't great for usability in general as simple druid cluster setup should just work.

In the long run, I think we need to do something to give special treatment to supervisor tasks in general and run them differently without occupying the task slots.

@jihoonson
Copy link
Contributor Author

In the long run, I think we need to do something to give special treatment to supervisor tasks in general and run them differently without occupying the task slots.

Interesting. Another possible workaround for now is you can explicitly specify taskResource to have 0 requiredCapacity for supervisor tasks.

For the long run idea, I'm not sure if it's fine to not reserve any resource, or it should reserve a minimum amount of resource. Maybe we need to distinguish different resource types (e.g., CPU, memory, disk, etc) to schedule tasks better.

@himanshug
Copy link
Contributor

himanshug commented Oct 2, 2019

with 0 requiredCapacity overlord would try and send them to MM but MM would reject them saying its full, no ?

For the long run idea, I'm not sure if it's fine to not reserve any resource, or it should reserve a minimum amount of resource. Maybe we need to distinguish different resource types (e.g., CPU, memory, disk, etc) to schedule tasks better.

yes, I meant reserve something else and not task slots i.e. maybe a "supervisor slot" concept.
For other reliability reasons, maybe also a first class "Supervisor" class like "Task" ... handling of "Supervisor" could be different than that of a Task e.g. all Supervisors are always restartable , unlike Tasks they are not failed if a MM running them dies and instead they are re-run etc.

@jihoonson
Copy link
Contributor Author

with 0 requiredCapacity overlord would try and send them to MM but MM would reject them saying its full, no ?

Hmm, I just tested with RemoteTaskRunner and it worked. Is MM supposed to reject it?

yes, I meant reserve something else and not task slots i.e. maybe a "supervisor slot" concept.

Sounds good. I was thinking a task scheduler respecting task priority. Like, it can schedule high priority tasks more often, but also should be able to avoid starvation for low priority tasks. Maybe the supervisor slot concept can go along with this idea together.

For other reliability reasons, maybe also a first class "Supervisor" class like "Task" ... handling of "Supervisor" could be different than that of a Task e.g. all Supervisors are always restartable , unlike Tasks they are not failed if a MM running them dies and instead they are re-run etc.

Also sounds good.

@himanshug
Copy link
Contributor

with 0 requiredCapacity overlord would try and send them to MM but MM would reject them saying its full, no ?

Hmm, I just tested with RemoteTaskRunner and it worked. Is MM supposed to reject it?

looked at the code and workers just do whatever overlord tells them, so behavior you saw is expected and required to fulfill the contract of requiredCapacity = 0. however, that means overlord would just keep assigning supervisor tasks to MMs as they come . Even if requiredCapacity = 0, in reality they do take resources on the MM and would eventually kill it if enough number of those show up.

@himanshug
Copy link
Contributor

@jihoonson I am thinking of adding a tuning config in ParallelIndexTuningConfig , minSplitsForParallelMode . ParallelIndexSupervisorTask would run parallel only if firehoseFactory.getNumSplits() >= minSplitsForParallelMode or else it will run sequentially.

default value of minSplitsForParallelMode=0 to retain existing behavior .

in my case, I would set minSplitsForParallelMode=2. most tasks have numSplits = 1 so they will just run sequentially and only few of them would really run parallel with subtasks , that would reduce the likelihood of deadlock. also it is probably more efficient to not go parallel if there is just 1 split.

what do you think ?

@himanshug
Copy link
Contributor

nevermind, I tried above and it didn't help my case as contrary to my thinking, most tasks had 2 splits.

running the supervisors on separate set of workers using #7066 is the only decent thing to do for now.

capistrant added a commit to capistrant/incubator-druid that referenced this issue Oct 7, 2019
This table stated that `index_parallel` tasks were best-effort only. However, this changed with apache#8061 and this documentation update was simply missed.
fjy pushed a commit that referenced this issue Oct 7, 2019
This table stated that `index_parallel` tasks were best-effort only. However, this changed with #8061 and this documentation update was simply missed.
@quenlang
Copy link

Hi @quenlang, you should be able to do it. An example tuningConfig could be

    "tuningConfig" : {
      "type" : "index_parallel",
      "maxNumConcurrentSubTasks": 2, // max number of sub queries that can be run at the same time
      "forceGuaranteedRollup": true,
      "partitionsSpec": {
        "type": "hashed",
        "numShards": 3 // number of segments per time chunk after compaction
      }
    }

Hi @jihoonson, I split data into different segments by single dimension tenant_id for the multitenancy scene. In this way, I can get higher query performance that filters on the tenant_id dimension. But tenant data was skew, so the segment size was not perfect. For example, the max segment size was nearly 18GB but the min segment size was 5MB. Then perform queries on the 18GB segment ware more slower than that partitioned by all dimensions.

Assume there is a way(eg. maxRowsPerSegment=10000000) can split the max segment into some more smaller segments, the queries will be quick. But maxRowsPerSegment can not exist when specify numShards option.

How do you think about this scene?
Thanks a lot!

@jihoonson
Copy link
Contributor Author

Hi @quenlang, I assume you're using hadoop indexing task since the parallel indexing task doesn't support single-dimension range partitioning yet. I think you can set targetPartitionSize and maxPartitionSize to something more optimal. Note that the single-dimension range partitioning doesn't support numShards.

@quenlang
Copy link

@jihoonson Sorry for I did not describe clearly. I used the parallel indexing task to reindex segments for single-dimension range partitioning.

                "tuningConfig": {
                        "type": "index_parallel",
                        "maxNumConcurrentSubTasks": 30,
                        "partitionsSpec": {
                                "type": "hashed",
                                "numShards": 10,
                                "partitionDimensions": ["mobile_app_id"]
                        },
                        "forceGuaranteedRollup": true
                }

I got a correct result. It seems perfect except the data skew. I'm confused with the parallel indexing task doesn't support single-dimension range partitioning yet?
Thanks!

@jihoonson
Copy link
Contributor Author

Ah, the partitionsSpec you used is the hash-based partitioning. To use the range partitioning, the type of the partitionsSpec should be single_dim instead of hashed. This single-dimension range partitioning is supported only by the hadoop task for now and I believe the native parallel indexing task will support it in the next release.

Hi @jihoonson, I split data into different segments by single dimension tenant_id for the multitenancy scene. In this way, I can get higher query performance that filters on the tenant_id dimension.

I'm pretty surprised by this and wondering how big the performance gain was in your case. Sadly, Druid doesn't support segment pruning in brokers for hash-based partitioning for now (this is supported only for single-dimension range partitioning). That means, even though your segments are partitioned based on the hash value of tenant_id, the broker will send queries to all historicals having any segments overlapping with the query interval no matter what their hash value is. I guess, perhaps you could see some performance improvement when you filter on tenant_id maybe because of less branch misprediction. Can you share your performance benchmark result if you can?

But tenant data was skew, so the segment size was not perfect. For example, the max segment size was nearly 18GB but the min segment size was 5MB. Then perform queries on the 18GB segment ware more slower than that partitioned by all dimensions.

One popular way to mitigate the data skewness is adding other columns to the partition key, so that segment partitioning can be more well balanced. This will corrupt the locality of data so I guess you may need to find a good combination of columns for partition key.

@quenlang
Copy link

@jihoonson Thanks for the quick reply!

I'm pretty surprised by this and wondering how big the performance gain was in your case. Sadly, Druid doesn't support segment pruning in brokers for hash-based partitioning for now (this is supported only for single-dimension range partitioning). That means, even though your segments are partitioned based on the hash value of tenant_id, the broker will send queries to all historicals having any segments overlapping with the query interval no matter what their hash value is. I guess, perhaps you could see some performance improvement when you filter on tenant_id maybe because of less branch misprediction. Can you share your performance benchmark result if you can?

I did not get big the performance gain than expected. for the small tenant, the query latency only reduces 50ms-100ms but for the big tenant, the latency increases 10s-30s. I think it caused by the data skewness with the hashed partition of tenant_id. The biggest tenant in an 18GB segment.

That means, even though your segments are partitioned based on the hash value of tenant_id, the broker will send queries to all historicals having any segments overlapping with the query interval no matter what their hash value is.

even though the broker sends queries to all historicals, but only one historical node has the tenant data. So I think the data skewness is the root cause of large latency.

Ah, the partitionsSpec you used is the hash-based partitioning. To use the range partitioning, the type of the partitionsSpec should be single_dim instead of hashed. This single-dimension range partitioning is supported only by the hadoop task for now and I believe the native parallel indexing task will support it in the next release.

Do you mean that druid 0.17.0 will support single-dimension range partitioning in native parallel indexing tasks?
Also, if there is a big tenant in a range set of tenant_id, how to avoid segment size skewness by single-dimension range partitioning in the future native parallel indexing task?
Thanks a lot!

@jihoonson
Copy link
Contributor Author

I did not get big the performance gain than expected. for the small tenant, the query latency only reduces 50ms-100ms but for the big tenant, the latency increases 10s-30s. I think it caused by the data skewness with the hashed partition of tenant_id. The biggest tenant in an 18GB segment.

Thank you for sharing! The performance gain does look small but still interesting.

Do you mean that druid 0.17.0 will support single-dimension range partitioning in native parallel indexing tasks?

I hope so. You can check the proposal in #8769.

Also, if there is a big tenant in a range set of tenant_id, how to avoid segment size skewness by single-dimension range partitioning in the future native parallel indexing task?

Hmm, sorry I didn't mean single-dimension range partitioning helps with the skewed partitioning. I'm not sure if there's a better way except adding other columns to the partition key.

@quenlang
Copy link

@jihoonson Thank you so much.

@jihoonson
Copy link
Contributor Author

This is done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants