-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Supervisor for KafkaIndexTask #2656
Conversation
@dclim this fails UT |
Should be fixed now, thanks! |
Here are more details on what is included and not included in this PR: Kafka supervisor manages the creating and monitoring of Kafka indexing tasks, which are tasks that ingest events from Kafka using offset ranges instead of time intervals (see #2220). A supervisor is created by providing a supervisor spec which includes fields for (among other things):
The supervisor will then periodically run the following sequence of steps:
As an example, say we have a Kafka topic with 6 partitions, we want to split these partitions across 2 tasks, with each task handling 3000 events, and we want a replication factor of 2. This will create 4 tasks:
(Replica tasks are created with the same availability group so that they will be executed on different nodes.) If:
As of now, this implementation is fully functional but has a number of areas that can be refined to be more efficient / a better user experience (numbered to facilitate discussion):
|
@himanshug @gianm as discussed during the dev sync, I've added some details into what is included in this PR and what enhancements could be added. Let me know your thoughts. |
|
||
private final Map<String, Pair<Supervisor, SupervisorSpec>> supervisors = new HashMap<>(); | ||
|
||
public Set<String> getSupervisors() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rename to getSupervisorIDs ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good, thanks!
Added improvement 6: supervisor no longer kills all tasks on startup, but checks to see if the task matches the supervisor's spec and expected offsets and starts tracking it if it conforms. If it doesn't match, it will kill the task to prevent ingesting duplicate events. |
@dclim nice write up. here are the things, i think, we need to have in the order of priority. I believe number (3) is a MUST have . Supervisor must persist its state in metadata store because overlord going down or changing leadership is a very common case, even in case of rolling deployment. (5) about early stopping of tasks and related "user supplies number of events a task will handle", it is very difficult for users to figure out that number and many times it could change depending on hour of the day, day of the week. This might impact rolling deployment also where some tasks are never finishing. (2) about remaining replica's showing FAILED status is confusing to users. It will require users to very carefully look at the overlord console task statuses to say whether ingestion is working fine or if there are genuine failures. |
log.debug("Task group [%d] pre-pruning: %s", groupId, taskGroup.taskStatuses); | ||
String successfulTaskId = null; | ||
Iterator<Map.Entry<String, Optional<TaskStatus>>> it = taskGroup.taskStatuses.entrySet().iterator(); | ||
while (it.hasNext()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not remove the TaskGroup for which all tasks were killed here from the taskGroups
map otherwise new tasks will be again spawned for this TaskGroup at Line 397 ? Is that correct ? Am I missing something ?
Consider this scenario, if the taskCount is set to 3 and number of kafka partitions is 3, now i decreased the number of partitions form 3 to 1, this would cause all tasks to be killed in two TaskGroups.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pjain1 removing the TaskGroup when all tasks are killed would be okay; the only (minor) downside is that with the current logic the new retry tasks wouldn't be enqueued until the next cycle. However, I don't believe this is necessary since Kafka does not support decreasing the number of partitions, and I can't think of any other reason that the number of task groups would decrease.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, does this PR handles the scenarios where one increases or decreases the taskCount
and restarts the kafka supervisor ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is only one situation where partitions can decrease- a kafka user deletes a topic and re-creates it with fewer partitions. But in that case it's not really the "same" topic (offsets are gonna get re-used too) so the supervisor is gonna get really confused and need to be re-set.
we could help people out in this situation with a "hard reset" option that stops the supervisor, and wipes the datasource metadata. that would let them resume from the start or end of the new topic, as they wish.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pjain1 yes, right now what would happen is the supervisor would come back up, discover the existing running tasks, and notice that they are processing a different allocation of partitions than what is specified in its spec (since the number of tasks have changed so the number of partitions processed per task also changed) - it would then forcibly kill these tasks and create new ones which would start reading from the last persisted offsets which are stored in the datasource metadata table. Assuming that Kafka's buffer is large enough that these events haven't been dropped, no data would be lost.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dclim great...I didn't knew that kafka does not allow decreasing the number of partitions in which case it should be fine
@pjain1 any more comments? |
IMO the importance of the issues @dclim raised are, from most important to least,
Of those I think 3, 6, 2 are important to do before releasing the feature at all. I think we can live without 1 for the first release but should do it in a follow-up. I think 5, 4 are nice improvements but less critical than the others. @himanshug - does this sound reasonable to you? |
@gianm from #2656 (comment) , i believe that (6) is already done. (4) would be nice to have but very hard to do without adding complexity , not having it is fine. also, (1) is not too critical at least for now in the first cut. currently user is required to set "the number of events each task should handle before persisting a segment and completing" . it will be very convenient if this is dynamically figured out by the system automatically. I thought (5) could probably help support this too in addition to speeding up schema updates as you pointed. |
@himanshug thanks for your feedback - what were you thinking of as an alternative to having the user specify the number of events a task will handle? It might be interesting if we allowed the user to specify a target for how long they want the task to run and then run an initial task to ingest a small number of events to get a feel of the events/sec and then use this to tune our offset ranges. Also like you said, if we had graceful shutdown we could guarantee tasks live as long as we want them to. Is that along the lines of what you were thinking? (have the user set the lifetime of the task in time vs number of events) |
@dclim as discussed on dev sync it will be nice if supervisor starts the tasks with some very large arbitrary end-offset and then forces the push if/when segments reach some size threshold or some time (e.g. 30 mins) has elasped. |
Update:
The supervisor logic has changed somewhat from the above description to accommodate time-based task lifetimes and no pauses between tasks. I'll post an updated overview of the mechanism, but as is this PR should be ready for review. |
Some additional notes regarding the time-based task lifetime design: Previously, the supervisor created KafkaIndexTasks, provided a starting and ending offset, and when the task completed, the supervisor created another task (to handle the next set of offsets if the task succeeded or to reprocess the same range of offsets if the task failed). With the updated design, the user provides a
|
Does the |
@pjain1 good question - there shouldn't need to be any condition on completionTimeout < taskDuration. If there are two sets of pending tasks (plus the currently reading task set) and the older set fails, it will kill itself, the newer pending tasks, as well as the currently reading set since all of them will now be producing invalid (non-contiguous) segments. The supervisor will then create new tasks that'll start from the offsets in the last successful segment (which would be the starting offsets of that older task set that failed). In general, you wouldn't want time-to-publish >> taskDuration, since that would mean that you'd be spawning new tasks faster than they're completing and your workers will eventually run out of capacity, but in terms of correctness, there shouldn't be any issues. |
@dclim can you add some user documentation ? |
Yes, will write some up. |
@dclim I read through the docs and noticed that the |
@schmee yeah, the KafkaIndexTask was written to read a single topic only; I don't believe that it's a fundamental limitation, but some work would be required to support topic patterns. We use topic patterns in our Druid cluster as well so I definitely think there's value in supporting them. What is your use case for topicPatterns? Are you reading from multiple topics into one dataSource or multiple topics into multiple dataSources (using the same ingestion spec)? |
@@ -83,7 +86,8 @@ public MetadataStorageTablesConfig( | |||
@JsonProperty("tasks") String tasksTable, | |||
@JsonProperty("taskLog") String taskLogTable, | |||
@JsonProperty("taskLock") String taskLockTable, | |||
@JsonProperty("audit") String auditTable | |||
@JsonProperty("audit") String auditTable, | |||
@JsonProperty("supervisors") String supervisorTable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
needs doc update
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good
+ "FROM %1$s r " | ||
+ "INNER JOIN(SELECT spec_id, max(version) as version FROM %1$s GROUP BY spec_id) latest " | ||
+ "ON r.spec_id = latest.spec_id and r.version = latest.version", | ||
getSupervisorsTable() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
limiting items in history may potentially help the performance of this query too.
@dclim |
|
||
# Kafka Ingestion | ||
|
||
The recommended way of ingesting data from Kafka is to use the `kafka-indexing-service` core extension (see |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the optimism but this language is a bit strong for first release :)
The language here should be telling people that this is an experimental feature, API subject to change, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
haha, sounds good
Fixed an issue where if the supervisor crashed after signalling a task to begin publishing but before creating the next task, the succeeding supervisor would create the new task with the same starting offsets as the publishing task. It will now create the new task starting from where the publishing task ended. Also added a test for this. |
|
@dclim looking good! 👍 |
Manages the creation and monitoring of KafkaIndexTasks based on a SupervisorSpec submitted to /druid/indexer/v1/supervisor. See #1642 for realtime ingestion improvements overview. Closes #2635.