KAFKA-9674: corruption should also cleanup producer and recreate #8242

abbccdda · 2020-03-06T18:46:29Z

The task producer cleanup doesn't involve handling of task corruption. Adding recreation of task producer to avoid reusing a fatal state producer in next cycle.

Committer Checklist (excluded from commit message)

Verify design and implementation
Verify test coverage and CI build status
Verify documentation (including upgrade notes)

abbccdda · 2020-03-06T18:53:44Z

Due to time constraints, I feel unblocking trunk is higher priority than test coverage on the active task creator, adding a ticket to track later: https://issues.apache.org/jira/browse/KAFKA-9676

vvcephei

Thanks @abbccdda ! I had a couple of high-level comments.

vvcephei · 2020-03-06T22:38:42Z

streams/src/main/java/org/apache/kafka/streams/processor/internals/TaskManager.java

+                        log.error(uncleanMessage, producerException);
+                        producerCloseExceptions.putIfAbsent(task.id(), producerException);
+                    }
+                }
            }

            task.revive();


Should we try to revive the task if there was an exception closing/re-creating the task producer?

For closing yes -- we close dirty anyway. And I don't think that creating a producer can fail (if we are worried about it, we should just not catch the exception but die...?

vvcephei · 2020-03-06T22:42:04Z

streams/src/main/java/org/apache/kafka/streams/processor/internals/TaskManager.java

+            if (firstEntry.getValue() instanceof KafkaException) {
+                log.error("Hit Kafka exception while closing first task {} producer", firstEntry.getKey());
+                throw firstEntry.getValue();
+            } else {
+                throw new RuntimeException(
+                    "Unexpected failure to close " + producerCloseExceptions.size() +
+                        " task(s) producers [" + producerCloseExceptions.keySet() + "]. " +
+                        "First unexpected exception (for task " + firstEntry.getKey() + ") follows.", firstEntry.getValue()


These two cases don't seem to be different. I'd recommend just always wrapping the exception and throwing (currently the else block). If we just re-throw the first exception, reading the stack trace becomes very confusing. Especially since a lot of those exceptions don't even include the stack trace.

In newest trunk we always call task.closeDirty .

We should wrap KafkaException as StreamsException but rethrow all other RuntimeException unwrapped (at least this is the pattern we use everywhere else, and thus we should follow it here, too)

vvcephei · 2020-03-06T22:55:14Z

streams/src/main/java/org/apache/kafka/streams/processor/internals/ActiveTaskCreator.java

+    void createTaskProducer(final TaskId taskId) {
+        final String taskProducerClientId = getTaskProducerClientId(threadId, taskId);
+        final Map<String, Object> producerConfigs = config.getProducerConfigs(taskProducerClientId);
+        producerConfigs.put(ProducerConfig.TRANSACTIONAL_ID_CONFIG, applicationId + "-" + taskId);
+        log.info("Creating producer client for task {}", taskId);
+        taskProducers.put(taskId, clientSupplier.getProducer(producerConfigs));
+    }
+


How about instead keeping this private and only exposing reOpenTaskProducerIfNeeded, which would take care of doing nothing if there's no task producer, etc. I'm concerned that otherwise, someone might call createTaskProducer when there's already one there, leading to a "producer leak".

vvcephei · 2020-03-06T23:03:07Z

test this please

guozhangwang

In newest trunk, we do not call closeAndRemoveTaskProducerIfNeeded during handleCorrupted, the error message seems be coming from the old version where we close the producer inside task#close.

So I'm wondering during task-corruption handling, could we still reuse the existing producer? Since task-corruption can only throw during:

restoration (changelog-reader), where producers are not used to send a single record yet.
creation with EOS (processor-state-manager), where the producers are not used to send a single record yet.

So I feel we do not need to close / recreate a new producer for handleCorruption. WDYT?

guozhangwang · 2020-03-07T00:03:30Z

streams/src/main/java/org/apache/kafka/streams/processor/internals/TaskManager.java

+            if (firstEntry.getValue() instanceof KafkaException) {
+                log.error("Hit Kafka exception while closing first task {} producer", firstEntry.getKey());
+                throw firstEntry.getValue();
+            } else {
+                throw new RuntimeException(
+                    "Unexpected failure to close " + producerCloseExceptions.size() +
+                        " task(s) producers [" + producerCloseExceptions.keySet() + "]. " +
+                        "First unexpected exception (for task " + firstEntry.getKey() + ") follows.", firstEntry.getValue()


In newest trunk we always call task.closeDirty .

abbccdda · 2020-03-07T00:40:45Z

@vvcephei @guozhangwang Thanks for the review! After some offline discussion we believe fixing this issue is not urgent at the moment, as John's refactoring on the producer should already handle the case of closing producer instead of inside closeClean, so we should be safe to not reinitialize.

mjsax · 2020-03-07T07:20:44Z

streams/src/main/java/org/apache/kafka/streams/processor/internals/TaskManager.java

+                // We need to recreate the producer as it could potentially be in illegal state.
+                if (task.isActive()) {
+                    try {
+                        activeTaskCreator.closeAndRemoveTaskProducerIfNeeded(task.id());


nit: task.id() -> taskId

mjsax · 2020-03-07T07:20:58Z

streams/src/main/java/org/apache/kafka/streams/processor/internals/TaskManager.java

+                        activeTaskCreator.createTaskProducer(taskId);
+                    } catch (final RuntimeException producerException) {
+                        final String uncleanMessage = String.format("Failed to close task %s producer cleanly. " +
+                            "Attempting to close remaining task producers before re-throwing:", task.id());


mjsax · 2020-03-07T07:21:04Z

streams/src/main/java/org/apache/kafka/streams/processor/internals/TaskManager.java

+                        final String uncleanMessage = String.format("Failed to close task %s producer cleanly. " +
+                            "Attempting to close remaining task producers before re-throwing:", task.id());
+                        log.error(uncleanMessage, producerException);
+                        producerCloseExceptions.putIfAbsent(task.id(), producerException);


mjsax · 2020-03-07T07:22:18Z

streams/src/main/java/org/apache/kafka/streams/processor/internals/TaskManager.java

+                producerCloseExceptions.entrySet().iterator().next();
+
+            if (firstEntry.getValue() instanceof KafkaException) {
+                log.error("Hit Kafka exception while closing first task {} producer", firstEntry.getKey());


Do we need to log here? All errors are logging in L163 already (and I think we would log it again in upper layers)

mjsax · 2020-03-07T07:26:05Z

@abbccdda Overall nice find -- working on KIP-447 PR I was also wondering if we would need to create a new producer for this case.

abbccdda · 2020-03-07T18:37:16Z

Closing for now

corruption should also cleanup producer and recreate

665d346

vvcephei reviewed Mar 6, 2020

View reviewed changes

vvcephei added the streams label Mar 6, 2020

guozhangwang reviewed Mar 7, 2020

View reviewed changes

mjsax reviewed Mar 7, 2020

View reviewed changes

abbccdda closed this Mar 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KAFKA-9674: corruption should also cleanup producer and recreate #8242

KAFKA-9674: corruption should also cleanup producer and recreate #8242

abbccdda commented Mar 6, 2020

abbccdda commented Mar 6, 2020

vvcephei left a comment

vvcephei Mar 6, 2020

mjsax Mar 7, 2020

vvcephei Mar 6, 2020

guozhangwang Mar 7, 2020

mjsax Mar 7, 2020

vvcephei Mar 6, 2020

vvcephei commented Mar 6, 2020

guozhangwang left a comment

guozhangwang Mar 7, 2020

abbccdda commented Mar 7, 2020

mjsax Mar 7, 2020

mjsax Mar 7, 2020

mjsax Mar 7, 2020

mjsax Mar 7, 2020 •

edited

Loading

mjsax commented Mar 7, 2020

abbccdda commented Mar 7, 2020

KAFKA-9674: corruption should also cleanup producer and recreate #8242

KAFKA-9674: corruption should also cleanup producer and recreate #8242

Conversation

abbccdda commented Mar 6, 2020

Committer Checklist (excluded from commit message)

abbccdda commented Mar 6, 2020

vvcephei left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vvcephei commented Mar 6, 2020

guozhangwang left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

abbccdda commented Mar 7, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mjsax Mar 7, 2020 • edited Loading

Choose a reason for hiding this comment

mjsax commented Mar 7, 2020

abbccdda commented Mar 7, 2020

mjsax Mar 7, 2020 •

edited

Loading