[SPARK-31805][SS] Exclude group.id from consumer settings when using the assign strategy#28623
[SPARK-31805][SS] Exclude group.id from consumer settings when using the assign strategy#28623tashoyan wants to merge 3 commits intoapache:masterfrom
Conversation
…n using the assign strategy
|
Can one of the admins verify this patch? |
|
cc @HeartSaVioR FYI |
| override def createConsumer( | ||
| kafkaParams: ju.Map[String, Object]): Consumer[Array[Byte], Array[Byte]] = { | ||
| val updatedKafkaParams = setAuthenticationConfigIfNeeded(kafkaParams) | ||
| excludeGroupId(updatedKafkaParams) |
There was a problem hiding this comment.
I'd rather not manipulate passed map, as we did in setAuthenticationConfigIfNeeded.
There was a problem hiding this comment.
Now using KafkaConfigUpdater
|
If I'm not missing anything, the assignment is only effective on driver side consumer, so if we want to discard Btw, TBH I'm hesitate to agree the change unless there's no way to deal with such issue even with the changes from Spark 3.0.0, because the change would make an "exception" on specific option, which behavior may not be consistent with existing versions of Kafka & future versions of Kafka. What I have been heard of group ID security issues from end users in community are that most cases can be dealt with prefixed group ID. Doesn't it help your case? |
Consumers inside executorsIndeed, executors have their own Kafka consumers, and these consumers always have However, the consumers inside executors always do Nevertheless, executors do not fail as driver does. My investigation is below in this post. Kafka consumers behaviorBy KafkaConsumer specification, the If I am not missing anything, Structured Streaming manually assigns Spark executors to Kafka partitions. We do not use consumer groups at all, so we don't need them. Therefore, I am not sure why Structured Streaming provides the option "subscribe". By not specifying Weird behavior of consumers inside executorsA weird thing is that executors do not fail with GroupAuthorizationException as the driver does. The difference between driver's consumer and executor's consumer is following:
These seek operations make the difference. I made a trivial application with KafkaConsumer - if I insert |
It's not true for driver side. Spark still leverages subscription in driver to receive the metadata updates and avoid dealing with retrieving target topic partitions by itself. Spark will interpret topic partitions in startingOffsets / endingOffsets based on the information. The metadata information is the thing we may need to deal manually with admin client if we want to avoid subscription at all, which is technically not impossible (if I'm not missing anything) but sub-optimal. Also, the option "assign" leaves many considerations on end users' side - end users should know about target topic partitions and fully understand the query may not consume all messages in topic once topic expands the number of partitions. It's not simple to use and error-prone. Back to the original topic, my comment is not about the feasibility of doing that. I agree we can do the change technically - the point is that how much it brings value to differentiate assign vs others in Spark side. Integrating with the details means that it's gonna be non-trivial to change. If the workarounds provided by Spark 3.0.0 work for all cases I'm not 100% sure about the value. One interesting thing is that Kafka could ignore the group ID for assign API but didn't look like, which is why you've encountered the issue. |
|
But that's only me. Let's hear more voices on this. |
|
I've taken a look at it and here are my thoughts:
This is simply not true. One can commit back offests with a consumer and such case group id is used.
Overall I wouldn't merge it but if somebody thinks it worth then I have couple of further comments. |
|
I'm just revisiting this. At the moment working on SPARK-32032 where |
|
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
What changes were proposed in this pull request?
Fix for SPARK-31805: do not set the
group.idconsumer property when using the "assign" strategy.Why are the changes needed?
With secure Kafka blocker an application fails, because the auto-generated group id is not allowed by the broker:
For the "assign" strategy, consumer group is not used. Therefore the best fix is to exclude the
group.idproperty from the consumer config.Does this PR introduce any user-facing change?
Yes. When using "assign" strategy:
How was this patch tested?
Manually:
spark-sql-kafka-0-10_2.12. My application does the following:I did not manage to add unit tests, because I do not know, how to set up a secure Kafka broker. Unit tests in spark-sql-kafka-0-10 use the tool
KafkaTestUtils. I am not sure if this tool is suitable to simulate a secure Kafka broker with ACLs.