[SPARK-31805][SS] Exclude group.id from consumer settings when using the assign strategy by tashoyan · Pull Request #28623 · apache/spark

tashoyan · 2020-05-23T23:23:55Z

What changes were proposed in this pull request?

Fix for SPARK-31805: do not set the group.id consumer property when using the "assign" strategy.

Why are the changes needed?

With secure Kafka blocker an application fails, because the auto-generated group id is not allowed by the broker:

org.apache.kafka.common.errors.GroupAuthorizationException: Not authorized to access group: spark-kafka-relation-ecab045d-4ee6-425e-88a0-495d4100a013-driver-0

For the "assign" strategy, consumer group is not used. Therefore the best fix is to exclude the group.id property from the consumer config.

Does this PR introduce any user-facing change?

Yes. When using "assign" strategy:

No need to reconfigure the broker - to add the necessary group ids to ACL
(since Spark 3.0.0) No need to provide a custom group id (SPARK-26350) or a custom prefix (SPARK-26121)

How was this patch tested?

Manually:

Rebuild the module:

mvn install -pl :spark-sql-kafka-0-10_2.12

Rebuild my application with newly built spark-sql-kafka-0-10_2.12. My application does the following:

val kafkaDf = spark.read
  .format("kafka")
  // "kafka.bootstrap.servers" and SASL-specific options
  .options(options)
  .option("assign", topicPartitionsJson)
  .option("startingOffsets", startingOffsetsJson)
  .option("endingOffsets", "latest")
  .load()

Run my application with a secure Kafka broker and verify that it does not fail with "GroupAuthorizationException: ..." anymore.

I did not manage to add unit tests, because I do not know, how to set up a secure Kafka broker. Unit tests in spark-sql-kafka-0-10 use the tool KafkaTestUtils. I am not sure if this tool is suitable to simulate a secure Kafka broker with ACLs.

…n using the assign strategy

AmplabJenkins · 2020-05-23T23:31:50Z

Can one of the admins verify this patch?

HyukjinKwon · 2020-05-24T05:39:28Z

cc @HeartSaVioR FYI

HeartSaVioR · 2020-05-24T07:22:10Z

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/ConsumerStrategy.scala

  override def createConsumer(
      kafkaParams: ju.Map[String, Object]): Consumer[Array[Byte], Array[Byte]] = {
    val updatedKafkaParams = setAuthenticationConfigIfNeeded(kafkaParams)
+    excludeGroupId(updatedKafkaParams)


I'd rather not manipulate passed map, as we did in setAuthenticationConfigIfNeeded.

Now using KafkaConfigUpdater

HeartSaVioR · 2020-05-24T07:38:29Z

If I'm not missing anything, the assignment is only effective on driver side consumer, so if we want to discard group.id config for assignment, we should do the same with consumers for executors as well. That can be applied for all strategies as consumers for executors always leverage assignment.
(Please turn on DEBUG log for org.apache.spark.sql.kafka010.consumer.InternalKafkaConsumer and see how group ID is printed.)

Btw, TBH I'm hesitate to agree the change unless there's no way to deal with such issue even with the changes from Spark 3.0.0, because the change would make an "exception" on specific option, which behavior may not be consistent with existing versions of Kafka & future versions of Kafka.

What I have been heard of group ID security issues from end users in community are that most cases can be dealt with prefixed group ID. Doesn't it help your case?

…ted value.

tashoyan · 2020-05-24T16:40:53Z

Consumers inside executors

Indeed, executors have their own Kafka consumers, and these consumers always have group.id set: KafkaSourceProvider.kafkaParamsForExecutors(). We can see group ids in executor logs, INFO level is enough.

However, the consumers inside executors always do assign() and never subscribe(). Then, there is no reason to set group.id for them. I would suggest to remove this setting for consumers.

Nevertheless, executors do not fail as driver does. My investigation is below in this post.

Kafka consumers behavior

By KafkaConsumer specification, the assign mechanism does not use consumer groups at all. Contrarily to subscribe, assign assumes manual assignment of topic partitions.

If I am not missing anything, Structured Streaming manually assigns Spark executors to Kafka partitions. We do not use consumer groups at all, so we don't need them. Therefore, I am not sure why Structured Streaming provides the option "subscribe".

By not specifying group.id, we can avoid the authorization problem on a secure Kafka broker.

Weird behavior of consumers inside executors

A weird thing is that executors do not fail with GroupAuthorizationException as the driver does. The difference between driver's consumer and executor's consumer is following:

driver sequence - see KafkaOffsetReader:
1. create consumer (using ConsumerStrategy)
2. subscribe or assign as per ConsumerStrategy
3. run consumer.poll() - KafkaOffsetReader.fetchTopicPartitions()
executor sequence - see KafkaDataConsumer:
1. create consumer
2. always assign (no strategy)
3. make a number of seeks, e.g. seekToBeginning(), seekToEnd() - see for example InternalKafkaConsumer.getAvailableOffsetRange().
4. run consumer.poll() - InternalKafkaConsumer.fetchData()

These seek operations make the difference. I made a trivial application with KafkaConsumer - if I insert seekToBeginning() before calling poll(), the consumer successfully reads records without facing GroupAuthorizationException, hence bypasses authorization. Maybe it is a Kafka bug (I tried with Kafka 2.2.0).

HeartSaVioR · 2020-05-26T08:03:18Z

If I am not missing anything, Structured Streaming manually assigns Spark executors to Kafka partitions. We do not use consumer groups at all, so we don't need them. Therefore, I am not sure why Structured Streaming provides the option "subscribe".

It's not true for driver side. Spark still leverages subscription in driver to receive the metadata updates and avoid dealing with retrieving target topic partitions by itself. Spark will interpret topic partitions in startingOffsets / endingOffsets based on the information. The metadata information is the thing we may need to deal manually with admin client if we want to avoid subscription at all, which is technically not impossible (if I'm not missing anything) but sub-optimal.

Also, the option "assign" leaves many considerations on end users' side - end users should know about target topic partitions and fully understand the query may not consume all messages in topic once topic expands the number of partitions. It's not simple to use and error-prone.

Back to the original topic, my comment is not about the feasibility of doing that. I agree we can do the change technically - the point is that how much it brings value to differentiate assign vs others in Spark side. Integrating with the details means that it's gonna be non-trivial to change. If the workarounds provided by Spark 3.0.0 work for all cases I'm not 100% sure about the value.
(Honestly I don't see the value of ACL on group ID if anyone can ignore the permission and call assign API. ACL should have been also set to the users.)

One interesting thing is that Kafka could ignore the group ID for assign API but didn't look like, which is why you've encountered the issue.

HeartSaVioR · 2020-05-26T08:03:46Z

But that's only me. Let's hear more voices on this.

cc. @zsxwing @gaborgsomogyi

gaborgsomogyi · 2020-05-27T12:14:28Z

I've taken a look at it and here are my thoughts:

According to KafkaConsumer specification, the consumer group in the "assign" strategy is not used.

This is simply not true. One can commit back offests with a consumer and such case group id is used.
This is not used in Structured Streaming at the moment but I don't want to close this possibility.

We've already added a solution in Spark 3.0.0 to solve exactly this issue, I don't see the reason to solve it a different way.
There is a configuration solution to this on broker side: bin/kafka-acls --authorizer kafka.security.auth.SimpleAclAuthorizer --authorizer-properties zookeeper.connect=zk:2181 --add --allow-principal User:'Bon' --operation READ --topic topicName --group='spark-kafka-source-' --resource-pattern-type prefixed
Only personal view but I don't see the gain adding this, on the other hand additional code means more maintenance.

Overall I wouldn't merge it but if somebody thinks it worth then I have couple of further comments.

gaborgsomogyi · 2020-07-31T08:56:59Z

I'm just revisiting this. At the moment working on SPARK-32032 where AdminClient is planned to use on driver side which makes group.id completely useless in the Kafka connector. If it's merged then we can go to this direction but not with the current implementation. I think we shouldn't remove group.id from the map but just remove code references which are using it (for instance in KafkaDataConsumer) + documentation of course.

github-actions · 2020-11-09T00:39:59Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

[SPARK-31805][SS] Exclude group.id setting from consumer settings whe…

088decb

…n using the assign strategy

probot-autolabeler bot added DSTREAM SQL STRUCTURED STREAMING labels May 23, 2020

HyukjinKwon removed DSTREAM SQL labels May 24, 2020

HeartSaVioR reviewed May 24, 2020

View reviewed changes

[SPARK-31805][SS] Immutable update of the parameters map. Print redac…

8fd9434

…ted value.

[SPARK-31805][SS] Tests for KafkaConfigUpdater.remove()

4438e6c

probot-autolabeler bot added DSTREAM SQL labels May 24, 2020

HeartSaVioR mentioned this pull request Sep 21, 2020

[SPARK-32032][SS] Avoid infinite wait in driver because of KafkaConsumer.poll(long) API #29729

Closed

github-actions bot added the Stale label Nov 9, 2020

github-actions bot closed this Nov 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-31805][SS] Exclude group.id from consumer settings when using the assign strategy#28623

[SPARK-31805][SS] Exclude group.id from consumer settings when using the assign strategy#28623
tashoyan wants to merge 3 commits intoapache:masterfrom
tashoyan:SPARK-31805-assign-no-group-id

tashoyan commented May 23, 2020

Uh oh!

AmplabJenkins commented May 23, 2020

Uh oh!

HyukjinKwon commented May 24, 2020

Uh oh!

HeartSaVioR May 24, 2020

Uh oh!

tashoyan May 24, 2020

Uh oh!

HeartSaVioR commented May 24, 2020

Uh oh!

tashoyan commented May 24, 2020 •

edited

Loading

Uh oh!

HeartSaVioR commented May 26, 2020

Uh oh!

HeartSaVioR commented May 26, 2020

Uh oh!

gaborgsomogyi commented May 27, 2020

Uh oh!

gaborgsomogyi commented Jul 31, 2020 •

edited

Loading

Uh oh!

github-actions bot commented Nov 9, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

tashoyan commented May 23, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

AmplabJenkins commented May 23, 2020

Uh oh!

HyukjinKwon commented May 24, 2020

Uh oh!

HeartSaVioR May 24, 2020

Choose a reason for hiding this comment

Uh oh!

tashoyan May 24, 2020

Choose a reason for hiding this comment

Uh oh!

HeartSaVioR commented May 24, 2020

Uh oh!

tashoyan commented May 24, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Consumers inside executors

Kafka consumers behavior

Weird behavior of consumers inside executors

Uh oh!

HeartSaVioR commented May 26, 2020

Uh oh!

HeartSaVioR commented May 26, 2020

Uh oh!

gaborgsomogyi commented May 27, 2020

Uh oh!

gaborgsomogyi commented Jul 31, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Nov 9, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

tashoyan commented May 24, 2020 •

edited

Loading

gaborgsomogyi commented Jul 31, 2020 •

edited

Loading