Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KAFKA-10727; Handle Kerberos error during re-login as transient failure in clients #9605

Merged

Conversation

rajinisivaram
Copy link
Contributor

We use a background thread for Kerberos to perform re-login before tickets expire. The thread performs logout() followed by login(), relying on the Java library to clear and then populate credentials in Subject. This leaves a timing window where clients fail to authenticate because credentials are not available. We cannot introduce any form of locking since authentication is performed on the network thread. So this PR treats NO_CRED as a transient failure rather than a fatal authentication exception in clients.

Committer Checklist (excluded from commit message)

  • Verify design and implementation
  • Verify test coverage and CI build status
  • Verify documentation (including upgrade notes)

Copy link
Contributor

@rondagostino rondagostino left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Rajini. LGTM. I was wondering if we had a similar possibility of currently-non-retriable failure with OAuth Bearer tokens, but it appears that we support multiple simultaneous tokens (see ExpiringCredentialRefreshingLogin; and OAuthBearerLoginModule sets loginRefreshReloginAllowedBeforeLogout to true) -- so what happens there is a new token is retrieved/logged-in before the first one is logged-out, and there is never a moment without valid credentials.

Copy link
Contributor

@omkreddy omkreddy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rajinisivaram Thanks for the PR. LGTM.

@rajinisivaram
Copy link
Contributor Author

@omkreddy @rondagostino Thanks for the reviews.
@rondagostino Yes, I had checked the OAuth code path to see if it had the same issue and found that you had already taken care of that :-)

Quota test failure not related, merging to trunk.

@rajinisivaram rajinisivaram merged commit ed8659b into apache:trunk Nov 23, 2020
rondagostino pushed a commit to confluentinc/kafka that referenced this pull request Feb 11, 2021
…re in clients (apache#9605)

We use a background thread for Kerberos to perform re-login before tickets expire. The thread performs logout() followed by login(), relying on the Java library to clear and then populate credentials in Subject. This leaves a timing window where clients fail to authenticate because credentials are not available. We cannot introduce any form of locking since authentication is performed on the network thread. So this commit treats NO_CRED as a transient failure rather than a fatal authentication exception in clients.

Reviewers: Ron Dagostino <rdagostino@confluent.io>, Manikumar Reddy <manikumar.reddy@gmail.com>
andrewegel pushed a commit to confluentinc/kafka that referenced this pull request Feb 11, 2021
…re in clients (apache#9605) (#508)

We use a background thread for Kerberos to perform re-login before tickets expire. The thread performs logout() followed by login(), relying on the Java library to clear and then populate credentials in Subject. This leaves a timing window where clients fail to authenticate because credentials are not available. We cannot introduce any form of locking since authentication is performed on the network thread. So this commit treats NO_CRED as a transient failure rather than a fatal authentication exception in clients.

Reviewers: Ron Dagostino <rdagostino@confluent.io>, Manikumar Reddy <manikumar.reddy@gmail.com>

Co-authored-by: Rajini Sivaram <rajinisivaram@googlemail.com>
rajinisivaram added a commit to confluentinc/kafka that referenced this pull request Apr 26, 2021
…re in clients (apache#9605)

We use a background thread for Kerberos to perform re-login before tickets expire. The thread performs logout() followed by login(), relying on the Java library to clear and then populate credentials in Subject. This leaves a timing window where clients fail to authenticate because credentials are not available. We cannot introduce any form of locking since authentication is performed on the network thread. So this commit treats NO_CRED as a transient failure rather than a fatal authentication exception in clients.

Reviewers: Ron Dagostino <rdagostino@confluent.io>, Manikumar Reddy <manikumar.reddy@gmail.com>
andrewegel pushed a commit to confluentinc/kafka that referenced this pull request Apr 28, 2021
…re in clients (apache#9605) (#550)

We use a background thread for Kerberos to perform re-login before tickets expire. The thread performs logout() followed by login(), relying on the Java library to clear and then populate credentials in Subject. This leaves a timing window where clients fail to authenticate because credentials are not available. We cannot introduce any form of locking since authentication is performed on the network thread. So this commit treats NO_CRED as a transient failure rather than a fatal authentication exception in clients.

Reviewers: Ron Dagostino <rdagostino@confluent.io>, Manikumar Reddy <manikumar.reddy@gmail.com>
rajinisivaram added a commit that referenced this pull request May 5, 2021
…re in clients (#9605)

We use a background thread for Kerberos to perform re-login before tickets expire. The thread performs logout() followed by login(), relying on the Java library to clear and then populate credentials in Subject. This leaves a timing window where clients fail to authenticate because credentials are not available. We cannot introduce any form of locking since authentication is performed on the network thread. So this commit treats NO_CRED as a transient failure rather than a fatal authentication exception in clients.

Reviewers: Ron Dagostino <rdagostino@confluent.io>, Manikumar Reddy <manikumar.reddy@gmail.com>
rajinisivaram added a commit that referenced this pull request May 5, 2021
…re in clients (#9605)

We use a background thread for Kerberos to perform re-login before tickets expire. The thread performs logout() followed by login(), relying on the Java library to clear and then populate credentials in Subject. This leaves a timing window where clients fail to authenticate because credentials are not available. We cannot introduce any form of locking since authentication is performed on the network thread. So this commit treats NO_CRED as a transient failure rather than a fatal authentication exception in clients.

Reviewers: Ron Dagostino <rdagostino@confluent.io>, Manikumar Reddy <manikumar.reddy@gmail.com>
rajinisivaram added a commit that referenced this pull request May 5, 2021
…re in clients (#9605)

We use a background thread for Kerberos to perform re-login before tickets expire. The thread performs logout() followed by login(), relying on the Java library to clear and then populate credentials in Subject. This leaves a timing window where clients fail to authenticate because credentials are not available. We cannot introduce any form of locking since authentication is performed on the network thread. So this commit treats NO_CRED as a transient failure rather than a fatal authentication exception in clients.

Reviewers: Ron Dagostino <rdagostino@confluent.io>, Manikumar Reddy <manikumar.reddy@gmail.com>
rajinisivaram added a commit to confluentinc/kafka that referenced this pull request May 25, 2021
…re in clients (apache#9605)

We use a background thread for Kerberos to perform re-login before tickets expire. The thread performs logout() followed by login(), relying on the Java library to clear and then populate credentials in Subject. This leaves a timing window where clients fail to authenticate because credentials are not available. We cannot introduce any form of locking since authentication is performed on the network thread. So this commit treats NO_CRED as a transient failure rather than a fatal authentication exception in clients.

Reviewers: Ron Dagostino <rdagostino@confluent.io>, Manikumar Reddy <manikumar.reddy@gmail.com>
rajinisivaram added a commit to confluentinc/kafka that referenced this pull request Jun 7, 2021
…re in clients (apache#9605)

We use a background thread for Kerberos to perform re-login before tickets expire. The thread performs logout() followed by login(), relying on the Java library to clear and then populate credentials in Subject. This leaves a timing window where clients fail to authenticate because credentials are not available. We cannot introduce any form of locking since authentication is performed on the network thread. So this commit treats NO_CRED as a transient failure rather than a fatal authentication exception in clients.

Reviewers: Ron Dagostino <rdagostino@confluent.io>, Manikumar Reddy <manikumar.reddy@gmail.com>
joannaksk pushed a commit to joannaksk/kafka that referenced this pull request May 24, 2022
…re in clients (apache#9605)

We use a background thread for Kerberos to perform re-login before tickets expire. The thread performs logout() followed by login(), relying on the Java library to clear and then populate credentials in Subject. This leaves a timing window where clients fail to authenticate because credentials are not available. We cannot introduce any form of locking since authentication is performed on the network thread. So this commit treats NO_CRED as a transient failure rather than a fatal authentication exception in clients.

Reviewers: Ron Dagostino <rdagostino@confluent.io>, Manikumar Reddy <manikumar.reddy@gmail.com>
MertEgeCAN pushed a commit to DogukanAltay/kafka that referenced this pull request May 28, 2022
…re in clients (apache#9605)

We use a background thread for Kerberos to perform re-login before tickets expire. The thread performs logout() followed by login(), relying on the Java library to clear and then populate credentials in Subject. This leaves a timing window where clients fail to authenticate because credentials are not available. We cannot introduce any form of locking since authentication is performed on the network thread. So this commit treats NO_CRED as a transient failure rather than a fatal authentication exception in clients.

Reviewers: Ron Dagostino <rdagostino@confluent.io>, Manikumar Reddy <manikumar.reddy@gmail.com>
DogukanAltay pushed a commit to DogukanAltay/kafka that referenced this pull request May 29, 2022
* Update build.gradle

* KAFKA-10727; Handle Kerberos error during re-login as transient failure in clients (apache#9605)

We use a background thread for Kerberos to perform re-login before tickets expire. The thread performs logout() followed by login(), relying on the Java library to clear and then populate credentials in Subject. This leaves a timing window where clients fail to authenticate because credentials are not available. We cannot introduce any form of locking since authentication is performed on the network thread. So this commit treats NO_CRED as a transient failure rather than a fatal authentication exception in clients.

Reviewers: Ron Dagostino <rdagostino@confluent.io>, Manikumar Reddy <manikumar.reddy@gmail.com>

* Update GssapiAuthenticationTest.scala

Co-authored-by: Rajini Sivaram <rajinisivaram@googlemail.com>
DogukanAltay added a commit to DogukanAltay/kafka that referenced this pull request May 29, 2022
* MINOR: revise assertions in AbstractConfigTest (apache#9180)

Reviewers: Chia-Ping Tsai <chia7712@gmail.com>

* PR-9180 applied through cherry-pick

* Gradle build fail fix.

* Revert "Revert KAFKA-12791"

This reverts commit f3cd2d7.

* Revert "KAFKA-12791: ConcurrentModificationException in AbstractConfig use by KafkaProducer (apache#10704)"

This reverts commit dfdf915.

* sonarqube integration fix

* MINOR: revise assertions in AbstractConfigTest (apache#9180)

Reviewers: Chia-Ping Tsai <chia7712@gmail.com>

* PR-9180 applied through cherry-pick

* Update rat.gradle

* MINOR: revise assertions in AbstractConfigTest (apache#9180)

Reviewers: Chia-Ping Tsai <chia7712@gmail.com>

* PR-9180 applied through cherry-pick

* refactoring class AbstractConfig to fix the failing test.

* Apache PR 10704 Applied (#3)

* KAFKA-12791: ConcurrentModificationException in AbstractConfig use by KafkaProducer (apache#10704)

Recently we have noticed multiple instances where KafkaProducers have failed to constructor due to the following exception:

```
org.apache.kafka.common.KafkaException: Failed to construct kafka producer at 
org.apache.kafka.clients.producer.KafkaProducer.<init>(KafkaProducer.java:440) at 
org.apache.kafka.clients.producer.KafkaProducer.<init>(KafkaProducer.java:291) at 
org.apache.kafka.clients.producer.KafkaProducer.<init>(KafkaProducer.java:318) 
java.base/java.lang.Thread.run(Thread.java:832) Caused by: java.util.ConcurrentModificationException at 
java.base/java.util.HashMap$HashIterator.nextNode(HashMap.java:1584) at 
java.base/java.util.HashMap$KeyIterator.next(HashMap.java:1607) at 
java.base/java.util.AbstractSet.removeAll(AbstractSet.java:171) at 
org.apache.kafka.common.config.AbstractConfig.unused(AbstractConfig.java:221) at 
org.apache.kafka.common.config.AbstractConfig.logUnused(AbstractConfig.java:379) at 
org.apache.kafka.clients.producer.KafkaProducer.<init>(KafkaProducer.java:433) ... 9 more 
exception.class:org.apache.kafka.common.KafkaException exception.message:Failed to construct kafka producer
```

This is due to the fact that `used` below is a synchronized set. `used` is being modified while removeAll is being called. This is due to the use of RecordingMap in the Sender thread (see below). Switching to a ConcurrentHashSet avoids this issue as it support concurrent iteration.

```
	at org.apache.kafka.clients.producer.ProducerConfig.ignore(ProducerConfig.java:569)
	at org.apache.kafka.common.config.AbstractConfig$RecordingMap.get(AbstractConfig.java:638)
	at org.apache.kafka.common.network.ChannelBuilders.createPrincipalBuilder(ChannelBuilders.java:242)
	at org.apache.kafka.common.network.PlaintextChannelBuilder$PlaintextAuthenticator.<init>(PlaintextChannelBuilder.java:96)
	at org.apache.kafka.common.network.PlaintextChannelBuilder$PlaintextAuthenticator.<init>(PlaintextChannelBuilder.java:89)
	at org.apache.kafka.common.network.PlaintextChannelBuilder.lambda$buildChannel$0(PlaintextChannelBuilder.java:66)
	at org.apache.kafka.common.network.KafkaChannel.<init>(KafkaChannel.java:174)
	at org.apache.kafka.common.network.KafkaChannel.<init>(KafkaChannel.java:164)
	at org.apache.kafka.common.network.PlaintextChannelBuilder.buildChannel(PlaintextChannelBuilder.java:79)
	at org.apache.kafka.common.network.PlaintextChannelBuilder.buildChannel(PlaintextChannelBuilder.java:67)
	at org.apache.kafka.common.network.Selector.buildAndAttachKafkaChannel(Selector.java:356)
	at org.apache.kafka.common.network.Selector.registerChannel(Selector.java:347)
	at org.apache.kafka.common.network.Selector.connect(Selector.java:274)
	at org.apache.kafka.clients.NetworkClient.initiateConnect(NetworkClient.java:1097)
	at org.apache.kafka.clients.NetworkClient.access$700(NetworkClient.java:87)
	at org.apache.kafka.clients.NetworkClient$DefaultMetadataUpdater.maybeUpdate(NetworkClient.java:1276)
	at org.apache.kafka.clients.NetworkClient$DefaultMetadataUpdater.maybeUpdate(NetworkClient.java:1164)
	at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:637)
	at org.apache.kafka.clients.producer.internals.Sender.runOnce(Sender.java:327)
	at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:242)
```

Reviewers: Ismael Juma <ismael@juma.me.uk>

* Small refactoring applied to fix bug after cherry-pick.

* Update rat.gradle

* KAFKA-12791: ConcurrentModificationException in AbstractConfig use by KafkaProducer (apache#10704)

Recently we have noticed multiple instances where KafkaProducers have failed to constructor due to the following exception:

```
org.apache.kafka.common.KafkaException: Failed to construct kafka producer at 
org.apache.kafka.clients.producer.KafkaProducer.<init>(KafkaProducer.java:440) at 
org.apache.kafka.clients.producer.KafkaProducer.<init>(KafkaProducer.java:291) at 
org.apache.kafka.clients.producer.KafkaProducer.<init>(KafkaProducer.java:318) 
java.base/java.lang.Thread.run(Thread.java:832) Caused by: java.util.ConcurrentModificationException at 
java.base/java.util.HashMap$HashIterator.nextNode(HashMap.java:1584) at 
java.base/java.util.HashMap$KeyIterator.next(HashMap.java:1607) at 
java.base/java.util.AbstractSet.removeAll(AbstractSet.java:171) at 
org.apache.kafka.common.config.AbstractConfig.unused(AbstractConfig.java:221) at 
org.apache.kafka.common.config.AbstractConfig.logUnused(AbstractConfig.java:379) at 
org.apache.kafka.clients.producer.KafkaProducer.<init>(KafkaProducer.java:433) ... 9 more 
exception.class:org.apache.kafka.common.KafkaException exception.message:Failed to construct kafka producer
```

This is due to the fact that `used` below is a synchronized set. `used` is being modified while removeAll is being called. This is due to the use of RecordingMap in the Sender thread (see below). Switching to a ConcurrentHashSet avoids this issue as it support concurrent iteration.

```
	at org.apache.kafka.clients.producer.ProducerConfig.ignore(ProducerConfig.java:569)
	at org.apache.kafka.common.config.AbstractConfig$RecordingMap.get(AbstractConfig.java:638)
	at org.apache.kafka.common.network.ChannelBuilders.createPrincipalBuilder(ChannelBuilders.java:242)
	at org.apache.kafka.common.network.PlaintextChannelBuilder$PlaintextAuthenticator.<init>(PlaintextChannelBuilder.java:96)
	at org.apache.kafka.common.network.PlaintextChannelBuilder$PlaintextAuthenticator.<init>(PlaintextChannelBuilder.java:89)
	at org.apache.kafka.common.network.PlaintextChannelBuilder.lambda$buildChannel$0(PlaintextChannelBuilder.java:66)
	at org.apache.kafka.common.network.KafkaChannel.<init>(KafkaChannel.java:174)
	at org.apache.kafka.common.network.KafkaChannel.<init>(KafkaChannel.java:164)
	at org.apache.kafka.common.network.PlaintextChannelBuilder.buildChannel(PlaintextChannelBuilder.java:79)
	at org.apache.kafka.common.network.PlaintextChannelBuilder.buildChannel(PlaintextChannelBuilder.java:67)
	at org.apache.kafka.common.network.Selector.buildAndAttachKafkaChannel(Selector.java:356)
	at org.apache.kafka.common.network.Selector.registerChannel(Selector.java:347)
	at org.apache.kafka.common.network.Selector.connect(Selector.java:274)
	at org.apache.kafka.clients.NetworkClient.initiateConnect(NetworkClient.java:1097)
	at org.apache.kafka.clients.NetworkClient.access$700(NetworkClient.java:87)
	at org.apache.kafka.clients.NetworkClient$DefaultMetadataUpdater.maybeUpdate(NetworkClient.java:1276)
	at org.apache.kafka.clients.NetworkClient$DefaultMetadataUpdater.maybeUpdate(NetworkClient.java:1164)
	at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:637)
	at org.apache.kafka.clients.producer.internals.Sender.runOnce(Sender.java:327)
	at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:242)
```

Reviewers: Ismael Juma <ismael@juma.me.uk>

* Small refactoring applied to fix bug after cherry-pick.

Co-authored-by: Lucas Bradstreet <lucas@confluent.io>
Co-authored-by: dogukan <dogukan.altay@jobilla.com>
Co-authored-by: MertEgeCAN <m.egecan@hotmail.com>

* SonarQube Code Smell Refactorings (#4)

* refactoring class ClientDnsLookup to fix the code smell.

* refactoring class FetchSessionHandler to fix the code smell.

* refactoring class NetworkClient to fix the code smell.

* Update rat.gradle

* refactoring class ClientDnsLookup to fix the code smell.

* refactoring class FetchSessionHandler to fix the code smell.

* refactoring class NetworkClient to fix the code smell.

Co-authored-by: Jason Gustafson <jason@confluent.io>
Co-authored-by: dogukan <dogukan.altay@jobilla.com>
Co-authored-by: MertEgeCAN <m.egecan@hotmail.com>

* Apache PR 9309 Applied  (#6)

* KAFKA-10503: MockProducer doesn't throw ClassCastException when no partition for topic exists (apache#9309)

Reviewer: Matthias J. Sax <matthias@confluent.io>

* Update rat.gradle

Co-authored-by: Gonzalo Muñoz <gmunozfe@redhat.com>

* Apache PR 8665 Applied  (#8)

* Update build.gradle

* KAFKA-9984 Should fail the subscription when pattern is empty (apache#8665)

Reviewers: Boyang Chen <boyang@confluent.io>, Chia-Ping Tsai <chia7712@gmail.com>, Matthias J. Sax <matthias@confluent.io>

Co-authored-by: zhaohaidao <zhaohaidao2008@hotmail.com>

* SonarQube Code Smell: ClusterConnectionStates.java (#9)

* Update build.gradle

* Code smells fix

* Update ClusterConnectionStates.java

* Update ClusterConnectionStates.java

* Apache PR 9605 Applied (#10)

* Update build.gradle

* KAFKA-10727; Handle Kerberos error during re-login as transient failure in clients (apache#9605)

We use a background thread for Kerberos to perform re-login before tickets expire. The thread performs logout() followed by login(), relying on the Java library to clear and then populate credentials in Subject. This leaves a timing window where clients fail to authenticate because credentials are not available. We cannot introduce any form of locking since authentication is performed on the network thread. So this commit treats NO_CRED as a transient failure rather than a fatal authentication exception in clients.

Reviewers: Ron Dagostino <rdagostino@confluent.io>, Manikumar Reddy <manikumar.reddy@gmail.com>

* Update GssapiAuthenticationTest.scala

Co-authored-by: Rajini Sivaram <rajinisivaram@googlemail.com>

* MINOR: revise assertions in AbstractConfigTest (apache#9180)

Reviewers: Chia-Ping Tsai <chia7712@gmail.com>

Co-authored-by: Sanket Fajage <23031210+sanketfajage@users.noreply.github.com>
Co-authored-by: dogukan <dogukan.altay@jobilla.com>
Co-authored-by: MertEgeCAN <m.egecan@hotmail.com>
Co-authored-by: Lucas Bradstreet <lucas@confluent.io>
Co-authored-by: Jason Gustafson <jason@confluent.io>
Co-authored-by: Gonzalo Muñoz <gmunozfe@redhat.com>
Co-authored-by: zhaohaidao <zhaohaidao2008@hotmail.com>
Co-authored-by: Rajini Sivaram <rajinisivaram@googlemail.com>
DogukanAltay added a commit to DogukanAltay/kafka that referenced this pull request May 29, 2022
* KAFKA-3720 cherry-pick

* small fix.

* KAFKA-3720 cherry-pick

* small fix.

* Update rat.gradle

* KAFKA-3720 cherry-pick

* small fix.

* refactoring class AbstractConfig to fix the failing test.

* refactoring class PlaintextProducerSendTest.scala to fix the failing test.

* refactoring class PlaintextProducerSendTest.scala to fix the failing test.
refactoring class BaseProducerSendTest.scala to fix the failing test.

* revert refactoring on core changes.

* Apache PR 10704 Applied (#3)

* KAFKA-12791: ConcurrentModificationException in AbstractConfig use by KafkaProducer (apache#10704)

Recently we have noticed multiple instances where KafkaProducers have failed to constructor due to the following exception:

```
org.apache.kafka.common.KafkaException: Failed to construct kafka producer at 
org.apache.kafka.clients.producer.KafkaProducer.<init>(KafkaProducer.java:440) at 
org.apache.kafka.clients.producer.KafkaProducer.<init>(KafkaProducer.java:291) at 
org.apache.kafka.clients.producer.KafkaProducer.<init>(KafkaProducer.java:318) 
java.base/java.lang.Thread.run(Thread.java:832) Caused by: java.util.ConcurrentModificationException at 
java.base/java.util.HashMap$HashIterator.nextNode(HashMap.java:1584) at 
java.base/java.util.HashMap$KeyIterator.next(HashMap.java:1607) at 
java.base/java.util.AbstractSet.removeAll(AbstractSet.java:171) at 
org.apache.kafka.common.config.AbstractConfig.unused(AbstractConfig.java:221) at 
org.apache.kafka.common.config.AbstractConfig.logUnused(AbstractConfig.java:379) at 
org.apache.kafka.clients.producer.KafkaProducer.<init>(KafkaProducer.java:433) ... 9 more 
exception.class:org.apache.kafka.common.KafkaException exception.message:Failed to construct kafka producer
```

This is due to the fact that `used` below is a synchronized set. `used` is being modified while removeAll is being called. This is due to the use of RecordingMap in the Sender thread (see below). Switching to a ConcurrentHashSet avoids this issue as it support concurrent iteration.

```
	at org.apache.kafka.clients.producer.ProducerConfig.ignore(ProducerConfig.java:569)
	at org.apache.kafka.common.config.AbstractConfig$RecordingMap.get(AbstractConfig.java:638)
	at org.apache.kafka.common.network.ChannelBuilders.createPrincipalBuilder(ChannelBuilders.java:242)
	at org.apache.kafka.common.network.PlaintextChannelBuilder$PlaintextAuthenticator.<init>(PlaintextChannelBuilder.java:96)
	at org.apache.kafka.common.network.PlaintextChannelBuilder$PlaintextAuthenticator.<init>(PlaintextChannelBuilder.java:89)
	at org.apache.kafka.common.network.PlaintextChannelBuilder.lambda$buildChannel$0(PlaintextChannelBuilder.java:66)
	at org.apache.kafka.common.network.KafkaChannel.<init>(KafkaChannel.java:174)
	at org.apache.kafka.common.network.KafkaChannel.<init>(KafkaChannel.java:164)
	at org.apache.kafka.common.network.PlaintextChannelBuilder.buildChannel(PlaintextChannelBuilder.java:79)
	at org.apache.kafka.common.network.PlaintextChannelBuilder.buildChannel(PlaintextChannelBuilder.java:67)
	at org.apache.kafka.common.network.Selector.buildAndAttachKafkaChannel(Selector.java:356)
	at org.apache.kafka.common.network.Selector.registerChannel(Selector.java:347)
	at org.apache.kafka.common.network.Selector.connect(Selector.java:274)
	at org.apache.kafka.clients.NetworkClient.initiateConnect(NetworkClient.java:1097)
	at org.apache.kafka.clients.NetworkClient.access$700(NetworkClient.java:87)
	at org.apache.kafka.clients.NetworkClient$DefaultMetadataUpdater.maybeUpdate(NetworkClient.java:1276)
	at org.apache.kafka.clients.NetworkClient$DefaultMetadataUpdater.maybeUpdate(NetworkClient.java:1164)
	at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:637)
	at org.apache.kafka.clients.producer.internals.Sender.runOnce(Sender.java:327)
	at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:242)
```

Reviewers: Ismael Juma <ismael@juma.me.uk>

* Small refactoring applied to fix bug after cherry-pick.

* Update rat.gradle

* KAFKA-12791: ConcurrentModificationException in AbstractConfig use by KafkaProducer (apache#10704)

Recently we have noticed multiple instances where KafkaProducers have failed to constructor due to the following exception:

```
org.apache.kafka.common.KafkaException: Failed to construct kafka producer at 
org.apache.kafka.clients.producer.KafkaProducer.<init>(KafkaProducer.java:440) at 
org.apache.kafka.clients.producer.KafkaProducer.<init>(KafkaProducer.java:291) at 
org.apache.kafka.clients.producer.KafkaProducer.<init>(KafkaProducer.java:318) 
java.base/java.lang.Thread.run(Thread.java:832) Caused by: java.util.ConcurrentModificationException at 
java.base/java.util.HashMap$HashIterator.nextNode(HashMap.java:1584) at 
java.base/java.util.HashMap$KeyIterator.next(HashMap.java:1607) at 
java.base/java.util.AbstractSet.removeAll(AbstractSet.java:171) at 
org.apache.kafka.common.config.AbstractConfig.unused(AbstractConfig.java:221) at 
org.apache.kafka.common.config.AbstractConfig.logUnused(AbstractConfig.java:379) at 
org.apache.kafka.clients.producer.KafkaProducer.<init>(KafkaProducer.java:433) ... 9 more 
exception.class:org.apache.kafka.common.KafkaException exception.message:Failed to construct kafka producer
```

This is due to the fact that `used` below is a synchronized set. `used` is being modified while removeAll is being called. This is due to the use of RecordingMap in the Sender thread (see below). Switching to a ConcurrentHashSet avoids this issue as it support concurrent iteration.

```
	at org.apache.kafka.clients.producer.ProducerConfig.ignore(ProducerConfig.java:569)
	at org.apache.kafka.common.config.AbstractConfig$RecordingMap.get(AbstractConfig.java:638)
	at org.apache.kafka.common.network.ChannelBuilders.createPrincipalBuilder(ChannelBuilders.java:242)
	at org.apache.kafka.common.network.PlaintextChannelBuilder$PlaintextAuthenticator.<init>(PlaintextChannelBuilder.java:96)
	at org.apache.kafka.common.network.PlaintextChannelBuilder$PlaintextAuthenticator.<init>(PlaintextChannelBuilder.java:89)
	at org.apache.kafka.common.network.PlaintextChannelBuilder.lambda$buildChannel$0(PlaintextChannelBuilder.java:66)
	at org.apache.kafka.common.network.KafkaChannel.<init>(KafkaChannel.java:174)
	at org.apache.kafka.common.network.KafkaChannel.<init>(KafkaChannel.java:164)
	at org.apache.kafka.common.network.PlaintextChannelBuilder.buildChannel(PlaintextChannelBuilder.java:79)
	at org.apache.kafka.common.network.PlaintextChannelBuilder.buildChannel(PlaintextChannelBuilder.java:67)
	at org.apache.kafka.common.network.Selector.buildAndAttachKafkaChannel(Selector.java:356)
	at org.apache.kafka.common.network.Selector.registerChannel(Selector.java:347)
	at org.apache.kafka.common.network.Selector.connect(Selector.java:274)
	at org.apache.kafka.clients.NetworkClient.initiateConnect(NetworkClient.java:1097)
	at org.apache.kafka.clients.NetworkClient.access$700(NetworkClient.java:87)
	at org.apache.kafka.clients.NetworkClient$DefaultMetadataUpdater.maybeUpdate(NetworkClient.java:1276)
	at org.apache.kafka.clients.NetworkClient$DefaultMetadataUpdater.maybeUpdate(NetworkClient.java:1164)
	at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:637)
	at org.apache.kafka.clients.producer.internals.Sender.runOnce(Sender.java:327)
	at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:242)
```

Reviewers: Ismael Juma <ismael@juma.me.uk>

* Small refactoring applied to fix bug after cherry-pick.

Co-authored-by: Lucas Bradstreet <lucas@confluent.io>
Co-authored-by: dogukan <dogukan.altay@jobilla.com>
Co-authored-by: MertEgeCAN <m.egecan@hotmail.com>

* SonarQube Code Smell Refactorings (#4)

* refactoring class ClientDnsLookup to fix the code smell.

* refactoring class FetchSessionHandler to fix the code smell.

* refactoring class NetworkClient to fix the code smell.

* Update rat.gradle

* refactoring class ClientDnsLookup to fix the code smell.

* refactoring class FetchSessionHandler to fix the code smell.

* refactoring class NetworkClient to fix the code smell.

Co-authored-by: Jason Gustafson <jason@confluent.io>
Co-authored-by: dogukan <dogukan.altay@jobilla.com>
Co-authored-by: MertEgeCAN <m.egecan@hotmail.com>

* Apache PR 9309 Applied  (#6)

* KAFKA-10503: MockProducer doesn't throw ClassCastException when no partition for topic exists (apache#9309)

Reviewer: Matthias J. Sax <matthias@confluent.io>

* Update rat.gradle

Co-authored-by: Gonzalo Muñoz <gmunozfe@redhat.com>

* Apache PR 8665 Applied  (#8)

* Update build.gradle

* KAFKA-9984 Should fail the subscription when pattern is empty (apache#8665)

Reviewers: Boyang Chen <boyang@confluent.io>, Chia-Ping Tsai <chia7712@gmail.com>, Matthias J. Sax <matthias@confluent.io>

Co-authored-by: zhaohaidao <zhaohaidao2008@hotmail.com>

* SonarQube Code Smell: ClusterConnectionStates.java (#9)

* Update build.gradle

* Code smells fix

* Update ClusterConnectionStates.java

* Update ClusterConnectionStates.java

* Apache PR 9605 Applied (#10)

* Update build.gradle

* KAFKA-10727; Handle Kerberos error during re-login as transient failure in clients (apache#9605)

We use a background thread for Kerberos to perform re-login before tickets expire. The thread performs logout() followed by login(), relying on the Java library to clear and then populate credentials in Subject. This leaves a timing window where clients fail to authenticate because credentials are not available. We cannot introduce any form of locking since authentication is performed on the network thread. So this commit treats NO_CRED as a transient failure rather than a fatal authentication exception in clients.

Reviewers: Ron Dagostino <rdagostino@confluent.io>, Manikumar Reddy <manikumar.reddy@gmail.com>

* Update GssapiAuthenticationTest.scala

Co-authored-by: Rajini Sivaram <rajinisivaram@googlemail.com>

* KAFKA-3720 cherry-pick

* refactoring class PlaintextProducerSendTest.scala to fix the failing test.

* refactoring class PlaintextProducerSendTest.scala to fix the failing test.
refactoring class BaseProducerSendTest.scala to fix the failing test.

* revert refactoring on core changes.

Co-authored-by: Sönke Liebau <soenke.liebau@opencore.com>
Co-authored-by: dogukan <dogukan.altay@jobilla.com>
Co-authored-by: MertEgeCAN <m.egecan@hotmail.com>
Co-authored-by: Lucas Bradstreet <lucas@confluent.io>
Co-authored-by: Jason Gustafson <jason@confluent.io>
Co-authored-by: Gonzalo Muñoz <gmunozfe@redhat.com>
Co-authored-by: zhaohaidao <zhaohaidao2008@hotmail.com>
Co-authored-by: Rajini Sivaram <rajinisivaram@googlemail.com>
MaximGonnissen added a commit to MaximGonnissen/kafka that referenced this pull request May 29, 2022
…sient failure in clients"

Integrated PR from apache/kafka: apache#9605
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants