-
Notifications
You must be signed in to change notification settings - Fork 13.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[KAFKA-5417] Clients get inconsistent connection states when SASL/SSL… #3282
Conversation
@guozhangwang please have a check at your convinence |
Refer to this link for build results (access rights to CI server needed): |
Refer to this link for build results (access rights to CI server needed): |
@dongeforever : Thanks for the finding. Great catch. The patch looks good. Perhaps we could add a comment in ConnectionState about state transition (e.g., we can transition from any of CONNECTING, CHECKING_API_VERSIONS, READY directly to DISCONNECTED). |
Iterator<String> connectedIt = connected.iterator(); | ||
while (connectedIt.hasNext()) { | ||
if (disconnected.containsKey(connectedIt.next())) { | ||
log.warn("Channel {} is marked connected and disconnected at the same time"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this warning indicate? Should users be doing something about it? If not, maybe it should be logged at a lower level?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. It is an internal class though. Users need to do nothing. I will update it.
while (connectedIt.hasNext()) { | ||
if (disconnected.containsKey(connectedIt.next())) { | ||
log.warn("Channel {} is marked connected and disconnected at the same time"); | ||
connectedIt.remove(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this channel was added to the connected
set only in this poll()
, no one knows about this channel yet. So perhaps we should remove it from disconnected
as well to avoid notifying disconnection of an unknown channel?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was debating the same thing. Won't NetworkClient
keep the node under the CONNECTING
state though? It seems like either approach involves a change in the contract that could affect users who are not expecting it. It's an internal class though, so we just need to make sure that the affected Kafka code is updated (if necessary).
It would be nice to include a test for this so that we can verify that things truly work under this scenario.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ijuma Yes, you are right, NetworkClient
does need to be notified. Ignore my previous comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ijuma I want to add a test, but unfortunately it is hard to mock such a network environment.
I have tested it many times in my company's LVS Proxy env.
Do you have some suggestions about it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we could keep it simple and add a test for Selector.pollSelectedKeys
where SelectionKey
and the channel returned by the SelectionKey
are mocked. What do you think?
Refer to this link for build results (access rights to CI server needed): |
Refer to this link for build results (access rights to CI server needed): |
Iterator<String> connectedIt = connected.iterator(); | ||
while (connectedIt.hasNext()) { | ||
if (disconnected.containsKey(connectedIt.next())) { | ||
log.debug("Channel {} is marked connected and disconnected at the same time"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The channel is not being passed to log.debug
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ijuma Thanks. It has been polished
Refer to this link for build results (access rights to CI server needed): |
Refer to this link for build results (access rights to CI server needed): |
@@ -335,6 +335,15 @@ public void poll(long timeout) throws IOException { | |||
// we use the time at the end of select to ensure that we don't close any connections that | |||
// have just been processed in pollSelectionKeys | |||
maybeCloseOldestConnection(endSelect); | |||
|
|||
Iterator<String> connectedIt = connected.iterator(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be a little cleaner to avoid the inconsistency in the first place rather than fixing it after the fact. Have you considered calling connected.remove(channel)
when we add the channel to disconnected
in doClose
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dongeforever, what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hachikuji @ijuma Agreed to avoid inconsistency in the first place. And it may be a little better to do it in close rather than in doClose, for doClose maybe a delayed operation and close is the first place to notify the ConnectionStates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ijuma @hachikuji what do you think about the updated version?
Refer to this link for build results (access rights to CI server needed): |
Refer to this link for build results (access rights to CI server needed): |
Refer to this link for build results (access rights to CI server needed): |
Refer to this link for build results (access rights to CI server needed): |
@@ -555,6 +555,9 @@ private void close(KafkaChannel channel, boolean processOutstanding) { | |||
|
|||
channel.disconnect(); | |||
|
|||
//avoid inconsistent connection states, see KAFKA-5417 | |||
connected.remove(channel.id()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was debating whether doClose
is the more appropriate spot for this since that is where we actually close the channel and add the id to the disconnected
collection. We have this "closing" state below in which we await pending receives before closing the channel. I am not sure if it is better to allow a connection in that state to be reported as connected or not. It probably doesn't matter too much for the specific bug reported since we wouldn't have any pending receives, but I guess we should still make sure this the state transitions are consistent. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, there is not too much difference to do it in "doClose" and "close". Maybe both are ok.
Any channel in "connected" will be reported as connected.
Since it is going to close such channel, it is better to prevent it to be marked as connected than close it after it has been actually marked.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I can't think of a strong reason for either option, so I guess we can leave it here.
@dongeforever Don't forget about Jun's comment from above:
We also should try to come up with a reasonable test case. Ismael suggested previously:
Would that work? |
@hachikuji It seems that Kafka dose not introduce any mock framework such as mockito. |
@dongeforever You can use |
@dongeforever, do you think you'll be able to provide a test? If not, we can help. |
@ijuma Thank you. A little busy these days. But I want to try it by myself firstly. And I will go back for your help if there is trouble. |
@dongeforever any luck writing the test? We are planning to do a 0.11.0.1 release soon and it would be good to include this fix. |
Refer to this link for build results (access rights to CI server needed): |
Refer to this link for build results (access rights to CI server needed): |
… connection is marked CONECTED and DISCONNECTED at the same time
e0d0d6b
to
927cf04
Compare
@ijuma Sorry for being so late. I have added a unit test, named as "testAvoidInconsisConnectionStates" in SelectorTest, for this issue. IMO, the easiest way is to test the private method "pollSelectionKeys" with a mocked SelectionKey and mocked KafkaChannel. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the test. I've made some minor changes (avoid reflection, tweak comments, remove some unnecessary code) and merged to trunk and 0.11.0.
If prepare throws an exception in the same poll when the connection is established, the channel id should be in `disconnected`, but not in `connected`. Author: dongeforever <dongeforever@apache.org> Reviewers: Ismael Juma <ismael@juma.me.uk> Closes #3282 from dongeforever/KAFKA-5417 (cherry picked from commit ea21752) Signed-off-by: Ismael Juma <ismael@juma.me.uk>
Sorry, I made a mistake during merging and didn't mention all the reviewers (Jun, Jason, Rajini). |
@ijuma Thanks |
… connection is marked CONECTED and DISCONNECTED at the same time
details are in:
https://issues.apache.org/jira/browse/KAFKA-5417