Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KAFKA-4134: log ConnectException at WARN #1829

Closed
wants to merge 2 commits into from

Conversation

cotedm
Copy link
Contributor

@cotedm cotedm commented Sep 6, 2016

Simply log the connection refused instance. If we're worried about spamming users, I can add a flag to make sure we only log this exception once, but the initial change is to simply log what we're given. @ijuma looks like you were last to touch this code, would you mind having a look?

@ewencp
Copy link
Contributor

ewencp commented Nov 29, 2016

@cotedm This is still potentially very spammy, right? Default reconnect backoff is 50ms which means you'll get 20 of these (with stacktrace) per second that you have a connectivity issue.

We'd probably benefit from some helper that can rate limit certain log calls without permanently turning them off after the first time one is logged. Otherwise you can be missing information at a much later time due to a very short transient network issue at some completely unrelated time.

@cotedm
Copy link
Contributor Author

cotedm commented Nov 30, 2016

@ewencp I should think ConnectException would only happen if the remote process stopped listening, but you'd still be spamming users in that case. Let me add a general wrapper with a cooldown for log messages and use it here. Then if we find other spammy things, we can reuse it hopefully.

@asfbot
Copy link

asfbot commented Dec 12, 2016

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk7-scala2.10/75/
Test FAILed (JDK 7 and Scala 2.10).

@asfbot
Copy link

asfbot commented Dec 12, 2016

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.12/76/
Test FAILed (JDK 8 and Scala 2.12).

@asfbot
Copy link

asfbot commented Dec 12, 2016

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.11/77/
Test FAILed (JDK 8 and Scala 2.11).

@cotedm
Copy link
Contributor Author

cotedm commented Dec 12, 2016

@ewencp I've added a LogRateLimiter utility with the method I would need as a wrapper. I think it's probably useful to have it for other scenarios so I kept it in commons/util, but if that doesn't make sense please let me know. I also hardcoded 1000 messages as the number we'll skip since I don't see a reason to make it configurable, but it might make sense to alert the user that the log message has been suppressed, what do you think?

* @param max maximum number of times we try to log the message before suggesting a count reset
* @return boolean indicating if the count should be reset or not
*/
public static boolean warn(Logger logger, String format, Object arg1, Object arg2,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might want to swap the order of the logger args and the count/max args. There are multiple overloads in slf4j and this is only one of them. There's also (String), (String, Object), (String, Object...), and (String, Throwable). The (String, Object...) version is actually the one you want to use as the most general version, the others exist purely as an optimization.

if (e instanceof ConnectException) {
if (LogRateLimiter.warn(log, "Cannot connect to {}", desc, e,
loggerCount, loggerCountMax))
loggerCount++;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this doing what you expect it to? You're returning true from the LogRateLimiter when the count > max. But that won't be true early on, which means you'll keep resetting the loggerCount to 0. Won't this just keep the loggerCount at 0 permanently?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, not sure if we can make this easy, but it'd really be ideal if we could keep the logging line as a single line of code without requiring additional logic by the caller. (Maybe even if this requires allocating a special object to do that.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*/
public class LogRateLimiter {

/**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be worth having the rate limiter have 2 levels: the first level logs every error and the second logs every nth error?

My concern is that simply cutting off logs entirely once we hit a certain # of messages can mask later messages. You don't want to mask those entirely, you just want to cut them off. I think any approach using timestamps is probably going to get too complicated. But I think still printing every 1000th message would be useful so you eventually see the problem.

@ewencp ewencp added the connect label Feb 28, 2018
@cotedm cotedm closed this Dec 20, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
4 participants