-
Notifications
You must be signed in to change notification settings - Fork 13.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KAFKA-4134: log ConnectException at WARN #1829
Conversation
@cotedm This is still potentially very spammy, right? Default reconnect backoff is 50ms which means you'll get 20 of these (with stacktrace) per second that you have a connectivity issue. We'd probably benefit from some helper that can rate limit certain log calls without permanently turning them off after the first time one is logged. Otherwise you can be missing information at a much later time due to a very short transient network issue at some completely unrelated time. |
@ewencp I should think ConnectException would only happen if the remote process stopped listening, but you'd still be spamming users in that case. Let me add a general wrapper with a cooldown for log messages and use it here. Then if we find other spammy things, we can reuse it hopefully. |
Refer to this link for build results (access rights to CI server needed): |
Refer to this link for build results (access rights to CI server needed): |
Refer to this link for build results (access rights to CI server needed): |
@ewencp I've added a LogRateLimiter utility with the method I would need as a wrapper. I think it's probably useful to have it for other scenarios so I kept it in commons/util, but if that doesn't make sense please let me know. I also hardcoded 1000 messages as the number we'll skip since I don't see a reason to make it configurable, but it might make sense to alert the user that the log message has been suppressed, what do you think? |
* @param max maximum number of times we try to log the message before suggesting a count reset | ||
* @return boolean indicating if the count should be reset or not | ||
*/ | ||
public static boolean warn(Logger logger, String format, Object arg1, Object arg2, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might want to swap the order of the logger args and the count
/max
args. There are multiple overloads in slf4j
and this is only one of them. There's also (String)
, (String, Object)
, (String, Object...)
, and (String, Throwable)
. The (String, Object...)
version is actually the one you want to use as the most general version, the others exist purely as an optimization.
if (e instanceof ConnectException) { | ||
if (LogRateLimiter.warn(log, "Cannot connect to {}", desc, e, | ||
loggerCount, loggerCountMax)) | ||
loggerCount++; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this doing what you expect it to? You're returning true
from the LogRateLimiter
when the count > max
. But that won't be true early on, which means you'll keep resetting the loggerCount
to 0
. Won't this just keep the loggerCount
at 0
permanently?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, not sure if we can make this easy, but it'd really be ideal if we could keep the logging line as a single line of code without requiring additional logic by the caller. (Maybe even if this requires allocating a special object to do that.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
*/ | ||
public class LogRateLimiter { | ||
|
||
/** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be worth having the rate limiter have 2 levels: the first level logs every error and the second logs every nth error?
My concern is that simply cutting off logs entirely once we hit a certain # of messages can mask later messages. You don't want to mask those entirely, you just want to cut them off. I think any approach using timestamps is probably going to get too complicated. But I think still printing every 1000th message would be useful so you eventually see the problem.
Simply log the connection refused instance. If we're worried about spamming users, I can add a flag to make sure we only log this exception once, but the initial change is to simply log what we're given. @ijuma looks like you were last to touch this code, would you mind having a look?