New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dropbox sync error handling #33965
Dropbox sync error handling #33965
Conversation
while current_time - attempt_start < INTERVAL_SECONDS && (current_time - SCRIPT_START) < TOTAL_SECONDS | ||
sleep 0.1 | ||
current_time = Time.now | ||
end | ||
end | ||
|
||
# Reports to Slack if current minute is a multiple of 5 | ||
logger.report! 5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, I think the way this is set up is that it will report errors that a) were happening before this run of sync_dropbox
, b) are still happening after the script runs unison
repeatedly for a minute, and c) the current minute is a multiple of 5 (say x:50 or x:55).
I think this is potentially not ideal in that an error that first occurs at x:50 or x:51 would not get reported until x:55.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the reasoning for the five minute gap that we want to give time to fix an error before reporting it again? I think it shouldn't be too big of a deal to wait 5 minutes, since as of now the issues seem to not be urgent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, most (actually all) of the errors that have occurred since Jessie and Winter set up the new process have self-resolved, but there is a case where an error could continue indefinitely (I think one example is if both pegasus and Dropbox edit the same file, and Unison doesn't know how to resolve).
So this is supposed to cover that case, and only report it every 5 minutes until it's resolved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
then I think this approach makes sense!
@new_events << {name: event_name, reported_at: Time.now} | ||
end | ||
|
||
def report!(throttle = 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: add a comment explaining what throttle means here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Can you add a screenshot of what the message looks like in slack? Hard to tell from the code
pegasus directory: #{value} | ||
stdout: #{stdout} | ||
stderr: #{stderr} | ||
ERROR_MSG |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this error message is fine. If you wanted to get fancier you could only show stdout/stderr if it was not empty, but I don't think that's necessary
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is value in showing both stdout & stderr. When the error messages are not super informative, seeing both outputs helps eliminating the doubt that there might be more info elsewhere.
end | ||
|
||
# Loads known events from previous runs from a file on disk | ||
def load |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Returns a true/false value so the caller knows if this method succeeds or not. Also, adds a comment that this method swallows exceptions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Been neglecting this for a month :) Just pushed some changes to get at some of your comments, take a look when you have a minute!
# @param [Integer] throttle | ||
def report!(throttle = 1) | ||
if Time.now.min % throttle == 0 | ||
alertable_events.each {|e| @chat_client.message(e, {channel: @channel})} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you clear events after reporting them?
pegasus directory: #{value} | ||
stdout: #{stdout} | ||
stderr: #{stderr} | ||
ERROR_MSG |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is value in showing both stdout & stderr. When the error messages are not super informative, seeing both outputs helps eliminating the doubt that there might be more info elsewhere.
|
||
# The second time the same error occurs, it should be reported | ||
# to Slack. We add a mock expectation to check that. | ||
fake_slack.expect :message, nil, [String, Hash] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TIL exepct
use ===
to check argument types (String and Hash in this case).
https://www.rubydoc.info/gems/minitest/5.11.3/Minitest%2FMock:expect
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoa... === does not do what I would expect it to in Ruby.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good!
HighFrequencyReporter
class that can be used to report errors (or other high frequency messages) to Slack.I'm also not super happy with the formatting of the error message that gets sent to Slack, if people have thoughts on how that could be improved -- here are a couple examples:
I think part of the problem is a) the error messages aren't super informative, and b) sometimes information is in
stdout
and sometimes instderr
.Future work
If this works well, I'd like to move the Slack channel this gets reported to from
sync-dropbox-staging
toinfra-staging
or something along those lines.Testing story
Did lots of local testing here where I made changes to files in two directories, got errors, and successfully reported them to Slack. There are also some unit tests for the new class. I think some things will be difficult to test without putting this "into the wild".
It will continue to report to
sync-dropbox-staging
at first -- we can move the reporting to somewhere that the DOTD would be more likely to see if we're happy with how it performs.Reviewer Checklist: