Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dropbox sync error handling #33965

Merged
merged 9 commits into from May 28, 2020
Merged

Dropbox sync error handling #33965

merged 9 commits into from May 28, 2020

Conversation

bencodeorg
Copy link
Contributor

@bencodeorg bencodeorg commented Apr 1, 2020

  1. Adds a new general HighFrequencyReporter class that can be used to report errors (or other high frequency messages) to Slack.
  2. Uses an instance of that class to limit notifications to Slack on Dropbox sync errors to only occur a) if an error occurs in repeated attempts to sync between pegasus and Dropbox, and b) a max frequency of every five minutes.

I'm also not super happy with the formatting of the error message that gets sent to Slack, if people have thoughts on how that could be improved -- here are a couple examples:

image

I think part of the problem is a) the error messages aren't super informative, and b) sometimes information is in stdout and sometimes in stderr.

Future work

If this works well, I'd like to move the Slack channel this gets reported to from sync-dropbox-staging to infra-staging or something along those lines.

Testing story

Did lots of local testing here where I made changes to files in two directories, got errors, and successfully reported them to Slack. There are also some unit tests for the new class. I think some things will be difficult to test without putting this "into the wild".

It will continue to report to sync-dropbox-staging at first -- we can move the reporting to somewhere that the DOTD would be more likely to see if we're happy with how it performs.

Reviewer Checklist:

  • Tests provide adequate coverage
  • Code is well-commented
  • New features are translatable or updates will not break translations
  • Relevant documentation has been added or updated
  • User impact is well-understood and desirable
  • Pull Request is labeled appropriately
  • Follow-up work items (including potential tech debt) are tracked and linked

while current_time - attempt_start < INTERVAL_SECONDS && (current_time - SCRIPT_START) < TOTAL_SECONDS
sleep 0.1
current_time = Time.now
end
end

# Reports to Slack if current minute is a multiple of 5
logger.report! 5
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, I think the way this is set up is that it will report errors that a) were happening before this run of sync_dropbox, b) are still happening after the script runs unison repeatedly for a minute, and c) the current minute is a multiple of 5 (say x:50 or x:55).

I think this is potentially not ideal in that an error that first occurs at x:50 or x:51 would not get reported until x:55.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the reasoning for the five minute gap that we want to give time to fix an error before reporting it again? I think it shouldn't be too big of a deal to wait 5 minutes, since as of now the issues seem to not be urgent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, most (actually all) of the errors that have occurred since Jessie and Winter set up the new process have self-resolved, but there is a case where an error could continue indefinitely (I think one example is if both pegasus and Dropbox edit the same file, and Unison doesn't know how to resolve).

So this is supposed to cover that case, and only report it every 5 minutes until it's resolved.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then I think this approach makes sense!

@bencodeorg bencodeorg requested a review from a team April 1, 2020 20:17
@new_events << {name: event_name, reported_at: Time.now}
end

def report!(throttle = 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: add a comment explaining what throttle means here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

Copy link
Contributor

@molly-moen molly-moen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Can you add a screenshot of what the message looks like in slack? Hard to tell from the code

pegasus directory: #{value}
stdout: #{stdout}
stderr: #{stderr}
ERROR_MSG
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this error message is fine. If you wanted to get fancier you could only show stdout/stderr if it was not empty, but I don't think that's necessary

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is value in showing both stdout & stderr. When the error messages are not super informative, seeing both outputs helps eliminating the doubt that there might be more info elsewhere.

end

# Loads known events from previous runs from a file on disk
def load
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returns a true/false value so the caller knows if this method succeeds or not. Also, adds a comment that this method swallows exceptions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Been neglecting this for a month :) Just pushed some changes to get at some of your comments, take a look when you have a minute!

# @param [Integer] throttle
def report!(throttle = 1)
if Time.now.min % throttle == 0
alertable_events.each {|e| @chat_client.message(e, {channel: @channel})}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you clear events after reporting them?

lib/cdo/high_frequency_reporter.rb Show resolved Hide resolved
lib/cdo/high_frequency_reporter.rb Outdated Show resolved Hide resolved
pegasus directory: #{value}
stdout: #{stdout}
stderr: #{stderr}
ERROR_MSG
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is value in showing both stdout & stderr. When the error messages are not super informative, seeing both outputs helps eliminating the doubt that there might be more info elsewhere.


# The second time the same error occurs, it should be reported
# to Slack. We add a mock expectation to check that.
fake_slack.expect :message, nil, [String, Hash]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL exepct use === to check argument types (String and Hash in this case).
https://www.rubydoc.info/gems/minitest/5.11.3/Minitest%2FMock:expect

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoa... === does not do what I would expect it to in Ruby.

Copy link
Contributor

@hacodeorg hacodeorg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good!

@bencodeorg bencodeorg merged commit 7c25885 into staging May 28, 2020
@bencodeorg bencodeorg deleted the dropbox-sync-error-handling branch May 28, 2020 20:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants