Dropbox sync error handling #33965

bencodeorg · 2020-04-01T01:03:22Z

Adds a new general HighFrequencyReporter class that can be used to report errors (or other high frequency messages) to Slack.
Uses an instance of that class to limit notifications to Slack on Dropbox sync errors to only occur a) if an error occurs in repeated attempts to sync between pegasus and Dropbox, and b) a max frequency of every five minutes.

I'm also not super happy with the formatting of the error message that gets sent to Slack, if people have thoughts on how that could be improved -- here are a couple examples:

I think part of the problem is a) the error messages aren't super informative, and b) sometimes information is in stdout and sometimes in stderr.

Future work

If this works well, I'd like to move the Slack channel this gets reported to from sync-dropbox-staging to infra-staging or something along those lines.

Testing story

Did lots of local testing here where I made changes to files in two directories, got errors, and successfully reported them to Slack. There are also some unit tests for the new class. I think some things will be difficult to test without putting this "into the wild".

It will continue to report to sync-dropbox-staging at first -- we can move the reporting to somewhere that the DOTD would be more likely to see if we're happy with how it performs.

Reviewer Checklist:

Tests provide adequate coverage
Code is well-commented
New features are translatable or updates will not break translations
Relevant documentation has been added or updated
User impact is well-understood and desirable
Pull Request is labeled appropriately
Follow-up work items (including potential tech debt) are tracked and linked

…ror-handling

bencodeorg · 2020-04-01T20:17:13Z

bin/cron/sync_dropbox

  while current_time - attempt_start < INTERVAL_SECONDS && (current_time - SCRIPT_START) < TOTAL_SECONDS
    sleep 0.1
    current_time = Time.now
  end
 end
+
+# Reports to Slack if current minute is a multiple of 5
+logger.report! 5


Currently, I think the way this is set up is that it will report errors that a) were happening before this run of sync_dropbox, b) are still happening after the script runs unison repeatedly for a minute, and c) the current minute is a multiple of 5 (say x:50 or x:55).

I think this is potentially not ideal in that an error that first occurs at x:50 or x:51 would not get reported until x:55.

is the reasoning for the five minute gap that we want to give time to fix an error before reporting it again? I think it shouldn't be too big of a deal to wait 5 minutes, since as of now the issues seem to not be urgent.

Yeah, most (actually all) of the errors that have occurred since Jessie and Winter set up the new process have self-resolved, but there is a case where an error could continue indefinitely (I think one example is if both pegasus and Dropbox edit the same file, and Unison doesn't know how to resolve).

So this is supposed to cover that case, and only report it every 5 minutes until it's resolved.

then I think this approach makes sense!

molly-moen · 2020-04-01T22:42:07Z

lib/cdo/high_frequency_reporter.rb

+    @new_events << {name: event_name, reported_at: Time.now}
+  end
+
+  def report!(throttle = 1)


nit: add a comment explaining what throttle means here

molly-moen

Looks good! Can you add a screenshot of what the message looks like in slack? Hard to tell from the code

molly-moen · 2020-04-02T15:06:42Z

bin/cron/sync_dropbox

+                      pegasus directory: #{value}
+                      stdout: #{stdout}
+                      stderr: #{stderr}
+                    ERROR_MSG


I think this error message is fine. If you wanted to get fancier you could only show stdout/stderr if it was not empty, but I don't think that's necessary

There is value in showing both stdout & stderr. When the error messages are not super informative, seeing both outputs helps eliminating the doubt that there might be more info elsewhere.

hacodeorg · 2020-04-10T16:14:32Z

lib/cdo/high_frequency_reporter.rb

+  end
+
+  # Loads known events from previous runs from a file on disk
+  def load


Returns a true/false value so the caller knows if this method succeeds or not. Also, adds a comment that this method swallows exceptions.

Been neglecting this for a month :) Just pushed some changes to get at some of your comments, take a look when you have a minute!

hacodeorg · 2020-04-10T16:18:28Z

lib/cdo/high_frequency_reporter.rb

+  # @param [Integer] throttle
+  def report!(throttle = 1)
+    if Time.now.min % throttle == 0
+      alertable_events.each {|e| @chat_client.message(e, {channel: @channel})}


Do you clear events after reporting them?

lib/cdo/high_frequency_reporter.rb

hacodeorg · 2020-04-10T16:30:42Z

bin/cron/sync_dropbox

+                      pegasus directory: #{value}
+                      stdout: #{stdout}
+                      stderr: #{stderr}
+                    ERROR_MSG


There is value in showing both stdout & stderr. When the error messages are not super informative, seeing both outputs helps eliminating the doubt that there might be more info elsewhere.

lib/cdo/high_frequency_reporter.rb

hacodeorg · 2020-04-10T16:44:52Z

lib/test/cdo/test_high_frequency_reporter.rb

+
+      # The second time the same error occurs, it should be reported
+      # to Slack.  We add a mock expectation to check that.
+      fake_slack.expect :message, nil, [String, Hash]


TIL exepct use === to check argument types (String and Hash in this case).
https://www.rubydoc.info/gems/minitest/5.11.3/Minitest%2FMock:expect

Whoa... === does not do what I would expect it to in Ruby.

hacodeorg

Looking good!

bencodeorg added 7 commits March 30, 2020 13:36

[WIP] Add class to handle reporting dropbox sync errors

569330b

Merge branch 'dropbox-sync-simplify-applab-docs' into dropbox-sync-er…

9161fb4

…ror-handling

More tests and channel specific alerts on high frequency reporter

cfb4700

Test dropbox syns script locally

c649969

Prep sync_dropbox for staging

9d2c00b

Fix typo

0d6b36a

Update high frequency reporter tests

8f39cbd

bencodeorg commented Apr 1, 2020

View reviewed changes

bencodeorg requested a review from a team April 1, 2020 20:17

molly-moen reviewed Apr 1, 2020

View reviewed changes

molly-moen approved these changes Apr 1, 2020

View reviewed changes

Add method description

3f65b36

molly-moen reviewed Apr 2, 2020

View reviewed changes

hacodeorg reviewed Apr 10, 2020

View reviewed changes

lib/cdo/high_frequency_reporter.rb Outdated Show resolved Hide resolved

hacodeorg reviewed Apr 10, 2020

View reviewed changes

Respond to PR feedback

51a1981

hacodeorg approved these changes May 26, 2020

View reviewed changes

bencodeorg merged commit 7c25885 into staging May 28, 2020

bencodeorg deleted the dropbox-sync-error-handling branch May 28, 2020 20:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dropbox sync error handling #33965

Dropbox sync error handling #33965

bencodeorg commented Apr 1, 2020 •

edited

bencodeorg Apr 1, 2020

molly-moen Apr 1, 2020

bencodeorg Apr 2, 2020

molly-moen Apr 2, 2020

molly-moen Apr 1, 2020

bencodeorg Apr 2, 2020

molly-moen left a comment

molly-moen Apr 2, 2020

hacodeorg Apr 10, 2020

hacodeorg Apr 10, 2020

bencodeorg May 20, 2020

hacodeorg Apr 10, 2020

hacodeorg Apr 10, 2020

hacodeorg Apr 10, 2020

bencodeorg Apr 14, 2020

hacodeorg left a comment

Dropbox sync error handling #33965

Dropbox sync error handling #33965

Conversation

bencodeorg commented Apr 1, 2020 • edited

Future work

Testing story

Reviewer Checklist:

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

molly-moen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hacodeorg left a comment

Choose a reason for hiding this comment

bencodeorg commented Apr 1, 2020 •

edited