process_repeater_stub() task #28978

kaapstorm · 2021-01-11T16:51:15Z

Summary

This PR is the next in a series to implement [CEP] Migrate RepeatRecord to SQL.

no-obligation fyi: @orangejenny @millerdev @snopoke

I'm opening as a draft PR because there are two things I am looking for feedback on:

1. When to back off

Currently, if 10 forms are submitted and forwarded to an endpoint that is offline, that endpoint is attempted 10 times, then about an hour later it is attempted 10 times again, three hours later 10 times again, each time the interval is multiplied by 3, for up to five days, when eventually all 10 repeat records will be cancelled.

This PR changes that behaviour. Exponential backoff is tracked on a RepeaterStub model instead of on each independent repeat record: If 10 forms are submitted, the first form will be forwarded. If the endpoint is offline, only that form will be retried an hour later, then three hours later, etc. If it succeeds within five days, the rest of the 9 forms will be sent. If it continues to fail, it will be cancelled, and HQ will try to send just the second form.

The idea is to dramatically reduce the rate of send attempts that are likely to fail. It also sets up a better foundation for automatically pausing repeaters that appear to be permanently offline.

My biggest concern is to make sure that we never back off when the server is fine, and the error is caused by the payload. The check_repeaters() task runs every five minutes. So the code differentiates between errors to back off on, and errors to retry on the next check_repeaters() call. My intention is that bad payloads can be cancelled fast (ideally in about half an hour; six attempts, each about five minutes apart), so that bad payloads can't hold up good payloads for five days!

I am open to the idea of not retrying some kinds of errors, but in my experience some third-party servers are under-resourced, and retrying a few minutes later can turn a "500" into a "201". Reducing the number of retries for some kinds of errors might be a happy compromise. I'm also open to taking this conversation somewhere else, and pulling in some USH AEs.

2. How to improve slow tests

I am very interested in alternatives to the test suite in the "Slooow tests to test backoff" commit. It takes 11 seconds to run five tests with REUSE_DB=True. The tests follow the approach used by these Repeater tests. The lowest effort would probably be to submit the form in setUpClass() and just requeue it in setUp(). But I'm curious to know if there is a better, completely different approach the covers the behaviour that these tests are meant to verify.

Do you know faster, better ways to test the same functionality?

Safety Assurance

Risk label is set correctly
The set of people pinged as reviewers is appropriate for the level of risk of the change
If QA is part of the safety story, the "Awaiting QA" label is used
I am certain that this PR will not introduce a regression for the reasons below

Automated test coverage

The new code is covered by automated tests. (Please point out important gaps in testing in PR feedback.)

Safety story

Other than through unit tests, none of this code is reachable yet.

Rollback instructions

This PR can be reverted after deploy with no further considerations

corehq/motech/repeaters/models.py

kaapstorm · 2021-01-12T19:16:10Z

Reducing the number of retries for some kinds of errors might be a happy compromise.

"6 backoff attempts, 3 normal attempts" commit.

This reverts commit b45258f.

kaapstorm · 2021-01-27T07:35:03Z

ping @dannyroberts

millerdev

Nice! Very happy to see more models getting migrated from Couch to SQL.

corehq/motech/repeaters/models.py

millerdev · 2021-01-27T20:11:38Z

corehq/motech/repeaters/models.py

+    process_repeater_stub.delay(repeater_stub)
+
+
+def get_payload(repeater: Repeater, repeat_record: SQLRepeatRecord) -> Any:


Is this return type really Any? What kinds of values are typical? (for my curiosity only)

The return value of anything that implements BasePayloadGenerator.get_payload(). So far, form XML, form JSON, case XML, case JSON, an app ID, a Tastypie-serialized CommCareUser, a Location as JSON.

But that's a very good question, because I think all those are returned a strings.

Do you think this be a good candidate for just leaving out the type hint for the return value?

I think yes. Maybe I'm too biased to answer that question objectively 😛

On second thought, no. I think it's fine as is since you're using type annotations in this code anyway.

corehq/motech/repeaters/tasks.py

corehq/motech/repeaters/tests/test_models.py

millerdev · 2021-01-27T21:19:06Z

corehq/motech/repeaters/tests/test_models_slow.py

+        self.repeater.delete()
+
+
+class ServerErrorTests(RepeaterFixtureMixin, TestCase, DomainSubscriptionMixin):


Are these the tests you were referring to that are slow? I'm assuming they're slow because they call submit_form_locally(), is that correct? If yes, I don't think there's any way to get around that if you really need to submit a form for every test.

Maybe you could submit one form in setUpClass and then have tearDown reset any form-related state that could have been changed by the test?

Co-authored-by: Daniel Miller <dmiller@dimagi.com>

corehq/motech/repeaters/models.py

Co-authored-by: Daniel Miller <dmiller@dimagi.com>

kaapstorm requested a review from proteusvacuum January 11, 2021 16:51

stickler-ci reviewed Jan 11, 2021

View reviewed changes

corehq/motech/repeaters/models.py Outdated Show resolved Hide resolved

corehq/motech/repeaters/models.py Outdated Show resolved Hide resolved

kaapstorm commented Jan 11, 2021

View reviewed changes

corehq/motech/repeaters/models.py Outdated Show resolved Hide resolved

stickler-ci reviewed Jan 12, 2021

View reviewed changes

corehq/motech/repeaters/models.py Outdated Show resolved Hide resolved

kaapstorm added 7 commits January 12, 2021 21:12

process_repeater_stub() task

eb53a54

Fun with typing

b45258f

⚖ Model tests for process_repeater_stub()

9eabf26

⚖ Task tests for process_repeater_stub()

a28f602

Back off server errors but not client errors.

dae2c0d

⚖ Slooow tests to test backoff

05777e3

6 backoff attempts, 3 normal attempts

2931f8c

kaapstorm force-pushed the nh/rep/process_repeater_stub branch from fcaa4d8 to 2931f8c Compare January 12, 2021 19:14

stickler-ci reviewed Jan 12, 2021

View reviewed changes

corehq/motech/repeaters/models.py Outdated Show resolved Hide resolved

kaapstorm added 2 commits January 12, 2021 21:20

Revert "Fun with typing": Not fun for everyone

406ff20

This reverts commit b45258f.

👷 Resolve inconsistent repeat_records order.

526cd02

kaapstorm force-pushed the nh/rep/process_repeater_stub branch from b302830 to 526cd02 Compare January 24, 2021 18:24

kaapstorm marked this pull request as ready for review January 24, 2021 19:29

kaapstorm mentioned this pull request Jan 24, 2021

SQL Repeat Record Report & Case Data Report #29029

Merged

5 tasks

kaapstorm added the product/invisible Change has no end-user visible impact label Jan 24, 2021

millerdev approved these changes Jan 27, 2021

View reviewed changes

DRY _add_failure_attempt()

a93ca0f

Co-authored-by: Daniel Miller <dmiller@dimagi.com>

stickler-ci reviewed Jan 28, 2021

View reviewed changes

corehq/motech/repeaters/models.py Outdated Show resolved Hide resolved

corehq/motech/repeaters/models.py Show resolved Hide resolved

kaapstorm and others added 6 commits January 28, 2021 17:00

docstring more explicit.

a9ff0a4

Co-authored-by: Daniel Miller <dmiller@dimagi.com>

repeater.get_payload() always returns a string

2acaa36

Use should_retry to explain break

444689f

Co-authored-by: Daniel Miller <dmiller@dimagi.com>

Syntax

d28c82e

♻ Use single inheritance

607029d

Less confusing comment

8afe442

Co-authored-by: Daniel Miller <dmiller@dimagi.com>

Faster tests

d8e5143

proteusvacuum approved these changes Jan 28, 2021

View reviewed changes

kaapstorm merged commit cec1c86 into master Jan 28, 2021

kaapstorm deleted the nh/rep/process_repeater_stub branch January 28, 2021 19:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

process_repeater_stub() task #28978

process_repeater_stub() task #28978

kaapstorm commented Jan 11, 2021

kaapstorm commented Jan 12, 2021

kaapstorm commented Jan 27, 2021

millerdev left a comment

millerdev Jan 27, 2021

kaapstorm Jan 28, 2021

kaapstorm Jan 28, 2021

kaapstorm Jan 28, 2021

millerdev Jan 28, 2021

millerdev Jan 28, 2021

millerdev Jan 27, 2021

		process_repeater_stub.delay(repeater_stub)


		def get_payload(repeater: Repeater, repeat_record: SQLRepeatRecord) -> Any:

		self.repeater.delete()


		class ServerErrorTests(RepeaterFixtureMixin, TestCase, DomainSubscriptionMixin):

process_repeater_stub() task #28978

process_repeater_stub() task #28978

Conversation

kaapstorm commented Jan 11, 2021

Summary

1. When to back off

2. How to improve slow tests

Safety Assurance

Automated test coverage

Safety story

Rollback instructions

kaapstorm commented Jan 12, 2021

kaapstorm commented Jan 27, 2021

millerdev left a comment

Choose a reason for hiding this comment

millerdev Jan 27, 2021

Choose a reason for hiding this comment

kaapstorm Jan 28, 2021

Choose a reason for hiding this comment

kaapstorm Jan 28, 2021

Choose a reason for hiding this comment

kaapstorm Jan 28, 2021

Choose a reason for hiding this comment

millerdev Jan 28, 2021

Choose a reason for hiding this comment

millerdev Jan 28, 2021

Choose a reason for hiding this comment

millerdev Jan 27, 2021

Choose a reason for hiding this comment