Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

process_repeater_stub() task #28978

Merged
merged 17 commits into from
Jan 28, 2021
Merged

process_repeater_stub() task #28978

merged 17 commits into from
Jan 28, 2021

Conversation

kaapstorm
Copy link
Contributor

Summary

This PR is the next in a series to implement [CEP] Migrate RepeatRecord to SQL.

Previous PR: SQLRepeatRecord

no-obligation fyi: @orangejenny @millerdev @snopoke

I'm opening as a draft PR because there are two things I am looking for feedback on:

1. When to back off

Currently, if 10 forms are submitted and forwarded to an endpoint that is offline, that endpoint is attempted 10 times, then about an hour later it is attempted 10 times again, three hours later 10 times again, each time the interval is multiplied by 3, for up to five days, when eventually all 10 repeat records will be cancelled.

This PR changes that behaviour. Exponential backoff is tracked on a RepeaterStub model instead of on each independent repeat record: If 10 forms are submitted, the first form will be forwarded. If the endpoint is offline, only that form will be retried an hour later, then three hours later, etc. If it succeeds within five days, the rest of the 9 forms will be sent. If it continues to fail, it will be cancelled, and HQ will try to send just the second form.

The idea is to dramatically reduce the rate of send attempts that are likely to fail. It also sets up a better foundation for automatically pausing repeaters that appear to be permanently offline.

My biggest concern is to make sure that we never back off when the server is fine, and the error is caused by the payload. The check_repeaters() task runs every five minutes. So the code differentiates between errors to back off on, and errors to retry on the next check_repeaters() call. My intention is that bad payloads can be cancelled fast (ideally in about half an hour; six attempts, each about five minutes apart), so that bad payloads can't hold up good payloads for five days!

I am open to the idea of not retrying some kinds of errors, but in my experience some third-party servers are under-resourced, and retrying a few minutes later can turn a "500" into a "201". Reducing the number of retries for some kinds of errors might be a happy compromise. I'm also open to taking this conversation somewhere else, and pulling in some USH AEs.

2. How to improve slow tests

I am very interested in alternatives to the test suite in the "Slooow tests to test backoff" commit. It takes 11 seconds to run five tests with REUSE_DB=True. The tests follow the approach used by these Repeater tests. The lowest effort would probably be to submit the form in setUpClass() and just requeue it in setUp(). But I'm curious to know if there is a better, completely different approach the covers the behaviour that these tests are meant to verify.

Do you know faster, better ways to test the same functionality?

Safety Assurance

  • Risk label is set correctly
  • The set of people pinged as reviewers is appropriate for the level of risk of the change
  • If QA is part of the safety story, the "Awaiting QA" label is used
  • I am certain that this PR will not introduce a regression for the reasons below

Automated test coverage

The new code is covered by automated tests. (Please point out important gaps in testing in PR feedback.)

Safety story

Other than through unit tests, none of this code is reachable yet.

Rollback instructions

  • This PR can be reverted after deploy with no further considerations

corehq/motech/repeaters/models.py Outdated Show resolved Hide resolved
corehq/motech/repeaters/models.py Outdated Show resolved Hide resolved
@kaapstorm
Copy link
Contributor Author

Reducing the number of retries for some kinds of errors might be a happy compromise.

"6 backoff attempts, 3 normal attempts" commit.

@kaapstorm kaapstorm marked this pull request as ready for review January 24, 2021 19:29
@kaapstorm kaapstorm added the product/invisible Change has no end-user visible impact label Jan 24, 2021
@kaapstorm
Copy link
Contributor Author

ping @dannyroberts

Copy link
Contributor

@millerdev millerdev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Very happy to see more models getting migrated from Couch to SQL.

corehq/motech/repeaters/models.py Outdated Show resolved Hide resolved
corehq/motech/repeaters/models.py Outdated Show resolved Hide resolved
corehq/motech/repeaters/models.py Outdated Show resolved Hide resolved
process_repeater_stub.delay(repeater_stub)


def get_payload(repeater: Repeater, repeat_record: SQLRepeatRecord) -> Any:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this return type really Any? What kinds of values are typical? (for my curiosity only)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The return value of anything that implements BasePayloadGenerator.get_payload(). So far, form XML, form JSON, case XML, case JSON, an app ID, a Tastypie-serialized CommCareUser, a Location as JSON.

But that's a very good question, because I think all those are returned a strings.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think this be a good candidate for just leaving out the type hint for the return value?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think yes. Maybe I'm too biased to answer that question objectively 😛

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second thought, no. I think it's fine as is since you're using type annotations in this code anyway.

corehq/motech/repeaters/tasks.py Outdated Show resolved Hide resolved
corehq/motech/repeaters/tests/test_models.py Outdated Show resolved Hide resolved
self.repeater.delete()


class ServerErrorTests(RepeaterFixtureMixin, TestCase, DomainSubscriptionMixin):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these the tests you were referring to that are slow? I'm assuming they're slow because they call submit_form_locally(), is that correct? If yes, I don't think there's any way to get around that if you really need to submit a form for every test.

Maybe you could submit one form in setUpClass and then have tearDown reset any form-related state that could have been changed by the test?

Co-authored-by: Daniel Miller <dmiller@dimagi.com>
corehq/motech/repeaters/models.py Outdated Show resolved Hide resolved
corehq/motech/repeaters/models.py Show resolved Hide resolved
kaapstorm and others added 6 commits January 28, 2021 17:00
Co-authored-by: Daniel Miller <dmiller@dimagi.com>
Co-authored-by: Daniel Miller <dmiller@dimagi.com>
Co-authored-by: Daniel Miller <dmiller@dimagi.com>
@kaapstorm kaapstorm merged commit cec1c86 into master Jan 28, 2021
@kaapstorm kaapstorm deleted the nh/rep/process_repeater_stub branch January 28, 2021 19:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
product/invisible Change has no end-user visible impact
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants