-
Notifications
You must be signed in to change notification settings - Fork 126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
automatically switch ESP during downtime #31
Comments
Really interesting idea -- thanks for the thought you've put into this. (Believe me, I understand the pain you're feeling with SendGrid recently.) And it's clever to use the status APIs to determine if an ESP is usable. Because the ESP's sending APIs often stay up, accepting and queuing messages, even when the ESP is significantly delaying delivery. So you have to find some other way to determine if the ESP is "down" for sending purposes. (BTW, is that statuspage.io v2/summary.json endpoint documented anywhere?) There would be a handful of problems to work through: 1. Keeping track of the current preferred connectionAnymail deliberately doesn't maintain any persistent state, which avoids a whole lot of additional configuration overhead for users. I'm not sure how we could implement "save chosen backup as current preferred connection" within Anymail. 2. Data can diverge when mixing ESPsThis is probably more of a user-education issue: sending to the same set of recipients through multiple ESPs leads to potentially-confusing data divided among those ESPs. If you're using your ESP's unsubscribe management, for example, you'd want to avoid backup ESPs sending to recipients on the primary ESP's unsubscribe list. And you'd have to figure out how to handle unsubscribes coming from messages sent through a backup ESP. That sort of syncing is beyond the scope of Anymail. (Similar issues with ESPs' bounced/blocked recipient lists.) You'd also have to be careful how you interpret open click and open rate data collected by one of the ESPs. (Or really, anything where ESPs maintain data on your behalf.) 3. Deciding if an ESP is "up"As I said earlier, I think using the status API is clever. But it also looks like it could be tricky to decide whether the ESP is "up" from the status summary. The simplest approach would be, if there are any open incidents, consider it down. But a lot of incidents aren't about sending/delivery -- e.g., Mailgun recently had a control panel outage, and SparkPost had delays in metrics and reporting. I don't think I'd have wanted to switch to a backup ESP in either of these cases. We could try to be smarter by looking for outages only in particular components. Each ESP labels its status components differently, so that would take some research. (E.g., SendGrid is currently showing "Mail Sending" as "Degraded Performance," but all three of its subcomponents as "Operational" -- including the ones that track "the flow of mail generated by mail.send API requests.") It could also be fragile to future status page changes. Another problem is that (some) ESPs may not promptly list (some) service issues in their status pages. (Though I suppose you're no worse off in that case than you are now.) All that said, though, I threw together a quick and untested implementation of a backend that checks status APIs and sends through the first "up" ESP. To avoid the whole persistent-state problem, it checks the status API(s) on every send. (Optimization is "left as an exercise for the reader." As is figuring out my bugs in it. 😄) Feel free to try it out, fork and edit, and if you end up with something useful we can either add it to Anymail or at least to the docs as an advanced example. |
Thanks for the thoughtful response, definitely some things to work through here. I will fork it and try to work through your points here a bit. Statuspage.io's undocumented documentation
1. Keeping track of preferred connection 2. Data divergence between ESPs
With that in mind two potential workarounds for the most simple use case:
3. Deciding if an ESP is up |
I thought a little more about keeping track of the preferred connection, and it seems like it would be perfectly reasonable to use Django's cache framework for this. That avoids the performance hit of checking status on every send, but without the complexity of adding models. I updated the (still untested) gist with code to cache the preferred ESP, and added a webhook view that invalidates the cache: https://gist.github.com/medmunds/e6b837bb3b382098d775cb412b889632 |
good call with caching, seems like the perfect use case. I forked the gist and added in a way to test for a list of components to isolate what the issue actually is so we are not falling back if for example SendGrid's marketing service is down. (Un)luckily SendGrid is having some more downtime today so I was able to confirm the is_component_working function works as expected. Do you think checking components should replace is_backend_working or should this be a setting? I will try to work on some testing soon |
Nice. Yeah, I'd probably just move the component checking into is_backend_working. is_backend_working should represent our best guess at determining if the ESP is "up." If you've figured out the right components to check, that's a better test than the overall status check I had in there. |
I don't think this code really belongs in the core Anymail, but I'm going to add a link to your gist to the docs on using multiple backends. |
Downtime from ESPs is a pain (4 hours into SendGrid send delays right now), and in reading about this library it would be a very neat feature if during downtime from your preferred ESP, anymail would switch to a backup or SMTP. Would this fit into the scope of anymail?
How this might work
settings contain backups
webhook flow
add extra step to sending an email
The text was updated successfully, but these errors were encountered: