Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reverseproxy: SRV dynamic upstream failover #5832

Merged
merged 5 commits into from
Mar 5, 2024
Merged

Conversation

mholt
Copy link
Member

@mholt mholt commented Sep 21, 2023

This should fix #5816 - if the SRV lookup fails and a GracePeriod is defined, it will continue to use the previously-cached records until the GracePeriod is up, after which it will try again.

@cds2-stripe @jjiang-stripe Feel free to give this a shot -- let me know if I need to rebase this onto another branch.

@kkroo
Copy link
Contributor

kkroo commented Sep 21, 2023

Thanks for this fix @mholt I’m wondering why it’s only for SRV and not A lookups for dynamic upstreams?

@mholt
Copy link
Member Author

mholt commented Sep 21, 2023

It was only requested for SRV upstreams for the moment. If it for sure works we can easily add it for A upstreams too 👍

@@ -140,6 +147,12 @@ func (su SRVUpstreams) GetUpstreams(r *http.Request) ([]*Upstream, error) {
// out and an error will be returned alongside the remaining results, if any." Thus, we
// only return an error if no records were also returned.
if len(records) == 0 {
if su.GracePeriod > 0 {
su.logger.Error("SRV lookup failed; using previously cached", zap.Error(err))
cached.freshness = time.Now().Add(-time.Duration(su.Refresh) - time.Duration(su.GracePeriod))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm second-guessing my math here.

I should write some tests for this. I did a couple in my head but I should probably just write them down.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still second guessing, or should we merge?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finally got a chance to do the math.

This had a mistake. 😅

The refresh period means "no longer fresh when freshness < now-refresh". So if the refresh period is 5 minutes, and the grace period is 2 minutes, that means that we should try again in 2 minutes (using only freshness and refresh). In other words, we want freshness < now-refresh to be true in 2 minutes, or freshness < (now+2)-refresh. Thus we set freshness to now+grace-refresh, or now+2-5 = -3. That means freshness will be 5 minutes ago in 2 minutes.

This also works when grace > refresh; we set freshness ahead of now and the extra time is allowed before trying again.

Pushing a commit soon then will merge.

@francislavoie francislavoie added the feature ⚙️ New feature or request label Jan 13, 2024
@francislavoie francislavoie modified the milestones: v2.9.0, v2.8.0 Jan 13, 2024
@francislavoie francislavoie changed the title SRV dynamic upstream failover reverseproxy: SRV dynamic upstream failover Jan 13, 2024
@mholt mholt merged commit 72ce78d into master Mar 5, 2024
25 checks passed
@mholt mholt deleted the upstream-failover branch March 5, 2024 19:08
@francislavoie
Copy link
Member

This is missing Caddyfile support for the grace_period option

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature ⚙️ New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support falling back to cached upstreams IPs when DNS lookups fail
3 participants