Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix etcd resource limits for existing clusters #342

Merged
merged 1 commit into from
May 10, 2022

Conversation

danielfoehrKn
Copy link
Contributor

How to categorize this PR?

/kind bug

What this PR does / why we need it:

Makes sure that only that the resource limits of an existing etcd stateful set are not re-used when reconciling the etcd resource.
This makes sure that if the etcd resource specifies no resource limits, that etcd-druid does not add resource limits because the existing etcd statefulset has limits.

Which issue(s) this PR fixes:
Fixes #

Special notes for your reviewer:

Release note:

Do not re-used resource limits from an existing etcd  stateful set. This will cause a RESTART(!) of the etcd pod for existing clusters that currently have a resource limit set for the etcd stateful-set, but whose etcd resource does not specify a resource limit.

@danielfoehrKn danielfoehrKn requested a review from a team as a code owner May 4, 2022 14:46
@gardener-robot gardener-robot added kind/bug Bug needs/review Needs review size/xs Size of pull request is tiny (see gardener-robot robot/bots/size.py) labels May 4, 2022
@gardener-robot-ci-3 gardener-robot-ci-3 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label May 4, 2022
@gardener-robot-ci-1 gardener-robot-ci-1 added needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels May 4, 2022
@@ -629,7 +629,8 @@ func (r *EtcdReconciler) syncStatefulSetSpec(ctx context.Context, logger logr.Lo
if !ok {
return nil, fmt.Errorf("container with name %s could not be fetched from statefulset %s", c.Name, decoded.Name)
}
decoded.Spec.Template.Spec.Containers[i].Resources = container.Resources
// only copy requested resources from the existing stateful set to avoid copying already removed (from the etcd resource) resource limits
decoded.Spec.Template.Spec.Containers[i].Resources.Requests = container.Resources.Requests
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need this piece of code at all? Is this different from what's done when we're reconciling the etcd resource based on current settings in the STS's containers?

It feels weird to have two places responsible for dealing with the Resources, probably Druid should just take what's in the etcd resource and put it into the STS.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my POV etcd-druid is a separate project that is used in Gardener, but could also be used without it.
Hence, when we reconcile the etcd resource outside of a Gardener reconciliation, we also want to make sure that we do not override resource reservations already set on the stateful set.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move ahead with this PR? The etcd-druid maintainers can still decide to drop the whole code (get recent stfl set, copy resources)
cc @gardener/etcd-druid-maintainers

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine keeping this code in etcd-druid if we think that people using this outside of gardener is a valid and supported usecase. Does this mean we could remove this piece of code in g/g then? It seems to do the exact same thing: copying over existing resource values from the STS. Currently, this code doesn't do anything, because the values get overwritten by the code above, correct?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, there are stakeholders which are/will be using etcd-druid without gardener

@ishan16696
Copy link
Member

Hi @danielfoehrKn,
We are going to have a druid patch release v0.8.4. Do you want this PR to be included in that patch release?

@voelzmo
Copy link
Member

voelzmo commented May 9, 2022

I think merging this will trigger a rollout of the STS – is it safe to include this in a patch release? We probably should release this as a new minor version, but I'm not sure what your general policy here is?

@ishan16696
Copy link
Member

ishan16696 commented May 9, 2022

I have released druid patch v0.8.4. Now this PR can go with either MVP-2 or another minor release(if necessary).

@dguendisch
Copy link
Member

this will trigger a rollout of the STS – is it safe to include this in a patch release?

In general we should try to minimize etcd restarts if possible on our live landscapes (until we have multi-node etcd).
@ishan16696 I think there's already a release in the pipeline (because of gardener/etcd-backup-restore#471) that will imply another restart. Is it possible maybe to bundle this PR with it?

@ishan16696
Copy link
Member

ishan16696 commented May 9, 2022

In general we should try to minimize etcd restarts

yes, etcd-druid v0.8.3 (contains gardener/etcd-backup-restore#467 and gardener/etcd-backup-restore#470) already released with g/g v1.46.
But due to this gardener/etcd-backup-restore#476 I released another patch version of druid v0.8.4.
And I also wanted this to be included in g/g patch release v1.46.1 to minimize etcd restarts.

Is it possible maybe to bundle this PR with it?

I don’t think so that’s why I asked hours before releasing druid. #342 (comment)

@dguendisch
Copy link
Member

But then we still could bring it in as druid v0.8.5 into g/g v1.46.1 (that already contains etcd restarting changes) to avoid yet another etcd restart because of this PR here?

@ishan16696
Copy link
Member

so you want another patch release of druid with this PR @dguendisch ?

@dguendisch
Copy link
Member

dguendisch commented May 10, 2022

I think it makes sense, this hit us last week, if we don't bring it in with another patch release in time, we add another etcd restart afterwards, so I'd say yes, please bring it in now with another patch release.

@ishan16696
Copy link
Member

so you want another patch release of druid with this PR @dguendisch ?

I think it makes sense, this hit us last week, if we don't bring it in with another patch release in time, we add another etcd restart afterwards, so I'd say yes, please bring it in now with another patch release.

cc @shafeeqes (g/g release responsible) @ashwani2k

@plkokanov do you also want to fix that owner check edge case which we discuss for seed migration ... we can have a backup-restore patch release as well ?

@plkokanov
Copy link
Contributor

plkokanov commented May 10, 2022

@plkokanov do you also want to fix that owner check edge case which we discuss for seed migration ... we can have a backup-restore patch release as well ?

@ishan16696 If we can add it into a hotfix that would be great. But I will probably be able to fix it tomorrow.

Copy link
Member

@voelzmo voelzmo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

@gardener-robot
Copy link

@voelzmo Command /approve is not known.

Copy link
Contributor

@aaronfern aaronfern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@gardener-robot gardener-robot added reviewed/lgtm Has approval for merging and removed needs/review Needs review labels May 10, 2022
@ishan16696 ishan16696 merged commit 3b16c06 into gardener:master May 10, 2022
@gardener-robot gardener-robot added the status/closed Issue is closed (either delivered or triaged) label May 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Bug needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) reviewed/lgtm Has approval for merging size/xs Size of pull request is tiny (see gardener-robot robot/bots/size.py) status/closed Issue is closed (either delivered or triaged)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants