Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

endpoint: Implement new Invalid endpoint state #11884

Conversation

christarazi
Copy link
Member

@christarazi christarazi commented Jun 4, 2020

See commit msgs.

Endpoint (EP) creation can fail for many reasons due to invalid data, such as IP conflict or if it's created with reserved labels. When EP creation failed, EPs can be stuck in the "waiting-for-identity" state, and left to be garbage collected. However, even though the object would be garbage collected, the metric representing which state the EP is in, is never decremented, thus a leak.

This PR fixes this by introducing a new EP state representing an "invalid" EP. When EP creations fails, the EP will transition from its current state to "invalid", thus decrementing the current state metric.

Fix leaking endpoint state metric

@christarazi christarazi requested a review from a team June 4, 2020 02:28
@christarazi christarazi requested review from a team as code owners June 4, 2020 02:28
@maintainer-s-little-helper

This comment has been minimized.

@christarazi christarazi added area/daemon Impacts operation of the Cilium daemon. area/metrics Impacts statistics / metrics gathering, eg via Prometheus. kind/bug This is a bug in the Cilium logic. release-note/bug This PR fixes an issue in a previous release of Cilium. labels Jun 4, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot added this to In progress in 1.8.0 Jun 4, 2020
@christarazi christarazi marked this pull request as draft June 4, 2020 02:28
@christarazi
Copy link
Member Author

test-me-please

@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from master in 1.7.5 Jun 4, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from master in 1.8.0 Jun 4, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from master in 1.6.9 Jun 4, 2020
@coveralls
Copy link

coveralls commented Jun 4, 2020

Coverage Status

Coverage increased (+0.06%) to 37.0% when pulling 8d3ddda on christarazi:pr/christarazi/fix-missed-decrement-metric-for-endpoints into 7316fa6 on cilium:master.

@@ -1147,6 +1147,16 @@ func GetCounterValue(m prometheus.Counter) float64 {
return 0
}

// GetGaugeValue returns the current value stored for the gauge
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Write in the comment that this function is only used in tests.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Comment on lines 177 to 185
func (d *Daemon) invalidDataError(ep *endpoint.Endpoint, deleteEP bool, err error) (*endpoint.Endpoint, int, error) {
if deleteEP {
d.deleteEndpointQuiet(ep, endpoint.DeleteConfig{
// The IP has been provided by the caller and must be released
// by the caller
NoIPRelease: true,
})
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest to create another function instead of reusing this one.

}

oldEp := d.endpointManager.LookupCiliumID(ep.ID)
if oldEp != nil {
return invalidDataError(ep, fmt.Errorf("endpoint ID %d already exists", ep.ID))
return d.invalidDataError(ep, true, fmt.Errorf("endpoint ID %d already exists", ep.ID))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be false, not true. If an endpoint already exists aren't we potentially deleting that same endpoint?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, will fix

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second thought, this would suffer from the same thing I described below. Not deleting this endpoint object would leak the metric. Are you saying that because in this specific case, both endpoint have the same ID, that would result in the original endpoint object being deleted?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Disregard my comments--was reading your comments from in order and didn't see your proposal at the end 👍

}

oldEp = d.endpointManager.LookupContainerID(ep.GetContainerID())
if oldEp != nil {
return invalidDataError(ep, fmt.Errorf("endpoint for container %s already exists", ep.GetContainerID()))
return d.invalidDataError(ep, true, fmt.Errorf("endpoint for container %s already exists", ep.GetContainerID()))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be false, not true. If an endpoint already exists aren't we potentially deleting that same endpoint?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, will fix

@@ -240,27 +248,27 @@ func (d *Daemon) createEndpoint(ctx context.Context, epTemplate *models.Endpoint
for _, id := range checkIDs {
oldEp, err := d.endpointManager.Lookup(id)
if err != nil {
return invalidDataError(ep, err)
return d.invalidDataError(ep, true, err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be false, not true. same reasons above

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would also be a very similar situation that I described below.

} else if oldEp != nil {
return invalidDataError(ep, fmt.Errorf("IP %s is already in use", id))
return d.invalidDataError(ep, true, fmt.Errorf("IP %s is already in use", id))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be false, not true. same reasons above

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was the exact error case that a customer ran into. This specific case was causing the errant metric.

}
}

if err = endpoint.APICanModify(ep); err != nil {
return invalidDataError(ep, err)
return d.invalidDataError(ep, true, err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be false, not true. same reasons above

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would this be problematic? I'm not clear on what this check is validating.

@@ -174,7 +174,15 @@ func (d *Daemon) fetchK8sLabelsAndAnnotations(nsName, podName string) (*slim_cor
return p, containerPorts, identityLabels, infoLabels, annotations, nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this file, instead of the proposed changes I suggest to do the following (on all error conditions):

		ep.SetState(endpoint.StateInvalidEndpoint, "Invalid endpoint")

and in pkg/endpoint/endpoint.go

diff --git a/pkg/endpoint/endpoint.go b/pkg/endpoint/endpoint.go
index 8bd438963..bd504c97a 100644
--- a/pkg/endpoint/endpoint.go
+++ b/pkg/endpoint/endpoint.go
@@ -96,6 +96,8 @@ const (
        // StateRestoring is used to set the endpoint is being restored.
        StateRestoring = string(models.EndpointStateRestoring)
 
+       // TODO Needs an new state in the api/v1/models`
+       StateInvalidEndpoint = string(models.StateInvalidEndpoint)
+
        // IpvlanMapName specifies the tail call map for EP on egress used with ipvlan.
        IpvlanMapName = "cilium_lxc_ipve_"
 )
@@ -1319,7 +1321,7 @@ func (e *Endpoint) setState(toState, reason string) bool {
                }
        case StateWaitingForIdentity:
                switch toState {
-               case StateReady, StateDisconnecting:
+               case StateReady, StateDisconnecting, StateInvalidEndpoint:
                        goto OKState
                }
        case StateReady:
@@ -1389,7 +1391,7 @@ OKState:
 
        // Since StateDisconnected is the final state, after which the
        // endpoint is gone, we should not increment metrics for this state.
-       if toState != "" && toState != StateDisconnected {
+       if toState != "" && (toState != StateDisconnected || toState != StateInvalidEndpoint) {
                metrics.EndpointStateCount.
                        WithLabelValues(toState).Inc()
        }
@@ -1403,7 +1405,7 @@ func (e *Endpoint) BuilderSetStateLocked(toState, reason string) bool {
        // Validate the state transition.
        fromState := e.state
        switch fromState { // From state
-       case StateWaitingForIdentity, StateReady, StateDisconnecting, StateDisconnected:
+       case StateWaitingForIdentity, StateReady, StateDisconnecting, StateDisconnected, StateInvalidEndpoint:
                // No valid transitions for the builder
        case StateWaitingToRegenerate, StateRestoring:
                switch toState {
@@ -1456,7 +1458,7 @@ OKState:
 
        // Since StateDisconnected is the final state, after which the
        // endpoint is gone, we should not increment metrics for this state.
-       if toState != "" && toState != StateDisconnected {
+       if toState != "" && (toState != StateDisconnected || toState != StateInvalidEndpoint) {
                metrics.EndpointStateCount.
                        WithLabelValues(toState).Inc()
        }

@christarazi christarazi force-pushed the pr/christarazi/fix-missed-decrement-metric-for-endpoints branch from fd86c73 to 5418fa1 Compare June 4, 2020 21:55
@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-sign-off The author needs to add signoff to their commits before merge. label Jun 4, 2020
@christarazi christarazi force-pushed the pr/christarazi/fix-missed-decrement-metric-for-endpoints branch from 5418fa1 to 7876367 Compare June 4, 2020 21:56
@maintainer-s-little-helper

This comment has been minimized.

@christarazi christarazi force-pushed the pr/christarazi/fix-missed-decrement-metric-for-endpoints branch from 7876367 to 28e6c57 Compare June 4, 2020 21:57
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-sign-off The author needs to add signoff to their commits before merge. label Jun 4, 2020
@cilium cilium deleted a comment from maintainer-s-little-helper bot Jun 4, 2020
@christarazi christarazi requested a review from a team June 5, 2020 06:42
@borkmann borkmann merged commit 4a4a59b into cilium:master Jun 5, 2020
1.8.0 automation moved this from In progress to Merged Jun 5, 2020
@christarazi christarazi deleted the pr/christarazi/fix-missed-decrement-metric-for-endpoints branch June 5, 2020 17:30
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from master to Backport pending to v1.7 in 1.7.5 Jun 5, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from master to Backport pending to v1.7 in 1.7.5 Jun 5, 2020
Copy link
Member

@joestringer joestringer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Endpoint creation strikes again.

Comment on lines +412 to +414
func isFinalState(state string) bool {
return (state == StateDisconnected || state == StateInvalid)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW I think a function like this would be nice to have in the actual code as well, to reference from the core state transition function. But I don't think it matters enough to further change this before merging.

@@ -176,6 +176,7 @@ func (d *Daemon) fetchK8sLabelsAndAnnotations(nsName, podName string) (*slim_cor

func invalidDataError(ep *endpoint.Endpoint, err error) (*endpoint.Endpoint, int, error) {
ep.Logger(daemonSubsys).WithError(err).Warning("Creation of endpoint failed due to invalid data")
ep.SetState(endpoint.StateInvalid, "Invalid endpoint")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked through as well to see that we don't need this anywhere else such as errorDuringCreation(), but that path is fine because it invokes deleteEndpointQuiet() which would transition the endpoint into StateDisconnecting.

@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from master to Backport pending to v1.6 in 1.6.9 Jun 5, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Backport pending to v1.6 to Backport done to v1.6 in 1.6.9 Jun 5, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Backport pending to v1.6 to Backport done to v1.6 in 1.6.9 Jun 5, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from master to Backport pending to v1.8 in 1.8.0 Jun 5, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Backport pending to v1.7 to Backport done to v1.7 in 1.7.5 Jun 6, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Backport pending to v1.8 to Backport done to v1.8 in 1.8.0 Jun 8, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/daemon Impacts operation of the Cilium daemon. area/metrics Impacts statistics / metrics gathering, eg via Prometheus. kind/bug This is a bug in the Cilium logic. release-note/bug This PR fixes an issue in a previous release of Cilium.
Projects
No open projects
1.6.9
Backport done to v1.6
1.7.5
Backport done to v1.7
1.8.0
  
Merged
1.8.0
Backport done to v1.8
Development

Successfully merging this pull request may close these issues.

None yet

6 participants