New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
endpoint: Implement new Invalid endpoint state #11884
endpoint: Implement new Invalid endpoint state #11884
Conversation
This comment has been minimized.
This comment has been minimized.
test-me-please |
pkg/metrics/metrics.go
Outdated
@@ -1147,6 +1147,16 @@ func GetCounterValue(m prometheus.Counter) float64 { | |||
return 0 | |||
} | |||
|
|||
// GetGaugeValue returns the current value stored for the gauge |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Write in the comment that this function is only used in tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
daemon/cmd/endpoint.go
Outdated
func (d *Daemon) invalidDataError(ep *endpoint.Endpoint, deleteEP bool, err error) (*endpoint.Endpoint, int, error) { | ||
if deleteEP { | ||
d.deleteEndpointQuiet(ep, endpoint.DeleteConfig{ | ||
// The IP has been provided by the caller and must be released | ||
// by the caller | ||
NoIPRelease: true, | ||
}) | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest to create another function instead of reusing this one.
daemon/cmd/endpoint.go
Outdated
} | ||
|
||
oldEp := d.endpointManager.LookupCiliumID(ep.ID) | ||
if oldEp != nil { | ||
return invalidDataError(ep, fmt.Errorf("endpoint ID %d already exists", ep.ID)) | ||
return d.invalidDataError(ep, true, fmt.Errorf("endpoint ID %d already exists", ep.ID)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be false, not true. If an endpoint already exists aren't we potentially deleting that same endpoint?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, will fix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On second thought, this would suffer from the same thing I described below. Not deleting this endpoint object would leak the metric. Are you saying that because in this specific case, both endpoint have the same ID, that would result in the original endpoint object being deleted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Disregard my comments--was reading your comments from in order and didn't see your proposal at the end 👍
daemon/cmd/endpoint.go
Outdated
} | ||
|
||
oldEp = d.endpointManager.LookupContainerID(ep.GetContainerID()) | ||
if oldEp != nil { | ||
return invalidDataError(ep, fmt.Errorf("endpoint for container %s already exists", ep.GetContainerID())) | ||
return d.invalidDataError(ep, true, fmt.Errorf("endpoint for container %s already exists", ep.GetContainerID())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be false, not true. If an endpoint already exists aren't we potentially deleting that same endpoint?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, will fix
daemon/cmd/endpoint.go
Outdated
@@ -240,27 +248,27 @@ func (d *Daemon) createEndpoint(ctx context.Context, epTemplate *models.Endpoint | |||
for _, id := range checkIDs { | |||
oldEp, err := d.endpointManager.Lookup(id) | |||
if err != nil { | |||
return invalidDataError(ep, err) | |||
return d.invalidDataError(ep, true, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be false, not true. same reasons above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would also be a very similar situation that I described below.
daemon/cmd/endpoint.go
Outdated
} else if oldEp != nil { | ||
return invalidDataError(ep, fmt.Errorf("IP %s is already in use", id)) | ||
return d.invalidDataError(ep, true, fmt.Errorf("IP %s is already in use", id)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be false, not true. same reasons above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was the exact error case that a customer ran into. This specific case was causing the errant metric.
daemon/cmd/endpoint.go
Outdated
} | ||
} | ||
|
||
if err = endpoint.APICanModify(ep); err != nil { | ||
return invalidDataError(ep, err) | ||
return d.invalidDataError(ep, true, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be false, not true. same reasons above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would this be problematic? I'm not clear on what this check is validating.
daemon/cmd/endpoint.go
Outdated
@@ -174,7 +174,15 @@ func (d *Daemon) fetchK8sLabelsAndAnnotations(nsName, podName string) (*slim_cor | |||
return p, containerPorts, identityLabels, infoLabels, annotations, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this file, instead of the proposed changes I suggest to do the following (on all error conditions):
ep.SetState(endpoint.StateInvalidEndpoint, "Invalid endpoint")
and in pkg/endpoint/endpoint.go
diff --git a/pkg/endpoint/endpoint.go b/pkg/endpoint/endpoint.go
index 8bd438963..bd504c97a 100644
--- a/pkg/endpoint/endpoint.go
+++ b/pkg/endpoint/endpoint.go
@@ -96,6 +96,8 @@ const (
// StateRestoring is used to set the endpoint is being restored.
StateRestoring = string(models.EndpointStateRestoring)
+ // TODO Needs an new state in the api/v1/models`
+ StateInvalidEndpoint = string(models.StateInvalidEndpoint)
+
// IpvlanMapName specifies the tail call map for EP on egress used with ipvlan.
IpvlanMapName = "cilium_lxc_ipve_"
)
@@ -1319,7 +1321,7 @@ func (e *Endpoint) setState(toState, reason string) bool {
}
case StateWaitingForIdentity:
switch toState {
- case StateReady, StateDisconnecting:
+ case StateReady, StateDisconnecting, StateInvalidEndpoint:
goto OKState
}
case StateReady:
@@ -1389,7 +1391,7 @@ OKState:
// Since StateDisconnected is the final state, after which the
// endpoint is gone, we should not increment metrics for this state.
- if toState != "" && toState != StateDisconnected {
+ if toState != "" && (toState != StateDisconnected || toState != StateInvalidEndpoint) {
metrics.EndpointStateCount.
WithLabelValues(toState).Inc()
}
@@ -1403,7 +1405,7 @@ func (e *Endpoint) BuilderSetStateLocked(toState, reason string) bool {
// Validate the state transition.
fromState := e.state
switch fromState { // From state
- case StateWaitingForIdentity, StateReady, StateDisconnecting, StateDisconnected:
+ case StateWaitingForIdentity, StateReady, StateDisconnecting, StateDisconnected, StateInvalidEndpoint:
// No valid transitions for the builder
case StateWaitingToRegenerate, StateRestoring:
switch toState {
@@ -1456,7 +1458,7 @@ OKState:
// Since StateDisconnected is the final state, after which the
// endpoint is gone, we should not increment metrics for this state.
- if toState != "" && toState != StateDisconnected {
+ if toState != "" && (toState != StateDisconnected || toState != StateInvalidEndpoint) {
metrics.EndpointStateCount.
WithLabelValues(toState).Inc()
}
fd86c73
to
5418fa1
Compare
5418fa1
to
7876367
Compare
This comment has been minimized.
This comment has been minimized.
7876367
to
28e6c57
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Endpoint creation strikes again.
func isFinalState(state string) bool { | ||
return (state == StateDisconnected || state == StateInvalid) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW I think a function like this would be nice to have in the actual code as well, to reference from the core state transition function. But I don't think it matters enough to further change this before merging.
@@ -176,6 +176,7 @@ func (d *Daemon) fetchK8sLabelsAndAnnotations(nsName, podName string) (*slim_cor | |||
|
|||
func invalidDataError(ep *endpoint.Endpoint, err error) (*endpoint.Endpoint, int, error) { | |||
ep.Logger(daemonSubsys).WithError(err).Warning("Creation of endpoint failed due to invalid data") | |||
ep.SetState(endpoint.StateInvalid, "Invalid endpoint") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked through as well to see that we don't need this anywhere else such as errorDuringCreation()
, but that path is fine because it invokes deleteEndpointQuiet()
which would transition the endpoint into StateDisconnecting
.
See commit msgs.
Endpoint (EP) creation can fail for many reasons due to invalid data, such as IP conflict or if it's created with reserved labels. When EP creation failed, EPs can be stuck in the "waiting-for-identity" state, and left to be garbage collected. However, even though the object would be garbage collected, the metric representing which state the EP is in, is never decremented, thus a leak.
This PR fixes this by introducing a new EP state representing an "invalid" EP. When EP creations fails, the EP will transition from its current state to "invalid", thus decrementing the current state metric.