Skip to content
This repository has been archived by the owner on Mar 9, 2022. It is now read-only.

Handle containerd event reliably #628

Merged
merged 2 commits into from
Mar 15, 2018
Merged

Conversation

yanxuean
Copy link
Member

fix #434
Signed-off-by: yanxuean yan.xuean@zte.com.cn

@yanxuean
Copy link
Member Author

/hold

@yanxuean yanxuean changed the title Handle containerd event reliably [WIP] Handle containerd event reliably Feb 28, 2018
@yanxuean
Copy link
Member Author

/cc @Random-Liu PTAL
I will add test case later.

@k8s-ci-robot
Copy link

@yanxuean: GitHub didn't allow me to request PR reviews from the following users: PTAL.

Note that only containerd members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc @Random-Liu PTAL
I will add test case later.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@Random-Liu
Copy link
Member

Random-Liu commented Mar 5, 2018

Ref #642.

@Random-Liu Random-Liu mentioned this pull request Mar 5, 2018
@@ -43,6 +45,14 @@ type eventMonitor struct {
errCh <-chan error
ctx context.Context
cancel context.CancelFunc
backOffQueue backOffQueue
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we put all backoff related thing into one struct? It will also be easier to unit test.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do

logrus.WithError(err).Errorf("Failed to convert event envelope %+v", e)
break
}
if cID, backOffIng := em.isBackOffIng(any); backOffIng {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/isBackOffIng/isInBackoff

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do

if e.Pid != sb.Status.Get().Pid {
// Non-init process died, ignore the event.
return
return nil
}
// No stream attached to sandbox container.
task, err := sb.Container.Task(context.Background(), nil)
if err != nil {
if !errdefs.IsNotFound(err) {
logrus.WithError(err).Errorf("failed to load task for sandbox %q", e.ContainerID)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

log in handleEvent instead, since you've returned the error anyway.

return fmt.Errorf("failed to load task: %v", err)

And same for the following errors and also handleContainerExit.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do

em.backOffQueue[key] = queue
}

func (em *eventMonitor) newBackOffTimer(key string) *time.Timer {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no backoff? The interval should increase if it keeps failing, right?

em.enBackOff(cID, any)
break
}
if cID, err := em.handleEvent(any); err != nil && cID != "" {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why err != nil && cID != ""?

Is it possible that err != nil but cID == ""? Is that an error?

Actually, you've got the container id here, why still let handleEvent return one?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To prevent accidents:)
will do

case err := <-em.errCh:
logrus.WithError(err).Error("Failed to handle event stream")
close(closeCh)
return
case cID := <-em.backOffExpire:
for {
any := em.deBackOff(cID)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When will any == nil? Is that an error?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, It is not error. It is in "for" loop. When we handle over the whole queue, it will be nil.

if any == nil {
break
}
if _, err := em.handleEvent(any); err != nil && cID != "" {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cID != ""? You don't even set it based on the return value.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do

// enBackOff start to backOff again and put event to the begin of queue
func (em *eventMonitor) reBackOff(key string, evt interface{}) {
newEvents := []interface{}{evt}
queue := em.backOffQueue[key]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just requeue, and always stop timer as long as it is not nil.

No need to check length.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If then, We will stop timer twice when there is only one event.

@Random-Liu
Copy link
Member

Random-Liu commented Mar 6, 2018

@yanxuean I feel like the per container timer logic is unnecessarily complex. And we haven't done real back yet. (See #628 (comment))

I feel like it is very easy to forget starting a timer, or leave an unnecessary timer over.

How about we just have a ticker which ticks every 1 second or so? And every time it ticks, we check whether there is any events which have exceed backoff time. And deal with them if there are any. This seems more reliable to me.

To optimize, we can only start the ticker when there is one event enqueued.
@yanxuean WDYT?

@yanxuean
Copy link
Member Author

yanxuean commented Mar 6, 2018

Will refactor it.

@yanxuean yanxuean force-pushed the nits branch 2 times, most recently from cb75ccc to 0d93897 Compare March 10, 2018 05:29
@yanxuean yanxuean changed the title [WIP] Handle containerd event reliably Handle containerd event reliably Mar 10, 2018
@yanxuean yanxuean force-pushed the nits branch 3 times, most recently from 624dad3 to 5e2e5cb Compare March 10, 2018 06:43
@yanxuean
Copy link
Member Author

@Random-Liu PTAL, Tks

@@ -74,16 +95,40 @@ func (em *eventMonitor) start() (<-chan struct{}, error) {
return nil, errors.New("event channel is nil")
}
closeCh := make(chan struct{})
em.backOff.ticker = time.NewTicker(1 * time.Second)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Add a start function in backoff?

@@ -54,6 +72,9 @@ func newEventMonitor(c *containerstore.Store, s *sandboxstore.Store) *eventMonit
sandboxStore: s,
ctx: ctx,
cancel: cancel,
backOff: backOff{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Add a new function?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do

logrus.WithError(err).Errorf("Failed to convert event envelope %+v", e)
break
}
cID, backOffIng := em.backOff.isInBackOff(any)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/backOffIng/inBackoff

break
}
if err := em.handleEvent(any); err != nil {
em.backOff.enBackOff(cID, any)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

log the error

case err := <-em.errCh:
logrus.WithError(err).Error("Failed to handle event stream")
close(closeCh)
return
case <-em.backOff.ticker.C:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Add a simple function to return the channel?

logrus.WithError(err).Errorf("Failed to convert event envelope %+v", evt)
return
}
func (em *eventMonitor) handleEvent(any interface{}) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's return error with description from this function, and we only need to log the error in start.

In this way, we don't need so many logs in this function.

}

// enBackOff start to backOff and put event to the tail of queue
func (b *backOff) enBackOff(key string, evt interface{}) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

queue, ok := b.queues[key]
if ok {
  queue.events = append(queue.events, evt)
  return
}
b.queues[key] = backOffQueue{
  events: []interface{}{evt},
  duration: initialBackoffTime,
  start: time.Now(),
} // Or add a function `newBackoffQueue`

}

// enBackOff start to backOff again and put [nth:] events to the queue
func (b *backOff) reBackOff(key string, queue backOffQueue, n int) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

func (b *backOff) reBackOff(key string, events []interface{}, oldDuration time.Duration) {
  duration := 2 * oldDuration
  if duration > maxBackoffTime {
    duration = maxBackoffTime
  }
  b.queues[key] = backOffQueue{
    events: events,
    duration: duration,
    start: time.Now(),
  }
}

return !now.Before(t.expireTime)
}

func (t backOffTime) newTime() backOffTime {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this function is unnecessary complex... Simple and straightforward code is preferred. See comments above.

var containers []string
now := time.Now()
for c, v := range b.queues {
if v.time.isExpire(now) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if time.Since(v.start) > v.duration // or add a `isExpired` function for backOffQueue

},
}

t.Logf("Should can backOff a event")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/can/be able to

And also for following ones.

}

t.Logf("Should can backOff a event")
actual := newEventMonitor(nil, nil).backOff
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

? I prefer using the new function suggested in https://github.com/containerd/cri/pull/628/files#r174375426.

}
assert.Equal(t, isQueueListEqual(t, actual.queues, expectedQueues, 1*time.Second), true)

t.Logf("Should can check if the container is on backOff state")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/on/in

@@ -0,0 +1,150 @@
/*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -0,0 +1,150 @@
/*
Copyright 2018 The Kubernetes Authors.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

containerd Authors

assert.Equal(t, isQueueListEqual(t, actual.queues, expectedQueues, 1*time.Second), true)

t.Logf("Should can check if the container is on backOff state")
for k, queue := range inputQueues {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should check an arbitrary container id is not in backoff.

}
}

t.Logf("Should can get all keys who are expired for backOff")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/who/which

actKeyList := actual.getExpiredContainers()
assert.Equal(t, len(expKeyList), len(actKeyList))
for _, expKey := range expKeyList {
found := false
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert.Contains

assert.Equal(t, isQueueEqual(t, actQueue, expectedQueues[k], 1*time.Second), true)
}

t.Logf("Should not get out the event again after having gut out the backOff event")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/gut/got.

expQueue := backOffQueue{
events: queue.events[failEventIndex:],
}
assert.Equal(t, isQueueEqual(t, actQueue, expQueue, 2*time.Second), true)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use constant for code and test.

Signed-off-by: yanxuean <yan.xuean@zte.com.cn>
fix containerd#434

Signed-off-by: yanxuean <yan.xuean@zte.com.cn>
Copy link
Member

@Random-Liu Random-Liu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with nits. I can take care of the nits in another PR.

@@ -55,6 +82,8 @@ func newEventMonitor(c *containerstore.Store, s *sandboxstore.Store) *eventMonit
sandboxStore: s,
ctx: ctx,
cancel: cancel,
backOff: newBackOff(backOffInitDuration, backOffMaxDuration,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can just use the const instead of parameterize it.

logrus.WithError(err).Errorf("Failed to convert event %+v", e)
break
}
if em.backOff.isInBackOff(cID) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add an info log here.

t.Logf("Should be able to check that a container isn't in backOff state")
notExistKey := "containerNotExist"
assert.Equal(t, actual.isInBackOff(notExistKey), false)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should add a test case that getExpiredContainers should be empty when not expired.

for _, k := range actKeyList {
actKeyMap[k] = struct{}{}
}
assert.Equal(t, actKeyMap, expKeyMap)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can compare length, and use assert.Contains.

@Random-Liu
Copy link
Member

@yanxuean Thanks a lot! Good job! :D

@Random-Liu Random-Liu merged commit eff311d into containerd:master Mar 15, 2018
@yanxuean yanxuean deleted the nits branch March 23, 2018 02:16
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Handle containerd event reliably
3 participants