New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

restart the event monitor when unhealthy node comes back #230

Merged
merged 5 commits into from Jan 19, 2015

Conversation

Projects
None yet
5 participants
@mountkin
Contributor

mountkin commented Jan 9, 2015

When an unhealthy node comes back, the event monitor should be restarted.
Otherwise the events can't be received.

The problem can be reproduced by:

  1. Start the leader process
  2. Join a node to the cluster
  3. Stop the docker daemon on the agent node after the node is discovered by the cluster, and start it again after the leader marks it as unhealthy
  4. Try to stop/start or run a container and see the logs of the leader, you'll find that no events are received any more.
@phemmer

This comment has been minimized.

phemmer commented Jan 9, 2015

This looks like it introduces an issue where events can be missed. If between the StopAllMonitorEvents() and StartMonitorEvents() an event is generated, it'll be missed.
I feel like the right solution here is to just start tracking this one node again.

I think it would also be good to have it pick up all events that occurred for the one node while it was missing.

@mountkin

This comment has been minimized.

Contributor

mountkin commented Jan 12, 2015

Hi @phemmer
The restart is only triggered when an unhealthy node comes back alive. There are two cases that may lead to refreshContainers() fail.

  1. the docker remote API response with an invalid JSON
  2. network error between swarm and the node

case 1 is unlikely to happen, but check the error of refreshContainers() before StopAllMonitorEvents() should be better.

For case 2, the connection of the previous API call of GET /events is probably broken. So the StopAllMonitorEvents() won't lead to the problem that you pointed out.

As to your suggestion of picking up all the events that occurred during the node's missing, I'm not sure whether the outdated events are really valuable for cluster management. Because when the unhealthy node comes back alive, there is already a refreshContainers() call and the status of the node will be synced automatically.

What do you think?

@abronan

This comment has been minimized.

Contributor

abronan commented Jan 12, 2015

@mountkin there is definitely a use case where we want to get events from the node that was down/unreachable.

We'd want to maintain a local cache of downloaded images from each node. This cache is updated using events: If a node downloads an image from a registry, it issues an event that the manager will use as a trigger to update its local cache calling getImages on that one node.

If we do not get those events from the node, there is a risk that this cache may be invalid until the manager receives a new get image event from this node.

Thus, updating the status only (with refreshContainers) is not sufficient.

As suggested by @phemmer, I guess that we need to get events since the moment the node was marked as unhealthy until we start again monitoring events for this node. This could be done by calling getEvents using the since and until arguments with the two correct timestamps.

Still, events could be missed between the moment the node is effectively unreachable and the moment we mark it as unhealthy. So the solution here must probably be a bit more elaborate. (using a timestamp pointing a while in the past and check if some events were already processed by the handler or not)

Your solution fixes an immediate issue though, so I guess this could be merged before going forward with a more elaborate solution in another PR?

What do you think @aluzzardi @vieux?

@vieux

This comment has been minimized.

Contributor

vieux commented Jan 12, 2015

@abronan yes I agree, but we have the fail safe: every 30s we do a full refresh so even if we miss a get image event, it'll fix itself in 30s max.

I think we shouldn't really care about the missing events, but trigger a refresh.

@mountkin

This comment has been minimized.

Contributor

mountkin commented Jan 13, 2015

I also think that picking up the outdated events is not necessary.
But as @abronan mentioned, the refreshContainers() only refreshes the containers' state but not the images'.
And I think refreshing everything here is unpractical. We can let the user to determine what to do when an unhealthy node comes back.
What about extend the docker event mechanism and emit a custom event when an unhealthy node comes back?
For example:

{
  "status":"node_comes_back",
  "id":"01e87f90af30b6567359f6a1042dfae5b933e184869a4184aa90bfedf65ec945",
  "from":"swarm",
  "time":1421115560,
  "Node": "id-of-the-node"
}

The user can do anything he like if he monitors the events API of swarm.

The patch is something like this:

diff --git a/cluster/node.go b/cluster/node.go
index abf3dff..c447f60 100644
--- a/cluster/node.go
+++ b/cluster/node.go
@@ -239,6 +239,14 @@ func (n *Node) refreshLoop() {
        } else {
            if !n.healthy {
                log.Infof("[%s/%s] Node came back to life. Hooray!", n.ID, n.Name)
+               n.client.StopAllMonitorEvents()
+               n.client.StartMonitorEvents(n.handler)
+               ev := &dockerclient.Event{
+                   Status: "node_comes_back",
+                   From:   "swarm",
+                   Time:   time.Now().Unix(),
+               }
+               n.handler(ev)
            }
            n.healthy = true
        }

Edit:
The patch above causes another call to refreshContainers().
To avoid this, it can be changed to call n.eventHandler.Handle() directly.

@vieux

This comment has been minimized.

Contributor

vieux commented Jan 13, 2015

@mountkin I like the idea, can you update your PR ?

@vieux

This comment has been minimized.

Contributor

vieux commented Jan 13, 2015

can you send the node left event aussi also?

@abronan

This comment has been minimized.

Contributor

abronan commented Jan 13, 2015

I like the idea aussi! 😛
And this would be useful for updating the cache as I mentioned above 👍

@vieux

This comment has been minimized.

Contributor

vieux commented Jan 13, 2015

@mountkin

This comment has been minimized.

Contributor

mountkin commented Jan 14, 2015

I've updated the PR.
The event when an node becomes unhealthy is named "node_die", the event when an unhealthy node comes back is named "node_comes_back".

@aluzzardi

This comment has been minimized.

Contributor

aluzzardi commented Jan 14, 2015

We should probably do a full refresh anyway - a node that goes away and comes back could mean different hardware, especially in a cloud environment (for instance, more memory / CPU, which we do care about).

Nitpick: I don't like node_die and node_comes_back as names. Perhaps node_join / node_left, I don't know.

We haven't thought about the node join/leave cycle yet - my guess is that eventually we are going to remove the whole Node object and re-create it from scratch when the node comes back to life, rather than keeping an un-healthy Node around.

Thoughts?

In the meantime, this PR fixes a real problem we have today.

@phemmer

This comment has been minimized.

phemmer commented Jan 14, 2015

And I'm going to nit on the names as well. They become impossible to change later on :-)
Agree with the sentiment about die/comes_back, but I think node_join/node_left might be a bit ambiguous. These names sound like the node is gracefully joining and leaving the cluster (plus they're 2 different tenses, present and past). Perhaps node_disconnect/node_reconnect?

@mountkin

This comment has been minimized.

Contributor

mountkin commented Jan 14, 2015

I also think that join/left is ambiguous.
The "node_die" is derived from the container event of docker remote API. "node_comes_back" is copied from the info log of the code.
I have no idea which names are better. Any suggestions are welcomed!

@abronan

This comment has been minimized.

Contributor

abronan commented Jan 14, 2015

@aluzzardi you have a point for the full refresh..

I don't like the names either but as already mentioned, the node_join/node_left somehow implies that a node is voluntary joining or leaving the cluster. But this might as well be a network partition, and in that case the node_disconnect/node_reconnect makes more sense imho.

@mountkin

This comment has been minimized.

Contributor

mountkin commented Jan 15, 2015

According to suggestions from @aluzzardi and @phemmer, I changed the names to node_disconnect/node_reconnect

@vieux

This comment has been minimized.

Contributor

vieux commented Jan 15, 2015

So we're good ? The custom events will trigger full refresh

@aluzzardi

This comment has been minimized.

Contributor

aluzzardi commented Jan 16, 2015

Perhaps instead of reconnect we should trigger an event on connect?

Or a different event altogether (connect, disconnect, reconnect)?

Since we are issuing this new kind of events, I think it will be useful for API clients to know whenever a new node joins instead of just when it reconnects.

mountkin added some commits Jan 9, 2015

restart the event monitor when unhealthy node comes back
Signed-off-by: mountkin <mountkin@gmail.com>
emit a custom event when a node dies or comes back
Signed-off-by: mountkin <mountkin@gmail.com>
rename custom events
Signed-off-by: mountkin <mountkin@gmail.com>
add a "node_connect" custom event when a node joins the cluster
Signed-off-by: mountkin <mountkin@gmail.com>
@mountkin

This comment has been minimized.

Contributor

mountkin commented Jan 19, 2015

Hi @aluzzardi
The node_connect event is added. I also added a call to updateSpecs() when an unhealthy node comes back.

@vieux vieux added this to the Swarm Beta 0.1.0 milestone Jan 19, 2015

@vieux

This comment has been minimized.

Contributor

vieux commented Jan 19, 2015

LGTM

vieux added a commit that referenced this pull request Jan 19, 2015

Merge pull request #230 from mountkin/master
restart the event monitor when unhealthy node comes back

@vieux vieux merged commit 08a17cd into docker:master Jan 19, 2015

1 check passed

continuous-integration/travis-ci The Travis CI build passed
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment