Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove attachable network on swarm leave #30157

Merged
merged 1 commit into from Jan 27, 2017
Merged

Remove attachable network on swarm leave #30157

merged 1 commit into from Jan 27, 2017

Conversation

aboch
Copy link
Contributor

@aboch aboch commented Jan 14, 2017

  • When the node leaves the cluster, if any user run
    container(s) is connected to the swarm (attachable) network,
    then daemon needs to detach the container(s) and
    remove the network.

Fixes #30084

- A picture of a cute animal (not mandatory but encouraged)

hyena

@cpuguy83
Copy link
Member

Seems like it's the right thing to do... what happens to the networking in the container when this happens?

Also, build failure:

05:42:45 ./docker_cli_swarm_test.go:380: undefined: strings.TrimSpae

@aboch
Copy link
Contributor Author

aboch commented Jan 14, 2017

Thanks @cpuguy83

what happens to the networking in the container when this happens?

The network interface which was connecting the container to the attachable network is removed.

@@ -468,3 +468,35 @@ func (daemon *Daemon) deleteNetwork(networkID string, dynamic bool) error {
func (daemon *Daemon) GetNetworks() []libnetwork.Network {
return daemon.getAllNetworks()
}

// clearAttachableNetworks remove the attachable networks
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove -> removes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I will fix

@aboch
Copy link
Contributor Author

aboch commented Jan 18, 2017

ping @mavenugo

@mavenugo mavenugo added this to the 1.13.1 milestone Jan 25, 2017
continue
}
containerID := sb.ContainerID()
if err := daemon.DisconnectContainerFromNetwork(containerID, nwName, true); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this use network-id instead of network-name?

Copy link
Contributor Author

@aboch aboch Jan 25, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessarily. This API can take either name of ID (it eventually calls FindNetwork()), it was in fact the first function which processes a network disconnect POST request. I am saying "it was" because later with the addition of swarm networks being lazily programmed FindNetwork had to be called before anyway.
Edit: ^^Looks like that FindNetwork call in the postNetworkDisconnect() has now been removed from master as well (it was there in 1.11.x branch where I was mistakenly looking at)

In this case the network is known to this node and there is no need to pass the ID.

But I agree it will spare the reader from wondering about this logic when reading the code, I will change it to ID.

Thanks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recall a case where both a "bridge" and "overlay" network existed on a node (due to a race?), resulting in the service failing to start. Although "corner case", removing the network by name could potentially result in the wrong network being deleted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, then it makes sense.
If user created that scenario, it is expected from him to handle things right when calling docker network rm ....
But in this case, being an automatic action taken by the daemon, we must delete the right network, with no user intervention.
So the right thing to do here is to pass the network ID as you suggested. Thanks for your comment.

@mavenugo
Copy link
Contributor

LGTM

@aboch curious to know how this will behave in a case where all the swarm managers leave, while the worker node (with an attachable network and containers) stays ? I guess, it will fix itself when the worker rejoins to new managers I assume. Can you pls confirm ?

@aboch
Copy link
Contributor Author

aboch commented Jan 25, 2017

@mavenugo
Nothing happens. The worker nodes keep trying to reach the manager. When I restart the manager, they reconnect. The network is never removed and the containers never detached.

Not sure if this is the test you wanted to check.

Is the worker node going into a failed state if we wait long enough and cluster provider be set to nil ?
If that is the case, then the cleanup logic would kick in.

@mavenugo
Copy link
Contributor

@aboch yes. I think that is reasonable and expected behavior. am good.

// after disconnecting any connected container
func (daemon *Daemon) clearAttachableNetworks() {
var networks []libnetwork.Network
for _, n := range daemon.GetNetworks() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering, should we combine the two loops, and simply

for _, n := range daemon.GetNetworks() {
    if !n.Info().Attachable() {
        continue
    }

Or is there a special reason for using the intermediate networks variable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, it in fact looks like Getnetworks() returns a new slice of networks.
Then there is no need to create a new slice here.
Will fix it shortly. Thanks.

@aboch
Copy link
Contributor Author

aboch commented Jan 25, 2017

@thaJeztah Taken care of your comment, and retested it, thanks. PTAL when you get a chance.

@tonistiigi
Copy link
Member

@aboch Can DeleteManagedNetwork call back to cluster pkg? Want to make sure that there isn't a possibility for a deadlock as that issue is not fixed in v1.13.

@aboch
Copy link
Contributor Author

aboch commented Jan 25, 2017

Thanks @tonistiigi Unless I missed it, DeleteManagedNetwork and the downstream calls in libnetwork do not call back to the docker cluster pkg.

Also, I do not think this fix will go in 1.13.x

@tonistiigi
Copy link
Member

LGTM

@mavenugo
Copy link
Contributor

@dnephin @aboch are we good to go with this PR ? FYI, This is marked for 1.13.1 and @vieux is in the process of cherry-picking the relevant patches to the 1.13.x branch.

@aboch
Copy link
Contributor Author

aboch commented Jan 26, 2017

@dnephin Addressed your comment, PTAL

(I am having some weird build error, so I just pushed the change to speed up the review)

@dnephin
Copy link
Member

dnephin commented Jan 26, 2017

LGTM

Thank you!

@aboch
Copy link
Contributor Author

aboch commented Jan 26, 2017

CI failures due to #28409. @dnephin is looking into it.

- When the node leaves the cluster, if any user run
  container(s) is connected to the swarm network,
  then daemon needs to detach the container(s) and
  remove the network.

Signed-off-by: Alessandro Boch <aboch@docker.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet