New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature idea: on startup, read subnet.env and attempt to acquire that lease #610

Closed
rosenhouse opened this Issue Feb 10, 2017 · 2 comments

Comments

Projects
None yet
2 participants
@rosenhouse

rosenhouse commented Feb 10, 2017

Background

We are concerned about preserving network connectivity for containers in the face of etcd outages and data loss. For example, during upgrades of etcd, Cloud Foundry users have encountered various "split-brain"-like scenarios. These scenarios are most easily resolved by stopping all etcd nodes, removing all the data directories, and finally restarting all the nodes. But CF users expect their containers to remain connected to the network throughout. Therefore, we prefer that etcd clients like flannel be resilient to etcd outages and data loss.

Problem

In manual testing of etcd outages that included data loss, we've found that flannel networks retain connectivity only if the flanneld processes remain alive throughout the outage.

Consider this sequence:

  1. deploy container hosts A & B, running flanneld connected to a remote etcd cluster.
  2. launch some containers on each host, using flannel CNI plugin; verify network connectivity between containers on different hosts
  3. etcd outage begins: stop all etcd nodes
  4. restart flanneld on host A; leave it running on host B
  5. etcd data loss: rm -rf the etcd data directories on every etcd node
  6. etcd outage ends: start all etcd nodes (but with empty data directories)
  7. check network connectivity across hosts

In testing, we've found that after step 6, host B correctly re-acquires its existing subnet lease -- the one in use by its running containers. But host A does not: instead it picks a new subnet at random, which invariably does not match the IP addresses of containers it is hosting. As a result, the containers on host A cannot reach host B.

Proposed solution

On startup, flanneld reads from the subnet.env file, if it exists. If it contains a subnet valid for the current configuration, then flanneld attempts to re-acquire that subnet from etcd. In the case of a conflict, flanneld falls back to picking a new subnet at random (current behavior).

What do y'all think? Would you be open to a PR?

References

cc: @genevievelesperance @jaydunk @rusha19

@tomdee

This comment has been minimized.

Show comment
Hide comment
@tomdee

tomdee Mar 22, 2017

Member

Yes, I think this sounds like a really useful feature.

Member

tomdee commented Mar 22, 2017

Yes, I think this sounds like a really useful feature.

@tomdee

This comment has been minimized.

Show comment
Hide comment
@tomdee

tomdee Apr 27, 2017

Member

Originally raised and tracked in #29 - I'm going to close this issue and just track it from #29.

Member

tomdee commented Apr 27, 2017

Originally raised and tracked in #29 - I'm going to close this issue and just track it from #29.

@tomdee tomdee closed this Apr 27, 2017

mgleung added a commit to mgleung/flannel that referenced this issue Jun 19, 2017

flannel reads from created subnet.env file on startup
Added feature to allow flannel to restart in case of etcd failures and
still keep the same subnet address for the hosts.

Fixes #610

mgleung added a commit to mgleung/flannel that referenced this issue Jun 19, 2017

flannel reads from created subnet.env file on startup
Added feature to allow flannel to restart in case of etcd failures and
still keep the same subnet address for the hosts.

Fixes #610

mgleung added a commit to mgleung/flannel that referenced this issue Jun 19, 2017

flannel reads from created subnet.env file on startup
Added feature to allow flannel to restart in case of etcd failures and
still keep the same subnet address for the hosts.

Fixes #610

mgleung added a commit to mgleung/flannel that referenced this issue Jun 19, 2017

flannel reads from created subnet.env file on startup
Added feature to allow flannel to restart in case of etcd failures and
still keep the same subnet address for the hosts.

Fixes #610

mgleung added a commit to mgleung/flannel that referenced this issue Jun 19, 2017

flannel reads from created subnet.env file on startup
Added feature to allow flannel to restart in case of etcd failures and
still keep the same subnet address for the hosts.

Fixes #610

mgleung added a commit to mgleung/flannel that referenced this issue Jun 19, 2017

flannel reads from created subnet.env file on startup
Added feature to allow flannel to restart in case of etcd failures and
still keep the same subnet address for the hosts.

Fixes #610 #29

mgleung added a commit to mgleung/flannel that referenced this issue Jun 19, 2017

flannel reads from created subnet.env file on startup
Added feature to allow flannel to restart in case of etcd failures and
still keep the same subnet address for the hosts.

Fixes #610 #29

mgleung added a commit to mgleung/flannel that referenced this issue Jun 19, 2017

flannel reads from created subnet.env file on startup
Added feature to allow flannel to restart in case of etcd failures and
still keep the same subnet address for the hosts.

Fixes #610 #29

mgleung added a commit to mgleung/flannel that referenced this issue Jun 22, 2017

flannel reads from created subnet.env file on startup
Added feature to allow flannel to restart in case of etcd failures and
still keep the same subnet address for the hosts.

Fixes #610 #29

mgleung added a commit to mgleung/flannel that referenced this issue Jun 22, 2017

flannel reads from created subnet.env file on startup
Added feature to allow flannel to restart in case of etcd failures and
still keep the same subnet address for the hosts.

Fixes #610 #29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment