Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"swarm init --force-new-cluster" ignores advertise-addr flag #523

Open
2 tasks done
burner-account opened this issue Dec 11, 2018 · 11 comments
Open
2 tasks done

"swarm init --force-new-cluster" ignores advertise-addr flag #523

burner-account opened this issue Dec 11, 2018 · 11 comments

Comments

@burner-account
Copy link

  • This is a bug report
  • I searched existing issues before opening this one

Expected behavior

If admins follow the instructions on backup/restore (https://docs.docker.com/engine/swarm/admin_guide/), they should be able to transfer old swarm data (secrets, ...) to a new swarm.
When using
docker swarm init --force-new-cluster
to do so, admins should be able to expect other flags (e.g. --advertise-addr) to work as well.

Actual behavior

Following the instructions i was able to restore the secrets of an old swarm to the new one.
Although i set the --advertise-addr to the NEW_IP, the swarm initialization script returns a join command as:
docker swarm join --token long-token-string OLD_IP:2377

Manually changing the IP in the join command allows nodes to join the swarm BUT - as the old advertise addr is pushed to the nodes - things enter fringe mode.

  • mesh routing stops working
  • stack deployments still work as expected

Steps to reproduce the behavior

1.) enter yourCorpNet 10.0.1.0/24, machine 10.0.1.1
2.) init a swarm, no need to join nodes.
3.) store a swarm secret as marker for state backup/restore test.
4.) backup folder, see https://docs.docker.com/engine/swarm/admin_guide/#back-up-the-swarm

5.) enter yourCorpNet 10.0.2.0/24, machine 10.0.2.1
6.) stop docker
7.) restore folder, see https://docs.docker.com/engine/swarm/admin_guide/#restore-from-a-backup
8.) start docker
9.) docker swarm init --force-new-cluster --advertise-addr 10.0.2.1

Output of docker version:

Client:
 Version:           18.09.0
 API version:       1.39
 Go version:        go1.10.4
 Git commit:        4d60db4
 Built:             Wed Nov  7 00:48:22 2018
 OS/Arch:           linux/amd64
 Experimental:      false

Server: Docker Engine - Community
 Engine:
  Version:          18.09.0
  API version:      1.39 (minimum version 1.12)
  Go version:       go1.10.4
  Git commit:       4d60db4
  Built:            Wed Nov  7 00:19:08 2018
  OS/Arch:          linux/amd64
  Experimental:     false

Output of docker info:

Server Version: 18.09.0
Storage Driver: overlay2
 Backing Filesystem: xfs
 Supports d_type: true
 Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: active
 NodeID: rsx29bg80yn126lf9g2p5cvyr
 Is Manager: true
 ClusterID: dqfryv63sxml0imlpc9n46jkd
 Managers: 1
 Nodes: 4
 Default Address Pool: 10.0.0.0/8  
 SubnetSize: 24
 Orchestration:
  Task History Retention Limit: 5
 Raft:
  Snapshot Interval: 10000
  Number of Old Snapshots to Retain: 0
  Heartbeat Tick: 1
  Election Tick: 3
 Dispatcher:
  Heartbeat Period: 5 seconds
 CA Configuration:
  Expiry Duration: 3 months
  Force Rotate: 0
 Autolock Managers: false
 Root Rotation In Progress: false
 Node Address: NEW_IP
 Manager Addresses:
  OLD_IP:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: c4446665cb9c30056f4998ed953e6d4ff22c7c39
runc version: 4fc53a81fb7c994640722ac585fa9ca548971871
init version: fec3683
Security Options:
 seccomp
  Profile: default
Kernel Version: 4.19.7-1.el7.elrepo.x86_64
Operating System: CentOS Linux 7 (Core)
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 1.946GiB
Name: ********
ID: ORLV:D6MD:O4H7:JNVQ:AY3A:L5AO:S7FZ:JKYN:NFKF:CA63:4Q3T:SZAD
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false
Product License: Community Engine

@bmedici
Copy link

bmedici commented Dec 20, 2018

Same problem here, keeps advertising 10.14.x.x IP which is now unreachable, though I force the right adv ip addr:

docker swarm init --force-new-cluster --advertise-addr=10.16.83.29

same without = sign

docker swarm init --force-new-cluster --advertise-addr 10.16.83.29

@Mrzhangxd
Copy link

The same problem

1 similar comment
@yunweizhe11
Copy link

The same problem

@sych74
Copy link

sych74 commented Aug 3, 2019

It seems like related to the unsolved issue moby/moby#34306.

@risyou
Copy link

risyou commented May 19, 2020

As my workaround, from master node

docker swarm leave -f

Then

docker swarm init --force-new-cluster --advertise-addr 10.x.x.x

@reifnir
Copy link

reifnir commented Mar 18, 2021

As my workaround, from master node

docker swarm leave -f

Then

docker swarm init --force-new-cluster --advertise-addr 10.x.x.x

That isn't a workaround if you care to keep the state of the swarm (secrets, configs, stacks, services, etc.)

@dehy
Copy link

dehy commented Apr 25, 2021

I'm having the same problem here. The initial swarm was created with the wrong ip. Services were deployed. I cannot destroy the swarm state in the fear of loosing something. The advertise-addr does not seem to work as it seems the old ip is stored in the raft consensus database, and retrieved upon force-new-cluster :(
I cannot add a second manager as the ip advertised is unreachable.

@TheWiresharkGuy
Copy link

TheWiresharkGuy commented Jul 23, 2021

I'm having the same problem here. The initial swarm was created with the wrong ip. Services were deployed. I cannot destroy the swarm state in the fear of loosing something. The advertise-addr does not seem to work as it seems the old ip is stored in the raft consensus database, and retrieved upon force-new-cluster :(
I cannot add a second manager as the ip advertised is unreachable.

Hi, I'm in the same situation, did you manage to solve this? Maybe update the IP address in the raft DB with the docker service stopped?

@reifnir
Copy link

reifnir commented Jul 24, 2021

I was able to restore swarm to a functioning state with high availability with different IP addresses (totally different CIDR range).

I'm almost certain this was all of the steps. The next time I need to do it, I'll test these instructions and write it up more properly. This is to help anyone who's been stuck where I was.

  1. Stand up a single manager node and restore the state onto it. (Calling this Node1)
    • The node will now accept calls on the IP address on eth0 as normal.
    • Every other node will appear as offline.
    • In docker node inspect [itself], it will report that it has the IP address from the node in which the backup was taken.
    • You all know this already, just setting context.
  2. Get another node ready to join the swarm. (Calling this Node2)
  3. On Node2, use iptables to direct all traffic from Node1's old IP address, to the new one. It's possible that both 2377 (or whatever port you use) and 2375 don't need to be redirected, but this worked. Ex:
iptables -t nat -A OUTPUT -p tcp -d $OLD_NODE_1_IP --dport 2375 -j DNAT --to-destination $NEW_NODE_1_IP:2375
iptables -t nat -A OUTPUT -p tcp -d $OLD_NODE_1_IP --dport 2377 -j DNAT --to-destination $NEW_NODE_1_IP:2377
  1. Have Node2 join the swarm as a worker.
  2. Wait. You may have to wait 5 or 10 minutes. Just be patient while the manager reports Node2's status as Unknown.
  3. Promote Node2 to manager
  4. Wait. The swarm state doesn't synchronize immediately. Unless you're sure you know how to detect when a new manager has synched all of its state, just wait 10 minutes.
    • In the past, I'd messed this part up by immediately demoting the old manager and then giving it the boot from the swarm. The new manager had an incomplete view of swarm state.
    • Don't be like Jim. Be patient.

You're can now join other managers and workers without trouble. Just be sure to get rid of that iptables rule. Preferably by killing that node entirely once you have 5 other managers.

HTH

@obsidiangroup
Copy link

There has been no other fix for this? Having to recreate an entire swarm just to update the advertise-ip is destructive and time consuming. This problem has been around for ages with no fix. It feels like no work is being done to actually solve this issue because there are "work-arounds", though these are not really viable solutions. We should not have to essentially break the swarm (even more than already is due to this), create a new swarm, then join all members to that. But right now, have no choice to do this, which will result in some downtime and just a late night/early morning.

@paliok2021
Copy link

Hi, I am facing to problem with moving swarm on different port Data Path Port: 9789. We have created new swarm on port 9789 but all nodes use old 4789. We didnt mention that real communication on ports and only check docker info and this parameter Data Path Port: 9789. It is ok. Now I am trying add new server connected on manager with port 9789 in configuration and new server docker info shows Data Path Port: 9789 and netstat -plun shows 9789 but all servers in swarm still works on default port 4789 and thats way new server cannot communicate with others in swarm. It is very strange and I dont know what was wrong during migration old swarm on port 4789 to new port 9789. Anybody has some experiencie with this strange situation ? It was very hard way move all services,secrets,configs, etc .. to new swarm and everything is the same.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests