Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrading from v2.3 through v3.0 and v3.1 to v3.2 results in panic #9480

Closed
jlhawn opened this issue Mar 22, 2018 · 14 comments

Comments

4 participants
@jlhawn
Copy link

commented Mar 22, 2018

Bug reporting

The docs recommend upgrading from v2.3 to v3.2 by first upgrading to each minor version along the way, however there seems to be an issue if you perform this transition too quickly, specifically if there are no writes to the v3 backend or there are no snapshots produced during v3.0 or v3.1 then this causes v3.2 to panic on startup.

To reproduce this, start with an etcd v2.3 server which does have a snapshot (this bug does not occur if no snapshots have taken place yet). Stop the server and replace it with a v3.0 server. Everything seems fine, next stop the server and replace it with a v3.1 server. Again everything is fine. Finally, stop the server and replace it with a v3.2 server and witness this panic when the server starts up:

2018-03-22 18:14:32.879716 I | etcdserver: recovered store from snapshot at index 52
2018-03-22 18:14:32.882938 C | etcdserver: recovering backend from snapshot error: database snapshot file path error: snap: snapshot file doesn't exist
panic: recovering backend from snapshot error: database snapshot file path error: snap: snapshot file doesn't exist
	panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x28 pc=0xb7ab8c]

goroutine 1 [running]:
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver.NewServer.func1(0xc4201ac5f8, 0xc4201ac3d0)
	/usr/local/google/home/jpbetz/Projects/etcd/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/server.go:284 +0x3c
panic(0xdaf1c0, 0xc42025f950)
	/usr/local/google/home/jpbetz/.gvm/gos/go1.8.7/src/runtime/panic.go:489 +0x2cf
github.com/coreos/etcd/cmd/vendor/github.com/coreos/pkg/capnslog.(*PackageLogger).Panicf(0xc420170820, 0xf95ff9, 0x2a, 0xc4201ac440, 0x1, 0x1)
	/usr/local/google/home/jpbetz/Projects/etcd/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/pkg/capnslog/pkg_logger.go:75 +0x15c
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver.NewServer(0xc42026c000, 0x0, 0x14b2580, 0xc42025f8e0)
	/usr/local/google/home/jpbetz/Projects/etcd/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/server.go:379 +0x2e4d
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/embed.StartEtcd(0xc420182a80, 0xc420264000, 0x0, 0x0)
	/usr/local/google/home/jpbetz/Projects/etcd/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/embed/etcd.go:157 +0x782
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdmain.startEtcd(0xc420182a80, 0x6, 0xf71713, 0x6, 0x1)
	/usr/local/google/home/jpbetz/Projects/etcd/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdmain/etcd.go:186 +0x58
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdmain.startEtcdOrProxyV2()
	/usr/local/google/home/jpbetz/Projects/etcd/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdmain/etcd.go:103 +0x15ba
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdmain.Main()
	/usr/local/google/home/jpbetz/Projects/etcd/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdmain/main.go:39 +0x61
main.main()
	/usr/local/google/home/jpbetz/Projects/etcd/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/etcd/main.go:28 +0x20

The bug seems to have been introduced in this patch from last year.

It this case, the new db backend (which has not yet been used and has been in its initial state since the v3.0 server was deployed) reports an index (0) which is less than the latest snapshot index. The server assumes that this means there is a *.snap.db file which can be renamed to db to catch up to the *.snap file but no such *.snap.db file exists.

@gyuho gyuho added the area/bug label Mar 22, 2018

@gyuho

This comment has been minimized.

Copy link
Member

commented Mar 22, 2018

To reproduce this, start with an etcd v2.3 server which does have a snapshot (this bug does not occur if no snapshots have taken place yet). Stop the server and replace it with a v3.0 server. Everything seems fine, next stop the server and replace it with a v3.1 server.

Do you have any v3 data?

@jlhawn

This comment has been minimized.

Copy link
Author

commented Mar 22, 2018

No, we are starting from etcd v2.3.8 and only use the v2 API (the v3 stuff was experimental in that release I thought) so we don't have any v3 data.

@gyuho

This comment has been minimized.

Copy link
Member

commented Mar 22, 2018

So just migrating v2.3.8 server with v2 data to v3 server etcd's can trigger this panic?

@jlhawn

This comment has been minimized.

Copy link
Author

commented Mar 22, 2018

So just migrating v2.3.8 server with v2 data to v3 server etcd's can trigger this panic?

Yes.

@gyuho

This comment has been minimized.

Copy link
Member

commented Mar 22, 2018

@jlhawn Ok, I will try to reproduce. Thanks for report.

@jlhawn

This comment has been minimized.

Copy link
Author

commented Mar 22, 2018

@gyuho I have prepared some repro steps if you think that would help.

@gyuho

This comment has been minimized.

Copy link
Member

commented Mar 22, 2018

@jlhawn Can you share minimal reproducible steps here?

@jlhawn

This comment has been minimized.

Copy link
Author

commented Mar 22, 2018

My minimal repro steps only require using docker:

  1. Create a volume named etcd-data:
docker volume create --name etcd-data
  1. Create a shell var to store some basic arguments for the server:
ARGS='-name etcd0 -data-dir /data
-advertise-client-urls http://127.0.0.1:2379,http://127.0.0.1:4001
-listen-client-urls http://127.0.0.1:2379,http://127.0.0.1:4001
-initial-advertise-peer-urls http://127.0.0.1:2380
-listen-peer-urls http://127.0.0.1:2380
-initial-cluster-token etcd-cluster-1
-initial-cluster etcd0=http://127.0.0.1:2380
-initial-cluster-state new'
  1. Create a 2.3.8 server:
docker run -d -v etcd-data:/data --name etcd quay.io/coreos/etcd:v2.3.8 $ARGS -snapshot-count 25

The very low -snapshot-count flag will force a snapshot soon after the server starts up without the need to write any data (or you could do that if you want, that works too). Use docker logs etcd to wait for that to happen, it should be only a few seconds and looks like this:

2018-03-22 20:52:44.261940 I | etcdserver: start to snapshot (applied: 26, lastsnap: 0)
2018-03-22 20:52:44.266772 I | etcdserver: saved snapshot at index 26

You can also check the contents of the etcd-data volume to see that there is a member/snap/0000000000000002-000000000000001a.snap file which exists now.

At this point, remove the server container with docker rm -f etcd.

  1. Create another server with etcd v3.0 (note we don't need the small snapshot count any longer):
docker run -d -v etcd-data:/data --name etcd quay.io/coreos/etcd:v3.0 etcd $ARGS

You'll see from the logs of that container that it's up and has migrated from 2.3 to 3.0 and enabled v3 features:

2018-03-22 20:54:03.900059 I | etcdserver: updating the cluster version from 2.3 to 3.0
2018-03-22 20:54:03.902694 N | membership: updated the cluster version from 2.3 to 3.0
2018-03-22 20:54:03.902739 I | api: enabled capabilities for version 3.0

Remove this container again to upgrade it to v3.1: docker rm -f etcd

  1. Create another server with etcd v3.1 next:
docker run -d -v etcd-data:/data --name etcd quay.io/coreos/etcd:v3.1 etcd $ARGS

You'll see from the logs of that container that it's up and has migrated from 3.0 to 3.1 and enabled v3.1 features:

2018-03-22 20:54:27.241720 I | etcdserver: updating the cluster version from 3.0 to 3.1
2018-03-22 20:54:27.243598 N | etcdserver/membership: updated the cluster version from 3.0 to 3.1
2018-03-22 20:54:27.243659 I | etcdserver/api: enabled capabilities for version 3.1

Remove this container again to upgrade it to v3.2: docker rm -f etcd

  1. Create another server with etcd v3.2:
docker run -d -v etcd-data:/data --name etcd quay.io/coreos/etcd:v3.2 etcd $ARGS

This container will exit soon after it starts. Use docker logs etcd to see that it had a panic.

@gyuho

This comment has been minimized.

Copy link
Member

commented Mar 23, 2018

This panic prevents accidental db file deletion(overwrite) in v3.
I will add an optional flag to allow this upgrade use case.

@raoofm

This comment has been minimized.

Copy link
Contributor

commented Mar 26, 2018

@gyuho I see that you added comments in upgrade checklist to not upgrade the server unless you migrate v3 data. The comments were added to v3.0, v3.1, v3.2, v3.3, v3.4

I think you can remove v3.0 and v3.1 from the list as the bug was introduced in v3.2+

I'm running 3.1.x in prod with v2 data and I think @jlhawn also mentioned it works.

@gyuho

This comment has been minimized.

Copy link
Member

commented Mar 26, 2018

@raoofm Thanks for pointing that out.

Actually, I intentionally added it to them all

Do not upgrade to newer v3 versions until v3.0 server contains v3 data.

to remind that no upgrade from 3.0 to 3.x without v3 data.

Please let me know if this is still confusing.

@raoofm

This comment has been minimized.

Copy link
Contributor

commented Mar 26, 2018

@gyuho no worries, was just wondering if it panics existing users who already did :) like me.

You can keep as is, thanks 👍

@wsong

This comment has been minimized.

Copy link

commented Mar 26, 2018

So just to be clear, if you start up a 3.0 server, don't make any writes, then stop it and start up a 3.2 server, it's expected that the 3.2 server will panic? In this case, are you supposed to write out a dummy v3 key or something?

@gyuho

This comment has been minimized.

Copy link
Member

commented Mar 26, 2018

Yes. Only upgrade to 3.2 with no v3 keys will panic (no consistent index has been set). It's not expected though... but we've decided to keep it as it is (sorry, too late to backport a fix to all 3.x branches), because bypassing it requires too much of manual unsafe operations. Safest workaround is write some dummy v3 keys.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.