mvcc/backend: Fix corruption bug in defrag #11613

jpbetz · 2020-02-12T06:51:11Z

If etcd is terminated during the defrag operation, the db.tmp file that it creates can be orphaned. If this happens, the next defragmentation operation that occurs will open the orphaned db.tmp instead of creating an empty db.tmp file, and starting with a fresh slate, as it should.

Once the defragmentation operation opens db.tmp , it traverses all key-values in the main db file and writes them to db.tmp. Any key-values already in the db.tmp file that are not overwritten by this copy remain in it, corrupting the boltdb keyspace. When the defragmentation operation completes successfully, db.tmp replaces db via file move and the main db file is now corrupt.

Impact:

the etcd keyspace can be corrupted, see mvcc/backend: Fix corruption bug in defrag #11613 (comment)
A deleted user or role can reappear (on one etcd member but not the others).
A etcd member can try to connect to a member that has been previously removed and is not considered part of the cluster anymore.
An expired lease can reappear. (In order for this to happen, the member that the defrag happens
on must also become the leader.)

See #11613 (comment) and #11613 (comment) for examples of how to reproduce the issue.

There is a narrow window between when the bolt db transaction that populates db.tmp is committed and db.tmp is moved to replacedb file that etcd must be terminated in to trigger this.

The fix is simple: ensure the temporary file used for defragmentation is always a new, empty file.

cc @wenjiaswe @jingyih @YoyinZyc @gyuho

codecov-io · 2020-02-12T07:44:52Z

Codecov Report

Merging #11613 into master will decrease coverage by 0.55%.
The diff coverage is 50%.

@@            Coverage Diff             @@
##           master   #11613      +/-   ##
==========================================
- Coverage   66.56%      66%   -0.56%     
==========================================
  Files         403      403              
  Lines       36630    37165     +535     
==========================================
+ Hits        24381    24529     +148     
- Misses      10768    11144     +376     
- Partials     1481     1492      +11

Impacted Files	Coverage Δ
etcdserver/api/snap/snapshotter.go	`66.93% <30%> (-3.24%)`	⬇️
mvcc/backend/backend.go	`80.95% <70%> (-0.26%)`	⬇️
auth/options.go	`37.5% <0%> (-55%)`	⬇️
pkg/transport/timeout_conn.go	`80% <0%> (-20%)`	⬇️
auth/store.go	`58.02% <0%> (-17.38%)`	⬇️
client/client.go	`73.52% <0%> (-10.46%)`	⬇️
clientv3/leasing/util.go	`91.66% <0%> (-6.67%)`	⬇️
clientv3/namespace/watch.go	`87.87% <0%> (-6.07%)`	⬇️
etcdserver/api/v3rpc/watch.go	`82.45% <0%> (-2.46%)`	⬇️
etcdserver/v3_server.go	`72.92% <0%> (-1.75%)`	⬇️
... and 35 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 61f2794...213f7f7. Read the comment docs.

mvcc/backend/backend.go

gyuho · 2020-02-12T08:40:23Z

Good catch!

panic right before the "rename", that these things can actually happen.

Can we also add fail points? Something like:

        }
+
+       // gofail: var defragBeforeRename struct{}
        err = os.Rename(tdbp, dbp)
        if err != nil {
                if b.lg != nil {

Thanks!

/cc @xiang90

jingyih

Thanks for catching this!

This might also be able to corrupt the keyspace, but I haven't been able find any ways to make that happen (in boltdb, the etcd keyspace is keyed by <revision, key> pairs, and any data older than the "last compacted" revision is ignored, so the data accidentally included from a previous defrag shouldn't matter).

I agree for the key bucket. Data accidentally included from a previous defrag orphaned file will be deleted in the next compaction.

For other buckets such as member and lease, they might cause damage?

wenjiaswe · 2020-02-12T17:50:45Z

Thanks Joe! Good catch, let's make sure we backport this to 3.2, 3.3 and 3.4, as well as changelog note.

jpbetz · 2020-02-12T18:02:45Z

I agree for the key bucket. Data accidentally included from a previous defrag orphaned file will be deleted in the next compaction.

Yeah. Sounds like you have the same understanding as I. If this was to somehow modify the keyspace, the only way I can imagine it happening would be: a lease reappears due to this issue (that was previously expired) and somehow deleted keys it wasn't suppose to when it expired again, modifying the keyspace. Hopefully this is not possible?

For other buckets such as member and lease, they might cause damage?

It can, I've been able reproduce it on my local machine. It's pretty easy, steps are basically:

# put a panic before os.Rename() in defrag() in mvcc/backend/backend.go
goreman -f Procfile start
# create any leases, members, users... that you'd like to experiment with, e.g.:
etcdctl user add example:pass
etcdctl defrag
# etcd on 2379 will panic and orphan db.tmp
# ctrl-c to stop goreman
# remove the panic in defrag()
goreman -f Procfile start
# delete any leases, members, users... that you're experimenting with
etcdctl defrag
etcdctl user list
# example will appear in results, but just for 2379

gyuho

lgtm, /cc @wenjiaswe @jingyih

Let's backport this

thx!

jingyih

LGTM. Thanks!

misclicked

xiang90 · 2020-02-12T18:32:52Z

lgtm

jingyih · 2020-02-12T18:40:46Z

If this was to somehow modify the keyspace, the only way I can imagine it happening would be: a lease reappears due to this issue (that was previously expired) and somehow deleted keys it wasn't suppose to when it expired again, modifying the keyspace. Hopefully this is not possible?

Good catch. This sounds possible but unlikely to me.

Lessor keeps a heap of leases in memory, from which it periodically pops out the expired leases and delete them in bbolt.
Above happens only if node is leader.
When restarting a node, lessor recovers leases from bbolt.
A restarted node is unlikely to become leader in a cluster.

jpbetz · 2020-02-12T22:51:29Z

OK. This can corrupt the keyspace. Here's how to reproduce:

# put a panic before os.Rename() in defrag() in mvcc/backend/backend.go to simulate etcd being terminated
goreman -f Procfile start # start a 3 member cluster

etcdctl put a 1
etcdctl defrag

# etcd on 2379 will panic and orphan db.tmp
# ctrl-c to stop goreman and the other 2 etcd members (they could be left running too, if you want)
# remove the panic in defrag()
goreman -f Procfile start # to start etcd member again

etcdctl del a
etcdctl put b
etcdctl get b -w json # to the the latest revision
etcdctl compact <latest-revision>
etcdctl defrag

# ctrl-c  to stop the instances again, this forces the in-memory index to be rebuild
goreman -f Procfile start # to start etcd member again

etcdctl get a
a
1

Thanks goes to @lavalamp for suggesting this approach.

mvcc/backend/backend.go

jingyih · 2020-02-13T06:47:31Z

Great work Joe!

So in your example, the deleted key a is brought back to persistent B+ tree by orphaned db.tmp file. And it is further brought back to in-memory index tree after a server restart.

jpbetz · 2020-02-13T18:54:39Z

So in your example, the deleted key a is brought back to persistent B+ tree by orphaned db.tmp file. And it is further brought back to in-memory index tree after a server restart.

Yes, and note that once a bad db.tmp is written, it's just a matter of time for a busy, long lived etcd cluster to eventually be defragmented again, and after that eventually restarted. So for this bug, the only really unlikely event is that etcd is terminated abnormally after db.tmp is written to but before it is renamed to db, everything else is quite probable.

…-origin-release-3.3 Automated cherry pick of #11613 to release-3.3

…-origin-release-3.2 Automated cherry pick of #11613 to release-3.2

…-origin-release-3.4 Automated cherry pick of #11613 to release-3.4

changelog: Add #11613 backport to 3.2, 3.3 and 3.4 changelogs

andyliuliming · 2020-02-19T02:09:30Z

@gyuho @wenjiaswe may I ask when the new etcd version containing this fix been released?

wenjiaswe · 2020-02-19T02:20:18Z

I would like to have that release too! also cc @hexfusion

YoyinZyc · 2020-02-19T18:45:01Z

I would like a new 3.4 release including this backport. Thanks.

etcd v3.3.19 includes a important bugfix: Fix corruption bug in defrag. etcd-io/etcd#11613

socketpair · 2020-10-16T19:27:13Z

So, what versions contain this fix ?

wenjiaswe · 2020-10-16T20:08:44Z

@socketpair It's in 3.2.29+, 3.3.19+ and 3.4.4+

jpbetz force-pushed the fix-defrag-orphan-file branch from 7e36b79 to ce8464e Compare February 12, 2020 07:08

gyuho reviewed Feb 12, 2020

View reviewed changes

mvcc/backend/backend.go Outdated Show resolved Hide resolved

gyuho requested review from jingyih and xiang90 February 12, 2020 08:40

jingyih previously requested changes Feb 12, 2020

View reviewed changes

jpbetz force-pushed the fix-defrag-orphan-file branch from ce8464e to 07a26f0 Compare February 12, 2020 18:13

gyuho approved these changes Feb 12, 2020

View reviewed changes

jingyih reviewed Feb 12, 2020

View reviewed changes

jpbetz changed the title ~~mvcc/backend: Delete orphaned db.tmp files before defrag~~ mvcc/backend: Delete orphaned db.tmp files before defrag to prevent corruption Feb 12, 2020

lavalamp reviewed Feb 12, 2020

View reviewed changes

mvcc/backend/backend.go Outdated Show resolved Hide resolved

lavalamp reviewed Feb 12, 2020

View reviewed changes

mvcc/backend/backend.go Outdated Show resolved Hide resolved

jpbetz force-pushed the fix-defrag-orphan-file branch 3 times, most recently from f970561 to 9cb6fd5 Compare February 13, 2020 06:33

jpbetz changed the title ~~mvcc/backend: Delete orphaned db.tmp files before defrag to prevent corruption~~ mvcc/backend: Force defrag always start with an empty temp file to prevent corruption Feb 13, 2020

jpbetz changed the title ~~mvcc/backend: Force defrag always start with an empty temp file to prevent corruption~~ mvcc/backend: Force defrag to always start with an empty temp file to prevent corruption Feb 13, 2020

jpbetz changed the title ~~mvcc/backend: Force defrag to always start with an empty temp file to prevent corruption~~ mvcc/backend: Fix corruption bug in defrag Feb 13, 2020

mvcc/backend: Delete orphaned db.tmp files before defrag

213f7f7

jpbetz force-pushed the fix-defrag-orphan-file branch from 9cb6fd5 to 213f7f7 Compare February 13, 2020 06:40

jpbetz merged commit 8bf7b2b into etcd-io:master Feb 13, 2020

This was referenced Feb 13, 2020

Automated cherry pick of #11613 to release-3.4 #11622

Merged

Automated cherry pick of #11613 to release-3.3 #11623

Merged

Automated cherry pick of #11613 to release-3.2 #11624

Merged

wenjiaswe added a commit that referenced this pull request Feb 13, 2020

Merge pull request #11623 from jpbetz/automated-cherry-pick-of-#11613…

b0a4038

…-origin-release-3.3 Automated cherry pick of #11613 to release-3.3

wenjiaswe added a commit that referenced this pull request Feb 13, 2020

Merge pull request #11624 from jpbetz/automated-cherry-pick-of-#11613…

746c167

…-origin-release-3.2 Automated cherry pick of #11613 to release-3.2

jpbetz added a commit that referenced this pull request Feb 13, 2020

Merge pull request #11622 from jpbetz/automated-cherry-pick-of-#11613…

a1bf557

…-origin-release-3.4 Automated cherry pick of #11613 to release-3.4

jpbetz added a commit that referenced this pull request Feb 13, 2020

Merge pull request #11626 from jpbetz/changelog-11613-backport

bdf69df

changelog: Add #11613 backport to 3.2, 3.3 and 3.4 changelogs

tedyu mentioned this pull request Feb 14, 2020

mvcc/backend: remove db.tmp without checking logger presence #11628

Merged

jingyih mentioned this pull request Mar 10, 2020

*: optimize auth/etcdserver logs to facilitate troubleshooting data inconsistency #11670

Merged

hexfusion mentioned this pull request Mar 20, 2020

[openshift-4.4] Bug 1815637: bump etcd v3.3.19 openshift/etcd#40

Closed

vitabaks added a commit to vitabaks/postgresql_cluster that referenced this pull request Mar 21, 2020

etcd: update to version 3.3.19

1ec3fff

etcd v3.3.19 includes a important bugfix: Fix corruption bug in defrag. etcd-io/etcd#11613

tangcong mentioned this pull request Apr 1, 2020

Inconsistent data between quorum nodes #11737

Closed

lavalamp mentioned this pull request Apr 2, 2020

The endpoint cannot update synchronously when pod is updated kubernetes/kubernetes#89679

Closed

hexfusion mentioned this pull request Apr 17, 2020

Bug 1815637: bump etcd v3.3.20 openshift/etcd#45

Closed

EppO mentioned this pull request Apr 21, 2020

Release Proposal v2.13.0 kubernetes-sigs/kubespray#5962

Closed

hexfusion mentioned this pull request May 21, 2020

Bug 1815637: bump etcd v3.3.22 openshift/etcd#47

Merged

hexfusion mentioned this pull request Jul 15, 2020

Bug 1815634: bump etcd v3.3.22 openshift/etcd#53

Merged

tangcong mentioned this pull request Jun 17, 2021

content/en/blog/2021: add "Announcing etcd 3.5" etcd-io/website#312

Merged

lavalamp mentioned this pull request Mar 25, 2022

Proposals should include a merkle root #13839

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mvcc/backend: Fix corruption bug in defrag #11613

mvcc/backend: Fix corruption bug in defrag #11613

jpbetz commented Feb 12, 2020 •

edited

Loading

codecov-io commented Feb 12, 2020 •

edited

Loading

gyuho commented Feb 12, 2020

jingyih left a comment •

edited

Loading

wenjiaswe commented Feb 12, 2020 •

edited

Loading

jpbetz commented Feb 12, 2020 •

edited

Loading

gyuho left a comment

jingyih left a comment

xiang90 commented Feb 12, 2020

jingyih commented Feb 12, 2020

jpbetz commented Feb 12, 2020 •

edited

Loading

jingyih commented Feb 13, 2020

jpbetz commented Feb 13, 2020

andyliuliming commented Feb 19, 2020

wenjiaswe commented Feb 19, 2020

YoyinZyc commented Feb 19, 2020

socketpair commented Oct 16, 2020

wenjiaswe commented Oct 16, 2020 •

edited

Loading

mvcc/backend: Fix corruption bug in defrag #11613

mvcc/backend: Fix corruption bug in defrag #11613

Conversation

jpbetz commented Feb 12, 2020 • edited Loading

codecov-io commented Feb 12, 2020 • edited Loading

Codecov Report

gyuho commented Feb 12, 2020

jingyih left a comment • edited Loading

Choose a reason for hiding this comment

wenjiaswe commented Feb 12, 2020 • edited Loading

jpbetz commented Feb 12, 2020 • edited Loading

gyuho left a comment

Choose a reason for hiding this comment

jingyih left a comment

Choose a reason for hiding this comment

xiang90 commented Feb 12, 2020

jingyih commented Feb 12, 2020

jpbetz commented Feb 12, 2020 • edited Loading

jingyih commented Feb 13, 2020

jpbetz commented Feb 13, 2020

andyliuliming commented Feb 19, 2020

wenjiaswe commented Feb 19, 2020

YoyinZyc commented Feb 19, 2020

socketpair commented Oct 16, 2020

wenjiaswe commented Oct 16, 2020 • edited Loading

jpbetz commented Feb 12, 2020 •

edited

Loading

codecov-io commented Feb 12, 2020 •

edited

Loading

jingyih left a comment •

edited

Loading

wenjiaswe commented Feb 12, 2020 •

edited

Loading

jpbetz commented Feb 12, 2020 •

edited

Loading

jpbetz commented Feb 12, 2020 •

edited

Loading

wenjiaswe commented Oct 16, 2020 •

edited

Loading