etcdserver: call refreshRangePermCache on Recover() in AuthStore. #14574

veshij · 2022-10-11T22:41:10Z

When new node joins the cluster and applies snapshot - auth cache is not updated.
On auth-enabled cluster it leads to all requests with "user doesn't exist".
It also leads to data inconsistencies in the cluster since updates can't be applied from raft log.
See discussion in #14571.

Fixes: #14571

ahrtr · 2022-10-12T02:57:46Z

At least an e2e test case needs to be added to verify the case.

The current e2e test framework doesn't support dynamically adding new member. Please wait for #14404 to get merged firstly. Afterwards, you can add an e2e test case for this PR.

ahrtr · 2022-10-16T08:11:40Z

Please add an e2e test case following the same way as https://github.com/etcd-io/etcd/blob/main/tests/e2e/etcd_grpcproxy_test.go

veshij · 2022-10-16T08:25:28Z

Will take a look next week. Вс, 16 окт. 2022 г. в 01:11, Benjamin Wang ***@***.***>:

Please add an e2e test case following the same way as https://github.com/etcd-io/etcd/blob/main/tests/e2e/etcd_grpcproxy_test.go — Reply to this email directly, view it on GitHub <#14574 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABOURUNQ4BQBTAWFQ6COKJTWDO2ENANCNFSM6AAAAAARCXJFOI> . You are receiving this because you authored the thread.Message ID: ***@***.***>

-- Oleg Guba

veshij · 2022-10-19T16:49:56Z

status update: working on e2e test, can't really reproduce an issue so far.

veshij · 2022-10-20T06:24:08Z

Updated the PR with e2e test.

tests/framework/e2e/cluster.go

tests/e2e/ctl_v3_auth_cluster_test.go

tests/framework/e2e/cluster.go

tests/framework/e2e/etcdctl.go

codecov-commenter · 2022-10-20T13:40:53Z

Codecov Report

Merging #14574 (3f13cc8) into main (c45f338) will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##             main   #14574   +/-   ##
=======================================
  Coverage   75.47%   75.47%           
=======================================
  Files         457      457           
  Lines       37274    37276    +2     
=======================================
+ Hits        28133    28135    +2     
+ Misses       7373     7367    -6     
- Partials     1768     1774    +6

Flag	Coverage Δ
all	`75.47% <100.00%> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
server/auth/range_perm_cache.go	`70.37% <100.00%> (+0.37%)`	⬆️
server/auth/store.go	`85.38% <100.00%> (+1.17%)`	⬆️
client/pkg/v3/fileutil/purge.go	`68.85% <0.00%> (-6.56%)`	⬇️
server/proxy/grpcproxy/watch.go	`92.48% <0.00%> (-4.05%)`	⬇️
server/etcdserver/api/v3rpc/auth.go	`80.45% <0.00%> (-2.30%)`	⬇️
server/etcdserver/api/v3election/election.go	`68.18% <0.00%> (-2.28%)`	⬇️
server/etcdserver/api/rafthttp/msgappv2_codec.go	`69.56% <0.00%> (-1.74%)`	⬇️
server/etcdserver/api/v3rpc/watch.go	`83.49% <0.00%> (-1.27%)`	⬇️
client/v3/maintenance.go	`59.37% <0.00%> (-1.25%)`	⬇️
server/etcdserver/api/v3rpc/interceptor.go	`76.56% <0.00%> (-1.05%)`	⬇️
... and 12 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

tests/framework/e2e/etcdctl.go

veshij · 2022-10-25T17:14:36Z

@ahrtr rebased the PR, could you take a look pls?

tests/framework/e2e/cluster.go

tests/e2e/ctl_v3_auth_cluster_test.go

ahrtr · 2022-10-26T00:14:25Z

Please rebase this PR again.

veshij · 2022-10-26T00:28:28Z

Please rebase this PR again.

done

tests/e2e/ctl_v3_auth_cluster_test.go

mitake · 2022-10-26T14:20:01Z

@veshij also could you squash test related commits into a single commit?

tests/framework/e2e/cluster.go

tests/framework/e2e/etcdctl.go

veshij · 2022-10-27T09:20:06Z

Please resolve comment #14574 (comment)

done

tests/e2e/ctl_v3_auth_cluster_test.go

Signed-off-by: Oleg Guba <oleg@dropbox.com>

ahrtr · 2022-10-27T21:34:24Z

tests/e2e/ctl_v3_auth_cluster_test.go

+			t.Logf("not exactly 2 hashkv responses returned: %d", len(hashKvs))
+			return false


It should be an error or fatal error, and fail the test immediately.

In case of a node returns an error (which sometimes happens on first request) e2e hashkv implementation returns single entry without an explicit error.
Unfortunately I'm not ready to troubleshoot that further as part of this PR, we've already spent quite a bit on finding issues with e2e test implementations.

Maybe return code is not checked around here

etcd/pkg/expect/expect.go

Lines 57 to 75 in e5790d2

func NewExpectWithEnv(name string, args []string, env []string, serverProcessConfigName string) (ep *ExpectProcess, err error) {

ep = &ExpectProcess{

cfg: expectConfig{

name: serverProcessConfigName,

cmd: name,

args: args,

env: env,

},

}

ep.cmd = commandFromConfig(ep.cfg)

if ep.fpty, err = pty.Start(ep.cmd); err != nil {

return nil, err

}

ep.wg.Add(1)

go ep.read()

return ep, nil

}

You already considered the case above, line 71 & 72

71...72 will be triggered if HashKV() returns an error, but it does not, just a truncated response.

It seems that we might need to enhance (*ExpectProcess) ExpectFunc. When the process exited with an error code ( != 0), then we need to return an error.

Please feel free to investigate & enhance it in a separate PR. Let's update this PR per https://github.com/etcd-io/etcd/pull/14574/files#r1007400980

I expect any methods of EtcdctlV3, including HashKV, should return an error when the command etcdctl exits with an error code.

I'm not sure I'll have time to work on ExpectProcess fixes, I don't know how many tests will fail if we change the behavior and what else will need to be updated.
I'll be happy to rebase this PR and update the test to explicitly fail once e2e env is fixed, will leave it as is for now.

Tracked it in #14638

I might take a look a bit later this week, can't promise.

tests/e2e/ctl_v3_auth_cluster_test.go

Signed-off-by: Oleg Guba <oleg@dropbox.com>

ahrtr

LGTM

Great work. Thank you @veshij

Could you backport this PR to 3.5 and 3.4?

veshij · 2022-10-27T22:37:36Z

LGTM

Great work. Thank you @veshij

Could you backport this PR to 3.5 and 3.4?

yep, will do this week.

mitake · 2022-10-28T13:48:59Z

Thanks a lot for fixing the issue @veshij !

veshij · 2022-10-28T20:26:46Z

@ahrtr started looking into backporting to 3.4/3.5.
Looks like e2e test env was significantly updated since then and changes were not backported.

Is there a plan to backport e2e changes? I can backport the fix itself w/o touching tests though.

ahrtr · 2022-10-28T21:34:25Z

@ahrtr started looking into backporting to 3.4/3.5. Looks like e2e test env was significantly updated since then and changes were not backported.

Is there a plan to backport e2e changes? I can backport the fix itself w/o touching tests though.

Yes, we need to backport the e2e test firstly. @biosvs Do you have bandwidth to backport #14589 to 3.5 and probably 3.4?

…d-io#14574 Signed-off-by: Oleg Guba <oleg@dropbox.com> Signed-off-by: Hitoshi Mitake <h.mitake@gmail.com>

biosvs · 2022-10-29T09:29:33Z

@ahrtr When I worked on backporting of my tests, it was impossible to backport changes for testing framework itself due to huge number of changes in e2e framework between 3.5 and main branch.

It's definitely easier to use raw calls, as I did it in #14397.

These calls could be collected to some "library" package, I guess, but I'll have time to try it only in a week (i.e. after next week).

ahrtr · 2022-10-29T22:09:56Z

@ahrtr When I worked on backporting of my tests, it was impossible to backport changes for testing framework itself due to huge number of changes in e2e framework between 3.5 and main branch.

It's definitely easier to use raw calls, as I did it in #14397.

These calls could be collected to some "library" package, I guess, but I'll have time to try it only in a week (i.e. after next week).

Thanks for the feedback. Let me take care of it.

chaochn47 · 2022-11-04T01:09:00Z

tests/e2e/ctl_v3_auth_cluster_test.go

+	// start second process
+	if err := epc.StartNewProc(ctx, t, rootUserClientOpts); err != nil {
+		t.Fatalf("could not start second etcd process (%v)", err)
+	}


This won't trigger transfering db snapshot to new member and restore auth store from backend. Confirmed by local testing.

By default SnapshotCatchUpEntries is 5k.

etcd/server/config/config.go

Lines 53 to 59 in 0dfd726

// SnapshotCatchUpEntries is the number of entries for a slow follower

// to catch-up after compacting the raft storage entries.

// We expect the follower has a millisecond level latency with the leader.

// The max throughput is around 10K. Keep a 5K entries is enough for helping

// follower to catch up.

// WARNING: only change this for tests. Always use "DefaultSnapshotCatchUpEntries"

SnapshotCatchUpEntries uint64

E2E test don't have ability to tune this hard-coded value while integration test could. For example

etcd/tests/integration/v3_alarm_test.go

Lines 260 to 269 in 0dfd726

func TestV3CorruptAlarmWithLeaseCorrupted(t *testing.T) {

integration.BeforeTest(t)

lg := zaptest.NewLogger(t)

clus := integration.NewCluster(t, &integration.ClusterConfig{

CorruptCheckTime: time.Second,

Size: 3,

SnapshotCount: 10,

SnapshotCatchUpEntries: 5,

DisableStrictReconfigCheck: true,

})

So I doubt this test really protects the issue from happening again while the fix makes sense to me. @veshij @ahrtr

Good catch! The reason why we can successfully & manually reproduce & verify this is that the snapi is usually < 5K, it always compacts all entries starting from index 2.

etcd/server/etcdserver/server.go

Lines 2099 to 2102 in 0dfd726

compacti := uint64(1)

if snapi > s.Cfg.SnapshotCatchUpEntries {

compacti = snapi - s.Cfg.SnapshotCatchUpEntries

}

Two points:

This might not be correct, we shouldn't compact the log at all if snapi < 5K. Of course, it isn't a big deal.

In an e2e or integration test case, the snapi should be usually < 5K, so the test case should be still valid. But I agree that we should correct it. So let's write more than 5K K/V to make it correct and stable.

@chaochn47 @veshij Please feel free to deliver a PR to address it for both main and release-3.5. Let fix # 2 firstly, and probably @ 1 separately.

Just to confirm - we need to only increase number of keys written to 5K (>= SnapshotCatchUpEntries)?

The reason why we can successfully & manually reproduce & verify this is that the snapi is usually < 5K, it always compacts all entries starting from index 2.

I am a little bit confused about this statement. How this test reproduced the scenario that db snapshot is transferred to new member? @ahrtr

Looks like raft log won't be compacted unless the very first one entry.

etcd/raft/storage.go

Lines 227 to 248 in 0dfd726

// Compact discards all log entries prior to compactIndex.

// It is the application's responsibility to not attempt to compact an index

// greater than raftLog.applied.

func (ms *MemoryStorage) Compact(compactIndex uint64) error {

ms.Lock()

defer ms.Unlock()

offset := ms.ents[0].Index

if compactIndex <= offset {

return ErrCompacted

}

if compactIndex > ms.lastIndex() {

getLogger().Panicf("compact %d is out of bound lastindex(%d)", compactIndex, ms.lastIndex())

}

i := compactIndex - offset

ents := make([]pb.Entry, 1, 1+uint64(len(ms.ents))-i)

ents[0].Index = ms.ents[i].Index

ents[0].Term = ms.ents[i].Term

ents = append(ents, ms.ents[i+1:]...)

ms.ents = ents

return nil

}

When new member bootstraps, leader will replicate all the entries to new member instead of transferring snapshot, right?

I obviously don't fully understand the replication behavior here, but what I noticed while writing the test - without SnapshotCount set to some low value I wasn't able to trigger the issue. Based on logs it looked like raft log was applied instead of snapshot.

Correction to my previous typo: all entries starting from index 2 will be retained. Since the first entry is compacted, so when a new member is added, the leader will still send a snapshot to the new member.

So let's add the following two cases:

keep the current case;

write more than 5K K/V, e.g 5100, and keep all others unchanged.

This won't trigger transfering db snapshot to new member and restore auth store from backend. Confirmed by local testing.

I am curious how did you confirm that?

Thanks for the explanation!! Just verified again new member will always receive a snapshot from leader whenever the configured snapshot count < KV writes.

/home/chaochn/go/src/go.etcd.io/etcd/bin/etcd (TestAuthCluster-test-0) (23533): { "level": "info", "ts": "2022-11-04T06:05:50.932Z", "caller": "etcdserver/snapshot_merge.go:66", "msg": "sent database snapshot to writer", "bytes": 20480, "size": "20 kB" } /home/chaochn/go/src/go.etcd.io/etcd/bin/etcd (TestAuthCluster-test-1) (23773): { "level": "info", "ts": "2022-11-04T06:05:50.932Z", "caller": "rafthttp/http.go:257", "msg": "receiving database snapshot", "local-member-id": "4e6825c690b0cb86", "remote-snapshot-sender-id": "ca50e9357181d758", "incoming-snapshot-index": 48, "incoming-snapshot-message-size-bytes": 7723, "incoming-snapshot-message-size": "7.7 kB" }

I must be confused by existing member restart and new member joining the cluster.

The former requires at least 5k KV writes to trigger a db snapshot.

Sorry for the confusion, all good and make sense to me now!

…d-io#14574 Signed-off-by: Oleg Guba <oleg@dropbox.com> Signed-off-by: Hitoshi Mitake <h.mitake@gmail.com>

veshij force-pushed the main branch from 3e87c74 to c54f544 Compare October 11, 2022 22:41

ahrtr mentioned this pull request Oct 12, 2022

3.5.5: client auth failures on new member first startup #14571

Closed

ahrtr added backport/v3.5 backport/v3.4 labels Oct 13, 2022

This was referenced Oct 14, 2022

[test] Support dynamically adding & starting new member #14589

Merged

server/auth: invalidate range permission cache during recovering from… #13920

Closed

veshij force-pushed the main branch from c54f544 to 3f0af1e Compare October 20, 2022 06:23

veshij force-pushed the main branch from 3f0af1e to faebef0 Compare October 20, 2022 06:24

ahrtr reviewed Oct 20, 2022

View reviewed changes

tests/framework/e2e/cluster.go Outdated Show resolved Hide resolved

tests/e2e/ctl_v3_auth_cluster_test.go Outdated Show resolved Hide resolved

tests/framework/e2e/cluster.go Outdated Show resolved Hide resolved

tests/framework/e2e/etcdctl.go Outdated Show resolved Hide resolved

ahrtr reviewed Oct 20, 2022

View reviewed changes

tests/framework/e2e/etcdctl.go Outdated Show resolved Hide resolved

veshij mentioned this pull request Oct 20, 2022

[e2e] Fix AuthEnable/Disable e2e test implementations #14608

Merged

veshij force-pushed the main branch 3 times, most recently from 782774a to d5de5e7 Compare October 24, 2022 20:52

ahrtr reviewed Oct 26, 2022

View reviewed changes

tests/framework/e2e/cluster.go Outdated Show resolved Hide resolved

tests/framework/e2e/cluster.go Outdated Show resolved Hide resolved

tests/e2e/ctl_v3_auth_cluster_test.go Outdated Show resolved Hide resolved

tests/e2e/ctl_v3_auth_cluster_test.go Outdated Show resolved Hide resolved

veshij force-pushed the main branch from d5de5e7 to d30068b Compare October 26, 2022 00:20

ahrtr reviewed Oct 26, 2022

View reviewed changes

tests/e2e/ctl_v3_auth_cluster_test.go Outdated Show resolved Hide resolved

veshij force-pushed the main branch 2 times, most recently from 0438192 to 392c288 Compare October 27, 2022 00:53

ahrtr reviewed Oct 27, 2022

View reviewed changes

tests/framework/e2e/cluster.go Outdated Show resolved Hide resolved

veshij commented Oct 27, 2022

View reviewed changes

tests/framework/e2e/etcdctl.go Outdated Show resolved Hide resolved

veshij force-pushed the main branch from 1cb1d29 to 3ac37a0 Compare October 27, 2022 09:19

veshij force-pushed the main branch from 3ac37a0 to 2d3e7d1 Compare October 27, 2022 09:22

ahrtr reviewed Oct 27, 2022

View reviewed changes

tests/e2e/ctl_v3_auth_cluster_test.go Outdated Show resolved Hide resolved

veshij force-pushed the main branch 2 times, most recently from 494ba7d to 4e5e74f Compare October 27, 2022 18:03

Allow passing client options to Client()

b22e3ff

Signed-off-by: Oleg Guba <oleg@dropbox.com>

veshij force-pushed the main branch from 4e5e74f to 3f13cc8 Compare October 27, 2022 21:08

ahrtr reviewed Oct 27, 2022

View reviewed changes

chaochn47 reviewed Oct 27, 2022

View reviewed changes

tests/e2e/ctl_v3_auth_cluster_test.go Show resolved Hide resolved

veshij force-pushed the main branch from 3f13cc8 to 63fddae Compare October 27, 2022 22:04

etcdserver: call refreshRangePermCache on Recover() in AuthStore

fbed8cb

Signed-off-by: Oleg Guba <oleg@dropbox.com>

veshij force-pushed the main branch from 63fddae to fbed8cb Compare October 27, 2022 22:05

ahrtr approved these changes Oct 27, 2022

View reviewed changes

ahrtr merged commit 1570dc9 into etcd-io:main Oct 27, 2022

mitake added a commit to mitake/etcd that referenced this pull request Oct 29, 2022

etcdserver: call refreshRangePermCache on Recover() in AuthStore. etc…

1e96e0b

…d-io#14574 Signed-off-by: Oleg Guba <oleg@dropbox.com> Signed-off-by: Hitoshi Mitake <h.mitake@gmail.com>

ahrtr mentioned this pull request Oct 29, 2022

Plans for v3.4.22 release #14651

Closed

4 tasks

ahrtr mentioned this pull request Oct 30, 2022

test: added e2e test case for issue 14571: etcd doesn't load auth info when recovering from a snapshot #14656

Merged

chaochn47 reviewed Nov 4, 2022

View reviewed changes

tjungblu pushed a commit to tjungblu/etcd that referenced this pull request Jul 26, 2023

etcdserver: call refreshRangePermCache on Recover() in AuthStore. etc…

4e5ee2c

…d-io#14574 Signed-off-by: Oleg Guba <oleg@dropbox.com> Signed-off-by: Hitoshi Mitake <h.mitake@gmail.com>

		t.Logf("not exactly 2 hashkv responses returned: %d", len(hashKvs))
		return false

	func NewExpectWithEnv(name string, args []string, env []string, serverProcessConfigName string) (ep *ExpectProcess, err error) {
	ep = &ExpectProcess{
	cfg: expectConfig{
	name: serverProcessConfigName,
	cmd: name,
	args: args,
	env: env,
	},
	}
	ep.cmd = commandFromConfig(ep.cfg)

	if ep.fpty, err = pty.Start(ep.cmd); err != nil {
	return nil, err
	}

	ep.wg.Add(1)
	go ep.read()
	return ep, nil
	}

	// SnapshotCatchUpEntries is the number of entries for a slow follower
	// to catch-up after compacting the raft storage entries.
	// We expect the follower has a millisecond level latency with the leader.
	// The max throughput is around 10K. Keep a 5K entries is enough for helping
	// follower to catch up.
	// WARNING: only change this for tests. Always use "DefaultSnapshotCatchUpEntries"
	SnapshotCatchUpEntries uint64

	func TestV3CorruptAlarmWithLeaseCorrupted(t *testing.T) {
	integration.BeforeTest(t)
	lg := zaptest.NewLogger(t)
	clus := integration.NewCluster(t, &integration.ClusterConfig{
	CorruptCheckTime: time.Second,
	Size: 3,
	SnapshotCount: 10,
	SnapshotCatchUpEntries: 5,
	DisableStrictReconfigCheck: true,
	})

	compacti := uint64(1)
	if snapi > s.Cfg.SnapshotCatchUpEntries {
	compacti = snapi - s.Cfg.SnapshotCatchUpEntries
	}

	// Compact discards all log entries prior to compactIndex.
	// It is the application's responsibility to not attempt to compact an index
	// greater than raftLog.applied.
	func (ms *MemoryStorage) Compact(compactIndex uint64) error {
	ms.Lock()
	defer ms.Unlock()
	offset := ms.ents[0].Index
	if compactIndex <= offset {
	return ErrCompacted
	}
	if compactIndex > ms.lastIndex() {
	getLogger().Panicf("compact %d is out of bound lastindex(%d)", compactIndex, ms.lastIndex())
	}

	i := compactIndex - offset
	ents := make([]pb.Entry, 1, 1+uint64(len(ms.ents))-i)
	ents[0].Index = ms.ents[i].Index
	ents[0].Term = ms.ents[i].Term
	ents = append(ents, ms.ents[i+1:]...)
	ms.ents = ents
	return nil
	}

etcdserver: call refreshRangePermCache on Recover() in AuthStore. #14574

etcdserver: call refreshRangePermCache on Recover() in AuthStore. #14574

Conversation

veshij commented Oct 11, 2022 • edited

ahrtr commented Oct 12, 2022

ahrtr commented Oct 16, 2022

veshij commented Oct 16, 2022 via email

veshij commented Oct 19, 2022 • edited

veshij commented Oct 20, 2022

codecov-commenter commented Oct 20, 2022 • edited

Codecov Report

veshij commented Oct 25, 2022

ahrtr commented Oct 26, 2022

veshij commented Oct 26, 2022

mitake commented Oct 26, 2022

veshij commented Oct 27, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

veshij Oct 27, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ahrtr left a comment

Choose a reason for hiding this comment

veshij commented Oct 27, 2022

mitake commented Oct 28, 2022

veshij commented Oct 28, 2022

ahrtr commented Oct 28, 2022

biosvs commented Oct 29, 2022

ahrtr commented Oct 29, 2022

chaochn47 Nov 4, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chaochn47 Nov 4, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chaochn47 Nov 4, 2022 • edited

Choose a reason for hiding this comment

veshij commented Oct 11, 2022 •

edited

veshij commented Oct 19, 2022 •

edited

codecov-commenter commented Oct 20, 2022 •

edited

veshij Oct 27, 2022 •

edited

chaochn47 Nov 4, 2022 •

edited

chaochn47 Nov 4, 2022 •

edited

chaochn47 Nov 4, 2022 •

edited