feat: add etcd store auto re-initialize #2650

bzp2010 · 2022-10-24T06:47:03Z

Please answer these questions before submitting a pull request, or your PR will get closed.

Why submit this pull request?

Bugfix
New feature provided
Improve performance
Backport patches

What changes will this PR take into?

Resynchronize when the etcd local cache does not match the data source.

Related issues

fix #2461 #2360 and more same problem

Checklist:

Did you explain what problem does this PR solve? Or what new features have been added?
Have you added corresponding test cases?
Have you modified the corresponding document?
Is this PR backward compatible? If it is not backward compatible, please discuss on the mailing list first

codecov-commenter · 2022-10-24T11:50:12Z

Codecov Report

Merging #2650 (92d65ca) into master (831df82) will increase coverage by 0.24%.
The diff coverage is 69.38%.

@@            Coverage Diff             @@
##           master    #2650      +/-   ##
==========================================
+ Coverage   74.06%   74.30%   +0.24%     
==========================================
  Files         201      201              
  Lines        7781     7823      +42     
  Branches      872      872              
==========================================
+ Hits         5763     5813      +50     
+ Misses       1723     1707      -16     
- Partials      295      303       +8

Flag	Coverage Δ
backend-e2e-test-ginkgo	`65.34% <57.14%> (+0.50%)`	⬆️
backend-unit-test	`50.07% <76.00%> (+0.08%)`	⬆️
frontend-e2e-test	`76.49% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
api/internal/core/server/store.go	`25.00% <0.00%> (ø)`
api/internal/core/storage/etcd.go	`54.38% <44.44%> (+4.77%)`	⬆️
api/internal/utils/closer.go	`71.42% <50.00%> (+38.09%)`	⬆️
api/cmd/root.go	`69.56% <62.16%> (+7.06%)`	⬆️
api/internal/core/store/store.go	`89.47% <83.33%> (+0.41%)`	⬆️
api/internal/core/server/server.go	`62.50% <0.00%> (+8.33%)`	⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

bzp2010 · 2022-10-27T03:19:45Z

Updated

Update etcd dependencies to v3.5.5 (old still v3.3.x)
Add passive live check for etcd watch (I have not observed active exits of watch due to connection exceptions in many tests and I am not sure if it works properly, so there is an additional active check mechanism)
Add active connection check for all etcd stream
Add cli test for active check
Improve some codes

The active check uses an etcd sync request that runs every 10 seconds (it is used to update cluster endpoints), if it does not respond within 5 seconds then I will assume it has a connection failure, if it fails multiple times then I assume that watch is likely to have hung and faked death, at which point I will re-execute the initialization, i.e. list + watch.

api/cmd/root.go

nic-chen · 2022-10-27T03:28:40Z

api/cmd/root.go

+			select {
+			case <-time.Tick(10 * time.Second):
+				sCtx, sCancel := context.WithTimeout(context.TODO(), 5*time.Second)
+				err := etcdClient.Sync(sCtx)


oh, I think it is unnecessary. Because the client has a retry mechanism, even if the connection is interrupted during operation, reusing the client will automatically recover, only the watch cannot recover automatically in some cases

@nic-chen

The connection can be automatically restored, but the behavior of watch is unobservable, and sometimes watch is incapacitated, but it still does not actively quit. There is no API to actively know which watchers are currently available and what is the state of those watchers.
This is a complete black box for its applicators, the entire documentation of etcd mentions that its connections are self-healing and watch will not lose any data, but it does happen in our environment.

I think this is really necessary, and if you don't think it's appropriate, we can use a less expensive way than sync requests, which is to implement it using etcd client's internal connection state monitor, client.getActiveConnection(), which will not initiate the request, but still get the connection state.

IMHO, for the uncertain behavior in the etcd client (it is indeed a black box), where all the grpc stream details are hidden and the documentation does not match the actual behavior, an active health checking mechanism independent of the internal mechanism is a must.

bzp2010 · 2022-10-28T01:01:49Z

ping @nic-chen @starsz again

api/cmd/root.go

starsz · 2022-10-28T01:30:36Z

api/test/shell/cli_test.sh

+  sleep 20
+
+  [ "$(grep -c "etcd connection recovered" ${LOG_FILE})" -ge '1' ]
+  [ "$(grep -c "etcd store reinitializing" ${LOG_FILE})" -ge '1' ]


Better to add e2e test to ensure the data is not loss.

This is not possible in theory, because each connection exception triggers a re-initialization of the entire abstract storage layer, which involves reloading the "list" in full, and creating a new watcher. As soon as the full load succeeds, the existing data in the cache will be overwritten by the new data, and what is not there will be added.

If the load fails, an error will be output to the console and log.

Again, the timer will still detect the fault at this point and do it again after recovery.

Co-authored-by: Peter Zhu <starszcan@gmail.com>

nic-chen

I checked out your branch and tried to reproduce it, but failed.
I think we could merge it first.
Before that, it is recommended to run it with APISIX for a long time to avoid causing other hidden issues. @bzp2010

bzp2010 · 2022-10-29T01:31:48Z

@nic-chen Thanks to your support, I have added a new use case to the CLI test to ensure that data changes during a connection outage will be resynchronized after the data is reinitialized. We can merge and continue to see the impact of this fix.

Co-authored-by: Peter Zhu <starszcan@gmail.com>

(cherry picked from commit f64372f)

* upstream/master: fix: change default CSP value (apache#2601) fix: ant-table unable to request (apache#2641) fix: plugin_config missing on service exist (apache#2657) feat: add etcd store auto re-initialize (apache#2650) feat: add login filter of OpenID-Connect (apache#2608) feat:Configure plug-ins to support this feature (apache#2647) feat: Adding a Loading state to buttons (apache#2630) feat: dashboard support windows (apache#2619) Feat: add tip and preset model for plugin editor, improve e2e stability (apache#2581) docs: add Slack invitation link badge (apache#2617) # Conflicts: # .github/workflows/backend-cli-test.yml # Dockerfile # api/test/shell/cli_test.sh # web/src/components/Footer/index.tsx # web/src/components/RightContent/index.tsx # web/src/pages/ServerInfo/List.tsx

bzp2010 self-assigned this Oct 24, 2022

bzp2010 added 6 commits October 26, 2022 23:40

chore: update etcd client deps

608a988

feat: reverse closer order

09909f1

fix: some misc things

9004326

feat: add store layer re-watch

62a0126

feat: add out-tree etcd connection checker

c89fe79

test: add cli test

10efb10

bzp2010 force-pushed the fix-etcd-rewatch branch from 48dc988 to 10efb10 Compare October 26, 2022 15:53

bzp2010 added 5 commits October 26, 2022 23:55

chore: fix deps

e156dae

fix: cli etcd binary version

f1f5e6a

test: update go version

ac5a29a

test: fix log checker

9bdef66

feat: add timed re-init

002a01d

bzp2010 requested review from nic-chen and starsz October 27, 2022 02:48

bzp2010 marked this pull request as ready for review October 27, 2022 03:03

nic-chen reviewed Oct 27, 2022

View reviewed changes

chore: add error log for etcd store reinitialize

da09fa5

bzp2010 requested a review from nic-chen October 27, 2022 03:57

starsz reviewed Oct 28, 2022

View reviewed changes

chore: add more log fields

349fb56

Co-authored-by: Peter Zhu <starszcan@gmail.com>

bzp2010 requested a review from starsz October 28, 2022 09:11

bzp2010 added 5 commits October 28, 2022 22:00

feat: add etcd re-init result check

d2ed390

Merge branch 'master' into fix-etcd-rewatch

19ca434

chore: update go mod

1813aec

fix: test case

b5fa274

fix: add more wait time

5a9ec02

fix

e9a6e1a

nic-chen approved these changes Oct 29, 2022

View reviewed changes

chore: change ctx usage

92d65ca

Co-authored-by: Peter Zhu <starszcan@gmail.com>

starsz approved these changes Oct 29, 2022

View reviewed changes

bzp2010 merged commit f64372f into apache:master Oct 29, 2022

bzp2010 added a commit to bzp2010/apisix-dashboard that referenced this pull request Oct 31, 2022

feat: add etcd store auto re-initialize (apache#2650)

c5d2e08

(cherry picked from commit f64372f)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add etcd store auto re-initialize #2650

feat: add etcd store auto re-initialize #2650

bzp2010 commented Oct 24, 2022

codecov-commenter commented Oct 24, 2022 •

edited

Loading

bzp2010 commented Oct 27, 2022 •

edited

Loading

nic-chen Oct 27, 2022

bzp2010 Oct 27, 2022 •

edited

Loading

bzp2010 commented Oct 28, 2022

starsz Oct 28, 2022

bzp2010 Oct 28, 2022 •

edited

Loading

bzp2010 Oct 28, 2022

bzp2010 Oct 28, 2022

nic-chen left a comment •

edited

Loading

bzp2010 commented Oct 29, 2022

feat: add etcd store auto re-initialize #2650

feat: add etcd store auto re-initialize #2650

Conversation

bzp2010 commented Oct 24, 2022

codecov-commenter commented Oct 24, 2022 • edited Loading

Codecov Report

bzp2010 commented Oct 27, 2022 • edited Loading

Updated

nic-chen Oct 27, 2022

Choose a reason for hiding this comment

bzp2010 Oct 27, 2022 • edited Loading

Choose a reason for hiding this comment

bzp2010 commented Oct 28, 2022

starsz Oct 28, 2022

Choose a reason for hiding this comment

bzp2010 Oct 28, 2022 • edited Loading

Choose a reason for hiding this comment

bzp2010 Oct 28, 2022

Choose a reason for hiding this comment

bzp2010 Oct 28, 2022

Choose a reason for hiding this comment

nic-chen left a comment • edited Loading

Choose a reason for hiding this comment

bzp2010 commented Oct 29, 2022

codecov-commenter commented Oct 24, 2022 •

edited

Loading

bzp2010 commented Oct 27, 2022 •

edited

Loading

bzp2010 Oct 27, 2022 •

edited

Loading

bzp2010 Oct 28, 2022 •

edited

Loading

nic-chen left a comment •

edited

Loading