Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: keep healthcheck target state when upstream changes #10312

Conversation

monkeyDluffy6017
Copy link
Contributor

@monkeyDluffy6017 monkeyDluffy6017 commented Oct 10, 2023

Description

When we modify the upstream (adding or deleting nodes), APISIX first deletes the health check object and then recreates the health check object. The process of deletion and recreation may affect other parallel requests searching for healthy nodes, leading to the following error:
image
So we will replace checker:clear() with function checker:delayed_clear(). This function marks all targets to be removed, but do not actually remove them. If before the delay parameter any of them is re-added, it is unmarked for removal.
This function makes it possible to keep target state during config changes, where the targets might be removed and then re-added.

It's very hard to add test cases. I tried but failed.

Follow these steps to reproduce:

  1. create a router
curl http://127.0.0.1:9180/apisix/admin/routes/1 -H 'X-API-KEY: edd1c9f034335f136f87ad84b625c8f1' -X PUT -d '
{
    "uri": "/get",
    "upstream_id": 1
}'
  1. Keep updating the upstream object
curl http://127.0.0.1:9180/apisix/admin/upstreams/1 \
-H 'X-API-KEY: edd1c9f034335f136f87ad84b625c8f1' -X PUT -d '
{
    "type": "chash",
    "key": "remote_addr",
    "nodes": {
        "127.0.0.1:8080": 1,
        "127.0.0.1:8081": 1,
        "127.0.0.1:8082": 1,
        "127.0.0.1:8083": 1,
        "127.0.0.1:8084": 1,
        "127.0.0.1:8085": 1,
        "127.0.0.1:8086": 1,
        "127.0.0.1:8087": 1,
        "127.0.0.1:8088": 1,
        "127.0.0.1:8089": 1,
        "127.0.0.1:8090": 1
    },
    "retries": 2,
    "checks": {
        "active": {
            "timeout": 5,
            "http_path": "/status",
            "healthy": {
                "interval": 2,
                "successes": 1
            },
            "unhealthy": {
                "interval": 1,
                "http_failures": 2
            }
        }
    }
}'
  1. Keep sending requests
wrk -c 10 -t 5 -d 500s -R 200 http://127.0.0.1:9080/get
  1. Pay attention to the error.log

Fixes # (issue)

Checklist

  • I have explained the need for this PR and the problem it solves
  • I have explained the changes or the new features added to this PR
  • I have added tests corresponding to this change
  • I have updated the documentation to reflect this change
  • I have verified that this change is backward compatible (If not, please discuss on the APISIX mailing list first)

@@ -145,7 +145,7 @@ nginx_config: # Config for render the template to generate n
# effective if the master process runs with super-user privileges.
error_log: logs/error.log # Location of the error log.
error_log_level: warn # Logging level: info, debug, notice, warn, error, crit, alert, or emerg.
worker_processes: auto # Automatically determine the optimal number of worker processes based
worker_processes: 4 # Automatically determine the optimal number of worker processes based
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you change this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@shreemaan-abhishek
Copy link
Contributor

Keep updating the upstream object

Do I need to add/remove nodes from upstream frequently?

@monkeyDluffy6017
Copy link
Contributor Author

Do I need to add/remove nodes from upstream frequently?

Yeah, but keep setting the same upstream is ok

@@ -114,7 +114,7 @@ function _M.fire_all_clean_handlers(item)
clean_handler.f(item)
end

item.clean_handlers = nil
item.clean_handlers = {}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the clean_handler is put into the parents (routers), it will cause error like this:

2023/10/10 20:53:16 [error] 1057074#1057074: *2020 lua entry thread aborted: runtime error: /home/liuwei/api7/apisix/apisix/core/config_util.lua:61: attempt to index field 'clean_handlers' (a nil value)
stack traceback:
coroutine 0:
	/home/liuwei/api7/apisix/apisix/core/config_util.lua: in function 'add_clean_handler'
	/home/liuwei/api7/apisix/apisix/upstream.lua:152: in function 'fetch_healthchecker'
	/home/liuwei/api7/apisix/apisix/upstream.lua:342: in function 'set_upstream'
	/home/liuwei/api7/apisix/apisix/init.lua:549: in function 'handle_upstream'
	/home/liuwei/api7/apisix/apisix/init.lua:730: in function 'http_access_phase'

@shreemaan-abhishek
Copy link
Contributor

@monkeyDluffy6017 I get this error when I run the wrk command that you shared:

image

@monkeyDluffy6017
Copy link
Contributor Author

@shreemaan-abhishek You need to use wrk2 or remove the -R option

@shreemaan-abhishek
Copy link
Contributor

I got the following error, when I ran the commands you mentioned:

2023/10/12 15:37:06 [error] 65019#1620486: *4429371 lua entry thread aborted: runtime error: ...abhishek/Desktop/repos/xisipa/apisix/apisix/balancer.lua:126: attempt to index local 'picker' (a number value)
stack traceback:
coroutine 0:
	...abhishek/Desktop/repos/xisipa/apisix/apisix/balancer.lua: in function 'create_obj_fun'
	...hek/Desktop/repos/xisipa/apisix/apisix/core/lrucache.lua:95: in function 'lrucache_server_picker'
	...abhishek/Desktop/repos/xisipa/apisix/apisix/balancer.lua:247: in function 'pick_server'
	...aan-abhishek/Desktop/repos/xisipa/apisix/apisix/init.lua:555: in function 'handle_upstream'
	...aan-abhishek/Desktop/repos/xisipa/apisix/apisix/init.lua:730: in function 'http_access_phase'
	access_by_lua(nginx.conf:303):2: in main chunk, client: 127.0.0.1, server: _, request: "GET /get HTTP/1.1", host: "127.0.0.1:9080"

I hope this is the same expected error.

@monkeyDluffy6017
Copy link
Contributor Author

@shreemaan-abhishek I don't know how you get this error, it has nothing to do with my changes.

@monkeyDluffy6017 monkeyDluffy6017 merged commit 6fa5a89 into apache:master Oct 13, 2023
38 checks passed
Revolyssup added a commit to Revolyssup/apisix that referenced this pull request Oct 15, 2023
Signed-off-by: Ashish Tiwari <ashishjaitiwari15112000@gmail.com>
Revolyssup pushed a commit to Revolyssup/apisix that referenced this pull request Oct 15, 2023
Revolyssup pushed a commit to Revolyssup/apisix that referenced this pull request Oct 24, 2023
hongbinhsu pushed a commit to fitphp/apix that referenced this pull request Nov 1, 2023
* upstream/master: (83 commits)
  fix: make install failed on mac (apache#10403)
  feat(zipkin): add variable (apache#10361)
  test(clickhouse-logger): to show that different endpoints will be chosen randomly (apache#8777)
  chore(deps): bump actions/setup-node from 3.8.1 to 4.0.0 (apache#10381)
  ci: fix the grpc test error (apache#10388)
  ci: trigger ci when doc-lint.yml changes (apache#10382)
  docs: fix usage of incorrect default admin api port (apache#10391)
  feat: Add authorization params to openid-connect plugin (apache#10058)
  feat: integrate authz-keycloak with secrets resource (apache#10353)
  fix(traffic-split): post_arg match fails because content-type contains charset (apache#10372)
  fix(consul): worker will not exit while reload or quit (apache#10342)
  chore: update rules for unresponded issues (apache#10354)
  docs: Update APISIX usecases in README (apache#10358)
  test: use http2 to test limit-req plugin (apache#10334)
  test: use http2 to test limit-conn plugin (apache#10332)
  chore: remove stream_proxy.only in config-default.yaml (apache#10337)
  docs: update underscore to hyphen in HTTP headers in `response-rewrite` plugin (apache#10347)
  fix: typos in comments (apache#10330)
  feat: support config stream_route upstream in service (apache#10298)
  fix: keep healthcheck target state when upstream changes (apache#10312)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants