-
Notifications
You must be signed in to change notification settings - Fork 2.8k
fix(consul): fix consul discovery only fetch healthy endpoint #8928
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Please add test cases |
apisix/discovery/consul/init.lua
Outdated
| " by sub url: ", consul_server.consul_catalog_sub_url, | ||
| ", got watch result: ", json_delay_encode(catalog_result, true), | ||
| ", with error: ", catalog_error_info) | ||
| -- retry_delay = get_retry_delay(retry_delay) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The commented code should be removed. if it should be kept, please give the reason and put the reason in the comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed done
| local nodes = up_services[service_name] | ||
| for _, node in ipairs(result.body) do | ||
| local svc_address, svc_port = node.ServiceAddress, node.ServicePort | ||
| local svc_address, svc_port = node.Service.Address, node.Service.Port |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to check whether node.Service is nil
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
checked done
Considering that this is a fix, there are some test case before this, can we use the previous test case? |
|
@soulbird test case added |
t/discovery/consul.t
Outdated
| === TEST 13: prepare healthy and unhealthy nodes (with consul health check) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just to register or unregister some nodes with consul, so what will happen after using consul service discovery on apisix?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I change use consul_dump.t test this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, should delete this one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
delete done
| local svc_address, svc_port = nil, nil | ||
| if node.Service then | ||
| svc_address, svc_port = node.Service.Address, node.Service.Port | ||
| end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if node.Service is nil, it need to CONTINUE, do not insert in the nodes table, the node is empty
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update done
|
@Fabriceli Help me see what is left unfinished? There are some ci failed. |
Please fix the test cases, there are some test failed. Referenced infos about how to run APISIX test: |
Tks , let me learn and fix. |
|
@Fabriceli there are some test use 8600 which is the first consul DNS port, it can not use to another consul instance(8500), i want to update it to 18500. |
I see there use docker-compose expose 8600 from 8500 , I will follow this first , there are so many changes. |
|
I encountered difficulties. I notice the test ci use docker-compose start consul_1 8500 and consul_2 8600 and then they can not access I need use host IP register service replace |
L26-43: https://github.com/apache/apisix/blob/master/t/discovery/consul_dump.t#L26 |
Yes , I use here code ,I mean this instance reg Address can use 127.0.0.1:30511 but the checks can not use 127.0.0.1:30511 "PUT /v1/agent/service/register\n" . "{\"Checks\": [{\"http\": \"http://192.168.6.35:30511\",\"interval\": \"1s\"}],\"ID\":\"service_a1\",\"Name\":\"service_a\",\"Tags\":[\"primary\",\"v1\"],\"Address\":\"127.0.0.1\",\"Port\":30511,\"Meta\":{\"service_a_version\":\"4.0\"},\"EnableTagOverride\":false,\"Weights\":{\"Passing\":10,\"Warning\":1}}",``` |
|
Or maybe I can use a fake address that can be visited instead? \"Checks\": [{\"http\": \"http://baidu.com\"It can work |
TEST 16 and TEST 17 test for filtered healthy instance
|
@Fabriceli === TEST 16: prepare nodes with consul health check
--- config
location /v1/agent {
proxy_pass http://127.0.0.1:8500;
}
--- request eval
[
"PUT /v1/agent/service/deregister/service_a1",
"PUT /v1/agent/service/deregister/service_a2",
"PUT /v1/agent/service/deregister/service_b1",
"PUT /v1/agent/service/deregister/service_b2",
"PUT /v1/agent/service/register\n" . "{\"Checks\": [{\"http\": \"http://baidu.com\",\"interval\": \"1s\"}],\"ID\":\"service_a1\",\"Name\":\"service_a\",\"Tags\":[\"primary\",\"v1\"],\"Address\":\"127.0.0.1\",\"Port\":30511,\"Meta\":{\"service_a_version\":\"4.0\"},\"EnableTagOverride\":false,\"Weights\":{\"Passing\":10,\"Warning\":1}}",
"PUT /v1/agent/service/register\n" . "{\"Checks\": [{\"http\": \"http://127.0.0.1:8002\",\"interval\": \"1s\"}],\"ID\":\"service_b1\",\"Name\":\"service_b\",\"Tags\":[\"primary\",\"v1\"],\"Address\":\"127.0.0.1\",\"Port\":8002,\"Meta\":{\"service_b_version\":\"4.1\"},\"EnableTagOverride\":false,\"Weights\":{\"Passing\":10,\"Warning\":1}}",
]
--- response_body eval
--- error_code eval
[200, 200, 200, 200, 200, 200]
=== TEST 17: show dump services with consul health check
--- yaml_config
apisix:
node_listen: 1984
enable_control: true
discovery:
consul:
servers:
- "http://127.0.0.1:8500"
dump:
path: "consul.dump"
load_on_init: false
--- config
location /t {
content_by_lua_block {
local json = require("toolkit.json")
local t = require("lib.test_admin")
ngx.sleep(2)
local code, body, res = t.test('/v1/discovery/consul/show_dump_file',
ngx.HTTP_GET)
local entity = json.decode(res)
ngx.say(json.encode(entity.services))
}
}
--- timeout: 3
--- request
GET /t
--- response_body
{"service_a":[{"host":"127.0.0.1","port":30511,"weight":1}]} |
|
cc @spacewander |
|
Could you fix the CI: https://github.com/apache/apisix/actions/runs/4280301682/jobs/7470288647? |
I am in process of this. |
|
@Fabriceli @spacewander discovery:
consul:
timeout:
wait: 60the wait query param in consul need s or m obvious setting,so in code i need add .. "s" wait = consul_conf.timeout.wait .."s", --blocked wait!=0; unblocked by wait=0curl -fsSL -i http://127.0.0.1:8500/v1/health/state/any?wait=1s
HTTP/1.1 200 OK
Content-Type: application/json
Vary: Accept-Encoding
X-Consul-Effective-Consistency: leader
X-Consul-Index: 544
X-Consul-Knownleader: true
X-Consul-Lastcontact: 0
Date: Tue, 28 Feb 2023 04:11:07 GMT
Content-Length: 250
[{"Node":"8cd1fbccae10","CheckID":"serfHealth","Name":"Serf Health Status","Status":"passing","Notes":"","Output":"Agent alive and reachable","ServiceID":"","ServiceName":"","ServiceTags":[],"Type":"","Definition":{},"CreateIndex":5,"ModifyIndex":5}]
curl -fsSL -i http://127.0.0.1:8500/v1/health/state/any?wait=1
curl: (22) The requested URL returned error: 400 Bad Request |
|
please check it (wait arg) in lua-resty-consul, we use this lib to get infos from Consul |
|
Before push the code, you can run |
got it |
|
Very strange question, the consul http://127.0.0.1:8500/v1/health/state/any update have delay when consul service node changed when use date && nerdctl stop nginx-1 then v1/health/state/any have 10s delay, Create this script called consul_check_services_changes.sh #!/bin/bash
if test ${1:-none} = "none"
then
echo "USAGE: $0 Consul_URL"
echo " Example: localhost:8500/v1/health/service/MY_SUPER_SERVICE"
exit 1
fi
url_to_check=$1
headers=$(mktemp)
content=$(mktemp)
index=0
while true;
do
url="${url_to_check}?wait=5m&index=${index}&pretty=true&stale"
curl -fs --dump-header "$headers" -o "${content}.new" "${url}" || { "echo Failed to query ${url}"; exit 1; }
if test $index -ne 0
then
diff -u "$content" "$content.new" && echo " diff: No Differences found in service"
fi
index=$(grep "X-Consul-Index" "$headers" | sed 's/[^0-9]*\([0-9][0-9]*\)[^0-9]*/\1/g')
mv "$content.new" "$content"
printf "X-Consul-Index: $index at $(date) \b"
done./consul_check_services_changes.sh http://127.0.0.1:8500/v1/health/state/anyIf /health/state/any change has deplay it can not use for comparison X-Consul-Index change.I'm looking for a solution continue. |
|
After continuous testing, I found problems. If the consul register and deregister with no health check , then the v1/health/state/any will no change of X-Consul-Index at all. Be compatible with consul check and no check services ,we need to watch health/state/any Index and catalog/services index both, if any index change go into update process |
|
we need create some tests for these cases |
I am not sure the resty_consul:new code how block two query API, if one block and wait, how another exec? now use config keepalive false And remove the -- if current index different last index then update service compare,it will work fine That is to say, long pooling connection is not used |
that will be a BREAKING CHANGE |
|
Yes, that lost keepalive usage. We need find a compatible mechanism to watch catalog/services AND health/state/any X-Consul-Index both, I found all of consul APIs, there has no a mixed and integrated API for sensing change both service and health. |
|
Now I use commit code like below can pass test, that is give up pass index parameter to consul , do not use blocking query, it will work fine. Is there have a better way? -- consul_server.index = watch_result.headers['X-Consul-Index']
-- -- only long connect type use index
-- if consul_server.keepalive then
-- consul_server.default_args.index = watch_result.headers['X-Consul-Index']
-- end |
|
It can create TWO event to watch the services and update the flowchart TD
subgraph Event1: watch catalog
A[Start] --> catalog{Catalog any change?}
catalog -->|Yes| health[Query nodes]
health --> D[Post Event1]
catalog ---->|No| E[End]
D --> E
end
subgraph Event2: watch health
A2[Start] --> health2{health any change?}
health2 -->|Yes| catalog2[Query services]
catalog2 --> H[Query nodes]
H --> D2[Post Event2]
health2 ---->|No| E2[End]
D2 --> E2
end
subgraph Events register
A3[Start] --> register[Register Event]
register --> update[Update all services]
update --> e[End]
end
|
|
I don't know much about this mechanism about events,I dare not move that piece of code. I'm afraid to fix more bugs. Can you join and help finish the work together. |
I create a new PR to fix this "bug", if you are agree |
Great ! agree with you |
|
Considered resolved due to #9204 |
Description
Fixes #8856
Checklist