Skip to content

Conversation

@dyrnq
Copy link
Contributor

@dyrnq dyrnq commented Feb 23, 2023

Description

Fixes #8856

Checklist

  • I have explained the need for this PR and the problem it solves
  • I have explained the changes or the new features added to this PR
  • I have added tests corresponding to this change
  • I have updated the documentation to reflect this change
  • I have verified that this change is backward compatible (If not, please discuss on the APISIX mailing list first)

@dyrnq dyrnq changed the title fix(consul):fix consul discovery only fetch healthy endpoint fix(consul): fix consul discovery only fetch healthy endpoint Feb 23, 2023
@soulbird
Copy link
Contributor

Please add test cases

" by sub url: ", consul_server.consul_catalog_sub_url,
", got watch result: ", json_delay_encode(catalog_result, true),
", with error: ", catalog_error_info)
-- retry_delay = get_retry_delay(retry_delay)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The commented code should be removed. if it should be kept, please give the reason and put the reason in the comment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed done

local nodes = up_services[service_name]
for _, node in ipairs(result.body) do
local svc_address, svc_port = node.ServiceAddress, node.ServicePort
local svc_address, svc_port = node.Service.Address, node.Service.Port
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to check whether node.Service is nil

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

checked done

@dyrnq
Copy link
Contributor Author

dyrnq commented Feb 24, 2023

Please add test cases

Considering that this is a fix, there are some test case before this, can we use the previous test case?

@dyrnq
Copy link
Contributor Author

dyrnq commented Feb 24, 2023

@soulbird test case added

=== TEST 13: prepare healthy and unhealthy nodes (with consul health check)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just to register or unregister some nodes with consul, so what will happen after using consul service discovery on apisix?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I change use consul_dump.t test this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, should delete this one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete done

local svc_address, svc_port = nil, nil
if node.Service then
svc_address, svc_port = node.Service.Address, node.Service.Port
end
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if node.Service is nil, it need to CONTINUE, do not insert in the nodes table, the node is empty

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update done

@dyrnq
Copy link
Contributor Author

dyrnq commented Feb 26, 2023

@Fabriceli Help me see what is left unfinished? There are some ci failed.

@Fabriceli
Copy link
Contributor

@Fabriceli Help me see what is left unfinished? There are some ci failed.

Please fix the test cases, there are some test failed. Referenced infos about how to run APISIX test:

  1. APISIX Running tests
  2. How to Use Gitpod to Develop API Gateway?

@dyrnq
Copy link
Contributor Author

dyrnq commented Feb 27, 2023

@Fabriceli Help me see what is left unfinished? There are some ci failed.

Please fix the test cases, there are some test failed. Referenced infos about how to run APISIX test:

  1. APISIX Running tests
  2. How to Use Gitpod to Develop API Gateway?

Tks , let me learn and fix.

@dyrnq
Copy link
Contributor Author

dyrnq commented Feb 27, 2023

@Fabriceli there are some test use 8600 which is the first consul DNS port, it can not use to another consul instance(8500), i want to update it to 18500.

@dyrnq
Copy link
Contributor Author

dyrnq commented Feb 27, 2023

@Fabriceli there are some test use 8600 which is the first consul DNS port, it can not use to another consul instance(8500), i want to update it to 18500.

I see there use docker-compose expose 8600 from 8500 , I will follow this first , there are so many changes.

@dyrnq
Copy link
Contributor Author

dyrnq commented Feb 27, 2023

@Fabriceli

I encountered difficulties.

I notice the test ci use docker-compose start consul_1 8500 and consul_2 8600 and then they can not access http://127.0.0.1:30511 which register to consul , the health check always check failed. That will not be expected.

I need use host IP register service replace 127.0.0.1 , so consul (start by docker-compose) can check the register instance IP successfully

@Fabriceli
Copy link
Contributor

@Fabriceli

I encountered difficulties.

I notice the test ci use docker-compose start consul_1 8500 and consul_2 8600 and then they can not access http://127.0.0.1:30511 which register to consul , the health check always check failed. That will not be expected.

I need use host IP register service replace 127.0.0.1 , so consul (start by docker-compose) can check the register instance IP successfully

L26-43: https://github.com/apache/apisix/blob/master/t/discovery/consul_dump.t#L26

@dyrnq
Copy link
Contributor Author

dyrnq commented Feb 27, 2023

@Fabriceli
I encountered difficulties.
I notice the test ci use docker-compose start consul_1 8500 and consul_2 8600 and then they can not access http://127.0.0.1:30511 which register to consul , the health check always check failed. That will not be expected.
I need use host IP register service replace 127.0.0.1 , so consul (start by docker-compose) can check the register instance IP successfully

L26-43: https://github.com/apache/apisix/blob/master/t/discovery/consul_dump.t#L26

Yes , I use here code ,I mean this instance reg Address can use 127.0.0.1:30511 but the checks can not use 127.0.0.1:30511
because the consul use docker ,it can`t check 127.0.0.1:30511 , 30511 on host, not in consul container ,check health are always failed, look the check http use http://192.168.6.35:30511 can work fine, but http://127.0.0.1:30511 not working.

"PUT /v1/agent/service/register\n" . "{\"Checks\": [{\"http\": \"http://192.168.6.35:30511\",\"interval\": \"1s\"}],\"ID\":\"service_a1\",\"Name\":\"service_a\",\"Tags\":[\"primary\",\"v1\"],\"Address\":\"127.0.0.1\",\"Port\":30511,\"Meta\":{\"service_a_version\":\"4.0\"},\"EnableTagOverride\":false,\"Weights\":{\"Passing\":10,\"Warning\":1}}",```

@dyrnq
Copy link
Contributor Author

dyrnq commented Feb 27, 2023

Or maybe I can use a fake address that can be visited instead?

 \"Checks\": [{\"http\": \"http://baidu.com\"

It can work

TEST 16 and TEST 17 test for filtered healthy instance
@dyrnq
Copy link
Contributor Author

dyrnq commented Feb 27, 2023

@Fabriceli
I have unit test completed on my pc use test case, I have pushed the code. Please help checks.

=== TEST 16: prepare nodes with consul health check
--- config
location /v1/agent {
    proxy_pass http://127.0.0.1:8500;
}
--- request eval
[
    "PUT /v1/agent/service/deregister/service_a1",
    "PUT /v1/agent/service/deregister/service_a2",
    "PUT /v1/agent/service/deregister/service_b1",
    "PUT /v1/agent/service/deregister/service_b2",
    "PUT /v1/agent/service/register\n" . "{\"Checks\": [{\"http\": \"http://baidu.com\",\"interval\": \"1s\"}],\"ID\":\"service_a1\",\"Name\":\"service_a\",\"Tags\":[\"primary\",\"v1\"],\"Address\":\"127.0.0.1\",\"Port\":30511,\"Meta\":{\"service_a_version\":\"4.0\"},\"EnableTagOverride\":false,\"Weights\":{\"Passing\":10,\"Warning\":1}}",
    "PUT /v1/agent/service/register\n" . "{\"Checks\": [{\"http\": \"http://127.0.0.1:8002\",\"interval\": \"1s\"}],\"ID\":\"service_b1\",\"Name\":\"service_b\",\"Tags\":[\"primary\",\"v1\"],\"Address\":\"127.0.0.1\",\"Port\":8002,\"Meta\":{\"service_b_version\":\"4.1\"},\"EnableTagOverride\":false,\"Weights\":{\"Passing\":10,\"Warning\":1}}",
]
--- response_body eval
--- error_code eval
[200, 200, 200, 200, 200, 200]


=== TEST 17: show dump services with consul health check
--- yaml_config
apisix:
  node_listen: 1984
  enable_control: true
discovery:
  consul:
    servers:
      - "http://127.0.0.1:8500"
    dump:
      path: "consul.dump"
      load_on_init: false
--- config
    location /t {
        content_by_lua_block {
            local json = require("toolkit.json")
            local t = require("lib.test_admin")
            ngx.sleep(2)

            local code, body, res = t.test('/v1/discovery/consul/show_dump_file',
                ngx.HTTP_GET)
            local entity = json.decode(res)
            ngx.say(json.encode(entity.services))
        }
    }
--- timeout: 3
--- request
GET /t
--- response_body
{"service_a":[{"host":"127.0.0.1","port":30511,"weight":1}]}
prove -v -I. -Itest-nginx/inc -Itest-nginx/lib -r t/discovery/consul_dump2.t


t/discovery/consul_dump2.t .. 
ok 1 - t/discovery/consul_dump2.t TEST 16: prepare nodes with consul health check - status code ok
ok 2 - t/discovery/consul_dump2.t TEST 16: prepare nodes with consul health check - pattern "[error]" does not match a line in error.log (req 0)
ok 3 - t/discovery/consul_dump2.t TEST 16: prepare nodes with consul health check - status code ok
ok 4 - t/discovery/consul_dump2.t TEST 16: prepare nodes with consul health check - pattern "[error]" does not match a line in error.log (req 0)
ok 5 - t/discovery/consul_dump2.t TEST 16: prepare nodes with consul health check - status code ok
ok 6 - t/discovery/consul_dump2.t TEST 16: prepare nodes with consul health check - pattern "[error]" does not match a line in error.log (req 0)
ok 7 - t/discovery/consul_dump2.t TEST 16: prepare nodes with consul health check - status code ok
ok 8 - t/discovery/consul_dump2.t TEST 16: prepare nodes with consul health check - pattern "[error]" does not match a line in error.log (req 0)
ok 9 - t/discovery/consul_dump2.t TEST 16: prepare nodes with consul health check - status code ok
ok 10 - t/discovery/consul_dump2.t TEST 16: prepare nodes with consul health check - pattern "[error]" does not match a line in error.log (req 0)
ok 11 - t/discovery/consul_dump2.t TEST 16: prepare nodes with consul health check - status code ok
ok 12 - t/discovery/consul_dump2.t TEST 16: prepare nodes with consul health check - pattern "[error]" does not match a line in error.log (req 0)
ok 13 - t/discovery/consul_dump2.t TEST 17: show dump services with consul health check - status code ok
ok 14 - t/discovery/consul_dump2.t TEST 17: show dump services with consul health check - response_body - response is expected (repeated req 0, req 0)
ok 15 - t/discovery/consul_dump2.t TEST 17: show dump services with consul health check - pattern "[error]" does not match a line in error.log (req 0)
1..15
ok
All tests successful.
Files=1, Tests=15,  3 wallclock secs ( 0.02 usr  0.00 sys +  0.28 cusr  0.06 csys =  0.36 CPU)
Result: PASS

@Fabriceli
Copy link
Contributor

cc @spacewander

@spacewander
Copy link
Member

@dyrnq
Copy link
Contributor Author

dyrnq commented Feb 28, 2023

Could you fix the CI: https://github.com/apache/apisix/actions/runs/4280301682/jobs/7470288647?

I am in process of this.

@dyrnq
Copy link
Contributor Author

dyrnq commented Feb 28, 2023

@Fabriceli @spacewander
I have fix , but there has a small discussion and uncertainty about a config of

discovery:
  consul:
    timeout:
      wait: 60

the wait query param in consul need s or m obvious setting,so in code i need add .. "s"

wait = consul_conf.timeout.wait .."s", --blocked wait!=0; unblocked by wait=0
curl -fsSL -i http://127.0.0.1:8500/v1/health/state/any?wait=1s
HTTP/1.1 200 OK
Content-Type: application/json
Vary: Accept-Encoding
X-Consul-Effective-Consistency: leader
X-Consul-Index: 544
X-Consul-Knownleader: true
X-Consul-Lastcontact: 0
Date: Tue, 28 Feb 2023 04:11:07 GMT
Content-Length: 250

[{"Node":"8cd1fbccae10","CheckID":"serfHealth","Name":"Serf Health Status","Status":"passing","Notes":"","Output":"Agent alive and reachable","ServiceID":"","ServiceName":"","ServiceTags":[],"Type":"","Definition":{},"CreateIndex":5,"ModifyIndex":5}]


curl -fsSL -i http://127.0.0.1:8500/v1/health/state/any?wait=1
curl: (22) The requested URL returned error: 400 Bad Request

@Fabriceli
Copy link
Contributor

Fabriceli commented Feb 28, 2023

please check it (wait arg) in lua-resty-consul, we use this lib to get infos from Consul

@Fabriceli
Copy link
Contributor

Before push the code, you can run make lint and make license-check in local to check the code and doc whether fit the code style

@dyrnq
Copy link
Contributor Author

dyrnq commented Feb 28, 2023

please check it (wait arg) in lua-resty-consul, we use this lib to get infos from Consul

got it

@dyrnq
Copy link
Contributor Author

dyrnq commented Feb 28, 2023

@Fabriceli

Very strange question, the consul http://127.0.0.1:8500/v1/health/state/any update have delay when consul service node changed

when use date && nerdctl stop nginx-1 then v1/health/state/any have 10s delay,
when use date && nerdctl start nginx-1 then v1/health/state/any have 2s delay,

Create this script called consul_check_services_changes.sh

#!/bin/bash
if test ${1:-none} = "none"
then
  echo "USAGE: $0 Consul_URL"
  echo "       Example: localhost:8500/v1/health/service/MY_SUPER_SERVICE"
  exit 1
fi
url_to_check=$1

headers=$(mktemp)
content=$(mktemp)
index=0
while true;
do
  url="${url_to_check}?wait=5m&index=${index}&pretty=true&stale"
  curl -fs --dump-header "$headers" -o "${content}.new" "${url}" || { "echo Failed to query ${url}"; exit 1; }
  if test $index -ne 0
  then
    diff -u "$content" "$content.new" && echo " diff: No Differences found in service"
  fi
  index=$(grep "X-Consul-Index" "$headers" | sed 's/[^0-9]*\([0-9][0-9]*\)[^0-9]*/\1/g')
  mv "$content.new" "$content"
  printf "X-Consul-Index: $index at $(date) \b"
done
./consul_check_services_changes.sh http://127.0.0.1:8500/v1/health/state/any

If /health/state/any change has deplay it can not use for comparison X-Consul-Index change.I'm looking for a solution continue.

@dyrnq
Copy link
Contributor Author

dyrnq commented Mar 1, 2023

@Fabriceli

After continuous testing, I found problems.

If the consul register and deregister with no health check , then the v1/health/state/any will no change of X-Consul-Index at all.

Be compatible with consul check and no check services ,we need to watch health/state/any Index and catalog/services index both, if any index change go into update process

@Fabriceli
Copy link
Contributor

we need create some tests for these cases

@dyrnq
Copy link
Contributor Author

dyrnq commented Mar 1, 2023

we need create some tests for these cases

I am not sure the resty_consul:new code how block two query API, if one block and wait, how another exec?

now use config keepalive false And remove the -- if current index different last index then update service compare,it will work fine

    fetch_interval: 1
    keepalive: false

That is to say, long pooling connection is not used

@Fabriceli
Copy link
Contributor

we need create some tests for these cases

I am not sure the resty_consul:new code how block two query API, if one block and wait, how another exec?

now use config keepalive false And remove the -- if current index different last index then update service compare,it will work fine

    fetch_interval: 1
    keepalive: false

That is to say, long pooling connection is not used

that will be a BREAKING CHANGE

@dyrnq
Copy link
Contributor Author

dyrnq commented Mar 1, 2023

Yes, that lost keepalive usage.

We need find a compatible mechanism to watch catalog/services AND health/state/any X-Consul-Index both,

I found all of consul APIs, there has no a mixed and integrated API for sensing change both service and health.

@dyrnq
Copy link
Contributor Author

dyrnq commented Mar 2, 2023

Now I use commit code like below can pass test, that is give up pass index parameter to consul , do not use blocking query, it will work fine.

Is there have a better way?

        -- consul_server.index = watch_result.headers['X-Consul-Index']
        -- -- only long connect type use index
        -- if consul_server.keepalive then
        --     consul_server.default_args.index = watch_result.headers['X-Consul-Index']
        -- end

@Fabriceli
Copy link
Contributor

It can create TWO event to watch the services and update the all_services:

flowchart TD
		subgraph Event1: watch catalog
		A[Start] --> catalog{Catalog any change?}
    catalog -->|Yes| health[Query nodes]
    health --> D[Post Event1]
    catalog ---->|No| E[End]
    D --> E
    end
    
    subgraph Event2: watch health
    A2[Start] --> health2{health any change?}
    health2 -->|Yes| catalog2[Query services]
    catalog2 --> H[Query nodes]
    H --> D2[Post Event2]
    health2 ---->|No| E2[End]
    D2 --> E2
    end
    
    subgraph Events register
    A3[Start] --> register[Register Event]
    register --> update[Update all services]
    update --> e[End]
    end
Loading

@dyrnq
Copy link
Contributor Author

dyrnq commented Mar 2, 2023

I don't know much about this mechanism about events,I dare not move that piece of code. I'm afraid to fix more bugs.

Can you join and help finish the work together.

@Fabriceli
Copy link
Contributor

I don't know much about this mechanism about events,I dare not move that piece of code. I'm afraid to fix more bugs.

Can you join and help finish the work together.

I create a new PR to fix this "bug", if you are agree

@dyrnq
Copy link
Contributor Author

dyrnq commented Mar 13, 2023

I don't know much about this mechanism about events,I dare not move that piece of code. I'm afraid to fix more bugs.
Can you join and help finish the work together.

I create a new PR to fix this "bug", if you are agree

Great ! agree with you

@monkeyDluffy6017
Copy link
Contributor

Considered resolved due to #9204

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: As a user, I want to apisix support consul discovery with health API

5 participants