Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

change(server-info): use a new approach(keepalive) to report DP info #6202

Merged
merged 23 commits into from
Feb 25, 2022
Merged

change(server-info): use a new approach(keepalive) to report DP info #6202

merged 23 commits into from
Feb 25, 2022

Conversation

zaunist
Copy link
Contributor

@zaunist zaunist commented Jan 25, 2022

What this PR does / why we need it:

fix #6176

Pre-submission checklist:

  • Did you explain what problem does this PR solve? Or what new features have been added?
  • Have you added corresponding test cases?
  • Have you modified the corresponding document?
  • Is this PR backward compatible? If it is not backward compatible, please discuss on the mailing list first

@moonming
Copy link
Member

moonming commented Feb 9, 2022

@zaunist any update?

@zaunist
Copy link
Contributor Author

zaunist commented Feb 9, 2022

@zaunist any update?

I have update the plugin code, still need to update test case.

@zaunist zaunist marked this pull request as ready for review February 9, 2022 15:45
@zaunist
Copy link
Contributor Author

zaunist commented Feb 10, 2022

cc @tzssangglass @shuaijinchao. Please take a look, thanks.

@zaunist
Copy link
Contributor Author

zaunist commented Feb 10, 2022

@bisakhmondal @bzp2010

shuaijinchao
shuaijinchao previously approved these changes Feb 10, 2022
Comment on lines 158 to 166
if not newres then
core.log.error("failed to set server_info: ", err)
end

-- set lease_id to ngx dict
local ok, err = internal_status:set("lease_id", newres.body.lease_id)
if not ok then
core.log.error("failed to save boot_time to shdict: ", err)
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should not continue to set ngx dict if newres fetch fails

local key = "/data_plane/server_info/" .. server_info.id
local ok, err = core.etcd.set(key, server_info, report_ttl)
local res, _ = core.etcd.get(key)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to judge failure

return
end

local res, _ = core.etcd.get(key)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

@@ -203,8 +236,7 @@ end


function _M.init()
core.log.info("server info: ", core.json.delay_encode(get()))

-- core.log.info("server info: ", core.json.delay_encode(get()))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove debug and redundant code

@@ -217,13 +249,14 @@ function _M.init()
return
end

-- local report_ttl = attr and attr.report_ttl or default_report_ttl
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

@shuaijinchao
Copy link
Member

misoperation approved, please correct the above problems first.

apisix/plugins/server-info.lua Outdated Show resolved Hide resolved
apisix/plugins/server-info.lua Outdated Show resolved Hide resolved
Comment on lines 176 to 189
local ok = core.table.deep_eq(server_info, res.body.node.value)
-- not equal, update it
if not ok then
core.log.error("failed to report server info to etcd: ", err)
local newres, err = core.etcd.set(key, server_info, report_ttl)
if not newres then
core.log.error("failed to set server_info to etcd: ", err)
return false, err
end

-- update lease_id
local _, err = internal_status:set("lease_id", res.body.lease_id)
if err ~= nil then
core.log.error("failed to set lease_id to shdict: ", err)
end
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This paragraph is same to the one in the if not res.body.node then .... end block, we can create a function to reuse it.

apisix/plugins/server-info.lua Show resolved Hide resolved
end

if not res.body.node then
local newres, err = core.etcd.set(key, server_info, report_ttl)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also the key for individual instances should be unique but should we use the atomic_set method?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it.

@zaunist zaunist requested a review from tokers February 11, 2022 09:02
local server_info, err = get()
if not server_info then
core.log.error("failed to get server_info: ", err)
return
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why removing this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, careless.

core.log.error("failed to get server_info from etcd: ", err)
end

if not res.body.node then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if not res.body.node then
if not res.body.node then

local ok, err = core.etcd.set(key, server_info, report_ttl)
local res, err = core.etcd.get(key)
if err ~= nil then
core.log.error("failed to get server_info from etcd: ", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if err is not nil, why can we go ahead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When a apisix node be initialed, the ETCD don't have server-info data, I think the get func will report error, so I continue the code and set the info to ETCD. Maybe use if not res be better.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, if get method reports error when no data, you should check the error type and decide whether returning or going ahead.

end

if not res.body.node then
local res_new, err = core.etcd.set(key, server_info, report_ttl)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the logic for initial setting can be merged to the below logic (modified_version is zero).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we don't have two similar code blocks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we set modified_version to nil, I'm not sure if it can be set to nil.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have already said you can set it to 0.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated it.

@zaunist zaunist requested a review from tokers February 13, 2022 16:57
@tzssangglass
Copy link
Member

origin issue has only when the apisix node system information changes will there be a put operation. (e.g., hostname changes)

we need to add a test case to cover this?

@zaunist
Copy link
Contributor Author

zaunist commented Feb 14, 2022

Hi, Community.

I have encountered follow problems.

Should I choose to turn on the server-info plugin by default or turn it off by default? If it is turned on by default, I don't know how to deal with the error here, can anyone give me some help?

@zaunist
Copy link
Contributor Author

zaunist commented Feb 14, 2022

origin issue has only when the apisix node system information changes will there be a put operation. (e.g., hostname changes)

we need to add a test case to cover this?

How do I change the hostname in the .t test file? I'm not quite sure how to do it.

@@ -132,51 +125,107 @@ local function get_server_info()
end


local function set(key,value,ttl, modified_index)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
local function set(key,value,ttl, modified_index)
local function set(key, value, ttl, modified_index)

return 503, {error_msg = err}
end

lease_id = res_new.body.lease_id
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we check if lease_id is not nil?

Comment on lines 129 to 145
local res_new, err = core.etcd.atomic_set(key, value, ttl, modified_index)
if not res_new then
core.log.error("failed to set server_info: ", err)
return 503, {error_msg = err}
end

lease_id = res_new.body.lease_id

-- set or update lease_id
local ok, err = internal_status:set("lease_id", lease_id)
if not ok then
core.log.error("failed to set lease_id to shdict: ", err)
return 503, {error_msg = err}
end

return 200, nil
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the two steps should be in sync, and we can use lock in the next PR to keep them in sync.

@tzssangglass
Copy link
Member

How do I change the hostname in the .t test file? I'm not quite sure how to do it.

It doesn't have to be hostname, it can be any other action that can be used to punish service info refresh.

@zaunist
Copy link
Contributor Author

zaunist commented Feb 19, 2022

Hi, @tzssangglass . I checked core.id.get() and core.utils.gethostname(). It doesn't support updating ids and hostnames, it only gets them once when apisix starts, so it's not possible to update them in real time in the server-info plugin either. I'm sorry that this doesn't do what the proposal is trying to do. Maybe we can enhance core.id.get and core.utils.hostname to support regularly update id and hostname in a subsequent PR.

Copy link
Member

@nic-chen nic-chen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What we need to test is keepalive instead of getting server info, so we could modify the info stored in shared DICT to test.

local res_new, err = core.etcd.set(key, value, ttl)
if not res_new then
core.log.error("failed to set server_info: ", err)
return 503, {error_msg = err}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why should we use 503 here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

503 Service Unavailable, I think it is OK to return this error when the plugin cannot provide the service, what do you suggestion?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think 500 is better. 503 is commonly used for overload.

@zaunist
Copy link
Contributor Author

zaunist commented Feb 19, 2022

What we need to test is keepalive instead of getting server info, so we could modify the info stored in shared DICT to test.

I have tried to change the server-info information in the share dict in the .t test file, but it did not work.

@nic-chen
Copy link
Member

What we need to test is keepalive instead of getting server info, so we could modify the info stored in shared DICT to test.

I have tried to change the server-info information in the share dict in the .t test file, but it did not work.

you could push the test case again to see why it failed.

@tzssangglass
Copy link
Member

Hi, @tzssangglass . I checked core.id.get() and core.utils.gethostname(). It doesn't support updating ids and hostnames, it only gets them once when apisix starts, so it's not possible to update them in real time in the server-info plugin either. I'm sorry that this doesn't do what the proposal is trying to do. Maybe we can enhance core.id.get and core.utils.hostname to support regularly update id and hostname in a subsequent PR.

My mistake, I should have pointed out that the test cases are written in shell script. Let's drop this test case for now and focus on finishing this PR.

@zaunist
Copy link
Contributor Author

zaunist commented Feb 20, 2022

Hi, @tzssangglass . I checked core.id.get() and core.utils.gethostname(). It doesn't support updating ids and hostnames, it only gets them once when apisix starts, so it's not possible to update them in real time in the server-info plugin either. I'm sorry that this doesn't do what the proposal is trying to do. Maybe we can enhance core.id.get and core.utils.hostname to support regularly update id and hostname in a subsequent PR.

My mistake, I should have pointed out that the test cases are written in shell script. Let's drop this test case for now and focus on finishing this PR.

Yes, if we need to get the latest id and hostname etc., then we also need to update the core.id.lua and the code in core.utils.lua to support getting the latest status on a regular basis. My question is, do we have to do these changes in this PR?

@tzssangglass
Copy link
Member

My question is, do we have to do these changes in this PR?

No need to change in this PR.

tzssangglass
tzssangglass previously approved these changes Feb 21, 2022
end

if server_info.etcd_version == "unknown" then
local res, err = core.etcd.server_version()
if not res then
core.log.error("failed to fetch etcd version: ", err)
return
return 500, {error_msg = err}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use 503 like other similar cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it.

lease_id, err = internal_status:get("lease_id")
if not lease_id then
core.log.error("failed to get lease_id from shdict: ", err)
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing a return?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already add return code block

local data, err = core.json.encode(server_info)
if not data then
core.log.error("failed to encode server_info: ", err)
return
return false, err
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The returned value is inconsistent?

And it seems the returned value of report is not checked:

report(nil, report_ttl)

@spacewander
Copy link
Member

@zaunist
I am curious why a shorter ttl is chosen?

@zaunist
Copy link
Contributor Author

zaunist commented Feb 22, 2022

@zaunist I am curious why a shorter ttl is chosen?

Hi, @spacewander. Previously, server-info would write a lot of dirty data and cause performance problems with ETCD, but with keepalive, this problem is avoided and a simple http request does not take up any more performance. So a shorter ttl would allow for more timely reporting of Apache APISIX node status. One minute heartbeat is too long in my opinion, so it was changed to 36 seconds, or we could even set a shorter heartbeat, such as nacos, which has a default of 5 seconds.

https://github.com/alibaba/nacos/blob/47d3afd0e4de5019c09a921fff08ae66aed52600/api/src/main/java/com/alibaba/nacos/api/common/Constants.java#L173

@spacewander
Copy link
Member

Yes. But keepalive request is still a write operation. Since this PR is to reduce the usage of etcd, using a lower interval will increase the number of requests to etcd.

One minute heartbeat is too long in my opinion, so it was changed to 36 seconds, or we could even set a shorter heartbeat, such as nacos, which has a default of 5 seconds

Nacos doesn't write the heartbeat to etcd...
Are there any issues relative to the heartbeat interval? We don't want to get in trouble because of an unnecessary change.

@spacewander spacewander changed the title feat: use keepalive in server-info plugin change(server-info): use a new approach(keepalive) to report DP info Feb 23, 2022
@zaunist
Copy link
Contributor Author

zaunist commented Feb 23, 2022

Yes. But keepalive request is still a write operation. Since this PR is to reduce the usage of etcd, using a lower interval will increase the number of requests to etcd.

One minute heartbeat is too long in my opinion, so it was changed to 36 seconds, or we could even set a shorter heartbeat, such as nacos, which has a default of 5 seconds

Nacos doesn't write the heartbeat to etcd... Are there any issues relative to the heartbeat interval? We don't want to get in trouble because of an unnecessary change.

OK, I will change it to 60s.

end

if not res.body.node then
local ok, err = set(key, server_info, report_ttl)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's strange that get returns nil, err while set returns status, table...
Can we use nil, err in both cases?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it.

local ok, err = set(key, server_info, report_ttl)
if not ok then
core.log.error("failed to set server_info to etcd: ", err)
return 503, {error_msg = err}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now we get a nested err table

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do I just return it here without the err? That means if there is an error, it will just be printed to error.log

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your reply.

local ok, err = set(key, server_info, report_ttl)
if not ok then
core.log.error("failed to set server_info to etcd: ", err)
return 503, {error_msg = err}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should return directly as we don't (and can't) care about the result of report.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I will fix it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Proposal: refactor server-info plugin
7 participants