bug: Etcd jwt refresh error can cause gateway cluster crash. #2899

gy09535 · 2020-11-30T12:00:04Z

Issue description

Today I find some auth error from etcd, I try to connect etcd to find the error, I find so many auth request cause etcd api timeout, this is my etcd logs:

2020-11-30 19:55:01.150181 W | auth: invalid auth token: SDZBJcAaVedJqUFY.116415
2020-11-30 19:55:01.226868 W | auth: invalid auth token: MNtikXEJuXkIOWVt.114382
2020-11-30 19:55:01.238848 W | auth: invalid auth token: ArWxwPXGQRclPdZp.114301
2020-11-30 19:55:01.254654 W | auth: invalid auth token: WuqYtmtIDCGftGwr.114315
2020-11-30 19:55:01.283579 W | auth: invalid auth token: CqdVqeYChuiEWgWh.114324
2020-11-30 19:55:01.320021 W | auth: invalid auth token: YHlQCFrmBrGsHXXV.116397
2020-11-30 19:55:01.320712 W | auth: invalid auth token: YyOUXXpgpqQyNryf.114384
2020-11-30 19:55:01.359229 W | auth: invalid auth token: YXSKKPuaGObcCrFZ.114346
2020-11-30 19:55:01.367728 W | auth: invalid auth token: HFqBwCvzPfTRVxih.116421
2020-11-30 19:55:01.389121 W | auth: invalid auth token: KGVyAgnXWOgFDFYl.114375
2020-11-30 19:55:01.408058 W | auth: invalid auth token: EGSajOVclqMhLIqS.114317
2020-11-30 19:55:01.424942 W | auth: invalid auth token: qtXATtyhLZJSGIfB.116433
2020-11-30 19:55:01.480119 W | auth: invalid auth token: tqHxDVbjSKBVCDaS.114331
2020-11-30 19:55:01.485725 W | auth: invalid auth token: siQhqOvdtRUXsHiW.116425
2020-11-30 19:55:01.521545 W | auth: invalid auth token: xVPFwHcEMXScHhGZ.114338
2020-11-30 19:55:01.529845 W | auth: invalid auth token: vqtBsVIGKtNCjUpb.114336
2020-11-30 19:55:01.541498 W | auth: invalid auth token: XFqNUGjGYYsHhpWI.114345
2020-11-30 19:55:01.580206 W | auth: invalid auth token: CiBhQUGpwiUPczNo.114314
2020-11-30 19:55:01.580919 W | auth: invalid auth token: JVaWKqKTDxnhwZig.114351
2020-11-30 19:55:01.600957 W | auth: invalid auth token: OIdYBazTuoTeRyhV.114320
2020-11-30 19:55:01.628087 W | auth: invalid auth token: bSuTMTUaydeOSFPw.114311
2020-11-30 19:55:01.650066 W | auth: invalid auth token: CUImlKqMxLllTDAG.116422
2020-11-30 19:55:01.671630 W | auth: invalid auth token: MckpSvnXCslOBZzN.114350
2020-11-30 19:55:01.730228 W | auth: invalid auth token: ZoCzVsZVnxKkXgkh.114323
2020-11-30 19:55:01.809137 W | auth: invalid auth token: MNtikXEJuXkIOWVt.114382
2020-11-30 19:55:01.840882 W | auth: invalid auth token: eLrJyzLIfDCCtLgS.114352
2020-11-30 19:55:01.892522 W | auth: invalid auth token: mprbhPNbjByDExCW.116398
2020-11-30 19:55:01.973658 W | auth: invalid auth token: MMKVaKJYLCPjtYra.114335
2020-11-30 19:55:02.055462 W | auth: invalid auth token: eLrJyzLIfDCCtLgS.114352
2020-11-30 19:55:02.313522 W | auth: invalid auth token: NcKCHbtvAvQTJWap.116409
2020-11-30 19:55:02.348031 W | auth: invalid auth token: iQkxDrTcVsWzzBwz.114330
2020-11-30 19:55:02.588947 W | auth: invalid auth token: THlTKQvjLFueVrsC.116408
2020-11-30 19:55:02.620312 W | auth: invalid auth token: zEoNFjjLmZLFLujv.114328
2020-11-30 19:55:02.637622 W | auth: invalid auth token: jmyLhQNCVaHCuESg.114312
2020-11-30 19:55:02.760728 W | auth: invalid auth token: bloWjYzYVchKqYCb.114376

I get package from etcd serve and find this packages:

After I check the apisix code, I find this code can cause dead loop auth, when sometimes jwt refresh is fail.

function refresh_jwt_token(self)
    -- token exist and not expire
    -- default is 5min, we use 3min
    -- https://github.com/etcd-io/etcd/issues/8287
    if self.jwt_token and now() - self.last_auth_time < 60 * 3 then
        return true, nil
    end

    local opts = {
        body = {
            name         = self.user,
            password     = self.password,
        }
    }
    local res, err = _request_uri(self, 'POST',
                        choose_endpoint(self).full_prefix .. "/auth/authenticate",
                        opts, 5, true)    -- default authenticate timeout 5 second
    if err then
        return nil, err
    end

    if not res or not res.body or not res.body.token then
        return nil, 'authenticate refresh token fail'
    end

    self.jwt_token = res.body.token
    self.last_auth_time = now()

    return true, nil
end

ref: https://github.com/api7/lua-resty-etcd/blob/master/lib/resty/etcd/v3.lua#L221

Environment

apisix version (cmd: apisix version):
OS:

Minimal test code / Steps to reproduce the issue

What's the actual result? (including assertion message & call stack if applicable)

What's the expected result?

The text was updated successfully, but these errors were encountered:

membphis · 2020-11-30T12:43:44Z

what is the error log at APISIX?

we can sleep more time if failed to refresh JWT auth.

membphis · 2020-11-30T12:44:18Z

@Yiyiyimu @nic-chen welcome to take a look at this issue

gy09535 · 2020-11-30T13:48:32Z

what is the error log at APISIX?

we can sleep more time if failed to refresh JWT auth.

Apisix render this error, I think it because of etcd is busy to handle request.

020/11/30 11:31:00 [error] 26#26: *31342118 [lua] config_etcd.lua:483: failed to fetch data from etcd: connection reset by peer,  etcd key: /apisix/proto, context: ngx.timer
2020/11/30 11:31:00 [error] 26#26: *31337085 recv() failed (104: Connection reset by peer), context: ngx.timer
2020/11/30 11:31:00 [error] 28#28: *31348440 [lua] config_etcd.lua:483: failed to fetch data from etcd: connection reset by peer,  etcd key: /apisix/global_rules, context: ngx.timer
2020/11/30 11:31:00 [error] 28#28: *31341712 recv() failed (104: Connection reset by peer), context: ngx.timer
2020/11/30 11:31:00 [error] 26#26: *31337085 [lua] config_etcd.lua:483: failed to fetch data from etcd: connection reset by peer,  etcd key: /apisix/services, context: ngx.timer
2020/11/30 11:31:00 [error] 26#26: *31337083 recv() failed (104: Connection reset by peer), context: ngx.timer
2020/11/30 11:31:00 [error] 28#28: *31341712 [lua] config_etcd.lua:483: failed to fetch data from etcd: connection reset by peer,  etcd key: /apisix/ssl, context: ngx.timer
2020/11/30 11:31:00 [error] 28#28: *31346756 recv() failed (104: Connection reset by peer), context: ngx.timer
2020/11/30 11:31:00 [error] 26#26: *31337083 [lua] config_etcd.lua:483: failed to fetch data from etcd: connection reset by peer,  etcd key: /apisix/plugin_metadata, context: ngx.timer

This is very weird, the auth is right, what tigger jwt refresh crazy?

gy09535 · 2020-12-01T02:11:35Z

@Yiyiyimu @nic-chen welcome to take a look at this issue

I am not sure different instances use one account can cause etcd auth problem?

gy09535 · 2020-12-01T03:05:52Z

@Yiyiyimu @nic-chen welcome to take a look at this issue

I am not sure different instances use one account can cause etcd auth problem?

I read the etcd doc ref: https://etcd.io/docs/v3.4.0/learning/design-auth-v3/

There are two kinds of token types: simple and JWT. The simple token isn’t designed for production use cases. Its tokens aren’t cryptographically signed and servers must statefully track token-user correspondence; it is meant for development testing. JWT tokens should be used for production deployments since it is cryptographically signed and verified. From the implementation perspective, JWT is stateless. Its token can include metadata including username and revision, so servers don’t need to remember correspondence between tokens and the metadata.

Jwt is stateless and It can auth for every request, This is not a problem.

spacewander · 2020-12-01T03:53:49Z

After I check the apisix code, I find this code can cause dead loop auth, when sometimes jwt refresh is fail.

I don't think there is a dead loop. If dead loop exists, the request internal won't be tens of ms. It will be much shorter. Can you provide a standalone example to prove your suspect?

spacewander · 2020-12-01T03:56:21Z

@gy09535
Can you provide a packet capture file so we can analyze it?

gy09535 · 2020-12-01T04:42:34Z

@gy09535
Can you provide a packet capture file so we can analyze it?

tcp_etcd_01.cap.zip

Etcd version : 3.4.13
Etcd roles:  root , gateway
gateway role permission: /apisix/*   /routes
Etcd user : apisix(role  gateway)

Use one user with 3 apisix instances and sometimes can find this issue,I am not sure what tigger it with no obviously error log in log file.I think tigger this is some api called too frequent such as watcher. So change the auth logical can avoid it. I am try to prove it.

spacewander · 2020-12-01T07:35:41Z

@gy09535
How many workers per APISIX instance?

gy09535 · 2020-12-01T08:23:16Z

@gy09535
How many workers per APISIX instance?

4

spacewander · 2020-12-01T12:07:01Z

Some interesting results:

every auth request is terminated after 5 seconds. It is because 5 is hardcoded as timeout value:
https://github.com/api7/lua-resty-etcd/blob/c45cfab8d4186c931ebfc3ea87ebcea5624b7122/lib/resty/etcd/v3.lua#L215
requests are in batch. We can add some random number to smooth the auth requests.

Some strange results:

the number of requests is bigger than what 4 workers can create (about 2.5x I guess). I can't reproduce it at my side. Need to log the backtrack when the request is fired.
Even under high pressure, there should be a few requests which can be achieved. However, all requests were failed. I doubt if the etcd has been hanged on for a while.

gy09535 · 2020-12-02T02:10:59Z

Some interesting results:

every auth request is terminated after 5 seconds. It is because 5 is hardcoded as timeout value:
https://github.com/api7/lua-resty-etcd/blob/c45cfab8d4186c931ebfc3ea87ebcea5624b7122/lib/resty/etcd/v3.lua#L215

requests are in batch. We can add some random number to smooth the auth requests.

Some strange results:

the number of requests is bigger than what 4 workers can create (about 2.5x I guess). I can't reproduce it at my side. Need to log the backtrack when the request is fired.

Even under high pressure, there should be a few requests which can be achieved. However, all requests were failed. I doubt if the etcd has been hanged on for a while.

Do you find the caller, who is frequently call the request? Any way we should control the auth fail hander , we can not allow request to auth again and again , we must protected the etcd not recevie too many request.

Some special condition can cause it ,not every time ,When it happened I restart the gateway the auth is normal, and the etcd request become normal.

spacewander · 2020-12-02T02:18:56Z

Do you find the caller, who is frequently call the request?

As I said, I can't reproduce it.

the number of requests is bigger than what 4 workers can create (about 2.5x I guess). I can't reproduce it at my side. Need to log the backtrack when the request is fired.

You need to provide the backtrack.

gy09535 · 2020-12-02T02:51:21Z

Do you find the caller, who is frequently call the request?

As I said, I can't reproduce it.

the number of requests is bigger than what 4 workers can create (about 2.5x I guess). I can't reproduce it at my side. Need to log the backtrack when the request is fired.

You need to provide the backtrack.

yeah, I am try to find it.

gy09535 · 2020-12-02T11:03:46Z

I try to use bad user and I find too many auth request can cause etcd cpu 100% . Apisx render too many auth error in etcd_config.lua ,ref : https://github.com/apache/apisix/blob/master/apisix/core/config_etcd.lua#L509
I think the _automatic_fetch may be has problem, ref: https://github.com/apache/apisix/blob/master/apisix/core/config_etcd.lua#L486

etcd cpu 100%

etcd error log

apisix error log

@spacewander You can use my case to reproduce it.

spacewander · 2020-12-02T11:14:13Z

@gy09535
Please provide the backtrack.
And please don't guess! If you think somewhere has problem, give a prove!

spacewander · 2020-12-02T11:23:26Z

etcd cpu 100% and 30~ requests per second don't look strange to me. They are expected.

I have said it yesterday, "the number of requests is bigger than what 4 workers can create (about 2.5x I guess). I can't reproduce it at my side. Need to log the backtrack when the request is fired."

30~ requests are lower than expected peak number (~40) of requests can be created by 4 workers. But the captured packets show that when the issue occurs, the peak number of requests can go ~100 per second. This is what let me feel strange.

We already waste too much time on the red herrings.

spacewander · 2020-12-02T11:48:40Z

Anyway, we can share the etcd client, to reduce the auth requests. Although there may still be 2.5x problem and etcd possible hang problem, this optimization can fix the etcd 100% and sync failure. (Who mind 2.5x problem if we only have 2 requests per worker?)

gy09535 · 2020-12-02T13:26:34Z

30~ requests are lower than expected peak number (~40) of requests can be created by 4 workers. But the captured packets show that when the issue occurs, the peak number of requests can go ~100 per second. This is what let me feel strange.

@spacewander
I change the worker to 1 to do this test, 1 worker has 30 requests and 4 workers should has 120 ~ request.I think it is an problem, if we deploy more instances the request can be magnified.

spacewander · 2020-12-02T13:45:39Z

I was mislead by the error log screenshots. The two error log are in different time interval. The etcd one is in a second, and the Nginx error log one is in 40 seconds. So actually you have 30 requests in 40 seconds for 1 workers, which is not strange.

gy09535 · 2020-12-02T13:50:26Z

I was mislead by the error log screenshots. The two error log are in different time interval. The etcd one is in a second, and the Nginx error log one is in 40 seconds. So actually you have 30 requests in 40 seconds for 1 workers, which is not strange.

hahaha, ignore it, this code is ok, I will try to find why etcd receive 40 auth fail in a second.

spacewander · 2020-12-02T14:03:07Z

Anyway, I just submit an ongoing PR: #2932. The PR can reduce the auth request significantly. Haven't finished for some details and tests yet.

gy09535 · 2020-12-03T04:02:30Z

Fore every auth retch, they will do 32 times fetch from this code

 while not exiting() and self.running and i <= 32 do
        i = i + 1

I think this "ok" var will always true, if we do auth fail, and it can not sleep 3s.

 if not ok then
            log.error("failed to fetch data from etcd: ", err, ", ",
                      tostring(self))
            ngx_sleep(3)
            break
        end

It will sleep 0.5 s from this code when auth is fail.

            if err then
                if err ~= "timeout" and err ~= "Key not found"
                    and self.last_err ~= err then
                    log.error("failed to fetch data from etcd: ", err, ", ",
                              tostring(self))
                end

                if err ~= self.last_err then
                    self.last_err = err
                    self.last_err_time = ngx_time()
                else
                    if ngx_time() - self.last_err_time >= 30 then
                        self.last_err = nil
                    end
                end
                ngx_sleep(0.5)

they are 16 batch job in on worker, this confused me ,it should be 8.

2020/12/03 11:17:26 [warn] 44375#0: *4 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/routes, context: ngx.timer
2020/12/03 11:17:26 [warn] 44375#0: *6 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/ssl, context: ngx.timer
2020/12/03 11:17:26 [warn] 44375#0: *8 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/global_rules, context: ngx.timer
2020/12/03 11:17:26 [warn] 44375#0: *10 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/services, context: ngx.timer
2020/12/03 11:17:26 [warn] 44375#0: *12 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/proto, context: ngx.timer
2020/12/03 11:17:26 [warn] 44375#0: *14 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/plugin_metadata, context: ngx.timer
2020/12/03 11:17:26 [warn] 44375#0: *16 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/consumers, context: ngx.timer
2020/12/03 11:17:26 [warn] 44375#0: *18 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/upstreams, context: ngx.timer
2020/12/03 11:17:26 [warn] 44372#0: *21 [lua] init.lua:262: sync_local_conf_to_etcd(): sync local conf to etcd, context: ngx.timer
2020/12/03 11:17:26 [warn] 44372#0: *24 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/routes, context: ngx.timer
2020/12/03 11:17:26 [warn] 44372#0: *26 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/ssl, context: ngx.timer
2020/12/03 11:17:26 [warn] 44372#0: *28 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/global_rules, context: ngx.timer
2020/12/03 11:17:26 [warn] 44372#0: *30 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/services, context: ngx.timer
2020/12/03 11:17:26 [warn] 44372#0: *32 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/proto, context: ngx.timer
2020/12/03 11:17:26 [warn] 44372#0: *34 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/plugin_metadata, context: ngx.timer
2020/12/03 11:17:26 [warn] 44372#0: *36 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/consumers, context: ngx.timer
2020/12/03 11:17:26 [warn] 44372#0: *38 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/upstreams, context: ngx.timer

for one worker, they are 16*2=32 requests (max) in one second when auth is fail, I think it is an problem.I think we should be remove this code , we can receive config is not sync immediately.

 while not exiting() and self.running and i <= 32 do
        i = i + 1

And I think we should sleep for more times when fetch is error.

gy09535 · 2020-12-03T07:10:29Z

After I change the sync code ,the etcd auth fail become normal, and the etcd cpu become normal. I think fix this issue should in two directions.
one as @spacewander said we can share etcd client to do request .
Another we should also control sync data frequency, we can sync every 10 s for peer dir. And we should add the random time delay for auth_refresh.

After I change the sync code and with error auth ,the etcd render :

It improves the behavior under apache#2899.

spacewander · 2020-12-07T02:16:02Z

they are 16 batch job in on worker, this confused me ,it should be 8.

The logs come from two different process, one is worker and the other is privilege agent. You can confirm it via the different pids.

I have submitted #2977, which is based on your PR.

It improves the behavior under apache#2899.

gy09535 · 2020-12-07T08:11:18Z

they are 16 batch job in on worker, this confused me ,it should be 8.

The logs come from two different process, one is worker and the other is privilege agent. You can confirm it via the different pids.

I have submitted #2977, which is based on your PR.

Got it, thanks for your professional responses.

It improves the behavior under #2899.

moonming added this to the 2.2 milestone Nov 30, 2020

gy09535 mentioned this issue Dec 1, 2020

fix: Add retry interval time when auth is fail api7/lua-resty-etcd#98

Closed

membphis added the checking check first if this issue occurred label Dec 2, 2020

spacewander mentioned this issue Dec 2, 2020

feat: share etcd client #2932

Merged

4 tasks

gy09535 mentioned this issue Dec 3, 2020

WIP: change sync etcd data method #2946

Closed

4 tasks

nic-chen mentioned this issue Dec 3, 2020

discuss: endpoint choose issue api7/lua-resty-etcd#101

Closed

spacewander added a commit to spacewander/incubator-apisix that referenced this issue Dec 7, 2020

feat: increase the delay before resync

aa020ca

It improves the behavior under apache#2899.

spacewander mentioned this issue Dec 7, 2020

feat: increase the delay before resync #2977

Merged

4 tasks

spacewander added a commit to spacewander/incubator-apisix that referenced this issue Dec 7, 2020

feat: increase the delay before resync

db69c96

It improves the behavior under apache#2899.

spacewander added a commit that referenced this issue Dec 7, 2020

feat: increase the delay before resync (#2977)

1b8dca9

It improves the behavior under #2899.

gy09535 closed this as completed Dec 8, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: Etcd jwt refresh error can cause gateway cluster crash. #2899

bug: Etcd jwt refresh error can cause gateway cluster crash. #2899

gy09535 commented Nov 30, 2020 •

edited

membphis commented Nov 30, 2020

membphis commented Nov 30, 2020

gy09535 commented Nov 30, 2020 •

edited

gy09535 commented Dec 1, 2020

gy09535 commented Dec 1, 2020 •

edited

spacewander commented Dec 1, 2020

spacewander commented Dec 1, 2020

gy09535 commented Dec 1, 2020 •

edited

spacewander commented Dec 1, 2020

gy09535 commented Dec 1, 2020

spacewander commented Dec 1, 2020

gy09535 commented Dec 2, 2020 •

edited

spacewander commented Dec 2, 2020

gy09535 commented Dec 2, 2020

gy09535 commented Dec 2, 2020 •

edited

spacewander commented Dec 2, 2020

spacewander commented Dec 2, 2020

spacewander commented Dec 2, 2020

gy09535 commented Dec 2, 2020 •

edited

spacewander commented Dec 2, 2020 •

edited

gy09535 commented Dec 2, 2020

spacewander commented Dec 2, 2020

gy09535 commented Dec 3, 2020 •

edited

gy09535 commented Dec 3, 2020 •

edited

spacewander commented Dec 7, 2020

gy09535 commented Dec 7, 2020

bug: Etcd jwt refresh error can cause gateway cluster crash. #2899

bug: Etcd jwt refresh error can cause gateway cluster crash. #2899

Comments

gy09535 commented Nov 30, 2020 • edited

Issue description

I get package from etcd serve and find this packages:

Environment

Minimal test code / Steps to reproduce the issue

What's the actual result? (including assertion message & call stack if applicable)

What's the expected result?

membphis commented Nov 30, 2020

membphis commented Nov 30, 2020

gy09535 commented Nov 30, 2020 • edited

gy09535 commented Dec 1, 2020

gy09535 commented Dec 1, 2020 • edited

spacewander commented Dec 1, 2020

spacewander commented Dec 1, 2020

gy09535 commented Dec 1, 2020 • edited

spacewander commented Dec 1, 2020

gy09535 commented Dec 1, 2020

spacewander commented Dec 1, 2020

gy09535 commented Dec 2, 2020 • edited

spacewander commented Dec 2, 2020

gy09535 commented Dec 2, 2020

gy09535 commented Dec 2, 2020 • edited

etcd cpu 100%

etcd error log

apisix error log

spacewander commented Dec 2, 2020

spacewander commented Dec 2, 2020

spacewander commented Dec 2, 2020

gy09535 commented Dec 2, 2020 • edited

spacewander commented Dec 2, 2020 • edited

gy09535 commented Dec 2, 2020

spacewander commented Dec 2, 2020

gy09535 commented Dec 3, 2020 • edited

gy09535 commented Dec 3, 2020 • edited

spacewander commented Dec 7, 2020

gy09535 commented Dec 7, 2020

gy09535 commented Nov 30, 2020 •

edited

gy09535 commented Nov 30, 2020 •

edited

gy09535 commented Dec 1, 2020 •

edited

gy09535 commented Dec 1, 2020 •

edited

gy09535 commented Dec 2, 2020 •

edited

gy09535 commented Dec 2, 2020 •

edited

gy09535 commented Dec 2, 2020 •

edited

spacewander commented Dec 2, 2020 •

edited

gy09535 commented Dec 3, 2020 •

edited

gy09535 commented Dec 3, 2020 •

edited