Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: Etcd jwt refresh error can cause gateway cluster crash. #2899

Closed
gy09535 opened this issue Nov 30, 2020 · 26 comments
Closed

bug: Etcd jwt refresh error can cause gateway cluster crash. #2899

gy09535 opened this issue Nov 30, 2020 · 26 comments
Labels
checking check first if this issue occurred
Milestone

Comments

@gy09535
Copy link
Contributor

gy09535 commented Nov 30, 2020

Issue description

Today I find some auth error from etcd, I try to connect etcd to find the error, I find so many auth request cause etcd api timeout, this is my etcd logs:

2020-11-30 19:55:01.150181 W | auth: invalid auth token: SDZBJcAaVedJqUFY.116415
2020-11-30 19:55:01.226868 W | auth: invalid auth token: MNtikXEJuXkIOWVt.114382
2020-11-30 19:55:01.238848 W | auth: invalid auth token: ArWxwPXGQRclPdZp.114301
2020-11-30 19:55:01.254654 W | auth: invalid auth token: WuqYtmtIDCGftGwr.114315
2020-11-30 19:55:01.283579 W | auth: invalid auth token: CqdVqeYChuiEWgWh.114324
2020-11-30 19:55:01.320021 W | auth: invalid auth token: YHlQCFrmBrGsHXXV.116397
2020-11-30 19:55:01.320712 W | auth: invalid auth token: YyOUXXpgpqQyNryf.114384
2020-11-30 19:55:01.359229 W | auth: invalid auth token: YXSKKPuaGObcCrFZ.114346
2020-11-30 19:55:01.367728 W | auth: invalid auth token: HFqBwCvzPfTRVxih.116421
2020-11-30 19:55:01.389121 W | auth: invalid auth token: KGVyAgnXWOgFDFYl.114375
2020-11-30 19:55:01.408058 W | auth: invalid auth token: EGSajOVclqMhLIqS.114317
2020-11-30 19:55:01.424942 W | auth: invalid auth token: qtXATtyhLZJSGIfB.116433
2020-11-30 19:55:01.480119 W | auth: invalid auth token: tqHxDVbjSKBVCDaS.114331
2020-11-30 19:55:01.485725 W | auth: invalid auth token: siQhqOvdtRUXsHiW.116425
2020-11-30 19:55:01.521545 W | auth: invalid auth token: xVPFwHcEMXScHhGZ.114338
2020-11-30 19:55:01.529845 W | auth: invalid auth token: vqtBsVIGKtNCjUpb.114336
2020-11-30 19:55:01.541498 W | auth: invalid auth token: XFqNUGjGYYsHhpWI.114345
2020-11-30 19:55:01.580206 W | auth: invalid auth token: CiBhQUGpwiUPczNo.114314
2020-11-30 19:55:01.580919 W | auth: invalid auth token: JVaWKqKTDxnhwZig.114351
2020-11-30 19:55:01.600957 W | auth: invalid auth token: OIdYBazTuoTeRyhV.114320
2020-11-30 19:55:01.628087 W | auth: invalid auth token: bSuTMTUaydeOSFPw.114311
2020-11-30 19:55:01.650066 W | auth: invalid auth token: CUImlKqMxLllTDAG.116422
2020-11-30 19:55:01.671630 W | auth: invalid auth token: MckpSvnXCslOBZzN.114350
2020-11-30 19:55:01.730228 W | auth: invalid auth token: ZoCzVsZVnxKkXgkh.114323
2020-11-30 19:55:01.809137 W | auth: invalid auth token: MNtikXEJuXkIOWVt.114382
2020-11-30 19:55:01.840882 W | auth: invalid auth token: eLrJyzLIfDCCtLgS.114352
2020-11-30 19:55:01.892522 W | auth: invalid auth token: mprbhPNbjByDExCW.116398
2020-11-30 19:55:01.973658 W | auth: invalid auth token: MMKVaKJYLCPjtYra.114335
2020-11-30 19:55:02.055462 W | auth: invalid auth token: eLrJyzLIfDCCtLgS.114352
2020-11-30 19:55:02.313522 W | auth: invalid auth token: NcKCHbtvAvQTJWap.116409
2020-11-30 19:55:02.348031 W | auth: invalid auth token: iQkxDrTcVsWzzBwz.114330
2020-11-30 19:55:02.588947 W | auth: invalid auth token: THlTKQvjLFueVrsC.116408
2020-11-30 19:55:02.620312 W | auth: invalid auth token: zEoNFjjLmZLFLujv.114328
2020-11-30 19:55:02.637622 W | auth: invalid auth token: jmyLhQNCVaHCuESg.114312
2020-11-30 19:55:02.760728 W | auth: invalid auth token: bloWjYzYVchKqYCb.114376

I get package from etcd serve and find this packages:

image

After I check the apisix code, I find this code can cause dead loop auth, when sometimes jwt refresh is fail.

function refresh_jwt_token(self)
    -- token exist and not expire
    -- default is 5min, we use 3min
    -- https://github.com/etcd-io/etcd/issues/8287
    if self.jwt_token and now() - self.last_auth_time < 60 * 3 then
        return true, nil
    end

    local opts = {
        body = {
            name         = self.user,
            password     = self.password,
        }
    }
    local res, err = _request_uri(self, 'POST',
                        choose_endpoint(self).full_prefix .. "/auth/authenticate",
                        opts, 5, true)    -- default authenticate timeout 5 second
    if err then
        return nil, err
    end

    if not res or not res.body or not res.body.token then
        return nil, 'authenticate refresh token fail'
    end

    self.jwt_token = res.body.token
    self.last_auth_time = now()

    return true, nil
end

ref: https://github.com/api7/lua-resty-etcd/blob/master/lib/resty/etcd/v3.lua#L221

Environment

  • apisix version (cmd: apisix version):
  • OS:

Minimal test code / Steps to reproduce the issue

What's the actual result? (including assertion message & call stack if applicable)

What's the expected result?

@membphis
Copy link
Member

what is the error log at APISIX?

we can sleep more time if failed to refresh JWT auth.

@membphis
Copy link
Member

@Yiyiyimu @nic-chen welcome to take a look at this issue

@gy09535
Copy link
Contributor Author

gy09535 commented Nov 30, 2020

what is the error log at APISIX?

we can sleep more time if failed to refresh JWT auth.

Apisix render this error, I think it because of etcd is busy to handle request.

020/11/30 11:31:00 [error] 26#26: *31342118 [lua] config_etcd.lua:483: failed to fetch data from etcd: connection reset by peer,  etcd key: /apisix/proto, context: ngx.timer
2020/11/30 11:31:00 [error] 26#26: *31337085 recv() failed (104: Connection reset by peer), context: ngx.timer
2020/11/30 11:31:00 [error] 28#28: *31348440 [lua] config_etcd.lua:483: failed to fetch data from etcd: connection reset by peer,  etcd key: /apisix/global_rules, context: ngx.timer
2020/11/30 11:31:00 [error] 28#28: *31341712 recv() failed (104: Connection reset by peer), context: ngx.timer
2020/11/30 11:31:00 [error] 26#26: *31337085 [lua] config_etcd.lua:483: failed to fetch data from etcd: connection reset by peer,  etcd key: /apisix/services, context: ngx.timer
2020/11/30 11:31:00 [error] 26#26: *31337083 recv() failed (104: Connection reset by peer), context: ngx.timer
2020/11/30 11:31:00 [error] 28#28: *31341712 [lua] config_etcd.lua:483: failed to fetch data from etcd: connection reset by peer,  etcd key: /apisix/ssl, context: ngx.timer
2020/11/30 11:31:00 [error] 28#28: *31346756 recv() failed (104: Connection reset by peer), context: ngx.timer
2020/11/30 11:31:00 [error] 26#26: *31337083 [lua] config_etcd.lua:483: failed to fetch data from etcd: connection reset by peer,  etcd key: /apisix/plugin_metadata, context: ngx.timer

This is very weird, the auth is right, what tigger jwt refresh crazy?

@moonming moonming added this to the 2.2 milestone Nov 30, 2020
@gy09535
Copy link
Contributor Author

gy09535 commented Dec 1, 2020

@Yiyiyimu @nic-chen welcome to take a look at this issue

I am not sure different instances use one account can cause etcd auth problem?

@gy09535
Copy link
Contributor Author

gy09535 commented Dec 1, 2020

@Yiyiyimu @nic-chen welcome to take a look at this issue

I am not sure different instances use one account can cause etcd auth problem?

I read the etcd doc ref: https://etcd.io/docs/v3.4.0/learning/design-auth-v3/

There are two kinds of token types: simple and JWT. The simple token isn’t designed for production use cases. Its tokens aren’t cryptographically signed and servers must statefully track token-user correspondence; it is meant for development testing. JWT tokens should be used for production deployments since it is cryptographically signed and verified. From the implementation perspective, JWT is stateless. Its token can include metadata including username and revision, so servers don’t need to remember correspondence between tokens and the metadata.

Jwt is stateless and It can auth for every request, This is not a problem.

@spacewander
Copy link
Member

After I check the apisix code, I find this code can cause dead loop auth, when sometimes jwt refresh is fail.

I don't think there is a dead loop. If dead loop exists, the request internal won't be tens of ms. It will be much shorter. Can you provide a standalone example to prove your suspect?

@spacewander
Copy link
Member

@gy09535
Can you provide a packet capture file so we can analyze it?

@gy09535
Copy link
Contributor Author

gy09535 commented Dec 1, 2020

@gy09535
Can you provide a packet capture file so we can analyze it?

tcp_etcd_01.cap.zip

Etcd version : 3.4.13
Etcd roles:  root , gateway
gateway role permission: /apisix/*   /routes
Etcd user : apisix(role  gateway)

Use one user with 3 apisix instances and sometimes can find this issue,I am not sure what tigger it with no obviously error log in log file.I think tigger this is some api called too frequent such as watcher. So change the auth logical can avoid it. I am try to prove it.

@spacewander
Copy link
Member

@gy09535
How many workers per APISIX instance?

@gy09535
Copy link
Contributor Author

gy09535 commented Dec 1, 2020

@gy09535
How many workers per APISIX instance?

4

@spacewander
Copy link
Member

Some interesting results:

  1. every auth request is terminated after 5 seconds. It is because 5 is hardcoded as timeout value:
    https://github.com/api7/lua-resty-etcd/blob/c45cfab8d4186c931ebfc3ea87ebcea5624b7122/lib/resty/etcd/v3.lua#L215
  2. requests are in batch. We can add some random number to smooth the auth requests.

Some strange results:

  1. the number of requests is bigger than what 4 workers can create (about 2.5x I guess). I can't reproduce it at my side. Need to log the backtrack when the request is fired.
  2. Even under high pressure, there should be a few requests which can be achieved. However, all requests were failed. I doubt if the etcd has been hanged on for a while.

@gy09535
Copy link
Contributor Author

gy09535 commented Dec 2, 2020

Some interesting results:

  1. every auth request is terminated after 5 seconds. It is because 5 is hardcoded as timeout value:
    https://github.com/api7/lua-resty-etcd/blob/c45cfab8d4186c931ebfc3ea87ebcea5624b7122/lib/resty/etcd/v3.lua#L215
  2. requests are in batch. We can add some random number to smooth the auth requests.

Some strange results:

  1. the number of requests is bigger than what 4 workers can create (about 2.5x I guess). I can't reproduce it at my side. Need to log the backtrack when the request is fired.
  2. Even under high pressure, there should be a few requests which can be achieved. However, all requests were failed. I doubt if the etcd has been hanged on for a while.

Do you find the caller, who is frequently call the request? Any way we should control the auth fail hander , we can not allow request to auth again and again , we must protected the etcd not recevie too many request.

Some special condition can cause it ,not every time ,When it happened I restart the gateway the auth is normal, and the etcd request become normal.

@spacewander
Copy link
Member

Do you find the caller, who is frequently call the request?

As I said, I can't reproduce it.

the number of requests is bigger than what 4 workers can create (about 2.5x I guess). I can't reproduce it at my side. Need to log the backtrack when the request is fired.

You need to provide the backtrack.

@gy09535
Copy link
Contributor Author

gy09535 commented Dec 2, 2020

Do you find the caller, who is frequently call the request?

As I said, I can't reproduce it.

the number of requests is bigger than what 4 workers can create (about 2.5x I guess). I can't reproduce it at my side. Need to log the backtrack when the request is fired.

You need to provide the backtrack.

yeah, I am try to find it.

@membphis membphis added the checking check first if this issue occurred label Dec 2, 2020
@gy09535
Copy link
Contributor Author

gy09535 commented Dec 2, 2020

I try to use bad user and I find too many auth request can cause etcd cpu 100% . Apisx render too many auth error in etcd_config.lua ,ref : https://github.com/apache/apisix/blob/master/apisix/core/config_etcd.lua#L509
I think the _automatic_fetch may be has problem, ref: https://github.com/apache/apisix/blob/master/apisix/core/config_etcd.lua#L486

etcd cpu 100%

image

etcd error log

image

apisix error log

image

@spacewander You can use my case to reproduce it.

@spacewander
Copy link
Member

@gy09535
Please provide the backtrack.
And please don't guess! If you think somewhere has problem, give a prove!

@spacewander
Copy link
Member

etcd cpu 100% and 30~ requests per second don't look strange to me. They are expected.

I have said it yesterday, "the number of requests is bigger than what 4 workers can create (about 2.5x I guess). I can't reproduce it at my side. Need to log the backtrack when the request is fired."

30~ requests are lower than expected peak number (~40) of requests can be created by 4 workers. But the captured packets show that when the issue occurs, the peak number of requests can go ~100 per second. This is what let me feel strange.

We already waste too much time on the red herrings.

@spacewander
Copy link
Member

Anyway, we can share the etcd client, to reduce the auth requests. Although there may still be 2.5x problem and etcd possible hang problem, this optimization can fix the etcd 100% and sync failure. (Who mind 2.5x problem if we only have 2 requests per worker?)

@gy09535
Copy link
Contributor Author

gy09535 commented Dec 2, 2020

30~ requests are lower than expected peak number (~40) of requests can be created by 4 workers. But the captured packets show that when the issue occurs, the peak number of requests can go ~100 per second. This is what let me feel strange.

@spacewander
I change the worker to 1 to do this test, 1 worker has 30 requests and 4 workers should has 120 ~ request.I think it is an problem, if we deploy more instances the request can be magnified.
image

@spacewander
Copy link
Member

spacewander commented Dec 2, 2020

I was mislead by the error log screenshots. The two error log are in different time interval. The etcd one is in a second, and the Nginx error log one is in 40 seconds. So actually you have 30 requests in 40 seconds for 1 workers, which is not strange.

@gy09535
Copy link
Contributor Author

gy09535 commented Dec 2, 2020

I was mislead by the error log screenshots. The two error log are in different time interval. The etcd one is in a second, and the Nginx error log one is in 40 seconds. So actually you have 30 requests in 40 seconds for 1 workers, which is not strange.

hahaha, ignore it, this code is ok, I will try to find why etcd receive 40 auth fail in a second.

@spacewander
Copy link
Member

Anyway, I just submit an ongoing PR: #2932. The PR can reduce the auth request significantly. Haven't finished for some details and tests yet.

@gy09535
Copy link
Contributor Author

gy09535 commented Dec 3, 2020

Fore every auth retch, they will do 32 times fetch from this code

 while not exiting() and self.running and i <= 32 do
        i = i + 1

I think this "ok" var will always true, if we do auth fail, and it can not sleep 3s.

 if not ok then
            log.error("failed to fetch data from etcd: ", err, ", ",
                      tostring(self))
            ngx_sleep(3)
            break
        end

It will sleep 0.5 s from this code when auth is fail.

            if err then
                if err ~= "timeout" and err ~= "Key not found"
                    and self.last_err ~= err then
                    log.error("failed to fetch data from etcd: ", err, ", ",
                              tostring(self))
                end

                if err ~= self.last_err then
                    self.last_err = err
                    self.last_err_time = ngx_time()
                else
                    if ngx_time() - self.last_err_time >= 30 then
                        self.last_err = nil
                    end
                end
                ngx_sleep(0.5)

they are 16 batch job in on worker, this confused me ,it should be 8.

2020/12/03 11:17:26 [warn] 44375#0: *4 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/routes, context: ngx.timer
2020/12/03 11:17:26 [warn] 44375#0: *6 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/ssl, context: ngx.timer
2020/12/03 11:17:26 [warn] 44375#0: *8 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/global_rules, context: ngx.timer
2020/12/03 11:17:26 [warn] 44375#0: *10 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/services, context: ngx.timer
2020/12/03 11:17:26 [warn] 44375#0: *12 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/proto, context: ngx.timer
2020/12/03 11:17:26 [warn] 44375#0: *14 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/plugin_metadata, context: ngx.timer
2020/12/03 11:17:26 [warn] 44375#0: *16 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/consumers, context: ngx.timer
2020/12/03 11:17:26 [warn] 44375#0: *18 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/upstreams, context: ngx.timer
2020/12/03 11:17:26 [warn] 44372#0: *21 [lua] init.lua:262: sync_local_conf_to_etcd(): sync local conf to etcd, context: ngx.timer
2020/12/03 11:17:26 [warn] 44372#0: *24 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/routes, context: ngx.timer
2020/12/03 11:17:26 [warn] 44372#0: *26 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/ssl, context: ngx.timer
2020/12/03 11:17:26 [warn] 44372#0: *28 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/global_rules, context: ngx.timer
2020/12/03 11:17:26 [warn] 44372#0: *30 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/services, context: ngx.timer
2020/12/03 11:17:26 [warn] 44372#0: *32 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/proto, context: ngx.timer
2020/12/03 11:17:26 [warn] 44372#0: *34 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/plugin_metadata, context: ngx.timer
2020/12/03 11:17:26 [warn] 44372#0: *36 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/consumers, context: ngx.timer
2020/12/03 11:17:26 [warn] 44372#0: *38 [lua] config_etcd.lua:468: begin to statistics call count for auto fetch,key:/apisix/upstreams, context: ngx.timer

for one worker, they are 16*2=32 requests (max) in one second when auth is fail, I think it is an problem.I think we should be remove this code , we can receive config is not sync immediately.

 while not exiting() and self.running and i <= 32 do
        i = i + 1

And I think we should sleep for more times when fetch is error.

@gy09535
Copy link
Contributor Author

gy09535 commented Dec 3, 2020

After I change the sync code ,the etcd auth fail become normal, and the etcd cpu become normal. I think fix this issue should in two directions.
one as @spacewander said we can share etcd client to do request .
Another we should also control sync data frequency, we can sync every 10 s for peer dir. And we should add the random time delay for auth_refresh.

After I change the sync code and with error auth ,the etcd render :
image
image

spacewander added a commit to spacewander/incubator-apisix that referenced this issue Dec 7, 2020
It improves the behavior under apache#2899.
@spacewander
Copy link
Member

they are 16 batch job in on worker, this confused me ,it should be 8.

The logs come from two different process, one is worker and the other is privilege agent. You can confirm it via the different pids.

I have submitted #2977, which is based on your PR.

spacewander added a commit to spacewander/incubator-apisix that referenced this issue Dec 7, 2020
It improves the behavior under apache#2899.
@gy09535
Copy link
Contributor Author

gy09535 commented Dec 7, 2020

they are 16 batch job in on worker, this confused me ,it should be 8.

The logs come from two different process, one is worker and the other is privilege agent. You can confirm it via the different pids.

I have submitted #2977, which is based on your PR.

Got it, thanks for your professional responses.

spacewander added a commit that referenced this issue Dec 7, 2020
@gy09535 gy09535 closed this as completed Dec 8, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
checking check first if this issue occurred
Projects
None yet
4 participants