Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get Devices REST API returned 504 gateway time-out #373

Closed
w1lldone opened this issue Oct 18, 2019 · 12 comments
Closed

Get Devices REST API returned 504 gateway time-out #373

w1lldone opened this issue Oct 18, 2019 · 12 comments
Labels

Comments

@w1lldone
Copy link

Is this a bug or a feature request?

Maybe it's a misconfiguration.

What did you expect?

To know what caused this issue.

What happened?

We have 139 active devices registered in our server. Sometime, the device list is gone from the application details page and the API endpoint returned error 504 (gateway time-out).
image.
The devices list will come back after restarting redis server. Maybe someone has ecountered the same issue and hopefuly know the solution.

What version are your using?

v3.3.1

@sa-wilson
Copy link
Contributor

sa-wilson commented Oct 28, 2019

We get something like this happening as well, although I'm not sure if the gateway is timing out like that or not as I haven't recently checked the dev tools when it's happened. I'll do so next time.

We fix it by restarting lora-app-server. A colleague checked the logs the last time it occured (over the weekend) and nothing made it to the debug log. /api/devices also failed to return (which I assume is the issue with the actual device list).

If restarting redis also fixes this, then maybe there's an issue with the connection between lora-app-server and redis. I'll check redis next time it happens here.

This has been happening for a few versions now, we've just been dealing with it and figured it might be a stupid configuration issue on our end.

EDIT: Although, I'm looking now and can't see anything redis related in storage/device.go. It all looks to be using postgres.

@darkfader
Copy link

I get 504 reading out the events. Perhaps it takes too long somehow?

@Tbronders
Copy link

Tbronders commented Jan 13, 2020

Hello there,

We have the same issue, a restart did however fix the problem. The time-out error did only occur with the api/devices. We are currently working on version 3.3.0 and looking to upgrade to the latest chirpstack version!

In the log I get the following errors/warnings:

Jan 13 10:30:46 ********* lora-app-server[1641]: time="2020-01-13T10:30:46+01:00" level=warning msg="influxdb integration: unhandled type!" type_name="[]string"

Jan 13 10:30:27 ********* lora-app-server[1641]: time="2020-01-13T10:30:27+01:00" level=error msg="finished unary call with code Internal" error="rpc error: code = Internal desc = handle received ping error: get ping lookup error: get ping lookup error: redigo: nil returned" grpc.code=Internal grpc.method=HandleProprietaryUplink grpc.service=as.ApplicationServerService grpc.start_time="2020-01-13T10:30:27+01:00" grpc.time_ms=1.17 peer.address="ip address:40700" span.kind=server system=grpc

Jan 13 10:31:19 ********* lora-app-server[1641]: time="2020-01-13T10:31:19+01:00" level=error msg="handle received ping error: get ping lookup error: get ping lookup error: redigo: nil returned"

Could you enlighten what happens here so we can resolve this problem the next time it occurs.

Thanks in advance!

@ramuit44
Copy link

ramuit44 commented Feb 8, 2020

Hi,
We are facing exactly the same issue with v3.6.1 of Chirpstack. As described above The device list API call times out.

We checked the GetDevices method under storage/device.go and manually executed the same SQL query and it returned just fine. Also, The Postgres connections are all fine, with no locks / no long-running transactions/sessions.

There no errors in the console except for the below log wrt redigo (redis client lib). Also, another thing to note is once this happens, even adding a new device (post API)was timing out. This happened quite a few times last month and restarting either the redis or app-server fixed the problem temporarily.

Which is strange. I understand that all the device session keys are stored in redis, but I didn't see any references to redis when GetDevices API is called. Anyways again after a few days, we encounter the same issues again and we kept restarting to fix the problem.

So obviously feels like some connection issue between Redis and the app server. Not sure if it's happening because of the client buffer limits on the Redis. Unfortunately there is nothing in the logs on Redis side.
https://redislabs.com/blog/top-redis-headaches-for-devops-client-buffers/
https://medium.com/@VipulDivyanshu/client-buffers-before-taking-redis-in-production-read-this-c175a508d8ba

The traffic on our app server is not that high though, on average one uplink payload per second, and the payload sizes aren't that big, so not sure if its taking time to process the redis responses and hence reaching the client buffer limits on Redis. Obviously the "GET" Redis connections are gone when this happens - based on the log " handle received ping error: get ping lookup error". The question would be even if we hit the limits, shouldn't there be "auto-reconnection" since its a Redis pool. We are now increasing the buffer sizes and hope Redis connections don't get screwed in the future.

Any pointers on this would be much appreciated. Note: setup is on K8S with chirpstack docker images.

time="2020-02-07T22:39:12Z" level=info msg="finished unary call with code OK" ctx_id=7af65e6f-ed08-48f7-99eb-37435887d4f9 grpc.code=OK grpc.method=HandleUplinkData grpc.request.deadline="2020-02-07T22:39:13Z" grpc.service=as.ApplicationServerService grpc.start_time="2020-02-07T22:39:12Z" grpc.time_ms=3.536 peer.address="172.30.59.205:35864" span.kind=server system=grpc
time="2020-02-07T22:39:12Z" level=info msg="integration/mqtt: publishing message" ctx_id=7af65e6f-ed08-48f7-99eb-37435887d4f9 qos=0 topic=application/1/device/00000000820001b5/rx
time="2020-02-07T22:39:22Z" level=info msg="device last-seen and dr updated" ctx_id=ad08ac79-369e-43f7-b5c8-ac08d72dcd09 dev_eui=00000000820001b5
time="2020-02-07T22:39:22Z" level=info msg="finished unary call with code OK" ctx_id=ad08ac79-369e-43f7-b5c8-ac08d72dcd09 grpc.code=OK grpc.method=HandleUplinkData grpc.request.deadline="2020-02-07T22:39:23Z" grpc.service=as.ApplicationServerService grpc.start_time="2020-02-07T22:39:22Z" grpc.time_ms=2.981 peer.address="172.30.59.205:35864" span.kind=server system=grpc
time="2020-02-07T22:39:22Z" level=info msg="integration/mqtt: publishing message" ctx_id=ad08ac79-369e-43f7-b5c8-ac08d72dcd09 qos=0 topic=application/1/device/00000000820001b5/rx
**time="2020-02-07T22:39:30Z" level=error msg="handle received ping error: get ping lookup error: get ping lookup error: redigo: nil returned"**
time="2020-02-07T22:39:30Z" level=error msg="finished unary call with code Internal" ctx_id=ca10bb5b-dd74-42f6-83b4-6951cc592768 error="rpc error: code = Internal desc = handle received ping error: get ping lookup error: get ping lookup error: error: read tcp 192.168.1.109:59050->13.210.17.29:443: read: connection reset by peervice=as.ApplicationServerService grpc.start_time="2020-02-07T22:39:30Z" grpc.time_ms=0.226 peer.address="172.30.59.68:41854" span.kind=server system=grpc
time="2020-02-07T22:39:32Z" level=info msg="device last-seen and dr updated" ctx_id=29a6dc82-948f-475a-b021-c79d04e9e615 dev_eui=00000000820001b5
time="2020-02-07T22:39:32Z" level=info msg="finished unary call with code OK" ctx_id=29a6dc82-948f-475a-b021-c79d04e9e615 grpc.code=OK grpc.method=HandleUplinkData grpc.request.deadline="2020-02-07T22:39:33Z" grpc.service=as.ApplicationServerService grpc.start_time="2020-02-07T22:39:32Z" grpc.time_ms=3.66 peer.address="172.30.59.205:35864" span.kind=server system=grpc

@ramuit44
Copy link

ramuit44 commented Feb 18, 2020

Further analysis on that after it happened again today.

When we did the CLIENT LIST on the Redis we saw below. Quite a number of connections for the PING. I suspect that was the reason for the redis to become unresponsive. We have added a limit to the active redis connections in the lora app server yaml. Hopefully, that resolves the issue.

id=130068 addr=172.30.59.205:43262 fd=74 name= age=518133 idle=3 flags=P db=0 sub=2 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=ping
id=129780 addr=172.30.59.205:38736 fd=52 name= age=519463 idle=14 flags=P db=0 sub=2 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=ping
id=130569 addr=172.30.59.205:51170 fd=131 name= age=515828 idle=11 flags=P db=0 sub=2 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=ping
id=131083 addr=172.30.59.205:59210 fd=169 name= age=513448 idle=2 flags=P db=0 sub=2 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=ping
......

@vfylyk
Copy link

vfylyk commented Feb 19, 2020

I looked further into this, based on what @ramuit44 noticed. It seems to be caused by a goroutine + redis connection pool leak. This problem is particularly relevant to my company because we use the troubleshooting tabs quite extensively when dealing with sensor providers as well as when diagnosing network issues.

Here's how I managed to reproduce the excessive number of connections in redis:

Context:
A registered device that does not send uplinks for the duration of the test.

Steps:

  1. Go to the device page in the UI and start a "Device Data" troubleshooting.
  2. Close the page after a considerable length of time (e.g. >20 minutes) running the capture.

Expected result:

  • A websocket connection is estabilished with the app server and stays stable (e.g. wss://lora.outcomex.com.au/api/devices/cb0f000010ffffff/events).
  • A call is placed in the system for StreamEventLogs.
  • As the device never sends an uplink, the UI page stays blank.
  • When done, the websocket connection is closed when the page is closed.
  • A log message stating that the StreamEventLogs gRPC method has finished, similar to time="2020-02-19T00:39:26Z" level=info msg="finished streaming call with code OK" grpc.code=OK grpc.method=StreamEventLogs grpc.service=api.DeviceService grpc.start_time="2020-02-19T00:34:23Z" grpc.time_ms=302393.25 peer.address="127.0.0.1:36280" span.kind=server system=grpc
  • When the RPC call finishes, the redis connection is returned to the pool
  • The number of client connections in redis is stable at the "max_idle" count after a while.

What actually happens:

  • The websocket connections are closed in the UI every 60 seconds and a new one is created.
  • Each new websocket connection opens a new RPC call to StreamEventLogs, which in turn takes a new redis connection from the pool
  • The previously closed websocket connections do NOT cause their respective RPC calls to complete and finish, which can be detected by the lack of the expected "finished streaming call" log messages.
  • The redis connections are therefore not returned to the pool. The 'GetEventLogForDevice' function in internal/eventlog/eventlog.go keeps pinging the unused connections every 30 seconds. When checking "client list" on redis-cli, all the leaked connections are idle for less than 30 seconds, and they accumulate as per the number of non-finished RPCs.
  • After too many redis connections are open, the server starts to misbehave and e.g. not show the list of devices.

image

Excerpt of GetEventLogForDevice:

	// todo: make this a config value?
	ticker := time.NewTicker(30 * time.Second)
	defer ticker.Stop()

loop:
	for {
		select {
		case <-ticker.C:
			if err := psc.Ping(""); err != nil {
				log.WithError(err).Error("subscription ping error")
				break loop
			}
		case <-ctx.Done():
			break loop
		case err := <-done:
			return err
		}
	}

If now you add a twist to this scenario, and after a while the device happens to send an uplink, I can see all the hanging RPCs completing all at once, and the list of client connections in redis to be normalised:

time="2020-02-19T00:39:26Z" level=info msg="finished streaming call with code OK" grpc.code=OK grpc.method=StreamEventLogs grpc.service=api.DeviceService grpc.start_time="2020-02-19T00:34:23Z" grpc.time_ms=302393.25 peer.address="127.0.0.1:36280" span.kind=server system=grpc
time="2020-02-19T00:39:26Z" level=info msg="finished streaming call with code OK" grpc.code=OK grpc.method=StreamEventLogs grpc.service=api.DeviceService grpc.start_time="2020-02-19T00:35:26Z" grpc.time_ms=239846.53 peer.address="127.0.0.1:36280" span.kind=server system=grpc
time="2020-02-19T00:39:26Z" level=info msg="finished streaming call with code OK" grpc.code=OK grpc.method=StreamEventLogs grpc.service=api.DeviceService grpc.start_time="2020-02-19T00:36:26Z" grpc.time_ms=179821.81 peer.address="127.0.0.1:36280" span.kind=server system=grpc
time="2020-02-19T00:39:26Z" level=info msg="finished streaming call with code OK" grpc.code=OK grpc.method=StreamEventLogs grpc.service=api.DeviceService grpc.start_time="2020-02-19T00:37:26Z" grpc.time_ms=119713.43 peer.address="127.0.0.1:36280" span.kind=server system=grpc

@brocaar Any thoughts on this?

PS example client list:

id=253366 addr=172.30.50.95:52246 fd=30 name= age=3002 idle=27 flags=P db=0 sub=1 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=ping
id=10 addr=127.0.0.1:37274 fd=12 name= age=1254070 idle=1254070 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=command
id=252929 addr=172.30.56.25:54924 fd=24 name= age=5065 idle=10 flags=P db=0 sub=2 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=ping
id=253421 addr=172.30.50.95:53684 fd=33 name= age=2759 idle=24 flags=P db=0 sub=1 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=ping
id=253985 addr=172.30.56.25:54786 fd=63 name= age=126 idle=10 flags=P db=0 sub=2 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=ping
id=253297 addr=172.30.50.95:50522 fd=26 name= age=3299 idle=27 flags=P db=0 sub=1 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=ping
id=253657 addr=172.30.50.95:60132 fd=40 name= age=1643 idle=24 flags=P db=0 sub=1 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=ping
id=253778 addr=172.30.50.95:35000 fd=47 name= age=1083 idle=24 flags=P db=0 sub=1 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=ping
id=253568 addr=172.30.50.95:57686 fd=32 name= age=2050 idle=24 flags=P db=0 sub=1 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=ping
id=253320 addr=172.30.50.95:51100 fd=14 name= age=3195 idle=26 flags=P db=0 sub=1 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=ping
id=251218 addr=127.0.0.1:39648 fd=11 name= age=12509 idle=1 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=26 qbuf-free=32742 obl=0 oll=0 omem=0 events=r cmd=client
id=253594 addr=172.30.50.95:58354 fd=35 name= age=1933 idle=24 flags=P db=0 sub=1 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=ping
id=253644 addr=172.30.50.95:59786 fd=39 name= age=1701 idle=24 flags=P db=0 sub=1 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=ping
id=253943 addr=172.30.50.95:39104 fd=60 name= age=322 idle=23 flags=P db=0 sub=1 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=ping
id=253983 addr=172.30.56.25:54740 fd=62 name= age=128 idle=10 flags=P db=0 sub=2 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=ping
id=253800 addr=172.30.50.95:35566 fd=49 name= age=978 idle=23 flags=P db=0 sub=1 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=ping
id=253999 addr=172.30.56.25:55086 fd=64 name= age=63 idle=10 flags=P db=0 sub=2 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=subscribe
id=253787 addr=172.30.50.95:35202 fd=48 name= age=1043 idle=23 flags=P db=0 sub=1 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=ping
id=252923 addr=172.30.62.25:57372 fd=19 name= age=5093 idle=38 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=publish
id=253958 addr=172.30.50.95:39416 fd=61 name= age=252 idle=3 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=publish
id=254011 addr=172.30.56.25:55322 fd=66 name= age=8 idle=3 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=lrange
id=253761 addr=172.30.50.95:34526 fd=46 name= age=1163 idle=24 flags=P db=0 sub=1 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=ping
id=252916 addr=172.30.56.25:54584 fd=10 name= age=5122 idle=10 flags=P db=0 sub=2 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=ping
id=253891 addr=172.30.50.95:37934 fd=56 name= age=563 idle=23 flags=P db=0 sub=1 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=ping
id=253553 addr=172.30.50.95:57304 fd=29 name= age=2120 idle=24 flags=P db=0 sub=1 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=ping
id=253352 addr=172.30.50.95:51914 fd=28 name= age=3059 idle=27 flags=P db=0 sub=1 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=ping
id=253696 addr=172.30.50.95:32910 fd=41 name= age=1458 idle=24 flags=P db=0 sub=1 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=ping
id=253631 addr=172.30.50.95:59366 fd=38 name= age=1764 idle=24 flags=P db=0 sub=1 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=ping
id=253817 addr=172.30.50.95:35994 fd=50 name= age=903 idle=23 flags=P db=0 sub=1 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=ping
id=253580 addr=172.30.50.95:58000 fd=34 name= age=1995 idle=24 flags=P db=0 sub=1 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=ping
id=253735 addr=172.30.50.95:33878 fd=44 name= age=1283 idle=24 flags=P db=0 sub=1 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=ping
id=253917 addr=172.30.50.95:38518 fd=58 name= age=443 idle=23 flags=P db=0 sub=1 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=ping
id=253852 addr=172.30.50.95:36922 fd=53 name= age=743 idle=23 flags=P db=0 sub=1 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=ping
id=253529 addr=172.30.50.95:56630 fd=20 name= age=2235 idle=24 flags=P db=0 sub=1 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=ping
id=253229 addr=172.30.56.25:35248 fd=21 name= age=3606 idle=10 flags=P db=0 sub=2 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=ping
id=253270 addr=172.30.50.95:49852 fd=18 name= age=3419 idle=27 flags=P db=0 sub=1 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=ping
id=253722 addr=172.30.50.95:33568 fd=43 name= age=1338 idle=24 flags=P db=0 sub=1 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=ping
id=253324 addr=172.30.50.95:51192 fd=17 name= age=3179 idle=27 flags=P db=0 sub=1 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=ping
id=253878 addr=172.30.50.95:37578 fd=55 name= age=623 idle=23 flags=P db=0 sub=1 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=ping
id=253605 addr=172.30.50.95:58686 fd=36 name= age=1878 idle=24 flags=P db=0 sub=1 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=ping
id=253752 addr=172.30.50.95:34336 fd=45 name= age=1203 idle=24 flags=P db=0 sub=1 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=ping
id=252623 addr=172.30.50.95:60842 fd=15 name= age=6407 idle=24 flags=P db=0 sub=1 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=ping
id=254010 addr=172.30.56.25:55310 fd=65 name= age=10 idle=10 flags=P db=0 sub=2 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=subscribe
id=253338 addr=172.30.50.95:51554 fd=22 name= age=3119 idle=27 flags=P db=0 sub=1 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=ping
id=253712 addr=172.30.50.95:33310 fd=42 name= age=1386 idle=24 flags=P db=0 sub=1 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=ping
id=253283 addr=172.30.50.95:50154 fd=23 name= age=3363 idle=27 flags=P db=0 sub=1 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=ping
id=252972 addr=172.30.56.25:56132 fd=9 name= age=4861 idle=10 flags=P db=0 sub=2 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=ping
id=253685 addr=172.30.50.95:60820 fd=8 name= age=1512 idle=24 flags=P db=0 sub=1 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=ping
id=252957 addr=172.30.56.25:55698 fd=16 name= age=4930 idle=10 flags=P db=0 sub=2 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=ping
id=253867 addr=172.30.50.95:37296 fd=54 name= age=675 idle=23 flags=P db=0 sub=1 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=ping
id=253840 addr=172.30.50.95:36628 fd=52 name= age=796 idle=23 flags=P db=0 sub=1 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=ping
id=252983 addr=172.30.56.25:56400 fd=13 name= age=4810 idle=10 flags=P db=0 sub=2 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=ping
id=253930 addr=172.30.50.95:38818 fd=59 name= age=378 idle=23 flags=P db=0 sub=1 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=ping
id=253905 addr=172.30.50.95:38264 fd=57 name= age=495 idle=23 flags=P db=0 sub=1 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=ping
id=253393 addr=172.30.50.95:52946 fd=25 name= age=2879 idle=26 flags=P db=0 sub=1 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=ping
id=253620 addr=172.30.50.95:59076 fd=37 name= age=1813 idle=24 flags=P db=0 sub=1 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=ping
id=253405 addr=172.30.50.95:53228 fd=31 name= age=2830 idle=26 flags=P db=0 sub=1 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=ping
id=253540 addr=172.30.50.95:56972 fd=27 name= age=2178 idle=24 flags=P db=0 sub=1 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=ping
id=253826 addr=172.30.50.95:36234 fd=51 name= age=863 idle=23 flags=P db=0 sub=1 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=ping

@brocaar
Copy link
Owner

brocaar commented Feb 25, 2020

Thanks @vfylyk for all your input! 👍 This is really helpful and indicates where the problem is.

The websocket connections are closed in the UI every 60 seconds and a new one is created.

Do you have a load-balancer or proxy between the client (browser) and server (ChirpStack Application Server). Do you see the same result when there is a direct connection between the client (browser) and server (ChirpStack Application Server)?

What I'm interested in is to see if there might be a kind of "zombie" connection between the load-balancer / proxy and the server, which keeps the go-routine running. Based on your input I will also do some debugging.

@vfylyk
Copy link

vfylyk commented Feb 25, 2020

Do you have a load-balancer or proxy between the client (browser) and server (ChirpStack Application Server). Do you see the same result when there is a direct connection between the client (browser) and server (ChirpStack Application Server)?

Good point, I do have load-balancer. As I run the app server in Kubernetes in Amazon in a private subnet, I use an ALB as ingress to make it public and provide TLS.

I'll run some tests tomorrow skipping the proxy and load balancer and will let you know if the same still happens.

@vfylyk
Copy link

vfylyk commented Feb 26, 2020

Hi @brocaar, I tested the same scenario through kubectl port-forward, skipping the AWS load-balancer.

What happens is the websocket connection does stay stable for as long as I stay on the "Device Data" tab, so no restarts every 60 seconds. However, when I move into another tab and the WS connection is closed, the RPCs still hang (i.e. no "finished streaming call with code OK" log entries). Even if a kill the kubectl port-forward process, it makes no difference.

Screen Shot 2020-02-26 at 12 40 06 pm

After the device sends one uplink, then all hanging RPCs close, as happens in the load-balancer scenario:

time="2020-02-26T01:55:52Z" level=info msg="finished streaming call with code OK" grpc.code=OK grpc.method=StreamEventLogs grpc.service=api.DeviceService grpc.start_time="2020-02-26T01:41:04Z" grpc.time_ms=888975.4 peer.address="127.0.0.1:36280" span.kind=server system=grpc
time="2020-02-26T01:55:52Z" level=info msg="finished streaming call with code OK" grpc.code=OK grpc.method=StreamEventLogs grpc.service=api.DeviceService grpc.start_time="2020-02-26T01:32:16Z" grpc.time_ms=1.4162388e+06 peer.address="127.0.0.1:36280" span.kind=server system=grpc

As a comparison, when I use the "Lorawan Frames" tab through the AWS load-balancer, we do see the websocket being restarted every 60 seconds, however whenever the connection is closed in the browser, the "finished call..." log entries appear immediately in the logs, for every one of the closed WS connections:

Screen Shot 2020-02-26 at 12 25 10 pm

time="2020-02-26T01:21:41Z" level=warning msg="error reading websocket message: websocket: close 1006 (abnormal closure): unexpected EOF"
time="2020-02-26T01:21:42Z" level=info msg="finished client streaming call" grpc.code=OK grpc.method=StreamFrameLogsForDevice grpc.service=ns.NetworkServerService grpc.time_ms=0.041 span.kind=client system=grpc
time="2020-02-26T01:22:42Z" level=warning msg="error reading websocket message: websocket: close 1006 (abnormal closure): unexpected EOF"
time="2020-02-26T01:22:43Z" level=info msg="finished client streaming call" grpc.code=OK grpc.method=StreamFrameLogsForDevice grpc.service=ns.NetworkServerService grpc.time_ms=0.149 span.kind=client system=grpc
time="2020-02-26T01:23:43Z" level=warning msg="error reading websocket message: websocket: close 1006 (abnormal closure): unexpected EOF"
time="2020-02-26T01:23:44Z" level=info msg="finished client streaming call" grpc.code=OK grpc.method=StreamFrameLogsForDevice grpc.service=ns.NetworkServerService grpc.time_ms=0.044 span.kind=client system=grpc
time="2020-02-26T01:24:44Z" level=warning msg="error reading websocket message: websocket: close 1006 (abnormal closure): unexpected EOF"
time="2020-02-26T01:24:44Z" level=info msg="finished client streaming call" grpc.code=OK grpc.method=StreamFrameLogsForDevice grpc.service=ns.NetworkServerService grpc.time_ms=0.055 span.kind=client system=grpc

@brocaar
Copy link
Owner

brocaar commented Mar 24, 2020

To give you an update, I'm currently refactoring the Redis related code. I'm migrating to godoc.org/github.com/go-redis/redis, which does also support Redis Clustering. It also makes the function subscribing to the Redis Pub/Sub a lot more compact as a lot of magic is handled by the redis library.

However, I don't think this solves the issue. What could help is sending a periodic ping from the server to the WebSocket client (browser). In case there is no response, then the server kills the connection.

It looks like when the server has anything to send to the client (browser), only then if finds out that the client has gone after which the RPC completes. Behavior might be different depending the kind of loadbalancer between the client and server.

@brocaar
Copy link
Owner

brocaar commented Mar 25, 2020

I have created the following issue: tmc/grpc-websocket-proxy#22.

@brocaar brocaar added bug and removed question labels Mar 26, 2020
@brocaar
Copy link
Owner

brocaar commented Mar 26, 2020

This should be fixed (see pull-request: tmc/grpc-websocket-proxy#23). As this has not yet been merged, I have updated the go.mod file to use a fork of grpc-websocket-proxy which includes this fix, see e3d81cc.

Sorry that this took a while to track down. One reason why some people were affected and others not is that when you have setup a proxy which periodically kills the websocket connection then this is making it significantly worse. E.g. when it kills the connection after 1 minute of inactivity, then the websocket wrapper might still have 60 internal requests open, each keeping a single Redis connection open.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants