-
-
Notifications
You must be signed in to change notification settings - Fork 325
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Get Devices REST API returned 504 gateway time-out #373
Comments
We get something like this happening as well, although I'm not sure if the gateway is timing out like that or not as I haven't recently checked the dev tools when it's happened. I'll do so next time. We fix it by restarting lora-app-server. A colleague checked the logs the last time it occured (over the weekend) and nothing made it to the debug log. /api/devices also failed to return (which I assume is the issue with the actual device list). If restarting redis also fixes this, then maybe there's an issue with the connection between lora-app-server and redis. I'll check redis next time it happens here. This has been happening for a few versions now, we've just been dealing with it and figured it might be a stupid configuration issue on our end. EDIT: Although, I'm looking now and can't see anything redis related in |
I get 504 reading out the events. Perhaps it takes too long somehow? |
Hello there, We have the same issue, a restart did however fix the problem. The time-out error did only occur with the api/devices. We are currently working on version 3.3.0 and looking to upgrade to the latest chirpstack version! In the log I get the following errors/warnings: Jan 13 10:30:46 ********* lora-app-server[1641]: time="2020-01-13T10:30:46+01:00" level=warning msg="influxdb integration: unhandled type!" type_name="[]string" Jan 13 10:30:27 ********* lora-app-server[1641]: time="2020-01-13T10:30:27+01:00" level=error msg="finished unary call with code Internal" error="rpc error: code = Internal desc = handle received ping error: get ping lookup error: get ping lookup error: redigo: nil returned" grpc.code=Internal grpc.method=HandleProprietaryUplink grpc.service=as.ApplicationServerService grpc.start_time="2020-01-13T10:30:27+01:00" grpc.time_ms=1.17 peer.address="ip address:40700" span.kind=server system=grpc Jan 13 10:31:19 ********* lora-app-server[1641]: time="2020-01-13T10:31:19+01:00" level=error msg="handle received ping error: get ping lookup error: get ping lookup error: redigo: nil returned" Could you enlighten what happens here so we can resolve this problem the next time it occurs. Thanks in advance! |
Hi, We checked the GetDevices method under storage/device.go and manually executed the same SQL query and it returned just fine. Also, The Postgres connections are all fine, with no locks / no long-running transactions/sessions. There no errors in the console except for the below log wrt redigo (redis client lib). Also, another thing to note is once this happens, even adding a new device (post API)was timing out. This happened quite a few times last month and restarting either the redis or app-server fixed the problem temporarily. Which is strange. I understand that all the device session keys are stored in redis, but I didn't see any references to redis when GetDevices API is called. Anyways again after a few days, we encounter the same issues again and we kept restarting to fix the problem. So obviously feels like some connection issue between Redis and the app server. Not sure if it's happening because of the client buffer limits on the Redis. Unfortunately there is nothing in the logs on Redis side. The traffic on our app server is not that high though, on average one uplink payload per second, and the payload sizes aren't that big, so not sure if its taking time to process the redis responses and hence reaching the client buffer limits on Redis. Obviously the "GET" Redis connections are gone when this happens - based on the log " handle received ping error: get ping lookup error". The question would be even if we hit the limits, shouldn't there be "auto-reconnection" since its a Redis pool. We are now increasing the buffer sizes and hope Redis connections don't get screwed in the future. Any pointers on this would be much appreciated. Note: setup is on K8S with chirpstack docker images.
|
Further analysis on that after it happened again today. When we did the CLIENT LIST on the Redis we saw below. Quite a number of connections for the PING. I suspect that was the reason for the redis to become unresponsive. We have added a limit to the active redis connections in the lora app server yaml. Hopefully, that resolves the issue.
|
I looked further into this, based on what @ramuit44 noticed. It seems to be caused by a goroutine + redis connection pool leak. This problem is particularly relevant to my company because we use the troubleshooting tabs quite extensively when dealing with sensor providers as well as when diagnosing network issues. Here's how I managed to reproduce the excessive number of connections in redis: Context: Steps:
Expected result:
What actually happens:
Excerpt of GetEventLogForDevice:
If now you add a twist to this scenario, and after a while the device happens to send an uplink, I can see all the hanging RPCs completing all at once, and the list of client connections in redis to be normalised:
@brocaar Any thoughts on this? PS example client list:
|
Thanks @vfylyk for all your input! 👍 This is really helpful and indicates where the problem is.
Do you have a load-balancer or proxy between the client (browser) and server (ChirpStack Application Server). Do you see the same result when there is a direct connection between the client (browser) and server (ChirpStack Application Server)? What I'm interested in is to see if there might be a kind of "zombie" connection between the load-balancer / proxy and the server, which keeps the go-routine running. Based on your input I will also do some debugging. |
Good point, I do have load-balancer. As I run the app server in Kubernetes in Amazon in a private subnet, I use an ALB as ingress to make it public and provide TLS. I'll run some tests tomorrow skipping the proxy and load balancer and will let you know if the same still happens. |
Hi @brocaar, I tested the same scenario through What happens is the websocket connection does stay stable for as long as I stay on the "Device Data" tab, so no restarts every 60 seconds. However, when I move into another tab and the WS connection is closed, the RPCs still hang (i.e. no "finished streaming call with code OK" log entries). Even if a kill the After the device sends one uplink, then all hanging RPCs close, as happens in the load-balancer scenario:
As a comparison, when I use the "Lorawan Frames" tab through the AWS load-balancer, we do see the websocket being restarted every 60 seconds, however whenever the connection is closed in the browser, the "finished call..." log entries appear immediately in the logs, for every one of the closed WS connections:
|
To give you an update, I'm currently refactoring the Redis related code. I'm migrating to godoc.org/github.com/go-redis/redis, which does also support Redis Clustering. It also makes the function subscribing to the Redis Pub/Sub a lot more compact as a lot of magic is handled by the redis library. However, I don't think this solves the issue. What could help is sending a periodic ping from the server to the WebSocket client (browser). In case there is no response, then the server kills the connection. It looks like when the server has anything to send to the client (browser), only then if finds out that the client has gone after which the RPC completes. Behavior might be different depending the kind of loadbalancer between the client and server. |
I have created the following issue: tmc/grpc-websocket-proxy#22. |
This should be fixed (see pull-request: tmc/grpc-websocket-proxy#23). As this has not yet been merged, I have updated the Sorry that this took a while to track down. One reason why some people were affected and others not is that when you have setup a proxy which periodically kills the websocket connection then this is making it significantly worse. E.g. when it kills the connection after 1 minute of inactivity, then the websocket wrapper might still have 60 internal requests open, each keeping a single Redis connection open. |
Is this a bug or a feature request?
Maybe it's a misconfiguration.
What did you expect?
To know what caused this issue.
What happened?
We have 139 active devices registered in our server. Sometime, the device list is gone from the application details page and the API endpoint returned error 504 (gateway time-out).
.
The devices list will come back after restarting redis server. Maybe someone has ecountered the same issue and hopefuly know the solution.
What version are your using?
v3.3.1
The text was updated successfully, but these errors were encountered: