-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pro Version random GRPC connection errors in the middle of task execution. #1391
Comments
Can you dump the master node docker logs? |
Below there are logs from master node from most recent such occurence. Logs are timestamped between 1 minute before it happened to 1 minute after. Crawlab container did rebuild itself at exactly 2023-09-13T16:30:47.886726955Z. Around the same time, as we can see on line 15 of the logs, Logs
|
@tikazyq It is the same error on the pro version, and it happens frequently |
I probably managed to pinpoint that it tends to happen while API call to /metrics endpoint is being made. Shortly after it tends to crash the container and rebuild it again with errors akin to the logs provided in my previous comment. |
@tikazyq same issue on pro version |
@tikazyq hi, any progress on this task? |
You can switch to the latest version of docker image to resolve this issue. |
@tikazyq hey, we've updated images to |
@tikazyq hi again, do you have any update to the above message? |
@tikazyq Hi , could you tell how to disable the metrics flag |
We have observed that there is some kind of co-relation between the It would be really helpful to know of a way to disable it. |
Describe the bug
While executing a task that runs for quite a long time (>1 day) docker logs tend to show various errors related to GRPC connection such as
rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp <MASTER_NODE_IP_ADDRESS>:<GRPC_PORT>: connect: connection refused
orconnection reset by peer
. Both worker node on which task runs and master node machines show continuous, uninterrupted availability network and hardware wise. Sometimes it is accompanied by errors related to failure of verifying licenseerror verify license error: Post \"https://license.crawlab.cn/release/license/verify\": net/http: TLS handshake timeout. retry in 5 seconds
.Expected behavior
Nodes don't lose GRPC connection randomly
The text was updated successfully, but these errors were encountered: