When the number of active scheduler is 0, the running peer will crash on its own #3158

karlhjm · 2024-04-01T12:31:38Z

Bug report:

When the number of active schedulers becomes 0, the peer will continuously crash on its own, causing machines using Docker mode to be unable to pull images(docker proxy exist but daemon crash)

Here are the logs before and after peer restart
after-crash.log
before-crash.log

Expected behavior:

When the number of active schedulers is 0, the running peer can continue to run normally and back source on its own

How to reproduce it:

Scale the scheduler to 0 and wait for up to 5 minutes (Redis cache expiration time)

Environment:

Dragonfly version: v2.1.0
OS: ubuntu 16.04

gaius-qi · 2024-04-01T12:36:18Z

There is no scheduler available when the daemon is started. It’s not that there is no scheduler available after successful startup.

{"log":"2024-04-01T12:22:12.026Z\u0009WARN\u0009zap/client_interceptors.go:52\u0009finished client unary call\u0009{\"system\": \"grpc\", \"span.kind\": \"client\", \"grpc.service\": \"scheduler.Scheduler\", \"grpc.method\": \"AnnounceHost\", \"error\": \"rpc error: code = Unavailable desc = last connection error: connection error: desc = \\\"transport: Error while dialing: dial tcp 202.168.114.113:8002: connect: connection refused\\\"\", \"grpc.code\": \"Unavailable\", \"grpc.time_ms\": 1014.044}\n","stream":"stderr","time":"2024-04-01T12:22:12.026825095Z"}

The panic of closing the daemon has been fixed in the new version.

karlhjm · 2024-04-01T13:07:34Z

There is no scheduler available when the daemon is started. It’s not that there is no scheduler available after successful startup.

{"log":"2024-04-01T12:22:12.026Z\u0009WARN\u0009zap/client_interceptors.go:52\u0009finished client unary call\u0009{\"system\": \"grpc\", \"span.kind\": \"client\", \"grpc.service\": \"scheduler.Scheduler\", \"grpc.method\": \"AnnounceHost\", \"error\": \"rpc error: code = Unavailable desc = last connection error: connection error: desc = \\\"transport: Error while dialing: dial tcp 202.168.114.113:8002: connect: connection refused\\\"\", \"grpc.code\": \"Unavailable\", \"grpc.time_ms\": 1014.044}\n","stream":"stderr","time":"2024-04-01T12:22:12.026825095Z"}

The panic of closing the daemon has been fixed in the new version.

Sorry, I found out that I am using version v2.1.10 not v2.1.0.

Since which version has the second problem been fixed?

karlhjm · 2024-04-01T14:59:31Z

I have used the latest version, but it still causes peer crashes when I scale scheduler to 0; but I found the log in peer

{"log":"2024-04-01T16:11:18.602Z\u0009WARN\u0009dependency/dependency.go:149\u0009receive signal: terminated\n","stream":"stderr","time":"2024-04-01T16:11:18.60252273Z"}
{"log":"d7y.io/dragonfly/v2/cmd/dependency.SetupQuitSignalHandler.func1\n","stream":"stderr","time":"2024-04-01T16:11:18.602549283Z"}
{"log":"d7y.io/dragonfly/v2/cmd/dependency.SetupQuitSignalHandler.func1\n","stream":"stderr","time":"2024-04-01T16:11:19.618173676Z"}
{"log":"2024-04-01T16:11:19.618Z\u0009WARN\u0009dependency/dependency.go:153\u0009handle signal: terminated finish\n","stream":"stderr","time":"2024-04-01T16:11:19.618579754Z"}
{"log":"d7y.io/dragonfly/v2/cmd/dependency.SetupQuitSignalHandler.func1\n","stream":"stderr","time":"2024-04-01T16:11:19.618590611Z"}

and some log in kubelet is as below, the container was killed because grpc liveness failed multiple times

Liveness probe for "dragonfly-dfdaemon-m76dl_dragonfly-system(7b1c5b41-9bda-4ed9-98f8-57655c5fa0c1):dfdaemon" failed (failure): service unhealthy (responded with "NOT_SERVING")

Killing unwanted container "dfdaemon"(id={"docker" "fc63d3f4032abca8e59bc1ea9cac3cfa4cd4f85e14c4331a431913b5d81f266b"}) for pod "dragonfly-dfdaemon-m76dl_dragonfly-system(7b1c5b41-9bda-4ed9-98f8-57655c5fa0c1)"

gaius-qi · 2024-04-03T00:29:55Z

I have used the latest version, but it still causes peer crashes when I scale scheduler to 0; but I found the log in peer

{"log":"2024-04-01T16:11:18.602Z\u0009WARN\u0009dependency/dependency.go:149\u0009receive signal: terminated\n","stream":"stderr","time":"2024-04-01T16:11:18.60252273Z"}

{"log":"d7y.io/dragonfly/v2/cmd/dependency.SetupQuitSignalHandler.func1\n","stream":"stderr","time":"2024-04-01T16:11:18.602549283Z"}

{"log":"d7y.io/dragonfly/v2/cmd/dependency.SetupQuitSignalHandler.func1\n","stream":"stderr","time":"2024-04-01T16:11:19.618173676Z"}

{"log":"2024-04-01T16:11:19.618Z\u0009WARN\u0009dependency/dependency.go:153\u0009handle signal: terminated finish\n","stream":"stderr","time":"2024-04-01T16:11:19.618579754Z"}

{"log":"d7y.io/dragonfly/v2/cmd/dependency.SetupQuitSignalHandler.func1\n","stream":"stderr","time":"2024-04-01T16:11:19.618590611Z"}

and some log in kubelet is as below, the container was killed because grpc liveness failed multiple times

Liveness probe for "dragonfly-dfdaemon-m76dl_dragonfly-system(7b1c5b41-9bda-4ed9-98f8-57655c5fa0c1):dfdaemon" failed (failure): service unhealthy (responded with "NOT_SERVING")

Killing unwanted container "dfdaemon"(id={"docker" "fc63d3f4032abca8e59bc1ea9cac3cfa4cd4f85e14c4331a431913b5d81f266b"}) for pod "dragonfly-dfdaemon-m76dl_dragonfly-system(7b1c5b41-9bda-4ed9-98f8-57655c5fa0c1)"

Without a scheduler, the dfdaemon health check will fail and k8s will restart the daemon. If you do not want it to be restarted, you can delete the probe.

karlhjm · 2024-04-06T14:21:31Z

I have used the latest version, but it still causes peer crashes when I scale scheduler to 0; but I found the log in peer
{"log":"2024-04-01T16:11:18.602Z\u0009WARN\u0009dependency/dependency.go:149\u0009receive signal: terminated\n","stream":"stderr","time":"2024-04-01T16:11:18.60252273Z"}
{"log":"d7y.io/dragonfly/v2/cmd/dependency.SetupQuitSignalHandler.func1\n","stream":"stderr","time":"2024-04-01T16:11:18.602549283Z"}
{"log":"d7y.io/dragonfly/v2/cmd/dependency.SetupQuitSignalHandler.func1\n","stream":"stderr","time":"2024-04-01T16:11:19.618173676Z"}
{"log":"2024-04-01T16:11:19.618Z\u0009WARN\u0009dependency/dependency.go:153\u0009handle signal: terminated finish\n","stream":"stderr","time":"2024-04-01T16:11:19.618579754Z"}
{"log":"d7y.io/dragonfly/v2/cmd/dependency.SetupQuitSignalHandler.func1\n","stream":"stderr","time":"2024-04-01T16:11:19.618590611Z"}
and some log in kubelet is as below, the container was killed because grpc liveness failed multiple times
Liveness probe for "dragonfly-dfdaemon-m76dl_dragonfly-system(7b1c5b41-9bda-4ed9-98f8-57655c5fa0c1):dfdaemon" failed (failure): service unhealthy (responded with "NOT_SERVING")
Killing unwanted container "dfdaemon"(id={"docker" "fc63d3f4032abca8e59bc1ea9cac3cfa4cd4f85e14c4331a431913b5d81f266b"}) for pod "dragonfly-dfdaemon-m76dl_dragonfly-system(7b1c5b41-9bda-4ed9-98f8-57655c5fa0c1)"

Without a scheduler, the dfdaemon health check will fail and k8s will restart the daemon. If you do not want it to be restarted, you can delete the probe.

thx, I found relevant feat. #2130
Can @jim3ma please explain some reasons behind this? Because in my understanding, even if the scheduler fails, it should not have affected the status of dfdaemon, dfdaemon can back source itself.

karlhjm added the bug label Apr 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When the number of active scheduler is 0, the running peer will crash on its own #3158

When the number of active scheduler is 0, the running peer will crash on its own #3158

karlhjm commented Apr 1, 2024 •

edited

gaius-qi commented Apr 1, 2024

karlhjm commented Apr 1, 2024 •

edited

karlhjm commented Apr 1, 2024 •

edited

gaius-qi commented Apr 3, 2024

karlhjm commented Apr 6, 2024

When the number of active scheduler is 0, the running peer will crash on its own #3158

When the number of active scheduler is 0, the running peer will crash on its own #3158

Comments

karlhjm commented Apr 1, 2024 • edited

Bug report:

Expected behavior:

How to reproduce it:

Environment:

gaius-qi commented Apr 1, 2024

karlhjm commented Apr 1, 2024 • edited

karlhjm commented Apr 1, 2024 • edited

gaius-qi commented Apr 3, 2024

karlhjm commented Apr 6, 2024

karlhjm commented Apr 1, 2024 •

edited

karlhjm commented Apr 1, 2024 •

edited

karlhjm commented Apr 1, 2024 •

edited