Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When the number of active scheduler is 0, the running peer will crash on its own #3158

Open
karlhjm opened this issue Apr 1, 2024 · 5 comments
Labels

Comments

@karlhjm
Copy link
Contributor

karlhjm commented Apr 1, 2024

Bug report:

When the number of active schedulers becomes 0, the peer will continuously crash on its own, causing machines using Docker mode to be unable to pull images(docker proxy exist but daemon crash)

Here are the logs before and after peer restart
after-crash.log
before-crash.log

Expected behavior:

When the number of active schedulers is 0, the running peer can continue to run normally and back source on its own

How to reproduce it:

Scale the scheduler to 0 and wait for up to 5 minutes (Redis cache expiration time)

Environment:

  • Dragonfly version: v2.1.0
  • OS: ubuntu 16.04
@karlhjm karlhjm added the bug label Apr 1, 2024
@gaius-qi
Copy link
Member

gaius-qi commented Apr 1, 2024

  1. There is no scheduler available when the daemon is started. It’s not that there is no scheduler available after successful startup.
{"log":"2024-04-01T12:22:12.026Z\u0009WARN\u0009zap/client_interceptors.go:52\u0009finished client unary call\u0009{\"system\": \"grpc\", \"span.kind\": \"client\", \"grpc.service\": \"scheduler.Scheduler\", \"grpc.method\": \"AnnounceHost\", \"error\": \"rpc error: code = Unavailable desc = last connection error: connection error: desc = \\\"transport: Error while dialing: dial tcp 202.168.114.113:8002: connect: connection refused\\\"\", \"grpc.code\": \"Unavailable\", \"grpc.time_ms\": 1014.044}\n","stream":"stderr","time":"2024-04-01T12:22:12.026825095Z"}
  1. The panic of closing the daemon has been fixed in the new version.

@karlhjm
Copy link
Contributor Author

karlhjm commented Apr 1, 2024

  1. There is no scheduler available when the daemon is started. It’s not that there is no scheduler available after successful startup.
{"log":"2024-04-01T12:22:12.026Z\u0009WARN\u0009zap/client_interceptors.go:52\u0009finished client unary call\u0009{\"system\": \"grpc\", \"span.kind\": \"client\", \"grpc.service\": \"scheduler.Scheduler\", \"grpc.method\": \"AnnounceHost\", \"error\": \"rpc error: code = Unavailable desc = last connection error: connection error: desc = \\\"transport: Error while dialing: dial tcp 202.168.114.113:8002: connect: connection refused\\\"\", \"grpc.code\": \"Unavailable\", \"grpc.time_ms\": 1014.044}\n","stream":"stderr","time":"2024-04-01T12:22:12.026825095Z"}
  1. The panic of closing the daemon has been fixed in the new version.

Sorry, I found out that I am using version v2.1.10 not v2.1.0.

Since which version has the second problem been fixed?

@karlhjm
Copy link
Contributor Author

karlhjm commented Apr 1, 2024

I have used the latest version, but it still causes peer crashes when I scale scheduler to 0; but I found the log in peer

{"log":"2024-04-01T16:11:18.602Z\u0009WARN\u0009dependency/dependency.go:149\u0009receive signal: terminated\n","stream":"stderr","time":"2024-04-01T16:11:18.60252273Z"}
{"log":"d7y.io/dragonfly/v2/cmd/dependency.SetupQuitSignalHandler.func1\n","stream":"stderr","time":"2024-04-01T16:11:18.602549283Z"}
{"log":"d7y.io/dragonfly/v2/cmd/dependency.SetupQuitSignalHandler.func1\n","stream":"stderr","time":"2024-04-01T16:11:19.618173676Z"}
{"log":"2024-04-01T16:11:19.618Z\u0009WARN\u0009dependency/dependency.go:153\u0009handle signal: terminated finish\n","stream":"stderr","time":"2024-04-01T16:11:19.618579754Z"}
{"log":"d7y.io/dragonfly/v2/cmd/dependency.SetupQuitSignalHandler.func1\n","stream":"stderr","time":"2024-04-01T16:11:19.618590611Z"}

and some log in kubelet is as below, the container was killed because grpc liveness failed multiple times

Liveness probe for "dragonfly-dfdaemon-m76dl_dragonfly-system(7b1c5b41-9bda-4ed9-98f8-57655c5fa0c1):dfdaemon" failed (failure): service unhealthy (responded with "NOT_SERVING")

Killing unwanted container "dfdaemon"(id={"docker" "fc63d3f4032abca8e59bc1ea9cac3cfa4cd4f85e14c4331a431913b5d81f266b"}) for pod "dragonfly-dfdaemon-m76dl_dragonfly-system(7b1c5b41-9bda-4ed9-98f8-57655c5fa0c1)"

@gaius-qi
Copy link
Member

gaius-qi commented Apr 3, 2024

I have used the latest version, but it still causes peer crashes when I scale scheduler to 0; but I found the log in peer

{"log":"2024-04-01T16:11:18.602Z\u0009WARN\u0009dependency/dependency.go:149\u0009receive signal: terminated\n","stream":"stderr","time":"2024-04-01T16:11:18.60252273Z"}

{"log":"d7y.io/dragonfly/v2/cmd/dependency.SetupQuitSignalHandler.func1\n","stream":"stderr","time":"2024-04-01T16:11:18.602549283Z"}

{"log":"d7y.io/dragonfly/v2/cmd/dependency.SetupQuitSignalHandler.func1\n","stream":"stderr","time":"2024-04-01T16:11:19.618173676Z"}

{"log":"2024-04-01T16:11:19.618Z\u0009WARN\u0009dependency/dependency.go:153\u0009handle signal: terminated finish\n","stream":"stderr","time":"2024-04-01T16:11:19.618579754Z"}

{"log":"d7y.io/dragonfly/v2/cmd/dependency.SetupQuitSignalHandler.func1\n","stream":"stderr","time":"2024-04-01T16:11:19.618590611Z"}

and some log in kubelet is as below, the container was killed because grpc liveness failed multiple times

Liveness probe for "dragonfly-dfdaemon-m76dl_dragonfly-system(7b1c5b41-9bda-4ed9-98f8-57655c5fa0c1):dfdaemon" failed (failure): service unhealthy (responded with "NOT_SERVING")

Killing unwanted container "dfdaemon"(id={"docker" "fc63d3f4032abca8e59bc1ea9cac3cfa4cd4f85e14c4331a431913b5d81f266b"}) for pod "dragonfly-dfdaemon-m76dl_dragonfly-system(7b1c5b41-9bda-4ed9-98f8-57655c5fa0c1)"

Without a scheduler, the dfdaemon health check will fail and k8s will restart the daemon. If you do not want it to be restarted, you can delete the probe.

@karlhjm
Copy link
Contributor Author

karlhjm commented Apr 6, 2024

I have used the latest version, but it still causes peer crashes when I scale scheduler to 0; but I found the log in peer
{"log":"2024-04-01T16:11:18.602Z\u0009WARN\u0009dependency/dependency.go:149\u0009receive signal: terminated\n","stream":"stderr","time":"2024-04-01T16:11:18.60252273Z"}
{"log":"d7y.io/dragonfly/v2/cmd/dependency.SetupQuitSignalHandler.func1\n","stream":"stderr","time":"2024-04-01T16:11:18.602549283Z"}
{"log":"d7y.io/dragonfly/v2/cmd/dependency.SetupQuitSignalHandler.func1\n","stream":"stderr","time":"2024-04-01T16:11:19.618173676Z"}
{"log":"2024-04-01T16:11:19.618Z\u0009WARN\u0009dependency/dependency.go:153\u0009handle signal: terminated finish\n","stream":"stderr","time":"2024-04-01T16:11:19.618579754Z"}
{"log":"d7y.io/dragonfly/v2/cmd/dependency.SetupQuitSignalHandler.func1\n","stream":"stderr","time":"2024-04-01T16:11:19.618590611Z"}
and some log in kubelet is as below, the container was killed because grpc liveness failed multiple times
Liveness probe for "dragonfly-dfdaemon-m76dl_dragonfly-system(7b1c5b41-9bda-4ed9-98f8-57655c5fa0c1):dfdaemon" failed (failure): service unhealthy (responded with "NOT_SERVING")
Killing unwanted container "dfdaemon"(id={"docker" "fc63d3f4032abca8e59bc1ea9cac3cfa4cd4f85e14c4331a431913b5d81f266b"}) for pod "dragonfly-dfdaemon-m76dl_dragonfly-system(7b1c5b41-9bda-4ed9-98f8-57655c5fa0c1)"

Without a scheduler, the dfdaemon health check will fail and k8s will restart the daemon. If you do not want it to be restarted, you can delete the probe.

thx, I found relevant feat. #2130
Can @jim3ma please explain some reasons behind this? Because in my understanding, even if the scheduler fails, it should not have affected the status of dfdaemon, dfdaemon can back source itself.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants