You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
dora-daemon hangs up due to heartbeat timeout, but dora-coodinator is running normally, then I restart dora-daemon, when the dataflow is closed by dora stop uuid, it cannot be closed.
(dora3.7) jarvis@jia:~/coding/dora_home/dora$ conda activate py310
(py310) jarvis@jia:~/coding/dora_home/dora$ dora-cli up
started dora coordinator
started dora daemon
(py310) jarvis@jia:~/coding/dora_home/dora$ dora-cli -V
dora-cli 0.2.3-rc6
(py310) jarvis@jia:~/coding/dora_home/dora$ dora-cli check
Dora Coordinator: ok
Dora Daemon: ok
(py310) jarvis@jia:~/coding/dora_home/dora$ dora-cli start examples/python-operator-dataflow/dataflow.yml --attach --hot-reload
10af7c98-604d-4808-b48a-7e028cb3d733
2023-05-19T03:53:57.743423Z WARN dora_coordinator: daemon at `` did not react as expected to watchdog message
Caused by:
0: failed to send watchdog message to daemon
1: Broken pipe (os error 32)
Location:
/home/jarvis/coding/dora_home/dora/binaries/coordinator/src/lib.rs:550:10
at binaries/coordinator/src/lib.rs:468
open new terminal and kill dora-daemon, simulate the daemon process to hang up abnormally
(py310) jarvis@jia:~/coding/dora_home/dora$ dora-cli list
Running dataflows:
- [nappy-back] 10af7c98-604d-4808-b48a-7e028cb3d733
(py310) jarvis@jia:~/coding/dora_home/dora$ ps -ef | grep dora
jarvis 22117 1 0 11:41 pts/12 00:00:00 dora-coordinator
jarvis 22131 1 0 11:41 pts/12 00:00:01 dora-daemon
jarvis 24461 18206 0 11:53 pts/12 00:00:00 dora-cli start dataflow.yml --attach --hot-reload
jarvis 24464 22131 7 11:53 pts/12 00:00:01 python3 -c import dora;dora.start_runtime() # webcam
jarvis 24467 22131 8 11:53 pts/12 00:00:01 python3 -c import dora;dora.start_runtime() # plot
jarvis 24598 22333 0 11:53 pts/3 00:00:00 grep --color=auto dora
(py310) jarvis@jia:~/coding/dora_home/dora$ kill -15 22131
(py310) jarvis@jia:~/coding/dora_home/dora$ dora-cli list
Running dataflows:
- [nappy-back] 10af7c98-604d-4808-b48a-7e028cb3d733
(py310) jarvis@jia:~/coding/dora_home/dora$ dora-cli stop
> Choose dataflow to stop: [nappy-back] 10af7c98-604d-4808-b48a-7e028cb3d733
no daemon connection
(py310) jarvis@jia:~/coding/dora_home/dora$ dora-cli up
started dora daemon
(py310) jarvis@jia:~/coding/dora_home/dora$ dora-cli check
Dora Coordinator: ok
Dora Daemon: ok
(py310) jarvis@jia:~/coding/dora_home/dora$ dora-cli list
Running dataflows:
- [nappy-back] 10af7c98-604d-4808-b48a-7e028cb3d733
(py310) jarvis@jia:~/coding/dora_home/dora$ dora-cli stop
> Choose dataflow to stop: [nappy-back] 10af7c98-604d-4808-b48a-7e028cb3d733
failed to stop dataflow
(py310) jarvis@jia:~/coding/dora_home/dora$ dora-cli list
Running dataflows:
- [nappy-back] 10af7c98-604d-4808-b48a-7e028cb3d733
(py310) jarvis@jia:~/coding/dora_home/dora$ dora-cli -V
dora-cli 0.2.3-rc6
(py310) jarvis@jia:~/coding/dora_home/dora$
To Reproduce
Steps to reproduce the behavior:
Dora start coodinator and daemon: dora-cli up
Start a new dataflow: dora-cli start examples/python-operator-dataflow/dataflow.yaml --attach --hot-reload
Expected behavior
I expect dora-coodinator and dora-daemon to live and die together, and they can automatically restart when the heartbeat times out, Or dora-daemon hangs up, and dataflow is also destroyed.
Environments (please complete the following information):
System info: ubuntu 22.04
Dora version: v0.2.3-rc6
The text was updated successfully, but these errors were encountered:
We do not support auto-restarting daemon at the moment.
Because, there are some reasons due to custom nodes and operators, which will cause dora-daemon to hang innocently. I kill the dora-daemon process to simulate this situation.
Do you have any ideas or context you can share about why dora-daemon to hang innocently?
I am not running in source debug mode,after dora up, run RUST_LOG=true dora start graphs/tutorials/webcam.yaml --attach --hot-reload --name webcam, dataflow cannot be stopped
(dora3.7) jarvis@jia:~/coding/pyhome/github.com/dora-rs/dora-drives$ dora list
Running dataflows:
- [webcam] 2eeba0b6-4cfa-438a-bc7f-0747664e06f3
(dora3.7) jarvis@jia:~/coding/pyhome/github.com/dora-rs/dora-drives$ dora stop
> Choose dataflow to stop: [webcam] 2eeba0b6-4cfa-438a-bc7f-0747664e06f3
(dora3.7) jarvis@jia:~/coding/pyhome/github.com/dora-rs/dora-drives$ dora list
Running dataflows:
- [webcam] 2eeba0b6-4cfa-438a-bc7f-0747664e06f3
(dora3.7) jarvis@jia:~/coding/pyhome/github.com/dora-rs/dora-drives$ dora -V
dora-cli 0.2.3
(dora3.7) jarvis@jia:~/coding/pyhome/github.com/dora-rs/dora-drives$ dora logs 2eeba0b6-4cfa-438a-bc7f-0747664e06f3 webcam
> │ Logs from webcam.
─────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
1 │ could not get webcam.
2 │ could not get webcam.
3 │ could not get webcam.
4 │ could not get webcam.
5 │ could not get webcam.
6 │ could not get webcam.
7 │ could not get webcam.
8 │ could not get webcam.
9 │ could not get webcam.
10 │ could not get webcam.
11 │ could not get webcam.
12 │ could not get webcam.
13 │ could not get webcam.
14 │ could not get webcam.
Describe the bug
dora-daemon hangs up due to heartbeat timeout, but dora-coodinator is running normally, then I restart dora-daemon, when the dataflow is closed by dora stop uuid, it cannot be closed.
open new terminal and kill dora-daemon, simulate the daemon process to hang up abnormally
To Reproduce
Steps to reproduce the behavior:
dora-cli up
dora-cli start examples/python-operator-dataflow/dataflow.yaml --attach --hot-reload
kill -15 pid_dora_daemon
dora-cli up
dora-cli stop uuid_your_dataflow
Expected behavior
I expect dora-coodinator and dora-daemon to live and die together, and they can automatically restart when the heartbeat times out, Or dora-daemon hangs up, and dataflow is also destroyed.
Environments (please complete the following information):
The text was updated successfully, but these errors were encountered: