Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

failed to stop dataflow #292

Open
meua opened this issue May 19, 2023 · 4 comments
Open

failed to stop dataflow #292

meua opened this issue May 19, 2023 · 4 comments
Labels

Comments

@meua
Copy link
Contributor

meua commented May 19, 2023

Describe the bug
dora-daemon hangs up due to heartbeat timeout, but dora-coodinator is running normally, then I restart dora-daemon, when the dataflow is closed by dora stop uuid, it cannot be closed.

(dora3.7) jarvis@jia:~/coding/dora_home/dora$ conda activate py310
(py310) jarvis@jia:~/coding/dora_home/dora$ dora-cli up
started dora coordinator
started dora daemon

(py310) jarvis@jia:~/coding/dora_home/dora$ dora-cli -V
dora-cli 0.2.3-rc6
(py310) jarvis@jia:~/coding/dora_home/dora$ dora-cli check
Dora Coordinator: ok
Dora Daemon: ok

(py310) jarvis@jia:~/coding/dora_home/dora$ dora-cli start examples/python-operator-dataflow/dataflow.yml --attach --hot-reload
10af7c98-604d-4808-b48a-7e028cb3d733
  2023-05-19T03:53:57.743423Z  WARN dora_coordinator: daemon at `` did not react as expected to watchdog message

Caused by:
   0: failed to send watchdog message to daemon
   1: Broken pipe (os error 32)

Location:
    /home/jarvis/coding/dora_home/dora/binaries/coordinator/src/lib.rs:550:10
    at binaries/coordinator/src/lib.rs:468

open new terminal and kill dora-daemon, simulate the daemon process to hang up abnormally

(py310) jarvis@jia:~/coding/dora_home/dora$ dora-cli list
Running dataflows:
- [nappy-back] 10af7c98-604d-4808-b48a-7e028cb3d733
(py310) jarvis@jia:~/coding/dora_home/dora$ ps -ef | grep dora
jarvis     22117       1  0 11:41 pts/12   00:00:00 dora-coordinator
jarvis     22131       1  0 11:41 pts/12   00:00:01 dora-daemon
jarvis     24461   18206  0 11:53 pts/12   00:00:00 dora-cli start dataflow.yml --attach --hot-reload
jarvis     24464   22131  7 11:53 pts/12   00:00:01 python3 -c import dora; dora.start_runtime() # webcam
jarvis     24467   22131  8 11:53 pts/12   00:00:01 python3 -c import dora; dora.start_runtime() # plot
jarvis     24598   22333  0 11:53 pts/3    00:00:00 grep --color=auto dora
(py310) jarvis@jia:~/coding/dora_home/dora$ kill -15 22131
(py310) jarvis@jia:~/coding/dora_home/dora$ dora-cli list
Running dataflows:
- [nappy-back] 10af7c98-604d-4808-b48a-7e028cb3d733
(py310) jarvis@jia:~/coding/dora_home/dora$ dora-cli stop 
> Choose dataflow to stop: [nappy-back] 10af7c98-604d-4808-b48a-7e028cb3d733
no daemon connection
(py310) jarvis@jia:~/coding/dora_home/dora$ dora-cli up
started dora daemon
(py310) jarvis@jia:~/coding/dora_home/dora$ dora-cli check
Dora Coordinator: ok
Dora Daemon: ok

(py310) jarvis@jia:~/coding/dora_home/dora$ dora-cli list
Running dataflows:
- [nappy-back] 10af7c98-604d-4808-b48a-7e028cb3d733
(py310) jarvis@jia:~/coding/dora_home/dora$ dora-cli stop
> Choose dataflow to stop: [nappy-back] 10af7c98-604d-4808-b48a-7e028cb3d733
failed to stop dataflow
(py310) jarvis@jia:~/coding/dora_home/dora$ dora-cli list
Running dataflows:
- [nappy-back] 10af7c98-604d-4808-b48a-7e028cb3d733
(py310) jarvis@jia:~/coding/dora_home/dora$ dora-cli -V
dora-cli 0.2.3-rc6
(py310) jarvis@jia:~/coding/dora_home/dora$ 

To Reproduce
Steps to reproduce the behavior:

  1. Dora start coodinator and daemon: dora-cli up
  2. Start a new dataflow: dora-cli start examples/python-operator-dataflow/dataflow.yaml --attach --hot-reload
  3. Kill dora-daemon: kill -15 pid_dora_daemon
  4. Dora start daemon: dora-cli up
  5. Destroy dataflow: dora-cli stop uuid_your_dataflow

Expected behavior
I expect dora-coodinator and dora-daemon to live and die together, and they can automatically restart when the heartbeat times out, Or dora-daemon hangs up, and dataflow is also destroyed.

Environments (please complete the following information):

  • System info: ubuntu 22.04
  • Dora version: v0.2.3-rc6
@haixuanTao
Copy link
Collaborator

Can I ask why are you killing the daemon?

We do not support auto-restarting daemon at the moment.

@meua
Copy link
Contributor Author

meua commented May 24, 2023

Can I ask why are you killing the daemon?

We do not support auto-restarting daemon at the moment.

Because, there are some reasons due to custom nodes and operators, which will cause dora-daemon to hang innocently. I kill the dora-daemon process to simulate this situation.

@haixuanTao
Copy link
Collaborator

Do you have any ideas or context you can share about why dora-daemon to hang innocently?

@meua
Copy link
Contributor Author

meua commented Jun 5, 2023

Do you have any ideas or context you can share about why dora-daemon to hang innocently?

I am not running in source debug mode,after dora up, run RUST_LOG=true dora start graphs/tutorials/webcam.yaml --attach --hot-reload --name webcam, dataflow cannot be stopped

(dora3.7) jarvis@jia:~/coding/pyhome/github.com/dora-rs/dora-drives$ dora list
Running dataflows:
- [webcam] 2eeba0b6-4cfa-438a-bc7f-0747664e06f3
(dora3.7) jarvis@jia:~/coding/pyhome/github.com/dora-rs/dora-drives$ dora stop
> Choose dataflow to stop: [webcam] 2eeba0b6-4cfa-438a-bc7f-0747664e06f3
(dora3.7) jarvis@jia:~/coding/pyhome/github.com/dora-rs/dora-drives$ dora list
Running dataflows:
- [webcam] 2eeba0b6-4cfa-438a-bc7f-0747664e06f3
(dora3.7) jarvis@jia:~/coding/pyhome/github.com/dora-rs/dora-drives$ dora -V
dora-cli 0.2.3
(dora3.7) jarvis@jia:~/coding/pyhome/github.com/dora-rs/dora-drives$ dora logs 2eeba0b6-4cfa-438a-bc7f-0747664e06f3 webcam
>     │ Logs from webcam.
─────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │ could not get webcam.
   2 │ could not get webcam.
   3 │ could not get webcam.
   4 │ could not get webcam.
   5 │ could not get webcam.
   6 │ could not get webcam.
   7 │ could not get webcam.
   8 │ could not get webcam.
   9 │ could not get webcam.
  10 │ could not get webcam.
  11 │ could not get webcam.
  12 │ could not get webcam.
  13 │ could not get webcam.
  14 │ could not get webcam.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants