Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bars of tqdm seems fixed at 0/50000 and fail to continue the Q-labeling #15

Closed
SunHaoOne opened this issue Jun 7, 2021 · 10 comments
Closed

Comments

@SunHaoOne
Copy link

Make sure you have read FAQ before posting.
Thanks!
Hello,
After running all the above programs correctly as you suggested,I have trained the ego-model and successfully collected the nocrash data about 186GB, then what I need to do is label Q. So I run with $python -m rails.data_phase2 --num-workers=4, it shows as follows:

|         | 0/53267 [00:00<?, ?it/s

And I have checked my GPU, and it shows the ray and running now.

+-------------------------------+----------------------+----------------------+
|   2  GeForce RTX 2080    Off  | 00000000:86:00.0 Off |                  N/A |
| 50%   71C    P2   153W / 215W |   2480MiB /  7952MiB |     98%      Default |
+-------------------------------+----------------------+----------------------+
|   3  GeForce RTX 2080    Off  | 00000000:AF:00.0 Off |                  N/A |
| 51%   72C    P2   152W / 215W |   2480MiB /  7952MiB |     98%      Default |

|    2     14845      C   ray::RAILSActionLabeler.run()               1229MiB |
|    2     14855      C   ray::RAILSActionLabeler.run()               1229MiB |
|    3     14898      C   ray::RAILSActionLabeler.run()               1229MiB |
|    3     14919      C   ray::RAILSActionLabeler.run()               1229MiB 

When I tried CTRL+C, it shows:

raceback (most recent call last):
  File "/home/shy/anaconda3/envs/world_on_rails/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/shy/anaconda3/envs/world_on_rails/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/shy/Desktop/WorldOnRails/rails/data_phase2.py", line 58, in <module>
    main(args)
  File "/home/shy/Desktop/WorldOnRails/rails/data_phase2.py", line 24, in main
    current_frames = ray.get(logger.total_frames.remote())
  File "/home/shy/anaconda3/envs/world_on_rails/lib/python3.7/site-packages/ray/worker.py", line 1372, in get
    object_refs, timeout=timeout)
  File "/home/shy/anaconda3/envs/world_on_rails/lib/python3.7/site-packages/ray/worker.py", line 304, in get_objects
    object_refs, self.current_task_id, timeout_ms)
  File "python/ray/_raylet.pyx", line 869, in ray._raylet.CoreWorker.get_objects
  File "python/ray/_raylet.pyx", line 142, in ray._raylet.check_status
KeyboardInterrupt
^CError in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/home/shy/anaconda3/envs/world_on_rails/lib/python3.7/site-packages/ray/node.py", line 868, in _kill_process_type
    process.wait(timeout_seconds)
  File "/home/shy/anaconda3/envs/world_on_rails/lib/python3.7/subprocess.py", line 1019, in wait
    return self._wait(timeout=timeout)
  File "/home/shy/anaconda3/envs/world_on_rails/lib/python3.7/subprocess.py", line 1647, in _wait
    time.sleep(delay)
KeyboardInterrupt
  0%|                              

It seems it just waiting now? (But we donot need to launch carla in this phase)
And the data-dir is set to the collected data direction, the config.yaml is set to the no-crash config.
( default='/home/shy/Desktop/WorldOnRails/experiments/config_nocrash.yaml')), just cp config.yaml to experiments file. Thanks a lot!

@dotchen dotchen closed this as completed Jun 7, 2021
@dotchen
Copy link
Owner

dotchen commented Jun 7, 2021

Just wait for longer. or if you have to eyeball the bar moving, set --num-per-log=1

@SunHaoOne
Copy link
Author

SunHaoOne commented Jun 7, 2021

Just wait for longer. or if you have to eyeball the bar moving, set --num-per-log=1

Thanks for your reply. I have tried to add this args, but still nothing happened, which shows the same message as before. And last time when I did CTRL+C it shows: | 0/53267 [1:20:03<?, ?it/s]. Maybe it is too long for one loop? (And I have checked the datafile, it seems no file is modified today) And I will wait longer, thanks!
Later I found that maybe some data is lost and I deleted the latest dataset file and finnally it works and it shows 81/52736 [01:41<16:43:51, 1.14s/it

@dotchen
Copy link
Owner

dotchen commented Jun 8, 2021

Gotcha, yes one thing I forgot to mention is that if something is odd set local_mode=True and --num-runners=1 to debug

@SunHaoOne
Copy link
Author

Gotcha, yes one thing I forgot to mention is that if something is odd set local_mode=True and --num-runners=1 to debug

Thanks! I have checked the link here https://github.com/dotchen/WorldOnRails/issues/6. I found when I set --number-workers=4, it always shows as follows and the 4 workers are running in the GPU3 and GPU4. So later I have tried to set workers=2, the 2 workers are running in one GPU and found the bar seems works faster than before. I guess this error caused by ray running on a different GPU. (But when collecting data, the error didnot occur)

(world_on_rails) shy@amax2080-3:~/Desktop/WorldOnRails$ python -m rails.data_phase2 --num-workers=2
                                                                    | 13500/52736 [1:06:58<3:33:48,  3.06it/s]
(world_on_rails) shy@amax2080-3:~/Desktop/WorldOnRails$ python -m rails.data_phase2 --num-workers=4 --num-per-log=1
  5%|\u2588\u2588\u2588\u2588\u258f                                                                                 | 2544/52736 [53:47<17:39:15,  1.27s/it]2021-06-08 09:44:29,882	WARNING worker.py:1034 -- The node with node id cf898cd2ae29ef3ad62dd5e886bbe935399d3e88 has been marked dead because the detector has missed too many heartbeats from it. This can happen when a raylet crashes unexpectedly or has lagging heartbeats.
Traceback (most recent call last):
  File "/home/shy/anaconda3/envs/world_on_rails/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/shy/anaconda3/envs/world_on_rails/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/shy/Desktop/WorldOnRails/rails/data_phase2.py", line 58, in <module>
    main(args)
  File "/home/shy/Desktop/WorldOnRails/rails/data_phase2.py", line 24, in main
    current_frames = ray.get(logger.total_frames.remote())
  File "/home/shy/anaconda3/envs/world_on_rails/lib/python3.7/site-packages/ray/worker.py", line 1381, in get
    raise value
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.

@varunjammula
Copy link

I have been running the code for 3 days now on 6 workers. The script does not seem to end. I enable local mode and no error was shown in the logs. I used the released 1M dataset. How long did it take for the labeling?

@dotchen
Copy link
Owner

dotchen commented Jul 18, 2021

I need more information to help you debug. Can you tell me what the progress bar say? Or does the progress bar completely freeze? Also, do you see any GPU utilization while running the script?

@dotchen dotchen reopened this Jul 18, 2021
@dotchen
Copy link
Owner

dotchen commented Jul 18, 2021

Also, you shouldn't use local_mode when num_workers is not 1.

@varunjammula
Copy link

varunjammula commented Jul 19, 2021

Hi,

When I set local_mode=False, I encounter the following issue with num_workers > 1.

image

If I set local_mode=True, the program seems to run but does not use multiple GPU's. I think it is taking a lot of time to create the worker threads and I do not see the tqdm bar at all. Do you have any insight into this? @dotchen

I am running the code on a very small subset of the data to figure out the issue.

@dotchen
Copy link
Owner

dotchen commented Jul 19, 2021

Can you check if this is related to #6?

If I set local_mode=True, the program seems to run but does not use multiple GPU's

By definition of local mode, the jobs run sequentially.

@varunjammula
Copy link

Hi, I think I figured out the issue. The above error occurs if the cluster resources are not available. My admin settings probably block me from auto-balancing when using sbatch scripts. I think the issue can be closed.

@dotchen dotchen closed this as completed Jul 19, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants