Bars of tqdm seems fixed at 0/50000 and fail to continue the Q-labeling #15

SunHaoOne · 2021-06-07T01:51:05Z

Make sure you have read FAQ before posting.
Thanks!
Hello,
After running all the above programs correctly as you suggested，I have trained the ego-model and successfully collected the nocrash data about 186GB, then what I need to do is label Q. So I run with $python -m rails.data_phase2 --num-workers=4, it shows as follows:

|         | 0/53267 [00:00<?, ?it/s

And I have checked my GPU, and it shows the ray and running now.

+-------------------------------+----------------------+----------------------+
|   2  GeForce RTX 2080    Off  | 00000000:86:00.0 Off |                  N/A |
| 50%   71C    P2   153W / 215W |   2480MiB /  7952MiB |     98%      Default |
+-------------------------------+----------------------+----------------------+
|   3  GeForce RTX 2080    Off  | 00000000:AF:00.0 Off |                  N/A |
| 51%   72C    P2   152W / 215W |   2480MiB /  7952MiB |     98%      Default |

|    2     14845      C   ray::RAILSActionLabeler.run()               1229MiB |
|    2     14855      C   ray::RAILSActionLabeler.run()               1229MiB |
|    3     14898      C   ray::RAILSActionLabeler.run()               1229MiB |
|    3     14919      C   ray::RAILSActionLabeler.run()               1229MiB

When I tried CTRL+C, it shows:

raceback (most recent call last):
  File "/home/shy/anaconda3/envs/world_on_rails/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/shy/anaconda3/envs/world_on_rails/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/shy/Desktop/WorldOnRails/rails/data_phase2.py", line 58, in <module>
    main(args)
  File "/home/shy/Desktop/WorldOnRails/rails/data_phase2.py", line 24, in main
    current_frames = ray.get(logger.total_frames.remote())
  File "/home/shy/anaconda3/envs/world_on_rails/lib/python3.7/site-packages/ray/worker.py", line 1372, in get
    object_refs, timeout=timeout)
  File "/home/shy/anaconda3/envs/world_on_rails/lib/python3.7/site-packages/ray/worker.py", line 304, in get_objects
    object_refs, self.current_task_id, timeout_ms)
  File "python/ray/_raylet.pyx", line 869, in ray._raylet.CoreWorker.get_objects
  File "python/ray/_raylet.pyx", line 142, in ray._raylet.check_status
KeyboardInterrupt
^CError in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/home/shy/anaconda3/envs/world_on_rails/lib/python3.7/site-packages/ray/node.py", line 868, in _kill_process_type
    process.wait(timeout_seconds)
  File "/home/shy/anaconda3/envs/world_on_rails/lib/python3.7/subprocess.py", line 1019, in wait
    return self._wait(timeout=timeout)
  File "/home/shy/anaconda3/envs/world_on_rails/lib/python3.7/subprocess.py", line 1647, in _wait
    time.sleep(delay)
KeyboardInterrupt
  0%|

It seems it just waiting now? (But we donot need to launch carla in this phase)
And the data-dir is set to the collected data direction, the config.yaml is set to the no-crash config.
( default='/home/shy/Desktop/WorldOnRails/experiments/config_nocrash.yaml')), just cp config.yaml to experiments file. Thanks a lot!

The text was updated successfully, but these errors were encountered:

dotchen · 2021-06-07T02:37:29Z

Just wait for longer. or if you have to eyeball the bar moving, set --num-per-log=1

SunHaoOne · 2021-06-07T06:17:35Z

Just wait for longer. or if you have to eyeball the bar moving, set --num-per-log=1

Thanks for your reply. I have tried to add this args, but still nothing happened, which shows the same message as before. And last time when I did CTRL+C it shows: | 0/53267 [1:20:03<?, ?it/s]. Maybe it is too long for one loop? (And I have checked the datafile, it seems no file is modified today) And I will wait longer, thanks!
Later I found that maybe some data is lost and I deleted the latest dataset file and finnally it works and it shows 81/52736 [01:41<16:43:51, 1.14s/it

dotchen · 2021-06-08T01:08:36Z

Gotcha, yes one thing I forgot to mention is that if something is odd set local_mode=True and --num-runners=1 to debug

SunHaoOne · 2021-06-08T03:10:25Z

Gotcha, yes one thing I forgot to mention is that if something is odd set local_mode=True and --num-runners=1 to debug

Thanks! I have checked the link here https://github.com/dotchen/WorldOnRails/issues/6. I found when I set --number-workers=4, it always shows as follows and the 4 workers are running in the GPU3 and GPU4. So later I have tried to set workers=2, the 2 workers are running in one GPU and found the bar seems works faster than before. I guess this error caused by ray running on a different GPU. (But when collecting data, the error didnot occur)

(world_on_rails) shy@amax2080-3:~/Desktop/WorldOnRails$ python -m rails.data_phase2 --num-workers=2
                                                                    | 13500/52736 [1:06:58<3:33:48,  3.06it/s]

(world_on_rails) shy@amax2080-3:~/Desktop/WorldOnRails$ python -m rails.data_phase2 --num-workers=4 --num-per-log=1
  5%|\u2588\u2588\u2588\u2588\u258f                                                                                 | 2544/52736 [53:47<17:39:15,  1.27s/it]2021-06-08 09:44:29,882	WARNING worker.py:1034 -- The node with node id cf898cd2ae29ef3ad62dd5e886bbe935399d3e88 has been marked dead because the detector has missed too many heartbeats from it. This can happen when a raylet crashes unexpectedly or has lagging heartbeats.
Traceback (most recent call last):
  File "/home/shy/anaconda3/envs/world_on_rails/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/shy/anaconda3/envs/world_on_rails/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/shy/Desktop/WorldOnRails/rails/data_phase2.py", line 58, in <module>
    main(args)
  File "/home/shy/Desktop/WorldOnRails/rails/data_phase2.py", line 24, in main
    current_frames = ray.get(logger.total_frames.remote())
  File "/home/shy/anaconda3/envs/world_on_rails/lib/python3.7/site-packages/ray/worker.py", line 1381, in get
    raise value
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.

varunjammula · 2021-07-18T17:06:07Z

I have been running the code for 3 days now on 6 workers. The script does not seem to end. I enable local mode and no error was shown in the logs. I used the released 1M dataset. How long did it take for the labeling?

dotchen · 2021-07-18T23:13:31Z

I need more information to help you debug. Can you tell me what the progress bar say? Or does the progress bar completely freeze? Also, do you see any GPU utilization while running the script?

dotchen · 2021-07-18T23:15:25Z

Also, you shouldn't use local_mode when num_workers is not 1.

varunjammula · 2021-07-19T02:51:38Z

Hi,

When I set local_mode=False, I encounter the following issue with num_workers > 1.

If I set local_mode=True, the program seems to run but does not use multiple GPU's. I think it is taking a lot of time to create the worker threads and I do not see the tqdm bar at all. Do you have any insight into this? @dotchen

I am running the code on a very small subset of the data to figure out the issue.

dotchen · 2021-07-19T03:49:47Z

Can you check if this is related to #6?

If I set local_mode=True, the program seems to run but does not use multiple GPU's

By definition of local mode, the jobs run sequentially.

varunjammula · 2021-07-19T05:44:52Z

Hi, I think I figured out the issue. The above error occurs if the cluster resources are not available. My admin settings probably block me from auto-balancing when using sbatch scripts. I think the issue can be closed.

dotchen closed this as completed Jun 7, 2021

dotchen reopened this Jul 18, 2021

dotchen closed this as completed Jul 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bars of tqdm seems fixed at 0/50000 and fail to continue the Q-labeling #15

Bars of tqdm seems fixed at 0/50000 and fail to continue the Q-labeling #15

SunHaoOne commented Jun 7, 2021

dotchen commented Jun 7, 2021 •

edited

SunHaoOne commented Jun 7, 2021 •

edited

dotchen commented Jun 8, 2021

SunHaoOne commented Jun 8, 2021

varunjammula commented Jul 18, 2021

dotchen commented Jul 18, 2021 •

edited

dotchen commented Jul 18, 2021

varunjammula commented Jul 19, 2021 •

edited

dotchen commented Jul 19, 2021

varunjammula commented Jul 19, 2021

Bars of tqdm seems fixed at 0/50000 and fail to continue the Q-labeling #15

Bars of tqdm seems fixed at 0/50000 and fail to continue the Q-labeling #15

Comments

SunHaoOne commented Jun 7, 2021

dotchen commented Jun 7, 2021 • edited

SunHaoOne commented Jun 7, 2021 • edited

dotchen commented Jun 8, 2021

SunHaoOne commented Jun 8, 2021

varunjammula commented Jul 18, 2021

dotchen commented Jul 18, 2021 • edited

dotchen commented Jul 18, 2021

varunjammula commented Jul 19, 2021 • edited

dotchen commented Jul 19, 2021

varunjammula commented Jul 19, 2021

dotchen commented Jun 7, 2021 •

edited

SunHaoOne commented Jun 7, 2021 •

edited

dotchen commented Jul 18, 2021 •

edited

varunjammula commented Jul 19, 2021 •

edited