Question about training time on NuScenes #34

dabi4256 · 2023-06-30T12:45:05Z

    您好，我自己用完整的nuScenes数据集训练，显卡是两张3090， BATCH_SIZE_PER_GPU=8，用的是‘cbgs_voxel0075_voxelnext.yaml’配置文件。训练时间稳定的话大概是200个小时，这是正常的吗。

The text was updated successfully, but these errors were encountered:

yukang2017 · 2023-06-30T13:42:55Z

你好，感觉不是很正常。我用4卡V100，BATCH_SIZE_PER_GPU=4. 训练大概2天多点。我猜测可能训练速度瓶颈在dataloader的数据读取上。

dabi4256 · 2023-07-01T01:50:08Z

你好，感谢你的回复。
请问这个时间是实际跑完需要2天，还是程序给出的预估时间。我代码还没有训练完，只是看到程序给出的时间是200个小时。
而且，我用双卡3090跟单卡3090的预估时间差别不大。

yukang2017 · 2023-07-01T15:27:36Z

你好，我预估时间和实际时间基本差不多。

learnuser1 · 2023-07-06T08:53:49Z

    您好，我自己用完整的nuScenes数据集训练，显卡是两张3090， BATCH_SIZE_PER_GPU=8，用的是‘cbgs_voxel0075_voxelnext.yaml’配置文件。训练时间稳定的话大概是200个小时，这是正常的吗。

你好，这个自己训练的话效果跟原论文比的话怎么样。然后这个模型跟centerpoint比的话要好很多嘛

dabi4256 · 2023-07-06T09:04:17Z

    您好，我自己用完整的nuScenes数据集训练，显卡是两张3090， BATCH_SIZE_PER_GPU=8，用的是‘cbgs_voxel0075_voxelnext.yaml’配置文件。训练时间稳定的话大概是200个小时，这是正常的吗。
你好，这个自己训练的话效果跟原论文比的话怎么样。然后这个模型跟centerpoint比的话要好很多嘛

我的机器跑的很慢，20个epoch还没跑完。但是我测试了第10个epoch，mAP已经到了59.32了。CenterPoint我没跑过，但是论文里面写的是CenterPoint的mAP为 58.6.

learnuser1 · 2023-07-07T01:30:28Z

    您好，我自己用完整的nuScenes数据集训练，显卡是两张3090， BATCH_SIZE_PER_GPU=8，用的是‘cbgs_voxel0075_voxelnext.yaml’配置文件。训练时间稳定的话大概是200个小时，这是正常的吗。
你好，这个自己训练的话效果跟原论文比的话怎么样。然后这个模型跟centerpoint比的话要好很多嘛
我的机器跑的很慢，20个epoch还没跑完。但是我测试了第10个epoch，mAP已经到了59.32了。CenterPoint我没跑过，但是论文里面写的是CenterPoint的mAP为 58.6.

好的，感谢

rockywind · 2023-08-22T16:30:28Z

@learnuser1 @yukang2017 @dabi4256 @yanwei-li 大家好
我跑多卡训练，会报这个错误。

Traceback (most recent call last):
  File "train.py", line 246, in <module>
Traceback (most recent call last):
  File "train.py", line 246, in <module>
    main()
  File "train.py", line 179, in main
    main()
  File "train.py", line 179, in main
    model = nn.parallel.DistributedDataParallel(model, device_ids=[cfg.LOCAL_RANK % torch.cuda.device_count()])
  File "/opt/conda/envs/VoxelNet/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 674, in __init__
    model = nn.parallel.DistributedDataParallel(model, device_ids=[cfg.LOCAL_RANK % torch.cuda.device_count()])
  File "/opt/conda/envs/VoxelNet/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 674, in __init__
    _verify_param_shape_across_processes(self.process_group, parameters)
  File "/opt/conda/envs/VoxelNet/lib/python3.8/site-packages/torch/distributed/utils.py", line 118, in _verify_param_shape_across_processes
    _verify_param_shape_across_processes(self.process_group, parameters)
  File "/opt/conda/envs/VoxelNet/lib/python3.8/site-packages/torch/distributed/utils.py", line 118, in _verify_param_shape_across_processes
    return dist._verify_params_across_processes(process_group, tensors, logger)
torch.distributed.DistBackendError:     return dist._verify_params_across_processes(process_group, tensors, logger)NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1275, internal error, NCCL version 2.14.3
ncclInternalError: Internal check failed.
Last error:
Duplicate GPU detected : rank 1 and rank 0 both on CUDA device 8a000

torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1275, internal error, NCCL version 2.14.3
ncclInternalError: Internal check failed.
Last error:
Duplicate GPU detected : rank 0 and rank 1 both on CUDA device 8a000
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 1132223) of binary: /opt/conda/envs/VoxelNet/bin/python
Traceback (most recent call last):
  File "/opt/conda/envs/VoxelNet/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/envs/VoxelNet/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/opt/conda/envs/VoxelNet/lib/python3.8/site-packages/torch/distributed/launch.py", line 196, in <module>
    main()
  File "/opt/conda/envs/VoxelNet/lib/python3.8/site-packages/torch/distributed/launch.py", line 192, in main
    launch(args)
  File "/opt/conda/envs/VoxelNet/lib/python3.8/site-packages/torch/distributed/launch.py", line 177, in launch
    run(args)
  File "/opt/conda/envs/VoxelNet/lib/python3.8/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/opt/conda/envs/VoxelNet/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/envs/VoxelNet/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about training time on NuScenes #34

Question about training time on NuScenes #34

dabi4256 commented Jun 30, 2023 •

edited

yukang2017 commented Jun 30, 2023 •

edited

dabi4256 commented Jul 1, 2023

yukang2017 commented Jul 1, 2023

learnuser1 commented Jul 6, 2023

dabi4256 commented Jul 6, 2023

learnuser1 commented Jul 7, 2023

rockywind commented Aug 22, 2023

Question about training time on NuScenes #34

Question about training time on NuScenes #34

Comments

dabi4256 commented Jun 30, 2023 • edited

yukang2017 commented Jun 30, 2023 • edited

dabi4256 commented Jul 1, 2023

yukang2017 commented Jul 1, 2023

learnuser1 commented Jul 6, 2023

dabi4256 commented Jul 6, 2023

learnuser1 commented Jul 7, 2023

rockywind commented Aug 22, 2023

dabi4256 commented Jun 30, 2023 •

edited

yukang2017 commented Jun 30, 2023 •

edited