Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About the 2D backbone #7

Closed
SxJyJay opened this issue Apr 30, 2022 · 65 comments
Closed

About the 2D backbone #7

SxJyJay opened this issue Apr 30, 2022 · 65 comments

Comments

@SxJyJay
Copy link

SxJyJay commented Apr 30, 2022

Hi, I have some questions about training the TransFusion-LC.

  • You mentioned in the supplementary materials that a 2D backbone pre-trained on the autonomous driving datasets is required and frozen during training the TransFusion-LC. (i.e., DLA-34 and ResNet-50 pre-trained on the nuScenes and Waymo in repsectively.) However, I don't find relevant pre-trained models in the readme.md of this git, and relevant configuration terms in the config files (e.g., transfusion_nusc_voxel_LC.py). Or maybe you have provided but I missed something important?

  • Could you please provide relevant pre-trained 2D backbone models, or relevant instructions of pre-training the 2D backbone models? Thanks a lot!

@XuyangBai
Copy link
Owner

Hi, sorry it seems I didn't make it clear in the readme.

  1. For DLA34 pretrained on 3D detection, I follow PointAugmenting to reuse the model provided by CenterNet. You can download the checkpoint from https://github.com/xingyizhou/CenterTrack/blob/master/readme/MODEL_ZOO.md#monocular-3d-detection-tracking.
  2. For ResNet50+FPN pretrained on instance segmentation, I use the model provided by mmdet3d, you can download the checkpoints from https://github.com/open-mmlab/mmdetection3d/blob/v0.12.0/configs/nuimages/README.md (note that you should also use the checkpoints provided by mmdet3d v0.12.0). I choose the backbone from MaskRCNN that pretrained only on imagenet (the first one).
  3. For ResNet50+FPN pretrained on 2D detection, I train the model by using the same config file with (2) except removing the mask head.

And I use a similar step as (3) to train a 2D backbone for the waymo dataset. I can send you the relevant processing code and config file if needed.

Best,
Xuyang.

@SxJyJay
Copy link
Author

SxJyJay commented Apr 30, 2022

Thanks a lot for your reply! It is really clear!
Could you please send me the relevant code for training the 2D backbone for the waymo dataset if that doesn't bother you? My email is yanjay2future@gmail.com.

@XuyangBai
Copy link
Owner

Hi, I have sent them to your email.

@SxJyJay
Copy link
Author

SxJyJay commented May 1, 2022

Thanks! I received your email.
I still have some questions about re-implementation.

  • In , I notice that you comment out DLA-34 image backbone, and replace it with the ResNet-50. I am wondering whether the configuration parameters of DLA-34 is the full version of DLA-34 because I notice that "heads" parameter is set to empty.
    img_backbone=dict( type='DLASeg', num_layers=34, heads={}, head_convs=-1, ),
  • I am reproducing the TransFusion-L strictly following your config file and instructions, but the mAP on nuScenes validation set is only 0.5985 at 17-th epoch (the whole training process hasn't be finished yet). I don't know where I went wrong. Could you please send me the training logs for TransFusion-L and TransFusion-LC, and thereby I can compare it with my training log.

Sorry to bother you again. Sincere appreciation!

@XuyangBai
Copy link
Owner

Hi,

  1. R50 + FPN gives a slightly better result compared with DLA34 (as shown in Table 12 in the supplementary material). And I only use DLA34 as the image feature extractor, so I do not load the task head.
  2. Did you adopt the fade strategy (disenable the copy-and-paste augmentation for the last 5 epochs)? That can have a remarkable effect on the mAP by reducing the false positive.

Best,
Xuyang

@SxJyJay
Copy link
Author

SxJyJay commented May 1, 2022

Oh, I got it. I forget to adopt the fade strategy for the last 5 epochs.
Besides, I found that the NDS value is always lower than mAP in my present validation process.
e.g.,

  • mAP 0.5199; NDS 0.4856 at epoch 5;
  • mAP 0.5606; NDS 0.5244 at epoch 10;
  • mAP 0.5895; NDS 0.5453 at epoch 15
    I don't know if this is a normal phenomenon. I observe that the NDS value is generally higher than the mAP values. Could you please provide some valuable suggestions or point out where am I possibly wrong?
    Thanks!

Sincerely,
Jay

@XuyangBai
Copy link
Owner

It is not normal, could you provide the full results such as mATE, mAOE, mASE?

@XuyangBai
Copy link
Owner

It can have a very bad mAOE abd mASE if you use the newest version mmdet3d to generate the .pkl and then train TransFusion. mmdet3d has a large coordinate system refactoring in newer version. See https://github.com/open-mmlab/mmdetection3d/blob/master/docs/en/compatibility.md#coordinate-system-refactoring

@SxJyJay
Copy link
Author

SxJyJay commented May 1, 2022

OK, I list the TP metrics results as below:
at epoch 19 (without the fade strategy), mATE=0.2839; mASE=0.7090, mAOE=1.5609; mAVE=0.2707; mAAE=0.1913

It can have a very bad mAOE and mASE if you use the newest version mmdet3d to generate the .pkl and then train TransFusion. mmdet3d has a large coordinate system refactoring in the newer version. See https://github.com/open-mmlab/mmdetection3d/blob/master/docs/en/compatibility.md#coordinate-system-refactoring

I think this might be the key to my problem! Since I create the meta data of nuscenes with the newest released mmdet3d, and degrade its version after I find the version mismatching with the mmdet3d of the TransFusion github.
Thanks for your valuable advice! I will re-create the metadata and see what will happen.

@YunzeMan
Copy link

YunzeMan commented May 2, 2022

Nice discussion above! Hi @XuyangBai, I have a follow-up question regarding the training of LC model.

To load the TransFusion-L model when training the -LC model, should we change the load_from key in the config file into the -L model checkpoint, or should we leave that part empty but change the pretrained key in the TransFusionDetector field instead?

@XuyangBai
Copy link
Owner

XuyangBai commented May 3, 2022

Hi @YunzeMan I usually use the following code to combine the pretrained TransFusion-L and the 2D backbone

img = torch.load('img_backbone.pth', map_location='cpu')
pts = torch.load('transfusionL.pth', map_location='cpu')
new_model = {"state_dict": pts["state_dict"]}
for k,v in img["state_dict"].items():
    if 'backbone' in k or 'neck' in k:
        new_model["state_dict"]['img_'+k] = v
        print('img_'+k)
torch.save(new_model, "fusion_model.pth")

And then set the load_from key to load both the pretrained 3D backbone and 2D backbone.

@WWW2323
Copy link

WWW2323 commented May 3, 2022

Hi, @XuyangBai @SxJyJay, it needs 4 days for me to train a TransFusion-L (8 V100 GPUs, epoch=20, samples_per_gpu=2), which seems too long. How long did you spend training TransFusion-L?Thanks!!

@XuyangBai
Copy link
Owner

@WWW2323 about 2 days for me using 8V100 GPUs

@SxJyJay
Copy link
Author

SxJyJay commented May 4, 2022

Hi, @XuyangBai @SxJyJay, it needs 4 days for me to train a TransFusion-L (8 V100 GPUs, epoch=20, samples_per_gpu=2), which seems too long. How long did you spend training TransFusion-L?Thanks!!

Also about 2 days for me using 8 RTX3090 GPUs.

@SxJyJay
Copy link
Author

SxJyJay commented May 4, 2022

@XuyangBai Hi, I have finished the whole training process of TransFusion. I make no modifications except for replacing the DLA-34 to ResNet50+FPN as you suggested. And the final results on the nuscenes validation set are:
mAP=67.25, NDS=70.89, mATE=28.09, mASE=25.30, mAOE=28.58, mAVE=26.26, mAAE=19.15
The mAP and NDS are a little lower than the results on the nuscenes test set reported in the paper. Conventionally, I think the results on the test set are lower than those on the validation set.

Besides, I find that the mAP drop may be caused by much lower AP of some classes such as trailer, traffic cone and barrier.. I list AP of my results (on val set) vs reported results (on test set) below:
car(87.9 vs 87.1), truck(64.0 vs 60.0), bus(74.1 vs 68.3), trailer(43.5 vs 60.8), construction_vehicle(29.8 vs 33.1), pedestrian(88.3 vs 88.4), motorcycle(74.3 vs 73.6), bike(63.5 vs 52.9), traffic cone(77.1 vs 86.7), barrier(70.1 vs 78.1)

I don't know whether my results are within an acceptable error margin. Or such results are caused by the bias of different image backbones (i.e., DLA-34 and ResNet50+FPN)?

@XuyangBai
Copy link
Owner

XuyangBai commented May 5, 2022

Hi @SxJyJay, You can see the detailed results on val set below.

mAP: 0.6727
mATE: 0.2721
mASE: 0.2517
mAOE: 0.2740
mAVE: 0.2536
mAAE: 0.1902
NDS: 0.7122

Per-class results:
Object Class    AP      ATE     ASE     AOE     AVE     AAE
car     0.876   0.169   0.148   0.085   0.259   0.185
truck   0.620   0.302   0.182   0.102   0.228   0.221
bus     0.757   0.302   0.186   0.048   0.386   0.256
trailer 0.428   0.520   0.209   0.463   0.185   0.163
construction_vehicle    0.274   0.666   0.417   0.833   0.124   0.318
pedestrian      0.878   0.128   0.282   0.360   0.215   0.097
motorcycle      0.754   0.184   0.244   0.215   0.421   0.267
bicycle 0.631   0.150   0.263   0.300   0.212   0.016
traffic_cone    0.770   0.119   0.304   nan     nan     nan
barrier 0.739   0.182   0.281   0.059   nan     nan

I think it is within an acceptable error margin. The slightly worse performance might be coming from the training variance. For the gap between validation and test set, it is normal because generally they are having different distributions. Also, you could try using more queries during inference to get a better result with a longer inference time (see Table 13 in the supplementary) Besides, if you are using a different version mmdet3d, some data augmentation strategy is actually disabled ( see the difference between LoadMultiViewImage in this codebase and in mmdet3d) if no img_fields is set, the RandomFlip augmentation is actually not working.

@304886938
Copy link

Hello @XuyangBai, I want to use your results on nuscenes validation set to do object tracking experiment, but I don't have enough computing power for training. I wonder if you could provide json files of the validation set results? Here is my email 304886938@qq.com. Looking forward to your reply!

@SxJyJay
Copy link
Author

SxJyJay commented May 5, 2022

Thank you. On the validation set, the performance I re-produced seems close to yours.
I also list my re-produced results on the val set below:

mAP: 0.6725
mATE: 0.2809
mASE: 0.2530
mAOE: 0.2858
mAVE: 0.2626
mAAE: 0.1915
NDS: 0.7089
Eval time: 110.1s

Per-class results:
Object Class    AP      ATE     ASE     AOE     AVE     AAE
car     0.879   0.168   0.148   0.087   0.259   0.196
truck   0.640   0.322   0.182   0.085   0.232   0.223
bus     0.741   0.326   0.181   0.041   0.407   0.244
trailer 0.435   0.509   0.203   0.495   0.213   0.159
construction_vehicle    0.298   0.723   0.445   0.817   0.123   0.324
pedestrian      0.883   0.128   0.285   0.376   0.217   0.093
motorcycle      0.743   0.183   0.232   0.216   0.451   0.281
bicycle 0.635   0.146   0.255   0.404   0.198   0.013
traffic_cone    0.771   0.118   0.311   nan     nan     nan
barrier 0.701   0.187   0.288   0.050   nan     nan

My problems are perfectly solved by you! Hence, I close this issue.
Thanks again for your patience!

@SxJyJay SxJyJay closed this as completed May 5, 2022
@xxlbigbrother
Copy link

Hi, I have sent them to your email.

Hi, I also plan to train 2D backbone for waymo and nuscenes, Could you please send me the relevant code for training the 2D backbone? It will be helpful! My email is xxlbigbrother@gmail.com

@zzm-hl
Copy link

zzm-hl commented May 11, 2022

Hi, @XuyangBai @SxJyJay, it needs 4 days for me to train a TransFusion-L (8 V100 GPUs, epoch=20, samples_per_gpu=2), which seems too long. How long did you spend training TransFusion-L?Thanks!!

Also about 2 days for me using 8 RTX3090 GPUs.

Hi, could you please provide the environment of your CUDA, PyTorch MMCV, mmdet, mmdet3d, because I am training on 4*A100 and the display takes 20 days, which makes me confused, I want to exclude the influence of the environment
`sys.platform: linux
Python: 3.7.13 (default, Mar 29 2022, 02:18:16) [GCC 7.5.0]
CUDA available: True
GPU 0,1,2,3: NVIDIA A100-SXM4-40GB
CUDA_HOME: /public/home/u212040344/usr/local/cuda-11.1
NVCC: Build cuda_11.1.TC455_06.29069683_0
GCC: gcc (GCC) 7.3.1 20180303 (Red Hat 7.3.1-5)
PyTorch: 1.8.0
PyTorch compiling details: PyTorch built with:

  • GCC 7.3
  • C++ Version: 201402
  • Intel(R) Math Kernel Library Version 2020.0.2 Product Build 20200624 for Intel(R) 64 architecture applications
  • Intel(R) MKL-DNN v1.7.0 (Git Hash 7aed236906b1f7a05c0917e5257a1af05e9ff683)
  • OpenMP 201511 (a.k.a. OpenMP 4.5)
  • NNPACK is enabled
  • CPU capability usage: AVX2
  • CUDA Runtime 11.1
  • NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=compute_37
  • CuDNN 8.0.5
  • Magma 2.5.2
  • Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.1, CUDNN_VERSION=8.0.5, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.8.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,

TorchVision: 0.9.0
OpenCV: 4.5.5
MMCV: 1.3.18
MMCV Compiler: GCC 7.3
MMCV CUDA Compiler: 11.1
MMDetection: 2.11.0
MMDetection3D: 0.12.0+5337046`

@zzm-hl
Copy link

zzm-hl commented May 11, 2022

Hi, @XuyangBai @SxJyJay, it needs 4 days for me to train a TransFusion-L (8 V100 GPUs, epoch=20, samples_per_gpu=2), which seems too long. How long did you spend training TransFusion-L?Thanks!!

Also about 2 days for me using 8 RTX3090 GPUs.

Hi, could you please provide the environment of your CUDA, PyTorch MMCV, mmdet, mmdet3d on 3090 GPUs, because I am training on 4*A100 and the display takes 20 days, which makes me confused, I want to exclude the influence of the environment `sys.platform: linux Python: 3.7.13 (default, Mar 29 2022, 02:18:16) [GCC 7.5.0] CUDA available: True GPU 0,1,2,3: NVIDIA A100-SXM4-40GB CUDA_HOME: /public/home/u212040344/usr/local/cuda-11.1 NVCC: Build cuda_11.1.TC455_06.29069683_0 GCC: gcc (GCC) 7.3.1 20180303 (Red Hat 7.3.1-5) PyTorch: 1.8.0 PyTorch compiling details: PyTorch built with:

  • GCC 7.3
  • C++ Version: 201402
  • Intel(R) Math Kernel Library Version 2020.0.2 Product Build 20200624 for Intel(R) 64 architecture applications
  • Intel(R) MKL-DNN v1.7.0 (Git Hash 7aed236906b1f7a05c0917e5257a1af05e9ff683)
  • OpenMP 201511 (a.k.a. OpenMP 4.5)
  • NNPACK is enabled
  • CPU capability usage: AVX2
  • CUDA Runtime 11.1
  • NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=compute_37
  • CuDNN 8.0.5
  • Magma 2.5.2
  • Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.1, CUDNN_VERSION=8.0.5, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.8.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,

TorchVision: 0.9.0 OpenCV: 4.5.5 MMCV: 1.3.18 MMCV Compiler: GCC 7.3 MMCV CUDA Compiler: 11.1 MMDetection: 2.11.0 MMDetection3D: 0.12.0+5337046`

@SxJyJay
Copy link
Author

SxJyJay commented May 11, 2022

Hi, @XuyangBai @SxJyJay, it needs 4 days for me to train a TransFusion-L (8 V100 GPUs, epoch=20, samples_per_gpu=2), which seems too long. How long did you spend training TransFusion-L?Thanks!!

Also about 2 days for me using 8 RTX3090 GPUs.

Hi, could you please provide the environment of your CUDA, PyTorch MMCV, mmdet, mmdet3d on 3090 GPUs, because I am training on 4*A100 and the display takes 20 days, which makes me confused, I want to exclude the influence of the environment `sys.platform: linux Python: 3.7.13 (default, Mar 29 2022, 02:18:16) [GCC 7.5.0] CUDA available: True GPU 0,1,2,3: NVIDIA A100-SXM4-40GB CUDA_HOME: /public/home/u212040344/usr/local/cuda-11.1 NVCC: Build cuda_11.1.TC455_06.29069683_0 GCC: gcc (GCC) 7.3.1 20180303 (Red Hat 7.3.1-5) PyTorch: 1.8.0 PyTorch compiling details: PyTorch built with:

  • GCC 7.3
  • C++ Version: 201402
  • Intel(R) Math Kernel Library Version 2020.0.2 Product Build 20200624 for Intel(R) 64 architecture applications
  • Intel(R) MKL-DNN v1.7.0 (Git Hash 7aed236906b1f7a05c0917e5257a1af05e9ff683)
  • OpenMP 201511 (a.k.a. OpenMP 4.5)
  • NNPACK is enabled
  • CPU capability usage: AVX2
  • CUDA Runtime 11.1
  • NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=compute_37
  • CuDNN 8.0.5
  • Magma 2.5.2
  • Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.1, CUDNN_VERSION=8.0.5, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.8.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,

TorchVision: 0.9.0 OpenCV: 4.5.5 MMCV: 1.3.18 MMCV Compiler: GCC 7.3 MMCV CUDA Compiler: 11.1 MMDetection: 2.11.0 MMDetection3D: 0.12.0+5337046`

Hi, my runtime environment is shown below:

 - GCC 7.3
  - C++ Version: 201402
  - Intel(R) Math Kernel Library Version 2020.0.2 Product Build 20200624 for
Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v1.7.0 (Git Hash 7aed236906b1f7a05c0917e5257a1af05e9ff68
3)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 11.1
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arc
h=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=com
pute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_
75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,co
de=sm_86;-gencode;arch=compute_37,code=compute_37
  - CuDNN 8.0.5
  - Magma 2.5.2
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.1, CUD
NN_VERSION=8.0.5, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLA
GS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -D
NDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XN
NPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing
-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -
Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-funct
ion -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-s
trict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno
-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-ca
st -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno
-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-s
tringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WI
TH_AVX512=1, TORCH_VERSION=1.8.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PT
R=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, US
E_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,

TorchVision: 0.9.0
OpenCV: 4.5.5
MMCV: 1.3.0
MMCV Compiler: GCC 7.5
MMCV CUDA Compiler: 11.1
MMDetection: 2.10.0
MMDetection3D: 0.11.0+

Besides, I think you can check the time consumed on fetching data and running one forward pass to identify where is the bottleneck. Maybe your problem is caused by slow io.

@zzm-hl
Copy link

zzm-hl commented May 11, 2022

Hi, @XuyangBai @SxJyJay, it needs 4 days for me to train a TransFusion-L (8 V100 GPUs, epoch=20, samples_per_gpu=2), which seems too long. How long did you spend training TransFusion-L?Thanks!!

Also about 2 days for me using 8 RTX3090 GPUs.

Hi, could you please provide the environment of your CUDA, PyTorch MMCV, mmdet, mmdet3d on 3090 GPUs, because I am training on 4*A100 and the display takes 20 days, which makes me confused, I want to exclude the influence of the environment `sys.platform: linux Python: 3.7.13 (default, Mar 29 2022, 02:18:16) [GCC 7.5.0] CUDA available: True GPU 0,1,2,3: NVIDIA A100-SXM4-40GB CUDA_HOME: /public/home/u212040344/usr/local/cuda-11.1 NVCC: Build cuda_11.1.TC455_06.29069683_0 GCC: gcc (GCC) 7.3.1 20180303 (Red Hat 7.3.1-5) PyTorch: 1.8.0 PyTorch compiling details: PyTorch built with:

  • GCC 7.3
  • C++ Version: 201402
  • Intel(R) Math Kernel Library Version 2020.0.2 Product Build 20200624 for Intel(R) 64 architecture applications
  • Intel(R) MKL-DNN v1.7.0 (Git Hash 7aed236906b1f7a05c0917e5257a1af05e9ff683)
  • OpenMP 201511 (a.k.a. OpenMP 4.5)
  • NNPACK is enabled
  • CPU capability usage: AVX2
  • CUDA Runtime 11.1
  • NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=compute_37
  • CuDNN 8.0.5
  • Magma 2.5.2
  • Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.1, CUDNN_VERSION=8.0.5, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.8.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,

TorchVision: 0.9.0 OpenCV: 4.5.5 MMCV: 1.3.18 MMCV Compiler: GCC 7.3 MMCV CUDA Compiler: 11.1 MMDetection: 2.11.0 MMDetection3D: 0.12.0+5337046`

Hi, my runtime environment is shown below:

 - GCC 7.3
  - C++ Version: 201402
  - Intel(R) Math Kernel Library Version 2020.0.2 Product Build 20200624 for
Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v1.7.0 (Git Hash 7aed236906b1f7a05c0917e5257a1af05e9ff68
3)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 11.1
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arc
h=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=com
pute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_
75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,co
de=sm_86;-gencode;arch=compute_37,code=compute_37
  - CuDNN 8.0.5
  - Magma 2.5.2
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.1, CUD
NN_VERSION=8.0.5, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLA
GS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -D
NDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XN
NPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing
-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -
Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-funct
ion -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-s
trict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno
-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-ca
st -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno
-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-s
tringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WI
TH_AVX512=1, TORCH_VERSION=1.8.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PT
R=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, US
E_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,

TorchVision: 0.9.0
OpenCV: 4.5.5
MMCV: 1.3.0
MMCV Compiler: GCC 7.5
MMCV CUDA Compiler: 11.1
MMDetection: 2.10.0
MMDetection3D: 0.11.0+

Besides, I think you can check the time consumed on fetching data and running one forward pass to identify where is the bottleneck. Maybe your problem is caused by slow io.

thanks for your reply! The strange thing is that my GPU usage has been maintained at 100, basically will not jump back and forth, I don't know if this can mean that the Speed of CPU loading data is normal?

@wzmsltw
Copy link

wzmsltw commented May 30, 2022

@SxJyJay Hi, can you provide the trained TransFusion and TransFusion-L model? My re-produced result is 63.9 mAP (Lidar) and 64.4 mAP(Lidar+Camera), which is strange. Thanks so much!

@SxJyJay
Copy link
Author

SxJyJay commented May 30, 2022

@wzmsltw Hi, you can leave me your email, and I will send checkpoints to you.

@wzmsltw
Copy link

wzmsltw commented May 30, 2022

@SxJyJay my email address is wzmsltw@gmail.com Thanks so much for your help!

@wzmsltw
Copy link

wzmsltw commented May 31, 2022

@SxJyJay Hi, when will you send checkpoints? Really looking forward to it. Thanks again~

@SxJyJay
Copy link
Author

SxJyJay commented May 31, 2022

@SxJyJay Hi, when will you send checkpoints? Really looking forward to it. Thanks again~

Sorry for the delay. I have something urgent yesterday. I have send you!
Best,
Yang Jiao

@zzj403
Copy link

zzj403 commented Aug 4, 2022

@maokp @kuangpanda @cxd520314wang I have send you my reproduced checkpoints! Please check up your email!

@SxJyJay Hi, I am a PhD studenet aiming to study the lidar-camera detection models. I've tried many times but I still cannot reproduce satisfing results. Could you please send me your checkpoints? Really looking forward to it. Thanks!
my email is 945937825@qq.com

@xpyqiubai
Copy link

Hi, I have sent them to your email.

Hi, I also plan to train 2D backbone for waymo and nuscenes. Could you please send me the relevant code for training the 2D backbone for the waymo and nuscenes dataset if that doesn't bother you? (specifically the waymo dataset) My email is xpydgqb@gmail.com

@HatakeKiki
Copy link

Hi, I'm also trying to reproduce TransFusion-L but my mAP and NDS (60.34 & 66.46) are much lower than the author's. Could you please send me your training log of TransFusion-L? I notice an obvious drop of loss at epoch 16 when fade strategy is applied in other's training. But mine seems no difference between with and w/o fade strategy. Thank you! My mail is: kiki_jiang@sjtu.edu.cn

@SxJyJay
Copy link
Author

SxJyJay commented Aug 20, 2022

@JamesHao-ml @yangsijing1995 @wangyd-0312 @Young98CN @zzj403 @jqfromsjtu Hi, I have sent checkpoints to u. Sorry for late reply, as I just finish a DDL.

@SxJyJay
Copy link
Author

SxJyJay commented Aug 20, 2022

@xpyqiubai @xxlbigbrother @kuangpanda Hi, I have sent data processing code for waymo and kitti to u. Sorry for late reply.

@xpyqiubai
Copy link

@xpyqiubai @xxlbigbrother @kuangpanda Hi, I have sent data processing code for waymo and kitti to u. Sorry for late reply.

Thanks!

@yichen928
Copy link

yichen928 commented Sep 27, 2022

@SxJyJay Hi SxJyJay, can you send the trained checkpoints on nuscenes to me? I need the trained TransFusion and
TransFusion-L model as well as the relevant data processing code. It would be greatly helpful for me since I may not have enough machines to train it by myself. Thank you very much! My email is 1733834831@qq.com.

@SxJyJay
Copy link
Author

SxJyJay commented Sep 29, 2022

@SxJyJay Hi SxJyJay, can you send the trained checkpoints on nuscenes to me? I need the trained TransFusion and TransFusion-L model as well as the relevant data processing code. It would be greatly helpful for me since I may not have enough machines to train it by myself. Thank you very much! My email is 1733834831@qq.com.

I have sent relevant checkpoints and data processing code to your email.

@yichen928
Copy link

@SxJyJay Hi SxJyJay, can you send the trained checkpoints on nuscenes to me? I need the trained TransFusion and TransFusion-L model as well as the relevant data processing code. It would be greatly helpful for me since I may not have enough machines to train it by myself. Thank you very much! My email is 1733834831@qq.com.

I have sent relevant checkpoints and data processing code to your email.

Thank you very much!

@minrui-hust
Copy link

Hi, @SxJyJay, I have reproduce the Transfusion-L with mAP 65.4, however, the reproduced Transfusion-LC model can only achive mAP 65.6, which has a large gap between yours(67.25). Can you send me your training log and checkpoint of both Transfusion-L and Transfusion-LC so I can check where went wrong, my email is hustminrui@126.com. Thank you!

@SxJyJay
Copy link
Author

SxJyJay commented Nov 22, 2022

Hi, @SxJyJay, I have reproduce the Transfusion-L with mAP 65.4, however, the reproduced Transfusion-LC model can only achive mAP 65.6, which has a large gap between yours(67.25). Can you send me your training log and checkpoint of both Transfusion-L and Transfusion-LC so I can check where went wrong, my email is hustminrui@126.com. Thank you!

Hi, I have sent you relevant pretrained weights.

@minrui-hust
Copy link

Thanks a lot

@carry-all-coder
Copy link

@SxJyJay Hi SxJyJay, the reproduced Transfusion-LC model of mine is so low. Could you please send the trained checkpoints on nuscenes to me? I need the trained TransFusion and TransFusion-L model as well as the relevant data processing code. Thank you very much! My email is 982330532@qq.com

@frogbam
Copy link

frogbam commented Dec 19, 2022

@SxJyJay Hi, could you send me the checkpoint to me? I need the trained TransFusion-L,
TransFusion and 2D backbone as well as the data processing code. My email is frogbam07@gmail.com. Very thanks.

@fanxlin
Copy link

fanxlin commented Dec 21, 2022

@SxJyJay Hi, can you provide the trained TransFusion and TransFusion-L model?
I am a novice and want to use 1 GPU to run through this verification and test model to learn.
Thanks so much! My email is fanxlin@gmail.com

@jiangchaokang
Copy link

Hi, @SxJyJay, I have reproduce the Transfusion-L with mAP 65.4, however, the reproduced Transfusion-LC model can only achive mAP 65.6, which has a large gap between yours(67.25). Can you send me your training log and checkpoint of both Transfusion-L and Transfusion-LC so I can check where went wrong, my email is hustminrui@126.com. Thank you!

Hi, I have sent you relevant pretrained weights.
Hello, SxJyJay, I really need a well-trained and trained model. I would be very grateful if you could send it to me. I look forward to your help. My email is ts20060079a31@cumt.edu.cn

@wang632846
Copy link

Hi, @SxJyJay I am trying to reproduce the Transfusion-L. But I can‘t reach the results.
could you send me your checkpoints?
My email is hulled-stags-0b@icloud.com
Thank you so much for your work!

@SxJyJay
Copy link
Author

SxJyJay commented Jan 28, 2023

I upload my reproduced checkpoints to Google Drive. You can get access them using the following links:
TransFusion-L: https://drive.google.com/file/d/1J7fTYsfqRovIdKPenEG5OHQObKq-tfrl/view?usp=sharing
TransFusion-LC: https://drive.google.com/file/d/1mv_JH0gqC3SrUZ9ik9qPEBlgqeeCh9Tb/view?usp=sharing

@wang632846
Copy link

Hi @SxJyJay Thank you very much!

@TE-fanxl
Copy link

@SxJyJay Thank you so much for you kind sharing!

@ajinkyakhoche
Copy link

@maokp @kuangpanda @cxd520314wang I have send you my reproduced checkpoints! Please check up your email!

@maokp @kuangpanda @cxd520314wang @SxJyJay I am interested in training a 2D backbone on Waymo dataset. Could you share the relevant code and checkpoints with me on khoche@kth.se? thanks in advance!

@RostyslavUA
Copy link

RostyslavUA commented Feb 28, 2023

@heminghuang7 You can comment the following part out:

mask_roi_extractor=dict(
type='SingleRoIExtractor',
roi_layer=dict(type='RoIAlign', output_size=14, sampling_ratio=0),
out_channels=256,
featmap_strides=[4, 8, 16, 32]),
mask_head=dict(
type='FCNMaskHead',
num_convs=4,
in_channels=256,
conv_out_channels=256,
num_classes=80,

Thank you so much!

After commenting out that part, I get an error

  File "/home/b1-gpu/mmcv-1.2.4/mmcv/utils/registry.py", line 144, in build_from_cfg
    '`cfg` or `default_args` must contain the key "type", '
KeyError: '`cfg` or `default_args` must contain the key "type", but got {\'num_classes\': 10}\nNone'

How did you solve that?

@gopi-erabati
Copy link

@xpyqiubai @xxlbigbrother @kuangpanda @SxJyJay I'm interested to train a 2D backbone on Waymo dataset. Can you please share the relevant code and checkpoints (if possible) to gopi231091@gmail.com. Thank you very much!

@wqueree
Copy link

wqueree commented Mar 23, 2023

I upload my reproduced checkpoints to Google Drive. You can get access them using the following links: TransFusion-L: https://drive.google.com/file/d/1J7fTYsfqRovIdKPenEG5OHQObKq-tfrl/view?usp=sharing TransFusion-LC: https://drive.google.com/file/d/1mv_JH0gqC3SrUZ9ik9qPEBlgqeeCh9Tb/view?usp=sharing

This is fantastic, thank you so much for sharing!

@ToothlessBDG
Copy link

我将复制的检查点上传到 Google 云端硬盘。您可以使用以下链接访问它们:TransFusion-L:https://drive.google.com/file/d/1J7fTYsfqRovIdKPenEG5OHQObKq-tfrl/view?usp=sharing TransFusion-LC:https://drive.google.com/file/d/1mv_JH0gqC3SrUZ9ik9qPEBlgqeeCh9Tb/view?usp=sharing

Hello, thank you very much for sharing, this is very helpful for me who only has a GPU, I also want to see the parameters after training, so can you send me a Transfusion work_dir file, thank you very much. gzr321654987@126.com

@friendship1
Copy link

@xpyqiubai @xxlbigbrother @kuangpanda @SxJyJay Could you please provide the necessary code and any available checkpoints for training a 2D backbone on the Waymo dataset? If possible, send the information to friendship1@dgist.ac.kr. Your assistance is greatly appreciated!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests