Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About GPU memory usage #10

Closed
Fan-Yixuan opened this issue May 5, 2022 · 37 comments
Closed

About GPU memory usage #10

Fan-Yixuan opened this issue May 5, 2022 · 37 comments

Comments

@Fan-Yixuan
Copy link

Thanks for your great work! I am trying to reimplement your work with the new version (v1.0.0) of mmd3d, my environment:

sys.platform: linux
Python: 3.8.11 (default, Aug  3 2021, 15:09:35) [GCC 7.5.0]
CUDA available: True
GPU 0,1,2,3,4,5,6,7: GeForce RTX 3090
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 11.1, V11.1.74
GCC: gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
PyTorch: 1.9.1
PyTorch compiling details: PyTorch built with:
  - GCC 7.3
  - C++ Version: 201402
  - Intel(R) oneAPI Math Kernel Library Version 2021.4-Product Build 20210904 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v2.1.2 (Git Hash 98be7e8afa711dc9b66c8ff3504129cb82013cdb)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 11.1
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=compute_37
  - CuDNN 8.0.5
  - Magma 2.5.2
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.1, CUDNN_VERSION=8.0.5, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.9.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, 

TorchVision: 0.10.1
OpenCV: 4.5.3
MMCV: 1.5.0
MMCV Compiler: GCC 7.3
MMCV CUDA Compiler: 11.1
MMDetection: 2.23.0
MMSegmentation: 0.24.0
MMDetection3D: 1.0.0rc1+c7cde78

I have dealt with the coordinate system refactoring problem and also the img_fields issue, but I can only train with up to 50 query proposals while with one sample per 24GB RTX3090 GPU, using the default config (nuscenes, Lidar and camera, R50FPN, second lidar backbone, 200queries) will encounter CUDA OOM.

Noting your practice #6 (comment), I hereby seek help. I also didn't notice your use of spconv, hope you can provide more details.
Thanks a lot.

@XuyangBai
Copy link
Owner

XuyangBai commented May 5, 2022

Hi @Fan-Yixuan Thanks for your interest in our work. I have tried training TransFusion on 8 3090GPUs and it could fit into the memory, not sure what happens in your environment. But you could try to use spconv 1.2 to reduce the memory. The spconv is used in SparseEncoder. mmdet3d includes spconv in their repo in mmdet3d/ops/spconv but it is a old version. To use another version of spconv, I used to install it following the instruction here and replace

from .conv import (SparseConv2d, SparseConv3d, SparseConvTranspose2d,
SparseConvTranspose3d, SparseInverseConv2d,
SparseInverseConv3d, SubMConv2d, SubMConv3d)
from .modules import SparseModule, SparseSequential
from .pool import SparseMaxPool2d, SparseMaxPool3d
from .structure import SparseConvTensor, scatter_nd

by something like

from spconv import SparseConv2d, ...

@Fan-Yixuan
Copy link
Author

Thanks a lot for your help, I'm using the latest spconv 2.1.21 and now I can train 200 queries with one sample per 3090 using ~22GB memory. While 2 samples per GPU is still not achievable. I will keep exploring to better solve this problem!

@Fan-Yixuan
Copy link
Author

@XuyangBai Hi dear author, I would like to ask if TransFusion's prediction heads do not contain branches for attribute prediction (moving, stopped, parked vehicle, etc.). I'm not familiar with this task (nuScenes), why does it work like this instead of reducing AAE by adding such branches.

@XuyangBai
Copy link
Owner

XuyangBai commented May 6, 2022

I basically follow the mmdet3d and achieve the attribute prediction using some post-processing rules, check the code here:

for i, box in enumerate(boxes):

@Fan-Yixuan
Copy link
Author

Yes I noticed, but it seems strange to directly use the default attribute, is there any official statement as to why this is done?

@XuyangBai
Copy link
Owner

Ah sorry I just use it as the de-facto, never carefully think about this issue

@Fan-Yixuan
Copy link
Author

Ok, since mmd3d implements it like this, it should have its own reason 2333333

@Fan-Yixuan
Copy link
Author

@XuyangBai Hi dear author, I finished training transfusion_nusc_voxel_L and got val set performance of 64.63mAP/69.99NDS. The previous problem about GPU memory has been solved, which was because the images were not resized to 448*800 due to some version issue.
However, the training after adding cameras encountered some problems, mATE, mASE, mAOE, and mAVE are all increasing with training. Do you have any suggestion for possible causes? My question is, it seems that GlobalRotScaleTrans and RandomFlip3D will cause mismatch between LiDAR and camera?

@XuyangBai
Copy link
Owner

Did you use the newest code? There is some bugs when changing the shape of image features, leading to the mismatch between two modalities, which are fixed in 8977b2b and 5187414

@XuyangBai
Copy link
Owner

GlobalRotScaleTrans and RandomFlip3D will not break the matching between LiDAR and camera because every time we project the object queries(and initial prediction) from 3d space onto the image plane, we first do the inverse transformation, which will convert the augmented 3d positions back to the original coordinate. See the following code:

if batch_size == 1: # skip during inference to save time
points = query_pos_3d_with_corners.T
else:
points = apply_3d_transformation(query_pos_3d_with_corners.T, 'LIDAR', img_metas[sample_idx], reverse=True).detach()
num_points = points.shape[0]

BTW, I just realize that this might be the problem: here I assume the BS=1 is for evaluation time so I skip the apply_3d_transformation for fast inference. If you use samples_per_gpu=1 for training you should remove this logic and always apply apply_3d_transformation.

@Fan-Yixuan
Copy link
Author

I'm using the latest version of the code, and I'm using 2 samples per GPU, and I have another question, RandomFlip3D's parent class, RandomFlip, doesn't support flipping a list of images, will it matter?

@XuyangBai
Copy link
Owner

XuyangBai commented May 11, 2022

Yes, It might be the reason. If the flip in img_field is set to True but the image is not flipped, then the consistency between Lidar and the image is broken. You can check the preprocessing classes to figure out how it works in my implementation, I do not remember exactly where I did the conversion from a list of images to a ndarray.

@Fan-Yixuan
Copy link
Author

Fan-Yixuan commented May 11, 2022

My concern is that maybe
https://github.com/open-mmlab/mmdetection/blob/master/mmdet/datasets/pipelines/transforms.py#L465-L469
should be changed to a loop over the list of images like you did in MyResize etc., but I don't understand why you and #7 (comment) are able to get the correct training results based on the current code.

@XuyangBai
Copy link
Owner

@Fan-Yixuan I find mmcv.imflip do works for list of images, see the following example:

截屏2022-05-12 上午8 35 15

@Fan-Yixuan
Copy link
Author

Fan-Yixuan commented May 12, 2022

Thanks for the explanation, that's true, but I still can't seem to solve my problem. The strangest thing I found is the change of loss_bbox during the training process as shown in the figure. The orange line is the result of LiDAR only, and the red line is the result of LC. Do you have any suggestions? thanks a lot.
2022-05-12 17-03-15 的屏幕截图
Also I found that I didn't notice the changes in train.py, i.e. I didn't freeze the LiDAR branch, combined with the figure above, I now think this is likely the reason.

@XuyangBai
Copy link
Owner

It is really weird that the bbox loss turns to increase at some point, the curve before 10k looks normal. I am not sure the reason but maybe you can first verify the projection of object queries onto the image through some visualization? If the lidar and image are not aligned well, the image feature attached to the object queries will be wrong. BTW, you mentioned the mATE, mASE, mAOE are all increasing, so how about mAP?

@Fan-Yixuan
Copy link
Author

Fan-Yixuan commented May 12, 2022

The first three epochs after adding camera, mAP: 62.49, 58.86, 59.66. I feel that the loss turns to increase is probably because the learning rate becomes larger (I use 4*3090 with 2 samples per GPU so I forward propagation twice and then update the parameters to make batch size equals 16, thus learning rate reaches a maximum at around 40k iters)

Do you think this is normal if the LiDAR branch is not frozen?

@XuyangBai
Copy link
Owner

XuyangBai commented May 12, 2022

The learning rate should not be the reason. I have also tried to use batch_size 8*1.

Yes, I freeze the LiDAR branch during training of TransFusion as it is already well trained in the first stage. If you would like to jointly optimize the lidar branch and the fusion component, maybe they should be operimized in different learning rates.

@Fan-Yixuan
Copy link
Author

Hi, sorry for the late reply. I made two changes: the first follows your changes in the dataset definition file, but from what I understand this shouldn't have a real impact.
2022-05-14 10-34-48 的屏幕截图

The second is to freeze the weights of the LiDAR branch. Now I can get 66.75mAP/71.03NDS on the nuScenes validation set. So I think the previous problem is caused by using too large learning rate for the LiDAR branch which has been well trained.

@XuyangBai
Copy link
Owner

Yes, the order of images does not affect a lot but freezing the backbone did.

@Fan-Yixuan
Copy link
Author

Ok, thank you for your patience and your excellent work, I close this issue.

@nmll
Copy link

nmll commented May 16, 2022

@Fan-Yixuan Hello! Could you tell me the max learning rate in your training step of the first stage and second stage separately?

@Fan-Yixuan
Copy link
Author

Hi, my experiment follows the code given by the author

optimizer = dict(type='AdamW', lr=0.0001, weight_decay=0.01) # for 8gpu * 2sample_per_gpu
optimizer_config = dict(grad_clip=dict(max_norm=0.1, norm_type=2))
lr_config = dict(
policy='cyclic',
target_ratio=(10, 0.0001),
cyclic_times=1,
step_ratio_up=0.4)

optimizer = dict(type='AdamW', lr=0.0001, weight_decay=0.01) # for 8gpu * 2sample_per_gpu
optimizer_config = dict(grad_clip=dict(max_norm=0.1, norm_type=2))
lr_config = dict(
policy='cyclic',
target_ratio=(10, 0.0001),
cyclic_times=1,
step_ratio_up=0.4)

@nmll
Copy link

nmll commented May 16, 2022

OK! Thanks!

@nmll
Copy link

nmll commented May 28, 2022

GlobalRotScaleTrans and RandomFlip3D will not break the matching between LiDAR and camera because every time we project the object queries(and initial prediction) from 3d space onto the image plane, we first do the inverse transformation, which will convert the augmented 3d positions back to the original coordinate. See the following code:

if batch_size == 1: # skip during inference to save time
points = query_pos_3d_with_corners.T
else:
points = apply_3d_transformation(query_pos_3d_with_corners.T, 'LIDAR', img_metas[sample_idx], reverse=True).detach()
num_points = points.shape[0]

BTW, I just realize that this might be the problem: here I assume the BS=1 is for evaluation time so I skip the apply_3d_transformation for fast inference. If you use samples_per_gpu=1 for training you should remove this logic and always apply apply_3d_transformation.

Hello! @XuyangBai May I ask about this apply_3d_transformation is only used in projecting 3D to 2D query, but is not used in adding the BEV lidar feature and BEV image feature for image guided query initialization. Will this be a mismatch between lidar and image modalities due to the Radomflip3d and GlobalRotScaleTrans?

@XuyangBai
Copy link
Owner

XuyangBai commented Jun 1, 2022

Hi @nmll That's a very good question that I didn't realize previously. Intuitively, the point clouds should also be transformed using the inversion of data augmentation when projecting image features onto the BEV plane (or equivalently, I should perform a similar rotation and flip to images, which is somewhat complicated). However, the network still works under the current settings. My guess is that the network is able to 1) leverage the contextual relationship (between image features and LiDAR features) to associate the two sets of features and thus perform the projection, and 2) ignore the geometry relationship brought by the position encodings of image features and LiDAR features.

Furthermore, I have run another experiment that removes the RandomFlip and GlobalRotScaleTrans during training to see whether forcing the two modalities to be consistent will further improve the results. In this case, the network could also leverage the geometry relationship to build the association. The observation is that: the training loss is decreasing more rapidly compared with the previous setting. The blue curve in the following figure is the one without RandomFlip & GlobalRotScaleTrans while the gray curve is the original one. However, the final mAP and NDS is similar. So I assume that removing these two augmentations will increase the convergence speed but the final performance might be already saturated (although the heatmap_loss could be further reduced, the object queries selected by the heatmap are already with good locations, so the improvement is not remarkable in terms of final mAP and NDS)

截屏2022-06-01 下午9 27 06

截屏2022-06-01 下午9 27 14

I will remove the RandomFlip and GlobalRotScaleTrans in the config files, which is more reasonable and gives better convergence speed. Thanks a lot for pointing out that issue.

Best,
Xuyang

@heming7
Copy link

heming7 commented Jun 16, 2022

@XuyangBai Hi dear author, I finished training transfusion_nusc_voxel_L and got val set performance of 64.63mAP/69.99NDS. The previous problem about GPU memory has been solved, which was because the images were not resized to 448*800 due to some version issue. However, the training after adding cameras encountered some problems, mATE, mASE, mAOE, and mAVE are all increasing with training. Do you have any suggestion for possible causes? My question is, it seems that GlobalRotScaleTrans and RandomFlip3D will cause mismatch between LiDAR and camera?

Hello @Fan-Yixuan

Can you tell what you did to solve the version issue? I am facing the same problem now.

@Fan-Yixuan
Copy link
Author

Hi @heming7, you need to make sure that results['img_fields'] is ['img'] and type(results['img']) is list before these code:

def _resize_img(self, results):
"""Resize images with ``results['scale']``."""
for key in results.get('img_fields', ['img']):
for idx in range(len(results['img'])):

@heminghuang7
Copy link

Hi @heming7, you need to make sure that results['img_fields'] is ['img'] and type(results['img']) is list before these code:

def _resize_img(self, results):
"""Resize images with ``results['scale']``."""
for key in results.get('img_fields', ['img']):
for idx in range(len(results['img'])):

Hello Yixuan

Thank you for the suggestion. I checked the code and I think the author has pushed a commit that fixes this. But I manage to run it by reducing the value samples_per_gpu. Anyway, thank you so much for the help!

@yinjunbo
Copy link

yinjunbo commented Oct 10, 2022

@XuyangBai Hi dear author, I finished training transfusion_nusc_voxel_L and got val set performance of 64.63mAP/69.99NDS. The previous problem about GPU memory has been solved, which was because the images were not resized to 448*800 due to some version issue. However, the training after adding cameras encountered some problems, mATE, mASE, mAOE, and mAVE are all increasing with training. Do you have any suggestion for possible causes? My question is, it seems that GlobalRotScaleTrans and RandomFlip3D will cause mismatch between LiDAR and camera?

Hi, @Fan-Yixuan, could you please share your torch/cuda/mmdet3d/spconv environment you've used to reproduce the nusc val performance (64.63mAP and 69.99NDS)? It seems that you used 8*3090 with batch size 2 and lr 1e-4?

@Fan-Yixuan
Copy link
Author

@XuyangBai Hi dear author, I finished training transfusion_nusc_voxel_L and got val set performance of 64.63mAP/69.99NDS. The previous problem about GPU memory has been solved, which was because the images were not resized to 448*800 due to some version issue. However, the training after adding cameras encountered some problems, mATE, mASE, mAOE, and mAVE are all increasing with training. Do you have any suggestion for possible causes? My question is, it seems that GlobalRotScaleTrans and RandomFlip3D will cause mismatch between LiDAR and camera?

Hi, @Fan-Yixuan, could you please share your torch/cuda/mmdet3d/spconv environment you've used to reproduce the nusc val performance (64.63mAP and 69.99NDS)? It seems that you used 8*3090 with batch size 2 and lr 1e-4?

Hi my env: #10 (comment), my spconv: 2.1.21 my total batchsize: 16, lr: 1e-4

@yinjunbo
Copy link

yinjunbo commented Oct 10, 2022

@XuyangBai Hi dear author, I finished training transfusion_nusc_voxel_L and got val set performance of 64.63mAP/69.99NDS. The previous problem about GPU memory has been solved, which was because the images were not resized to 448*800 due to some version issue. However, the training after adding cameras encountered some problems, mATE, mASE, mAOE, and mAVE are all increasing with training. Do you have any suggestion for possible causes? My question is, it seems that GlobalRotScaleTrans and RandomFlip3D will cause mismatch between LiDAR and camera?

Hi, @Fan-Yixuan, could you please share your torch/cuda/mmdet3d/spconv environment you've used to reproduce the nusc val performance (64.63mAP and 69.99NDS)? It seems that you used 8*3090 with batch size 2 and lr 1e-4?

Hi my env: #10 (comment), my spconv: 2.1.21 my total batchsize: 16, lr: 1e-4
@Fan-Yixuan , Thanks for your quik reply! I'll have another try.
Btw, could you please share your traing log, so I can check my problem accordingly(email: yinjunbocn@gmail.com)?

@Fan-Yixuan
Copy link
Author

Fan-Yixuan commented Oct 10, 2022

@yinjunbo
Sure, for training lidar-only, the first 15 epochs:
20220505_225100.log
the last 5 epochs (fade strategy):
20220508_101828.log

@yinjunbo
Copy link

@yinjunbo Sure, for training lidar-only, the first 15 epochs: 20220505_225100.log the last 5 epochs (fade strategy): 20220508_101828.log

Thank you very much!
I find that my training loss is obvisously larger than yours. Did you try to train a model before coordinate system refactoring ?

@Fan-Yixuan
Copy link
Author

@yinjunbo Sorry I didn't save the training logs before modifying the coordinate system, but if the coordinate is not aligned, it should work very poorly.

@yinjunbo
Copy link

@yinjunbo Sorry I didn't save the training logs before modifying the coordinate system, but if the coordinate is not aligned, it should work very poorly.

I totally agree with you. Since my repreoced performance is just slightly lower (~2 points) than yours, this could not be caused by coordinate system. I'll continue to find the problems. tks!

@BoomSky0416
Copy link

@Fan-Yixuan Hello, I am trying to reproduce transfusion in mmdet3d-1.1.0. But I got the wrong result in training lidar-camera fusion stage. Could you please share your training log for this stage, thanks! (email: shoutian@umich.edu)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants