为什么我跑第三阶段时会爆显存？ #28

Musicad · 2020-08-03T10:22:52Z

作者大大们好像都是中国人？我就偷懒用中文写issue了。
这个implementation我没理解错的话，是分三阶段训练对吧？第三阶段用local branch的feature map通过deep feature map sharing来辅助训练global branch。显存是在最后计算ensemble loss的时候开始暴增，我将batch size设为1，每次只训练一张full size的图片，发现只要这张图片稍微大一点就会爆显存，哪怕将sub batch size设到2也不行。我的训练集图片也不算很大，最大的长宽也不超过4000。按文章里对于显存效率的说法看这不应该啊，是不是这个implementation跟文章写的不一样呢？
希望得到回复！卡在这很久很久了。

qinliuliuqin · 2020-08-31T00:14:41Z

作者大大们好像都是中国人？我就偷懒用中文写issue了。
这个implementation我没理解错的话，是分三阶段训练对吧？第三阶段用local branch的feature map通过deep feature map sharing来辅助训练global branch。显存是在最后计算ensemble loss的时候开始暴增，我将batch size设为1，每次只训练一张full size的图片，发现只要这张图片稍微大一点就会爆显存，哪怕将sub batch size设到2也不行。我的训练集图片也不算很大，最大的长宽也不超过4000。按文章里对于显存效率的说法看这不应该啊，是不是这个implementation跟文章写的不一样呢？
希望得到回复！卡在这很久很久了。

Hi Musicad, 请问您的问题解决了吗？我正准备复现这个工作。

EmmaSRH · 2020-08-31T01:53:06Z

我也有相同的问题，事实上在第二阶段时显存就有明显增加，代码中写的也比较粗暴，感觉和论文中汇报的高效差距很远。。。

chenwydj · 2020-08-31T03:03:43Z

Hi everyone!

Thank you for your interest in our work!

The largest size of the image we tried in our work is 5000x5000. The core hyperparameters that affect the memory usage during training are: 1) batch size; 2) sub_batch_size; 3) size of cropped patches. For training the 5000x5000 example, we used batch_size = 4, sub_batch_size = 6, crop size = 536x536. This will cost about 10G memory during training.
Our main claim is testing time memory efficiency instead of training.

qinliuliuqin · 2020-08-31T04:33:10Z

Hi everyone!

Thank you for your interest in our work!

The largest size of the image we tried in our work is 5000x5000. The core hyperparameters that affect the memory usage during training are: 1) batch size; 2) sub_batch_size; 3) size of cropped patches. For training the 5000x5000 example, we used batch_size = 4, sub_batch_size = 6, crop size = 536x536. This will cost about 10G memory during training.

Our main claim is testing time memory efficiency instead of training.

Hi Wuyang,

Thanks for your timely reply! I enjoyed your paper a lot and am now trying to extend your work to high-resolution volumetric medical image segmentation (e.g., the median size is 512x512x384). I will let you know if GL-Net works in the medical image domain where memory issue is more severe.

Qin

chenwydj closed this as completed Sep 8, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

为什么我跑第三阶段时会爆显存？ #28

为什么我跑第三阶段时会爆显存？ #28

Musicad commented Aug 3, 2020

qinliuliuqin commented Aug 31, 2020

EmmaSRH commented Aug 31, 2020

chenwydj commented Aug 31, 2020

qinliuliuqin commented Aug 31, 2020 •

edited

为什么我跑第三阶段时会爆显存？ #28

为什么我跑第三阶段时会爆显存？ #28

Comments

Musicad commented Aug 3, 2020

qinliuliuqin commented Aug 31, 2020

EmmaSRH commented Aug 31, 2020

chenwydj commented Aug 31, 2020

qinliuliuqin commented Aug 31, 2020 • edited

qinliuliuqin commented Aug 31, 2020 •

edited