training error with my own datasets with RuntimeWarning, Help? #22

brisker · 2016-10-19T12:30:57Z

the error looks llike this:
I1019 20:25:01.929584 24937 solver.cpp:245] Train net output #12: seg_cls_loss = 3.18091 (* 1 = 3.18091 loss)
I1019 20:25:01.929592 24937 solver.cpp:245] Train net output #13: seg_cls_loss_ext = 3.28305 (* 1 = 3.28305 loss)
I1019 20:25:01.929605 24937 sgd_solver.cpp:106] Iteration 0, lr = 0.001
/home/sjtu/code/MNC-master/tools/../lib/pylayer/stage_bridge_layer.py:106: RuntimeWarning: overflow encountered in exp
bottom[0].diff[i, 3] = dfdw[ind] * (delta_x + np.exp(delta_w))
/home/sjtu/code/MNC-master/tools/../lib/pylayer/stage_bridge_layer.py:106: RuntimeWarning: invalid value encountered in float_scalars
bottom[0].diff[i, 3] = dfdw[ind] * (delta_x + np.exp(delta_w))
/home/sjtu/code/MNC-master/tools/../lib/pylayer/stage_bridge_layer.py:107: RuntimeWarning: overflow encountered in exp
bottom[0].diff[i, 4] = dfdh[ind] * (delta_y + np.exp(delta_h))
/home/sjtu/code/MNC-master/tools/../lib/pylayer/stage_bridge_layer.py:107: RuntimeWarning: invalid value encountered in float_scalars
bottom[0].diff[i, 4] = dfdh[ind] * (delta_y + np.exp(delta_h))
/home/sjtu/code/MNC-master/tools/../lib/pylayer/proposal_layer.py:183: RuntimeWarning: invalid value encountered in greater
top_non_zero_ind = np.unique(np.where(abs(top[0].diff[:, :]) > 0)[0])
/home/sjtu/code/MNC-master/tools/../lib/transform/bbox_transform.py:129: RuntimeWarning: invalid value encountered in greater_equal
keep = np.where((ws >= min_size) & (hs >= min_size))[0]
./experiments/scripts/mnc_5stage.sh: 行 35: 24937 浮点数例外 (核心已转储) ./tools/train_net.py --gpu ${GPU_ID} --solver models/${NET}/mnc_5stage/solver.prototxt --weights ${NET_INIT} --imdb ${DATASET_TRAIN} --iters ${ITERS} --cfg experiments/cfgs/${NET}/mnc_5stage.yml ${EXTRA_ARGS}

hgaiser · 2016-10-19T13:00:56Z

Lower the learning rate.

brisker · 2016-10-19T13:10:28Z

@hgaiser Thanks ! Besides,
I want to train from scratch without finetuned model, when I comment the "weights-" line in mnc_5stage.sh, why the the training data becomes the voc2007? I just want to train my own dataset and put it all in the VOCdevkitSDS folder

hgaiser · 2016-10-20T07:55:08Z

Why do you want to train from scratch? My advice is to use the pretrained model actually.

qinhaifangpku · 2016-11-19T09:47:25Z

@hgaiser hi~ when i change the input data size to 1_3_1200*1200, it happend to this error and the loss become NaN. i want to know why it happends?and low the learning rate could be work?

hgaiser · 2016-11-19T12:22:00Z

Not sure what you mean with that input size. Images are by default resized such that the shortest axis is 600px, meaning the input would for example be 600xHx3. But if you are seeing the same error as above, chances are lowering the learning rate helps.

qinhaifangpku · 2016-11-21T03:59:31Z

@oh233 hi～ i have trained on my own data once before normally, but now i train it again, without any change, it happend to this:

/data2/qinhaifang/MNC/tools/../lib/transform/bbox_transform.py:86: RuntimeWarning: overflow encountered in exp
pred_w = np.exp(dw) * widths[:, np.newaxis]
/data2/qinhaifang/MNC/tools/../lib/transform/bbox_transform.py:86: RuntimeWarning: overflow encountered in multiply
pred_w = np.exp(dw) * widths[:, np.newaxis]
/data2/qinhaifang/MNC/tools/../lib/transform/bbox_transform.py:87: RuntimeWarning: overflow encountered in exp
pred_h = np.exp(dh) * heights[:, np.newaxis]
/data2/qinhaifang/MNC/tools/../lib/transform/bbox_transform.py:87: RuntimeWarning: overflow encountered in multiply
pred_h = np.exp(dh) * heights[:, np.newaxis]
/data2/qinhaifang/MNC/tools/../lib/pylayer/stage_bridge_layer.py:107: RuntimeWarning: overflow encountered in exp
bottom[0].diff[i, 3] = dfdw[ind] * (delta_x +np.exp(delta_w))
/data2/qinhaifang/MNC/tools/../lib/pylayer/stage_bridge_layer.py:108: RuntimeWarning: overflow encountered in exp
then the loss will become NAN
is there anybody happend to this error?
thank you for your help in advance!

qinhaifangpku · 2016-11-21T04:07:43Z

@hgaiser thank you! but i used to train the model normally, now it happends to this error without any change.

haihaoshen · 2017-01-06T09:50:56Z

Anyone got such error? Thanks.

/mnt/sda/MNC/tools/../lib/transform/bbox_transform.py:129: RuntimeWarning: invalid value encountered in greater_equal
keep = np.where((ws >= min_size) & (hs >= min_size))[0]
Traceback (most recent call last):
File "./tools/train_net.py", line 96, in
_solver.train_model(args.max_iters)
File "/mnt/sda/MNC/tools/../lib/caffeWrapper/SolverWrapper.py", line 127, in train_model
self.solver.step(1)
File "/mnt/sda/MNC/tools/../lib/pylayer/proposal_layer.py", line 186, in backward
unmap_val = self._ind_after_filter[self._ind_after_sort[proposal_index[top_non_zero_ind]]]
IndexError: arrays used as indices must be of integer (or boolean) type

hgaiser · 2017-01-06T10:00:26Z

I don't think I have seen that error before, but what you could try is to delete the cache files (in data/cache/) and try again. Otherwise I suggest diving into the python code to figure out where and why it messes up. Start by printing the ws, hs and min_size values, apparently one of those is incorrect.

haihaoshen · 2017-01-09T05:21:40Z

Lowering the learning rate can solve it. Will dive into python code if issue happens again.

souryuu · 2017-01-25T07:21:11Z

I found the same error as above with learning rate = 0.001 and 0.0001. However, it works after changing the learning rate to 0.00001. From observing the code, it seems that the regression value of bbox location diverge and cause the overflow in backprop.

haihaoshen · 2017-01-26T03:50:59Z

Thanks for the info. Is lr 0.00001 too small for fine-tuning a new dataset?

…

On Wed, Jan 25, 2017 at 3:21 PM, souryuu ***@***.***> wrote: I found the same error as above with learning rate = 0.001 and 0.0001. However, it works after changing the learning rate to 0.00001. From observing the code, it seems that the regression value of bbox location diverge and cause the overflow in backprop. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#22 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AB46ZNKY3Nsgb2QmHSkOQZooCsaThTgcks5rVvfngaJpZM4Ka6YO> .

souryuu · 2017-01-30T04:01:37Z

Adjusting learning rate affected the fine-tuning results. By following the instruction for end-to-end training 5-stage MNC model by VOC2012 with adjusted learning rate (0.0001), I got mAP@0.5 = 36.16 and mAP@0.7 = 13.08.

haihaoshen · 2017-01-31T09:01:16Z

Yeah, I can run well using end2end 5-stage on VOC 2012 as well. Did you try own dataset?

…

On Mon, Jan 30, 2017 at 12:01 PM, souryuu ***@***.***> wrote: Adjusting learning rate affected the fine-tuning results. By following the instruction for end-to-end training 5-stage MNC model by VOC2012 with adjusted learning rate (0.0001), I got ***@***.*** = 36.16 and ***@***.*** = 13.08. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#22 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AB46ZDcHcL4NX72ABPmETnbX9d1xqx6aks5rXWCigaJpZM4Ka6YO> .

YsSue · 2017-03-13T02:37:43Z

@souryuu Hi, I am using mnc_5stage.sh to train the model with learning rate (0.00001), I am wondering how much iteration I need? Did you use lr=0.0001 with iteration=25000 or 250000?

xialuxi · 2017-04-01T03:29:16Z

@brisker did you change the number of classes?

guofeng007 · 2017-09-27T11:54:45Z

great with 0.00001

JulianoLagana mentioned this issue Feb 16, 2017

Overflow occurs when training MNC with the VGG16 net #41

Open

brisker closed this as completed Apr 1, 2017

leduckhc mentioned this issue Apr 4, 2017

conv5_3 layers contains NANs causing SIGFPE #53

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

training error with my own datasets with RuntimeWarning, Help? #22

training error with my own datasets with RuntimeWarning, Help? #22

brisker commented Oct 19, 2016

hgaiser commented Oct 19, 2016

brisker commented Oct 19, 2016

hgaiser commented Oct 20, 2016

qinhaifangpku commented Nov 19, 2016

hgaiser commented Nov 19, 2016

qinhaifangpku commented Nov 21, 2016

qinhaifangpku commented Nov 21, 2016

haihaoshen commented Jan 6, 2017

hgaiser commented Jan 6, 2017

haihaoshen commented Jan 9, 2017

souryuu commented Jan 25, 2017

haihaoshen commented Jan 26, 2017 via email

souryuu commented Jan 30, 2017

haihaoshen commented Jan 31, 2017 via email

YsSue commented Mar 13, 2017

xialuxi commented Apr 1, 2017

guofeng007 commented Sep 27, 2017

training error with my own datasets with RuntimeWarning, Help? #22

training error with my own datasets with RuntimeWarning, Help? #22

Comments

brisker commented Oct 19, 2016

hgaiser commented Oct 19, 2016

brisker commented Oct 19, 2016

hgaiser commented Oct 20, 2016

qinhaifangpku commented Nov 19, 2016

hgaiser commented Nov 19, 2016

qinhaifangpku commented Nov 21, 2016

qinhaifangpku commented Nov 21, 2016

haihaoshen commented Jan 6, 2017

hgaiser commented Jan 6, 2017

haihaoshen commented Jan 9, 2017

souryuu commented Jan 25, 2017

haihaoshen commented Jan 26, 2017 via email

souryuu commented Jan 30, 2017

haihaoshen commented Jan 31, 2017 via email

YsSue commented Mar 13, 2017

xialuxi commented Apr 1, 2017

guofeng007 commented Sep 27, 2017