Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

training error with my own datasets with RuntimeWarning, Help? #22

Closed
brisker opened this issue Oct 19, 2016 · 17 comments
Closed

training error with my own datasets with RuntimeWarning, Help? #22

brisker opened this issue Oct 19, 2016 · 17 comments

Comments

@brisker
Copy link

brisker commented Oct 19, 2016

the error looks llike this:
I1019 20:25:01.929584 24937 solver.cpp:245] Train net output #12: seg_cls_loss = 3.18091 (* 1 = 3.18091 loss)
I1019 20:25:01.929592 24937 solver.cpp:245] Train net output #13: seg_cls_loss_ext = 3.28305 (* 1 = 3.28305 loss)
I1019 20:25:01.929605 24937 sgd_solver.cpp:106] Iteration 0, lr = 0.001
/home/sjtu/code/MNC-master/tools/../lib/pylayer/stage_bridge_layer.py:106: RuntimeWarning: overflow encountered in exp
bottom[0].diff[i, 3] = dfdw[ind] * (delta_x + np.exp(delta_w))
/home/sjtu/code/MNC-master/tools/../lib/pylayer/stage_bridge_layer.py:106: RuntimeWarning: invalid value encountered in float_scalars
bottom[0].diff[i, 3] = dfdw[ind] * (delta_x + np.exp(delta_w))
/home/sjtu/code/MNC-master/tools/../lib/pylayer/stage_bridge_layer.py:107: RuntimeWarning: overflow encountered in exp
bottom[0].diff[i, 4] = dfdh[ind] * (delta_y + np.exp(delta_h))
/home/sjtu/code/MNC-master/tools/../lib/pylayer/stage_bridge_layer.py:107: RuntimeWarning: invalid value encountered in float_scalars
bottom[0].diff[i, 4] = dfdh[ind] * (delta_y + np.exp(delta_h))
/home/sjtu/code/MNC-master/tools/../lib/pylayer/proposal_layer.py:183: RuntimeWarning: invalid value encountered in greater
top_non_zero_ind = np.unique(np.where(abs(top[0].diff[:, :]) > 0)[0])
/home/sjtu/code/MNC-master/tools/../lib/transform/bbox_transform.py:129: RuntimeWarning: invalid value encountered in greater_equal
keep = np.where((ws >= min_size) & (hs >= min_size))[0]
./experiments/scripts/mnc_5stage.sh: 行 35: 24937 浮点数例外 (核心已转储) ./tools/train_net.py --gpu ${GPU_ID} --solver models/${NET}/mnc_5stage/solver.prototxt --weights ${NET_INIT} --imdb ${DATASET_TRAIN} --iters ${ITERS} --cfg experiments/cfgs/${NET}/mnc_5stage.yml ${EXTRA_ARGS}

@hgaiser
Copy link

hgaiser commented Oct 19, 2016

Lower the learning rate.

@brisker
Copy link
Author

brisker commented Oct 19, 2016

@hgaiser Thanks ! Besides,
I want to train from scratch without finetuned model, when I comment the "weights-" line in mnc_5stage.sh, why the the training data becomes the voc2007? I just want to train my own dataset and put it all in the VOCdevkitSDS folder
image

@hgaiser
Copy link

hgaiser commented Oct 20, 2016

Why do you want to train from scratch? My advice is to use the pretrained model actually.

@qinhaifangpku
Copy link

@hgaiser hi~ when i change the input data size to 1_3_1200*1200, it happend to this error and the loss become NaN. i want to know why it happends?and low the learning rate could be work?

@hgaiser
Copy link

hgaiser commented Nov 19, 2016

Not sure what you mean with that input size. Images are by default resized such that the shortest axis is 600px, meaning the input would for example be 600xHx3. But if you are seeing the same error as above, chances are lowering the learning rate helps.

@qinhaifangpku
Copy link

@oh233 hi~ i have trained on my own data once before normally, but now i train it again, without any change, it happend to this:

/data2/qinhaifang/MNC/tools/../lib/transform/bbox_transform.py:86: RuntimeWarning: overflow encountered in exp
pred_w = np.exp(dw) * widths[:, np.newaxis]
/data2/qinhaifang/MNC/tools/../lib/transform/bbox_transform.py:86: RuntimeWarning: overflow encountered in multiply
pred_w = np.exp(dw) * widths[:, np.newaxis]
/data2/qinhaifang/MNC/tools/../lib/transform/bbox_transform.py:87: RuntimeWarning: overflow encountered in exp
pred_h = np.exp(dh) * heights[:, np.newaxis]
/data2/qinhaifang/MNC/tools/../lib/transform/bbox_transform.py:87: RuntimeWarning: overflow encountered in multiply
pred_h = np.exp(dh) * heights[:, np.newaxis]
/data2/qinhaifang/MNC/tools/../lib/pylayer/stage_bridge_layer.py:107: RuntimeWarning: overflow encountered in exp
bottom[0].diff[i, 3] = dfdw[ind] * (delta_x +np.exp(delta_w))
/data2/qinhaifang/MNC/tools/../lib/pylayer/stage_bridge_layer.py:108: RuntimeWarning: overflow encountered in exp
then the loss will become NAN
is there anybody happend to this error?
thank you for your help in advance!

@qinhaifangpku
Copy link

@hgaiser thank you! but i used to train the model normally, now it happends to this error without any change.

@haihaoshen
Copy link

Anyone got such error? Thanks.

/mnt/sda/MNC/tools/../lib/transform/bbox_transform.py:129: RuntimeWarning: invalid value encountered in greater_equal
keep = np.where((ws >= min_size) & (hs >= min_size))[0]
Traceback (most recent call last):
File "./tools/train_net.py", line 96, in
_solver.train_model(args.max_iters)
File "/mnt/sda/MNC/tools/../lib/caffeWrapper/SolverWrapper.py", line 127, in train_model
self.solver.step(1)
File "/mnt/sda/MNC/tools/../lib/pylayer/proposal_layer.py", line 186, in backward
unmap_val = self._ind_after_filter[self._ind_after_sort[proposal_index[top_non_zero_ind]]]
IndexError: arrays used as indices must be of integer (or boolean) type

@hgaiser
Copy link

hgaiser commented Jan 6, 2017

I don't think I have seen that error before, but what you could try is to delete the cache files (in data/cache/) and try again. Otherwise I suggest diving into the python code to figure out where and why it messes up. Start by printing the ws, hs and min_size values, apparently one of those is incorrect.

@haihaoshen
Copy link

Lowering the learning rate can solve it. Will dive into python code if issue happens again.

@souryuu
Copy link

souryuu commented Jan 25, 2017

I found the same error as above with learning rate = 0.001 and 0.0001. However, it works after changing the learning rate to 0.00001. From observing the code, it seems that the regression value of bbox location diverge and cause the overflow in backprop.

@haihaoshen
Copy link

haihaoshen commented Jan 26, 2017 via email

@souryuu
Copy link

souryuu commented Jan 30, 2017

Adjusting learning rate affected the fine-tuning results. By following the instruction for end-to-end training 5-stage MNC model by VOC2012 with adjusted learning rate (0.0001), I got mAP@0.5 = 36.16 and mAP@0.7 = 13.08.

@haihaoshen
Copy link

haihaoshen commented Jan 31, 2017 via email

@YsSue
Copy link

YsSue commented Mar 13, 2017

@souryuu Hi, I am using mnc_5stage.sh to train the model with learning rate (0.00001), I am wondering how much iteration I need? Did you use lr=0.0001 with iteration=25000 or 250000?

@xialuxi
Copy link

xialuxi commented Apr 1, 2017

@brisker did you change the number of classes?

@guofeng007
Copy link

great with 0.00001

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants