-
Notifications
You must be signed in to change notification settings - Fork 183
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Overflow occurs when training MNC with the VGG16 net #41
Comments
Even using a learning rate 100x smaller than the default one still gives the same error (but now even further into the optimization, around iteration 2370). |
Hi @JulianoLagana, did you managed to solve the problem? I did try to clip all the
|
Hi @leduckhc. No, unfortunately I didn't. These and other problems with this implementation led me to a different research direction. I hope you manage it, though. |
Hi @JulianoLagana . I just figured out that the weights of I explored values and weights by going through
|
I see, thanks for sharing it! If you do find a workaround for this issue,
I'd be very interested.
tis 4 apr. 2017 kl. 19:24 skrev leduckhc <notifications@github.com>:
… Hi @JulianoLagana <https://github.com/JulianoLagana> . I just figured out
that the weights of conv5_3 and lower (conv5_{2,1}, conv4_{1,2,3}, etc)
contains NaNs. So the reason might be in bad initialization/loading of the
network from caffemodel. I am gonna examine it in more depth.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#41 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ADz5UNN_GlW8Z3P15TX66GD_HI4YmHxWks5rsny4gaJpZM4MDC2I>
.
|
Check #53 for solution |
Freezing layers is not a solution. |
Hi everyone.
I'm trying to train the default VGG16 implementation of MNC with the command
./experiments/scripts/mnc_5stage.sh 0 VGG16
However, after some iterations I run into an overflow error:
Error messages
/home/juliano/MNC/tools/../lib/pylayer/stage_bridge_layer.py:107: RuntimeWarning: overflow encountered in exp bottom[0].diff[i, 3] = dfdw[ind] * (delta_x + np.exp(delta_w)) /home/juliano/MNC/tools/../lib/pylayer/proposal_layer.py:213: RuntimeWarning: invalid value encountered in multiply dfdxc * anchor_w * weight_out_proposal * weight_out_anchor /home/juliano/MNC/tools/../lib/pylayer/proposal_layer.py:217: RuntimeWarning: invalid value encountered in multiply dfdw * np.exp(bottom[1].data[0, 4*c+2, h, w]) * anchor_w * weight_out_proposal * weight_out_anchor /home/juliano/MNC/tools/../lib/pylayer/stage_bridge_layer.py:107: RuntimeWarning: invalid value encountered in float_scalars bottom[0].diff[i, 3] = dfdw[ind] * (delta_x + np.exp(delta_w)) /home/juliano/MNC/tools/../lib/pylayer/proposal_layer.py:183: RuntimeWarning: invalid value encountered in greater top_non_zero_ind = np.unique(np.where(abs(top[0].diff[:, :]) > 0)[0]) /home/juliano/MNC/tools/../lib/transform/bbox_transform.py:86: RuntimeWarning: overflow encountered in exp pred_w = np.exp(dw) * widths[:, np.newaxis] /home/juliano/MNC/tools/../lib/transform/bbox_transform.py:129: RuntimeWarning: invalid value encountered in greater_equal keep = np.where((ws >= min_size) & (hs >= min_size))[0] ./experiments/scripts/mnc_5stage.sh: line 35: 22873 Floating point exception(core dumped) ./tools/train_net.py --gpu ${GPU_ID} --solver models/${NET}/mnc_5stage/solver.prototxt --weights ${NET_INIT} --imdb ${DATASET_TRAIN} --iters ${ITERS} --cfg experiments/cfgs/${NET}/mnc_5stage.yml ${EXTRA_ARGS}
I saw in issue #22 that user @brisker experienced the same error when trying to train the MNC with his own dataset. The advice given there was to lower the training rate. Lowering it also helped in my case, but even at 1/10th of the original learning rate the same problem occurs, only later in the training process. User @souryuu mentioned that he needed to use a learning rate 100x times smaller to avoid this problem, which lead to a poorer performance of the end-result net (possibly because he ran for the same number of iterations, not 100 times longer).
Wasn't anyone able to run the training with the default learning rate provided by the creators, but without running into overflow problems? I'm simply trying to train the default implementation of the network, with the default dataset. I'm guessing this means it should be possible to use the default learning rate, no?
The text was updated successfully, but these errors were encountered: