Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Overflow occurs when training MNC with the VGG16 net #41

Open
JulianoLagana opened this issue Feb 16, 2017 · 7 comments
Open

Overflow occurs when training MNC with the VGG16 net #41

JulianoLagana opened this issue Feb 16, 2017 · 7 comments

Comments

@JulianoLagana
Copy link

Hi everyone.

I'm trying to train the default VGG16 implementation of MNC with the command
./experiments/scripts/mnc_5stage.sh 0 VGG16

However, after some iterations I run into an overflow error:

Error messages

/home/juliano/MNC/tools/../lib/pylayer/stage_bridge_layer.py:107: RuntimeWarning: overflow encountered in exp bottom[0].diff[i, 3] = dfdw[ind] * (delta_x + np.exp(delta_w)) /home/juliano/MNC/tools/../lib/pylayer/proposal_layer.py:213: RuntimeWarning: invalid value encountered in multiply dfdxc * anchor_w * weight_out_proposal * weight_out_anchor /home/juliano/MNC/tools/../lib/pylayer/proposal_layer.py:217: RuntimeWarning: invalid value encountered in multiply dfdw * np.exp(bottom[1].data[0, 4*c+2, h, w]) * anchor_w * weight_out_proposal * weight_out_anchor /home/juliano/MNC/tools/../lib/pylayer/stage_bridge_layer.py:107: RuntimeWarning: invalid value encountered in float_scalars bottom[0].diff[i, 3] = dfdw[ind] * (delta_x + np.exp(delta_w)) /home/juliano/MNC/tools/../lib/pylayer/proposal_layer.py:183: RuntimeWarning: invalid value encountered in greater top_non_zero_ind = np.unique(np.where(abs(top[0].diff[:, :]) > 0)[0]) /home/juliano/MNC/tools/../lib/transform/bbox_transform.py:86: RuntimeWarning: overflow encountered in exp pred_w = np.exp(dw) * widths[:, np.newaxis] /home/juliano/MNC/tools/../lib/transform/bbox_transform.py:129: RuntimeWarning: invalid value encountered in greater_equal keep = np.where((ws >= min_size) & (hs >= min_size))[0] ./experiments/scripts/mnc_5stage.sh: line 35: 22873 Floating point exception(core dumped) ./tools/train_net.py --gpu ${GPU_ID} --solver models/${NET}/mnc_5stage/solver.prototxt --weights ${NET_INIT} --imdb ${DATASET_TRAIN} --iters ${ITERS} --cfg experiments/cfgs/${NET}/mnc_5stage.yml ${EXTRA_ARGS}


I saw in issue #22 that user @brisker experienced the same error when trying to train the MNC with his own dataset. The advice given there was to lower the training rate. Lowering it also helped in my case, but even at 1/10th of the original learning rate the same problem occurs, only later in the training process. User @souryuu mentioned that he needed to use a learning rate 100x times smaller to avoid this problem, which lead to a poorer performance of the end-result net (possibly because he ran for the same number of iterations, not 100 times longer).

Wasn't anyone able to run the training with the default learning rate provided by the creators, but without running into overflow problems? I'm simply trying to train the default implementation of the network, with the default dataset. I'm guessing this means it should be possible to use the default learning rate, no?

@JulianoLagana
Copy link
Author

Even using a learning rate 100x smaller than the default one still gives the same error (but now even further into the optimization, around iteration 2370).

@leduckhc
Copy link
Contributor

leduckhc commented Apr 4, 2017

Hi @JulianoLagana, did you managed to solve the problem? I did try to clip all the np.exp expressions to some value though still failing due to signal 8: SIGFPE (floating point error)

x = np.clip(x, -10, 10)
np.exp(x)

@JulianoLagana
Copy link
Author

Hi @leduckhc. No, unfortunately I didn't. These and other problems with this implementation led me to a different research direction. I hope you manage it, though.

@leduckhc
Copy link
Contributor

leduckhc commented Apr 4, 2017

Hi @JulianoLagana . I just figured out that the weights of conv5_3 and lower (conv5_{2,1}, conv4_{1,2,3}, etc) contains NaNs. So the reason might be in bad initialization/loading of the network from caffemodel. I am gonna examine it in more depth.

I explored values and weights by going through

print {k: v.data for k, v in self.solver.net.blobs.items()}
print {k: v[0].data for k, v in self.solver.net.params.items()}
# v[0] is for weights, v[1] for biases

@JulianoLagana
Copy link
Author

JulianoLagana commented Apr 4, 2017 via email

@leduckhc
Copy link
Contributor

leduckhc commented Apr 5, 2017

Check #53 for solution

@feichtenhofer
Copy link

Freezing layers is not a solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants