Overflow occurs when training MNC with the VGG16 net #41

JulianoLagana · 2017-02-16T14:01:12Z

Hi everyone.

I'm trying to train the default VGG16 implementation of MNC with the command
./experiments/scripts/mnc_5stage.sh 0 VGG16

However, after some iterations I run into an overflow error:

Error messages

/home/juliano/MNC/tools/../lib/pylayer/stage_bridge_layer.py:107: RuntimeWarning: overflow encountered in exp bottom[0].diff[i, 3] = dfdw[ind] * (delta_x + np.exp(delta_w)) /home/juliano/MNC/tools/../lib/pylayer/proposal_layer.py:213: RuntimeWarning: invalid value encountered in multiply dfdxc * anchor_w * weight_out_proposal * weight_out_anchor /home/juliano/MNC/tools/../lib/pylayer/proposal_layer.py:217: RuntimeWarning: invalid value encountered in multiply dfdw * np.exp(bottom[1].data[0, 4*c+2, h, w]) * anchor_w * weight_out_proposal * weight_out_anchor /home/juliano/MNC/tools/../lib/pylayer/stage_bridge_layer.py:107: RuntimeWarning: invalid value encountered in float_scalars bottom[0].diff[i, 3] = dfdw[ind] * (delta_x + np.exp(delta_w)) /home/juliano/MNC/tools/../lib/pylayer/proposal_layer.py:183: RuntimeWarning: invalid value encountered in greater top_non_zero_ind = np.unique(np.where(abs(top[0].diff[:, :]) > 0)[0]) /home/juliano/MNC/tools/../lib/transform/bbox_transform.py:86: RuntimeWarning: overflow encountered in exp pred_w = np.exp(dw) * widths[:, np.newaxis] /home/juliano/MNC/tools/../lib/transform/bbox_transform.py:129: RuntimeWarning: invalid value encountered in greater_equal keep = np.where((ws >= min_size) & (hs >= min_size))[0] ./experiments/scripts/mnc_5stage.sh: line 35: 22873 Floating point exception(core dumped) ./tools/train_net.py --gpu ${GPU_ID} --solver models/${NET}/mnc_5stage/solver.prototxt --weights ${NET_INIT} --imdb ${DATASET_TRAIN} --iters ${ITERS} --cfg experiments/cfgs/${NET}/mnc_5stage.yml ${EXTRA_ARGS}

I saw in issue #22 that user @brisker experienced the same error when trying to train the MNC with his own dataset. The advice given there was to lower the training rate. Lowering it also helped in my case, but even at 1/10th of the original learning rate the same problem occurs, only later in the training process. User @souryuu mentioned that he needed to use a learning rate 100x times smaller to avoid this problem, which lead to a poorer performance of the end-result net (possibly because he ran for the same number of iterations, not 100 times longer).

Wasn't anyone able to run the training with the default learning rate provided by the creators, but without running into overflow problems? I'm simply trying to train the default implementation of the network, with the default dataset. I'm guessing this means it should be possible to use the default learning rate, no?

The text was updated successfully, but these errors were encountered:

JulianoLagana · 2017-02-20T12:31:47Z

Even using a learning rate 100x smaller than the default one still gives the same error (but now even further into the optimization, around iteration 2370).

leduckhc · 2017-04-04T07:57:38Z

Hi @JulianoLagana, did you managed to solve the problem? I did try to clip all the np.exp expressions to some value though still failing due to signal 8: SIGFPE (floating point error)

x = np.clip(x, -10, 10)
np.exp(x)

JulianoLagana · 2017-04-04T09:29:48Z

Hi @leduckhc. No, unfortunately I didn't. These and other problems with this implementation led me to a different research direction. I hope you manage it, though.

leduckhc · 2017-04-04T17:24:07Z

Hi @JulianoLagana . I just figured out that the weights of conv5_3 and lower (conv5_{2,1}, conv4_{1,2,3}, etc) contains NaNs. So the reason might be in bad initialization/loading of the network from caffemodel. I am gonna examine it in more depth.

I explored values and weights by going through

print {k: v.data for k, v in self.solver.net.blobs.items()}
print {k: v[0].data for k, v in self.solver.net.params.items()}
# v[0] is for weights, v[1] for biases

JulianoLagana · 2017-04-04T19:28:53Z

I see, thanks for sharing it! If you do find a workaround for this issue, I'd be very interested. tis 4 apr. 2017 kl. 19:24 skrev leduckhc <notifications@github.com>:

…

Hi @JulianoLagana <https://github.com/JulianoLagana> . I just figured out that the weights of conv5_3 and lower (conv5_{2,1}, conv4_{1,2,3}, etc) contains NaNs. So the reason might be in bad initialization/loading of the network from caffemodel. I am gonna examine it in more depth. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#41 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADz5UNN_GlW8Z3P15TX66GD_HI4YmHxWks5rsny4gaJpZM4MDC2I> .

leduckhc · 2017-04-05T12:07:53Z

Check #53 for solution

feichtenhofer · 2017-04-25T11:02:06Z

Freezing layers is not a solution.

leduckhc mentioned this issue Apr 4, 2017

conv5_3 layers contains NANs causing SIGFPE #53

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Overflow occurs when training MNC with the VGG16 net #41

Overflow occurs when training MNC with the VGG16 net #41

JulianoLagana commented Feb 16, 2017

JulianoLagana commented Feb 20, 2017

leduckhc commented Apr 4, 2017

JulianoLagana commented Apr 4, 2017

leduckhc commented Apr 4, 2017 •

edited

Loading

JulianoLagana commented Apr 4, 2017 via email

leduckhc commented Apr 5, 2017

feichtenhofer commented Apr 25, 2017

Overflow occurs when training MNC with the VGG16 net #41

Overflow occurs when training MNC with the VGG16 net #41

Comments

JulianoLagana commented Feb 16, 2017

Error messages

JulianoLagana commented Feb 20, 2017

leduckhc commented Apr 4, 2017

JulianoLagana commented Apr 4, 2017

leduckhc commented Apr 4, 2017 • edited Loading

JulianoLagana commented Apr 4, 2017 via email

leduckhc commented Apr 5, 2017

feichtenhofer commented Apr 25, 2017

leduckhc commented Apr 4, 2017 •

edited

Loading