Scaled embedded dropout mask #77

plpplp · 2021-03-03T14:22:36Z

Hello,

why is the embedded dropout mask weighted by / (1 - dropout)?

hedwig/models/reg_lstm/embed_regularize.py

Line 38 in 98634d3

    
           mask = embed.weight.data.new().resize_((embed.weight.size(0), 1)).bernoulli_(1 - dropout).expand_as(embed.weight) / (1 - dropout)

I looked into the references in the corresponding paper but
Gal and Ghahramani (2016) are not using it and a Merity et al. (2018) are just stating that they do it. So I wonder what the idea behind the step is.

The text was updated successfully, but these errors were encountered:

plpplp · 2021-03-03T22:12:46Z

I found the answer. Scaling is to account for the change in magnitude of activations when dropout is applied and seems to be the standard way to do dropout.

plpplp closed this as completed Mar 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scaled embedded dropout mask #77

Scaled embedded dropout mask #77

plpplp commented Mar 3, 2021 •

edited

Loading

plpplp commented Mar 3, 2021

Scaled embedded dropout mask #77

Scaled embedded dropout mask #77

Comments

plpplp commented Mar 3, 2021 • edited Loading

plpplp commented Mar 3, 2021

plpplp commented Mar 3, 2021 •

edited

Loading