<a href="https://colab.research.google.com/github/arnaujc91/experiments/blob/main/EmbeddingDropout_new.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install fastai==2.0.16

In [2]:
from fastai.text.all import *

As you can see the class `EmbeddingDropout` is using `emb` (`self.encoder` for the AWD_LSTM class) just to fetch its attributes: *weight, scale_grad_by_freq, norm_type*, etc. It is much easier to sublcass `nn.Embedding` instead, then the attributes we are looking for are *already* inside the class and we do not have to create an instance of `nn.Embedding` and pass it to the constructor of `EmbeddingDropout` as currently is happening.

In [3]:
# CURRENT CODE
class EmbeddingDropout(Module):
    "Apply dropout with probability `embed_p` to an embedding layer `emb`."

    def __init__(self, emb, embed_p):
      # self.emb is going to be an instance of the class 'nn.Embedding' 
        self.emb,self.embed_p = emb,embed_p

    def forward(self, words, scale=None):
        if self.training and self.embed_p != 0:
            size = (self.emb.weight.size(0),1)
            mask = dropout_mask(self.emb.weight.data, size, self.embed_p)
            masked_embed = self.emb.weight * mask
        else: masked_embed = self.emb.weight
        if scale: masked_embed.mul_(scale)
        return F.embedding(words, masked_embed, ifnone(self.emb.padding_idx, -1), self.emb.max_norm,
                           self.emb.norm_type, self.emb.scale_grad_by_freq, self.emb.sparse)
        
# MY PROPOSAL
class EmbeddingDropout(nn.Embedding):
    "Apply dropout with probability `embed_p` to an embedding layer `emb`."
    def __init__(self, *args, embed_p, **kwargs):
      # Instead of passing an instance, that has to be previously created, from 'nn.Embedding', 
      # we directly inherit from 'nn.Embedding' such that what previously was 'self.emb' now is simpliy 'self'.
      # Therefore we avoid the redundancy of creating previously an instance of 'nn.Embedding' 
      # and passing it as an argument to the constructor 
        super().__init__(*args, **kwargs)
        self.embed_p = embed_p

    def forward(self, words, scale=None):
        if self.training and self.embed_p != 0:
            size = (self.weight.size(0),1)
            mask = dropout_mask(self.weight.data, size, self.embed_p)
            masked_embed = self.weight * mask
        else: masked_embed = self.weight
        if scale: masked_embed.mul_(scale)
        return F.embedding(words, masked_embed, ifnone(self.padding_idx, -1), self.max_norm,
                       self.norm_type, self.scale_grad_by_freq, self.sparse)


**IMPORTANT**: I guess you wrote it that way because you wanted that people could load their own pretrained encoders/embeddings. You can still load a pretrained encoder as long as this is of the class `EmbeddingDropout` that I just created. The problem is: if you want to import an encoder that is an instance of `nn.Embedding` it will not work as far as my understanding goes. This may break compatibility with PyTorch.

## What is the problem with the current code?

### 1. First issue

First of all the function `flatten_model`, which is used to create the Hooks for a given model is not going to work as expected

In [4]:
awd_lstm =  AWD_LSTM(vocab_sz=3,
                  emb_sz=5,
                  n_hid=6,
                  n_layers=2)

In [5]:
awd_lstm

AWD_LSTM(
  (encoder): Embedding(3, 5, padding_idx=1)
  (encoder_dp): EmbeddingDropout(
    (emb): Embedding(3, 5, padding_idx=1)
  )
  (rnns): ModuleList(
    (0): WeightDropout(
      (module): LSTM(5, 6, batch_first=True)
    )
    (1): WeightDropout(
      (module): LSTM(6, 5, batch_first=True)
    )
  )
  (input_dp): RNNDropout()
  (hidden_dps): ModuleList(
    (0): RNNDropout()
    (1): RNNDropout()
  )
)

You can see in the following line how the layer `Embedding` is **duplicated**.

In [6]:
modules = flatten_model(awd_lstm); modules

[Embedding(3, 5, padding_idx=1),
 Embedding(3, 5, padding_idx=1),
 LSTM(5, 6, batch_first=True),
 ParameterModule(),
 LSTM(6, 5, batch_first=True),
 ParameterModule(),
 RNNDropout(),
 RNNDropout(),
 RNNDropout()]

This is because `flatten_model` goes through all the layers and checks if they have children. The first layer is `encoder` and it does not have children, but the second layer is `encoder_dp` which indeed has children and the cildren is precisely `encoder`:

In [7]:
print('encoder has children: ', awd_lstm.encoder.has_children )
print('encoder_dp has children: ' ,awd_lstm.encoder_dp.has_children )

encoder has children:  False
encoder_dp has children:  True


And because the children of `encoder_dp` is `encoder` this layer appears twice when we use `flatten_model`.

In [8]:
next(awd_lstm.encoder_dp.children()) == awd_lstm.encoder

True

### 2. Second issue

`flatten_model` does not contain the layer `EmbeddingDropout` and this is going to be a problem because when we use the forward method of `AWD_LSTM` this forward method calls the forward method of `EmbeddingDropout`  and not the one from `nn.Embedding`. As a consequence the hooks are not fired!

In [9]:
def hook_fn(m, i, o):
  print(f"Working for layer: -- {m._get_name()} --\n")

In [10]:
awd_lstm.encoder.register_forward_hook(hook_fn)
awd_lstm(torch.randint(3, (1,4)))

tensor([[[-0.0436, -0.0949, -0.1038,  0.0582,  0.0075],
         [-0.0652, -0.1297, -0.1486,  0.0671,  0.0088],
         [-0.0738, -0.1497, -0.1684,  0.0567,  0.0135],
         [-0.0813, -0.1592, -0.1754,  0.0476,  0.0141]]],
       grad_fn=<TransposeBackward0>)

Eventhough I explicitly hooked the layer `encoder` its hooks do not get fired because its forward method is not called in the forward method of `AWD_LSTM`.
Instead `AWD_LSTM` calls the forward method for `encoder_dp`:

In [11]:
awd_lstm.encoder_dp.register_forward_hook(hook_fn)
awd_lstm(torch.randint(3, (1,4)))

Working for layer: -- EmbeddingDropout --



tensor([[[-0.1072, -0.1707, -0.1844,  0.0398,  0.0364],
         [-0.1255, -0.1713, -0.1871,  0.0233,  0.0520],
         [-0.1366, -0.1730, -0.1855,  0.0079,  0.0611],
         [-0.1435, -0.1745, -0.1821, -0.0045,  0.0654]]],
       grad_fn=<TransposeBackward0>)

## Solutions

I just see two possible solutions:
1. Modify `flatten_model`
2. Modifty `EmbeddingDropout`

The modification I suggested suffers from the following problem: 
- If someone wants to load a pretrained `encoder` layer this has to be of the class `EmbeddingDropout` I just created, therefore breaking compatibility with PyTorch.

Instead I suggest the following solution:

- Downgrade the class `EmbeddingDropout` to a function instead of a class.

In the next cell I show you how I would modify the code in order to get the hooks fired and keep the `encoder` being from class `nn.Embedding`:

In [12]:
from functools import partial

def EmbeddingDropout(emb, embed_p, words, training, scale=None):
    "Apply dropout with probability `embed_p` to an embedding layer."
    print(words)
    if training and embed_p != 0:
        size = (emb.weight.size(0),1)
        mask = dropout_mask(emb.weight.data, size, embed_p)
        masked_embed = emb.weight * mask
    else: masked_embed = emb.weight
    if scale: masked_embed.mul_(scale)
    return F.embedding(words, masked_embed, ifnone(emb.padding_idx, -1), emb.max_norm,
                        emb.norm_type, emb.scale_grad_by_freq, emb.sparse)
        
class AWD_LSTM(Module):
    "AWD-LSTM inspired by https://arxiv.org/abs/1708.02182"
    initrange=0.1

    def __init__(self, vocab_sz, emb_sz, n_hid, n_layers, pad_token=1, hidden_p=0.2, input_p=0.6, embed_p=0.1,
                 weight_p=0.5, bidir=False):
        store_attr('emb_sz,n_hid,n_layers,pad_token')
        self.bs = 1
        self.n_dir = 2 if bidir else 1
        self.encoder = nn.Embedding(vocab_sz, emb_sz, padding_idx=pad_token)
        # BEFORE: self.encoder_dp = EmbeddingDropout(self.encoder, embed_p)
        self.encoder_dp = partial(EmbeddingDropout, self.encoder, embed_p)
        self.encoder.weight.data.uniform_(-self.initrange, self.initrange)
        self.rnns = nn.ModuleList([self._one_rnn(emb_sz if l == 0 else n_hid, (n_hid if l != n_layers - 1 else emb_sz)//self.n_dir,
                                                 bidir, weight_p, l) for l in range(n_layers)])
        self.input_dp = RNNDropout(input_p)
        self.hidden_dps = nn.ModuleList([RNNDropout(hidden_p) for l in range(n_layers)])
        self.reset()


    def forward(self, inp, from_embeds=False):
        bs,sl = inp.shape[:2] if from_embeds else inp.shape
        if bs!=self.bs: self._change_hidden(bs)

        # BEFORE: output = self.input_dp(inp if from_embeds else self.encoder_dp(inp))
        output = self.input_dp(inp if from_embeds else self.encoder_dp(inp, self.training))
        new_hidden = []
        for l, (rnn,hid_dp) in enumerate(zip(self.rnns, self.hidden_dps)):
            output, new_h = rnn(output, self.hidden[l])
            new_hidden.append(new_h)
            if l != self.n_layers - 1: output = hid_dp(output)
        self.hidden = to_detach(new_hidden, cpu=False, gather=False)
        return output

    def _change_hidden(self, bs):
        self.hidden = [self._change_one_hidden(l, bs) for l in range(self.n_layers)]
        self.bs = bs

    def _one_rnn(self, n_in, n_out, bidir, weight_p, l):
        "Return one of the inner rnn"
        rnn = nn.LSTM(n_in, n_out, 1, batch_first=True, bidirectional=bidir)
        return WeightDropout(rnn, weight_p)

    def _one_hidden(self, l):
        "Return one hidden state"
        nh = (self.n_hid if l != self.n_layers - 1 else self.emb_sz) // self.n_dir
        return (one_param(self).new_zeros(self.n_dir, self.bs, nh), one_param(self).new_zeros(self.n_dir, self.bs, nh))

    def _change_one_hidden(self, l, bs):
        if self.bs < bs:
            nh = (self.n_hid if l != self.n_layers - 1 else self.emb_sz) // self.n_dir
            return tuple(torch.cat([h, h.new_zeros(self.n_dir, bs-self.bs, nh)], dim=1) for h in self.hidden[l])
        if self.bs > bs: return (self.hidden[l][0][:,:bs].contiguous(), self.hidden[l][1][:,:bs].contiguous())
        return self.hidden[l]

    def reset(self):
        "Reset the hidden states"
        [r.reset() for r in self.rnns if hasattr(r, 'reset')]
        self.hidden = [self._one_hidden(l) for l in range(self.n_layers)]

In [13]:
awd_lstm_new =  AWD_LSTM(vocab_sz=3,
                  emb_sz=5,
                  n_hid=6,
                  n_layers=2)

1. Now there is no duplication of layers:

In [14]:
modules = flatten_model(awd_lstm_new); modules

[Embedding(3, 5, padding_idx=1),
 LSTM(5, 6, batch_first=True),
 ParameterModule(),
 LSTM(6, 5, batch_first=True),
 ParameterModule(),
 RNNDropout(),
 RNNDropout(),
 RNNDropout()]

2. The hooks for the encoder layer gets fired

In [15]:
awd_lstm_new.encoder.register_forward_hook(hook_fn)
awd_lstm_new(torch.randint(3, (1,4)))

tensor([[0, 2, 2, 1]])


tensor([[[ 0.0145, -0.0343, -0.1015, -0.0362,  0.1925],
         [ 0.0103, -0.0716, -0.1456, -0.0621,  0.2494],
         [ 0.0056, -0.0966, -0.1682, -0.0831,  0.2669],
         [ 0.0029, -0.1107, -0.1802, -0.0990,  0.2719]]],
       grad_fn=<TransposeBackward0>)

I think now the functionality is the same and is compatible with PyTorch. So this could finally solve the issue. If you like this solution I will create a new PR.

**NOTE**: The tests still will not pass because `encoder_dp` is now a function and not a class.