Rename copy_initial_weights to something more intuitive, and replace copy with detach where appropriate. #54
Comments
@egrefen I want to help improving the naming and documentation for I will start with something very simple like MAML, where the meta-parameters of the meta-learner are only the initialization of the base model (in the original paper the weights of a 4 layer CNN with 32 filters). Let To clarify the semantics I will start by describing what I believe Notation: If we want MAML the correct loop invariants we want to hold in higher are as follows: For the outer loop step/update (assuming it's SGD and not Adam) we want:
i.e. after each outer time step we want the gradient to be with respect to the current initialization For the inner loop step (assuming SGD) we want:
note that in the code we are likely to have the initialization params of the base model in a variable At the end of the outer step we want:
and nothing else. The gradients At the end of the inner step we want:
assuming Now let's get back to the docs to figure out the intended semantics. The docs say:
I think it's important to precisely define "initial weights". The first time I read that I thought it meant
which cannot possible be correct. So I will assume that by "initial weight" we mean the weight before an update to the outer loop. No other definition makes sense to me. So the initial weight is |
Now with the notation being clear in my head I think I can provide actionable feedback (I hope). Before I start giving more feedback I think it’s important to clarify what exactly “initial” means. I believe it’s meant to be the weights after an update from the outer optimizer has been done (note that the outer optimization process is not unrolled or differentiated through). Thus, the loop invariant for the word “initial" should mean w^<t_o,0> according to the notation I introduced. Note that initial does not necessarily mean “initialization of the base model”, which is another source of confusion since that can be it’s “initial weights”. I believe that’s true since (in principle) there are meta-learners that do not train the initialization (or the base models “initial weights”, note the potential source of confusion). So the intended meaning of initial I believe means
To give concrete feedback on the current wording:
According to my previous clarification I believe it means that the patched module (which really means the differentiable module or the module as a functional object) get’s the weight w^<t_o,0> before inner loop adaptation. To further clarify I believe the word “copy” probably means “deep copy”. So it’s a separate set of weights. Not sure why this is useful to be done but at least the terms are clear. If this is done my suspicion is that the outer optimizer will not see the original weights as part of the forward computation (after unrolling the gradient path of the inner loop) and thus the outer gradients would zero as I’ve outlined here: #58 and seen in my own code. In the true case I’d be curious to know what would happen to weights of a parametrized optimizer (like in the case of meta-lstm optimizer paper by Ravi and Hugo) if those are part of the meta-parameters. Next:
I hope this doesn’t seem pedantic but since at some point I was confused that there one could accidentally unroll both the outer and inner loop, I believe it's a useful reminder to say that only the inner loop is usually unrolled. Thus, “unrolling” means that we make the inner optimization part of the forward pass (and thus differentiate through it) when using the outer optimizer. The outer optimizer should disallow further chaining because that’s the standard way normal pytorch optimizers work. I’ve never heard the term “gradient tape” (and I’ve read a handful of learning to learn papers) so that remains ambiguous but I believe it means the “unrolled path through the gradient operation in the inner loop optimization”.
I guess what this means is that the weights of the differentiable module (i.e. patched module) will be w^<t_o,0> and not w^<t_o,0>.deep_copy(). Perhaps a better wording would be
Now after going through the wording in more detail I fail to appreciate what is the use case for For a suggestion to renaming the “copy_weights_before_inner_loop”, I suggest to drop the word initial since that is confusing with initialization (e.g. it could also mean initial weights before inner loop adaptation of the parametrized optimizer). Or at least clarify what “initial” means, since I believe it means w^<t_o,0> i.e. the weights before the first inner loop step and after the in place outer loop update. I realize this is hard because the learning to learn and meta-learning has optimizer of optimizer of optimizer of gradient of gradient ad nauseam. That’s a joke but that’s why this is difficult to explain cleanly. I hope this helps. |
As pointed out in #30, the kwarg
copy_initial_weights
is hard to understand.First, we should investigate whether it's not sufficient just to detach when branching from the outer loop model when unrolling. Second, we should come up with a kwarg and docs which are more intuitive. Third, we should illustrate its use in a tutorial example.
The text was updated successfully, but these errors were encountered: