-
Notifications
You must be signed in to change notification settings - Fork 91
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce explicit parameter mapping #148
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jonatanklosko I've just skimmed a bit, but my initial thoughts are that this is fantastic! It should also make adding new models significantly easier, and in general is much more manageable for us to maintain.
I will give a more in-depth review later
Also, on requiring mappings, I don't think that's a big deal, HF actually requires mappings for non-PT models anyway, and we mostly end up hiding that fact from the user anyway. It's a consequence of depending on an external community for models, but I don't think it's a bad one |
Fantastic work! |
|
||
Both param names and values for corresponding layers may not match | ||
exactly, so they require further transformations. For example, the | ||
convolution `"kernel"` in Axon corresponds to a transposed `"weight"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if Axon should just make the switch to use weight
everywhere like PyTorch. It would make the conversions a bit easier, though you'd still have to pay attention to doing the transpose in some cases
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should be changing just because of PyTorch in this case, flax does use kernel
. Parameter naming in particular is not much hassle, especially that in case we don't need to transpose, we just always look for weight
instead of kernel
by default.
@@ -148,6 +146,15 @@ defmodule Bumblebee.Layers.Decoder do | |||
end | |||
end | |||
|
|||
defnp append_attention_cache(key, value, attention_cache, offset, _opts \\ []) do | |||
%{key: cached_key, value: cached_value} = attention_cache | |||
indices = [0, Nx.as_type(offset, {:s, 64}), 0, 0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When is offset not s64?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Offset comes from the cache, which we pass as model input and get back as model output, so Axon would automatically convert it to float either way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh makes sense, I have opened this issue in Axon to fix this: elixir-nx/axon#464
Co-authored-by: Sean Moriarity <smoriarity.5@gmail.com>
Closes #71.
Advantages:
bert.
, we just add the head layers.Disadvantages: