Introduce explicit parameter mapping #148

jonatanklosko · 2023-01-18T18:28:53Z

Closes #71.

Advantages:

This removes a lot of duplication between models. For example, now all of the models use the same attention implementation, and even the whole transformer block.
Certain models looked very differently, even though they were for the most part the same, that's because their reference implementations applied the same concepts, but implemented them on their own (naming, structuring and nesting the layers). Now the models are much more consistent, both in terms of layer nesting and naming, which makes it much easier to see the similarities and differences.
We no longer need to follow an artificial layers nesting, that follows from the class composition in the Python implementations. For example, when building a model with head, we don't need to prefix the base model with something like bert., we just add the head layers.
We have full control over the parameters, which makes us much less coupled to the original implementation details. For example the original GPT2 implementation uses conv1d layers (in terms of implementation those are essentially transposed dense layers) for attention. The parameter mapping has enough flexibility that we actually slice the conv1d kernels into dense layer kernels and use the same attention implementation as all other models.

Disadvantages:

A single file doesn't contain all of the model. This already wasn't the case and I don't consider this a major issue. Also, we could still be doing that and use consistent naming/implementation, but again I don't think we should.
Each model needs explicit mapping to hf/transformers parameters. This also implies we use different parameter names than hf/transformers.

seanmor5

@jonatanklosko I've just skimmed a bit, but my initial thoughts are that this is fantastic! It should also make adding new models significantly easier, and in general is much more manageable for us to maintain.

I will give a more in-depth review later

seanmor5 · 2023-01-18T18:57:56Z

Also, on requiring mappings, I don't think that's a big deal, HF actually requires mappings for non-PT models anyway, and we mostly end up hiding that fact from the user anyway. It's a consequence of depending on an external community for models, but I don't think it's a bad one

josevalim · 2023-01-18T19:25:58Z

Fantastic work!

seanmor5 · 2023-01-24T01:33:29Z

lib/bumblebee/huggingface/transformers/model.ex

+
+  Both param names and values for corresponding layers may not match
+  exactly, so they require further transformations. For example, the
+  convolution `"kernel"` in Axon corresponds to a transposed `"weight"`


I wonder if Axon should just make the switch to use weight everywhere like PyTorch. It would make the conversions a bit easier, though you'd still have to pay attention to doing the transpose in some cases

I don't think we should be changing just because of PyTorch in this case, flax does use kernel. Parameter naming in particular is not much hassle, especially that in case we don't need to transpose, we just always look for weight instead of kernel by default.

lib/bumblebee/huggingface/transformers/model.ex

lib/bumblebee/layers.ex

seanmor5 · 2023-01-24T01:45:23Z

lib/bumblebee/layers/decoder.ex

@@ -148,6 +146,15 @@ defmodule Bumblebee.Layers.Decoder do
    end
  end

+  defnp append_attention_cache(key, value, attention_cache, offset, _opts \\ []) do
+    %{key: cached_key, value: cached_value} = attention_cache
+    indices = [0, Nx.as_type(offset, {:s, 64}), 0, 0]


When is offset not s64?

Offset comes from the cache, which we pass as model input and get back as model output, so Axon would automatically convert it to float either way.

Ahh makes sense, I have opened this issue in Axon to fix this: elixir-nx/axon#464

Co-authored-by: Sean Moriarity <smoriarity.5@gmail.com>

Introduce explicit parameter mapping

54c1917

seanmor5 reviewed Jan 18, 2023

View reviewed changes

josevalim approved these changes Jan 18, 2023

View reviewed changes

Up

951bdb7

seanmor5 reviewed Jan 24, 2023

View reviewed changes

lib/bumblebee/huggingface/transformers/model.ex Outdated Show resolved Hide resolved

seanmor5 reviewed Jan 24, 2023

View reviewed changes

lib/bumblebee/layers.ex Outdated Show resolved Hide resolved

seanmor5 reviewed Jan 24, 2023

View reviewed changes

jonatanklosko and others added 2 commits January 24, 2023 11:10

Update lib/bumblebee/huggingface/transformers/model.ex

8e48330

Co-authored-by: Sean Moriarity <smoriarity.5@gmail.com>

embeddings -> learned_embeddings

6f03b24

jonatanklosko merged commit b4bada2 into main Jan 26, 2023

jonatanklosko deleted the jk-params-mapping branch January 26, 2023 12:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce explicit parameter mapping #148

Introduce explicit parameter mapping #148

jonatanklosko commented Jan 18, 2023

seanmor5 left a comment

seanmor5 commented Jan 18, 2023

josevalim commented Jan 18, 2023

seanmor5 Jan 24, 2023

jonatanklosko Jan 24, 2023

seanmor5 Jan 24, 2023

jonatanklosko Jan 24, 2023

seanmor5 Jan 24, 2023

Introduce explicit parameter mapping #148

Introduce explicit parameter mapping #148

Conversation

jonatanklosko commented Jan 18, 2023

seanmor5 left a comment

Choose a reason for hiding this comment

seanmor5 commented Jan 18, 2023

josevalim commented Jan 18, 2023

seanmor5 Jan 24, 2023

Choose a reason for hiding this comment

jonatanklosko Jan 24, 2023

Choose a reason for hiding this comment

seanmor5 Jan 24, 2023

Choose a reason for hiding this comment

jonatanklosko Jan 24, 2023

Choose a reason for hiding this comment

seanmor5 Jan 24, 2023

Choose a reason for hiding this comment