Improve loading errors and add docs on HF repos #256

jonatanklosko · 2023-09-27T12:11:36Z

Also detects if a repository has only safetensors parameters and picks that up automatically.

jonatanklosko · 2023-09-27T12:13:37Z

lib/bumblebee.ex

+    cond do
+      Map.has_key?(repo_files, @pytorch_params_filename) ->
+        {@pytorch_params_filename, false}


@josevalim we are still defaulting to PyTorch, because I tried a couple repos and noticed that the safetensors parameters don't necessarily match the model architecture. I will investigate further and check the expected behaviour with HF folks.

cool, thank you!

@jonatanklosko Nice!

Just curious, do you remember which specific HF repos you experienced that with?

@grzuy I tried a couple repos and it was often the case, for example bert-base-cased. I figured out what it is, basically many models are configured with tie_word_embeddings=True, which means that two specific layers share the same parameters. In such case the PyTorch .bin files still include parameters for both layers, so loading works, but the .safetensors files only include parameters for one layer, and they are then set for the other layer.

We haven't added explicit support for tied embeddings yet, once we do, I think it will work as expected.

Oh, I see.

I guess it relates to this comment:

bumblebee/lib/bumblebee/text/bert.ex

Lines 575 to 576 in 8867021

# TODO: use a shared parameter with embeddings.word_embeddings.kernel

# if spec.tie_word_embeddings is true (relevant for training)

.

Cool, thanks for clarifying!

Is this partially what is needed?
#263

josevalim · 2023-09-27T12:45:02Z

README.md

+
+First, if the repository is clearly a fine-tuned version of another model, you can look for `tokenizer.json` in the original model repository. For example, [`textattack/bert-base-uncased-yelp-polarity`](https://huggingface.co/textattack/bert-base-uncased-yelp-polarity) only includes `tokenizer_config.json`, but it is a fine-tuned version of [`bert-base-uncased`](bert-base-uncased), which does include `tokenizer.json`. Consequently, you can safely load the model from `textattack/bert-base-uncased-yelp-polarity` and tokenizer from `bert-base-uncased`.
+
+Otherwise, the Transformers library includes conversion rules to load a "slow tokenizer" and convert it to a corresponding "fast tokenizer", which is possible in most cases. You can generate the `tokenizer.json` file using [this tool](https://jonatanklosko-bumblebee-tools.hf.space/apps/tokenizer-generator). Once successful, you can follow the steps to submit a PR adding `tokenizer.json` to the model repository. Note that you do not have to wait for the PR to be merged, instead you can copy commit SHA from the PR and load the tokenizer with `Bumblebee.load_tokenizer({:hf, "model-repo", revision: "..."})`.


Beautifully written docs. I am wondering if we should include the README as our front-page in hexdocs.pm/bumblebee as well? Or maybe use the same trick we use in Livebook to include part of the README in the moduledocs.

Mirroring part of the README into docs sounds good!

josevalim · 2023-09-27T12:46:16Z

lib/bumblebee.ex

+        {@safetensors_params_filename, true}
+
+      true ->
+        raise "none of the expected parameters files found in the repository." <>


I think those should be raise ArgumentError because it is often something wrong with the input or fixed by changing the input? Anyway, it is up to you.

Oh yeah, for these it makes sense.

I was also second guessing that in some cases we return :ok/:error, which happens mostly for http errors (repo not found, not authenticated, etc). But if we raise in all cases then it should have a bang 🤷‍♂️

josevalim

The new docs and error messages are 🔥 .

Improve loading errors and add docs on HF repos

67da793

jonatanklosko commented Sep 27, 2023

View reviewed changes

josevalim reviewed Sep 27, 2023

View reviewed changes

josevalim approved these changes Sep 27, 2023

View reviewed changes

jonatanklosko added 2 commits September 27, 2023 22:50

RuntimeError -> ArgumentError

4ce66c0

Mirror README part in docs

850de3b

jonatanklosko merged commit 4623083 into main Sep 27, 2023
2 checks passed

jonatanklosko deleted the jk-load-errors branch September 27, 2023 16:10

This was referenced Sep 27, 2023

feat: supports loading .safetensors params file #231

Merged

Improve loading error messages #253

Closed

Issues loading tokenizer/Support loading tokenizer.model? #239

Closed

grzuy mentioned this pull request Oct 11, 2023

Default to safetensors #255

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve loading errors and add docs on HF repos #256

Improve loading errors and add docs on HF repos #256

jonatanklosko commented Sep 27, 2023

jonatanklosko Sep 27, 2023

josevalim Sep 27, 2023

grzuy Oct 11, 2023

jonatanklosko Oct 13, 2023

grzuy Oct 16, 2023

grzuy Oct 16, 2023

josevalim Sep 27, 2023

jonatanklosko Sep 27, 2023

josevalim Sep 27, 2023

jonatanklosko Sep 27, 2023

josevalim left a comment

	# TODO: use a shared parameter with embeddings.word_embeddings.kernel
	# if spec.tie_word_embeddings is true (relevant for training)


		First, if the repository is clearly a fine-tuned version of another model, you can look for `tokenizer.json` in the original model repository. For example, [`textattack/bert-base-uncased-yelp-polarity`](https://huggingface.co/textattack/bert-base-uncased-yelp-polarity) only includes `tokenizer_config.json`, but it is a fine-tuned version of [`bert-base-uncased`](bert-base-uncased), which does include `tokenizer.json`. Consequently, you can safely load the model from `textattack/bert-base-uncased-yelp-polarity` and tokenizer from `bert-base-uncased`.

		Otherwise, the Transformers library includes conversion rules to load a "slow tokenizer" and convert it to a corresponding "fast tokenizer", which is possible in most cases. You can generate the `tokenizer.json` file using [this tool](https://jonatanklosko-bumblebee-tools.hf.space/apps/tokenizer-generator). Once successful, you can follow the steps to submit a PR adding `tokenizer.json` to the model repository. Note that you do not have to wait for the PR to be merged, instead you can copy commit SHA from the PR and load the tokenizer with `Bumblebee.load_tokenizer({:hf, "model-repo", revision: "..."})`.

Improve loading errors and add docs on HF repos #256

Improve loading errors and add docs on HF repos #256

Conversation

jonatanklosko commented Sep 27, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

josevalim left a comment

Choose a reason for hiding this comment