convert-hf : save memory with lazy evaluation #7075

compilade · 2024-05-04T14:41:41Z

Continued from #7031 so that all shape-changing operations are made with PyTorch.

This was motivated by some remarks about convert-hf-to-gguf.py using too much RAM and/or convert.py being better for Llama models because it uses lazy tensors.

This fixes the RAM usage problem by bringing lazy conversion for ALL model architectures supported by convert-hf-to-gguf.py (assuming none were broken with #7031).

Summary of changes

All changes from convert-hf : reduce repeated boilerplate from write_tensors #7031
Way less RAM usage, by default. Like, waaaay less.
This uses PyTorch's meta device as well as __torch_function__ overrides and __getattr__ switcheroos, wrapped in LazyTorchTensor
- LazyTorchTensors form a tree of tensors to share the eager evaluation when multiple tensors are split from one
- The goal is that ALL PyTorch operations can be used normally in modify_tensors to be usable BOTH in lazy evaluation mode and when eagerly evaluating (as before). This keeps the semantics easier for everyone.
- (BTW thanks @heiner for showing that sending a tensor .to("meta") works in Script to convert Grok-1 weights from raw JAX pickle files. #7058, it wasn't really clear from PyTorch's documentation)
Add the LazyTensor class in gguf_writer.py to abstract away how the tensor is actually calculated
- This helps avoiding to make the gguf-py library depend on PyTorch.
- Wraps basic Numpy operations like astype() and byteswap() and tofile()
convert-hf-to-gguf.py has a new flag, --no-lazy to let the user choose to not use lazy evaluation (same behavior as before, accumulating everything in RAM before beginning to write in the output file)
Since the modify_tensors step now happens very very quickly, most time is spent writing to disk. I've added a progress bar in GGUFWriter.write_tensors_to_file (which is only enabled in convert-hf-to-gguf.py, otherwise it's disabled by default).
- Shows the progress in bytes (could have been in tensors, but bytes is more meaningful for a unit here)
- This uses tqdm which is already in the user's Python environment because it's a dependency of transformers
  - There is precedent for this in gguf-convert-endian.py, which was a good suggestion from @cebtenzzre in convert.py: add python logging instead of print() #6511 (comment)

Testing

Everything should pretty much Just Work (TM), but some models use special PyTorch operators and need more careful testing.

Ideally, there should be no fatal failure, and the resulting model should be the same when lazy (by default) and when using --no-lazy.

LlamaModel with MoE (torch.stack, as well as permuting and reshapes for Q and K)
- @compilade Seems to be working when converting https://huggingface.co/jtatman/TinyMistral-248m-v2.5-4x-Moe
MambaModel (__neg__ (the unary minus operator), torch.equal (which is used in a condition, so the involved tensors need to be eagerly evaluated!))
- @compilade Tested to work with https://huggingface.co/jondurbin/bagel-dpo-2.8b-v0.2
GemmaModel (__add__, to do + 1 somewhere)
- @compilade lazy conversion seems to work correctly with Gemma 2B, pizza recipe seems coherent. (Got a 2GB Ram usage bump from the 500M parameter tok_embd.weight when it got converted from bfloat16 to float32, then RAM usage lowered and stayed low. This was the biggest individual tensor in the models I tested.)
InternLM2Model (__getitem__ for sliced indexing, torch.Tensor.T needs to be wrapped, ~~einops.rearrange~~ was replaced with reshapes)
- @compilade After rewriting the einops.rearrange operations with reshapes (was more trivial than I thought), Tested in lazy conversion mode with https://huggingface.co/internlm/internlm2-chat-1_8b and got some very coherent text, so I assume the implementation is correct.
BloomModel (__getitem__ for sliced indexing)
- @compilade Tested that the resulting model's sha256sum matches when lazily converting and with --no-lazy for https://huggingface.co/bigscience/bloom-560m (required bypassing the pre-tokenizer check). Inference seems good too.
FalconModel (__getitem__ for sliced indexing)
- @ggerganov Tested https://huggingface.co/tiiuae/falcon-7b, lazy and non-lazy produce same file. PPL is the same as the old model that I had
RefactModel (__getitem__ for sliced indexing)
- @compilade Tested https://huggingface.co/smallcloudai/Refact-1_6B-fim using lazy conversion. Works as of fix in 215a0d3 from convert-hf : reduce repeated boilerplate from write_tensors #7031.

(@bartowski1182, @fairydreaming, @phymbert, @arki05, @slaren, @ggerganov, this will most likely benefit you all because I'm sure you've dealt with converting bigger models than your RAM before)

For example, loading `model-00001-of-00001.safetensors` now works. * convert-hf : fix stacking MoE expert tensors `torch.stack` and `torch.cat` don't do the same thing. * convert-hf : fix Mamba conversion Tested to work even with a SentencePiece-based tokenizer.

`os.listdir` is said to list files in arbitrary order. Sorting the file names should let "model-00009-of-00042.safetensors" be loaded before "model-00010-of-00042.safetensors".

It seems Protocol can't be used as a statically type-checked ABC, because its subclasses also can't be instantiated. (why did it seem to work?) At least there's still a way to throw an error when forgetting to define the `model_arch` property of any registered Model subclasses.

…tion There are no abstract methods used anyway, so using ABC isn't really necessary.

bartowski1182 · 2024-05-04T15:20:29Z

This looks awesome! A lot of great changes!

Any differences in performance if RAM capacity isn't a bottleneck? Or is the speed the same with or without?

If unknown I can also test

compilade · 2024-05-05T04:33:22Z

Any differences in performance if RAM capacity isn't a bottleneck? Or is the speed the same with or without?

I don't know, since on my machine RAM capacity is a bottleneck, so lazy conversion is always better there.

I think the performance shouldn't be that different if there's enough RAM. It's still reading, calculating, and writing the same things. Either everything is first read, computed, and logged, kept in RAM, then all written (with --no-lazy and/or the current state of master), or (with lazy conversion) all metadata is read and logged, then each tensor's data is read, computed, and written, one after the other. Both lazy and non-lazy conversion should result in the exact same file.

I personally think the UX (user experience) is better with lazy conversion, because the (new) progress bar shown when writing the tensors gives a good enough ETA, while in non-lazy conversion (with --no-lazy), the progress bar only appears at the end when flushing everything to the output file.

If unknown I can also test

Yes, this would be appreciated. This way we'll know whether or not there's an overhead with lazy conversion vs non-lazy conversion. Not sure how to get a consistent measurement though, with filesystem caches and stuff like that. (though lazy conversion has an advantage there, since it leaves more RAM for the filesystem cache)

Instead of pre-loading them all into a dict, iterate on the tensors in the model parts progressively as needed in Model.write_tensors Conversion for some architectures relies on checking for the presence of specific tensor names, so for multi-part models, the weight map is read from the relevant json file to quickly get these names up-front.

bartowski1182 · 2024-05-05T18:24:16Z

It does seem likely that there shouldn't be much if any performance hit, since realistically there's no good reason to have ALL tensors in memory at once

While doing this, any insight into converting bf16 to fp16 by first going to FP32? Feels like it should be a straightforward process but I have dove into the black box yet

compilade · 2024-05-07T17:01:45Z

It'd be good to have 1 script that needs maintenance rather than 2. It would also avoid confusion from the user perspective.

I agree.

is there anything preventing us from getting rid of convert.py?

There are some cool things implemented in convert.py, like Q8_0 quantization, and no torch dependency. So it's still worth to keep somewhere (even if renamed), at least for reference.

Also its --dump and --dump-single options are very useful to list tensor names and shapes, especially for not-yet-supported models.

Galunid · 2024-05-07T18:15:37Z

There are some cool things implemented in convert.py, like Q8_0 quantization, and no torch dependency. So it's still worth to keep somewhere (even if renamed), at least for reference.

I agree it's a cool showcase, but I'd argue it's not worth it to keep an obsolete file in master branch.
git checkout 04976db7a819fcf8bfefbfc09a3344210b79dd27 -- convert.py should solve the issue of ever needing to reference the file. As for Q8_0 quantization, I'd argue for having only one way to generate files.
Right now you can

convert.py -> ./quantize Q8_0
convert.py Q8_0
convert-hf-to-gguf.py -> ./quantize Q8_0

Afaik all produce slightly different results. Establishing only one right way to generate models would go long way towards reducing issues we need to respond to.

Also its --dump and --dump-single options are very useful to list tensor names and shapes, especially for not-yet-supported models.

I didn't know about this, but it is useful. It should be simple enough to add it to convert-hf-to-gguf.py

I guess we could maybe move it to examples/torchless_convert.py or something?

slaren · 2024-05-08T00:20:12Z

convert.py also allows converting llama, llama2 and mixtral models from the original pth distributions.

ggerganov · 2024-05-08T06:49:28Z

convert.py also allows converting llama, llama2 and mixtral models from the original pth distributions.

Correct, the idea was convert.py to support "native" model conversion without waiting to be huggingface-ified, and without heavy dependencies. But it looks like that HF / Transformers adoption is wide across all major model producers, and convert.py depends on almost everything that convery-hf-to-gguf.py depends (except torch) so not sure if it makes sense anymore

I'm personally interested in small convert scripts that not necessarily have all the cool features of convert.py and convert-hf-to-gguf.py, but are simple to understand and modify (example: https://github.com/ggerganov/ggml/blob/master/examples/magika/convert.py). Though these would also be hard to maintain because they won't be exercised as much

About this PR - shall we proceed to merge? Not sure if anyone else would like to review the changes

Galunid

Not a super in depth review, but nothing stood out to me.

compilade · 2024-05-08T16:19:23Z

I've found a way to support bfloat16 conversion (since #6412 was merged), so I've added support for it in 1eccde6.

Even if Numpy doesn't have the bfloat16 data type, it's still possible to use Numpy to perform the type conversion, because it's possible to change the type without converting the underlying data.

if data.dtype != np.float32:
    data = data.astype(np.float32)
data.dtype = np.int32
data = (data >> 16).astype(int16)

This way, data contains the exact same bytes as if all of its elements were converted to bfloat16.

GGUFWriter.add_tensor supports overriding the dtype with any gguf.GGMLQuantizationType, so this is used to store these np.int16 tensors with the correct metadata.

teleprint-me · 2024-05-08T16:20:59Z

But it looks like that HF / Transformers adoption is wide across all major model producers, and convert.py depends on almost everything that convery-hf-to-gguf.py depends (except torch) so not sure if it makes sense anymore

The irony is that HuggingFace heavily relies on torch under the hood. The abstractions make it difficult to see unless you really poke at it. This statement is accurate though. The transformers framework makes me think of the blob, consuming everything around it.

Regardless, I've been waiting for this PR to get merged.

teleprint-me · 2024-05-08T16:33:08Z

@compilade I didn't see that for some reason when I looked here ealier! I must've just missed your post. You can bit shift the values in Python, but there is a decent enough margin of error that precision is noticeably affected. I had written tests for this and it's just due to the way Python handles numbers. There's no way around it except to implement a FFI. It's cool that this workaround seems to do the trick though. I like it because it's simple. However, expect precision to be affected.

compilade · 2024-05-08T16:47:13Z

You can bit shift the values in Python, but there is a decent enough margin of error that precision is noticeably affected. I had written tests for this and it's just due to the way Python handles numbers. There's no way around it except to implement a FFI. It's cool that this workaround seems to do the trick though. I like it because it's simple. However, expect precision to be affected.

@teleprint-me Not sure what you've tried before, but I don't think any precision is lost in this process when converting from bfloat16 weights.

The values are first converted from torch.bfloat16 to torch.float32, then to np.float32 (same as torch.float32), then the type is changed to np.int32 without changing the data, then shifted right by 16 bits, then converted to np.int16 (which trims the upper 16 bits to get down to 2 bytes per element). The resulting data is the upper 16 bits of the float32 values that came from the bfloat16 weights.

From my understanding, this is a lossless conversion.

(EDIT: ⚠️ this approach is wrong because subnormals are not flushed to zero, and because it's not rounding to nearest even value. Rounding should be easy to fix, but subnormals... ~~I'm not sure.~~ (idea: it should be possible using np.vectorize))

compilade · 2024-05-08T17:21:25Z

I've noticed a problem with metadata; the quantization type of tensors are not using the same enum values as the file types. This makes BF16 models be reported as IQ4_XS (this is about the GGUF key general.file_type), when converted with convert-hf-to-gguf.py.

This requires adding new constants in gguf-py/gguf/constants.py. I'm hesitating between using the name GGMLFileType (named similarly to GGMLQuantizationType) and GGUFFileType (named similarly to GGUFValueType) for this enum class.

compilade · 2024-05-08T18:32:18Z

Unfortunately bfloat16 conversion has a bit more special cases than I thought, so it won't be included in this PR (it will require more general lazy Numpy tensors (which I will be working on soon)).

I'm force-pushing this branch back to before 1eccde6 to match the reviewed changes.

ggerganov · 2024-05-08T18:32:47Z

This requires adding new constants in gguf-py/gguf/constants.py. I'm hesitating between using the name GGMLFileType (named similarly to GGMLQuantizationType) and GGUFFileType (named similarly to GGUFValueType) for this enum class.

I think we should go with GGMLFileType

Here is some more information about the file-type enums (it's a bit messy):

The C++ enum used to determine the file type in llama.cpp is llama_ftype:

llama.cpp/llama.h

Lines 108 to 114 in 83330d8

    
           // model file types 
        
           enum llama_ftype { 
        
               LLAMA_FTYPE_ALL_F32              = 0, 
        
               LLAMA_FTYPE_MOSTLY_F16           = 1,  // except 1d tensors 
        
               LLAMA_FTYPE_MOSTLY_Q4_0          = 2,  // except 1d tensors 
        
               LLAMA_FTYPE_MOSTLY_Q4_1          = 3,  // except 1d tensors 
        
               LLAMA_FTYPE_MOSTLY_Q4_1_SOME_F16 = 4,  // tok_embeddings.weight and output.weight are F16

This enum is specific to llama.cpp. For example, whisper.cpp also has the concept of a "file type", but instead of defining a corresponding enum, it uses the built-in enum ggml_ftype:

https://github.com/ggerganov/whisper.cpp/blob/a750868428868abd437e228ae5cab763ef3dc387/ggml.h#L388-L395

The enum ggml_ftype was introduced after enum llama_ftype and initially it mirrored its values, but eventually it diverged, because we forgot to update it.

So now, ggml projects can either define their own file-type enum (like llama.cpp) or use the built-in enum (like whisper.cpp). I guess there are arguments for either option, but I'm thinking that adopting enum ggml_ftype would be overall better, since gguf-py also needs to "know" about it.

But anyway, the oversight here that you noticed is that the F32 and F16 values of enum llama_ftype accidentally match the values of enum ggml_type (a.k.a GGMLQuantizationType). So we didn't have any problems until now.

compilade · 2024-05-08T19:36:01Z

So now, ggml projects can either define their own file-type enum (like llama.cpp) or use the built-in enum (like whisper.cpp). I guess there are arguments for either option, but I'm thinking that adopting enum ggml_ftype would be overall better, since gguf-py also needs to "know" about it.

I've added GGMLFileType with values from llama.cpp's llama_ftype in cad22e1, but reading this makes me realize you meant using ggml.h's ggml_ftype, which I agree would be better. (though it wouldn't quite work when converting models for llama.cpp, since it expects types matching llama_ftype)

Again, reverting back, because I need to give this GGMLFileType enum more thought (especially with the different conflicting sources of its possible values), and because it will only be useful along with bfloat16 conversion so I think the changes belong together (in another PR).

I will then proceed to merge this PR (after the tests finish again, so in an hour, I guess?) with the previously-approved changes, because I think they fulfill the goal of this PR and because there was no disagreement in the 2 days since the last change.

(EDIT: some Docker test is failing because of network errors. I guess this happens, but it doesn't seem related to this PR, so I'm merging anyway.)

…erganov#7075 (shameless copy from LlamaModel).

bartowski1182 · 2024-05-13T22:44:58Z

@compilade sorry to resurrect this, but looks like lazy conversion is broken for Falcon:

llama_cpp-1  |   File "/llama.cpp/convert-hf-to-gguf.py", line 2562, in <module>
llama_cpp-1  |     main()
llama_cpp-1  |   File "/llama.cpp/convert-hf-to-gguf.py", line 2556, in main
llama_cpp-1  |     model_instance.write()
llama_cpp-1  |   File "/llama.cpp/convert-hf-to-gguf.py", line 343, in write
llama_cpp-1  |     self.gguf_writer.write_tensors_to_file(progress=True)
llama_cpp-1  |   File "/llama.cpp/gguf-py/gguf/gguf_writer.py", line 295, in write_tensors_to_file
llama_cpp-1  |     tensor.tofile(self.fout)
llama_cpp-1  |   File "/llama.cpp/gguf-py/gguf/lazy.py", line 222, in tofile
llama_cpp-1  |     eager = LazyNumpyTensor.to_eager(self)
llama_cpp-1  |   File "/llama.cpp/gguf-py/gguf/lazy.py", line 182, in to_eager
llama_cpp-1  |     return cls._recurse_apply(t, simple_to_eager)
llama_cpp-1  |   File "/llama.cpp/gguf-py/gguf/lazy.py", line 108, in _recurse_apply
llama_cpp-1  |     return fn(o)
llama_cpp-1  |   File "/llama.cpp/gguf-py/gguf/lazy.py", line 171, in simple_to_eager
llama_cpp-1  |     raise ValueError(f"{lt} did not belong in the lazy queue")
llama_cpp-1  | ValueError: <__main__.LazyTorchTensor object at 0x7c67929f0b20> did not belong in the lazy queue

using --no-lazy it converts with no issue (so far)

Using newest falcon model: https://huggingface.co/tiiuae/falcon-11B

compilade · 2024-05-13T22:47:52Z

ValueError: <__main__.LazyTorchTensor object at 0x7c67929f0b20> did not belong in the lazy queue

@bartowski1182 try again with the latest master. Since #7234 was merged, this exception should no longer be raised.

bartowski1182 · 2024-05-13T22:52:07Z

@compilade oh thanks, i'll wait for the release and give it another shot!

@teleprint-me

In #7075, to fix the conversion of (some) models using model-00001-of-00001.safetensors instead of model.safetensors for a single model part we simply used the same logic as the part count to get the part names. But this doesn't always work correctly, like when unusual additional model files like consolidated.safetensors in https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3 are present. This commit matching both the prefix and the suffix of the model part names should fix this problem without breaking any previously-supported upstream models. But according to report by @teleprint-me there is still some persistent problem, but shall do in the meantime.

compilade added 17 commits April 30, 2024 14:07

convert-hf : begin refactoring write_tensor

47e02eb

Merge branch 'master' into compilade/convert-hf-refactor

0d720ac

convert : upgrade to sentencepiece v0.2.0

c33775b

convert-hf : remove unused n_dims in extra_*_tensors

698f0b3

convert-hf : simplify MoE weights stacking

cde9ea6

convert-hf : flake8 linter doesn't like semicolons

56f60f5

convert : use a string for the SentencePiece tokenizer path

dcd8dfa

convert-hf : display tensor shape

21068b6

convert-hf : convert norms to f32 by default

639b374

convert-hf : sort model part names

644c269

`os.listdir` is said to list files in arbitrary order. Sorting the file names should let "model-00009-of-00042.safetensors" be loaded before "model-00010-of-00042.safetensors".

convert-hf : use a plain class for Model, and forbid direct instantia…

13f4cf7

…tion There are no abstract methods used anyway, so using ABC isn't really necessary.

Merge branch 'master' into compilade/convert-hf-refactor

6a54973

Merge branch 'master' into compilade/convert-hf-refactor

3e5e0dc

convert-hf : more consistent formatting of cmdline args

98f2d0e

convert-hf : align the message logged for converted tensors

f2099c5

compilade added enhancement New feature or request need feedback Testing and feedback with results are needed labels May 4, 2024

compilade mentioned this pull request May 4, 2024

convert-hf : reduce repeated boilerplate from write_tensors #7031

Closed

18 tasks

compilade mentioned this pull request May 4, 2024

Suport for Jamba JambaForCausalLM #6372

Open

4 tasks

compilade added 4 commits May 4, 2024 23:55

convert-hf : fix Refact conversion

215a0d3

convert-hf : save memory with lazy evaluation

f09674f

convert-hf : flake8 doesn't like lowercase L as a variable name

0c38332

convert-hf : remove einops requirement for InternLM2

98db434

compilade force-pushed the compilade/lazy-convert-hf branch from b361d69 to 98db434 Compare May 5, 2024 04:00

Galunid approved these changes May 8, 2024

View reviewed changes

Merge branch 'master' into compilade/lazy-convert-hf

bffdaf4

compilade mentioned this pull request May 8, 2024

Introduce bfloat16 support #6412

Merged

compilade force-pushed the compilade/lazy-convert-hf branch from 1eccde6 to bffdaf4 Compare May 8, 2024 18:32

compilade force-pushed the compilade/lazy-convert-hf branch from cad22e1 to bffdaf4 Compare May 8, 2024 19:36

compilade merged commit f98eb31 into master May 8, 2024
19 of 54 checks passed

This was referenced May 9, 2024

convert-hf : support bfloat16 conversion #7158

Merged

Added support for the ArcticForCausalLM. #7020

Merged

add chatglm3-6b and glm-4-9b-chat model support #6999

Closed

fairydreaming pushed a commit to fairydreaming/llama.cpp that referenced this pull request May 9, 2024

Applied changes from upstream PR: save memory with lazy evaluation gg…

750f051

…erganov#7075 (shameless copy from LlamaModel).

fairydreaming pushed a commit to fairydreaming/llama.cpp that referenced this pull request May 9, 2024

Applied changes from upstream PR: save memory with lazy evaluation gg…

a892571

…erganov#7075 (shameless copy from LlamaModel).

ggerganov mentioned this pull request May 15, 2024

gguf.md: Add GGUF Naming Convention Section ggerganov/ggml#822

Merged

christianazinn mentioned this pull request May 23, 2024

Option to split during conversion #6942

Merged

compilade mentioned this pull request Jun 2, 2024

convert-hf : match model part name prefix and suffix #7687

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

convert-hf : save memory with lazy evaluation #7075

convert-hf : save memory with lazy evaluation #7075

compilade commented May 4, 2024 •

edited by ggerganov

Loading

bartowski1182 commented May 4, 2024

compilade commented May 5, 2024

bartowski1182 commented May 5, 2024

compilade commented May 7, 2024

Galunid commented May 7, 2024 •

edited

Loading

slaren commented May 8, 2024

ggerganov commented May 8, 2024

Galunid left a comment

compilade commented May 8, 2024 •

edited

Loading

teleprint-me commented May 8, 2024 •

edited

Loading

teleprint-me commented May 8, 2024 •

edited

Loading

compilade commented May 8, 2024 •

edited

Loading

compilade commented May 8, 2024

compilade commented May 8, 2024

ggerganov commented May 8, 2024

compilade commented May 8, 2024 •

edited

Loading

bartowski1182 commented May 13, 2024

compilade commented May 13, 2024

bartowski1182 commented May 13, 2024

convert-hf : save memory with lazy evaluation #7075

convert-hf : save memory with lazy evaluation #7075

Conversation

compilade commented May 4, 2024 • edited by ggerganov Loading

Summary of changes

Testing

bartowski1182 commented May 4, 2024

compilade commented May 5, 2024

bartowski1182 commented May 5, 2024

compilade commented May 7, 2024

Galunid commented May 7, 2024 • edited Loading

slaren commented May 8, 2024

ggerganov commented May 8, 2024

Galunid left a comment

Choose a reason for hiding this comment

compilade commented May 8, 2024 • edited Loading

teleprint-me commented May 8, 2024 • edited Loading

teleprint-me commented May 8, 2024 • edited Loading

compilade commented May 8, 2024 • edited Loading

compilade commented May 8, 2024

compilade commented May 8, 2024

ggerganov commented May 8, 2024

compilade commented May 8, 2024 • edited Loading

bartowski1182 commented May 13, 2024

compilade commented May 13, 2024

bartowski1182 commented May 13, 2024

compilade commented May 4, 2024 •

edited by ggerganov

Loading

Galunid commented May 7, 2024 •

edited

Loading

compilade commented May 8, 2024 •

edited

Loading

teleprint-me commented May 8, 2024 •

edited

Loading

teleprint-me commented May 8, 2024 •

edited

Loading

compilade commented May 8, 2024 •

edited

Loading

compilade commented May 8, 2024 •

edited

Loading