Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

convert-hf : save memory with lazy evaluation #7075

Merged
merged 26 commits into from May 8, 2024

Conversation

compilade
Copy link
Collaborator

@compilade compilade commented May 4, 2024

Continued from #7031 so that all shape-changing operations are made with PyTorch.

This was motivated by some remarks about convert-hf-to-gguf.py using too much RAM and/or convert.py being better for Llama models because it uses lazy tensors.

This fixes the RAM usage problem by bringing lazy conversion for ALL model architectures supported by convert-hf-to-gguf.py (assuming none were broken with #7031).

Summary of changes

  • All changes from convert-hf : reduce repeated boilerplate from write_tensors #7031
  • Way less RAM usage, by default. Like, waaaay less.
  • This uses PyTorch's meta device as well as __torch_function__ overrides and __getattr__ switcheroos, wrapped in LazyTorchTensor
    • LazyTorchTensors form a tree of tensors to share the eager evaluation when multiple tensors are split from one
    • The goal is that ALL PyTorch operations can be used normally in modify_tensors to be usable BOTH in lazy evaluation mode and when eagerly evaluating (as before). This keeps the semantics easier for everyone.
    • (BTW thanks @heiner for showing that sending a tensor .to("meta") works in Script to convert Grok-1 weights from raw JAX pickle files. #7058, it wasn't really clear from PyTorch's documentation)
  • Add the LazyTensor class in gguf_writer.py to abstract away how the tensor is actually calculated
    • This helps avoiding to make the gguf-py library depend on PyTorch.
    • Wraps basic Numpy operations like astype() and byteswap() and tofile()
  • convert-hf-to-gguf.py has a new flag, --no-lazy to let the user choose to not use lazy evaluation (same behavior as before, accumulating everything in RAM before beginning to write in the output file)
  • Since the modify_tensors step now happens very very quickly, most time is spent writing to disk. I've added a progress bar in GGUFWriter.write_tensors_to_file (which is only enabled in convert-hf-to-gguf.py, otherwise it's disabled by default).

Testing

Everything should pretty much Just Work (TM), but some models use special PyTorch operators and need more careful testing.

Ideally, there should be no fatal failure, and the resulting model should be the same when lazy (by default) and when using --no-lazy.

(@bartowski1182, @fairydreaming, @phymbert, @arki05, @slaren, @ggerganov, this will most likely benefit you all because I'm sure you've dealt with converting bigger models than your RAM before)

For example, loading `model-00001-of-00001.safetensors` now works.

* convert-hf : fix stacking MoE expert tensors

`torch.stack` and `torch.cat` don't do the same thing.

* convert-hf : fix Mamba conversion

Tested to work even with a SentencePiece-based tokenizer.
`os.listdir` is said to list files in arbitrary order.
Sorting the file names should let "model-00009-of-00042.safetensors"
be loaded before "model-00010-of-00042.safetensors".
It seems Protocol can't be used as a statically type-checked ABC,
because its subclasses also can't be instantiated. (why did it seem to work?)

At least there's still a way to throw an error when forgetting to define
the `model_arch` property of any registered Model subclasses.
…tion

There are no abstract methods used anyway,
so using ABC isn't really necessary.
@compilade compilade added enhancement New feature or request need feedback Testing and feedback with results are needed labels May 4, 2024
@bartowski1182
Copy link
Contributor

This looks awesome! A lot of great changes!

Any differences in performance if RAM capacity isn't a bottleneck? Or is the speed the same with or without?

If unknown I can also test

@compilade
Copy link
Collaborator Author

Any differences in performance if RAM capacity isn't a bottleneck? Or is the speed the same with or without?

I don't know, since on my machine RAM capacity is a bottleneck, so lazy conversion is always better there.

I think the performance shouldn't be that different if there's enough RAM. It's still reading, calculating, and writing the same things. Either everything is first read, computed, and logged, kept in RAM, then all written (with --no-lazy and/or the current state of master), or (with lazy conversion) all metadata is read and logged, then each tensor's data is read, computed, and written, one after the other. Both lazy and non-lazy conversion should result in the exact same file.

I personally think the UX (user experience) is better with lazy conversion, because the (new) progress bar shown when writing the tensors gives a good enough ETA, while in non-lazy conversion (with --no-lazy), the progress bar only appears at the end when flushing everything to the output file.

If unknown I can also test

Yes, this would be appreciated. This way we'll know whether or not there's an overhead with lazy conversion vs non-lazy conversion. Not sure how to get a consistent measurement though, with filesystem caches and stuff like that. (though lazy conversion has an advantage there, since it leaves more RAM for the filesystem cache)

Instead of pre-loading them all into a dict, iterate on the tensors
in the model parts progressively as needed in Model.write_tensors

Conversion for some architectures relies on checking for the presence
of specific tensor names, so for multi-part models, the weight map is read
from the relevant json file to quickly get these names up-front.
@bartowski1182
Copy link
Contributor

It does seem likely that there shouldn't be much if any performance hit, since realistically there's no good reason to have ALL tensors in memory at once

While doing this, any insight into converting bf16 to fp16 by first going to FP32? Feels like it should be a straightforward process but I have dove into the black box yet

@Galunid
Copy link
Collaborator

Galunid commented May 7, 2024

@ggerganov Once this is merged, is there anything preventing us from getting rid of convert.py? I assume people switched to llama-3 already so there's little reason to try converting llama/llama2. Not to mention it doesn't have the recent tokenizer changes, which kinda makes it useless. It'd be good to have 1 script that needs maintenance rather than 2. It would also avoid confusion from the user perspective.

@ggerganov
Copy link
Owner

@Galunid Yes, it makes sense to remove it.

@compilade
Copy link
Collaborator Author

It'd be good to have 1 script that needs maintenance rather than 2. It would also avoid confusion from the user perspective.

I agree.

is there anything preventing us from getting rid of convert.py?

There are some cool things implemented in convert.py, like Q8_0 quantization, and no torch dependency. So it's still worth to keep somewhere (even if renamed), at least for reference.

Also its --dump and --dump-single options are very useful to list tensor names and shapes, especially for not-yet-supported models.

@Galunid
Copy link
Collaborator

Galunid commented May 7, 2024

There are some cool things implemented in convert.py, like Q8_0 quantization, and no torch dependency. So it's still worth to keep somewhere (even if renamed), at least for reference.

I agree it's a cool showcase, but I'd argue it's not worth it to keep an obsolete file in master branch.
git checkout 04976db7a819fcf8bfefbfc09a3344210b79dd27 -- convert.py should solve the issue of ever needing to reference the file. As for Q8_0 quantization, I'd argue for having only one way to generate files.
Right now you can

  1. convert.py -> ./quantize Q8_0
  2. convert.py Q8_0
  3. convert-hf-to-gguf.py -> ./quantize Q8_0

Afaik all produce slightly different results. Establishing only one right way to generate models would go long way towards reducing issues we need to respond to.

Also its --dump and --dump-single options are very useful to list tensor names and shapes, especially for not-yet-supported models.

I didn't know about this, but it is useful. It should be simple enough to add it to convert-hf-to-gguf.py

I guess we could maybe move it to examples/torchless_convert.py or something?

@slaren
Copy link
Collaborator

slaren commented May 8, 2024

convert.py also allows converting llama, llama2 and mixtral models from the original pth distributions.

@ggerganov
Copy link
Owner

convert.py also allows converting llama, llama2 and mixtral models from the original pth distributions.

Correct, the idea was convert.py to support "native" model conversion without waiting to be huggingface-ified, and without heavy dependencies. But it looks like that HF / Transformers adoption is wide across all major model producers, and convert.py depends on almost everything that convery-hf-to-gguf.py depends (except torch) so not sure if it makes sense anymore

I'm personally interested in small convert scripts that not necessarily have all the cool features of convert.py and convert-hf-to-gguf.py, but are simple to understand and modify (example: https://github.com/ggerganov/ggml/blob/master/examples/magika/convert.py). Though these would also be hard to maintain because they won't be exercised as much

About this PR - shall we proceed to merge? Not sure if anyone else would like to review the changes

Copy link
Collaborator

@Galunid Galunid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a super in depth review, but nothing stood out to me.

@compilade
Copy link
Collaborator Author

compilade commented May 8, 2024

I've found a way to support bfloat16 conversion (since #6412 was merged), so I've added support for it in 1eccde6.

Even if Numpy doesn't have the bfloat16 data type, it's still possible to use Numpy to perform the type conversion, because it's possible to change the type without converting the underlying data.

if data.dtype != np.float32:
    data = data.astype(np.float32)
data.dtype = np.int32
data = (data >> 16).astype(int16)

This way, data contains the exact same bytes as if all of its elements were converted to bfloat16.

GGUFWriter.add_tensor supports overriding the dtype with any gguf.GGMLQuantizationType, so this is used to store these np.int16 tensors with the correct metadata.

@teleprint-me
Copy link
Contributor

teleprint-me commented May 8, 2024

But it looks like that HF / Transformers adoption is wide across all major model producers, and convert.py depends on almost everything that convery-hf-to-gguf.py depends (except torch) so not sure if it makes sense anymore

The irony is that HuggingFace heavily relies on torch under the hood. The abstractions make it difficult to see unless you really poke at it. This statement is accurate though. The transformers framework makes me think of the blob, consuming everything around it.

Regardless, I've been waiting for this PR to get merged.

@teleprint-me
Copy link
Contributor

teleprint-me commented May 8, 2024

@compilade I didn't see that for some reason when I looked here ealier! I must've just missed your post. You can bit shift the values in Python, but there is a decent enough margin of error that precision is noticeably affected. I had written tests for this and it's just due to the way Python handles numbers. There's no way around it except to implement a FFI. It's cool that this workaround seems to do the trick though. I like it because it's simple. However, expect precision to be affected.

@compilade
Copy link
Collaborator Author

compilade commented May 8, 2024

You can bit shift the values in Python, but there is a decent enough margin of error that precision is noticeably affected. I had written tests for this and it's just due to the way Python handles numbers. There's no way around it except to implement a FFI. It's cool that this workaround seems to do the trick though. I like it because it's simple. However, expect precision to be affected.

@teleprint-me Not sure what you've tried before, but I don't think any precision is lost in this process when converting from bfloat16 weights.

The values are first converted from torch.bfloat16 to torch.float32, then to np.float32 (same as torch.float32), then the type is changed to np.int32 without changing the data, then shifted right by 16 bits, then converted to np.int16 (which trims the upper 16 bits to get down to 2 bytes per element). The resulting data is the upper 16 bits of the float32 values that came from the bfloat16 weights.

From my understanding, this is a lossless conversion.

(EDIT: ⚠️ this approach is wrong because subnormals are not flushed to zero, and because it's not rounding to nearest even value. Rounding should be easy to fix, but subnormals... I'm not sure. (idea: it should be possible using np.vectorize))

@compilade
Copy link
Collaborator Author

I've noticed a problem with metadata; the quantization type of tensors are not using the same enum values as the file types. This makes BF16 models be reported as IQ4_XS (this is about the GGUF key general.file_type), when converted with convert-hf-to-gguf.py.

This requires adding new constants in gguf-py/gguf/constants.py. I'm hesitating between using the name GGMLFileType (named similarly to GGMLQuantizationType) and GGUFFileType (named similarly to GGUFValueType) for this enum class.

@compilade
Copy link
Collaborator Author

Unfortunately bfloat16 conversion has a bit more special cases than I thought, so it won't be included in this PR (it will require more general lazy Numpy tensors (which I will be working on soon)).

I'm force-pushing this branch back to before 1eccde6 to match the reviewed changes.

@ggerganov
Copy link
Owner

This requires adding new constants in gguf-py/gguf/constants.py. I'm hesitating between using the name GGMLFileType (named similarly to GGMLQuantizationType) and GGUFFileType (named similarly to GGUFValueType) for this enum class.

I think we should go with GGMLFileType


Here is some more information about the file-type enums (it's a bit messy):

The C++ enum used to determine the file type in llama.cpp is llama_ftype:

llama.cpp/llama.h

Lines 108 to 114 in 83330d8

// model file types
enum llama_ftype {
LLAMA_FTYPE_ALL_F32 = 0,
LLAMA_FTYPE_MOSTLY_F16 = 1, // except 1d tensors
LLAMA_FTYPE_MOSTLY_Q4_0 = 2, // except 1d tensors
LLAMA_FTYPE_MOSTLY_Q4_1 = 3, // except 1d tensors
LLAMA_FTYPE_MOSTLY_Q4_1_SOME_F16 = 4, // tok_embeddings.weight and output.weight are F16

This enum is specific to llama.cpp. For example, whisper.cpp also has the concept of a "file type", but instead of defining a corresponding enum, it uses the built-in enum ggml_ftype:

https://github.com/ggerganov/whisper.cpp/blob/a750868428868abd437e228ae5cab763ef3dc387/ggml.h#L388-L395

The enum ggml_ftype was introduced after enum llama_ftype and initially it mirrored its values, but eventually it diverged, because we forgot to update it.

So now, ggml projects can either define their own file-type enum (like llama.cpp) or use the built-in enum (like whisper.cpp). I guess there are arguments for either option, but I'm thinking that adopting enum ggml_ftype would be overall better, since gguf-py also needs to "know" about it.

But anyway, the oversight here that you noticed is that the F32 and F16 values of enum llama_ftype accidentally match the values of enum ggml_type (a.k.a GGMLQuantizationType). So we didn't have any problems until now.

@compilade
Copy link
Collaborator Author

compilade commented May 8, 2024

So now, ggml projects can either define their own file-type enum (like llama.cpp) or use the built-in enum (like whisper.cpp). I guess there are arguments for either option, but I'm thinking that adopting enum ggml_ftype would be overall better, since gguf-py also needs to "know" about it.

I've added GGMLFileType with values from llama.cpp's llama_ftype in cad22e1, but reading this makes me realize you meant using ggml.h's ggml_ftype, which I agree would be better. (though it wouldn't quite work when converting models for llama.cpp, since it expects types matching llama_ftype)

Again, reverting back, because I need to give this GGMLFileType enum more thought (especially with the different conflicting sources of its possible values), and because it will only be useful along with bfloat16 conversion so I think the changes belong together (in another PR).

I will then proceed to merge this PR (after the tests finish again, so in an hour, I guess?) with the previously-approved changes, because I think they fulfill the goal of this PR and because there was no disagreement in the 2 days since the last change.

(EDIT: some Docker test is failing because of network errors. I guess this happens, but it doesn't seem related to this PR, so I'm merging anyway.)

@compilade compilade merged commit f98eb31 into master May 8, 2024
19 of 54 checks passed
fairydreaming pushed a commit to fairydreaming/llama.cpp that referenced this pull request May 9, 2024
fairydreaming pushed a commit to fairydreaming/llama.cpp that referenced this pull request May 9, 2024
@bartowski1182
Copy link
Contributor

@compilade sorry to resurrect this, but looks like lazy conversion is broken for Falcon:

llama_cpp-1  |   File "/llama.cpp/convert-hf-to-gguf.py", line 2562, in <module>
llama_cpp-1  |     main()
llama_cpp-1  |   File "/llama.cpp/convert-hf-to-gguf.py", line 2556, in main
llama_cpp-1  |     model_instance.write()
llama_cpp-1  |   File "/llama.cpp/convert-hf-to-gguf.py", line 343, in write
llama_cpp-1  |     self.gguf_writer.write_tensors_to_file(progress=True)
llama_cpp-1  |   File "/llama.cpp/gguf-py/gguf/gguf_writer.py", line 295, in write_tensors_to_file
llama_cpp-1  |     tensor.tofile(self.fout)
llama_cpp-1  |   File "/llama.cpp/gguf-py/gguf/lazy.py", line 222, in tofile
llama_cpp-1  |     eager = LazyNumpyTensor.to_eager(self)
llama_cpp-1  |   File "/llama.cpp/gguf-py/gguf/lazy.py", line 182, in to_eager
llama_cpp-1  |     return cls._recurse_apply(t, simple_to_eager)
llama_cpp-1  |   File "/llama.cpp/gguf-py/gguf/lazy.py", line 108, in _recurse_apply
llama_cpp-1  |     return fn(o)
llama_cpp-1  |   File "/llama.cpp/gguf-py/gguf/lazy.py", line 171, in simple_to_eager
llama_cpp-1  |     raise ValueError(f"{lt} did not belong in the lazy queue")
llama_cpp-1  | ValueError: <__main__.LazyTorchTensor object at 0x7c67929f0b20> did not belong in the lazy queue

using --no-lazy it converts with no issue (so far)

Using newest falcon model: https://huggingface.co/tiiuae/falcon-11B

@compilade
Copy link
Collaborator Author

ValueError: <__main__.LazyTorchTensor object at 0x7c67929f0b20> did not belong in the lazy queue

@bartowski1182 try again with the latest master. Since #7234 was merged, this exception should no longer be raised.

@bartowski1182
Copy link
Contributor

@compilade oh thanks, i'll wait for the release and give it another shot!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request high priority Very important issue need feedback Testing and feedback with results are needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants