-
Notifications
You must be signed in to change notification settings - Fork 9.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make loading weights 10-100x faster #613
Conversation
Should the other converters also be rewritten to handle this new format? |
Yes indeed. I just fixed the quantize program. Now I'm hunting down all the tests. |
All tests look green except for a CMake test. For example: https://github.com/ggerganov/llama.cpp/actions/runs/4559537462/jobs/8043597142?pr=613 I'm stumped on this error. I can't figure out where the file |
#355 mentioned "Added ./models/ggml-vocab.bin containing just LLaMA vocab data (used for tests)" |
@@ -20,7 +20,7 @@ | |||
#endif | |||
|
|||
#define LLAMA_FILE_VERSION 1 | |||
#define LLAMA_FILE_MAGIC 0x67676d66 // 'ggmf' in hex |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: why change the magic rather than the version? I assumed the plan was to keep the magic constant forever. If you bump the version instead, old executables will recognize new model files and give a more useful error message. And it's nice to distinguish between "this is definitely a model file for this project, but it's the wrong version" vs "this is some random junk we don't know anything about".
(This PR is a very neat bit of engineering; please don't let my nitpick distract from that.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not a nitpick but a real change request :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(nvm)
@jart I had the expectation that Regarding the version comment - yes, the plan was to bump versions and no the magic. But I'm ok to change the magic to commemorate the significance of this update. In fact, maybe we can make this a thing and everybody who makes a significant contribution to the project will get their initials appended to the version. What do you think? 😄 Let me play with this tonight before merging. We have to make special care that all the other Also, maybe some synchronisation with #545 would be needed |
This is a breaking change that's going to give you three benefits: 1. Your inference commands should load 100x faster 2. You may be able to safely load models 2x larger 3. You can run many concurrent inference processes This was accomplished by changing the file format so we can mmap() weights directly into memory without having to read() or copy them thereby ensuring the kernel can make its file cache pages directly accessible to our inference processes; and secondly, that the file cache pages are much less likely to get evicted (which would force loads to hit disk) because they're no longer competing with memory pages that were needlessly created by gigabytes of standard i/o. The new file format supports single-file models like LLaMA 7b, and it also supports multi-file models like LLaMA 13B. Our Python tool now merges the foo.1, foo.2, etc. files back into a single file so that the C++ code which maps it doesn't need to reshape data every time. That's made llama.cpp so much simpler. Much of its load code has now been deleted. Furthermore, this change ensures that tensors are aligned properly on a 32-byte boundary. That opens the door to seeing if we can get additional performance gains on some microprocessors, by using ops that require memory alignment. Lastly note that both POSIX and the Windows platform are supported Fixes ggerganov#91
File updated. A lot more tests are green now. No idea what's up with the sanitizer. I thought so too! I too was pleasantly surprised by how well it worked out. Glad we took a few weeks to think. I'm honored to hear you say that. I can roundup the magic to 64 bytes if you like, so there's room to hand out kudos without breaking backwards compatibility in the future. Since my initials also act as a stamp of approval, I'm going to be sending a follow-up change after this, that'll harden the loading code, so that folks will be able to trade model files for this format on HuggingFace with maximum safety and confidence. #545 is an ambitious unification. I've done my best to comment my changes to make the merge less painful for the author. I've sought to update the other scripts too, but don't know how to run them. One thing you could also consider with this project is having a |
int fd = open(fname, O_RDONLY); | ||
if (fd == -1) return 0; | ||
int64_t length = lseek(fd, 0, SEEK_END); | ||
void *addr = mmap(NULL, length, PROT_READ, MAP_SHARED, fd, 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Is it more safe to use
mmap64
for 4GB+ files? - It seems
mmap
,mmap64
andMapViewOfFile
support mapping from given offset. Is it possible to map from header_len (asoffset
)? If we can do this, no need to align model file, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- The right thing to do on 32-bit platforms is to have your build system define
-D_FILE_OFFSET_BITS=64
which will cause your system header files to automatically#define mmap mmap64
- File offsets passed to mmap() need to be page size aligned, so I don't think so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jart Is it possible to ensure the file size is a multiple of the hugepage size (e.g. using ftruncate), to benefit from fewer TLB lookups when the model data is accessed? (corresponding mmap hints or other system-specific APIs, e.g. needed for macOS, might need to be used)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't matter with mmap() if the file length isn't page size aligned, even with smaller pages. You should be good to go if you modify the mmap() code in llama.cpp by hand and actually manage to get huge pages to work without nuking your machine :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TIL!
If you deleted your old Meta LLaMA .pth files, then the migrate-ggml-2023-03-30-pr613.py script will allow you to convert your old ggml files into the new mmap()'able format. See ggerganov#613
If you deleted your old Meta LLaMA .pth files, then the migrate-ggml-2023-03-30-pr613.py script will allow you to convert your old ggml files into the new mmap()'able format. See ggerganov#613
@ggerganov This change now includes a migration tool named |
@x02Sylvie I don't have access to the Alpaca model. Could send a pull request fixing that after this gets merged? |
I don't really know python, so I'd rather leave pull request to someone smarter than me, I did however manage to get alpaca 13b model converted by manually setting n_parts to 1 in .py conversion script . I'm unsure if it's proper place to set n_parts though
to
Model does work however after conversion |
Hello, I can not load the gtp4all after converting it to the new ggml format using your script: I have opened a new issue probably related to this: #655 (comment) |
I could run it with the previous version https://github.com/ggerganov/llama.cpp/tree/master-ed3c680
|
You need to also run the resulting file through gpt4all weights -> convert-gpt4all-to-ggml.py -> converted gpt4all weights -> migrate-ggml-2023-03-30-pr613.py -> gpt4all weights compatible with the latest version of llama.cpp |
It worked. Thank you for your fast response! |
If you deleted your old Meta LLaMA .pth files, then the migrate-ggml-2023-03-30-pr613.py script will allow you to convert your old ggml files into the new mmap()'able format. See ggerganov#613
As noted in https://github.com/ggerganov/llama.cpp/blob/master/migrate-ggml-2023-03-30-pr613.py, Authors from `llama.cpp` caused a breaking change to the file format on 2023-03-30 in: ggerganov/llama.cpp#613 Therefore, we need further use `migrate-ggml-2023-03-30-pr613.py` to convert the llama model.
As noted in https://github.com/ggerganov/llama.cpp/blob/master/migrate-ggml-2023-03-30-pr613.py, Authors from `llama.cpp` caused a breaking change to the file format on 2023-03-30 in: ggerganov/llama.cpp#613 Therefore, we need further use `migrate-ggml-2023-03-30-pr613.py` to convert the llama model.
This is a breaking change that's going to give us three benefits:
This was accomplished by changing the file format so we can mmap()
weights directly into memory without having to read() or copy them
thereby ensuring the kernel can make its file cache pages directly
accessible to our inference processes; and secondly, that the file
cache pages are much less likely to get evicted (which would force
loads to hit disk) because they're no longer competing with memory
pages that were needlessly created by gigabytes of standard i/o.
The new file format supports single-file models like LLaMA 7b, and
it also supports multi-file models like LLaMA 13B. Our Python tool
now merges the foo.1, foo.2, etc. files back into a single file so
that the C++ code which maps it doesn't need to reshape data every
time. That's made llama.cpp so much simpler. Much of its load code
has now been deleted.
Furthermore, this change ensures that tensors are aligned properly
on a 32-byte boundary. That opens the door to seeing if we can get
additional performance gains on some microprocessors, by using ops
that require memory alignment.
Lastly note that both POSIX and the Windows platform are supported
The issue this PR solves is #91
This PR was written in collaboration with @slaren. This PR is also rebased on
PR #586 so please do not squash merge! Use either merge or rebase.