-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid the "non-contiguous X" branch in the Z = X * Y matrix multiplication #439
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
ggerganov
changed the title
Avoid the transposed X branch in the Z = X * Y matrix multiplication
Avoid the "transposed X" branch in the Z = X * Y matrix multiplication
Mar 23, 2023
ggerganov
changed the title
Avoid the "transposed X" branch in the Z = X * Y matrix multiplication
Avoid the "non-contiguous X" branch in the Z = X * Y matrix multiplication
Mar 23, 2023
@blackhole89 bringing your attention to this |
For me, this change increases the memory usage so drastically that Ubuntu 22.04.2 LTS, GCC 12 |
KerfuffleV2
added a commit
to KerfuffleV2/llama-rs
that referenced
this pull request
Mar 24, 2023
KerfuffleV2
added a commit
to KerfuffleV2/llama-rs
that referenced
this pull request
Mar 26, 2023
philpax
pushed a commit
to rustformers/llm
that referenced
this pull request
Mar 26, 2023
ocrickard
added a commit
to ocrickard/llama.cpp
that referenced
this pull request
Apr 4, 2023
…ication (ggerganov#439)" This reverts commit 483bab2.
ocrickard
added a commit
to ocrickard/llama.cpp
that referenced
this pull request
Apr 4, 2023
… multiplication (ggerganov#439)"" This reverts commit fd19b4a.
ocrickard
added a commit
to ocrickard/llama.cpp
that referenced
this pull request
Apr 4, 2023
Try to optimize ggml a bit Revert "Try to optimize ggml a bit" This reverts commit 661ec69. Revert "Avoid the transposed X branch in the Z = X * Y matrix multiplication (ggerganov#439)" This reverts commit 483bab2. Fix missing memory_v Revert "Fix missing memory_v" This reverts commit 375bd60. Revert "Revert "Avoid the transposed X branch in the Z = X * Y matrix multiplication (ggerganov#439)"" This reverts commit fd19b4a.
sw
added a commit
to sw/llama.cpp
that referenced
this pull request
Apr 4, 2023
…ication (ggerganov#439)" This reverts commit 483bab2.
AAbushady
pushed a commit
to AAbushady/llama.cpp
that referenced
this pull request
Jan 27, 2024
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
ref #249 #95
See #407 (comment)
In the
mul_mat()
implementation currently we have 2 main branches:src0
is contiguous in memory (code)src0
is not contiguous in memory (code)In the first branch we parallelize the computation along the
src0
rows. Each thread computes a dot product ofsrc0
row withsrc1
column and writes the result into a cell ofdst
.In the second branch we parallelize along the
src1
columns. Each thread computes multiply + add (mad) of asrc0
column with an element fromsrc1
and writes the result into a per-thread temporary buffer row. At the end of the multiplication, the results from all temporary buffers are accumulated intodst
.The second branch produces variation in the final result based on the used number of threads, since the result into a single
dst
cell is computed by adding different number of floating point terms, based on the used number of threads. It is a bit more efficient, but also uses a lot more memory due to the temporary buffers.I am thinking that in view of having more stable results and also simplifying significantly the code in
ggml.c
, we should eliminate this second branch. The solution is to always make sure thatsrc0
is contiguous, which the user can always achieve with a simpleggml_cpy()
call.The benefits are quite a lot:
ggml_vec_mad_xxx()
functions - can be simply deletedggml_forward_mul_mat_xxx()
implementationsThe drawbacks: