-
Notifications
You must be signed in to change notification settings - Fork 13.3k
graph : reuse SSM graphs #16490
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
ggerganov
wants to merge
5
commits into
master
Choose a base branch
from
gg/graph-mamba-reuse
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+78
−7
Open
graph : reuse SSM graphs #16490
Changes from all commits
Commits
Show all changes
5 commits
Select commit
Hold shift + click to select a range
8a2ac4e
graph : reuse hybrid graphs
ggerganov f6b0a4b
graph : reuse recurrent graphs
ggerganov f48f052
graph : fix reuse check for recurrent inputs
ggerganov 7ab2b58
memory : move the recurrent state into the memory context
ggerganov 16d57ca
Revert "memory : move the recurrent state into the memory context"
ggerganov File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mctx->get_head()
(the start of the slot) andmctx->get_rs_z()
(the first zeroed state) are used in view offsets, and so would need to match too, otherwise the graph can't really be re-used.The case where they wouldn't match (but
n_rs
matches) is when ubatches of the same size with different sequences are used.E.g. seq_ids 0, 1, with 1 token and then seq_ids 2, 3 with 1 token, in consecutive ubatches, repeatedly.
This probably happens when using
-ub 1
in thellama-parallel
example, I think (because it uses a singleseq_id
per ubatch at a time, but ends up using different seq_ids while using the same size of ubatches).(Note that I didn't actually test the changes yet, so I don't know if this is a real problem)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a check earlier for whether the sequences are the same:
llama.cpp/src/llama-graph.h
Lines 443 to 457 in 638e2c2
This check applies to all graphs and if not satisfied, we don't attempt to reuse the graph. I think this should cover this case.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, this should cover the case where different sequences are used.
However, I don't think it covers the case when a sequence is cleared (which will make
mctx->get_rs_z()
differ).I'm noticing different perplexity with and without graph-reuse with a
Q8_0
mamba-130m
on CPU.(this is on the first 10 chunks of
calibration_datav3
)-b 512
-b 2048
-b 512
-b 2048
I'm not sure it's caused by what exactly, but I'm suspecting it's either related to
rs_z
orhead
(since this doesn't seem to happen with non-recurrent models (I tested with aQ8_0
TinyLlama)).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ggerganov
Checking for
head
andrs_z
mismatch does seem to help with the case in my previous comment, making the graph-reuse case have the same PPL as when it's not used.Patch with changes
It might not be ideal to expose another way to get
head
andrs_z
. But the constructor ofllm_graph_input_rs
would need access tollama-memory-recurrent.h
to usemctx->get_head()
andmctx->get_rs_z()
.Strangely enough, hybrid models like Falcon-H1 don't manifest the same problem as
mamba-130m
; I can't reproduce the original problem with that.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you clarify what you mean here? The proposed solution seems OK to me.
On a related topic, would it be possible to avoid these offsets through the use of
ggml_set_rows()
in a similar way as we avoided the KV cache offset for the regular attention?Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@compilade I improved the state management of the recurrent state with 6589d3b. The recurrent memory context now keeps immutable values such as
head
,rs_z
, etc. These can be used in thecan_reuse()
logic without duplicating this state in the inputs.