New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Idea: Downscaling the K and/or Q matrices for repeated layers in franken-merges? #198
Comments
This is an interesting idea! I've played around with downscaling |
@cg123 @shamanez I don't know if you've seen 3blue1brown's last couple of videos: https://youtube.com/c/3blue1brown but his interpretation of what the transformer is doing makes me really think the downscaling is even more important. This gives a 3rd way to consider how stacking multiple layers could improve models, ie: more fine-grained movement within the high dimensional vector space leading to a more accurate final position! The W_K and W_Q matrices both being scaled by 1/sqrt(2) would be equivalent to halving the score matrix values (which may or may not be the best scale factor, but probably a good start). The interesting point the 3blue1brown video raises is that the W_V matrix should also probably be halved too. If it isn't then his sum of vectors interpretation will overshoot even if you deattenuate the sharpness of the softmax! BUT: This doesn't account for the MLP layer that follows the transformer and the downscaled outputs of softmax * V might be too weak to push through the non-linearity of the MLP (or more likely just be plain wrong for what the MLP is expecting to see). So perhaps the W_V matrix should be left alone and the norm layer be downscaled instead? BUT: From some reading a few weeks ago, IIRC the current LLMs actually apply their norm layers before the operations for some reason (compared to the original "Attention is all you need" paper) and that's why there is the extra norm layer right at the end that I couldn't work out the point of having before... So it's possible the residuals and/or the norm layers will both need downscaling so the vector addition interpretation is maintained? Also the "confidentiality wrong" phenomenon can just as easily be explained by the overshooting in the vector space as it can by the sharpness of the softmax output (probably even more so!). |
I've just realised there is a super easy way to do this using the filters: just make a copy of the model and zero all the weights out and then use the linear merge method. This would also be useful for a method described in a paper I can't find atm (related to the Obviously it would be better if we could somehow specify to interpolate towards zero in the yaml file rather than have to make a zeroed model for this! :) |
@cg123, I was just looking through the source of Is this the correct use of it:
https://github.com/arcee-ai/mergekit/tree/main/mergekit/_data/architectures I can't see any Really excited to try this now!!! |
If this actually does anything useful then I will write some C++ to try to explore the scale factor over lots of random inputs:
|
Also, if anybody knows the name of this paper, I will make a couple of examples for both this "attenuated passthrough" idea above and the "identity duplication" idea from this paper using the new |
Added some logging to
It would be nice to print the tenor names to be 100% sure though, but my Python skills are pretty much zero and runtime type-checking of long-runtime jobs reminds me exactly why... 😦 I've double checked and am fairly sure this is the correct set of We want to scale It will probably take all afternoon to test what this does, but I think this self-merge of The only other merged model I have used extensively is |
Just realised I'm double scaling by altering the input norm like that... Gonna leave that for now:
and see what this does to start with. |
I really need to know the architecture of exactly what the MLP layers look like: it could be the Either way I think this is less interesting that the scaling of the score matrix via |
Holly crap this does work!!! Using
(this was a copy and paste fail from me, but he didn't fall for it! 😁)
https://old.reddit.com/r/LocalLLaMA/comments/15kntrq/i_asked_a_simple_riddle_to_30_models/
Now for the slightly more problematic puzzles this one is getting wrong. The old
This also doesn't seem to "lose" the information like the original
This is a question I have definitely asked the He doesn't seem to have mentioned the key insight from the paper - compressing sequences to a fixed length vector (but I don't know if the original THIS IS NUTS!!! |
So now the question is:
Actually the more I think about this the more I think maybe we don't actually need to mess with the norm(s) or |
Yep, double-stacked
EDIT: Just realised I keep pasting in the stupid version of this question, but either way this model wasn't much use compared to "Sally has 3 sisters. The information about their brothers is not relevant to the question. So, Sally still has 3 sister", and did show the extreme verbosity problem I had when I last tried this without the attenuation applied. |
Yeah, gibberish: "SER sisters"... But sound logic.
I'll try it on Wolfram's
-->
|
Some better, some worse, but still very good. |
I will report back how the |
I have failed so far to get any of the original/big merge makers interested in this so gonna try and re-explain here: Imagine you are trying to get from one location to another:
If you have an extra opportunity to correct your plan: it doesn't make sense to face in approximately the correct direction and then move the FULL distance, reevaluate and then again face in approximately the correct direction and then move the FULL the distance again! These 3 ways of looking at what a transformer block is doing all suggest this too:
The (1) paper suggests each block is acting as an "in-context" learner so down-scaling the score matrix is similar to 'damped Newton's method', or even simpler: think of it as the leaning-rate in gradient descent - if you set the leaning rate too high you will overshoot the minima and get the "rattling" phenomena where the error starts bouncing around instead of decreasing. The (2) paper and more visual blog explanation, suggest each block is acting as an associative memory stored using (exponentially) many local minima. So instead of just trying to jump straight to the bottom of the closest local minima; it might make sense to take a smaller step, reevaluate and take another smaller step by changing the β parameter (have a look at the greyscale pictures for a better idea of what's going on). The (3) videos, suggest each block is adding vectors to each other (nose to tail) in a (very) high-dimensional vector space, where each point in this high-dimensional vector space has a semantic meaning. So again it makes sense to add together 2 smaller vectors to get to a point (with a certain semantic meaning), rather than two copies of the same vector that will likely overshoot out into the edges of this high-dimensional vector space. Empirically it also seems that this holds: the bigger "chunk" of repeated transformer blocks (ie: 40 instead of 20 or 16) the more "confidently wrong" the model gets. So now hopefully it makes a bit more sense why we would want to move 1/2 the distance if we have 2 opportunities to reevaluate, 1/3 of the distance if we have 3 opportunities to reevaluate, and so on. If we just wanted to use PyTorch and the Transformers library directly then it would be pretty easy to add in a β parameter that we could use to re-scale or "attenuate" the score matrix (it actually already does this and scales by the sqrt of the latent dimension anyway!) to take these multiple steps, BUT: if we want to use these models in software like So now looking at how the transformer block works, we see that the score matrix is made by taking lots of dot-products between a latent vector that is first projected using the The problem with this though is that you have now reduced the magnitudes of values of only the So to solve this problem we can use the identity:
and now if you think about what is happening when we take a dot-product:
and then home-in on just looking at If we want this sum to be half the magnitude we can use the identity above to get:
So after all this, hopefully it make sense what is going on and where this number is coming from! So there are a few caveats:
I added this to
and then run You should then see some So to help demonstrate this idea even more, here are 3 popular frankenmerge model configs that I have adapted, and added extensive comments to show the similarly between the existing configs and this new "attenuated" versions:
IMPORTANT: It seem that for merging different models, that scaling the score matrix by 0.5 (and hence Reduce[{x^2 == 0.5, x > 0}, x, Reals] Replace 0.5 with whatever value in the range If you want to experiment with duplicating 3 blocks: Reduce[{x^3 == 0.333, x > 0}, x, Reals] Replace 0.333 with whatever value in the range and so on... |
Is there any way I can simplify these complex looking |
Very interesting, thank you for your research and sharing the results. I am still trying to get my head around it, which I am finding a bit difficult with the limited time that I have at the moment to work on this. Maybe I can apply it to WestLake-v2-10.7b, which should allow me to benchmark it relatively quickly and compare to the original. Are you planning to publish the 103b self-merge of miqu-1 using this method? |
I can't upload the merged model easily, as only have around 1mb/s upload on VDSL :/ But this is what the
To create the merge:
If you wanted to make your own GGUF quants (assuming you have
|
Understood. I don't have much upload bandwidth either, but at least a bit more. I have just ran the merge for miqu-1-120b-attenuated, and I am in the middle of converting it and quantising. I will upload the Q4_KS version to froggeric/miqu-1-120b-attenuated-GGUF. Thanks for the 103b yaml. |
No problem - and if you get chance; try using different (larger) values of |
For the 103b self-merge, I would actually be more interested in adapting the following recipe instead: slices:
- sources:
- model: 152334H/miqu-1-70b-sf
layer_range: [0,11]
- sources:
- model: 152334H/miqu-1-70b-sf
layer_range: [9,13]
- sources:
- model: 152334H/miqu-1-70b-sf
layer_range: [11,15]
- sources:
- model: 152334H/miqu-1-70b-sf
layer_range: [13,17]
- sources:
- model: 152334H/miqu-1-70b-sf
layer_range: [15,23]
- sources:
- model: 152334H/miqu-1-70b-sf
layer_range: [21,25]
- sources:
- model: 152334H/miqu-1-70b-sf
layer_range: [23,49]
- sources:
- model: 152334H/miqu-1-70b-sf
layer_range: [47,51]
- sources:
- model: 152334H/miqu-1-70b-sf
layer_range: [49,53]
- sources:
- model: 152334H/miqu-1-70b-sf
layer_range: [51,55]
- sources:
- model: 152334H/miqu-1-70b-sf
layer_range: [53,57]
- sources:
- model: 152334H/miqu-1-70b-sf
layer_range: [55,59]
- sources:
- model: 152334H/miqu-1-70b-sf
layer_range: [57,61]
- sources:
- model: 152334H/miqu-1-70b-sf
layer_range: [59,63]
- sources:
- model: 152334H/miqu-1-70b-sf
layer_range: [61,65]
- sources:
- model: 152334H/miqu-1-70b-sf
layer_range: [63,67]
- sources:
- model: 152334H/miqu-1-70b-sf
layer_range: [65,69]
- sources:
- model: 152334H/miqu-1-70b-sf
layer_range: [67,71]
- sources:
- model: 152334H/miqu-1-70b-sf
layer_range: [69,73]
- sources:
- model: 152334H/miqu-1-70b-sf
layer_range: [71,75]
- sources:
- model: 152334H/miqu-1-70b-sf
layer_range: [73,80]
merge_method: passthrough
dtype: float16 |
After doing the You would need to try to work out how many times each layer is getting duplicated first - probably best to look at the original Hopefully @cg123 or @shamanez might get back to us to see if there is a simpler way to do this! I'm going to recreate the |
Omg this is one way to merge 😁.
…On Fri, 19 Apr 2024 at 8:33 AM, jukofyork ***@***.***> wrote:
For the 103b self-merge, I would actually be more interested in adapting
the following recipe instead:
slices:
- sources:
- model: 152334H/miqu-1-70b-sf
layer_range: [0,11]
- sources:
- model: 152334H/miqu-1-70b-sf
layer_range: [9,13]
- sources:
- model: 152334H/miqu-1-70b-sf
layer_range: [11,15]
- sources:
- model: 152334H/miqu-1-70b-sf
layer_range: [13,17]
- sources:
- model: 152334H/miqu-1-70b-sf
layer_range: [15,23]
- sources:
- model: 152334H/miqu-1-70b-sf
layer_range: [21,25]
- sources:
- model: 152334H/miqu-1-70b-sf
layer_range: [23,49]
- sources:
- model: 152334H/miqu-1-70b-sf
layer_range: [47,51]
- sources:
- model: 152334H/miqu-1-70b-sf
layer_range: [49,53]
- sources:
- model: 152334H/miqu-1-70b-sf
layer_range: [51,55]
- sources:
- model: 152334H/miqu-1-70b-sf
layer_range: [53,57]
- sources:
- model: 152334H/miqu-1-70b-sf
layer_range: [55,59]
- sources:
- model: 152334H/miqu-1-70b-sf
layer_range: [57,61]
- sources:
- model: 152334H/miqu-1-70b-sf
layer_range: [59,63]
- sources:
- model: 152334H/miqu-1-70b-sf
layer_range: [61,65]
- sources:
- model: 152334H/miqu-1-70b-sf
layer_range: [63,67]
- sources:
- model: 152334H/miqu-1-70b-sf
layer_range: [65,69]
- sources:
- model: 152334H/miqu-1-70b-sf
layer_range: [67,71]
- sources:
- model: 152334H/miqu-1-70b-sf
layer_range: [69,73]
- sources:
- model: 152334H/miqu-1-70b-sf
layer_range: [71,75]
- sources:
- model: 152334H/miqu-1-70b-sf
layer_range: [73,80]merge_method: passthroughdtype: float16
After doing the goliath-120b version I don't think my brain could handle
this! 🤣
You would need to try to work out how many times each layer is getting
duplicated first - probably best to look at the original goliath-120b
model and compare with what I did to see what's needed.
Hopefully @cg123 <https://github.com/cg123> or @shamanez
<https://github.com/shamanez> might get back to us to see if there is a
simpler way to do this!
—
Reply to this email directly, view it on GitHub
<#198 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AEA4FGUBTGBNLN36H4YDOXLY6AU2TAVCNFSM6AAAAABE2TYM6CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANRVGI3TAOBRHA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Nice, you found my secret scale parameter. :) This is really interesting, it's great to see actual results coming out of this! I'm out of town at the moment so I can't really dig into this for a while. I'll be watching with interest for sure though. I'd definitely recommend writing a little python script or something to generate these configs. |
There's also some discussion in this thread too: https://huggingface.co/wolfram/miqu-1-120b/discussions/4 divinetaco (who has been running grid-search to find optimal merge parameters) has also confirmed it doesn't seem to work very well for non-self-merges:
Hopefully somebody can work out how to apply the "0 weight merge on first / final layer" thing in case that is important. |
I'm away for a few days now, but here's a couple more ideas to experiment with:
The If there is a non-linearity after You probably don't want to scale both The next thing to try experimenting with for non-self-merges (where this doesn't seem to be working well), is to not scale the "donor" models'
and see if that helps. |
I just finished evaluating miqu-1-120b-attenuated-GGUF against my LLM Creativity benchmark and it performs slightly better than the non-attenuated version, taking the second spot in the current leaderboard. I noticed some improvements at creative writing, producing longer, more details, and unrushed text. Like wolfram/miqu-1-120b though, there is some degradation over miqu-1-70b, with longer text, as it starts deviating from instructions and requires some effort to keep it on track. The scores show some improvement in all categories. |
This is a fascinating discussion. I am cooking up an attenuated version of Midnight-Miqu-103B-v1.5 as I type this. I'll be curious to see how it performs. |
Only have my tablet here so using Claude Opus to format this but it seems to write invalid Latex/Katex for Github :( Let's analyze how the expected value of the L2 norm of the sum of two high-dimensional random unit vectors changes as the correlation between them increases from 0 to 1.
In summary, as the correlation between the two high-dimensional random unit vectors increases from 0 to 1, the expected value of the L2 norm of their sum varies from So basically what this saying is:
So putting this back into the value in the above
This all should be taken with a grain of salt as in the actual model is very different to this simple example, but it is probably a good place to start playing with different scale factors to try to get it working with multi-model merges. Sorry for the bad tablet formatting - autocorrect is painful for stuff like this, but thought it was worth posting before I get back whilst people are currently experimenting with this idea still. |
Did you add the logging output and run with |
See this post for the logging alteration: |
class LlamaMLP(nn.Module):
def __init__(self, config):
super().__init__()
self.config = config
self.hidden_size = config.hidden_size
self.intermediate_size = config.intermediate_size
self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False)
self.act_fn = ACT2FN[config.hidden_act]
def forward(self, x):
if self.config.pretraining_tp > 1:
slice = self.intermediate_size // self.config.pretraining_tp
gate_proj_slices = self.gate_proj.weight.split(slice, dim=0)
up_proj_slices = self.up_proj.weight.split(slice, dim=0)
down_proj_slices = self.down_proj.weight.split(slice, dim=1)
gate_proj = torch.cat(
[F.linear(x, gate_proj_slices[i]) for i in range(self.config.pretraining_tp)], dim=-1
)
up_proj = torch.cat([F.linear(x, up_proj_slices[i]) for i in range(self.config.pretraining_tp)], dim=-1)
intermediate_states = (self.act_fn(gate_proj) * up_proj).split(slice, dim=2)
down_proj = [
F.linear(intermediate_states[i], down_proj_slices[i]) for i in range(self.config.pretraining_tp)
]
down_proj = sum(down_proj)
else:
down_proj = self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
return down_proj The key line is this: down_proj = self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) and: # Fully Connected
residual = hidden_states
hidden_states = self.post_attention_layernorm(hidden_states)
hidden_states = self.mlp(hidden_states)
hidden_states = residual + hidden_states
outputs = (hidden_states,) So So this should definitely do something and as far as I can see, is the only place we can apply this scaling: all the other places look like they will end up scaling V and/or end up scaling the inputs to the MLP non-linear activation functions. When I get back home I'll look into seeing if I can get this to actually run and then should be able to get some interesting diagnostic stuff printed out for the actual transformer blocks we are dealing with, instead of just having to guess at it. |
class LlamaRMSNorm(nn.Module):
def __init__(self, hidden_size, eps=1e-6):
"""
LlamaRMSNorm is equivalent to T5LayerNorm
"""
super().__init__()
self.weight = nn.Parameter(torch.ones(hidden_size))
self.variance_epsilon = eps
def forward(self, hidden_states):
input_dtype = hidden_states.dtype
hidden_states = hidden_states.to(torch.float32)
variance = hidden_states.pow(2).mean(-1, keepdim=True)
hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)
return self.weight * hidden_states.to(input_dtype) @froggeric maybe this is why it had no effect? It looks like it actually calculates the standard deviation of the actual input instance's values and then uses this to rescale: variance = hidden_states.pow(2).mean(-1, keepdim=True)
hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon) rather than using the weight as a standard deviation it found during training (as I had nievely assumed). @cg123 can probably confirm this? If so then I don't think there will be any place to change the scale like this without altering the transformer code and in turn making the model impossible to quantize using llama.cpp or other software :/ |
@cg123 does changing the |
This still looks like it should do something as The only way I can see it would do nothing is if @froggeric I'd leave playing with |
From: https://arxiv.org/abs/2309.00071 In Section 3.4, the authors introduce a temperature factor This attention scaling has a uniform impact on perplexity, regardless of the data sample and token position over the extended context window. The authors reparameterize RoPE as 2D matrices, allowing them to scale both In Appendix A.2, the authors analyze the impact of where The authors observe that a suitable Sorry for the AI summary but it saves me trying to write that lot out with autocorrect 😁 It's interesting they also atttenuate the softmax but doesn't seem any real theory behind it; other than it was found to work empirically. |
Found it again: https://arxiv.org/abs/2401.02415 So this can be implemented now in mergekit by setting the It will require fine-tuning but I think it's an interesting idea for merging methods and would be a good way to show off the use of the pass through |
I'm pretty sure I've completely cracked this now and just running a |
Whilst I'm waiting for the model to finish cooking, I'll start with these 2 images from the: The Unreasonable Ineffectiveness of the Deeper Layers paper @cg123 implemented to prune back I've marked on in red the bounds of where there is a clear transition in the early layers for the I've also marked in orange where there is a similar but much fainter transition in the last layers, but there is also an interesting bit of blue on the diagonal than consists of 2-3 layers at the end. I know people nearly always don't overlap the last 10 layers but when I was messing about with I strongly suspect these are the places where the model in transitioning in/out of its latent space representation and should definitely not be messed with. So this leaves everything between these layers as likely the place where 3blue1brown's "addition of vectors interpretation" makes sense (ie: the residual stream hypothesis), and this in turn leads us to scaling
So now forgetting about the QK-attenuation stuff from earlier in the thread, the question is: why are we overlapping/interleaving the blocks of layers? Could it be that this just happens to add lots of noise to the vectors being added and is a sort of accidental/rudimentary way of scaling the vectors to avoid overshooting!? So if that is the case, and we can actually work out the correct scale factor for
So using the numbers I found a few posts ago for uncorrelated vectors (ie: multi-model merges), we get: Scale: and with the attenuation modification inspired by the Hopfield Networks is All You Need and Transformers Learn Higher-Order Optimization Methods for In-Context Learning papers linked in the OP), we get: Scale: and using these numbers
Who appears to be completely coherent!!!! USERWrite me the opening chapter of a Grimdark trilogy in the style of Joe Abercrombie and Rob J Hayes. Use third person personal and feature internal monologues of the characters. The POV character for chapter 1 is a cultist who has just escaped his cult. He is dressed in dirty yellow robes and his only possession is a mysterious small mirror. The story starts with him arriving at an apparently deserted ghost town on the edge of a desert. ASSISTANTThe sun beat down mercilessly on the parched earth, casting a shimmering haze over the desolate landscape. In the distance, the crumbling remains of a once-thriving town stood as a testament to the passage of time and the cruelty of the desert. The wind howled through the empty streets, carrying with it the whispers of ghosts long forgotten. Dust swirled in the breeze as a lone figure stumbled into view, his once-yellow robes now stained and tattered from days of relentless travel. His name was long forgotten, erased from his memory by the cult that had raised him, but he knew he could no longer be a part of their twisted beliefs. He was a man on the run, a heretic in the eyes of his former brethren, and he would do anything to escape their clutches. The cultist, now a fugitive, entered the ghost town, his eyes darting warily from side to side. He knew that his pursuers couldn't be far behind, and he needed to find shelter before they caught up to him. The sun was setting, casting long shadows across the abandoned streets, and he knew that nightfall would bring with it a different kind of danger. As he walked, he fingered the small, mysterious mirror tucked securely in his robes. It was the only possession he had managed to take with him when he fled the cult's stronghold, and he had no idea of its significance. All he knew was that it was important enough for the cult to send their most skilled assassins after him. The cultist ducked into the first building he came across, a ramshackle tavern that looked as if it had been abandoned in a hurry. Dust covered every surface, and the stale smell of old ale and despair hung heavy in the air. He moved cautiously through the dimly lit interior, his heart pounding in his chest, and climbed the creaking stairs to the upper floor. In a room at the end of the hall, he found a small, dust-covered mirror hanging crookedly on the wall. The cultist removed the mirror from his robes and held it up to the flickering light of the candle he had lit. As he peered into the reflective surface, he gasped in shock. The face staring back at him was not his own, but that of a stranger. His thoughts raced as he tried to make sense of the impossible. How could this be? He had no memories of his life before the cult, and yet here he was, looking at the face of a stranger. The sound of footsteps on the stairs jolted him from his reverie, and he knew he had no time to ponder the mystery now. Quickly, he extinguished the candle and crept to the window, peering through the grimy glass. Below, he saw the unmistakable forms of his pursuers, their hooded figures casting long shadows in the moonlight. He knew they would stop at nothing until they had recovered the mirror and brought him back to face the cult's justice. As he watched, a plan began to form in his mind. He would use the darkness and the abandoned town to his advantage, leading his pursuers on a deadly game of cat and mouse. He was no longer the naive initiate they had indoctrinated; he was a survivor, and he would do whatever it took to stay alive. The cultist, now a fugitive on the run, slipped into the shadows, his heart pounding in his chest. The game had only just begun. USERCan you compare and contrast the writing styles of Joe Abercrombie and Rob J Hayes? They seem very similar to me... ASSISTANTJoe Abercrombie and Rob J. Hayes are both popular fantasy authors known for their gritty and dark storytelling. While their writing styles share some similarities, there are also some key differences.
In summary, while both Abercrombie and Hayes write dark, gritty fantasy with complex characters and vivid worlds, their writing styles differ in terms of pacing, humor, and the level of detail in their worldbuilding and character development. USERCan you give a quick summary of the most popular 2-3 works of each for me? ASSISTANTJoe Abercrombie:
Rob J. Hayes:
So again using the numbers I found a few posts ago for correlated vectors (ie: self-merges), we get: Scale: and with the attenuation modification, we get: Scale: and using these numbers we
Who seems pretty smart: USERTwo people are in a room alone with a machine that requires two people to operate. One person is operating the machine. The machine is functioning properly. What is the other person doing? ASSISTANTThe other person is also operating the machine. The statement "requires two people to operate" implies that both individuals are actively involved in operating the machine. Therefore, the other person is not just present in the room but is actively participating in operating the machine as well. USERSally has 3 sisters. Each of her sisters has one brother. How many sisters does Sally have? ASSISTANTSally has 3 sisters. The information about each sister having a brother does not affect the number of sisters Sally has. Therefore, Sally still has 3 sisters. The information about their brothers does not change the number of sisters Sally has. Confusion might arise because the question mentions brothers, but it is not relevant to the question at hand, which is about the number of sisters Sally has. The number of brothers or brothers-in-law does not change the number of sisters Sally has. So, Sally has 3 sisters. USERSally (a girl) has 3 brothers. Each brother has 2 sisters. How many sisters does Sally have? ASSISTANTSally has 2 sisters. Here's the reasoning:
Therefore, Sally has 2 sisters in total, including herself. However, this question can be confusing because it might lead some people to think that each brother has a different set of sisters, but in fact, they share the same sisters. So, Sally has 1 sister in addition to herself. Confidence: 99% USERSally (a girl) has 3 brothers. Each brother has 2 sisters. How many sisters does Sally have? Think step by step as this is a tricky question: ASSISTANTLet's break this down:
Therefore, Sally has 1 sister. Confusion might arise because the question mentions that each brother has 2 sisters, but it's important to remember that Sally is also counted as one of the sisters. Final answer: Sally has 1 sister. USERSolve this family relationship quiz: Marie is Emma's parent. Emma is Carl's parent. Carl is Timothy's parent. Timothy is male. Who's Timothy to Marie? ASSISTANTTimothy is Marie's grandchild. Here's the reasoning:
From these relationships, we can deduce that:
Therefore, Timothy is Marie's grandchild. USERSolve this family relationship quiz: Betty is Brenda's parent. Betty is Joe's parent. Joe is Charles' parent. Brenda is female. Who's Brenda to Charles? ASSISTANTBrenda is Charles' aunt. Here's the reasoning:
USERSolve this family relationship quiz: Nathan is Timothy's parent. Nathan is Aaron's parent. Aaron is Edward's parent. Timothy is male. Who's Timothy to Edward? ASSISTANTTimothy is Edward's grandfather. Here's the reasoning:
Therefore, Timothy is Edward's grandfather. However, the question asks for Timothy's relationship to Edward, so the answer is that Timothy is Edward's grandfather. USERSolve this family relationship quiz: Nathan is Timothy's parent. Nathan is Aaron's parent. Aaron is Edward's parent. Timothy is male. Who's Timothy to Edward? Think step by step as this is a tricky problem: ASSISTANTLet's break this down step by step:
So, Timothy is Edward's uncle. Final answer: Timothy is Edward's uncle. I can't really comment further as have no real principled way to test these model configs other than for coherency, story telling and these handful of puzzles... Hopefully others can do that and also see if the configs can be tweaked to work even better. So I think I'm largely done with this now. I only have 2 requests for people who want to carry on experimenting:
Big thanks to @cg123 for adding the |
Triple merging seems to work OK too. Sadly I couldn't get the 200b mix of 3 models to work in llama.cpp:
So had to just let the last 30 layers be left unmerged to get a 160b model:
|
Interesting, and it matches exactly what I had theorised a while ago. At the time I did not work though, and I started exploring the size of the overlaps. Now I realise what I was missing was the attenuation. I am currently benchmarking the miqu self-merge. However, I find 120b (140 layers) a bit too large. Maybe we can combine it with this this: only duplicate the most important layers, according to exl2 measurements, with the exception of the beginning and the end layers. What I am proposing is the following for a 96b model (112 layers): # The models we are going to use.
const_tag: &MODEL1 152334H/miqu-1-70b-sf
const_tag: &MODEL2 152334H/miqu-1-70b-sf
# The amount to scale the residual contributions and the amount to attenuate the Q and K matrices.
# Correlation = 0 --> RESIDUAL_SCALE_FACTOR = sqrt(1/2) [≈ 0.7071067812] & QK_ATTENUATION_FACTOR = sqrt(sqrt(1/2)) [≈ 0.8408964153]
# Correlation = 1 --> RESIDUAL_SCALE_FACTOR = 1/2 [≈ 0.5] & QK_ATTENUATION_FACTOR = sqrt(1/2) [≈ 0.7071067812]
const_tag: &RESIDUAL_SCALE_FACTOR 0.5 # Set to 1.0 to leave unchanged.
const_tag: &QK_ATTENUATION_FACTOR 0.7071067812 # Set to 1.0 to leave unchanged.
# How we are going to scale each matched tensor type.
scale-filter-env: &scale_filter_env
parameters:
scale:
- filter: o_proj
value: *RESIDUAL_SCALE_FACTOR
- filter: down_proj
value: *RESIDUAL_SCALE_FACTOR
- filter: q_proj
value: *QK_ATTENUATION_FACTOR
- filter: k_proj
value: *QK_ATTENUATION_FACTOR
- value: 1.0
slices:
# The first 10 layers are not duplicated.
- sources:
- model: *MODEL1
layer_range: [0, 10]
- sources:
- model: *MODEL1
layer_range: [10, 11]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [10, 11]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [11, 12]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [11, 12]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [12, 13]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [12, 13]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [13, 14]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [13, 14]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [14, 15]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [14, 15]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [15, 16]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [15, 16]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [16, 17]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [16, 17]
<<: *scale_filter_env
# Layers 17-21 are not duplicated (less relevant according to exl2 measurements)
- sources:
- model: *MODEL1
layer_range: [17, 21]
- sources:
- model: *MODEL1
layer_range: [21, 22]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [21, 22]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [22, 23]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [22, 23]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [23, 24]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [23, 24]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [24, 25]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [24, 25]
<<: *scale_filter_env
# Layers 25-49 are not duplicated (less relevant according to exl2 measurements)
- sources:
- model: *MODEL1
layer_range: [25, 49]
- sources:
- model: *MODEL1
layer_range: [49, 50]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [49, 50]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [50, 51]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [50, 51]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [51, 52]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [51, 52]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [52, 53]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [52, 53]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [53, 54]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [53, 54]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [54, 55]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [54, 55]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [55, 56]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [55, 56]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [56, 57]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [56, 57]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [57, 58]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [57, 58]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [58, 59]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [58, 59]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [59, 60]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [59, 60]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [60, 61]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [60, 61]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [61, 62]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [61, 62]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [62, 63]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [62, 63]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [63, 64]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [63, 64]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [64, 65]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [64, 65]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [65, 66]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [65, 66]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [66, 67]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [66, 67]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [67, 68]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [67, 68]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [68, 69]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [68, 69]
<<: *scale_filter_env
- sources:
- model: *MODEL1
layer_range: [69, 70]
<<: *scale_filter_env
- sources:
- model: *MODEL2
layer_range: [69, 70]
<<: *scale_filter_env
# The last 10 layers are not duplicated.
- sources:
- model: *MODEL1
layer_range: [70, 80]
merge_method: passthrough
dtype: float16 |
Yeah, this should hopefully open up a lot of new ways of merging! I'm keen to see if a non-interleaved version of I've just added a second NVME that will save me from thrashing the main NVME to death and hopefully can set up some way to automatically optimise these parameters soon. |
Just tried this:
but it's nothing like as good at writing compared to the original It does seem to be a lot more coherent, and less prone to weird spelling mistakes and stopping mid-sentence, but we definitely need to find some way to optimize this... I might try to minimise perplexity on actual text generated by |
I have almost finished putting miqu-test trough all my evaluation tests, and unfortunately I have to report it is a bit brain damaged:
It's definitely an improvement over my experiments with doubled layers and no attenuation, but it reinforces my findings that doubled layers are not viable. I have not finished testing the 96b self-merge with selection based on layers relevance, but so far, I am seeing scores lower than what I would expect them to be, Once I finish with those 2, I will go back to non-doubled layers, and test adding the attenuation to the 103b I posted earlier (it uses merge selection based on layers relevance) |
Yeah, my attempt didn't work very well with
(the
I've got a second NVME set up that isn't so much hassle to replace if this kills it now. I'll try and see if I can test all these to see the effect they have. |
I've created 10 prompts and will test all 54 combinations to see if this helps the alternating layer I've already found that just switching the order of the alternating layers:
vs
seems to make a difference to the quality of the stories. |
So changed to use these as anything using
I've only got through the first 15 out of 54 so far but there is a pattern developing, especially with this prompt:
I find the last bit most interesting, but need to run the full test to be sure... It could be that the MLP layer (with it's norm before) expects the distribution caused by the EDIT: Just noticed a lot of the other 9 prompts with This seems a good test prompt too:
The dumber the model the more likely they are to try to just write a scene-play instead of a story... It definitely looks like |
Still running, but I think there is a pretty clear pattern for the
So far I'm going to leave it to finish to have both orderings (ie: also I may try I would say the best settings are at least as good as |
These seem to be working really well:
It seems setting
Interested to see what you guys make of this:
The same templates should work for other self-merges and multi-model-merges too (gonna try a An interesting experiment is to reduce the If you want to merge 3 copies/models then
All seem like sensible things to try for multi-model merges. |
Wow, the The I'll have another look at the The
|
I am looking forward to test it :-) For now, I am still finishing tests on some of the settings you suggested, and I have a few more merge patterns I would like to try. So far, none of the miqu self-merge I tested has been worthwhile, apart from the very first one from this conversation. Regarding importance matrix, I would suggest to leave it out for now, which is what I do in my testing. The reason being, they do influence the model behaviour. This is something I and a few others have observed: when using an English based matrix, the model multilingual capabilities show some noticeable degradation. For multilingual behaviour, it easy to notice. I expect the same kind of degradation for other capabilities, only they are more difficult to notice. Ideally, the importance matrix should be based on the training or fine-tuning dataset. |
I would hold off as I will try running grid-search for
Yeah, I'm redoing all the goliath models for a fair comparison now. |
Rerunning all the tests now. I think I'll make different hugginface model cards for these as this thread is getting way too many huge files in it and hard to navigate. I'll post back here any findings to do with changing the parameters. |
I posted all the results here: https://huggingface.co/jukofyork/goliath-esque The It still seems to be doing occasional weird stuff. It almost looks like the shorter the prompt the more it does this. |
So back to merging blocks of layers now as there are just too many WTFs with the alternating layers from that set of tests. A few interesting observations that might help decide what to try next: The 'interleaving' of self-merges is an illusion:- sources:
- layer_range: [0, 20]
model: *MODEL
- sources:
- layer_range: [10, 30]
model: *MODEL
- sources:
- layer_range: [20, 40]
model: *MODEL
- sources:
- layer_range: [30, 50]
model: *MODEL
- sources:
- layer_range: [40, 60]
model: *MODEL
- sources:
- layer_range: [50, 70]
model: *MODEL
- sources:
- layer_range: [60, 80]
model: *MODEL as this is 100% equivalent to this: - sources:
- layer_range: [0, 10]
model: *MODEL
- sources:
- layer_range: [10, 20]
model: *MODEL
- sources:
- layer_range: [10, 20]
model: *MODEL
- sources:
- layer_range: [20, 30]
model: *MODEL
- sources:
- layer_range: [20, 30]
model: *MODEL
- sources:
- layer_range: [30, 40]
model: *MODEL
- sources:
- layer_range: [30, 40]
model: *MODEL
- sources:
- layer_range: [40, 50]
model: *MODEL
- sources:
- layer_range: [40, 50]
model: *MODEL
- sources:
- layer_range: [50, 60]
model: *MODEL
- sources:
- layer_range: [50, 60]
model: *MODEL
- sources:
- layer_range: [60, 70]
model: *MODEL
- sources:
- layer_range: [60, 70]
model: *MODEL
- sources:
- layer_range: [70, 80]
model: *MODEL Goliath's alternating block structure:- sources:
- layer_range: [0, 16]
model: *MODEL1
- sources:
- layer_range: [8, 24]
model: *MODEL2
- sources:
- layer_range: [17, 32]
model: *MODEL1
- sources:
- layer_range: [25, 40]
model: *MODEL2
- sources:
- layer_range: [33, 48]
model: *MODEL1
- sources:
- layer_range: [41, 56]
model: *MODEL2
- sources:
- layer_range: [49, 64]
model: *MODEL1
- sources:
- layer_range: [57, 72]
model: *MODEL2
- sources:
- layer_range: [64, 80]
model: *MODEL1 can be looked at like this: - sources:
- layer_range: [0, 16]
model: *MODEL1
# Replace layer 17 of MODEL1 with layers 16 layers (ie: 8-23) of MODEL2.
- sources:
- layer_range: [8, 24]
model: *MODEL2
- sources:
- layer_range: [17, 32]
model: *MODEL1
# Replace layer 32 of MODEL1 with layers 15 layers (ie: 25-39) of MODEL2.
- sources:
- layer_range: [25, 40]
model: *MODEL2
- sources:
- layer_range: [33, 48]
model: *MODEL1
# Replace layer 48 of MODEL1 with layers 15 layers (ie: 41-55) of MODEL2.
- sources:
- layer_range: [41, 56]
model: *MODEL2
- sources:
- layer_range: [49, 64]
model: *MODEL1
# Replace layer 64 of MODEL1 with layers 15 layers (ie: 57-71) of MODEL2.
- sources:
- layer_range: [57, 72]
model: *MODEL2
- sources:
- layer_range: [65, 80]
model: *MODEL1 Winter-Goliath's alternating block structure:- sources:
- layer_range: [0, 16]
model: *MODEL1
- sources:
- layer_range: [8, 24]
model: *MODEL2
- sources:
- layer_range: [16, 32]
model: *MODEL1
- sources:
- layer_range: [24, 40]
model: *MODEL2
- sources:
- layer_range: [32, 48]
model: *MODEL1
- sources:
- layer_range: [40, 56]
model: *MODEL2
- sources:
- layer_range: [48, 64]
model: *MODEL1
- sources:
- layer_range: [56, 72]
model: *MODEL2
- sources:
- layer_range: [64, 80]
model: *MODEL1 can be looked at like this: - sources:
- layer_range: [0, 16]
model: *MODEL1
# Squeeze in 16 layers from MODEL2.
- sources:
- layer_range: [8, 24]
model: *MODEL2
- sources:
- layer_range: [16, 32]
model: *MODEL1
# Squeeze in 16 layers from MODEL2.
- sources:
- layer_range: [24, 40]
model: *MODEL2
- sources:
- layer_range: [32, 48]
model: *MODEL1
# Squeeze in 16 layers from MODEL2.
- sources:
- layer_range: [40, 56]
model: *MODEL2
- sources:
- layer_range: [48, 64]
model: *MODEL1
# Squeeze in 16 layers from MODEL2.
- sources:
- layer_range: [56, 72]
model: *MODEL2
- sources:
- layer_range: [64, 80]
model: *MODEL1 |
🤮 and:
90% of the time ends up with him doing something good or even not using the grenade??? 🤦 |
Wow... this is some thread ! ; learning a lot. I did / have tried something similar to this; except I looked at a bit differently. The theory -> Like model layers should be somewhat interchangable with other like layers. IE: With the following caveats: (based on reading a number of papers on layer position and importance) Goal was to hit 60 layers... but COlab blew up. I have made GGUFs of it ; and they work well - but do break on occasion. (needs some "healing"). This is at my repo. Here is what the merge file looks like: slices:
|
Has anyone tried downscaling the K and/or Q matrices for repeated layers in franken-merges? This should act like changing the temperature of the softmax and effectively smooth the distribution:
Hopfield Networks is All You Need
https://arxiv.org/abs/2008.02217
https://ml-jku.github.io/hopfield-layers/
Transformers Learn Higher-Order Optimization Methods for In-Context Learning: A Study with Linear Models
https://arxiv.org/abs/2310.17086
Empirically I've found repeating large blocks does seem to make models "confidently wrong" - stacking two full copies of
deepseek-coder
ormiqu-1
shows this phenomenon really well.The text was updated successfully, but these errors were encountered: