Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is anyone working on 'conditional' logit biasing? (How to potentially improve repetition penalty?) #3149

Closed
kalomaze opened this issue Sep 12, 2023 · 12 comments
Labels

Comments

@kalomaze
Copy link
Contributor

kalomaze commented Sep 12, 2023

Prerequisites

  • [✅] I reviewed the Discussions, and have a new bug or useful enhancement to share.

Feature Idea

If we want to efficiently bias against specific words that are made up of multiple tokens or 'ban' them, as well as short phrases, checking the logit list to see if the other predictions imply the 'full word' or 'full phrase' could be very beneficial. Currently, there is a limitation of predicting single tokens at a time; this means the decision on whether or not to pick a token based on context clues (e.g short synonyms instead of the first piece of a larger word) would be beneficial as there would be no overhead from 'rewinding' or reprocessing context.

A related draft PR exists which is dedicated towards implementing a 'rewind' feature for a sequence repetition penalty option. This could be very beneficial for longer phrases that can't be accurately 'predicted' ahead of time:
#2593

But I don't see any PR that attempts to tackle the issue in a way that doesn't incur performance overheads of some kind from having to regenerate tokens.

image

I have visually drafted out this conditional biasing concept in hopes that anyone working on a similar feature might be willing to help on this idea.

In addition, you could theoretically implement this in such a way where if you are biasing against a continued phrase or sentence, you gradually bias it for each consecutive word. For example, let's say you want to avoid this sentence from being referenced in any way:

"The quick brown fox jumps over the lazy dog."

Individually, these could still be considered typical tokens; the bias would only be introduced if a repeated sequence order is seen based on the frequency of those words.

"The" by itself shouldn't be impacted for obvious reasons; but a small bias against 'quick' could be introduced if the word preceding it was 'The'. For 'brown', you could bias the probability more aggressively and so on.
For every token that is breaking out of the 'banned sequence', you could ease off the biasing until it returns back to zero.

Doing this by hand would be tedious; maybe an automatic calculation that judges the rarest portions of the 'banned phrases' and weighs them proportionally to the rest of the temperature would be a better move for a 'phrase ban list'?

In addition, it doesn't necessarily have to be followed exactly in order to trigger the 'ban' as you could proportionally penalize more generic phrases like 'jumps over the' less than others. 'quick brown fox' might have a stronger negative bias, for example.

@kalomaze kalomaze changed the title Is anyone working on 'conditional' logit biasing? (How to potentially improve repetition penalty) Is anyone working on 'conditional' logit biasing? (How to potentially improve repetition penalty?) Sep 12, 2023
@KerfuffleV2
Copy link
Collaborator

This seems super, super hard to do. My advice would be to get something you're satisfied with, even if there's a fairly steep performance penalty and then worry about trying to find an optimized way to accomplish the same thing.

Even with rewinding, I'm find it's still pretty hard to stop the repeated sequences I want to stop and avoid hitting the ones that are fine.

Also, once batch generation stuff is in I think we can actually avoid the performance penalty most of the time. We can just do batch generation and supposedly even something like 20 variations doesn't have much of a performance penalty. At that point, once there's a repeated sequence, instead of rewinding and banning the token that started it, we can just choose a variation where it didn't get picked.

Even without that, the performance issue with rewinding doesn't really seem that steep. Maybe 20%? That's definitely something I could live with if I could actually get it working the way I want.

@kalomaze
Copy link
Contributor Author

kalomaze commented Sep 13, 2023

Yeah, I don't blame you for taking the route you did at all, and I might not be able to pull off a worthwhile 'alternative' for the time being just by extrapolating off single starter tokens. I'm considering just doing a different version of --logit_bias where, instead of biasing single token sets globally, it accepts sequences of tokens that it treats as 'full phrases'; that way, the negative biasing only comes up when necessary, and can work for multi-token words/phrases

Maybe 'negative bias only gets reinforced if it's part of the next token's pool/when it gets generated' isn't possible in the way I'm expecting though. If it works, then I'll likely try my hand at elaborating on it a bit with my intial idea here, but it's possible I'm overestimating myself hard here... it's pretty intimidating stuff lol

It's nice to hear that batch generation should help with drafting though

@KerfuffleV2
Copy link
Collaborator

instead of biasing single token sets globally, it accepts sequences of tokens that it treats as 'full phrases'; that way, the negative biasing only comes up when necessary, and can work for multi-token words/phrases

I'm not sure I fully understand what you're describing, but if you penalizing the start of the sequence then that's a sledgehammer. You're stopping everything that could start with that token. If you do something like progressively increase a penalty as more of a sequence is matched (basically how the seqrep stuff can work in non-rewind mode) then you can basically ban a token and leave the model with no good alternatives.

That's kind of what the word boundary aware stuff is trying to improve.

Just as example, if you were looking to stop "the dog's toy is blue" and you let the LLM generate "the dog'" and basically ban "s" at that point, you're pretty much going to get nonsense. There's nothing it can pick after that point except "s" that makes sense. Or even something like the word in your example, "min istr ations". If you get to "ministr" and ban "ations" there are very few possibilities it can choose that even make a real word and something like "ministry" probably isn't going to fit the context.

@kalomaze
Copy link
Contributor Author

kalomaze commented Sep 13, 2023

If you get to "ministr" and ban "ations" there are very few possibilities it can choose that even make a real word

What I was getting at with the 'rarity' statement was, I thought partial bias based on how common token parts were could significantly help minimize the risk of high perplexity outputs instead of linearly biasing for all tokens. The kind you mention where you force it to complete half of a word it's no longer allowed to create. I was thinking of manually calculating / estimating that 'popularity' based on a dictionary or some other resource if necessary, and then you could store percentage estimates of the most sensitive parts for the token pieces.

The 'blacklist' vs 'whitelist' style behavior where it analyzes the top probable tokens is also something I still wanna do if it's reasonable. There might be way more overhead than I'm expecting towards checking and changing logit probability on the fly like that though. I was planning on referencing the code for the repetition penalty to see how it updated probability, but I could be wrongly assuming that biasing logits during generation would be higher performance compared to your rewind drafting

Also, I'm not too familiar with seqrep; if there are obvious mistakes I'm making that I'm still ignoring when it comes to this, lmk. Overall I'm still learning and throwing ideas at the wall atm, but I appreciate the feedback

@kalomaze
Copy link
Contributor Author

kalomaze commented Sep 14, 2023

Alright, so here's my gameplan for this:

  • Analyze the --logit_bias option further. I currently know that it handles each token you give it separately, and they are globally biased, not conditionally biased.

  • Figure out how the 'view logits' code in text-generation-webui shows the probabilities. If you can somehow access what tokens are most likely at any given time during generation, hopefully that implies you can view the logits before it decides a token and bias based on context.

image

If that's impossible, the alternative is gauging the last X tokens generated and banning if there's other context clues; which I imagine would be less reliable, but it'd be better than not being able to bias sequences whatsoever...

  • Implement a new option that is initially just a duplicate of --logit_bias: --multi_bias. This multibias will be adapted to accept multiple tokens as a sequence to penalize sequentially. If the first token is seen, bias the next token to NOT follow the sequence it is biased against (e.g., "ministrations" would penalize 'istr' if it picks 'min'.)

By itself, this won't be very valuable. For example, when banning ministrations, words like 'minimalist,' for example, will show up where they shouldn't be showing up (for example, "The counselor's min" shouldn't continue as "minimalist" most of the time).

But then I will replace it with a renamed option: --multicontext_bias. The input for this will be a multi-token phrase/word, and then the single tokens that it should check for in the token pool at the time as 'evidence' for how much the token should be biased against; how many are needed for it to be 'banned' is something I will experiment and decide later on.

So, for example:

  • --multicontext_bias min: care, help, aid < pretend I used the actual token numbers here

Maybe it could be set up so that it biases proportional to how many synonyms are seen (e.g, 0% negative bias without care, help, or aid, but 25% bias if it finds just 'care', and 75% bias if it finds 'help' as well, and so on..>)

image

@KerfuffleV2
Copy link
Collaborator

If you can somehow access what tokens are most likely at any given time during generation

Yes that's basically how it works. You evaluate the model for a step and you get the logits back. "logits" are basically a big array of floating point numbers, one for each token id the model understands where a higher value means more likely. One thing to note though is that while the values mean something in a relative sense, they're kind of arbitrary and may vary between models. That's what stuff like softmax is for - it scales them to a more predictable value.

For example, a sampler like top-k works by sorting that list and keeping the top K items.

Various samplers like biases, top-k, etc run after evaluation and then using some method you pick an actual token. Then the next time you evaluate a stop of the model you feed that token id you chose into it and get the next set of logits. This process repeats.

So the manipulating what token gets chosen part really isn't hard. The big problem is making a decision of what to do when you only have the tokens that have been generated so far. Also, regardless of how you scale logits or whatever - the end result for a particular logit is binary: it gets picked or it doesn't.

@kalomaze
Copy link
Contributor Author

kalomaze commented Sep 14, 2023

Well I know you can see the evaluation after generation; my concern was whether or not you could evaluate during generation and change it before those probabilities are picked from. Which I guess that's true and it's the sampler's job (which is how Mirostat's top_k can be changed on the fly) so that's not of concern.

And yes, the end result is binary for each token. The proportional bias I mention at the end would just be deciding how much to bias against the token (globally?) based on what other tokens are seen in the logit list. The percentage would be a grade of how likely it is that this token is being used for a longer sequence that the user doesn't want; and so due to that uncertainty, it should scale based on how likely it is 'min' will become 'ministrations' instead of 'minimalist'.

So what you could do is have a 'bias generation script' of some kind, which will parse the differences between the logit list of 'min' in the context of "The nurse's..." compared to the context of "His room wasn't complex and was..." and figure out which tokens are exceedingly probable for 'ministrations' generations compared to 'minimalist'. That way you can bias less universally and only contextually.

Are you understanding what I'm getting at now? If the custom sampling isn't that hard to implement, it could be a very cost effective way to bias against longer words based on context clues without regeneration.

@KerfuffleV2
Copy link
Collaborator

my concern was whether or not you could evaluate during generation

Not really (not to say it's 100% impossible but there's nothing like that currently and it's generally a hard problem).

Which I guess that's true and it's the sampler's job

Yes, basically. You influence the logits you get when you evaluate the model with the token you picked on the previous step (or the prompt if you haven't had the model evaluate any of its own tokens yet).

and figure out which tokens are exceedingly probable for 'ministrations' generations compared to 'minimalist'.

It's going to be model specific and language specific and very likely also context specific as well. It seems like it's going to require so much effort constructing the rules that you might as well just write the thing yourself at that point. There are just so many words and permutations I just don't see how one could cover enough of them.

it could be a very cost effective way to bias against longer words based on context clues without regeneration.

I actually don't really understand why someone would want to do that. My own efforts are aimed at trying to encourage less repetitive and more creative/diverse output. What's the use case for biasing against longer words?

@kalomaze
Copy link
Contributor Author

kalomaze commented Sep 14, 2023

image

It seems like it's going to require so much effort constructing the rules that you might as well just write the thing yourself at that point. There are just so many words and permutations I just don't see how one could cover enough of them.

See the image attached for an explanation. It's not something you need ML for, and yes, it would be complex for a person to do it by hand; so why not make a script that will make the best whitelist/blacklist for you? And then weigh proportionally?

I plan to make a script to do that on my own, and then I will begin attempting the implementation on my own; if you're not interested, that's fine, as you already have enough on your plate. But if you think I shouldn't attempt this idea because something else is preventing it from working, do tell, but it seems like this should be doable if I focus on it independently.

I actually don't really understand why someone would want to do that. My own efforts are aimed at trying to encourage less repetitive and more creative/diverse output. What's the use case for biasing against longer words?

It's not about longer words. It's about being able to bias any word or short sequence (or bias positively, which I might explore later), in a way that is contextually aware. The potential use cases for that are very powerful for controlling how llama behaves outside of 'prompt engineering'. So not exclusively a 'better' repetition penalty.

Also, you can already guide sampling to fit arbitrary grammar rules for things like JSON output. I wonder if I should look into that PR and what it's doing for a better understanding of how I could improve model guidance...

@KerfuffleV2
Copy link
Collaborator

so why not make a script that will make the best whitelist/blacklist for you? And then weigh proportionally?

Try it. :) I feel like this is going to be very difficult but certainly it would be great if you can prove me wrong.

But if you think I shouldn't attempt this idea

No, no, I would never say anything like that. Even if you don't succeed, you'll probably still learn a lot by trying.

It's not about longer words. It's about being able to bias any word or short sequence (or bias positively, which I might explore later), in a way that is contextually aware.

If you can find a practical way to do it, I agree, that's very useful. I'd really suggest not worrying too much about the runtime performance. Like, if you could find a way to do that even with rewinding where you will have much, much more information available it would still be great.

Just for example the Classifier Free Guidance stuff halves performance and people still think it's worthwhile. So you could rewind to the extent that you halve performance and it still could be useful. In reality, it's not likely you'd need to rewind anything close to that much.

I wonder if I should look into that PR and what it's doing for a better understanding of how I could improve model guidance...

If I remember correctly, it basically works by banning all tokens that don't conform to the grammar. It doesn't really guide the model so much as prevent generation of anything that won't fit.

@DutchEllie
Copy link

I'd like to comment on this myself if that's no problem. First, I'd like to know if this is going well or how much work as been done on this issue.

Secondly, I personally dislike the idea of having a static list of synonyms for alternative words. Like Kerfuffle mentioned, this will ultimately make the list language specific, so would need to be done for all languages. Additionally, this does not cover cases where you're not biasing against whole words or phrases. If I wanted to bias against let's then I doubt there will be a synonym for that in the dictionary.

Instead, I would propose to use the earlier mentioned idea of batched generation. You would introduce new settings/arguments, something like --logit-bias-token-length and --logit-bias-batches or something far better-named.
The former would be a setting denoting the amount of historical tokens we want to keep per batch. The latter is a setting denoting how many batches we want to use for this logit analyzing (I'm mega new, so idk if this is already something).

My method (and I have absolutely NO IDEA whether this is performant, or memory intensive or anything or if it's even practical, just throwing it out there) would consider a certain number of previous tokens as well in evaluating the logit bias. This would eliminate the awkward moments where the LLM generates text leading up to its use of a certain word, only for it to then be told "Uh well, pick a different one.", only for there to be no suitable alternative.

Consider the text generation examples from earlier, in particular I didn't intend for that! It was unintentional, with a bias against unintentional. The preceding prompt here does not matter, but particularly important is that this entire phrase is generated by the LLM! It was not prompted to generate unintentional from the words I didn't intend for that! It was ..., but from something preceding this.

This phrase including unintentional is 13 tokens long (according to the llama-tokenizer-js I think). We set up our --logit-bias-evaluating-length at 10 tokens. We run batched generation and use --logit-bias-token-length 10 --logit-bias-batches 20. The program starts running and at all times it will keep the previous 10 iterations of 20 batches in memory for later.

We finished running our 13th generation, and we still have 10 iterations of 20 batches in memory, what now? Well, because we only kept 10 iterations, the tokens I didn are now generated, they cannot be changed. You can see that it just generated a bunch of different texts, so let's look at 2 possible scenarios (out of 20 remember? Also I totally made these up in my head, but they are the same token length):

I didn't intend for that! It was unintentional
I didn't mean it like that! I shouldn't have

With the previously thought of method of logit biasing (as far as I understand how it would make most sense), if the first option was ever generated, any other alternative for unintentional really would just kinda suck. Accidental, random... Not that great choices, but the logit bias dictates that we pick anything but unintentionally now, so we must settle for the inferior option now. The problem here is that the lead-up sets the text up for the word we don't want.

However, with this historical knowledge, we can instead pick a different leading phrase and avoid this awkward situation all together. We might instead go for the other phrase, which avoids any lead-up to the word unintentional in the first place.

How the score itself is generated, or how the bias mathematically would work, I don't know.

@github-actions github-actions bot added the stale label Mar 20, 2024
Copy link
Contributor

github-actions bot commented Apr 3, 2024

This issue was closed because it has been inactive for 14 days since being marked as stale.

@github-actions github-actions bot closed this as completed Apr 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants