Min P sampler implementation [alternative to Top P/Top K] #3841

kalomaze · 2023-10-28T22:28:58Z

The way that this sampler works is:

Every possible token has a probability percentage attached to it that we will be measuring for consideration.
The base min p value represents the starting required percentage. (For example, 0.05 = only include tokens that are at least 5% probable)
This gets scaled by the top token in the entire list's probability. So if your top token is 90%, then that 5% is multiplied by 0.9x (4.5%)
So if the top token is 90% probable, and your base_min_p is set to 0.05, then only tokens that are at least 4.5% probable will be sampled from before temperature is applied.
This method seems more effective at selecting the reasonable tokens compared to both Top P and Top K.

Top P has a design flaw in that numerous tail end tokens can be considered if the top tokens don't have concentrated enough scores to meet up to the specified Top P value, while TFS and other novel sampler approaches aren't as easily interpretable or consistent as Top P. The primary purpose of the Min P sampler is to accomodate for both of these design flaws.

The current implementation is very rough around the edges code-wise, as I am not very experienced with C++, but I hope to properly polish this implementation to be considered for merging. I have gotten improved results personally and positive feedback from other users, especially in regards to increased coherent creativity.

Mathematically, it is not as complex as TFS or other tail search algorithms, but importantly, it is easily understandable and in how it impacts the probabilities as a result. It is essentially a streamlined linear version of Top A in design. However, it consistently outperforms Top P and Top K for removing tail end tokens.

kalomaze · 2023-10-28T22:41:56Z

The current implementation:

Checks if Top P is set to 0.02, and if it has this value, triggers an override to use Min P sampling
Creates a .txt file with the base_min_p value, SamplerBaseMinP.txt if it doesn't already exist
Loads the value from the .txt file and performs Min P calculations

This is of course suboptimal in a lot of ways, but when drafting sampler ideas, I wanted to avoid touching the sampler stack order as it currently existed before I found a solution. What would be the best way to integrate this if the objective was to avoid Top P and Top K's flaws via an improved single sampler, where it's not intended to be used in tandem with them? (Maybe they should be disabled like how Mirostat disables samplers when this is enabled?)

llama.cpp

common/common.cpp

llama.h

kalomaze · 2023-10-29T03:31:16Z

A comparison between Top P and Min P when faced with absurdly high temperature scaling (no prompt formatting or anything, so not ideal model conditions, just a quick test)

llama.cpp

common/sampling.h

+ fixed 0.0 default for min_p

common/sampling.h

ivanstepanovftw · 2023-11-04T06:14:47Z

the order of the samplers became a command-line parameter

I like this. It is easy to implement -a 1 -b 2 --order A,B but it better be just -a 1 -b 2 in order to apply already applied sampler second time, i.e. -a 1 -b 2 -a 3.

Regarding defaults - I believe that all disabled samplers and penalties, with temperature == 1 is the best way to avoid token and context repetitions, and preserve statistically human-like prediction, because that shows model abilities.

cebtenzzre · 2023-11-04T16:51:49Z

I like this. It is easy to implement -a 1 -b 2 --order A,B but it better be just -a 1 -b 2 in order to apply already applied sampler second time, i.e. -a 1 -b 2 -a 3.

If we did something like that, the defaults would have to be cleared as soon as you override any of them. Which might be confusing if a user just wants to override e.g. top-p and leave the other parameters alone.

I've tried using models with minimal samplers (e.g. just min-p, or min-p and top-p=.9x), but I had to fall back on my favorite default preset of Midnight Enigma (temp=1 top-p=0.37 rep-pen=1.18 top-k=100) after the models I tried seemed to have a hard time staying on topic and remaining coherent.

kalomaze · 2023-11-05T15:33:29Z

I like this. It is easy to implement -a 1 -b 2 --order A,B but it better be just -a 1 -b 2 in order to apply already applied sampler second time, i.e. -a 1 -b 2 -a 3.

If we did something like that, the defaults would have to be cleared as soon as you override any of them. Which might be confusing if a user just wants to override e.g. top-p and leave the other parameters alone.

I've tried using models with minimal samplers (e.g. just min-p, or min-p and top-p=.9x), but I had to fall back on my favorite default preset of Midnight Enigma (temp=1 top-p=0.37 rep-pen=1.18 top-k=100) after the models I tried seemed to have a hard time staying on topic and remaining coherent.

Top P 0.37 seems aggressively deterministic to me, you're picking like two or three choices max 99% of the time. Your repetition penalty is pretty high as well, which probably helps counteract the determinism, but I try to avoid rep pen because it's a bit of a 'hacky' solution to the problem of overly high determinism compared to turning up Temp.

I would try lowering Rep Pen a bit, turning Top P off (to 1.0), and using a Min P that is on the deterministic side. (e.g 0.25 Min P) to get similar effects to what you want.

cebtenzzre · 2023-11-07T18:56:40Z

In my experience, the TGWUI preset called "Midnight Enigma" (possibly with top-p increased to .57) is very good for Alpaca-style instruction-based prompting with certain models. Free-form prompting with models like Chronoboros does seem to benefit from using e.g. min-p=.25 instead.

* Update server.cpp with min_p after it was introduced in #3841 * Use spaces instead of tabs * Update index.html.hpp after running deps.sh * Fix test - fix line ending

Closes abetlen#911 Implement min_p sampling as described in ggerganov/llama.cpp#3841 Most of the actual work was already done, I just added the parameters to Llama.sample, Llama.generate, Llama.create_completion, Llama.create_completion, and Llama.create_chat_completion. Tested and working as expected, as far as I can tell.

My small contribution to this great project. Ref: ggerganov/llama.cpp#3841 Closes: abetlen#911

* ggerganov/llama.cpp#3841

* Added support for min_p My small contribution to this great project. Ref: ggerganov/llama.cpp#3841 Closes: #911 * Fix for negative temp (sample_softmax)

@kalomaze

…gerganov#3841) * Introduce the new Min-P sampler by @kalomaze The Min-P sampling method was designed as an alternative to Top-P, and aims to ensure a balance of quality and variety. The parameter *p* represents the minimum probability for a token to be considered, relative to the probability of the most likely token. * Min-P enabled and set to 0.05 default --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: cebtenzzre <cebtenzzre@gmail.com>

* Update server.cpp with min_p after it was introduced in ggerganov#3841 * Use spaces instead of tabs * Update index.html.hpp after running deps.sh * Fix test - fix line ending

ivanstepanovftw · 2024-03-09T09:47:24Z

rep-pen=1.18

Some languages does not have full word tokens, and you will penalize sub words or characters.

ZoomRmc · 2024-04-24T01:36:01Z

Having experimented with using strictly min-p sampler (all others turned off) for creative but structured writing with llama3-7b, I see very little sense in using the currently default order (temperature last).

--samplers min_p;temperature (current default if top-k and top-p turned off)
In this configuration, it seems there's very little direct influence on the creativity by setting various temp values. The output follows the prompted structure correctly and initially generates rather cohesive output well into the very high temp values (t:4.0 min-p: 0.01) until some extremely unlikely garbage tokens slip through which immediately destabilizes the output.
In other words, I couldn't achieve any gradual control over the creativity of the output. Moreover, tweaking min-p alone doesn't help with changing the creativity much, abruptly starting to produce gibberish.
--samplers temperature;min_p
This seems far more controllable. The gradual increase of the temperature up to the extreme values work as you'd expect: the output follows the bland>unremarkable>creative>typo-ridden>broken grammar>unstable curve in a more or less predictable way.
On the other hand, this sampler order is more interdependent and makes it easier to subtly influence the results in an unintended way.

My impression after a brief testing: Current order defaults certainly provide more deterministic output for most cases, including uninformed tweaking of the sampler settings, but probably limit the user's control.

Relevant: #4091

ivanstepanovftw · 2024-04-24T11:15:33Z

Also NMS (Non Max Suppression) from image object detection task uses probability threshold too, similarly to Min P.

ggerganov and others added 8 commits October 25, 2023 10:26

cuda : prints wip

59d1232

cuda : new cublas gemm branch for multi-batch quantized src0

52af782

cuda : add F32 sgemm branch

16b60dd

cuda : fine-tune >= VOLTA params + use MMQ only for small batches

a3c2843

cuda : remove duplicated cuBLAS GEMM code

4c6744b

cuda : add CUDA_USE_TENSOR_CORES and GGML_CUDA_FORCE_MMQ macros

a4e15a3

build : add compile option to force use of MMQ kernels

49af767

Super hacky starting implementation of Min P

a9e2b74

cebtenzzre marked this pull request as draft October 28, 2023 22:58

KerfuffleV2 reviewed Oct 28, 2023

View reviewed changes

llama.cpp Outdated Show resolved Hide resolved

KerfuffleV2 reviewed Oct 28, 2023

View reviewed changes

llama.cpp Outdated Show resolved Hide resolved

KerfuffleV2 reviewed Oct 28, 2023

View reviewed changes

llama.cpp Outdated Show resolved Hide resolved

KerfuffleV2 reviewed Oct 28, 2023

View reviewed changes

llama.cpp Outdated Show resolved Hide resolved

KerfuffleV2 reviewed Oct 28, 2023

View reviewed changes

llama.cpp Outdated Show resolved Hide resolved

Transform Min P into a proper CLI option

a235a0d

KerfuffleV2 reviewed Oct 29, 2023

View reviewed changes

llama.cpp Show resolved Hide resolved

Min P disabled if set to 1.0 or 0, otherwise Top P

838d58d

kalomaze commented Oct 29, 2023

View reviewed changes

common/common.cpp Outdated Show resolved Hide resolved

kalomaze added 2 commits October 28, 2023 21:14

Debugging print statements removed

69ef4ca

erring on the side of caution; disable by default

833637b

kalomaze commented Oct 29, 2023

View reviewed changes

llama.h Outdated Show resolved Hide resolved

kalomaze added 2 commits October 28, 2023 23:04

Remove accidentally kept prints + min_keep support

62fc771

Standardize 0.0 disabling min_p upon feedback

49b68e8

KerfuffleV2 reviewed Oct 29, 2023

View reviewed changes

llama.cpp Outdated Show resolved Hide resolved

KerfuffleV2 reviewed Oct 29, 2023

View reviewed changes

common/sampling.h Outdated Show resolved Hide resolved

kalomaze added 2 commits October 28, 2023 23:37

Simplified counter by checking candidates size

6f7cdec

+ fixed 0.0 default for min_p

minor whitespace fix

cb23358

KerfuffleV2 reviewed Oct 29, 2023

View reviewed changes

common/sampling.h Outdated Show resolved Hide resolved

jhen0409 pushed a commit that referenced this pull request Nov 9, 2023

server : add min_p param (#3877)

57ad015

* Update server.cpp with min_p after it was introduced in #3841 * Use spaces instead of tabs * Update index.html.hpp after running deps.sh * Fix test - fix line ending

This was referenced Nov 10, 2023

MinP Sampling SciSharp/LLamaSharp#272

Closed

MinP Sampler SciSharp/LLamaSharp#277

Merged

esmeetu mentioned this pull request Nov 13, 2023

Support Min P Sampler vllm-project/vllm#1642

Merged

ArtyomZemlyak mentioned this pull request Nov 15, 2023

Add min p arg to server abetlen/llama-cpp-python#911

Closed

JoseConseco mentioned this pull request Nov 15, 2023

Add support for llamacpp min_p sampler ollama/ollama#1142

Open

MaggotHATE mentioned this pull request Nov 16, 2023

Temperature application order non standard? #4091

Closed

tk-master added a commit to tk-master/llama-cpp-python that referenced this pull request Nov 16, 2023

Added support for min_p

1b1a918

My small contribution to this great project. Ref: ggerganov/llama.cpp#3841 Closes: abetlen#911

tk-master mentioned this pull request Nov 16, 2023

Added support for min_p abetlen/llama-cpp-python#921

Merged

brittlewis12 added a commit to brittlewis12/llmfarm_core.swift that referenced this pull request Nov 17, 2023

Add min-p sampling

d03f651

* ggerganov/llama.cpp#3841

abetlen pushed a commit to abetlen/llama-cpp-python that referenced this pull request Nov 21, 2023

Added support for min_p (#921)

b8438f7

* Added support for min_p My small contribution to this great project. Ref: ggerganov/llama.cpp#3841 Closes: #911 * Fix for negative temp (sample_softmax)

kalomaze mentioned this pull request Nov 21, 2023

New sampler: Min-P (makes sense?) #4152

Closed

4 tasks

kalomaze mentioned this pull request Nov 23, 2023

Min P style sampling - an alternative to Top P/TopK huggingface/transformers#27670

Closed

martindevans mentioned this pull request Dec 8, 2023

Custom Sampling Pipelines SciSharp/LLamaSharp#348

Merged

Robitx mentioned this pull request Jan 6, 2024

feat: add support for min_p (resolve #1142) ollama/ollama#1825

Open

Digitous mentioned this pull request Jan 10, 2024

Feature request: expose min_p Mobile-Artificial-Intelligence/maid#254

Closed

z80maniac mentioned this pull request Feb 2, 2024

server : improvements and maintenance #4216

Open

10 tasks

Technologicat mentioned this pull request Feb 12, 2024

[FEATURE_REQUEST] Adding another default character: an AI assistant SillyTavern/SillyTavern#1805

Open

aikitoria mentioned this pull request Feb 25, 2024

Feature Request: Add Min-P sampling layer NVIDIA/TensorRT-LLM#1154

Open

gante mentioned this pull request May 3, 2024

Generate: add min_p sampling huggingface/transformers#30639

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Min P sampler implementation [alternative to Top P/Top K] #3841

Min P sampler implementation [alternative to Top P/Top K] #3841

kalomaze commented Oct 28, 2023 •

edited

kalomaze commented Oct 28, 2023 •

edited

kalomaze commented Oct 29, 2023 •

edited

ivanstepanovftw commented Nov 4, 2023

cebtenzzre commented Nov 4, 2023 •

edited

kalomaze commented Nov 5, 2023 •

edited

cebtenzzre commented Nov 7, 2023

ivanstepanovftw commented Mar 9, 2024

ZoomRmc commented Apr 24, 2024 •

edited

ivanstepanovftw commented Apr 24, 2024

Min P sampler implementation [alternative to Top P/Top K] #3841

Min P sampler implementation [alternative to Top P/Top K] #3841

Conversation

kalomaze commented Oct 28, 2023 • edited

kalomaze commented Oct 28, 2023 • edited

kalomaze commented Oct 29, 2023 • edited

ivanstepanovftw commented Nov 4, 2023

cebtenzzre commented Nov 4, 2023 • edited

kalomaze commented Nov 5, 2023 • edited

cebtenzzre commented Nov 7, 2023

ivanstepanovftw commented Mar 9, 2024

ZoomRmc commented Apr 24, 2024 • edited

ivanstepanovftw commented Apr 24, 2024

kalomaze commented Oct 28, 2023 •

edited

kalomaze commented Oct 28, 2023 •

edited

kalomaze commented Oct 29, 2023 •

edited

cebtenzzre commented Nov 4, 2023 •

edited

kalomaze commented Nov 5, 2023 •

edited

ZoomRmc commented Apr 24, 2024 •

edited