ggml : use F16 instead of F32 in Q4_0, Q4_1, Q8_0 #1508

ggerganov · 2023-05-17T18:32:42Z

No need to keep the scaling factor in F32 - use F16 instead to get smaller models and improve inference speed

This change makes the model files smaller, the inference faster and allows to fit more GPU layers in VRAM

check AVX
implement CUDA
recalc PPL

New stats:

Model	Measure	F16	Q4_0	Q4_1	Q5_0	Q5_1	Q8_0
7B	perplexity	5.9066	6.1565	6.0912	5.9862	5.9481	5.9070
7B	file size	13.0G	3.5G	3.9G	4.3G	4.7G	6.7G
7B	ms/tok @ 4th	127	55	54	76	83	72
7B	ms/tok @ 8th	122	43	45	52	56	67
7B	bits/weight	16.0	4.5	5.0	5.5	6.0	8.5
13B	perplexity	5.2543	5.3860	5.3608	5.2856	5.2706	5.2548
13B	file size	25.0G	6.8G	7.6G	8.3G	9.1G	13G
13B	ms/tok @ 4th	-	103	105	148	160	131
13B	ms/tok @ 8th	-	73	82	98	105	128
13B	bits/weight	16.0	4.5	5.0	5.5	6.0	8.5

Old stats:

Model	Measure	F16	Q4_0	Q4_1	Q5_0	Q5_1	Q8_0
7B	perplexity	5.9066	6.1565	6.0910	5.9862	5.9481	5.9069
7B	file size	13.0G	4.0G	4.8G	4.4G	4.8G	7.1G
7B	ms/tok @ 4th	128	50	54	75	83	75
7B	ms/tok @ 8th	123	44	52	53	58	72
7B	bits/weight	16.0	5.0	6.0	5.5	6.0	9.0
13B	perplexity	5.2543	5.3860	5.3607	5.2856	5.2706	5.2548
13B	file size	25.0G	7.6G	9.1G	8.4G	9.1G	14G
13B	ms/tok @ 4th	-	93	101	150	164	141
13B	ms/tok @ 8th	-	81	96	96	104	136
13B	bits/weight	16.0	5.0	6.0	5.5	6.0	9.0

`🤖 Generated by Copilot at d627025`

Summary

:compression:💾🔄

Use 16-bit floats for quantization deltas in ggml.c to save memory and compress better.

We're sailing on the sea of bits, we need to save some space
We'll change the delta field to ggml_fp16_t
Heave away, me hearties, heave away with grace
We'll convert and swap and compress at a faster rate

Walkthrough

Replace the float type with the ggml_fp16_t type for the delta field of the block_q4_0, block_q4_1, block_q8_0, and block_q8_1 structs in ggml.c to reduce the memory footprint and improve the compression ratio of the quantized blocks (link, link, link, link)
Add macro calls to GGML_FP32_TO_FP16 to convert the delta value from a 32-bit float to a 16-bit half-float before storing it in the y[i].d field of the block_q4_0, block_q4_1, block_q8_0, and block_q8_1 structs in ggml.c for various quantization schemes and SIMD implementations (link, link, link, link, link, link, link)
Add macro calls to GGML_FP16_TO_FP32 to convert the delta, min, and sum values from 16-bit half-floats to 32-bit floats before using them in the dequantization and dot product computations in ggml.c for various quantization schemes and SIMD implementations (link, link, link, link, link, link, link, link, link, link, link, link, link, link, link, link, link, link, link, link, link, link)
Swap the order of the multiplication operands in the expressions d * sum and sum * d in ggml.c to avoid a potential loss of precision when multiplying a 32-bit float by a 16-bit half-float (link, link, link, link, link, link)
Replace the calls to _mm256_broadcast_ss and _mm_broadcast_ss with the calls to _mm256_set1_ps and _mm_set1_ps in ggml.c to avoid loading the values from memory and instead use an immediate operand, which may improve the performance and reduce the memory access (link, link, link, link)
Add curly braces around the for loop body in ggml.c to improve the readability and consistency of the code style (link)
Remove an extra space character between the m4b and s8b variable declarations in ggml.c to fix a minor formatting issue and align the code style with the rest of the file (link)

LostRuins · 2023-05-18T02:58:46Z

Hi again, just wondering if backward compatibility is possible (e.g. transparently handle f32->f16 scaling factors when loading older quantizations during model load)?

If not, and it breaks all old models, can I request an increase in LLAMA_FILE_VERSION so this change can be detected and handled if it ends up being merged?

Edit: also looked at the chart, seems like q4_0 actually gets a regression in speed with this change?

Green-Sky · 2023-05-18T12:08:42Z

Do I read ms/tok @ 4th correctly as "Milliseconds per token for the 4th token" or "Milliseconds per token after the 4th token" ?

j-f1 · 2023-05-18T12:21:59Z

I believe that means "4 threads" @Green-Sky

ivanstepanovftw · 2023-05-18T19:24:16Z

Maybe we should use , 4T instead of @ 4th?

ilyakurdyukov · 2023-05-19T06:09:20Z

Although this saves some space, it will have a negative impact on performance. If the architecture doesn't support F16, then a lookup table is required, which will result in random (to the CPU) memory accesses and wasted caches. And even if F16 is supported, as on x86, which has an F16 conversion instruction, there will be extra instructions in the loop.

ilyakurdyukov · 2023-05-19T09:52:02Z

My thoughts on model format changes: it's better to keep the short and portable model format for distribution, but make a script to convert that format to a format that's better suited for performance on a particular architecture. Also allow it to be loaded and converted in memory. Because, for example, for the performance of vector code, it is better to fuse four Q4_0:

typedef struct {
	float d[4];
	uint8_t qs[4][QK4_0 / 2];
} block4_q4_0;

Or even 8 at once. So 4-8 d can be read with the one vector load instruction. And each field will be perfectly aligned for use with aligned vector load instructions. I think it's impossible to create an ideal model format for the best performance for every architecture.

KerfuffleV2 · 2023-05-19T10:38:34Z

It kind of seems like the level of sacrifice required to allow mmaping models is getting bigger and bigger. Maybe that should be the special case, where there's a tool to convert the model to something that can be mmaped directly for people that really care about making loading a little bit faster.

Without the limitation of models having to be mmapable, doing small conversions online during the load process would be easy. Personally, I'm perfectly happy to trade slightly slower model loading for lower memory usage/faster inference and more options.

ggerganov · 2023-05-19T17:53:32Z

Hi again, just wondering if backward compatibility is possible (e.g. transparently handle f32->f16 scaling factors when loading older quantizations during model load)?

If not, and it breaks all old models, can I request an increase in LLAMA_FILE_VERSION so this change can be detected and handled if it ends up being merged?

Edit: also looked at the chart, seems like q4_0 actually gets a regression in speed with this change?

This change breaks Q4_0, Q4_1 and Q8_0 models

LLAMA_FILE_VERSION has been increased to 3 together with GGML_QNT_VERSION to 2

The regression in Q4_0 is only when using 4 threads. My explanations is that at 4 threads we are not yet memory bound and the F16 - > F32 conversion has a small negative impact on the overall performance due to extra compute. With more threads, the computation is memory bound, so we win by reducing the amount of memory used by the weights

Regarding the performance concerns by @ilyakurdyukov and @KerfuffleV2
Since the computation on the CPU is memory-bound, this change actually improves the performance.
Additionally, having smaller weights allows to load more of them in VRAM which brings additional improvement when using GPU

Alternative memory layouts are being explored in #1073 and #1256

KerfuffleV2 · 2023-05-19T18:44:25Z

My concern wasn't really about performance specifically, but the restriction of being tied to a mmapable file format. There isn't necessarily a one-size-fits-all approach to the model format that will be ideal for every CPU/GPU like ilyakurdyukov mentioned.

If the model file must be mmapable then you basically have to try to pick some middle ground even if there are small fixups that could easily be done. It also makes changes that break model file compatibility a lot more frequent, since the format must be directly evaluable exactly as it exists on disk.

Now that there's GPU support, it may be even more of an issue. From what I know, GPUs (generally) like 16bit floats, x86 CPUs at least (generally) like 32bit floats. You have to choose one or the other at the moment, it seems, even if it's not ideal.

YellowRoseCx · 2023-05-19T19:26:10Z

Hey Mr @ggerganov now that the F16 implementation has been merged, could you please increase LLAMA_FILE_VERSION so this change can be detected and handled for backwards compatibility with other models?

Do you have any throughts on transparently handling f32->f16 scaling factors when loading older quantizations during model load?

SlyEcho · 2023-05-19T20:37:16Z

Loading older models now results in a crash instead of an error message, as expected.

* ggml : use F16 instead of F32 in Q4_0, Q4_1 and Q8_0 * llama : bump LLAMA_FILE_VERSION to 3 * cuda : update Q4 and Q8 dequantize kernels * ggml : fix AVX dot products * readme : update performance table + hot topics

…oadcasting for ggml_mul (#1483) * Broadcasting for ggml_mul * CUDA kernel for ggml_mul, norms in VRAM * GPU weights not in RAM, direct loading with cuFile * fixup! GPU weights not in RAM, direct loading with cuFile * fixup! GPU weights not in RAM, direct loading with cuFile * define default model path once, sync path with readme (#1366) * ~7% faster Q5_1 AVX2 code (#1477) * convert.py: Support models which are stored in a single pytorch_model.bin (#1469) * Support models in a single pytorch_model.bin * Remove spurious line with typo * benchmark-matmul: Print the average of the test results (#1490) * Remove unused n_parts parameter (#1509) * Fixes #1511 lambda issue for w64devkit (mingw) (#1513) * Fix for w64devkit and mingw * make kv_f16 the default for api users (#1517) * minor : fix compile warnings * readme : adds WizardLM to the list of supported models (#1485) * main : make reverse prompt option act as a stop token in non-interactive mode (#1032) * Make reverse prompt option act as a stop token in non-interactive scenarios * Making requested review changes * Update gpt_params_parse and fix a merge error * Revert "Update gpt_params_parse and fix a merge error" This reverts commit 2bb2ff1. * Update gpt_params_parse and fix a merge error take 2 * examples : add persistent chat (#1495) * examples : add persistent chat * examples : fix whitespace --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * tests : add missing header * ggml : use F16 instead of F32 in Q4_0, Q4_1, Q8_0 (#1508) * ggml : use F16 instead of F32 in Q4_0, Q4_1 and Q8_0 * llama : bump LLAMA_FILE_VERSION to 3 * cuda : update Q4 and Q8 dequantize kernels * ggml : fix AVX dot products * readme : update performance table + hot topics * ggml : fix scalar implementation of Q4_1 dot * llama : fix compile warnings in llama_set_state_data() * llama : fix name shadowing and C4146 (#1526) * Fix name shadowing and C4146 * Fix if macros not using defined when required * Update llama-util.h Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> * Update llama-util.h Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> * Code style Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Fix for mingw (#1462) * llama : add llama_init_backend() API (close #1527) * feature : add blis and other BLAS implementation support (#1502) * feature: add blis support * feature: allow all BLA_VENDOR to be assigned in cmake arguments. align with whisper.cpp pr 927 * fix: version detection for BLA_SIZEOF_INTEGER, recover min version of cmake * Fix typo in INTEGER Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Revert "feature : add blis and other BLAS implementation support (#1502)" This reverts commit 07e9ace. * GPU weights not in RAM, direct loading with cuFile * llama : code style fixes + progress print fix * ggml : ggml_mul better broadcast support * cmake : workarounds for cufile when CMake version < 3.25 * gg rebase fixup * Loop in llama.cpp, fixed progress callback * Attempt clang-tidy fix * llama : fix vram size computation * Add forgotten fclose() --------- Co-authored-by: András Salamon <ott2@users.noreply.github.com> Co-authored-by: Ilya Kurdyukov <59548320+ilyakurdyukov@users.noreply.github.com> Co-authored-by: Tom Jobbins <784313+TheBloke@users.noreply.github.com> Co-authored-by: rankaiyx <rankaiyx@rankaiyx.com> Co-authored-by: Stephan Walter <stephan@walter.name> Co-authored-by: DannyDaemonic <DannyDaemonic@gmail.com> Co-authored-by: Erik Scholz <Green-Sky@users.noreply.github.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: David Kennedy <dakennedyd@gmail.com> Co-authored-by: Jason McCartney <jmac@theroot.org> Co-authored-by: Evan Jones <evan.q.jones@gmail.com> Co-authored-by: Maxime <672982+maximegmd@users.noreply.github.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Zenix <zenixls2@gmail.com>

Green-Sky · 2023-05-20T17:59:52Z

$ ./main -m /mnt/nvme/ml/llama/ggjt_3/65B/llama-65B-q4_0-ggjt3.bin -ngl 16 -b 1 -n 10 -t 8 -s 3 -p "Once" -c 128

... I know I cheated a little, batchsize of 1, so it fits into vram (idk looks like it allocates vram based on batch size), only context size of 128

WARNING: when using cuBLAS generation results are NOT guaranteed to be reproducible.
main: build = 580 (3f008ca)
main: seed  = 3
llama.cpp: loading model from /mnt/nvme/ml/llama/ggjt_3/65B/llama-65B-q4_0-ggjt3.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 128
llama_model_load_internal: n_embd     = 8192
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 64
llama_model_load_internal: n_layer    = 80
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 22016
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 65B
llama_model_load_internal: ggml ctx size =    0,18 MB
llama_model_load_internal: mem required  = 31661,46 MB (+ 5120,00 MB per state)
llama_model_load_internal: [cublas] offloading 16 layers to GPU
llama_model_load_internal: [cublas] total VRAM used: 6949 MB
.....................
llama_init_from_file: kv self size  =  320,00 MB

system_info: n_threads = 8 / 24 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | 
sampling: repeat_last_n = 64, repeat_penalty = 1,100000, presence_penalty = 0,000000, frequency_penalty = 0,000000, top_k = 40, tfs_z = 1,000000, top_p = 0,950000, typical_p = 1,000000, temp = 0,800000, mirostat = 0, mirostat_lr = 0,100000, mirostat_ent = 5,000000
generate: n_ctx = 128, n_batch = 1, n_predict = 10, n_keep = 0


 Once again we are happy to be able to offer a
llama_print_timings:        load time = 34773,73 ms
llama_print_timings:      sample time =    15,87 ms /    10 runs   (    1,59 ms per token)
llama_print_timings: prompt eval time =     0,00 ms /     1 tokens (    0,00 ms per token)
llama_print_timings:        eval time = 101378,22 ms /    11 runs   ( 9216,20 ms per token)
llama_print_timings:       total time = 119457,45 ms

so I can run 65B with 32gig ram and 8gig vram with 9.2sec/tok

might get a little faster + 1 more layer on gpu, when there is no desktop session running..

🎉

The precision for Q4_0 has degraded since #1508

Rhialto · 2023-05-23T19:58:34Z

@KerfuffleV2 I don't agree with your dislike of mmap-able file formats. If you actually need to read the file, you need to read the whole thing before you can even start doing anything. If you mmap it, you can start right away, and the parts that you're not using (yet) do not need to be read.
Furthermore, mmaping saves you half the memory. When reading the file, you have one copy of the whole data in the process address space, and another sitting in your file cache. When mmap-ing, it's in memory only once. Furthermore, it can be shared by multiple running copies of the program. And if you quit the program and restart it, pretty much everything should still be cached and you can do work right away. Where on the other hand without mmap, the program has to read all the data again. Even if it is all still in the file cache, this takes again time.

KerfuffleV2 · 2023-05-23T23:01:42Z

I don't agree with your dislike of mmap-able file formats.

You're attacking a straw man here, in reality I never said anything about disliking mmap-able file formats. Also, in reality, I do not in fact dislike them.

All I said is that the tradeoffs might not be worth it at this point.

If you actually need to read the file, you need to read the whole thing before you can even start doing anything.

You need the entirety of the model data before you can generate the first token. So whether or not it's mmaped, the entire thing must be read from disk at least once before you see a single output. Pretty sure this also applies to feeding a token from the prompt as well.

Furthermore, mmaping saves you half the memory.

That's a weird way to look at it, since buffer cache isn't really used memory. Anyway, you're arguing like I said there weren't benefits to mmaping, which is not true. There are both pros and cons to both the mmap and non-mmap approach.

mmaping can use the buffer cache more effectively (it can also be faster in some cases). However, a trade off is that the exact data, byte for byte, must be evaluable. That means if a format is optimal for your architecture (CPU or GPU) but not another, then you have to make a choice. You can't convert it to the format that's most optimal for actual evaluation on your specific system, you have to live with whatever the common format is.

And if you quit the program and restart it, pretty much everything should still be cached and you can do work right away.

In practice, this takes a very small amount of time when everything's already in the buffer cache. I'd say, generally speaking most people are going to care more about the time/memory required for actually generating tokens as opposed to making loading the model file slightly faster. The majority of the time is spent doing inference, not loading a model (especially if the cache is hot).

However, like I already mentioned, if it was decided that the tradeoff of being able to mmap and evaluate the models directly wasn't worth it anymore that wouldn't mean it isn't possible to make a simple tool that just creates a mmapable model from the more flexible distributable one.

So people who really care about load time can have their cake and eat it too.

EwoutH · 2023-05-24T12:39:10Z

This is really amazing work!

With this PR merged, should the Memory/Disk Requirements in the Readme also be updated?

Dwedit · 2023-05-24T18:04:42Z

Since people will be able to refine a 4-bit model directly, a full precision non-quantized version of that model might not even exist. We really need some kind of way to convert a model to the new format without requantizing it from a bigger file.

KerfuffleV2 · 2023-05-25T07:59:55Z

We really need some kind of way to convert a model to the new format without requantizing it from a bigger file.

Quantization is lossy "compression". To use those models without a large quality loss it would probably be necessary for GGML to be able to run inference on them without converting the actual tensor data again.

May make sense for GGML to take the direction of coming up with its own format that accomplishes the same thing, especially since formats like Q5_0, Q5_1 have substantially higher quality than 4bit quantization and don't take up much more memory/disk.

The precision for Q4_0 has degraded since #1508

The precision for Q4_0 has degraded since ggerganov#1508

ggml : use F16 instead of F32 in Q4_0, Q4_1 and Q8_0

d627025

ggerganov added performance Speed related topics breaking change Changes that break ABIs, APIs, file formats, or other forms of backwards compatibility. labels May 17, 2023

ggerganov mentioned this pull request May 17, 2023

Loading models directly into VRAM, norm calculation on GPUs, broadcasting for ggml_mul #1483

Merged

This was referenced May 17, 2023

Feature - Internal ggml precision GGML_TYPE_F16 support #1492

Closed

Try whether OpenLLaMa works #1291

Closed

ggerganov added 3 commits May 19, 2023 18:53

llama : bump LLAMA_FILE_VERSION to 3

3094f64

cuda : update Q4 and Q8 dequantize kernels

8b71329

ggml : fix AVX dot products

a443497

readme : update performance table + hot topics

c6d8255

ggerganov merged commit 2d5db48 into master May 19, 2023
21 checks passed

ggerganov deleted the qnt-f16 branch May 19, 2023 19:17

ggerganov mentioned this pull request May 19, 2023

OpenCL dequant_mul_mat #1459

Merged

niansa mentioned this pull request May 19, 2023

Add llama compatibility with new ggml quantization nomic-ai/gpt4all#642

Merged

philpax mentioned this pull request May 19, 2023

Support GGJT v3 rustformers/llm#252

Closed

su77ungr mentioned this pull request May 20, 2023

LlamaCpp change breaks Q4_0, Q4_1 and Q8_0 models su77ungr/CASALIOY#101

Closed

marella mentioned this pull request May 20, 2023

Starcoder / Quantized Issues marella/ctransformers#1

Closed

gdedrouas mentioned this pull request May 20, 2023

update llamacpp submodule abetlen/llama-cpp-python#251

Closed

This was referenced May 20, 2023

I updated and it crashes with an error #1538

Closed

Model not loading - nothing happens #1543

Closed

marella mentioned this pull request May 21, 2023

New ggml llamacpp file format support marella/ctransformers#4

Closed

ggerganov added a commit that referenced this pull request May 21, 2023

examples : fix benchmark-matmult

b16c085

The precision for Q4_0 has degraded since #1508

ggerganov mentioned this pull request May 21, 2023

examples : fix benchmark-matmult #1554

Merged

This was referenced May 22, 2023

ggjt v2 models don't load (or error gracefully) #1559

Closed

[BUG] CLBLast generate garbage text on q8.0 models #1525

Closed

bmarquismarkail mentioned this pull request May 22, 2023

Manticore (Wizard Mega) unsupported zylon-ai/private-gpt#369

Closed

maozdemir mentioned this pull request May 24, 2023

feat: Enable GPU acceleration zylon-ai/private-gpt#425

Closed

ChaoticByte mentioned this pull request May 24, 2023

Bump llama-cpp-python[server] from 0.1.50 to 0.1.54 ChaoticByte/Eucalyptus-Chat#7

Merged

hindfelt mentioned this pull request May 24, 2023

std::runtime_error: unexpectedly reached end of file fabiomatricardi/GPT4All_Medium#3

Closed

LLukas22 mentioned this pull request May 25, 2023

Inference without ONNX / usage of WONNX as backend for LLMs webonnx/wonnx#169

Open

dmeleedy mentioned this pull request May 26, 2023

Reading model error std::runtime_error: unexpectedly reached end of file #1596

Closed

Guillaume-Fgt mentioned this pull request May 28, 2023

Problem when launching privateGPT zylon-ai/private-gpt#411

Closed

synw mentioned this pull request May 28, 2023

Ggml v3 support in Llama.cpp Atome-FE/llama-node#84

Closed

s7726 mentioned this pull request Jun 7, 2023

Update for llama.cpp quantization changes thomasantony/llamacpp-python#27

Open

This was referenced Aug 2, 2023

GGUF file format specification ggerganov/ggml#302

Merged

llama : add support for llama2.c models #2379

Closed

ggml : PoC for normalizing weights for better quantization packing #2434

Draft

ggerganov added a commit that referenced this pull request Sep 20, 2023

examples : fix benchmark-matmult (#1554)

d119c04

The precision for Q4_0 has degraded since #1508

pkrmf pushed a commit to morlockstudios-com/llama.cpp that referenced this pull request Sep 26, 2023

examples : fix benchmark-matmult (ggerganov#1554)

487e22a

The precision for Q4_0 has degraded since ggerganov#1508

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml : use F16 instead of F32 in Q4_0, Q4_1, Q8_0 #1508

ggml : use F16 instead of F32 in Q4_0, Q4_1, Q8_0 #1508

ggerganov commented May 17, 2023 •

edited

Loading

LostRuins commented May 18, 2023 •

edited

Loading

Green-Sky commented May 18, 2023

j-f1 commented May 18, 2023

ivanstepanovftw commented May 18, 2023

ilyakurdyukov commented May 19, 2023

ilyakurdyukov commented May 19, 2023

KerfuffleV2 commented May 19, 2023

ggerganov commented May 19, 2023 •

edited

Loading

KerfuffleV2 commented May 19, 2023

YellowRoseCx commented May 19, 2023 •

edited

Loading

SlyEcho commented May 19, 2023

Green-Sky commented May 20, 2023

Rhialto commented May 23, 2023

KerfuffleV2 commented May 23, 2023

EwoutH commented May 24, 2023

Dwedit commented May 24, 2023

KerfuffleV2 commented May 25, 2023

ggml : use F16 instead of F32 in Q4_0, Q4_1, Q8_0 #1508

ggml : use F16 instead of F32 in Q4_0, Q4_1, Q8_0 #1508

Conversation

ggerganov commented May 17, 2023 • edited Loading

🤖 Generated by Copilot at d627025

Summary

Walkthrough

LostRuins commented May 18, 2023 • edited Loading

Green-Sky commented May 18, 2023

j-f1 commented May 18, 2023

ivanstepanovftw commented May 18, 2023

ilyakurdyukov commented May 19, 2023

ilyakurdyukov commented May 19, 2023

KerfuffleV2 commented May 19, 2023

ggerganov commented May 19, 2023 • edited Loading

KerfuffleV2 commented May 19, 2023

YellowRoseCx commented May 19, 2023 • edited Loading

SlyEcho commented May 19, 2023

Green-Sky commented May 20, 2023

Rhialto commented May 23, 2023

KerfuffleV2 commented May 23, 2023

EwoutH commented May 24, 2023

Dwedit commented May 24, 2023

KerfuffleV2 commented May 25, 2023

ggerganov commented May 17, 2023 •

edited

Loading

`🤖 Generated by Copilot at d627025`

LostRuins commented May 18, 2023 •

edited

Loading

ggerganov commented May 19, 2023 •

edited

Loading

YellowRoseCx commented May 19, 2023 •

edited

Loading