Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support tensors with 64-bit number of elements in ggml #599

Closed
ghost opened this issue Mar 29, 2023 · 21 comments
Closed

Support tensors with 64-bit number of elements in ggml #599

ghost opened this issue Mar 29, 2023 · 21 comments
Labels
enhancement New feature or request

Comments

@ghost
Copy link

ghost commented Mar 29, 2023

Expected Behavior

When setting '-c' to a large number, with sufficient RAM llama.cpp should run. I'm aware that it warns me that context sizes larger than 2048 might produce poor results, but the results are actually fine, if it's not crashing.

Current Behavior

When setting '-c' to a large number, llama.cpp crashes with the error message

ggml_new_tensor_impl: not enough space in the context's memory pool (needed 32289959360, available 32279078144)

(repeated many times with slightly varying numbers, see below).

To be precise, I'm using the command

./main -m ./models/65B/ggml-model-q4_0.bin -t 16 -c 3300 -b 16 -n 2048 --keep 0 --temp 0.8 \
    --repeat_last_n 512 --repeat_penalty 1.1 --color \
    --ignore-eos

which crashes while the same command with '-c 3200' works.

If still have plenty of free RAM (~60 GB) so that shouldn't be an issue.

This might be related to #52 but it seem to die later in the process so probably it's something different.

Environment and Context

I'm using Debian 11 and compiled using clang(++)-13 (but same result with g++) with OpenBLAS:

system_info: n_threads = 16 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
  • Physical (or virtual) hardware you are using, e.g. for Linux:
$ lscpu
...
Model name: Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz
...
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp md_clear flush_l1d arch_capabilities
$ lsmem
RANGE                                  SIZE  STATE REMOVABLE BLOCK
0x0000000000000000-0x000000007fffffff    2G online       yes     0
0x0000000100000000-0x00000020ffffffff  128G online       yes  2-65

Memory block size:         2G
Total online memory:     130G
Total offline memory:      0B
  • Operating System, e.g. for Linux:
Linux 5.10.0-21-amd64 #1 SMP Debian 5.10.162-1 (2023-01-21) x86_64 GNU/Linux
  • SDK version, e.g. for Linux:
$ python3 --version
Python 3.10.9
$ make --version
GNU Make 4.3
Built for x86_64-pc-linux-gnu
$ g++ --version
g++ (Debian 10.2.1-6) 10.2.1 20210110
$ clang++-13 --version
Debian clang version 13.0.1-6~deb11u1
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /usr/bin
llama.cpp$ git log | head -1
commit a6956b25a1c783e5e96fe06c9c00438f846ef047

Failure Information (for bugs)

Hopefully everything stated above. Feel free to ask for details.

Steps to Reproduce

  1. Have an huge amount of RAM
  2. Execute command above

Failure Logs

main: warning: model does not support context sizes greater than 2048 tokens (3300 specified);expect poor results
main: seed = 1680105266
llama_model_load: loading model from './models/65B/ggml-model-q4_0.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx   = 3300
llama_model_load: n_embd  = 8192
llama_model_load: n_mult  = 256
llama_model_load: n_head  = 64
llama_model_load: n_layer = 80
llama_model_load: n_rot   = 128
llama_model_load: f16     = 2
llama_model_load: n_ff    = 22016
llama_model_load: n_parts = 8
llama_model_load: type    = 4
llama_model_load: ggml ctx size = 30783.73 MB
llama_model_load: mem required  = 33343.73 MB (+ 5120.00 MB per state)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 32289959360, available 32279078144)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 32289959360, available 32279078144)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 32289959360, available 32279078144)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 32360771200, available 32279078144)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 32360771200, available 32279078144)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 32360771200, available 32279078144)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 32290025280, available 32279078144)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 32290025280, available 32279078144)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 32290025280, available 32279078144)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 32290025280, available 32279078144)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 32360837120, available 32279078144)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 32360837120, available 32279078144)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 32360837120, available 32279078144)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 32290091200, available 32279078144)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 32290091200, available 32279078144)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 32290091200, available 32279078144)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 32290091200, available 32279078144)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 32360903040, available 32279078144)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 32360903040, available 32279078144)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 32360903040, available 32279078144)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 32290157120, available 32279078144)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 32290157120, available 32279078144)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 32290157120, available 32279078144)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 32290157120, available 32279078144)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 32360968960, available 32279078144)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 32360968960, available 32279078144)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 32360968960, available 32279078144)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 32290223040, available 32279078144)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 32290223040, available 32279078144)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 32290223040, available 32279078144)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 32290223040, available 32279078144)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 32361034880, available 32279078144)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 32361034880, available 32279078144)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 32361034880, available 32279078144)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 32290288960, available 32279078144)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 32290288960, available 32279078144)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 32290288960, available 32279078144)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 32290288960, available 32279078144)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 32361100800, available 32279078144)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 32361100800, available 32279078144)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 32361100800, available 32279078144)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 32290354880, available 32279078144)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 32290354880, available 32279078144)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 32290354880, available 32279078144)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 32290354880, available 32279078144)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 32361166720, available 32279078144)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 32361166720, available 32279078144)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 32361166720, available 32279078144)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 32290420800, available 32279078144)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 32290420800, available 32279078144)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 32290420800, available 32279078144)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 32290420800, available 32279078144)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 32361232640, available 32279078144)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 32361232640, available 32279078144)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 32361232640, available 32279078144)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 32290486720, available 32279078144)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 32290486720, available 32279078144)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 32290486720, available 32279078144)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 32290486720, available 32279078144)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 32361298560, available 32279078144)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 32361298560, available 32279078144)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 32361298560, available 32279078144)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 32290552640, available 32279078144)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 32290552640, available 32279078144)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 32290552640, available 32279078144)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 32290552640, available 32279078144)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 32361364480, available 32279078144)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 32361364480, available 32279078144)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 32361364480, available 32279078144)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 32290618560, available 32279078144)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 32290618560, available 32279078144)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 32290618560, available 32279078144)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 32290618560, available 32279078144)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 32361430400, available 32279078144)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 32361430400, available 32279078144)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 32361430400, available 32279078144)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 32290684480, available 32279078144)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 32290684480, available 32279078144)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 32290684480, available 32279078144)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 32290684480, available 32279078144)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 32361496320, available 32279078144)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 32361496320, available 32279078144)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 32361496320, available 32279078144)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 32290750400, available 32279078144)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 32290750400, available 32279078144)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 32290750400, available 32279078144)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 32290750400, available 32279078144)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 32361562240, available 32279078144)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 32361562240, available 32279078144)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 32361562240, available 32279078144)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 32290816320, available 32279078144)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 32290816320, available 32279078144)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 32290816320, available 32279078144)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 32290816320, available 32279078144)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 32361628160, available 32279078144)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 32361628160, available 32279078144)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 32361628160, available 32279078144)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 32290882240, available 32279078144)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 32290882240, available 32279078144)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 32290882240, available 32279078144)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 32290882240, available 32279078144)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 32361694080, available 32279078144)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 32361694080, available 32279078144)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 32361694080, available 32279078144)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 32290948160, available 32279078144)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 32290948160, available 32279078144)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 32290948160, available 32279078144)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 32290948160, available 32279078144)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 32361760000, available 32279078144)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 32361760000, available 32279078144)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 32361760000, available 32279078144)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 32291014080, available 32279078144)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 32291014080, available 32279078144)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 32291014080, available 32279078144)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 32291014080, available 32279078144)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 32361825920, available 32279078144)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 32361825920, available 32279078144)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 32361825920, available 32279078144)
llama_model_load: loading model part 1/8 from './models/65B/ggml-model-q4_0.bin'
llama_model_load: ........................................................................
74420 Segmentation fault      ./main -m ./models/65B/ggml-model-q4_0.bin -t 16 -c 3300 -b 16 -n 2048 --keep 0 --temp 0.8 --repeat_last_n 512 --repeat_penalty 1.1 --color --ignore-eos
@anzz1
Copy link
Contributor

anzz1 commented Mar 29, 2023

It wants to request 32361825920/1024^3 = 30.14GB of additional memory but thinks
you only have 32279078144/1024^3 = 30.06 GB available.

Unfortunately I cannot reproduce this as I'm failing on the first step:

Steps to reproduce

1. Have an huge amount of RAM ❌

@ghost
Copy link
Author

ghost commented Mar 29, 2023

It wants to request 32361825920/1024^3 = 30.14GB of additional memory but thinks you only have 32279078144/1024^3 = 30.06 GB available.

Yes, but when running with -c 3200 I have over 60 GB left. So that wont be the issue.

Unfortunately I cannot reproduce this as I'm failing on the first step:

Steps to reproduce

1. Have an huge amount of RAM ❌

I feared that would be an issue. I'm not an insane C++ programmer but I've used it a few times. If you need me to run a debugger or something, I think I will be able to do that. Just tell me what I should do or look for.

@bogdad
Copy link
Contributor

bogdad commented Mar 29, 2023

did also repro this

lldb -- ./build/bin/main -m ./models/65B/ggml-model-q4_0.bin -t 16 -c 3300 -b 16 -n 2048 --keep 0 --temp 0.8 \                                                                    (base)
                                                    --repeat_last_n 512 --repeat_penalty 1.1 --color \
                                                    --ignore-eos
Process 8687 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x8)
    frame #0: 0x0000000100006b14 main`ggml_nelements(tensor=0x0000000000000000) at ggml.c:2511:12 [opt]
   2508 int ggml_nelements(const struct ggml_tensor * tensor) {
   2509     static_assert(GGML_MAX_DIMS == 4, "GGML_MAX_DIMS is not 4 - update this function");
   2510
-> 2511     return tensor->ne[0]*tensor->ne[1]*tensor->ne[2]*tensor->ne[3];
   2512 }
   2513
   2514 int ggml_nrows(const struct ggml_tensor * tensor) {
T
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x8)
  * frame #0: 0x0000000100006b14 main`ggml_nelements(tensor=0x0000000000000000) at ggml.c:2511:12 [opt]
    frame #1: 0x000000010001b95c main`llama_model_load(fname="./models/65B/ggml-model-q4_0.bin", lctx=0x0000000101008a00, n_ctx=<unavailable>, n_parts=<unavailable>, memory_type=<unavailable>, vocab_only=false, progress_callback=0x0000000000000000, progress_callback_user_data=0x0000000000000000)(float, void*), void*) at llama.cpp:684:25 [opt]
    frame #2: 0x000000010001981c main`::llama_init_from_file(path_model="./models/65B/ggml-model-q4_0.bin", params=llama_context_params @ 0x000000016fdfd840) at llama.cpp:1648:10 [opt]
    frame #3: 0x0000000100004b94 main`main(argc=<unavailable>, argv=<unavailable>) at main.cpp:102:15 [opt]
    frame #4: 0x000000019738fe50 dyld`start + 2544
(
frame #0: 0x0000000100006b14 main`ggml_nelements(tensor=0x0000000000000000) at ggml.c:2511:12 [opt]
   2508 int ggml_nelements(const struct ggml_tensor * tensor) {
   2509     static_assert(GGML_MAX_DIMS == 4, "GGML_MAX_DIMS is not 4 - update this function");
   2510
-> 2511     return tensor->ne[0]*tensor->ne[1]*tensor->ne[2]*tensor->ne[3];
   2512 }
   2513
   2514 int ggml_nrows(const struct ggml_tensor * tensor) {
(lldb) e tensor
(const ggml_tensor *) $5 = NULL
(lldb) f 1
frame #1: 0x000000010001b95c main`llama_model_load(fname="./models/65B/ggml-model-q4_0.bin", lctx=0x0000000101008a00, n_ctx=<unavailable>, n_parts=<unavailable>, memory_type=<unavailable>, vocab_only=false, progress_callback=0x0000000000000000, progress_callback_user_data=0x0000000000000000)(float, void*), void*) at llama.cpp:684:25 [opt]
   681                          return false;
   682                      }
   683                  } else {
-> 684                      if (ggml_nelements(tensor)/n_parts != nelements) {
   685                          fprintf(stderr, "%s: tensor '%s' has wrong size in model file\n", __func__, name.data());
   686                          return false;
   687                      }
(lldb) e name
(std::string) $6 = "layers.63.attention.wk.weight"
(

@anzz1
Copy link
Contributor

anzz1 commented Mar 29, 2023

Maybe there actually isn't such a large contiguous memory block available and the allocation fails because of that? I don't know which memory reporting tool you use but it should show the amount of contiguous available, committable memory and not something that is backed by a page file.

The first step I would take would be to disable pagefile completely so absolutely nothing will get paged from/to disk to rule out the possibility of it being a memory/swap issue.

@anzz1
Copy link
Contributor

anzz1 commented Mar 29, 2023

Oh wait, this line:
thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x8)
gives a clue that it's could be a overflow issue.

Or maybe not after all, and it's actually trying to access 0 (null) and the integer length being 8 its 0x8.
It seems that you are actually experiencing two different problems, @AngryDuck42 with memory allocation and @bogdad has a null reference pointer somewhere in the tensor chain. @bogdad does this happen with all models or only certain ones? Have you checked your models against the checksums so a model error can be ruled out.

@ghost
Copy link
Author

ghost commented Mar 29, 2023

If I do the things @bogdad did I get more or less the identical output (+- Linux and Apple stuff). So we most likely see the same issue. Also I figured out the issue occurs with -c 3277 while it works with -c 3276.
I controlled the checksums of my q4_0 model and they fit. For -c 2048 also gives expected outputs.

@bogdad
Copy link
Contributor

bogdad commented Mar 29, 2023

Agree, have very samish looking output when running without lldb too. Sorry, did not have time to look more into this, good thing it reproduces reliably every time, maybe some overflow mistake in buffer size calculation as you say, but hard to say without looking carefully.

@ggerganov
Copy link
Owner

You need to bump these buffer sizes:

llama.cpp/llama.cpp

Lines 46 to 79 in b51c717

// computed for n_ctx == 2048
// TODO: dynamically determine these sizes
// needs modifications in ggml
static const std::map<e_model, size_t> MEM_REQ_SCRATCH0 = {
{ MODEL_7B, 512ull*MB },
{ MODEL_13B, 512ull*MB },
{ MODEL_30B, 512ull*MB },
{ MODEL_65B, 512ull*MB },
};
static const std::map<e_model, size_t> MEM_REQ_SCRATCH1 = {
{ MODEL_7B, 512ull*MB },
{ MODEL_13B, 512ull*MB },
{ MODEL_30B, 512ull*MB },
{ MODEL_65B, 512ull*MB },
};
// 2*n_embd*n_ctx*n_layer*sizeof(float16)
static const std::map<e_model, size_t> MEM_REQ_KV_SELF = {
{ MODEL_7B, 1026ull*MB },
{ MODEL_13B, 1608ull*MB },
{ MODEL_30B, 3124ull*MB },
{ MODEL_65B, 5120ull*MB },
};
// this is mostly needed for temporary mul_mat buffers to dequantize the data
// not actually needed if BLAS is disabled
static const std::map<e_model, size_t> MEM_REQ_EVAL = {
{ MODEL_7B, 768ull*MB },
{ MODEL_13B, 1024ull*MB },
{ MODEL_30B, 1280ull*MB },
{ MODEL_65B, 1536ull*MB },
};

Currently, they are hardcoded to support max context of ~2048
There is no mechanism yet to automatically compute the needed memory for arbitrary context size, so you have to do some guess work

@ghost
Copy link
Author

ghost commented Mar 29, 2023

You need to bump these buffer sizes:

llama.cpp/llama.cpp

Lines 46 to 79 in b51c717

// computed for n_ctx == 2048
// TODO: dynamically determine these sizes
// needs modifications in ggml
static const std::map<e_model, size_t> MEM_REQ_SCRATCH0 = {
{ MODEL_7B, 512ull*MB },
{ MODEL_13B, 512ull*MB },
{ MODEL_30B, 512ull*MB },
{ MODEL_65B, 512ull*MB },
};
static const std::map<e_model, size_t> MEM_REQ_SCRATCH1 = {
{ MODEL_7B, 512ull*MB },
{ MODEL_13B, 512ull*MB },
{ MODEL_30B, 512ull*MB },
{ MODEL_65B, 512ull*MB },
};
// 2*n_embd*n_ctx*n_layer*sizeof(float16)
static const std::map<e_model, size_t> MEM_REQ_KV_SELF = {
{ MODEL_7B, 1026ull*MB },
{ MODEL_13B, 1608ull*MB },
{ MODEL_30B, 3124ull*MB },
{ MODEL_65B, 5120ull*MB },
};
// this is mostly needed for temporary mul_mat buffers to dequantize the data
// not actually needed if BLAS is disabled
static const std::map<e_model, size_t> MEM_REQ_EVAL = {
{ MODEL_7B, 768ull*MB },
{ MODEL_13B, 1024ull*MB },
{ MODEL_30B, 1280ull*MB },
{ MODEL_65B, 1536ull*MB },
};

Currently, they are hardcoded to support max context of ~2048 There is no mechanism yet to automatically compute the needed memory for arbitrary context size, so you have to do some guess work

Unfortunately that seems to not be sufficient. I increased all four values for 65B by quiet a bit but still get the issue for -c 3277, which is the smallest value it wouldn't work with before.

@ghost
Copy link
Author

ghost commented Mar 30, 2023

I looked into the issue further and want to share my findings so maybe someone who actually knows what's going on might get an idea.

Like said before, increasing the MEM_REQ_... values didn't help, but what did help was just adding some space right after:

ctx_size += (5 + 10*n_layer)*256; // object overhead

I just added a random

ctx_size += 10 * 1024 * MB;

and most of the messages were gone. As far as I can tell, MEM_REQ_... is not used for the ctx_size calculation at all, so increasing the values alone by a sane amount wouldn't have helped. I'm not sure if something else should to be included for ctx_size calculation.

Like I said, most message were gone, but not all

[...]
llama_model_load: model size =  4869.09 MB / num tensors = 723
llama_model_load: loading model part 8/8 from './models/65B/ggml-model-q4_0.bin.7'
llama_model_load: .......................................................................................... done
llama_model_load: model size =  4869.09 MB / num tensors = 723
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 18446744069414846656, available 2621440)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 18446744069414846656, available 2621440)
Process 12253 stopped

So after loading the model into memory it needs an amount of memory which even for my computer is a rather large number. I used my newly learned debugging skills to look what's going on

Process 12523 stopped
* thread #1, name = 'main', stop reason = breakpoint 1.1
    frame #0: 0x000000000040db39 main`ggml_new_tensor_impl(ctx=0x0000000000542c68, type=GGML_TYPE_F16, n_dims=1, ne=0x00007fffffffb4f0, data=0x0000000000000000) at ggml.c:2950:13
   2947	        size_needed += sizeof(struct ggml_tensor);
   2948	
   2949	        if (cur_end + size_needed + GGML_OBJECT_SIZE > ctx->mem_size) {
-> 2950	            GGML_PRINT("%s: not enough space in the context's memory pool (needed %zu, available %zu)\n",
   2951	                    __func__, cur_end + size_needed + GGML_OBJECT_SIZE, ctx->mem_size);
   2952	            assert(false);
   2953	            return NULL;
(lldb) frame variable
(ggml_context *) ctx = 0x0000000000542c68
(ggml_type) type = GGML_TYPE_F16
(int) n_dims = 1
(const int *) ne = 0x00007fffffffb4f0
(void *) data = 0x0000000000000000
(ggml_object *) obj_cur = NULL
(const size_t) cur_offs = 0
(const size_t) cur_size = 0
(const size_t) cur_end = 0
(size_t) size_needed = 18446744069414846624
(char *const) mem_buffer = 0x00007ff57512c010 ""
(ggml_object *const) obj_new = 0x00007ff57512c010
(ggml_tensor *const) result = 0x000000000055e3c8
(lldb) bt
* thread #1, name = 'main', stop reason = breakpoint 1.1
  * frame #0: 0x000000000040db39 main`ggml_new_tensor_impl(ctx=0x0000000000542c68, type=GGML_TYPE_F16, n_dims=1, ne=0x00007fffffffb4f0, data=0x0000000000000000) at ggml.c:2950:13
    frame #1: 0x000000000040e05e main`ggml_new_tensor(ctx=0x0000000000542c68, type=GGML_TYPE_F16, n_dims=1, ne=0x00007fffffffb4f0) at ggml.c:3044:12
    frame #2: 0x000000000040e097 main`ggml_new_tensor_1d(ctx=0x0000000000542c68, type=GGML_TYPE_F16, ne0=-2147352576) at ggml.c:3051:12
    frame #3: 0x000000000042ec4a main`kv_cache_init(hparams=0x000000000055e354, cache=0x000000000055e3b0, wtype=GGML_TYPE_F16, n_ctx=3277) at llama.cpp:257:15
    frame #4: 0x000000000042a4f3 main`::llama_init_from_file(path_model="./models/65B/ggml-model-q4_0.bin", params=llama_context_params @ 0x00007fffffffc9f0) at llama.cpp:1670:14
    frame #5: 0x0000000000406305 main`main(argc=21, argv=0x00007fffffffe768) at main.cpp:102:15
    frame #6: 0x00007ffff57e5d0a libc.so.6`__libc_start_main(main=(main`main at main.cpp:39), argc=21, argv=0x00007fffffffe768, init=<unavailable>, fini=<unavailable>, rtld_fini=<unavailable>, stack_end=0x00007fffffffe758) at libc-start.c:308:16
    frame #7: 0x0000000000405eea main`_start + 42

So something is going wrong already here:

const int n_elements = n_embd*n_mem;

Maybe this already helps anyone. Unfortunately I have to do some work now, but I'll try to dig further later on.

Edit:
Quick remark:
n_elements would be 80 * 3277 * 8192 = 2147614720 which seems to be larger than the maximum value an int can hold. With the current int implementation of ggml for 65B model everything above -c 3276 gives an overflow.

Edit²:
Further verified this with 30B, where we have n_embd=6656 and n_layer=60, so you can calculate the limit as (2**31)/(6656 * 60)=5377. -c 5377 works, -c 5378 not.

@bogdad
Copy link
Contributor

bogdad commented Mar 30, 2023

n_elements being int seems to be a hard thing to change, as it is assigned to tensor->ne[..] sometimes which is also and an int. might be the natural limit. but i have almost zero experience in llama.cpp.

@ghost
Copy link
Author

ghost commented Mar 30, 2023

n_elements being int seems to be a hard thing to change, as it is assigned to tensor->ne[..] sometimes which is also and an int. might be the natural limit. but i have almost zero experience in llama.cpp.

Yeah, it goes deep into ggml.c . One probably would have to rewrite whole indexing of ggml.c and llama.cpp to a different data type, which probably would be an horrific task.

As I'm not too familiar with C/++, I wonder if it would even work. One probably would have to use something like unsigned long, int64_t or size_t I guess? Can one safely index arrays using this? Would this break AVX and all the other optimization stuff? Do people get angry if they see something like for(size_t i = 0; i < n; i++)? Just wondering if this is a fundamental limitation of C/++.

@bogdad
Copy link
Contributor

bogdad commented Mar 30, 2023

I would say there is no limitation on array size or vector size for up to size_t with some caveats. I would also think that n_elements (or tensor->ne) is not part of the avx optimizations, more controls how many times we do the avx thing, but I know nothing it’s just a guess. Int might be fastest to iterate the loop on, but maybe it would not be that important.

I would say also if I had a concrete practical thing that depended on n_elements being larger than int, for example uint might be enough, I would totally gave changing to uint a try in a fork. Getting it upstream might be a separate concern though : )

It might be that it won’t work though for some other reasons, please don’t blame me if that’s the case!

@ghost
Copy link
Author

ghost commented Mar 30, 2023

Thanks a lot @bogdad, that would be great. I read into the topic and it seems like size_t is the way to go, as it's guaranteed to index the largest possible array on your machine and it should be optimized away by the compiler anyways. uint would at least double the maximum tensor size, but if you're already on it it makes sense to use the best possible type. Possibly also @ggerganov has an opinion on this.

@bogdad
Copy link
Contributor

bogdad commented Mar 30, 2023

(sorry, i was not very clear - currently i don't have the real practical need, i am not working on it)

@ggerganov ggerganov changed the title ggml issues with memory management Support tensors with 64-bit number of elements in ggml Mar 30, 2023
@ggerganov ggerganov added the enhancement New feature or request label Mar 30, 2023
@ggerganov
Copy link
Owner

So in summary - it is an integer overflow issue.
Somehow I didn't realize the number of elements in a tensor can get larger than ~2bil and overflow int.

The proposed solution in #626 does not seem the right way.
The interface of ggml has to keep using int for the individual tensor shapes.

What needs to be done I think is to carefully go through the places where we multiply ne and make the cast to int64_t.
Use int64_t instead of size_t because when dealing with computations of number of elements, every now and then one has to subtract them and we don't want to get underflows.

In contrast, for memory offsets we always use size_t.

@Seltsamsel
Copy link
Contributor

This should be resolved by #626 .

@daulet
Copy link

daulet commented Jun 16, 2023

any plans to fix this?

@ggerganov
Copy link
Owner

It should be fixed via #626

@daulet
Copy link

daulet commented Jun 16, 2023

I'm trying to load a custom 50B model via ggml (not this repo). I believe similar patch has been applied there too, but I can still repro. Sorry didn't realize I was on different repo before commenting (google search brought me here).

....
gpt2_model_load: ftype   = 2003
gpt2_model_load: qntvr   = 2
gpt2_model_load: ggml tensor size = 224 bytes
gpt2_model_load: ggml ctx size = 15745.44 MB
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 16603255552, available 16510290944)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 16603345920, available 16510290944)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 16613996800, available 16510290944)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 16530209280, available 16510290944)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 16603642624, available 16510290944)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 16603732992, available 16510290944)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 16603823360, available 16510290944)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 16614474240, available 16510290944)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 16530686720, available 16510290944)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 16604120064, available 16510290944)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 16604210432, available 16510290944)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 16604300800, available 16510290944)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 16614951680, available 16510290944)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 16531164160, available 16510290944)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 16604597504, available 16510290944)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 16604687872, available 16510290944)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 16604778240, available 16510290944)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 16615429120, available 16510290944)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 16531641600, available 16510290944)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 16605074944, available 16510290944)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 16605165312, available 16510290944)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 16605255680, available 16510290944)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 16615906560, available 16510290944)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 16532119040, available 16510290944)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 16605552384, available 16510290944)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 16605642752, available 16510290944)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 16605733120, available 16510290944)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 16616384000, available 16510290944)
...

@daulet
Copy link

daulet commented Jun 17, 2023

Just checked ggml project has the same patch applied. Any advice how to debug this issue further?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants