Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

main does not terminate #6149

Closed
markdols opened this issue Mar 19, 2024 · 9 comments · Fixed by #6206
Closed

main does not terminate #6149

markdols opened this issue Mar 19, 2024 · 9 comments · Fixed by #6206

Comments

@markdols
Copy link

Running Arch Linux, kernel 6.8.1
llama.cpp commit 2d15886
CPU: AMD Ryzen 9 5900HS, GPU: NVIDIA 3050 Ti Laptop (4GB VRAM)

After cloning the repo and running make -j 8 LLAMA_CUBLAS=1 LLAMA_FAST=1, I ran the command ./main -t 4 -m models/fusionnet2x7b-q5_K_S.gguf -ngl 10 -n 256.

This was the output:

Log start
main: build = 2460 (2d15886b)
main: built with cc (GCC) 13.2.1 20230801 for x86_64-pc-linux-gnu
main: seed  = 1710807513
llama_model_loader: loaded meta data with 25 key-value pairs and 419 tensors from models/fusionnet2x7b-q5_K_S.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = models
llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   9:                         llama.expert_count u32              = 2
llama_model_loader: - kv  10:                    llama.expert_used_count u32              = 2
llama_model_loader: - kv  11:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  12:                       llama.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  13:                          general.file_type u32              = 16
llama_model_loader: - kv  14:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  16:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  17:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  20:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  21:            tokenizer.ggml.padding_token_id u32              = 1
llama_model_loader: - kv  22:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  23:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  24:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type  f16:   32 tensors
llama_model_loader: - type q5_K:  321 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
... (some of the logs were lost after having to restart my computer)
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:  CUDA_Host KV buffer size =    44.00 MiB
llama_kv_cache_init:      CUDA0 KV buffer size =    20.00 MiB
llama_new_context_with_model: KV self size  =   64.00 MiB, K (f16):   32.00 MiB, V (f16):   32.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =    62.50 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   286.00 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =     9.00 MiB
llama_new_context_with_model: graph nodes  = 1668
llama_new_context_with_model: graph splits = 268

system_info: n_threads = 4 / 16 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 |
sampling:
        repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature
generate: n_ctx = 512, n_batch = 2048, n_predict = 256, n_keep = 1


 import React, { useState } from 'react';
import axios from 'axios';
import './SearchForm.css'

const SearchForm = () => {
  const [cityName, setCityName] = useState('');
  const [weatherData, setWeatherData] = useState(null);

  const handleSubmit = async (e) => {
    e.preventDefault();
    try {
      const response = await axios.get(`https://api.openweathermap.org/data/2.5/weather?q=${cityName}&appid=580a72b1922e3d64d99fbcda3f3a5d9c`);
      setWeatherData(response.data);
    } catch (error) {
      console.log('Error getting weather data:', error.message);
    }
  };

  const handleChange = (e) => {
    setCityName(e.target.value);
  }

  return (
    <form onSubmit={handleSubmit} className="searchForm">
      <input type
llama_print_timings:        load time =   25928.97 ms
llama_print_timings:      sample time =      33.03 ms /   256 runs   (    0.13 ms per token,  7750.53 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_print_timings:        eval time =   46923.08 ms /   256 runs   (  183.29 ms per token,     5.46 tokens per second)
llama_print_timings:       total time =   47061.57 ms /   257 tokens

Even after the exit logs are printed, the main process does not exit. Trying to use ^C or ^\ doesn't terminate the process either, it only causes my terminal to become unresponsive. CPU resources taken by the process do not seem to be freed. Soon after, my entire desktop environment freezes, and I am forced to go to a TTY to restart my computer, or if the freeze is especially bad, force-restart it.

Running the command in a TTY only causes it to become completely unresponsive after the exit logs are printed.

Building the repository multiple times has not solved this issue. This issue occurs whether I use a model with an imatrix or a model with no imatrix.

I suspect the recent commit regarding the CUDA backend may have something to do with this, as I have only observed this behavior after updating to a commit on March 18, 2024. Omitting -ngl 10 from the command doesn't resolve the issue, though.

Regardless, this bug has made main largely unusable for me.

@MathiasSchindler
Copy link

Greetings. I can confirm this behaviour. Larger models seem to trigger this more often than running a very small model such as phi2 (but it will happen eventually).

Running Ubuntu 23.10
CPU: AMD Ryzen 9 7900, GPU: NVIDIA 4070 Ti SUPER

@slaren
Copy link
Collaborator

slaren commented Mar 20, 2024

Setting the environment variable GGML_CUDA_NO_PINNED, or using --no-mmap may help.

@kalomaze
Copy link
Contributor

kalomaze commented Mar 20, 2024

I can confirm that after some amount of time if I try to shut down the server executable [on Ubuntu] I will sometimes (inconsistently?) get this issue and have to kill the terminal to stop the process.
I built on this commit, which came before the CUDA backend PR was merged. Not sure if it applies to both the server and main in the same way.

@slaren
Copy link
Collaborator

slaren commented Mar 21, 2024

If you can reproduce this and your computer is still responsive, the best way to find the cause would be to attach a debugger to the process and get a call stack. Run gdb, type attach <pid> and then bt.

@markdols
Copy link
Author

Running with --no-mmap actually allows me to run gdb and get a backtrace on the process, but this is all the backtrace was after the exit logs printed:

(gdb) bt
#0  0x0000782170643cd0 in ?? () from /usr/lib/libc.so.6
#1  0x0000782170643d8a in __libc_start_main () from /usr/lib/libc.so.6
#2  0x00005b55411afe95 in _start ()

Without --no-mmap, trying to attach a debugger to the process seems infeasible.

@markdols
Copy link
Author

Using a debug build, it seems to be freezing at llama_free_model(model).

@markdols
Copy link
Author

This could be the culprit. It did come from the commit I mentioned earlier.
In ~llama_model(), which is what llama_free_model calls:

#ifdef GGML_USE_CUBLAS
            if (ggml_backend_buffer_get_type(buf) == ggml_backend_cpu_buffer_type()) {
                ggml_backend_cuda_unregister_host_buffer(ggml_backend_buffer_get_base(buf));
            }
#endif

It might be caused by something outside of this destructor, but the freeze appears to happen here. Removing it and recompiling causes the process to terminate, yet my computer still dramatically slows down afterward. Clearly there is something else going on.
I'll see if reverting to an earlier commit works.

@markdols
Copy link
Author

Reverted to this commit, and everything works fine. As suspected, switching to this commit or any commit after causes the aforementioned freezing.

@MathiasSchindler
Copy link

@slaren: Thank you, I tested it and it seems to work fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants