Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA/OpenCL error, out of memory when reload. #1456

Closed
edp1096 opened this issue May 14, 2023 · 25 comments
Closed

CUDA/OpenCL error, out of memory when reload. #1456

edp1096 opened this issue May 14, 2023 · 25 comments
Labels
bug Something isn't working hardware Hardware related high priority Very important issue

Comments

@edp1096
Copy link
Contributor

edp1096 commented May 14, 2023

Hello folks,

When try save-load-state example with CUDA, error occured.
It seems to necessary to add something toward llama_free function.

n_gpu_layers variable is appended at main function like below.

int main(int argc, char ** argv) {
    ...
    auto lparams = llama_context_default_params();

    lparams.n_ctx     = params.n_ctx;
    lparams.n_parts   = params.n_parts;
    lparams.n_gpu_layers = params.n_gpu_layers; // Add gpu layers count
    lparams.seed      = params.seed;
    ...
}

And tried to run as below.

D:\dev\pcbangstudio\workspace\my-llama\bin>save-load-state.exe -m ggml-vic7b-q4_0.bin -ngl 32
main: build = 548 (60f8c36)
llama.cpp: loading model from ggml-vic7b-q4_0.bin
llama_model_load_internal: format     = ggjt v2 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =  72.75 KB
llama_model_load_internal: mem required  = 5809.34 MB (+ 1026.00 MB per state)
llama_model_load_internal: [cublas] offloading 32 layers to GPU
llama_model_load_internal: [cublas] total VRAM used: 3860 MB
llama_init_from_file: kv self size  =  256.00 MB

The quick brown fox jumps over the lazy dog.

<!-- InstanceEnd -->Visible transl

llama.cpp: loading model from ggml-vic7b-q4_0.bin
llama_model_load_internal: format     = ggjt v2 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =  72.75 KB
llama_model_load_internal: mem required  = 5809.34 MB (+ 1026.00 MB per state)
llama_model_load_internal: [cublas] offloading 32 layers to GPU
CUDA error 2 at D:\dev\pcbangstudio\workspace\my-llama\llama.cpp\ggml-cuda.cu:462: out of memory

D:\dev\pcbangstudio\workspace\my-llama\bin>
@FSSRepo
Copy link
Collaborator

FSSRepo commented May 15, 2023

It seems that llama_free is not releasing the memory used by the previously used weights.

@edp1096
Copy link
Contributor Author

edp1096 commented May 17, 2023

I found all gpu malloc call cudaFree except ggml_cuda_transform_tensor in ggml_cuda.cu
Is there reason to leave qkv layers in state of allocated?

@bfrasure
Copy link

bfrasure commented May 21, 2023

For some reason, I was having this problem but I solved it by killing the task TabNine-deep-local.exe. That might have been local to my computer, but if your GPU is holding onto the memory, try closing some of the processes.

@edp1096
Copy link
Contributor Author

edp1096 commented May 21, 2023

@bfrasure What is TabNine? If it means code assistant application which you said, I don't use it.

@bfrasure
Copy link

It's an extension I loaded with VSCode. Looking further, I don't think it's related.

@edp1096
Copy link
Contributor Author

edp1096 commented May 22, 2023

I could deallocate the gpu offloaded parts by llama_free() modifying.
#1459 for clblast, a pr which is not accepted yet is working but #1412 for cuda is not woking.

@nidhishs
Copy link

I wanted to bring more attention to this issue, @JohannesGaessler, as downstream packages are being affected by offloaded layers not being cleaned from GPU VRAM.

@JohannesGaessler
Copy link
Collaborator

I can't reproduce the issue. In any case. if I had to guess the problem is not that the cuda buffers for the model weights aren't being deallocated but rather that they are getting allocated multiple times. I will soon make a PR that overhauls the CUDA code to make it more scalable and I'll try to include a fix then.

@nidhishs
Copy link

Using the python bindings on Linux, this snippet was able to reproduce the issue:

from llama_cpp import Llama
import gc
import os

def measure_resources(func):
    def get_ram_usage(pid):
        ram = os.popen(f'pmap {pid} | tail -1').read().strip()
        return ram.split(' ')[-1]
    
    def get_gpu_usage(pid):
        gpu = os.popen(f'nvidia-smi --query-compute-apps=pid,used_memory --format=csv | grep {pid}').read().strip()
        return gpu.split(', ')[-1] if gpu else '0 MiB'

    def wrapper():
        pid = os.getpid()
        print('pid:', pid)
        pre_ram, pre_gpu = get_ram_usage(pid), get_gpu_usage(pid)
        print('pre_ram:', pre_ram, 'pre_gpu:', pre_gpu)
        func()
        post_ram, post_gpu = get_ram_usage(pid), get_gpu_usage(pid)
        print('post_ram:', post_ram, 'post_gpu:', post_gpu)

    return wrapper

@measure_resources
def generate_text():
    llm = Llama(model_path=os.environ.get("MODEL"), n_gpu_layers=40)
    del llm
    gc.collect()

if __name__ == '__main__':
    generate_text()

Output:

pid: 13121
pre_ram: 720676K pre_gpu: 0 MiB
llama.cpp: loading model from ./weights/oasst-30b.bin
llama_model_load_internal: format     = ggjt v2 (latest)
llama_model_load_internal: n_vocab    = 32016
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 6656
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 52
llama_model_load_internal: n_layer    = 60
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 9 (mostly Q5_1)
llama_model_load_internal: n_ff       = 17920
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 30B
llama_model_load_internal: ggml ctx size = 135.75 KB
llama_model_load_internal: mem required  = 25573.29 MB (+ 3124.00 MB per state)
llama_model_load_internal: [cublas] offloading 40 layers to GPU
llama_model_load_internal: [cublas] total VRAM used: 15307 MB
llama_init_from_file: kv self size  =  780.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | 
post_ram: 25209048K post_gpu: 16074 MiB

More info here.

@JohannesGaessler
Copy link
Collaborator

Thanks for the code snippet, I can reproduce the issue now. I think I'll be able to fix it by adding a destructor to ggml_tensor although deleting and then recreating LLama Python objects will still require you to load up VRAM every time.

@JohannesGaessler
Copy link
Collaborator

I was able to fix this issue on my branch where I'm refactoring CUDA code.

@edp1096
Copy link
Contributor Author

edp1096 commented May 26, 2023

#include "common.h"
#include "llama.h"
#include "build-info.h"

#include <vector>
#include <cstdio>
#include <chrono>

int main(int argc, char ** argv) {
    gpt_params params;
    params.seed = 42;
    params.n_threads = 4;
    params.repeat_last_n = 64;
    params.prompt = "The quick brown fox";

    if (gpt_params_parse(argc, argv, params) == false) {
        return 1;
    }

    fprintf(stderr, "%s: build = %d (%s)\n", __func__, BUILD_NUMBER, BUILD_COMMIT);

    if (params.n_predict < 0) {
        params.n_predict = 16;
    }

    auto lparams = llama_context_default_params();

    lparams.n_ctx     = params.n_ctx;
    lparams.n_gpu_layers = params.n_gpu_layers;  /** Here, I modified for gpu offload enabling */
    lparams.seed      = params.seed;
    lparams.f16_kv    = params.memory_f16;
    lparams.use_mmap  = params.use_mmap;
    lparams.use_mlock = params.use_mlock;

    auto n_past = 0;
    auto last_n_tokens_data = std::vector<llama_token>(params.repeat_last_n, 0);

    // init
    auto ctx = llama_init_from_file(params.model.c_str(), lparams);
    auto tokens = std::vector<llama_token>(params.n_ctx);
    auto n_prompt_tokens = llama_tokenize(ctx, params.prompt.c_str(), tokens.data(), tokens.size(), true);

    if (n_prompt_tokens < 1) {
        fprintf(stderr, "%s : failed to tokenize prompt\n", __func__);
        return 1;
    }

    // evaluate prompt
    llama_eval(ctx, tokens.data(), n_prompt_tokens, n_past, params.n_threads);

    last_n_tokens_data.insert(last_n_tokens_data.end(), tokens.data(), tokens.data() + n_prompt_tokens);
    n_past += n_prompt_tokens;

    const size_t state_size = llama_get_state_size(ctx);
    uint8_t * state_mem = new uint8_t[state_size];

    // Save state (rng, logits, embedding and kv_cache) to file
    {
        FILE *fp_write = fopen("dump_state.bin", "wb");
        llama_copy_state_data(ctx, state_mem); // could also copy directly to memory mapped file
        fwrite(state_mem, 1, state_size, fp_write);
        fclose(fp_write);
    }

    // save state (last tokens)
    const auto last_n_tokens_data_saved = std::vector<llama_token>(last_n_tokens_data);
    const auto n_past_saved = n_past;

    // first run
    printf("\n%s", params.prompt.c_str());

    for (auto i = 0; i < params.n_predict; i++) {
        auto logits = llama_get_logits(ctx);
        auto n_vocab = llama_n_vocab(ctx);
        std::vector<llama_token_data> candidates;
        candidates.reserve(n_vocab);
        for (llama_token token_id = 0; token_id < n_vocab; token_id++) {
            candidates.emplace_back(llama_token_data{token_id, logits[token_id], 0.0f});
        }
        llama_token_data_array candidates_p = { candidates.data(), candidates.size(), false };
        auto next_token = llama_sample_token(ctx, &candidates_p);
        auto next_token_str = llama_token_to_str(ctx, next_token);
        last_n_tokens_data.push_back(next_token);

        printf("%s", next_token_str);
        if (llama_eval(ctx, &next_token, 1, n_past, params.n_threads)) {
            fprintf(stderr, "\n%s : failed to evaluate\n", __func__);
            return 1;
        }
        n_past += 1;
    }

    printf("\n\n");

    // free old model
    llama_free(ctx);

    // load new model
    auto ctx2 = llama_init_from_file(params.model.c_str(), lparams);

    // Load state (rng, logits, embedding and kv_cache) from file
    {
        FILE *fp_read = fopen("dump_state.bin", "rb");
        if (state_size != llama_get_state_size(ctx2)) {
            fprintf(stderr, "\n%s : failed to validate state size\n", __func__);
            return 1;
        }

        const size_t ret = fread(state_mem, 1, state_size, fp_read);
        if (ret != state_size) {
            fprintf(stderr, "\n%s : failed to read state\n", __func__);
            return 1;
        }

        llama_set_state_data(ctx2, state_mem);  // could also read directly from memory mapped file
        fclose(fp_read);
    }

    delete[] state_mem;

    // restore state (last tokens)
    last_n_tokens_data = last_n_tokens_data_saved;
    n_past = n_past_saved;

    // second run
    for (auto i = 0; i < params.n_predict; i++) {
        auto logits = llama_get_logits(ctx2);
        auto n_vocab = llama_n_vocab(ctx2);
        std::vector<llama_token_data> candidates;
        candidates.reserve(n_vocab);
        for (llama_token token_id = 0; token_id < n_vocab; token_id++) {
            candidates.emplace_back(llama_token_data{token_id, logits[token_id], 0.0f});
        }
        llama_token_data_array candidates_p = { candidates.data(), candidates.size(), false };
        auto next_token = llama_sample_token(ctx2, &candidates_p);
        auto next_token_str = llama_token_to_str(ctx2, next_token);
        last_n_tokens_data.push_back(next_token);

        printf("%s", next_token_str);
        if (llama_eval(ctx2, &next_token, 1, n_past, params.n_threads)) {
            fprintf(stderr, "\n%s : failed to evaluate\n", __func__);
            return 1;
        }
        n_past += 1;
    }

    printf("\n\n");

    return 0;
}

Here is code I used.

D:\llama.cpp_test>save-load-state.exe -m vicuna-7B-1.1-ggml_q4_0-ggjt_v3.bin -ngl 32
main: build = 589 (1fcdcc2)
llama.cpp: loading model from vicuna-7B-1.1-ggml_q4_0-ggjt_v3.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.07 MB
llama_model_load_internal: mem required  = 1932.71 MB (+ 1026.00 MB per state)
llama_model_load_internal: [cublas] offloading 32 layers to GPU
llama_model_load_internal: [cublas] total VRAM used: 3475 MB
..................................................................................................
llama_init_from_file: kv self size  =  256.00 MB

The quick brown fox jumps over the lazy dog.

<!-- InstanceEnd -->Visible transl

llama.cpp: loading model from vicuna-7B-1.1-ggml_q4_0-ggjt_v3.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.07 MB
llama_model_load_internal: mem required  = 1932.71 MB (+ 1026.00 MB per state)
llama_model_load_internal: [cublas] offloading 32 layers to GPU
llama_model_load_internal: [cublas] total VRAM used: 3475 MB
.........................................................................................CUDA error 2 at D:\dev\pcbangstudio\workspace\llama.cpp\ggml-cuda.cu:935: out of memory

D:\llama.cpp_test>

I tried vicuna 7b model which consume about 4gb vram on 3060ti 8gb also with #1530 .

llama_free function works well for cpu ram.

For vram, still not work.

@JohannesGaessler
Copy link
Collaborator

I added a fix in this PR #1607 where I'm refactoring the CUDA code. However, I added a new CLI argument --tensor-split and because of that the Python script that I used to reproduce the memory leak seems to now be broken:

ggml_init_cublas: found 1 CUDA devices:
  1. NVIDIA GeForce RTX 3090
pid: 2070536
pre_ram: 8135808K pre_gpu: 628 MiB
Fatal Python error: PyEval_RestoreThread: the function must be called with the GIL held, but the GIL is released (the current Python thread state is NULL)
Python runtime state: initialized

Current thread 0x00007fdd557c3740 (most recent call first):
  File "/home/johannesg/Projects/llama-cpp-python/llama_cpp/llama_cpp.py", line 207 in llama_context_default_params
  File "/home/johannesg/Projects/llama-cpp-python/llama_cpp/llama.py", line 128 in __init__
  File "/home/johannesg/Projects/llama.cpp/oom.py", line 29 in generate_text
  File "/home/johannesg/Projects/llama.cpp/oom.py", line 21 in wrapper
  File "/home/johannesg/Projects/llama.cpp/oom.py", line 34 in <module>
[1]    2070536 IOT instruction (core dumped)  python3 oom.py

Can I easily fix this on my end or will llamacpp-python need to be updated?

@nidhishs
Copy link

I think llama-cpp-python needs to be updated. I briefly looked at the code that’s causing the error. Seems like we will need to update the default parameters being passed during initialization to llama.cpp. What do you think @gjmulder?

@edp1096
Copy link
Contributor Author

edp1096 commented May 28, 2023

@JohannesGaessler It seems work for cuda. Does it also affect to opencl?

@gjmulder
Copy link
Collaborator

@nidhishs @JohannesGaessler, I believe @abetlen's policy is to expose all parameters that llama.cpp exposes so they can be configured within python.

It is certainly required when doing apples-to-apples tests as we seem to be getting a number of "llama-cpp-python is slower than llama.cpp" issues.

@edp1096 edp1096 changed the title CUDA error, out of memory when reload. CUDA/OpenCL error, out of memory when reload. May 28, 2023
@JohannesGaessler
Copy link
Collaborator

JohannesGaessler commented May 28, 2023

Wait, I think I may have been using the wrong Python bindings. I was using this repository which worked for me to reproduce the bug. Can someone give me a quick rundown for the difference between this and abetlen's repository?

@JohannesGaessler
Copy link
Collaborator

It seems work for cuda. Does it also affect to opencl?

I have not made any changes to OpenCL.

@JohannesGaessler
Copy link
Collaborator

Disregard my previous post, I was using the correct repository.

@edp1096
Copy link
Contributor Author

edp1096 commented Jun 7, 2023

Currently CUDA release vram usage well. Thank you @JohannesGaessler .
@0cc4m I'm still looking forward for opencl but if you're busy, can I post a PR for this?

@0cc4m
Copy link
Collaborator

0cc4m commented Jun 7, 2023

Go ahead, sure.

@edp1096
Copy link
Contributor Author

edp1096 commented Jun 7, 2023

Ok, I will do it.

@edp1096
Copy link
Contributor Author

edp1096 commented Jun 9, 2023

Done. @0cc4m thank you for acceptance.

@edp1096 edp1096 closed this as completed Jun 9, 2023
@iactix
Copy link

iactix commented Jul 19, 2023

Oh this is closed? That probably explains why I'm still waiting for the memory leak fix in llama-cpp-python 2 months later.

@edp1096
Copy link
Contributor Author

edp1096 commented Jul 20, 2023

@iactix Yes, the leakage issues which I'v met atleast , were solved. If you still have problem, you can open another issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working hardware Hardware related high priority Very important issue
Projects
None yet
Development

No branches or pull requests

8 participants