Multi Threaded issue with CUDA compiled library 1.5.4 #1814

bradmit · 2024-01-29T05:17:00Z

I've recently compiled the 1.5.4 library for with CUBLAS and having an issue where running multiple whisper_full_with_state()'s.
I did not previously have this issue with the 1.5.1 library.

I re-compiled with DEBUG_CUDA_MALLOC and it has the output:

[29-01-2024 16:04:58:258] [INFO ] [cuda pool[0]: allocated 7680000 bytes at 302000000, used: 7680000]
[29-01-2024 16:04:58:263] [INFO ] [cuda pool[0]: allocated 7680000 bytes at 302753000, used: 15360000]
[29-01-2024 16:04:58:267] [INFO ] [cuda pool[0]: freed 7680000 bytes at 302000000]
[29-01-2024 16:04:58:272] [INFO ] [GGML_ASSERT: ggml-cuda.cu:6742: ptr == (void *) (g_cuda_pool_addr[device] + g_cuda_pool_used[device])]

I can see its failed on the assert as its not de-allocating in reverse order. It has these allocated "per device", but there is only one actual device (card). In this case, I'd have two instances running and the first executed one has finished first.

Is this some sort of "virtual" device where each call of whisper_full_with_state needs to specify a separate device or is this something with the memory allocation?

I also noticed that libcuda.so is missing when compiling if you don't have the driver installed. I don't have a GPU in the compiler host so I had to copy this onto the host manually.

ggerganov · 2024-01-29T08:17:09Z

Does it work with the latest master branch?

bradmit · 2024-01-29T22:13:14Z

I can confirm this still happens on the master.

[30-01-2024 09:09:12:174] [INFO ] [GGML_ASSERT: ggml-cuda.cu:7579: ptr == (void *) (g_cuda_pool_addr[device] + g_cuda_pool_used[device])]

Line 7579 in ggml-cuda.cu:
// all deallocations must be in reverse order of the allocations
GGML_ASSERT(ptr == (void *) (g_cuda_pool_addr[device] + g_cuda_pool_used[device]));

The way the allocations work in the current format, if the first call with the first allocation finishes before a second allocation you're always going to get this? Single audio file at once, obviously no issue, but a MT situation...

If there is something I'm missing in the call to whisper_full_with_state to help with this. I previously have not had to. 1.5.1 was good.

ggerganov · 2024-01-30T12:18:47Z

Just to make sure - you are using a different whisper_state in the different threads, correct?

bradmit · 2024-01-30T22:18:49Z

Yes. We retrieve a context by model type, and then state by model in each thread that calls whisper_full_with_state:

struct whisper_context* ctx = (struct whisper_context*)whisperInt->getContextByModel(modelTxtName);
struct whisper_state* state = (struct whisper_context*)whisperInt->getWhisperStateByModel(modelTxtName);

int res = whisper_full_with_state(ctx, state, wparams, (const float*)samples, count);

whisperInt->freeWhisperState(state);

Construction / free of state from whisperInt obj:

void* WhisperInt::getContextByModel(std::string model)
{
auto ctxSearch = modelsContext.find(model);
if (ctxSearch != modelsContext.end())
{
return ctxSearch->second;
}
return NULL;
}

void* WhisperInt::getWhisperStateByModel(std::string model)
{
auto ctxSearch = modelsContext.find(model);
if (ctxSearch != modelsContext.end())
{
return (void*)whisper_init_state((struct whisper_context*)ctxSearch->second);
}
return NULL;
}

void WhisperInt::freeWhisperState(void* state)
{
whisper_free_state((struct whisper_state*)state);
}

sammistq · 2024-01-31T04:36:13Z

hi, i also meet the same problem and it can be reproduced by assigning --processor or whisper_full_parallel
./main -m models/ggml-base.en.bin -f samples/jfk.wav -p 2
my current workaround is to force disable vmm by setting device_vmm = 0 in ggml-cuda.cu
hope this can be fixed.
thanks

ggerganov · 2024-01-31T12:28:45Z

Yes, so the CUDA backend is not completely thread-safe at this point. We will fix this, but I cannot provide an ETA at the moment.

@bradmit Does it work for you with forcing device_vmm = 0 in ggml-cuda.cu?

@slaren Any ideas for quick workarounds, or we just have to wait and fix this properly?

slaren · 2024-01-31T12:34:37Z

As you suggested, disabling the vmm allocator should fix the assert, but there are so many globals in the CUDA backend that are not synchronized that I can only imagine that if it works at all, it will be by chance.

ggerganov · 2024-01-31T12:38:27Z

Ok, you can track the state of ggml becoming thread-safe in this issue: ggerganov/llama.cpp#3960

Sorry for the inconvenience - you might want to stick with whisper.cpp v1.5.1 for now

bradmit · 2024-01-31T22:20:05Z

No problemo. What you've done thus far is awesome.

slaren · 2024-03-27T21:51:17Z

The CUDA backend should be thread safe now.

thepacketloss mentioned this issue Mar 22, 2024

Panic when using go binding with CUDA compiled library #1986

Open

chriskyndrid mentioned this issue Apr 29, 2024

How to integrate in actix-web? tazz4843/whisper-rs#141

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi Threaded issue with CUDA compiled library 1.5.4 #1814

Multi Threaded issue with CUDA compiled library 1.5.4 #1814

bradmit commented Jan 29, 2024 •

edited

ggerganov commented Jan 29, 2024

bradmit commented Jan 29, 2024

ggerganov commented Jan 30, 2024

bradmit commented Jan 30, 2024

sammistq commented Jan 31, 2024

ggerganov commented Jan 31, 2024

slaren commented Jan 31, 2024

ggerganov commented Jan 31, 2024

bradmit commented Jan 31, 2024

slaren commented Mar 27, 2024

Multi Threaded issue with CUDA compiled library 1.5.4 #1814

Multi Threaded issue with CUDA compiled library 1.5.4 #1814

Comments

bradmit commented Jan 29, 2024 • edited

ggerganov commented Jan 29, 2024

bradmit commented Jan 29, 2024

ggerganov commented Jan 30, 2024

bradmit commented Jan 30, 2024

sammistq commented Jan 31, 2024

ggerganov commented Jan 31, 2024

slaren commented Jan 31, 2024

ggerganov commented Jan 31, 2024

bradmit commented Jan 31, 2024

slaren commented Mar 27, 2024

bradmit commented Jan 29, 2024 •

edited