Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi Threaded issue with CUDA compiled library 1.5.4 #1814

Open
bradmit opened this issue Jan 29, 2024 · 10 comments
Open

Multi Threaded issue with CUDA compiled library 1.5.4 #1814

bradmit opened this issue Jan 29, 2024 · 10 comments

Comments

@bradmit
Copy link
Contributor

bradmit commented Jan 29, 2024

I've recently compiled the 1.5.4 library for with CUBLAS and having an issue where running multiple whisper_full_with_state()'s.
I did not previously have this issue with the 1.5.1 library.

I re-compiled with DEBUG_CUDA_MALLOC and it has the output:

[29-01-2024 16:04:58:258] [INFO ] [cuda pool[0]: allocated 7680000 bytes at 302000000, used: 7680000]
[29-01-2024 16:04:58:263] [INFO ] [cuda pool[0]: allocated 7680000 bytes at 302753000, used: 15360000]
[29-01-2024 16:04:58:267] [INFO ] [cuda pool[0]: freed 7680000 bytes at 302000000]
[29-01-2024 16:04:58:272] [INFO ] [GGML_ASSERT: ggml-cuda.cu:6742: ptr == (void *) (g_cuda_pool_addr[device] + g_cuda_pool_used[device])]

I can see its failed on the assert as its not de-allocating in reverse order. It has these allocated "per device", but there is only one actual device (card). In this case, I'd have two instances running and the first executed one has finished first.

Is this some sort of "virtual" device where each call of whisper_full_with_state needs to specify a separate device or is this something with the memory allocation?

I also noticed that libcuda.so is missing when compiling if you don't have the driver installed. I don't have a GPU in the compiler host so I had to copy this onto the host manually.

@ggerganov
Copy link
Owner

Does it work with the latest master branch?

@bradmit
Copy link
Contributor Author

bradmit commented Jan 29, 2024

I can confirm this still happens on the master.

[30-01-2024 09:09:12:174] [INFO ] [GGML_ASSERT: ggml-cuda.cu:7579: ptr == (void *) (g_cuda_pool_addr[device] + g_cuda_pool_used[device])]

Line 7579 in ggml-cuda.cu:
// all deallocations must be in reverse order of the allocations
GGML_ASSERT(ptr == (void *) (g_cuda_pool_addr[device] + g_cuda_pool_used[device]));

The way the allocations work in the current format, if the first call with the first allocation finishes before a second allocation you're always going to get this? Single audio file at once, obviously no issue, but a MT situation...

If there is something I'm missing in the call to whisper_full_with_state to help with this. I previously have not had to. 1.5.1 was good.

@ggerganov
Copy link
Owner

Just to make sure - you are using a different whisper_state in the different threads, correct?

@bradmit
Copy link
Contributor Author

bradmit commented Jan 30, 2024

Yes. We retrieve a context by model type, and then state by model in each thread that calls whisper_full_with_state:

struct whisper_context* ctx = (struct whisper_context*)whisperInt->getContextByModel(modelTxtName);
struct whisper_state* state = (struct whisper_context*)whisperInt->getWhisperStateByModel(modelTxtName);

int res = whisper_full_with_state(ctx, state, wparams, (const float*)samples, count);

whisperInt->freeWhisperState(state);

Construction / free of state from whisperInt obj:

void* WhisperInt::getContextByModel(std::string model)
{
auto ctxSearch = modelsContext.find(model);
if (ctxSearch != modelsContext.end())
{
return ctxSearch->second;
}
return NULL;
}

void* WhisperInt::getWhisperStateByModel(std::string model)
{
auto ctxSearch = modelsContext.find(model);
if (ctxSearch != modelsContext.end())
{
return (void*)whisper_init_state((struct whisper_context*)ctxSearch->second);
}
return NULL;
}

void WhisperInt::freeWhisperState(void* state)
{
whisper_free_state((struct whisper_state*)state);
}

@sammistq
Copy link

hi, i also meet the same problem and it can be reproduced by assigning --processor or whisper_full_parallel
./main -m models/ggml-base.en.bin -f samples/jfk.wav -p 2
my current workaround is to force disable vmm by setting device_vmm = 0 in ggml-cuda.cu
hope this can be fixed.
thanks

@ggerganov
Copy link
Owner

Yes, so the CUDA backend is not completely thread-safe at this point. We will fix this, but I cannot provide an ETA at the moment.

@bradmit Does it work for you with forcing device_vmm = 0 in ggml-cuda.cu?

@slaren Any ideas for quick workarounds, or we just have to wait and fix this properly?

@slaren
Copy link
Collaborator

slaren commented Jan 31, 2024

As you suggested, disabling the vmm allocator should fix the assert, but there are so many globals in the CUDA backend that are not synchronized that I can only imagine that if it works at all, it will be by chance.

@ggerganov
Copy link
Owner

Ok, you can track the state of ggml becoming thread-safe in this issue: ggerganov/llama.cpp#3960

Sorry for the inconvenience - you might want to stick with whisper.cpp v1.5.1 for now

@bradmit
Copy link
Contributor Author

bradmit commented Jan 31, 2024

No problemo. What you've done thus far is awesome.

@slaren
Copy link
Collaborator

slaren commented Mar 27, 2024

The CUDA backend should be thread safe now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants