Description:
I am using the llama-cpp-python library to handle multiple simultaneous user requests. However, when simulating two users making requests at the same time, the model freezes and does not respond. I suspect there might be an issue with multi-threading or concurrency.
Steps to Reproduce:
use model:unsloth/DeepSeek-R1-Distill-Llama-70B-GGUF:DeepSeek-R1-Distill-Llama-70B-Q4_K_M.gguf
Scene 1: Single User Request
Initialize the model using llama-cpp-python.
Send a single user request.
Observe that the model responds correctly and outputs the expected result.
Scene 2: Simulating Two Concurrent User Requests
Initialize the model using llama-cpp-python.
Simulate two users by sending two requests at the same time (in separate threads or processes).
Observe that the model freezes and does not respond to either request, causing a timeout or hanging.
Expected Behavior:
The model should handle both requests concurrently, providing responses to both users without freezing or encountering delays.
Actual Behavior:
When simulating two users, the model freezes and does not respond to either request.
System Information:
Operating System: [ Ubuntu 20.04]
Python version: [3.10]
llama-cpp-python version: [0.3.7]
Hardware: [4090*4]
Additional Information:
I suspect this issue is related to multi-threading or multi-processing, as the model might be sharing resources (e.g., memory, GPU) between threads or processes. I have tried running both requests in separate threads, but it results in a freeze. This problem might be caused by a lack of thread-safety in the library or resource contention.
Questions:
Is llama-cpp-python thread-safe, and does it support concurrent requests in a multi-threaded environment?
Are there recommended practices or solutions for handling multiple concurrent user requests?
Would using a multi-process approach be a better solution for this use case?
I would appreciate any insights or guidance on how to resolve this issue.
Thank you!
Description:
I am using the llama-cpp-python library to handle multiple simultaneous user requests. However, when simulating two users making requests at the same time, the model freezes and does not respond. I suspect there might be an issue with multi-threading or concurrency.
Steps to Reproduce:
use model:unsloth/DeepSeek-R1-Distill-Llama-70B-GGUF:DeepSeek-R1-Distill-Llama-70B-Q4_K_M.gguf
Scene 1: Single User Request
Initialize the model using llama-cpp-python.
Send a single user request.
Observe that the model responds correctly and outputs the expected result.
Scene 2: Simulating Two Concurrent User Requests
Initialize the model using llama-cpp-python.
Simulate two users by sending two requests at the same time (in separate threads or processes).
Observe that the model freezes and does not respond to either request, causing a timeout or hanging.
Expected Behavior:
The model should handle both requests concurrently, providing responses to both users without freezing or encountering delays.
Actual Behavior:
When simulating two users, the model freezes and does not respond to either request.
System Information:
Operating System: [ Ubuntu 20.04]
Python version: [3.10]
llama-cpp-python version: [0.3.7]
Hardware: [4090*4]
Additional Information:
I suspect this issue is related to multi-threading or multi-processing, as the model might be sharing resources (e.g., memory, GPU) between threads or processes. I have tried running both requests in separate threads, but it results in a freeze. This problem might be caused by a lack of thread-safety in the library or resource contention.
Questions:
Is llama-cpp-python thread-safe, and does it support concurrent requests in a multi-threaded environment?
Are there recommended practices or solutions for handling multiple concurrent user requests?
Would using a multi-process approach be a better solution for this use case?
I would appreciate any insights or guidance on how to resolve this issue.
Thank you!