<a href="https://colab.research.google.com/github/casualcomputer/redis-rq-llm/blob/master/google_colab_rq_llm_gpu.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Testing Large-Scale Querying of LLM Endpoints (T4 GPU) Using Redis Queue

We are testing the feasibility of querying a large language model (LLM) endpoint at scale using Redis Queue (RQ). This approach allows us to handle and process a high volume of queries efficiently by distributing them across multiple workers for parallel processing.

Our goal is to ensure the system can manage substantial simultaneous requests, maintain stability, and provide timely responses. By leveraging Redis Queue, we hope to optimize resource utilization, enhance scalability, and improve fault tolerance.

This test will help identify bottlenecks, determine practical limits, and guide necessary improvements to achieve efficient and reliable large-scale query handling.

It took the LLM 20 seconds to answer the first question. Interestingly, responses to the rest of the questions came back at the same time. It appears RQ is distributing workloads. I expected the GPU usage to be higher, but it was using 0.5G out of 15G... Are there ways to increase the GPU utilization to provide faster response time? How can we explicity set up RQ to improve performance in this aspect? To be determined.

What's confusing is that in the CPU-only sessions, the RQ tasks were executed in sequence when I expected to execute in parallel. Throughout the CPU ran time, the CPU was at 2-6G out of 16G. Are there ways to maximize the efficiency of RQ in this aspect? How? Any needs to explicityly reference the low-level hardware to achieve better performance? Does RQ work like a load balanacer for servers, where it reads in the server stats and decide routing of requests?

Note: I was not able to stop RQ with warm stops. When I press the stop button, RQ re-ran the entire tasks. Why?

## Download folders and install packages

In [None]:
#Install llama-cpp-python with CUBLAS, compatible to CUDA 12.2 which is the CUDA driver build above
!set LLAMA_CUBLAS=1
!set CMAKE_ARGS=-DLLAMA_CUBLAS=on
!set FORCE_CMAKE=1

#Install llama-cpp-python, cuda-enabled package
!python -m pip install llama-cpp-python==0.2.7 --prefer-binary --extra-index-url=https://jllllll.github.io/llama-cpp-python-cuBLAS-wheels/AVX2/cu122

#Install pytorch-related, cuda-enabled package
!pip install torch==2.3.0 torchvision==0.18.0 torchaudio==2.3.0 --index-url https://download.pytorch.org/whl/cu121

Looking in indexes: https://pypi.org/simple, https://jllllll.github.io/llama-cpp-python-cuBLAS-wheels/AVX2/cu122
Collecting llama-cpp-python==0.2.7
  Downloading https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/wheels/llama_cpp_python-0.2.7%2Bcu122-cp310-cp310-manylinux_2_31_x86_64.whl (14.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m14.8/14.8 MB[0m [31m15.8 MB/s[0m eta [36m0:00:00[0m
Collecting diskcache>=5.6.1 (from llama-cpp-python==0.2.7)
  Downloading diskcache-5.6.3-py3-none-any.whl (45 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.5/45.5 kB[0m [31m919.8 kB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: diskcache, llama-cpp-python
Successfully installed diskcache-5.6.3 llama-cpp-python-0.2.7+cu122
Looking in indexes: https://download.pytorch.org/whl/cu121
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch==2.3.0)
  Downloading https://download.pytorch.org/whl/cu121/nvidia_cud

In [None]:
! pip install redis rq requests fastapi

Collecting redis
  Downloading redis-5.0.6-py3-none-any.whl (252 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m252.0/252.0 kB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting rq
  Downloading rq-1.16.2-py3-none-any.whl (90 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m90.9/90.9 kB[0m [31m7.1 MB/s[0m eta [36m0:00:00[0m
Collecting fastapi
  Downloading fastapi-0.111.0-py3-none-any.whl (91 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.0/92.0 kB[0m [31m11.6 MB/s[0m eta [36m0:00:00[0m
Collecting starlette<0.38.0,>=0.37.2 (from fastapi)
  Downloading starlette-0.37.2-py3-none-any.whl (71 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m71.9/71.9 kB[0m [31m9.0 MB/s[0m eta [36m0:00:00[0m
Collecting fastapi-cli>=0.0.2 (from fastapi)
  Downloading fastapi_cli-0.0.4-py3-none-any.whl (9.5 kB)
Collecting httpx>=0.23.0 (from fastapi)
  Downloading httpx-0.27.0-py3-none-any.whl (75 kB)
[

## Update linux packages and install redis-server

In [None]:
!sudo apt-get update
!sudo apt-get install redis-server

0% [Working]            Get:1 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease [3,626 B]
0% [Waiting for headers] [Waiting for headers] [1 InRelease 3,626 B/3,626 B 1000% [Waiting for headers] [Waiting for headers] [Connected to ppa.launchpadconte                                                                               Get:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease [1,581 B]
0% [Waiting for headers] [Waiting for headers] [Connected to ppa.launchpadconte0% [Waiting for headers] [Waiting for headers] [Connected to ppa.launchpadconte                                                                               Get:3 http://security.ubuntu.com/ubuntu jammy-security InRelease [129 kB]
Hit:4 http://archive.ubuntu.com/ubuntu jammy InRelease
Get:5 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [128 kB]
Hit:6 https://ppa.launchpadcontent.net/c2d4u.team/c2d4u4.0+/ubuntu jammy InRelease
Hit:7 https://ppa.la

In [None]:
!sudo service redis-server start

Starting redis-server: redis-server.


## Donwload model from huggingface

In [None]:
from huggingface_hub import hf_hub_download
from google.colab import userdata

model_name = "cjpais/llava-1.6-mistral-7b-gguf"
model_file = "llava-v1.6-mistral-7b.Q4_K_M.gguf"

# save your huggingface access key as HF_TOKEN in the colab secret before you continue
model_path = hf_hub_download(model_name, filename=model_file, local_dir='/content/redis-rq-llm/models/', token=userdata.get('HF_TOKEN'))
print("Model path:", model_path)

llava-v1.6-mistral-7b.Q4_K_M.gguf:   0%|          | 0.00/4.37G [00:00<?, ?B/s]

Model path: /content/redis-rq-llm/models/llava-v1.6-mistral-7b.Q4_K_M.gguf


## Create an API endpoint for the LLM

In [None]:
%%writefile fastapi_app.py
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from llama_cpp import Llama

# Initialize the FastAPI app
app = FastAPI()

# Load the LLM model
model_path = "/content/redis-rq-llm/models/llava-v1.6-mistral-7b.Q4_K_M.gguf"
llm = Llama(model_path=model_path)

# Define request and response models
class QueryRequest(BaseModel):
    question: str

class QueryResponse(BaseModel):
    answer: str

# Define the query endpoint
@app.post("/query", response_model=QueryResponse)
async def query_llm(request: QueryRequest):
    system_message = "You are a helpful assistant"
    user_message = f"Q: {request.question} A: "

    prompt = f"""<s>[INST] <<SYS>>
{system_message}
<</SYS>>
{user_message} [/INST]"""

    try:
        # Run the model to get the response
        output = llm(
            prompt,  # Prompt
            max_tokens=2000,  # Generate up to 2000 tokens
            stop=["Q:", "\n"],  # Stop generating just before the model would generate a new question
            echo=False  # Do not echo the prompt back in the output
        )

        # Extract and return the response
        response_text = output["choices"][0]["text"].strip()

        # Ensure the response is trimmed properly
        response_text = response_text.split("[/INST]")[-1].strip()

        return QueryResponse(answer=response_text)
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

# Run the FastAPI application
if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=8000)

# Template for sending a request
# curl -X POST "http://localhost:8000/query" -H "Content-Type: application/json" -d "{\"question\": \"Name the planets in the solar system?\"}"

Writing fastapi_app.py


## Quietly serve the LLM API in the background

In [None]:
import subprocess

# Start the FastAPI server in the background
fastapi_process = subprocess.Popen(['python', 'fastapi_app.py'])

# You can also add a brief sleep to ensure the server starts before continuing
import time
time.sleep(5)  # Sleep for 5 seconds to give the server time to start

Testing LLM's API endpoint connection:

In [None]:
! curl -X POST "http://localhost:8000/query" -H "Content-Type: application/json" -d "{\"question\": \"Name the planets in the solar system?\"}"

{"answer":"Sure, here's a list of the planets in our solar system, from the inner planet to the outer planet:"}

## Write a function to query LLM API and log progress

In [None]:
%%writefile tasks.py
import requests
import time
from rq import get_current_job

def process_question(question, url="http://localhost:8000/query"):
    job = get_current_job()
    worker_id = job.origin  # Get the worker ID
    start_time = time.time()
    response = requests.post(url, json={"question": question})
    end_time = time.time()
    duration = end_time - start_time

    result = {
        "question": question,
        "response": response.text,
        "start_time": time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(start_time)),
        "end_time": time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(end_time)),
        "duration": duration,
        "worker_id": worker_id
    }

    with open("responses.txt", "a") as file:
        file.write(f"Worker ID: {result['worker_id']}\n")
        file.write(f"Question: {result['question']}\n")
        file.write(f"Response: {result['response']}\n")
        file.write(f"Start Time: {result['start_time']}\n")
        file.write(f"End Time: {result['end_time']}\n")
        file.write(f"Duration: {result['duration']:.2f} seconds\n")
        file.write("-" * 60 + "\n")

    print(result)
    print("-" * 60)

    return result

Writing tasks.py


## Implement a Redis Queue

In [None]:
%%writefile main.py
from rq import Queue
from redis import Redis
from tasks import process_question

# Redis connection
redis_conn = Redis()

# RQ Queue
queue = Queue(connection=redis_conn)

# List of questions
questions = [
    "Is the mind the same as the brain, or do we have souls?",
    "Can computers think, or fall in love?",
    "Can computers be creative?",
    "What is consciousness?",
    "Can we really know what it feels like to be a bat?",
    "When you have a toothache, is the pain in your mouth or in your brain?",
    "What is an emotion?",
    "Is love just a feeling?",
    "How is love different from passion or sexual desire?",
    "Are emotions irrational?"]

# Enqueue each question to be processed
for index, question in enumerate(questions, start=1):
    job = queue.enqueue(process_question, question)
    print(f"Enqueued job {index}: {job.id}")

Overwriting main.py


In [None]:
!python main.py
!rq worker

## The dollowing didn't work...
# # Start multiple RQ workers
# num_workers = 4
# worker_processes = []
# for _ in range(num_workers):
#     worker_process = subprocess.Popen(['rq', 'worker'])
#     worker_processes.append(worker_process)

# # Run the main script to enqueue jobs
# main_process = subprocess.Popen(['python', 'main.py'])

{'question': 'Can computers think, or fall in love?', 'response': '{"answer":"No, computers cannot think or fall in love. They operate based on predefined instructions and algorithms. While they can perform tasks that require intelligence, such as playing chess or recognizing faces, this does not mean they possess true intelligence or consciousness. In essence, while computers are incredibly powerful tools that can accomplish a wide range of tasks, they do not have the ability to think, feel, or experience the world in the same way that humans do."}', 'start_time': '2024-06-25 00:44:19', 'end_time': '2024-06-25 00:45:53', 'duration': 93.11730456352234, 'worker_id': 'default'}
------------------------------------------------------------
00:45:53 [32mdefault[39;49;00m: [34mJob OK[39;49;00m (dec5b1b7-6f2f-4c22-a81a-c390c30d0e09)
00:45:53 Result is kept for 500 seconds
00:45:53 Worker rq:worker:c95f68086fe942c29722423a733cc6ae: stopping on request
00:45:53 Unsubscribing from channel rq