<a href="https://colab.research.google.com/github/casualcomputer/redis-rq-llm/blob/master/google_colab_rq_llm_cpu.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Testing Large-Scale Querying of LLM Endpoints (CPU only) Using Redis Queue

We are testing the feasibility of querying a large language model (LLM) endpoint at scale using Redis Queue (RQ). This approach allows us to handle and process a high volume of queries efficiently by distributing them across multiple workers for parallel processing.

Our goal is to ensure the system can manage substantial simultaneous requests, maintain stability, and provide timely responses. By leveraging Redis Queue, we hope to optimize resource utilization, enhance scalability, and improve fault tolerance.

This test will help identify bottlenecks, determine practical limits, and guide necessary improvements to achieve efficient and reliable large-scale query handling. However, we noted during the experiment that results are returned in sequence, despite expecting certain queries to be processed simultaneously.

## Download folders and install packages

In [None]:
! git clone https://github.com/huggingface/redis-rq-llm.git #git clone the repo just to get a folder structure

In [26]:
! pip install redis rq requests fastapi llama-cpp-python

Collecting fastapi
  Downloading fastapi-0.111.0-py3-none-any.whl (91 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.0/92.0 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting llama-cpp-python
  Downloading llama_cpp_python-0.2.79.tar.gz (50.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.3/50.3 MB[0m [31m12.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting starlette<0.38.0,>=0.37.2 (from fastapi)
  Downloading starlette-0.37.2-py3-none-any.whl (71 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m71.9/71.9 kB[0m [31m8.5 MB/s[0m eta [36m0:00:00[0m
Collecting fastapi-cli>=0.0.2 (from fastapi)
  Downloading fastapi_cli-0.0.4-py3-none-any.whl (9.5 kB)
Collecting httpx>=0.23.0 (from fa

## Update linux packages and install redis-server

In [None]:
!sudo apt-get update
!sudo apt-get install redis-server

In [12]:
!sudo service redis-server start

Starting redis-server: redis-server.


## Donwload model from huggingface

In [36]:
from huggingface_hub import hf_hub_download
from google.colab import userdata

model_name = "cjpais/llava-1.6-mistral-7b-gguf"
model_file = "llava-v1.6-mistral-7b.Q4_K_M.gguf"

# save your huggingface access key as HF_TOKEN in the colab secret before you continue
model_path = hf_hub_download(model_name, filename=model_file, local_dir='/content/redis-rq-llm/models/', token=userdata.get('HF_TOKEN'))
print("Model path:", model_path)

llava-v1.6-mistral-7b.Q4_K_M.gguf:   0%|          | 0.00/4.37G [00:00<?, ?B/s]

Model path: /content/redis-rq-llm/models/llava-v1.6-mistral-7b.Q4_K_M.gguf


## Create an API endpoint for the LLM

In [43]:
!cd /content/redis-rq-llm
!ls

fastapi_app.py	main.py  __pycache__  redis-rq-llm  sample_data  tasks.py


In [37]:
%%writefile fastapi_app.py
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from llama_cpp import Llama

# Initialize the FastAPI app
app = FastAPI()

# Load the LLM model
model_path = "/content/redis-rq-llm/models/llava-v1.6-mistral-7b.Q4_K_M.gguf"
llm = Llama(model_path=model_path)

# Define request and response models
class QueryRequest(BaseModel):
    question: str

class QueryResponse(BaseModel):
    answer: str

# Define the query endpoint
@app.post("/query", response_model=QueryResponse)
async def query_llm(request: QueryRequest):
    system_message = "You are a helpful assistant"
    user_message = f"Q: {request.question} A: "

    prompt = f"""<s>[INST] <<SYS>>
{system_message}
<</SYS>>
{user_message} [/INST]"""

    try:
        # Run the model to get the response
        output = llm(
            prompt,  # Prompt
            max_tokens=2000,  # Generate up to 2000 tokens
            stop=["Q:", "\n"],  # Stop generating just before the model would generate a new question
            echo=False  # Do not echo the prompt back in the output
        )

        # Extract and return the response
        response_text = output["choices"][0]["text"].strip()

        # Ensure the response is trimmed properly
        response_text = response_text.split("[/INST]")[-1].strip()

        return QueryResponse(answer=response_text)
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

# Run the FastAPI application
if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=8000)

# Template for sending a request
# curl -X POST "http://localhost:8000/query" -H "Content-Type: application/json" -d "{\"question\": \"Name the planets in the solar system?\"}"

Overwriting fastapi_app.py


## Quietly serve the LLM API in the background

In [42]:
import subprocess

# Start the FastAPI server in the background
fastapi_process = subprocess.Popen(['python', 'fastapi_app.py'])

# You can also add a brief sleep to ensure the server starts before continuing
import time
time.sleep(5)  # Sleep for 5 seconds to give the server time to start

## Write a function to query LLM API and log progress

In [38]:
%%writefile tasks.py
import requests
import time

def process_question(question, url="http://localhost:8000/query"):
    """
    Send a question to the FastAPI server and log the response time.

    Args:
        question (str): The question to send.
        url (str): The API endpoint to send the question to.

    Returns:
        dict: A dictionary containing the question, response, and timing information.
    """
    # Record the start time
    start_time = time.time()

    # Send the POST request to the FastAPI server
    response = requests.post(url, json={"question": question})

    # Record the end time
    end_time = time.time()

    # Calculate the duration taken to get the response
    duration = end_time - start_time

    # Create a result dictionary
    result = {
        "question": question,
        "response": response.text,
        "start_time": time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(start_time)),
        "end_time": time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(end_time)),
        "duration": duration
    }

    # Print the result (or save it to a log, etc.)
    print(result)
    print("-" * 60)  # Print a separator line for clarity

    return result


Overwriting tasks.py


## Implement a Redis Queue

In [39]:
%%writefile main.py
from rq import Queue
from redis import Redis
from tasks import process_question

# Redis connection
redis_conn = Redis()

# RQ Queue
queue = Queue(connection=redis_conn)

# List of questions
questions = [
    "Is the mind the same as the brain, or do we have souls?",
    "Can computers think, or fall in love?",
    "Can computers be creative?",
    "What is consciousness?",
    "Can we really know what it feels like to be a bat?",
    "When you have a toothache, is the pain in your mouth or in your brain?",
    "What is an emotion?",
    "Is love just a feeling?",
    "How is love different from passion or sexual desire?",
    "Are emotions irrational?"]

# Enqueue each question to be processed
for index, question in enumerate(questions, start=1):
    job = queue.enqueue(process_question, question)
    print(f"Enqueued job {index}: {job.id}")

Overwriting main.py


In [44]:
!redis-server
!python main.py
!rq worker

14292:C 24 Jun 2024 23:13:28.807 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
14292:C 24 Jun 2024 23:13:28.807 # Redis version=6.0.16, bits=64, commit=00000000, modified=0, pid=14292, just started
14292:M 24 Jun 2024 23:13:28.815 # Could not create server TCP listening socket *:6379: bind: Address already in use
Enqueued job 1: acc8f419-9c6f-498d-bea6-43b73446c50b
Enqueued job 2: bab216c4-c52e-4fce-857a-67ef679541ea
Enqueued job 3: 80573f1e-8026-4ac4-8e23-0591ff4a74b3
Enqueued job 4: b13b88ac-609d-4cb4-83b0-71d5a56fa431
Enqueued job 5: 6247e529-c20a-42b8-91bf-4569469f4d6f
Enqueued job 6: 63359c46-85e2-456a-a517-b6a85a0a6164
Enqueued job 7: 8616e1ee-3bae-4e71-87c7-9ec992c18ed4
Enqueued job 8: cd497b46-f7d3-4475-97dc-6a2c19ca1191
Enqueued job 9: 541771c3-8157-4488-8723-92df21dcdbf1
Enqueued job 10: dd203c91-f41f-4872-8979-eac242acbfed
23:13:30 Worker rq:worker:1c52093879f34d779dbdf6fab44d4807 started with PID 14299, version 1.16.2
23:13:30 Subscribing to channel rq:pubsub:1c52093879f3