<a href="https://colab.research.google.com/github/casualcomputer/redis-rq-llm/blob/master/google_colab_streamlitrq_cpu_harriet.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Testing Large-Scale Querying of LLM Endpoints (CPU only) Using Redis Queue

We are testing the feasibility of querying a large language model (LLM) endpoint at scale using Redis Queue (RQ). This approach allows us to handle and process a high volume of queries efficiently by distributing them across multiple workers for parallel processing.

Our goal is to ensure the system can manage substantial simultaneous requests, maintain stability, and provide timely responses. By leveraging Redis Queue, we hope to optimize resource utilization, enhance scalability, and improve fault tolerance.

This test will help identify bottlenecks, determine practical limits, and guide necessary improvements to achieve efficient and reliable large-scale query handling. However, we noted during the experiment that results are returned in sequence, despite expecting certain queries to be processed simultaneously.

## Download folders and install packages

In [1]:
! pip install redis rq requests fastapi llama-cpp-python

Collecting redis
  Downloading redis-5.0.7-py3-none-any.whl (252 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m252.1/252.1 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting rq
  Downloading rq-1.16.2-py3-none-any.whl (90 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m90.9/90.9 kB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
Collecting fastapi
  Downloading fastapi-0.111.0-py3-none-any.whl (91 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.0/92.0 kB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting llama-cpp-python
  Downloading llama_cpp_python-0.2.79.tar.gz (50.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.3/50.3 MB[0m [31m7.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [

## Update linux packages and install redis-server

In [2]:
!sudo apt-get update
!sudo apt-get install redis-server

0% [Working]            Get:1 http://security.ubuntu.com/ubuntu jammy-security InRelease [129 kB]
0% [Connecting to archive.ubuntu.com] [1 InRelease 14.2 kB/129 kB 11%] [Connect                                                                               Get:2 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease [3,626 B]
0% [Waiting for headers] [1 InRelease 66.3 kB/129 kB 51%] [2 InRelease 3,626 B/0% [Waiting for headers] [1 InRelease 95.3 kB/129 kB 74%] [Connected to ppa.lau                                                                               Hit:3 http://archive.ubuntu.com/ubuntu jammy InRelease
0% [1 InRelease 118 kB/129 kB 92%] [Connected to ppa.launchpadcontent.net (185.0% [Waiting for headers] [Connected to ppa.launchpadcontent.net (185.125.190.80                                                                               Hit:4 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease
0% [Waiting for headers

In [3]:
!sudo service redis-server start

Starting redis-server: redis-server.


## Donwload model from huggingface

In [5]:
from huggingface_hub import hf_hub_download
from google.colab import userdata

model_name = "cjpais/llava-1.6-mistral-7b-gguf"
model_file = "llava-v1.6-mistral-7b.Q4_K_M.gguf"

# save your huggingface access key as HF_TOKEN in the colab secret before you continue
model_path = hf_hub_download(model_name, filename=model_file, local_dir='/content/redis-rq-llm/models/', token=userdata.get('HF_TOKEN'))
print("Model path:", model_path)

llava-v1.6-mistral-7b.Q4_K_M.gguf:   0%|          | 0.00/4.37G [00:00<?, ?B/s]

Model path: /content/redis-rq-llm/models/llava-v1.6-mistral-7b.Q4_K_M.gguf


## Create an API endpoint for the LLM

In [6]:
%%writefile fastapi_app.py
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from llama_cpp import Llama

# Initialize the FastAPI app
app = FastAPI()

# Load the LLM model
model_path = "/content/redis-rq-llm/models/llava-v1.6-mistral-7b.Q4_K_M.gguf"
llm = Llama(model_path=model_path)

# Define request and response models
class QueryRequest(BaseModel):
    question: str

class QueryResponse(BaseModel):
    answer: str

# Define the query endpoint
@app.post("/query", response_model=QueryResponse)
async def query_llm(request: QueryRequest):
    system_message = "You are a helpful assistant"
    user_message = f"Q: {request.question} A: "

    prompt = f"""<s>[INST] <<SYS>>
{system_message}
<</SYS>>
{user_message} [/INST]"""

    try:
        # Run the model to get the response
        output = llm(
            prompt,  # Prompt
            max_tokens=2000,  # Generate up to 2000 tokens
            stop=["Q:", "\n"],  # Stop generating just before the model would generate a new question
            echo=False  # Do not echo the prompt back in the output
        )

        # Extract and return the response
        response_text = output["choices"][0]["text"].strip()

        # Ensure the response is trimmed properly
        response_text = response_text.split("[/INST]")[-1].strip()

        return QueryResponse(answer=response_text)
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

# Run the FastAPI application
if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=8000)

# Template for sending a request
# curl -X POST "http://localhost:8000/query" -H "Content-Type: application/json" -d "{\"question\": \"Name the planets in the solar system?\"}"

Writing fastapi_app.py


## Quietly serve the LLM API in the background

In [8]:
import subprocess

# Start the FastAPI server in the background
fastapi_process = subprocess.Popen(['python', 'fastapi_app.py'])

# You can also add a brief sleep to ensure the server starts before continuing
import time
time.sleep(5)  # Sleep for 5 seconds to give the server time to start

Testing the fastapi_app.py (LLM endpoint)

In [9]:
!curl -X POST "http://localhost:8000/query" -H "Content-Type: application/json" -d "{\"question\": \"Name the planets in the solar system?\"}"

{"answer":"Sure, here's a list of the eight planets in our solar system:"}

## Write a function to query LLM API and log progress

In [10]:
%%writefile tasks.py
import requests
import time

def process_question(question, url="http://localhost:8000/query"):
    """
    Send a question to the FastAPI server and log the response time.

    Args:
        question (str): The question to send.
        url (str): The API endpoint to send the question to.

    Returns:
        dict: A dictionary containing the question, response, and timing information.
    """
    # Record the start time
    start_time = time.time()

    # Send the POST request to the FastAPI server
    response = requests.post(url, json={"question": question})

    # Record the end time
    end_time = time.time()

    # Calculate the duration taken to get the response
    duration = end_time - start_time

    # Create a result dictionary
    result = {
        "question": question,
        "response": response.text,
        "start_time": time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(start_time)),
        "end_time": time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(end_time)),
        "duration": duration
    }

    # Print the result (or save it to a log, etc.)
    print(result)
    print("-" * 60)  # Print a separator line for clarity

    return result


Writing tasks.py


## Implement a Redis Queue

In [11]:
%%writefile main.py
from rq import Queue
from redis import Redis
from tasks import process_question

# Redis connection
redis_conn = Redis()

# RQ Queue
queue = Queue(connection=redis_conn)

# List of questions
questions = [
    "Is the mind the same as the brain, or do we have souls?",
    "Can computers think, or fall in love?",
    "Can computers be creative?",
    "What is consciousness?",
    "Can we really know what it feels like to be a bat?",
    "When you have a toothache, is the pain in your mouth or in your brain?",
    "What is an emotion?",
    "Is love just a feeling?",
    "How is love different from passion or sexual desire?",
    "Are emotions irrational?"]

# Enqueue each question to be processed
for index, question in enumerate(questions, start=1):
    job = queue.enqueue(process_question, question)
    print(f"Enqueued job {index}: {job.id}")

Writing main.py


In [12]:
!redis-server

39048:C 28 Jun 2024 17:55:24.416 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
39048:C 28 Jun 2024 17:55:24.416 # Redis version=6.0.16, bits=64, commit=00000000, modified=0, pid=39048, just started
39048:M 28 Jun 2024 17:55:24.417 # Could not create server TCP listening socket *:6379: bind: Address already in use


In [13]:
!python main.py
!rq worker

## The following didn't work...
# # Start multiple RQ workers
# num_workers = 4
# worker_processes = []
# for _ in range(num_workers):
#     worker_process = subprocess.Popen(['rq', 'worker'])
#     worker_processes.append(worker_process)

# # Run the main script to enqueue jobs
# main_process = subprocess.Popen(['python', 'main.py'])

Enqueued job 1: ddcfe7b1-42c0-4148-a615-e5e01be2c152
Enqueued job 2: 142ffc45-57ae-4d9f-bf61-b76e2090a974
Enqueued job 3: 7babac47-e6eb-4747-bb39-15cfcaedf25b
Enqueued job 4: e8a45d05-dc3f-4264-835d-8b7cea8d0608
Enqueued job 5: 5f3254c6-9c8e-4b70-88b7-9b21700cd45b
Enqueued job 6: 548ba70d-5cf5-4414-9089-d4683b23f2f4
Enqueued job 7: 203781bc-51ea-46e1-9cbe-eb0b2490bc30
Enqueued job 8: 10a2a358-b2cd-40ef-b46d-2c14f4a43b97
Enqueued job 9: 350dbb79-47d0-4126-93d1-2f25f14b4b60
Enqueued job 10: 7ecc6cd7-6482-4dcd-9dcc-366da9e9e432
17:55:29 Worker rq:worker:219315bd0dd44d0d81d8b4dacd084108 started with PID 39072, version 1.16.2
17:55:29 Subscribing to channel rq:pubsub:219315bd0dd44d0d81d8b4dacd084108
17:55:29 *** Listening on [32mdefault[39;49;00m...
17:55:29 Cleaning registries for queue: default
17:55:29 [32mdefault[39;49;00m: [34mtasks.process_question('Is the mind the same as the brain, or do we have souls?')[39;49;00m (ddcfe7b1-42c0-4148-a615-e5e01be2c152)
{'question': 'Is the min

In [None]:
# import multiprocessing
# cores = multiprocessing.cpu_count() # Count the number of cores in a computer
# cores #2

2

In [26]:
%%writefile requirements.txt
aiohttp==3.9.5
aiosignal==1.3.1
altair==5.3.0
annotated-types==0.6.0
anyio<4,>=3.1.0
archspec==0.2.3
attrs==23.2.0
boltons==24.0.0
Brotli==1.1.0
cachetools==5.3.3
certifi==2024.2.2
cffi==1.16.0
charset-normalizer==3.3.2
click==8.1.7
cryptography==42.0.5
dataclasses-json==0.6.4
distro==1.9.0
frozenlist==1.4.1
gitdb==4.0.11
GitPython==3.1.43
greenlet==3.0.3
h11==0.14.0
httpcore==1.0.5
httpx==0.27.0
idna==3.7
Jinja2==3.1.3
jsonpatch==1.33
jsonpointer==2.1
jsonschema==4.21.1
jsonschema-specifications==2023.12.1
langchain==0.1.16
langchain-community==0.0.34
langchain-core==0.1.46
langchain-text-splitters==0.0.1
langsmith==0.1.51
markdown-it-py==3.0.0
MarkupSafe==2.1.5
marshmallow==3.21.1
mdurl==0.1.2
multidict==6.0.5
mypy-extensions==1.0.0
numpy==1.26.4
openai==1.23.6
orjson==3.10.1
packaging==23.2
pandas==2.0.3
pillow==10.3.0
pip==23.2.1
platformdirs==4.2.1
pluggy==1.5.0
protobuf==4.25.3
psutil==5.9.8
pyarrow<15.0.0a0,>=14.0.1
pycosat==0.6.6
pycparser==2.22
pydantic==2.7.1
pydantic_core==2.18.2
pydeck==0.9.0b0
Pygments==2.17.2
PySocks==1.7.1
python-dateutil==2.9.0.post0
pytz==2024.1
PyYAML==6.0.1
redis==5.0.4
referencing==0.35.0
requests==2.31.0
rich==13.7.1
rpds-py==0.18.0
rq==1.16.1
ruamel.yaml==0.18.6
ruamel.yaml.clib==0.2.8
setuptools==69.5.1
six==1.16.0
smmap==5.0.1
sniffio==1.3.1
SQLAlchemy==2.0.29
streamlit==1.33.0
tenacity==8.2.3
toml==0.10.2
toolz==0.12.1
tornado==6.3.3
tqdm==4.66.2
truststore==0.8.0
typing_extensions==4.11.0
typing-inspect==0.9.0
tzdata==2024.1
urllib3==2.2.1
watchdog==4.0.0
wheel==0.41.2
yarl==1.9.4
zstandard==0.22.0

Overwriting requirements.txt


In [1]:
!pip install -r requirements.txt

[0m

In [12]:
! pip install git+https://github.com/Yue1Harriet1/streamlitrq.git

Collecting git+https://github.com/Yue1Harriet1/streamlitrq.git
  Cloning https://github.com/Yue1Harriet1/streamlitrq.git to /tmp/pip-req-build-t32u1wbi
  Running command git clone --filter=blob:none --quiet https://github.com/Yue1Harriet1/streamlitrq.git /tmp/pip-req-build-t32u1wbi
  Resolved https://github.com/Yue1Harriet1/streamlitrq.git to commit 8372ac83dea3b86fc7435089471f651d39418c03
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: streamlitrq
  Building wheel for streamlitrq (setup.py) ... [?25l[?25hdone
  Created wheel for streamlitrq: filename=streamlitrq-0.0.1-py3-none-any.whl size=14496 sha256=0c0a3a753ed84a1bebb8b9bb03dbeba4757ec725480c5f22220d6f6002a8a598
  Stored in directory: /tmp/pip-ephem-wheel-cache-a34gur3h/wheels/cd/32/b7/d3d9a6449d189840cd6c69314d024307e1e48cf4e46b36c396
Successfully built streamlitrq
Installing collected packages: streamlitrq
Successfully installed streamlitrq-0.0.1
[0m

In [3]:
%%writefile main_app.py
from streamlitrq import ui_streamlit, task
ui_streamlit.demo_homepage(user_tasks=[task.Task(task.sleep, "default task name")], task_inputs={"seconds": 10})

Overwriting main_app.py


In [4]:
!npm install localtunnel

[K[?25h[37;40mnpm[0m [0m[30;43mWARN[0m [0m[35msaveError[0m ENOENT: no such file or directory, open '/content/package.json'
[0m[37;40mnpm[0m [0m[30;43mWARN[0m [0m[35menoent[0m ENOENT: no such file or directory, open '/content/package.json'
[0m[37;40mnpm[0m [0m[30;43mWARN[0m[35m[0m content No description
[0m[37;40mnpm[0m [0m[30;43mWARN[0m[35m[0m content No repository field.
[0m[37;40mnpm[0m [0m[30;43mWARN[0m[35m[0m content No README data
[0m[37;40mnpm[0m [0m[30;43mWARN[0m[35m[0m content No license field.
[0m
+ localtunnel@2.0.2
updated 1 package and audited 36 packages in 1.044s

3 packages are looking for funding
  run `npm fund` for details

found 2 [93mmoderate[0m severity vulnerabilities
  run `npm audit fix` to fix them, or `npm audit` for details
[K[?25h

In [None]:
!npx localtunnel --port 8501 #click the url and enter the ip address from the previous step

[K[?25hnpx: installed 22 in 2.703s
your url is: https://metal-teeth-prove.loca.lt


In [17]:
!streamlit run /content/main_app.py &>/content/logs.txt & #check log.txt under "content" folder and take the ip address e.g. 34.74.134.24 (you will need it later)

In [18]:
import urllib
print("Password/Enpoint IP for localtunnel is:",urllib.request.urlopen('https://ipv4.icanhazip.com').read().decode('utf8').strip("\n"))

Password/Enpoint IP for localtunnel is: 34.139.211.170


Found existing installation: streamlitrq 0.0.1
Uninstalling streamlitrq-0.0.1:
  Would remove:
    /usr/local/lib/python3.10/dist-packages/streamlitrq-0.0.1.dist-info/*
    /usr/local/lib/python3.10/dist-packages/streamlitrq/*
Proceed (Y/n)? Y
  Successfully uninstalled streamlitrq-0.0.1
[0m