<a href="https://colab.research.google.com/github/fabiopauli/Qwen3.5-colab/blob/main/Server_Qwen27B_llamacpp_16k_context_L4_20gb-queue.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

A Celula abaixo demora 8 minutos para ser conclu√≠da

In [1]:
# Cell 1: Build llama.cpp with CUDA and run Qwen3.5-27B (non-thinking mode)
!apt-get update -qq && apt-get install -qq -y pciutils build-essential cmake curl libcurl4-openssl-dev > /dev/null 2>&1

!git clone --depth 1 https://github.com/ggml-org/llama.cpp 2>/dev/null || echo "already cloned"

!cmake llama.cpp -B llama.cpp/build \
    -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON > /dev/null 2>&1

!cmake --build llama.cpp/build --config Release -j$(nproc) --clean-first --target llama-cli llama-server 2>&1 | tail -5

!cp llama.cpp/build/bin/llama-* llama.cpp/

# Download the model
!pip install -q huggingface_hub hf_transfer
!HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli download unsloth/Qwen3.5-27B-GGUF \
    --local-dir unsloth/Qwen3.5-27B-GGUF \
    --include "*UD-Q4_K_XL*"

W: Skipping acquire of configured file 'main/source/Sources' as repository 'https://r2u.stat.illinois.edu/ubuntu jammy InRelease' does not seem to provide it (sources.list entry misspelt?)
already cloned
[ 98%] Building CXX object tools/server/CMakeFiles/llama-server.dir/server.cpp.o
[ 98%] Building CXX object tools/server/CMakeFiles/llama-server.dir/server-http.cpp.o
[ 98%] Building CXX object tools/server/CMakeFiles/llama-server.dir/server-models.cpp.o
[100%] Linking CXX executable ../../bin/llama-server
[100%] Built target llama-server
/bin/bash: line 1: huggingface-cli: command not found


A c√©lula abaixo cria o servidor Llamacpp em background.

In [44]:
# Cell 2: Run llama-server in the background
import os
import time
import subprocess

# Kill any existing server to free up the port
os.system("pkill -f llama-server")
time.sleep(2)

os.environ["LLAMA_CACHE"] = "unsloth/Qwen3.5-27B-GGUF"

# Start the server using nohup so it runs in the background
server_cmd = """
nohup ./llama.cpp/llama-server \
    -hf unsloth/Qwen3.5-27B-GGUF:UD-Q4_K_XL \
    --host 127.0.0.1 \
    --port 8081 \
    --ctx-size 16384 \
    -ngl 99 \
    --temp 0.7 \
    --top-p 0.8 \
    --top-k 20 \
    --min-p 0.00 \
    --chat-template-kwargs '{"enable_thinking": false}' \
    --cache-type-k q8_0 \
    --cache-type-v q8_0 > llama_server.log 2>&1 &
"""

print("Starting llama-server on port 8081...")
os.system(server_cmd)

# Wait for the server to spin up and load the model into VRAM
print("Waiting for model to load into VRAM (this takes 30-60 seconds)...")
for i in range(600):
    try:
        import requests
        res = requests.get("http://127.0.0.1:8081/health")
        if res.status_code == 200:
            print("\n‚úÖ llama-server is ready and listening on port 8081!")
            break
    except:
        pass
    time.sleep(2)
    print(".", end="", flush=True)
else:
    print("\n‚ö†Ô∏è Server might not have started correctly. Check llama_server.log:")
    os.system("tail -n 20 llama_server.log")

Starting llama-server on port 8081...
Waiting for model to load into VRAM (this takes 30-60 seconds)...
.......
‚úÖ llama-server is ready and listening on port 8081!


A seguir, criamos outro servidor para gerar os endpoints da API, tamb√©m em background

In [5]:
# Cell 3: Install dependencies for FastAPI wrapper
!pip install -q fastapi uvicorn pyngrok httpx pydantic nest-asyncio

In [47]:
# Cell 4: Background FastAPI + Cloudflare Tunnel
import os
import time
import re

# 1. Write the FastAPI app to a file
fastapi_code = """
import uvicorn
from fastapi import FastAPI, Request
from fastapi.responses import StreamingResponse, JSONResponse
from fastapi.middleware.cors import CORSMiddleware
import httpx

app = FastAPI(title="Custom FastAPI Wrapper for llama.cpp")

app.add_middleware(
    CORSMiddleware,
    allow_origins=["*"],
    allow_credentials=True,
    allow_methods=["*"],
    allow_headers=["*"],
)

LLAMA_SERVER_URL = "http://127.0.0.1:8081"

@app.get("/v1/models")
async def get_models():
    async with httpx.AsyncClient() as client:
        response = await client.get(f"{LLAMA_SERVER_URL}/v1/models")
        return response.json()

@app.post("/v1/chat/completions")
async def chat_completions(request: Request):
    payload = await request.json()
    is_stream = payload.get("stream", False)

    if is_stream:
        async def generate():
            async with httpx.AsyncClient(timeout=300.0) as client:
                async with client.stream("POST", f"{LLAMA_SERVER_URL}/v1/chat/completions", json=payload) as response:
                    async for chunk in response.aiter_bytes():
                        yield chunk

        return StreamingResponse(generate(), media_type="text/event-stream")
    else:
        async with httpx.AsyncClient(timeout=300.0) as client:
            response = await client.post(f"{LLAMA_SERVER_URL}/v1/chat/completions", json=payload)
            return JSONResponse(content=response.json(), status_code=response.status_code)
"""

with open("fastapi_server.py", "w") as f:
    f.write(fastapi_code)

# 2. Kill existing processes (if you run this cell multiple times)
os.system("pkill -f uvicorn")
os.system("pkill -f cloudflared")
time.sleep(1)

# 3. Download Cloudflare if needed
if not os.path.exists("cloudflared"):
    os.system("wget -q https://github.com/cloudflare/cloudflared/releases/latest/download/cloudflared-linux-amd64 -O cloudflared")
    os.system("chmod +x cloudflared")

# 4. Start FastAPI in the background via Uvicorn
print("Starting FastAPI server in the background...")
os.system("nohup python -m uvicorn fastapi_server:app --host 0.0.0.0 --port 8000 > fastapi.log 2>&1 &")

# 5. Start Cloudflare Tunnel in the background
print("Starting Cloudflare Tunnel...")
os.system("nohup ./cloudflared tunnel --url http://127.0.0.1:8000 > cloudflare.log 2>&1 &")

# Wait a few seconds for Cloudflare to assign a URL
print("Waiting for URL...")
time.sleep(8)

# 6. Read the log to extract the URL
with open("cloudflare.log", "r") as f:
    logs = f.read()
    match = re.search(r"(https://[a-zA-Z0-9-]+\.trycloudflare\.com)", logs)

    if match:
        public_url = match.group(1)
        base_url = f"{public_url}/v1"

        # Save the URL to a file
        with open("api_url.txt", "w") as url_file:
            url_file.write(base_url)

        print(f"\n‚úÖ URL saved to api_url.txt")
        print(f"üëâ {base_url}\n")
    else:
        print("‚ö†Ô∏è Could not find Cloudflare URL.")

Starting FastAPI server in the background...
Starting Cloudflare Tunnel...
Waiting for URL...

‚úÖ URL saved to api_url.txt
üëâ https://editorial-details-updating-turns.trycloudflare.com/v1



Abaixo est√° um exemplo de uso da API, pode ser usado de qualquer computador, basta preencher o API_BASE_URL com a URL do servidor da c√©lula acima

In [50]:
# Cell 5: Test your API with the official OpenAI Python package
from openai import OpenAI

# Read the base URL automatically from the file
with open("api_url.txt", "r") as f:
    API_BASE_URL = f.read().strip()

print(f"Connecting to: {API_BASE_URL}\n")

client = OpenAI(
    base_url=API_BASE_URL,
    api_key="sk-no-key-required"
)


# --- 1. GET MODELS ---
print("Fetching models...")
models = client.models.list()
print(f"Available models: {[m.id for m in models.data]}\n")
print("-" * 50)


# --- 2. STREAMING COMPLETION ---
print("Sending chat request (Streaming)...\n")
stream_response = client.chat.completions.create(
    model="unsloth/Qwen3.5-27B-GGUF",
    messages=[
        {"role": "system", "content": "You are a helpful and concise AI assistant."},
        {"role": "user", "content": "Explique o que √© um llamacpp server e o que √© um Cloudflared tunnel"}
    ],
    stream=True # <--- Set to True
)

# Print the streaming response as it arrives
for chunk in stream_response:
    if chunk.choices[0].delta.content is not None:
        print(chunk.choices[0].delta.content, end="", flush=True)

print("\n\n" + "-" * 50)


# --- 3. NON-STREAMING COMPLETION ---
print("Sending chat request (Non-Streaming)...\n")
standard_response = client.chat.completions.create(
    model="unsloth/Qwen3.5-27B-GGUF",
    messages=[
        {"role": "system", "content": "You are a helpful and concise AI assistant."},
        {"role": "user", "content": "O que √© aux√≠lio-doen√ßa no direito brasileiro ? N√£o use markdown na resposta"}
    ],
    stream=False # <--- Set to False
)

# Print the final complete message
print(standard_response.choices[0].message.content)
print("\n" + "-" * 50)

Connecting to: https://editorial-details-updating-turns.trycloudflare.com/v1

Fetching models...
Available models: ['unsloth/Qwen3.5-27B-GGUF:UD-Q4_K_XL']

--------------------------------------------------
Sending chat request (Streaming)...

Aqui est√° uma explica√ß√£o concisa sobre os dois conceitos:

### 1. LlamaCPP Server
O **LlamaCPP Server** √© um servidor web leve que permite rodar modelos de linguagem grandes (LLMs), como Llama 3, Mistral ou Phi, localmente no seu computador.

*   **Tecnologia:** √â escrito em C++ e otimizado para rodar em CPUs (e GPUs) comuns, sem precisar de hardware especializado caro.
*   **Funcionamento:** Ele carrega um modelo quantizado (arquivo `.gguf`) e exp√µe uma **API REST** (geralmente no formato OpenAI).
*   **Uso Principal:** Permite que aplica√ß√µes externas (como interfaces de chat, ferramentas de desenvolvimento ou scripts) conversem com o modelo localmente, como se estivessem chamando a API da OpenAI, mas totalmente offline e privado.

### 2

Abaixo, uma api ass√≠ncrona, que organiza a fila de requisi√ß√µes (pulling and queue)

In [51]:
# Cell 4: Background FastAPI (Queue System) + Cloudflare Tunnel
import os
import time
import re

# 1. C√≥digo do novo FastAPI com Filas (Queue)
fastapi_code = """
import uvicorn
import asyncio
import uuid
from fastapi import FastAPI, Request, HTTPException
from fastapi.responses import JSONResponse
from fastapi.middleware.cors import CORSMiddleware
import httpx
from typing import Dict, Any

app = FastAPI(title="Queued FastAPI Wrapper for llama.cpp")

app.add_middleware(
    CORSMiddleware,
    allow_origins=["*"],
    allow_credentials=True,
    allow_methods=["*"],
    allow_headers=["*"],
)

LLAMA_SERVER_URL = "http://127.0.0.1:8081"

# "Banco de dados" em mem√≥ria para salvar as requisi√ß√µes e respostas
tasks_db: Dict[str, Dict[str, Any]] = {}

# Fila ass√≠ncrona
request_queue = asyncio.Queue()

# Worker que processar√° a fila em background
async def process_queue():
    async with httpx.AsyncClient(timeout=600.0) as client:
        while True:
            # Pega o pr√≥ximo item da fila (espera se estiver vazia)
            task_id, payload = await request_queue.get()

            # Atualiza status
            tasks_db[task_id]["status"] = "processing"

            try:
                # For√ßa stream=False pois estamos salvando o resultado final
                payload["stream"] = False

                response = await client.post(
                    f"{LLAMA_SERVER_URL}/v1/chat/completions",
                    json=payload
                )
                response.raise_for_status()

                # Salva o resultado
                tasks_db[task_id]["status"] = "finished"
                tasks_db[task_id]["result"] = response.json()

            except Exception as e:
                tasks_db[task_id]["status"] = "failed"
                tasks_db[task_id]["error"] = str(e)
            finally:
                request_queue.task_done()

@app.on_event("startup")
async def startup_event():
    # Inicia o worker em background quando o servidor iniciar
    asyncio.create_task(process_queue())

@app.get("/v1/models")
async def get_models():
    async with httpx.AsyncClient() as client:
        response = await client.get(f"{LLAMA_SERVER_URL}/v1/models")
        return response.json()

# Endpoint para CRIAR a requisi√ß√£o
@app.post("/v1/chat/completions")
async def queue_chat_completion(request: Request):
    payload = await request.json()

    # Gera um ID √∫nico para esta requisi√ß√£o
    task_id = str(uuid.uuid4())

    # Salva no "banco de dados" com status inicial
    tasks_db[task_id] = {
        "id": task_id,
        "status": "queued",
        "result": None,
        "error": None
    }

    # Adiciona na fila
    await request_queue.put((task_id, payload))

    # Retorna imediatamente para o usu√°rio
    return JSONResponse(content={"id": task_id, "status": "queued"}, status_code=202)

# Novo endpoint para CONSULTAR o status da requisi√ß√£o
@app.get("/v1/tasks/{task_id}")
async def get_task_status(task_id: str):
    if task_id not in tasks_db:
        raise HTTPException(status_code=404, detail="Task not found")

    return tasks_db[task_id]
"""

with open("fastapi_server.py", "w") as f:
    f.write(fastapi_code)

# 2. Kill existing processes
os.system("pkill -f uvicorn")
os.system("pkill -f cloudflared")
time.sleep(1)

# 3. Download Cloudflare se necess√°rio
if not os.path.exists("cloudflared"):
    os.system("wget -q https://github.com/cloudflare/cloudflared/releases/latest/download/cloudflared-linux-amd64 -O cloudflared")
    os.system("chmod +x cloudflared")

# 4. Start FastAPI
print("Starting Queued FastAPI server in the background...")
os.system("nohup python -m uvicorn fastapi_server:app --host 0.0.0.0 --port 8000 > fastapi.log 2>&1 &")

# 5. Start Cloudflare Tunnel
print("Starting Cloudflare Tunnel...")
os.system("nohup ./cloudflared tunnel --url http://127.0.0.1:8000 > cloudflare.log 2>&1 &")

print("Waiting for URL...")
time.sleep(8)

# 6. Read URL
with open("cloudflare.log", "r") as f:
    logs = f.read()
    match = re.search(r"(https://[a-zA-Z0-9-]+\.trycloudflare\.com)", logs)

    if match:
        public_url = match.group(1)
        base_url = f"{public_url}/v1"

        with open("api_url.txt", "w") as url_file:
            url_file.write(base_url)

        print(f"\n‚úÖ URL saved to api_url.txt")
        print(f"üëâ {base_url}\n")
    else:
        print("‚ö†Ô∏è Could not find Cloudflare URL.")

Starting Queued FastAPI server in the background...
Starting Cloudflare Tunnel...
Waiting for URL...

‚úÖ URL saved to api_url.txt
üëâ https://weblog-actors-webshots-sig.trycloudflare.com/v1



In [52]:
# Cell 5: Test the Async Queue API
import requests
import time

# L√™ a URL
with open("api_url.txt", "r") as f:
    API_BASE_URL = f.read().strip()

print(f"Connecting to: {API_BASE_URL}\n")

# 1. Enviar a requisi√ß√£o para a fila
print("1. Enviando requisi√ß√£o para a fila...")
payload = {
    "model": "unsloth/Qwen3.5-27B-GGUF",
    "messages": [
        {"role": "system", "content": "Voc√™ √© um assistente prestativo."},
        {"role": "user", "content": "Me conte uma hist√≥ria curta sobre um rob√¥ que aprendeu a programar em Python."}
    ],
    "temperature": 0.7
}

# Usamos requests normal em vez da biblioteca OpenAI
response = requests.post(f"{API_BASE_URL}/chat/completions", json=payload)
data = response.json()

print("Resposta imediata do servidor:")
print(data)

task_id = data.get("id")

print("\n" + "-"*50 + "\n")

if task_id:
    # 2. Consultar o status da requisi√ß√£o (Polling)
    print(f"2. Consultando o status da Tarefa ID: {task_id}")

    while True:
        status_response = requests.get(f"{API_BASE_URL}/tasks/{task_id}")
        task_data = status_response.json()

        status = task_data.get("status")
        print(f"Status atual: {status}")

        if status == "finished":
            print("\n‚úÖ Tarefa conclu√≠da! Aqui est√° a resposta final:\n")
            # Extraindo a resposta do formato OpenAI salvo no banco de dados
            mensagem_final = task_data["result"]["choices"][0]["message"]["content"]
            print(mensagem_final)
            break

        elif status == "failed":
            print(f"\n‚ùå Falha na tarefa: {task_data.get('error')}")
            break

        # Espera 15 segundos antes de perguntar novamente
        time.sleep(15)
else:
    print("Falha ao obter o ID da tarefa.")

Connecting to: https://weblog-actors-webshots-sig.trycloudflare.com/v1

1. Enviando requisi√ß√£o para a fila...
Resposta imediata do servidor:
{'id': '43b4bbaf-d4e1-4efa-9b6b-97fdce9cd99b', 'status': 'queued'}

--------------------------------------------------

2. Consultando o status da Tarefa ID: 43b4bbaf-d4e1-4efa-9b6b-97fdce9cd99b
Status atual: processing
Status atual: processing
Status atual: processing
Status atual: processing
Status atual: processing
Status atual: processing
Status atual: processing
Status atual: processing
Status atual: processing
Status atual: processing
Status atual: processing
Status atual: finished

‚úÖ Tarefa conclu√≠da! Aqui est√° a resposta final:

Era uma vez um rob√¥ chamado **Pyro**, fabricado em uma oficina antiga para realizar apenas tarefas repetitivas: organizar parafusos e limpar o ch√£o. Pyro funcionava com um c√≥digo bin√°rio r√≠gido, sem capacidade de adapta√ß√£o ou criatividade.

Um dia, enquanto limrava a mesa de um jovem estudante de progr

In [53]:
# Cell 5: Test the Async Queue API with Multiple Tasks
import requests
import time

# L√™ a URL
with open("api_url.txt", "r") as f:
    API_BASE_URL = f.read().strip()

print(f"Connecting to: {API_BASE_URL}\n")

# Nossos dois prompts
prompts = [
    "Explique o que √© queueing and polling no contexto de APIs. Seja conciso.",
    "Explique o conceito de trabalhos ass√≠ncronos em APIs. Seja conciso."
]

task_ids = []

# 1. Enviar ambas as requisi√ß√µes para a fila
print("1. ENVIANDO TAREFAS PARA A FILA...\n")
for i, prompt in enumerate(prompts, 1):
    payload = {
        "model": "unsloth/Qwen3.5-27B-GGUF",
        "messages": [
            {"role": "system", "content": "Voc√™ √© um especialista em engenharia de software e APIs. Responda em portugu√™s."},
            {"role": "user", "content": prompt}
        ],
        "temperature": 0.7
    }

    response = requests.post(f"{API_BASE_URL}/chat/completions", json=payload)
    data = response.json()

    task_id = data.get("id")
    if task_id:
        print(f"‚úÖ Tarefa {i} enviada! ID recebido: {task_id}")
        task_ids.append(task_id)
    else:
        print(f"‚ùå Erro ao enviar Tarefa {i}: {data}")

print("\n" + "="*50 + "\n")

# 2. Consultar o status das requisi√ß√µes (Polling M√∫ltiplo)
print("2. INICIANDO O POLLING (CONSULTA DE STATUS)...\n")

# Criamos uma lista de tarefas pendentes
pending_tasks = task_ids.copy()
resultados = {}

# O loop continua enquanto houver tarefas pendentes na lista
while pending_tasks:
    # Usamos .copy() para iterar com seguran√ßa enquanto removemos itens da lista original
    for task_id in pending_tasks.copy():
        status_response = requests.get(f"{API_BASE_URL}/tasks/{task_id}")
        task_data = status_response.json()

        status = task_data.get("status")
        hora_atual = time.strftime('%H:%M:%S')

        # Imprime o ID encurtado para facilitar a leitura no console
        short_id = task_id[:8]
        print(f"[{hora_atual}] Tarefa {short_id}... | Status atual: {status}")

        if status == "finished":
            print(f"\nüéâ Tarefa {short_id} conclu√≠da com sucesso!\n")
            # Salva o resultado final no dicion√°rio
            resultados[task_id] = task_data["result"]["choices"][0]["message"]["content"]
            # Remove da lista de pendentes para n√£o consultar mais
            pending_tasks.remove(task_id)

        elif status == "failed":
            print(f"\n‚ùå Tarefa {short_id} falhou: {task_data.get('error')}\n")
            resultados[task_id] = "ERRO NA GERA√á√ÉO"
            pending_tasks.remove(task_id)

    if pending_tasks:
        print("-" * 30)
        print("Aguardando 5 segundos antes da pr√≥xima consulta...\n")
        time.sleep(5)

# 3. Exibir os resultados finais
print("\n" + "="*50)
print("üèÜ TODAS AS TAREFAS FORAM FINALIZADAS!")
print("="*50 + "\n")

for i, task_id in enumerate(task_ids, 1):
    print(f"--- RESULTADO DA TAREFA {i} ---")
    print(f"PROMPT: {prompts[i-1]}")
    print(f"RESPOSTA:\n{resultados.get(task_id)}\n")
    print("-" * 50 + "\n")

Connecting to: https://weblog-actors-webshots-sig.trycloudflare.com/v1

1. ENVIANDO TAREFAS PARA A FILA...

‚úÖ Tarefa 1 enviada! ID recebido: 90b03f1c-b31b-4a1b-8cf3-6fe0d8e0773a
‚úÖ Tarefa 2 enviada! ID recebido: ca47413d-9f3e-4635-8ac1-c91f6e62c1d4


2. INICIANDO O POLLING (CONSULTA DE STATUS)...

[20:28:14] Tarefa 90b03f1c... | Status atual: processing
[20:28:14] Tarefa ca47413d... | Status atual: queued
------------------------------
Aguardando 5 segundos antes da pr√≥xima consulta...

[20:28:19] Tarefa 90b03f1c... | Status atual: processing
[20:28:19] Tarefa ca47413d... | Status atual: queued
------------------------------
Aguardando 5 segundos antes da pr√≥xima consulta...

[20:28:24] Tarefa 90b03f1c... | Status atual: processing
[20:28:25] Tarefa ca47413d... | Status atual: queued
------------------------------
Aguardando 5 segundos antes da pr√≥xima consulta...

[20:28:30] Tarefa 90b03f1c... | Status atual: processing
[20:28:30] Tarefa ca47413d... | Status atual: queued
------