<a href="https://colab.research.google.com/github/fabiopauli/Qwen3.5-colab/blob/main/Server_Qwen27B_llamacpp_256k_context_L4_20gb.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

A Celula abaixo demora 8 minutos para ser conclu√≠da

In [1]:
# Cell 1: Build llama.cpp with CUDA and run Qwen3.5-27B (non-thinking mode)
!apt-get update -qq && apt-get install -qq -y pciutils build-essential cmake curl libcurl4-openssl-dev > /dev/null 2>&1

!git clone --depth 1 https://github.com/ggml-org/llama.cpp 2>/dev/null || echo "already cloned"

!cmake llama.cpp -B llama.cpp/build \
    -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON > /dev/null 2>&1

!cmake --build llama.cpp/build --config Release -j$(nproc) --clean-first --target llama-cli llama-server 2>&1 | tail -5

!cp llama.cpp/build/bin/llama-* llama.cpp/

# Download the model
!pip install -q huggingface_hub hf_transfer
!HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli download unsloth/Qwen3.5-27B-GGUF \
    --local-dir unsloth/Qwen3.5-27B-GGUF \
    --include "*UD-Q4_K_XL*"

W: Skipping acquire of configured file 'main/source/Sources' as repository 'https://r2u.stat.illinois.edu/ubuntu jammy InRelease' does not seem to provide it (sources.list entry misspelt?)
already cloned
[ 98%] Building CXX object tools/server/CMakeFiles/llama-server.dir/server.cpp.o
[ 98%] Building CXX object tools/server/CMakeFiles/llama-server.dir/server-http.cpp.o
[ 98%] Building CXX object tools/server/CMakeFiles/llama-server.dir/server-models.cpp.o
[100%] Linking CXX executable ../../bin/llama-server
[100%] Built target llama-server
/bin/bash: line 1: huggingface-cli: command not found


A c√©lula abaixo cria o servidor Llamacpp em background.

In [44]:
# Cell 2: Run llama-server in the background
import os
import time
import subprocess

# Kill any existing server to free up the port
os.system("pkill -f llama-server")
time.sleep(2)

os.environ["LLAMA_CACHE"] = "unsloth/Qwen3.5-27B-GGUF"

# Start the server using nohup so it runs in the background
server_cmd = """
nohup ./llama.cpp/llama-server \
    -hf unsloth/Qwen3.5-27B-GGUF:UD-Q4_K_XL \
    --host 127.0.0.1 \
    --port 8081 \
    --ctx-size 16384 \
    -ngl 99 \
    --temp 0.7 \
    --top-p 0.8 \
    --top-k 20 \
    --min-p 0.00 \
    --chat-template-kwargs '{"enable_thinking": false}' \
    --cache-type-k q8_0 \
    --cache-type-v q8_0 > llama_server.log 2>&1 &
"""

print("Starting llama-server on port 8081...")
os.system(server_cmd)

# Wait for the server to spin up and load the model into VRAM
print("Waiting for model to load into VRAM (this takes 30-60 seconds)...")
for i in range(600):
    try:
        import requests
        res = requests.get("http://127.0.0.1:8081/health")
        if res.status_code == 200:
            print("\n‚úÖ llama-server is ready and listening on port 8081!")
            break
    except:
        pass
    time.sleep(2)
    print(".", end="", flush=True)
else:
    print("\n‚ö†Ô∏è Server might not have started correctly. Check llama_server.log:")
    os.system("tail -n 20 llama_server.log")

Starting llama-server on port 8081...
Waiting for model to load into VRAM (this takes 30-60 seconds)...
.......
‚úÖ llama-server is ready and listening on port 8081!


A seguir, criamos outro servidor para gerar os endpoints da API, tamb√©m em background

In [5]:
# Cell 3: Install dependencies for FastAPI wrapper
!pip install -q fastapi uvicorn pyngrok httpx pydantic nest-asyncio

In [47]:
# Cell 4: Background FastAPI + Cloudflare Tunnel
import os
import time
import re

# 1. Write the FastAPI app to a file
fastapi_code = """
import uvicorn
from fastapi import FastAPI, Request
from fastapi.responses import StreamingResponse, JSONResponse
from fastapi.middleware.cors import CORSMiddleware
import httpx

app = FastAPI(title="Custom FastAPI Wrapper for llama.cpp")

app.add_middleware(
    CORSMiddleware,
    allow_origins=["*"],
    allow_credentials=True,
    allow_methods=["*"],
    allow_headers=["*"],
)

LLAMA_SERVER_URL = "http://127.0.0.1:8081"

@app.get("/v1/models")
async def get_models():
    async with httpx.AsyncClient() as client:
        response = await client.get(f"{LLAMA_SERVER_URL}/v1/models")
        return response.json()

@app.post("/v1/chat/completions")
async def chat_completions(request: Request):
    payload = await request.json()
    is_stream = payload.get("stream", False)

    if is_stream:
        async def generate():
            async with httpx.AsyncClient(timeout=300.0) as client:
                async with client.stream("POST", f"{LLAMA_SERVER_URL}/v1/chat/completions", json=payload) as response:
                    async for chunk in response.aiter_bytes():
                        yield chunk

        return StreamingResponse(generate(), media_type="text/event-stream")
    else:
        async with httpx.AsyncClient(timeout=300.0) as client:
            response = await client.post(f"{LLAMA_SERVER_URL}/v1/chat/completions", json=payload)
            return JSONResponse(content=response.json(), status_code=response.status_code)
"""

with open("fastapi_server.py", "w") as f:
    f.write(fastapi_code)

# 2. Kill existing processes (if you run this cell multiple times)
os.system("pkill -f uvicorn")
os.system("pkill -f cloudflared")
time.sleep(1)

# 3. Download Cloudflare if needed
if not os.path.exists("cloudflared"):
    os.system("wget -q https://github.com/cloudflare/cloudflared/releases/latest/download/cloudflared-linux-amd64 -O cloudflared")
    os.system("chmod +x cloudflared")

# 4. Start FastAPI in the background via Uvicorn
print("Starting FastAPI server in the background...")
os.system("nohup python -m uvicorn fastapi_server:app --host 0.0.0.0 --port 8000 > fastapi.log 2>&1 &")

# 5. Start Cloudflare Tunnel in the background
print("Starting Cloudflare Tunnel...")
os.system("nohup ./cloudflared tunnel --url http://127.0.0.1:8000 > cloudflare.log 2>&1 &")

# Wait a few seconds for Cloudflare to assign a URL
print("Waiting for URL...")
time.sleep(8)

# 6. Read the log to extract the URL
with open("cloudflare.log", "r") as f:
    logs = f.read()
    match = re.search(r"(https://[a-zA-Z0-9-]+\.trycloudflare\.com)", logs)

    if match:
        public_url = match.group(1)
        base_url = f"{public_url}/v1"

        # Save the URL to a file
        with open("api_url.txt", "w") as url_file:
            url_file.write(base_url)

        print(f"\n‚úÖ URL saved to api_url.txt")
        print(f"üëâ {base_url}\n")
    else:
        print("‚ö†Ô∏è Could not find Cloudflare URL.")

Starting FastAPI server in the background...
Starting Cloudflare Tunnel...
Waiting for URL...

‚úÖ URL saved to api_url.txt
üëâ https://editorial-details-updating-turns.trycloudflare.com/v1



Abaixo est√° um exemplo de uso da API, pode ser usado de qualquer computador, basta preencher o API_BASE_URL com a URL do servidor da c√©lula acima

In [48]:
# Cell 5: Test your API with the official OpenAI Python package
from openai import OpenAI

# Read the base URL automatically from the file
with open("api_url.txt", "r") as f:
    API_BASE_URL = f.read().strip()

print(f"Connecting to: {API_BASE_URL}\n")

client = OpenAI(
    base_url=API_BASE_URL,
    api_key="sk-no-key-required"
)


# --- 1. GET MODELS ---
print("Fetching models...")
models = client.models.list()
print(f"Available models: {[m.id for m in models.data]}\n")
print("-" * 50)


# --- 2. STREAMING COMPLETION ---
print("Sending chat request (Streaming)...\n")
stream_response = client.chat.completions.create(
    model="unsloth/Qwen3.5-27B-GGUF",
    messages=[
        {"role": "system", "content": "You are a helpful and concise AI assistant."},
        {"role": "user", "content": "Explique o que √© um llamacpp server e o que √© um Cloudflared tunnel"}
    ],
    stream=True # <--- Set to True
)

# Print the streaming response as it arrives
for chunk in stream_response:
    if chunk.choices[0].delta.content is not None:
        print(chunk.choices[0].delta.content, end="", flush=True)

print("\n\n" + "-" * 50)


# --- 3. NON-STREAMING COMPLETION ---
print("Sending chat request (Non-Streaming)...\n")
standard_response = client.chat.completions.create(
    model="unsloth/Qwen3.5-27B-GGUF",
    messages=[
        {"role": "system", "content": "You are a helpful and concise AI assistant."},
        {"role": "user", "content": "O que √© aux√≠lio-doen√ßa no direito brasileiro ? N√£o use markdown na resposta"}
    ],
    stream=False # <--- Set to False
)

# Print the final complete message
print(standard_response.choices[0].message.content)
print("\n" + "-" * 50)

Connecting to: https://editorial-details-updating-turns.trycloudflare.com/v1

Fetching models...
Available models: ['unsloth/Qwen3.5-27B-GGUF:UD-Q4_K_XL']

--------------------------------------------------
Sending chat request (Streaming)...

Aqui est√° uma explica√ß√£o concisa de cada conceito e como eles se relacionam:

### 1. Llama.cpp Server
O **Llama.cpp** √© uma biblioteca de c√≥digo aberto escrita em C/C++ otimizada para rodar modelos de linguagem grandes (LLMs) como Llama, Mistral e Gemma em hardware local (CPU ou GPU de consumidor), sem precisar de grandes clusters de servidores.

O **Llama.cpp Server** √© uma funcionalidade espec√≠fica dentro dessa biblioteca que transforma o modelo local em um **servi√ßo de API web** (geralmente compat√≠vel com a API da OpenAI).
*   **Como funciona:** Ele inicia um servidor local (ex: `localhost:8080`) que aceita requisi√ß√µes HTTP para gerar texto, completar prompts ou chat.
*   **Uso principal:** Permite que aplica√ß√µes, scripts ou inter