# Qwen3 Next 80B Fewshot Starter (llama.cpp)

## Overview

A small starter that builds **few-shot prompts via TF-IDF retrieval** over *(transliteration → translation)* pairs, then runs inference through **llama.cpp `llama-server`** (OpenAI-compatible endpoint), tuned for **T4×2**-style environments.

## Highlights

* **RAG**: TF-IDF retriever for transliteration/translation pair selection (few-shot)
* **Model**: `unsloth/Qwen3-Next-80B-A3B-Instruct` (4-bit / GGUF)
* **Server**: `llama.cpp` (`llama-server`) optimized for T4×2

## Tips

* Decoding params (`temperature`, `min_p`, `top_p`, `top_k`, `presence_penalty`) follow the official guide:
  [https://unsloth.ai/docs/models/tutorials/qwen3-next](https://unsloth.ai/docs/models/tutorials/qwen3-next)

## Requirements

* `llama.cpp` **with `llama-server`** (OpenAI-compatible `/v1/*` endpoints)
* A **GGUF** model file
* Python deps: `pandas`, `numpy`, `scikit-learn`, `tqdm`, `requests`

## Quickstart

```bash
# 1) Build few-shot examples for each test item
python retriever.py --data_dir ./data --out test_with_few_shot.parquet

# 2) Start llama-server (edit paths inside run_server.py)
python run_server.py

# 3) Run inference and write submission
python submit.py \
  --data_dir ./data \
  --test_with_few_shot_path test_with_few_shot.parquet \
  --base_url http://127.0.0.1:8080 \
  --out submission.csv
```

## Data format

* `train.csv` must contain: `transliteration`, `translation`
* `test.csv` must contain: `transliteration`
* `retriever.py` outputs a parquet where each test row includes:

  * `few_shot`: list of `{transliteration, translation}` pairs

## Server check (optional)

```bash
curl http://127.0.0.1:8080/v1/models
```

## Tuning knobs

* Retrieval: `n_shots`, TF-IDF char n-grams, similarity thresholds (`retriever.py`)
* Server: `ctx`, `n_gpu_layers`, parallelism (`run_server.py`)
* Generation: decoding params + max tokens (`submit.py`)

## Troubleshooting

* `503 model is loading`: expected while the model is warming up
* Wrong paths: check `run_server.py` (`server_bin`, `model_path`)
* Output too long/short: adjust `max_tokens` / context size (`ctx`)

## License / Notes

Follow the licenses of the model, llama.cpp, and any dataset used.

In [1]:
%%bash
mkdir -p /kaggle/temp/app
cp /kaggle/input/llama-b7814-cuda/llama-b7814-cuda/app/* /kaggle/temp/app/

In [2]:
!ls /kaggle/temp/app

libggml-base.so.0.9.5	    libggml-cpu-ivybridge.so	   libggml-cpu-zen4.so
libggml-cpu-alderlake.so    libggml-cpu-piledriver.so	   libggml-cuda.so
libggml-cpu-cannonlake.so   libggml-cpu-sandybridge.so	   libggml.so.0.9.5
libggml-cpu-cascadelake.so  libggml-cpu-sapphirerapids.so  libllama.so.0.0.7814
libggml-cpu-cooperlake.so   libggml-cpu-skylakex.so	   libmtmd.so.0.0.7814
libggml-cpu-haswell.so	    libggml-cpu-sse42.so	   llama-server
libggml-cpu-icelake.so	    libggml-cpu-x64.so


In [3]:
%%bash
cd /kaggle/temp/app/
chmod +x llama-server
test -e libmtmd.so.0 || ln -s libmtmd.so.0.0.7814 libmtmd.so.0
test -e libllama.so.0 || ln -s libllama.so.0.0.7814 libllama.so.0
test -e libggml.so.0 || ln -s libggml.so.0.9.5 libggml.so.0
test -e libggml-base.so.0 || ln -s libggml-base.so.0.9.5 libggml-base.so.0

#./llama-server --list-devices
#nohup ./llama-server -m "/kaggle/input/qwen3-next-80b-a3b-instruct-q2-k-gguf/gguf/default/1/Qwen3-Next-80B-A3B-Instruct-Q2_K.gguf" -ngl 24 -c 8192 -np 1 &

In [4]:
%%writefile llama_server.py
import os
import time
import socket
import subprocess
from pathlib import Path

import requests


def wait_port(host: str, port: int, timeout_s: float = 60.0) -> None:
    """TCP portが開くまで待つ"""
    deadline = time.time() + timeout_s
    last_err = None
    while time.time() < deadline:
        try:
            with socket.create_connection((host, port), timeout=1.0):
                return
        except OSError as e:
            last_err = e
            time.sleep(0.5)
    raise TimeoutError(f"Port {host}:{port} did not open within {timeout_s}s. Last error: {last_err}")


def wait_http_ready(base_url: str, timeout_s: float = 900.0, proc: subprocess.Popen | None = None) -> None:
    """
    llama.cpp serverの「準備完了」を待つ。
    - 200/204: ready
    - 503 かつ "model is loading": 正常なロード中として待機継続
    - proc が死んだら即エラー
    """
    deadline = time.time() + timeout_s
    last = None

    urls = [
        f"{base_url}/health",
        f"{base_url}/v1/models",
        f"{base_url}/",
    ]

    while time.time() < deadline:
        if proc is not None and proc.poll() is not None:
            raise RuntimeError(f"llama-server exited early with code {proc.returncode}")

        for url in urls:
            try:
                r = requests.get(url, timeout=2.0)

                # Ready
                if r.status_code in (200, 204):
                    return

                # Loading (expected)
                if r.status_code == 503 and ("model is loading" in r.text.lower()):
                    last = (url, 503, "loading")
                    continue

                # Some builds: /health may not exist; treat as "server up" hints only
                if url.endswith("/") and r.status_code in (404, 405):
                    last = (url, r.status_code, "server up (route missing)")
                    continue

                last = (url, r.status_code, r.text[:200])

            except Exception as e:
                last = (url, "exception", repr(e))

        time.sleep(0.5)

    raise TimeoutError(f"Server not ready within {timeout_s}s. Last probe: {last}")


class LlamaServer:
    def __init__(
        self,
        server_bin: str,
        model_path: str,
        host: str = "127.0.0.1",
        port: int = 8080,
        n_gpu_layers: int = 30,
        n_cpu_moe: int = 19,
        ctx: int = 7168,
        parallel: int = 1,
        threads: int = 4,
        cram: int = 5120,
        extra_args=None,
        workdir: str | None = None,
        ld_library_path: str | None = None,
        log_path: str | None = None,
    ):
        self.server_bin = server_bin
        self.model_path = model_path
        self.host = host
        self.port = port
        self.n_gpu_layers = n_gpu_layers
        self.n_cpu_moe = n_cpu_moe
        self.ctx = ctx
        self.parallel = parallel
        self.threads = threads
        self.cram = cram
        self.extra_args = extra_args or []
        self.workdir = workdir
        self.ld_library_path = ld_library_path
        self.log_path = log_path
        self.proc: subprocess.Popen | None = None

    @property
    def base_url(self) -> str:
        return f"http://{self.host}:{self.port}"

    def start(self, timeout_s: float = 900.0) -> None:
        if self.proc and self.proc.poll() is None:
            return  # already running

        env = os.environ.copy()
        if self.ld_library_path:
            env["LD_LIBRARY_PATH"] = f"{self.ld_library_path}:{env.get('LD_LIBRARY_PATH','')}"

        # ログはファイルへ（Jupyterのセルが固まらないように）
        log_file = None
        if self.log_path:
            Path(self.log_path).parent.mkdir(parents=True, exist_ok=True)
            log_file = open(self.log_path, "ab", buffering=0)

        cmd = [
            self.server_bin,
            "-m", self.model_path,
            "--host", self.host,
            "--port", str(self.port),
            "-ngl", str(self.n_gpu_layers),
            "--n-cpu-moe", str(self.n_cpu_moe),
            "-c", str(self.ctx),
            "-np", str(self.parallel),
            "-t", str(self.threads),
            "-cram", str(self.cram),
            *self.extra_args,
        ]

        self.proc = subprocess.Popen(
            cmd,
            cwd=self.workdir,
            env=env,
            stdout=log_file if log_file else subprocess.DEVNULL,
            stderr=subprocess.STDOUT,
            stdin=subprocess.DEVNULL,
            start_new_session=True,
        )

        # まずポートを待つ（プロセス死亡もここで検知してよい）
        wait_port(self.host, self.port, timeout_s=min(timeout_s, 60.0))

        if self.proc.poll() is not None:
            raise RuntimeError(f"llama-server exited early with code {self.proc.returncode}. Check log: {self.log_path}")

        # HTTPレベルで ready を待つ（503 loading を許容）
        wait_http_ready(self.base_url, timeout_s=timeout_s, proc=self.proc)

    def stop(self) -> None:
        if not self.proc:
            return
        if self.proc.poll() is None:
            self.proc.terminate()
            try:
                self.proc.wait(timeout=10)
            except subprocess.TimeoutExpired:
                self.proc.kill()
                self.proc.wait(timeout=10)

    def __enter__(self):
        self.start()
        return self

    def __exit__(self, exc_type, exc, tb):
        self.stop()


Writing llama_server.py


In [5]:
!ps aux
#!cat /kaggle/temp/app/llama_server.log

USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root           1  0.3  0.0   7376  3504 ?        Ss   11:02   0:00 /bin/bash -c 
root           9 26.5  0.4 735492 163772 ?       Sl   11:02   0:04 python3 -c im
root          11  1.0  0.0 189232 21496 ?        Ssl  11:02   0:00 /opt/bin/nvid
root          24 16.3  0.3 893652 103820 ?       Ssl  11:03   0:01 /usr/bin/pyth
root          55  0.0  0.0  10076  3480 pts/0    Rs+  11:03   0:00 ps aux


In [6]:
%%writefile run_server.py
import importlib
from llama_server import LlamaServer
import llama_server
importlib.reload(llama_server)
import inspect
import requests

def main():
    print("llama_server file:", llama_server.__file__)
    print("LlamaServer.__init__:", inspect.signature(llama_server.LlamaServer.__init__))
    
    server = LlamaServer(
        server_bin="/kaggle/temp/app/llama-server",
        model_path="/kaggle/input/qwen3-next-80b-a3b-instruct-q4-k-m-gguf/Qwen3-Next-80B-A3B-Instruct-Q4_K_M.gguf",
        workdir="/kaggle/temp/app",
        ld_library_path="/kaggle/temp/app",
        log_path="/kaggle/temp/app/llama_server.log",
        n_gpu_layers=30,
        n_cpu_moe=19,
        ctx=7168,
        parallel=1,
        threads=4,
        cram=5120
    )
    
    server.start(timeout_s=1500)   # ここでロード完了まで待つ（503 loading は許容済み）
    print("server up:", server.base_url)
    
    # 以降、何回でも呼べる
    models = requests.get(f"{server.base_url}/v1/models").json()
    print(models)
if __name__ == "__main__":
    main()
    

Writing run_server.py


In [7]:
%%writefile retriever.py
#from rag_utils import FewShotRetriever
from pathlib import Path
import argparse
import pandas as pd
from tqdm import tqdm

import os
import re
from typing import Iterable

import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

class FewShotRetriever:
    def __init__(self, train_df: pd.DataFrame):
        self.train_df = train_df
        self.vectorizer = TfidfVectorizer(analyzer="char", ngram_range=(2, 4), min_df=1, max_features=5000)
        translit_data = train_df["transliteration"].values.astype("U")
        self.tfidf_matrix = self.vectorizer.fit_transform(translit_data)

    def get_few_shot_examples(self, query: str, n_shots: int) -> list[dict]:
        MAX_TOKENS = 5120
        query_vec = self.vectorizer.transform([query])
        similarities = cosine_similarity(query_vec, self.tfidf_matrix).flatten()
        top_indices = np.argsort(similarities)[-n_shots * 5:][::-1]
        examples = []
        for idx in top_indices:  
            if len(examples) >= n_shots:
                break
            transliteration = self.train_df.iloc[idx]["transliteration"]
            translation = self.train_df.iloc[idx]["translation"]
            if len(transliteration) / 4 >= MAX_TOKENS or len(translation) / 4 >= MAX_TOKENS:
                continue
            if similarities[idx] > 0.05:
                examples.append({
                    "transliteration": transliteration,
                    "translation": translation,
                })
        return examples

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("--data_dir", type=str, default="/kaggle/input/deep-past-initiative-machine-translation")
    parser.add_argument("--out", type=str)
    args = parser.parse_args()
    data_dir = Path(args.data_dir)
    test_path = data_dir / "test.csv"
    train_path = data_dir / "train.csv"
    test_df = pd.read_csv(test_path)
    train_df = pd.read_csv(train_path)
    tgt_col = "translation"
    source_col = "transliteration"
    fsr = FewShotRetriever(train_df) 
    all_docs = []
    for query in tqdm(test_df[source_col].fillna("").astype(str).tolist()):
        docs = fsr.get_few_shot_examples(query, n_shots=1)
        all_docs.append(docs)
    test_df["few_shot"] = all_docs
    test_df.to_parquet(args.out)
if __name__ == "__main__":
    main()

Writing retriever.py


In [8]:
%%writefile submit.py
import os
from pathlib import Path
import argparse
import pandas as pd
from tqdm import tqdm
import time
import requests

SYSTEM_PROMPT = """You are an expert translator of Old Akkadian merchant documents.
Your goal: Produce accurate, complete, scholarly translations matching examples.

Akkadian determinatives in curly brackets:
{d} = dingir ‘god, deity’ — d preceding non-human divine actors
{mul} = ‘stars’ — MUL preceding astronomical bodies and constellations
{ki} = ‘earth’ — KI following a geographical place name or location
{lu₂} = LÚ preceding people and professions
{e₂} = {É} preceding buildings and institutions, such as temples and palaces
{uru} = (URU) preceding names of settlements, such as villages, towns and cities
{kur} = (KUR) preceding lands and territories as well as mountains
{mi} = munus (f) preceding feminine personal names
{m} = (1 or m) preceding masculine personal names
{geš} / {ĝeš) = (GIŠ) preceding trees and things made of wood
{tug₂} = (TÚG) preceding textiles and other woven objects
{dub} = (DUB) preceding clay tablets, and by extension, documents and legal records
{id₂} = (ÍD) (A.ENGUR) preceding names of canals or rivers
{mušen} = (MUŠEN) preceding birds
{na₄} = (na4) preceding stone
{kuš} = (kuš) preceding (animal) skin, fleece, hides
{u₂} = (Ú) preceding plants

CRITICAL RULES:
1. Names: Preserve EXACTLY as shown in transliteration (Ashshur-taklaku, Ennam-Ashshur, Ali-ahum)
2. Units: Use 'mina' (SINGULAR: '2 mina' not 'minas'), 'shekel' for smaller amounts
3. Document: 'tablet' for records, 'letter' for messages
4. Fractions: Use decimals (0.3333, 0.6666) NOT symbols (1/3, 2/3)
5. Completeness: Translate ALL including witness statements, dates, 'witnessed', 'sealed'
6. Tense: Past tense for completed actions (sent, received, delivered, paid)
7. Structure: Keep legal format, periods, commas between items
8. Do NOT use '...'. If text is broken, use <big_gap>
Output: ONLY the complete English translation.
"""

def get_model_id(base_url: str) -> str:
    r = requests.get(f"{base_url}/v1/models", timeout=10)
    r.raise_for_status()
    j = r.json()
    data = j.get("data", [])
    if not data:
        raise RuntimeError(f"No models returned from {base_url}/v1/models: {j}")
    return data[0].get("id") or data[0].get("model") or "default"

def chat_completions(
    base_url: str,
    messages: list[dict],
    model: str,
    temperature: float,
    max_tokens: int,
    top_k: int,
    min_p: float,
    top_p: float,
    presence_penalty: float,
    timeout_s: float,
    retry_loading: bool,
) -> dict:
    url = f"{base_url}/v1/chat/completions"
    payload = {
        "model": model,
        "messages": messages,
        "temperature": temperature,
        "stream": False,
        "max_tokens": max_tokens,
        "top_k": top_k,
        "min_p": min_p,
        "top_p": top_p,
        "presence_penalty": presence_penalty
    }

    deadline = time.time() + timeout_s
    last_err = None
    while time.time() < deadline:
        r = requests.post(url, json=payload, timeout=500)
        if r.status_code in (200, 201):
            return r.json()
        if retry_loading and r.status_code == 503 and ("model is loading" in r.text.lower()):
            last_err = "loading"
            time.sleep(0.5)
            continue
        if r.status_code in (404, 405):
            raise RuntimeError(
                f"{url} returned {r.status_code}. This server may not expose /v1/chat/completions. "
                f"Body head: {r.text[:200]!r}"
            )
        last_err = (r.status_code, r.text[:500])
        r.raise_for_status()
    raise TimeoutError(f"chat_completions timed out within {timeout_s}s. last_err={last_err}")


def few_shot_to_messages(few_shot: list[dict]) -> list[dict]:
    messages: list[dict] = []
    for shot in few_shot:
        src = shot["transliteration"]
        tgt = shot["translation"]
        messages.append({"role": "user", "content": f"Just translate Akkadian(transliteration) to English: {src}"})
        messages.append({"role": "assistant", "content": tgt})
    return messages


def build_messages(system_prompt: str, few_shot: list[dict], text: str) -> list[dict]:
    few_shot_messages = few_shot_to_messages(few_shot)
    if few_shot_messages is None or len(few_shot_messages) == 0:
        return [{"role": "system", "content": system_prompt}]
        + [{"role": "user","content": f"Don't output superfluous comments. Just translate Akkadian(transliteration) to English: {text}"}]
    return (
        [{"role": "system", "content": system_prompt}]
        + few_shot_messages
        + [{"role": "user","content": f"Just translate Akkadian(transliteration) to English: {text}"}]
    )

def predict_max_token(transliteration_len: int, ratio: float, upper_bound: int) -> int:
    token_len = (transliteration_len / 4)
    pred = min(token_len * ratio, upper_bound)
    return int(pred)


def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("--data_dir", type=str, default="/kaggle/input/deep-past-initiative-machine-translation")
    parser.add_argument("--test_with_few_shot_path", type=str, default="/kaggle/working/test_with_few_shot.parquet")
    parser.add_argument("--max_tokens_ub", type=int, default=2048)
    parser.add_argument("--base_url", type=str, default="http://127.0.0.1:8080")
    parser.add_argument("--out", type=str, default="submission.csv")
    args = parser.parse_args()
    data_dir = Path(args.data_dir)
    #test_path = data_dir / "test.csv"
    sample_sub_path = data_dir / "sample_submission.csv"
    #test_df = pd.read_csv(test_path)
    test_df = pd.read_parquet(args.test_with_few_shot_path)
    sub_df = pd.read_csv(sample_sub_path)
    tgt_col = "translation"
    source_col = "transliteration"
    few_shot_col = "few_shot"
    
    model_id = get_model_id(args.base_url)
    # Translate
    outputs = []
    for i, row in tqdm(test_df.iterrows()):
        text = row[source_col]
        few_shot = row[few_shot_col]
        messages = build_messages(SYSTEM_PROMPT, few_shot, text)
        max_tokens = predict_max_token(len(text), 1.5, args.max_tokens_ub)

        response = chat_completions(
            args.base_url,
            messages=messages,
            model=model_id,
            temperature=0.7,
            max_tokens=max_tokens,
            top_k=20,
            min_p=0.00,
            top_p=0.80,
            presence_penalty=1.0,
            timeout_s=600.0,
            retry_loading=True,
        )
        outputs.append(response["choices"][0]["message"]["content"])

    # Fill submission in the exact column layout of sample_submission
    sub_df[tgt_col] = outputs
    out_path = Path(args.out)
    sub_df.to_csv(out_path, index=False)
    print(f"Saved: {out_path.resolve()}")
    print(sub_df.head())

if __name__ == "__main__":
    main()

Writing submit.py


In [9]:
!python retriever.py --data_dir /kaggle/input/deep-past-initiative-machine-translation --out test_with_few_shot.parquet

100%|█████████████████████████████████████████████| 4/4 [00:00<00:00, 63.14it/s]


In [10]:
!python run_server.py

llama_server file: /kaggle/working/llama_server.py
LlamaServer.__init__: (self, server_bin: str, model_path: str, host: str = '127.0.0.1', port: int = 8080, n_gpu_layers: int = 30, n_cpu_moe: int = 19, ctx: int = 7168, parallel: int = 1, threads: int = 4, cram: int = 5120, extra_args=None, workdir: str | None = None, ld_library_path: str | None = None, log_path: str | None = None)
server up: http://127.0.0.1:8080
{'models': [{'name': 'Qwen3-Next-80B-A3B-Instruct-Q4_K_M.gguf', 'model': 'Qwen3-Next-80B-A3B-Instruct-Q4_K_M.gguf', 'modified_at': '', 'size': '', 'digest': '', 'type': 'model', 'description': '', 'tags': [''], 'capabilities': ['completion'], 'parameters': '', 'details': {'parent_model': '', 'format': 'gguf', 'family': '', 'families': [''], 'parameter_size': '', 'quantization_level': ''}}], 'object': 'list', 'data': [{'id': 'Qwen3-Next-80B-A3B-Instruct-Q4_K_M.gguf', 'object': 'model', 'created': 1770807984, 'owned_by': 'llamacpp', 'meta': {'vocab_type': 2, 'n_vocab': 151936

In [11]:
!python submit.py --data_dir /kaggle/input/deep-past-initiative-machine-translation --out submission.csv

4it [01:38, 24.69s/it]
Saved: /kaggle/working/submission.csv
   id                                        translation
0   0  Thus, Kanesh, say to Aqil...: To the payers, o...
1   1  From this day on, whoever buys meteoric iron, ...
2   2  As soon as you have heard our letter, whoever ...
3   3  Send a copy of this letter to every colony and...


In [12]:
!pgrep -f "llama-server" | xargs kill -9
!pgrep -f "run_server.py" | xargs kill -9