# 10.7 • LLMs in Practice (No-Paid-API Edition)

This notebook gives you a **hands-on** way to use LLMs without paying for an API:
- **Local model** with `transformers` (e.g. `distilgpt2`) — works offline after first download.
- **Optional free-tier API** via **Hugging Face Inference** (needs a free token).
- A tiny, transparent **RAG demo** (retrieve → augment → generate).
- **Schema-constrained outputs** checked with Python validation.

Hippo-flavoured prompts included. 🦛

## 0) Setup

> The first run will download small models (~100–200 MB). This is normal.

In [None]:
# If running on Colab, uncomment the next line to install dependencies:
# !pip -q install transformers==4.44.2 sentence-transformers==3.0.1 torch --upgrade

import os, json, math, numpy as np
from typing import List, Dict

## 1) Local small model with `transformers` (no API key)

We'll load a small open model (`distilgpt2`) and generate text with different sampling settings.

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_name = "distilgpt2"  # small and quick to try
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

def generate_local(prompt: str, max_new_tokens=60, temperature=0.8, top_k=50):
    inputs = tokenizer(prompt, return_tensors="pt")
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            do_sample=True,
            temperature=temperature,
            top_k=top_k,
            max_new_tokens=max_new_tokens,
            pad_token_id=tokenizer.eos_token_id,
        )
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

print(generate_local("Hippo keeper log: Today the hippo felt", max_new_tokens=40))

### Experiment: sampling choices

- Lower temperature (e.g., 0.6) → safer, more repetitive.
- Higher temperature (e.g., 1.2) → more diverse, riskier.
- `top_k` limits to the k most likely tokens at each step.

In [None]:
for temp in [0.6, 0.9, 1.2]:
    print(f"\n--- temperature={temp} ---")
    print(generate_local("Hippo diet note: The riverbank forage was", max_new_tokens=40, temperature=temp, top_k=40))

## 2) Optional: Free-tier API via Hugging Face Inference

If you create a free account at **huggingface.co**, you can obtain a token and set it as an environment variable `HF_TOKEN`.

- Model endpoint examples: `tiiuae/falcon-7b-instruct`, `google/gemma-2b-it`, etc. (availability may vary).
- **No token?** This cell will simply skip the API call and explain.

> This is useful for showing students a hosted model workflow without paid credit.

In [None]:
import os, requests

HF_TOKEN = os.environ.get("HF_TOKEN", "").strip()
HF_MODEL = os.environ.get("HF_MODEL", "google/gemma-2-2b-it")  # small instruct model; change if unavailable

def call_hf_inference(prompt: str, max_new_tokens=128, temperature=0.8):
    if not HF_TOKEN:
        return "HF_TOKEN not set. Create a free token at huggingface.co, then: export HF_TOKEN=hf_xxx"
    # Inference endpoint (text-generation):
    url = f"https://api-inference.huggingface.co/models/{HF_MODEL}"
    headers = {"Authorization": f"Bearer {HF_TOKEN}"}
    payload = {
        "inputs": prompt,
        "parameters": {"max_new_tokens": max_new_tokens, "temperature": temperature, "return_full_text": False},
    }
    r = requests.post(url, headers=headers, json=payload, timeout=60)
    if r.status_code != 200:
        return f"HF error {r.status_code}: {r.text[:2000]}"
    try:
        out = r.json()
        if isinstance(out, list) and len(out)>0 and "generated_text" in out[0]:
            return out[0]["generated_text"]
        return json.dumps(out, indent=2)[:2000]
    except Exception as e:
        return f"Parse error: {e}\nRaw: {r.text[:500]}"

demo_prompt = (
    "System: You are a concise nutrition assistant for a zoo hippo team.\n"
    "User: Summarise three key checks for a daily hippo health round.
"
)

print(call_hf_inference(demo_prompt))

## 3) A tiny Retrieval-Augmented Generation (RAG) demo

We’ll create a **mini corpus** of short notes (hippo care + nutrition), embed them with `sentence-transformers`, 
retrieve the top-3 most similar passages, and then **compose** a grounded prompt for the local model.

> This demonstrates the workflow: **retrieve first, then generate**.

In [None]:
# If sentence-transformers not installed, see install cell above
from sentence_transformers import SentenceTransformer, util

corpus = [
    "Hippos often graze at night; daytime wallowing helps thermoregulation.",
    "Dental checks should examine tusk wear, gum inflammation, and food impaction.",
    "Forage quality affects chewing time and saliva buffering, influencing dental health.",
    "Sudden changes in water salinity can alter drinking behaviour and stress.",
    "Keeper logs should include appetite, activity, and social interactions.",
]
queries = [
    "What should we check during a hippo dental health round?",
]

embedder = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
E_corpus = embedder.encode(corpus, normalize_embeddings=True)
E_query  = embedder.encode(queries, normalize_embeddings=True)

def retrieve(query_vec, k=3):
    scores = (E_corpus @ query_vec)  # cosine similarity since normalized
    idx = np.argsort(-scores)[:k]
    return [(corpus[i], float(scores[i])) for i in idx]

top = retrieve(E_query[0], k=3)
top

### Compose a grounded prompt and generate (local model)

We build a prompt that **quotes** the retrieved passages and asks the local model for a **succinct, structured** answer.

In [None]:
context_lines = "\n".join([f"- {t[0]}" for t in top])
grounded_prompt = f"""Use ONLY the following notes to answer the question. Quote specific checks.

Notes:
{context_lines}

Question: List 3 essential checks for a hippo dental round in bullet points.
Answer:
"""

print(grounded_prompt)
print("\n--- Generated (local model) ---\n")
print(generate_local(grounded_prompt, max_new_tokens=80, temperature=0.7, top_k=40))

## 4) Schema-constrained output (JSON) and validation

For real systems, ask the model to output **JSON with required fields**, then validate it.

In [None]:
import json, re

json_prompt = (

    "Produce a JSON object with fields: 'checks' (list of 3 short strings), "

    "'priority' (one of: 'low','medium','high').\n"

    "Use the notes above; if unclear, choose 'medium'.\n"

    "JSON only, no extra text.\n"

)

candidate = generate_local(grounded_prompt + "\n" + json_prompt, max_new_tokens=120, temperature=0.7, top_k=50)

# Extract a JSON object with a simple heuristic (first {...} block)
match = re.search(r'\{[\s\S]*\}', candidate)
parsed = None
if match:
    try:
        parsed = json.loads(match.group(0))
    except Exception as e:
        parsed = {"error": f"JSON parse failed: {e}", "raw": candidate[:500]}
else:
    parsed = {"error": "No JSON object found.", "raw": candidate[:500]}

# Simple validation
def validate(payload: dict) -> Dict[str, str]:
    if not isinstance(payload, dict): return {"ok": "false", "reason": "not a dict"}
    if "checks" not in payload or not isinstance(payload["checks"], list) or len(payload["checks"]) != 3:
        return {"ok":"false", "reason":"checks must be a list of length 3"}
    if "priority" not in payload or payload["priority"] not in {"low","medium","high"}:
        return {"ok":"false", "reason":"priority invalid"}
    return {"ok":"true"}

print("RAW CANDIDATE:\n", candidate[:500], "\n")
print("PARSED:", parsed)
print("VALIDATION:", validate(parsed))

## 5) Summary

- **Local models** (e.g., `distilgpt2`) let you teach LLM mechanics without paid APIs.
- **Hugging Face Inference** provides a free-tier API with a token; good for demos.
- **RAG** grounds answers in your corpus; show students how retrieval shapes outputs.
- **JSON schemas** (even simple checks) harden systems against hallucinations.

**Exercises**
1. Swap `distilgpt2` for `microsoft/phi-2` or `TinyLlama/TinyLlama-1.1B-Chat-v1.0` (may need more RAM). Compare output quality and speed.
2. Change the retrieval corpus to short **nutrition RCT abstracts**. Does the answer become more evidence-based?
3. Extend validation to enforce that each `checks` item contains at least one **dental term** (`gum`, `tusk`, `impaction`, `inflammation`). Reject if not met.