# Goblin Assistant — RAG Prototype and Prompt Engineering Sandbox

This Colab notebook prototypes the full Goblin Assistant workflow:

1. Embed documentation
2. Vector search (FAISS/Chroma)
3. Retrieval-Augmented Generation (RAG)
4. LLM answer generation (OpenAI / HF fallback)
5. Automated and manual evaluation (ROUGE/BLEU, provenance checks)
6. Iteration harness to tune prompts and retrieval
7. Prompt engineering playground (system, routing, persona, CoT, tool schemas)

Run this notebook in Google Colab for an interactive sandbox. The notebook includes a small built-in sample dataset so you can run end-to-end without external files.

---

## About This Colab (Demo / R&D Lab)

This is a demo environment — not production. Think of it like a pop-up shop for your AI: a place to prototype, experiment, and record demos quickly and safely. Typical uses include:

- Show people what Goblin can do with short, repeatable demos
- Test and validate the RAG pipeline (embeddings, indexing, retrieval)
- Prototype UI ideas before touching the real frontend
- Experiment with model swaps, prompt engineering, embeddings, and indexing strategies
- Record demos for your site or portfolio and gather examples
- Run quick experiments without provisioning long-lived or costly servers

Use this notebook as a short-lived sandbox — iterate fast, collect learnings, then port stable patterns back into the backend code paths and the real frontend when they’re ready for production.

In [None]:
# 1) Install & Setup (run this cell in Colab)

# Use -q to keep output small in notebook
!pip install -q sentence-transformers faiss-cpu openai transformers[torch] datasets rouge-score sacrebleu

# Optional: chromadb if you prefer
!pip install -q chromadb

print('Install complete. If you are running in local Python, ensure packages are installed in your environment.')


In [None]:
# 2) Import Required Libraries

from typing import List, Dict, Tuple, Optional
import os
import time
from dataclasses import dataclass

# Embedding & search
from sentence_transformers import SentenceTransformer
import numpy as np

# FAISS
import faiss

# LLM calls
import openai
from transformers import pipeline

# Evaluation
from rouge_score import rouge_scorer
import sacrebleu

print('Libraries imported')


In [None]:
# 3) Sample documentation (small dataset included so notebook runs end-to-end)

SAMPLE_DOCS = [
    {
        'id': 'doc1',
        'title': 'Getting started with Goblin Assistant',
        'text': """
Goblin Assistant is an experimental AI assistant built for developer workflows. It accepts user queries, fetches context from documentation, and responds using a retrieval-augmented LLM. Use environment variables to configure backend endpoints.
""",
    },
    {
        'id': 'doc2',
        'title': 'Installation',
        'text': """
To install Goblin Assistant, clone the repository and run `pnpm install` for frontend and `pip install -r requirements.txt` for backend. After installing, set environment variables like VITE_FASTAPI_URL.
""",
    },
    {
        'id': 'doc3',
        'title': 'Configuration and API',
        'text': """
The API exposes endpoints at /api/ai and /api/settings. Authentication uses API keys stored in environment variables.
""",
    },
    {
        'id': 'doc4',
        'title': 'Limitations',
        'text': """
The Hobby plan on Vercel limits serverless functions. For production, use external backend services like Render or consolidate functions.
""",
    },
    {
        'id': 'doc5',
        'title': 'Best practices',
        'text': """
When building RAG, chunk long documents, include provenance in responses, and evaluate for hallucinations.
""",
    },
]

print(f'Sample docs loaded: {len(SAMPLE_DOCS)} docs')


In [None]:
# 4) Chunking & Embedding utilities

def chunk_text(text: str, chunk_size: int = 200, overlap: int = 50) -> List[str]:
    tokens = text.split()
    chunks = []
    i = 0
    while i < len(tokens):
        chunk = tokens[i:i+chunk_size]
        chunks.append(' '.join(chunk))
        i += chunk_size - overlap
    return chunks

# Initialize embedding model (sentence-transformers)
print('Loading embedding model...')
embedder = SentenceTransformer('all-MiniLM-L6-v2')
print('Embedding model loaded')

# Prepare corpus: chunk docs and compute embeddings
corpus_texts = []
corpus_meta = []
for d in SAMPLE_DOCS:
    chunks = chunk_text(d['text'], chunk_size=60, overlap=10)
    for idx, c in enumerate(chunks):
        meta = {
            'doc_id': d['id'],
            'title': d['title'],
            'chunk_index': idx,
        }
        corpus_texts.append(c)
        corpus_meta.append(meta)

print(f'Corpus prepared with {len(corpus_texts)} chunks')

# Compute embeddings
corpus_embeddings = embedder.encode(corpus_texts, show_progress_bar=True, convert_to_numpy=True)
print('Embeddings computed')


In [None]:
# 5) Build FAISS index and retrieval function

# Build faiss index (L2 normalized can be used for cosine similarity)
embedding_dim = corpus_embeddings.shape[1]
index = faiss.IndexFlatL2(embedding_dim)
index.add(corpus_embeddings)
print(f'FAISS index built with {index.ntotal} vectors')

def retrieve(query: str, k: int = 3) -> List[Tuple[float, Dict, str]]:
    q_emb = embedder.encode([query], convert_to_numpy=True)
    D, I = index.search(q_emb, k)
    results = []
    for score, idx in zip(D[0], I[0]):
        meta = corpus_meta[idx]
        text = corpus_texts[idx]
        results.append((float(score), meta, text))
    return results

# Quick test
print('Retrieving for query: "How to install Goblin Assistant"')
print(retrieve('How to install Goblin Assistant', k=3))


In [None]:
# 6) RAG prompt templates & prompt engineering examples

SYSTEM_PROMPT = """
You are Goblin Assistant, a concise, developer-focused AI assistant. Use the provided context and cite sources by title when making factual claims. If the answer is not contained in the context, say you don't know and suggest how to find it.
"""

ROUTING_PROMPT = """
If the user query contains keywords like 'deploy', 'build', 'install', route to the deployment instructions tool. Otherwise, use the documentation context.
"""

PERSONA_PROMPT = """
Respond as an experienced developer coach: brief, concrete steps, code snippets when helpful, and safe defaults. Use a friendly but professional tone.
"""

COT_INSTRUCTION = """
When dealing with multi-step problems, think step-by-step and enumerate actions. Start by summarizing what you will do.
"""

# Tool schema example for structured invocations (JSON schema)
TOOL_SCHEMA = {
    'name': 'deploy_tool',
    'description': 'Performs deployment steps for project',
    'inputs': {
        'environment': 'staging|production',
        'build_command': 'string',
        'env_vars': 'dict'
    }
}

# Example RAG prompt builder
def build_rag_prompt(query: str, retrieved: List[Tuple[float, Dict, str]], system=SYSTEM_PROMPT, persona=PERSONA_PROMPT, cot=COT_INSTRUCTION) -> str:
    sources = []
    for score, meta, text in retrieved:
        sources.append(f"- {meta['title']} (doc={meta['doc_id']}, chunk={meta['chunk_index']})\n{text[:500]}")
    context_block = '\n\n'.join(sources)
    prompt = f"{system}\n{persona}\n{cot}\n\nCONTEXT:\n{context_block}\n\nUSER QUERY:\n{query}\n\nANSWER:" 
    return prompt

print('Prompt templates ready')


In [None]:
# 7) LLM call(s): OpenAI primary, HF fallback

# Configure OpenAI key if available
OPENAI_KEY = os.environ.get('OPENAI_API_KEY')
if OPENAI_KEY:
    openai.api_key = OPENAI_KEY


def call_openai(prompt: str, model: str = 'gpt-4o-mini', temperature: float = 0.0, max_tokens: int = 512) -> str:
    if not OPENAI_KEY:
        raise RuntimeError('OPENAI_API_KEY not set')
    resp = openai.ChatCompletion.create(
        model=model,
        messages=[{'role':'system','content':SYSTEM_PROMPT}, {'role':'user','content':prompt}],
        temperature=temperature,
        max_tokens=max_tokens,
    )
    return resp['choices'][0]['message']['content']

# HF fallback using a small T5/flan model for test/development
hf_pipe = None
try:
    hf_pipe = pipeline('text2text-generation', model='google/flan-t5-small')
except Exception as e:
    print('HF pipeline not available locally:', e)


def call_hf(prompt: str, max_length: int = 256) -> str:
    if not hf_pipe:
        raise RuntimeError('HF pipeline not available. Install model or set OPENAI_API_KEY')
    out = hf_pipe(prompt, max_length=max_length, do_sample=False)
    return out[0]['generated_text']

print('LLM call functions ready (OpenAI if key present, HF fallback otherwise)')


In [None]:
# 8) RAG driver — retrieve, build prompt, call LLM, return answer with provenance

def rag_answer(query: str, k: int = 3, use_openai: bool = True) -> Dict:
    retrieved = retrieve(query, k=k)
    prompt = build_rag_prompt(query, retrieved)
    try:
        if use_openai and OPENAI_KEY:
            answer = call_openai(prompt)
        else:
            answer = call_hf(prompt)
    except Exception as e:
        # simple error recovery: fallback to HF if OpenAI fails
        print('LLM call failed:', e)
        if hf_pipe:
            answer = call_hf(prompt)
        else:
            answer = 'LLM unavailable. Please set OPENAI_API_KEY or install HF model.'
    return {
        'query': query,
        'answer': answer,
        'retrieved': retrieved,
    }

# Quick RAG test
res = rag_answer('How do I install Goblin Assistant?', k=2, use_openai=False)
print('Answer:')
print(res['answer'])
print('\nSources:')
for s in res['retrieved']:
    print('-', s[1]['title'], f"(score={s[0]:.4f})")


In [None]:
# 9) Evaluation harness: ROUGE, BLEU, and simple provenance checks

scorer = rouge_scorer.RougeScorer(['rouge1','rougeL'], use_stemmer=True)


def evaluate_single(reference: str, candidate: str, retrieved: List[Tuple[float, Dict, str]]) -> Dict:
    rouge = scorer.score(reference, candidate)
    bleu = sacrebleu.sentence_bleu(candidate, [reference]).score
    # provenance check: ensure any doc title appears in candidate (very simple heuristic)
    provenance_ok = any(meta['title'] in candidate for _, meta, _ in retrieved)
    return {
        'rouge1': rouge['rouge1'].fmeasure,
        'rougeL': rouge['rougeL'].fmeasure,
        'bleu': bleu,
        'provenance_ok': provenance_ok
    }

# Example evaluation with a fake reference
test_ref = 'To install Goblin Assistant, clone the repo and run pnpm install for frontend and pip install -r requirements.txt for backend.'
res = rag_answer('How do I install Goblin Assistant?', k=2, use_openai=False)
eval_res = evaluate_single(test_ref, res['answer'], res['retrieved'])
print('Evaluation:', eval_res)


In [None]:
# 10) Iteration harness: sweep prompts and retrieval k

from itertools import product

prompt_variants = [
    {'persona': PERSONA_PROMPT, 'cot': COT_INSTRUCTION},
    {'persona': PERSONA_PROMPT + '\nKeep answers ultra concise (2-3 lines).', 'cot': COT_INSTRUCTION},
    {'persona': PERSONA_PROMPT, 'cot': COT_INSTRUCTION + '\nProvide enumerated steps.'}
]

ks = [1,2,3]

# Example small dataset of queries + refs
TEST_SET = [
    {'q': 'How to install Goblin Assistant?', 'ref': test_ref},
]

results = []
for variant, k in product(prompt_variants, ks):
    # patch templates temporarily
    for t in TEST_SET:
        retrieved = retrieve(t['q'], k=k)
        prompt = build_rag_prompt(t['q'], retrieved, persona=variant['persona'], cot=variant['cot'])
        try:
            answer = call_hf(prompt)
        except Exception:
            answer = 'LLM unavailable in notebook environment.'
        metrics = evaluate_single(t['ref'], answer, retrieved)
        results.append({'persona': variant['persona'][:60], 'k': k, 'metrics': metrics, 'answer': answer})

# Show results
import pandas as pd
df = pd.DataFrame([{'persona': r['persona'], 'k': r['k'], 'rouge1': r['metrics']['rouge1'], 'bleu': r['metrics']['bleu'], 'provenance_ok': r['metrics']['provenance_ok']} for r in results])
print(df)


In [None]:
# 11) Prompt engineering playground (editable templates)

# Editable templates: change these strings and re-run the RAG driver
SYSTEM = SYSTEM_PROMPT
PERSONA = PERSONA_PROMPT
ROUTING = ROUTING_PROMPT
COT = COT_INSTRUCTION

print('Playground ready. Edit PERSONA/ROUTING/COT and re-run rag_answer() to test.')


In [None]:
# 12) Error recovery logic & retry wrapper

import functools
import random

def retry(func=None, *, retries=3, backoff=1.0):
    def deco(f):
        @functools.wraps(f)
        def wrapper(*args, **kwargs):
            attempt = 0
            while True:
                try:
                    return f(*args, **kwargs)
                except Exception as e:
                    attempt += 1
                    if attempt > retries:
                        raise
                    sleep = backoff * (2 ** (attempt-1)) + random.random()*0.1
                    print(f'Retrying after error: {e}. Sleeping {sleep:.2f}s')
                    time.sleep(sleep)
        return wrapper
    return deco

# Example usage for calling LLM
@retry(retries=3, backoff=1.0)
def safe_call_llm(prompt: str):
    if OPENAI_KEY:
        return call_openai(prompt)
    else:
        return call_hf(prompt)

print('Retry wrapper ready')


# How to run in Colab

1) Upload this notebook to Colab or open using:

https://colab.research.google.com/github/fuaadabdullah/forgemono/blob/main/tools/goblin_assistant_rag_prototype.ipynb

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/fuaadabdullah/forgemono/blob/main/tools/goblin_assistant_rag_prototype.ipynb)

2) Set secrets: Runtime -> Manage sessions -> set environment variables, or use google.colab to load from Drive

> Note: This notebook is a demo environment (R&D sandbox). Do not treat it as production. Use it for quick experiments, prototyping, and demos — then port validated patterns to the backend/frontend for production use.

In [None]:
# 14) Model registry, loader & unified LLM call function for multi-model testing

# Registry of available models — extend as needed
MODEL_REGISTRY = {
    'openai:gpt-4o-mini': {
        'type': 'openai',
        'model': 'gpt-4o-mini',
        'desc': 'OpenAI GPT-4o-mini (requires API key)'
    },
    'openai:gpt-3.5-turbo': {
        'type': 'openai',
        'model': 'gpt-3.5-turbo',
        'desc': 'OpenAI 3.5 turbo (cheaper fallback)'
    },
    'hf:flan-t5-small': {
        'type': 'hf',
        'model': 'google/flan-t5-small',
        'desc': 'HF Flan-T5 small (local or HF Inference API)'
    },
    # Add examples if you have API access or local weights for larger models
    # 'hf:llama-2-7b': {...}
}

# Internal cache to store loaded HF pipelines
LOADED_HF_PIPES = {}


def load_model_if_needed(model_id: str):
    conf = MODEL_REGISTRY.get(model_id)
    if not conf:
        raise ValueError(f'Model {model_id} not found in registry')
    if conf['type'] == 'hf':
        if model_id in LOADED_HF_PIPES:
            return LOADED_HF_PIPES[model_id]
        try:
            pipe = pipeline('text2text-generation', model=conf['model'])
            LOADED_HF_PIPES[model_id] = pipe
            return pipe
        except Exception as e:
            print(f"Failed to load HF model {conf['model']}:", e)
            # Fallback: try to load from Hugging Face inference API with requests if configured
            return None
    if conf['type'] == 'openai':
        # nothing to load locally; rely on OPENAI_API_KEY being set
        return conf['model']
    return None


def call_model(model_id: str, prompt: str, **kwargs) -> str:
    conf = MODEL_REGISTRY.get(model_id)
    if not conf:
        raise ValueError(f'Model {model_id} not registered')
    if conf['type'] == 'openai':
        if not OPENAI_KEY:
            raise RuntimeError('OPENAI_API_KEY not set for OpenAI model')
        model = conf['model']
        # Use the common call_openai we already created
        return call_openai(prompt=prompt, model=model, temperature=kwargs.get('temperature',0.0), max_tokens=kwargs.get('max_tokens',512))
    elif conf['type'] == 'hf':
        model_pipe = load_model_if_needed(model_id)
        if not model_pipe:
            # If local pipeline fails, attempt to call HF Inference API (if HF_TOKEN available)
            hf_token = os.environ.get('HF_TOKEN')
            if not hf_token:
                raise RuntimeError('HF pipeline not available and HF_TOKEN not provided for Inference API fallback')
            # HF Inference API call
            import requests
            hf_api_url = f'https://api-inference.huggingface.co/models/{conf["model"]}'
            headers = {'Authorization': f'Bearer {hf_token}'}
            payload = {"inputs": prompt, "options": {"wait_for_model": True}, "parameters": {"max_new_tokens": kwargs.get('max_tokens',256)}}
            r = requests.post(hf_api_url, json=payload, headers=headers)
            if r.status_code != 200:
                raise RuntimeError(f'HF Inference API failed: {r.status_code} {r.text}')
            out = r.json()
            # HF inference API returns a list of results by default
            if isinstance(out, list):
                return out[0].get('generated_text', '')
            return out.get('generated_text', '')
        # Local HF pipeline
        out = model_pipe(prompt, max_length=kwargs.get('max_tokens',256), do_sample=False)
        if isinstance(out, list):
            return out[0].get('generated_text', '')
        return out
    else:
        raise ValueError('Unsupported model type')

print('Model registry & loader ready. Registered models:')
for mid, cfg in MODEL_REGISTRY.items():
    print('-', mid, cfg['desc'])

# Quick test using the TF fallback (HF) if OpenAI is not available
if 'hf:flan-t5-small' in MODEL_REGISTRY:
    try:
        _ = load_model_if_needed('hf:flan-t5-small')
        print('Local HF model flan-t5-small ready for calls (or API fallback configured)')
    except Exception as e:
        print('HF model not available locally:', e)


In [None]:
# 15) Multi-model RAG comparison helper

from typing import Any


def compare_models_on_query(query: str, model_ids: List[str], k: int = 3) -> List[Dict[str, Any]]:
    """Run RAG for each model in model_ids and return results with evaluations"""
    outputs = []
    for model_id in model_ids:
        print(f'Running model: {model_id} — retrieval k={k}')
        retrieved = retrieve(query, k=k)
        prompt = build_rag_prompt(query, retrieved, persona=PERSONA, cot=COT)
        try:
            answer = call_model(model_id, prompt)
        except Exception as e:
            answer = f'ERROR: {e}'
        metrics = evaluate_single(test_ref, answer, retrieved)
        outputs.append({'model_id': model_id, 'answer': answer, 'metrics': metrics, 'retrieved': retrieved})
    return outputs

# Example: compare three models on the test query
compare_result = compare_models_on_query('How do I install Goblin Assistant?', ['hf:flan-t5-small', 'openai:gpt-3.5-turbo'], k=2)
for r in compare_result:
    print('\n---')
    print('Model:', r['model_id'])
    print('Answer:', r['answer'])
    print('Metrics:', r['metrics'])
    print('Sources:', [s[1]['title'] for s in r['retrieved']])


In [None]:
# 16) Simple IPyWidgets UI (optional: run in Colab/local to interactively select models)

try:
    from ipywidgets import SelectMultiple, Button, HBox, VBox, Text, Output, IntSlider
    from IPython.display import display

    available_models = list(MODEL_REGISTRY.keys())
    model_selector = SelectMultiple(options=available_models, value=[available_models[0]], description='Models')
    query_input = Text(value='How do I install Goblin Assistant?', description='Query')
    k_slider = IntSlider(value=2, min=1, max=5, description='k')
    run_btn = Button(description='Run Comparison', button_style='primary')
    out = Output()

    def on_run(b):
        with out:
            out.clear_output()
            selected = list(model_selector.value)
            q = query_input.value
            k = k_slider.value
            res = compare_models_on_query(q, selected, k=k)
            for r in res:
                print('---')
                print('Model:', r['model_id'])
                print('Answer:')
                print(r['answer'])
                print('Metrics:', r['metrics'])
                print('Sources:', [s[1]['title'] for s in r['retrieved']])

    run_btn.on_click(on_run)

    ui = VBox([query_input, HBox([model_selector, k_slider]), run_btn, out])
    display(ui)

except Exception as e:
    print('IPyWidgets not available or failed to render:', e)


In [None]:
# 17) End-to-end prototype workflow

from typing import List, Dict, Any

# Utility: normalize vectors for inner product similarity
def normalize_vectors(vs: np.ndarray) -> np.ndarray:
    norms = np.linalg.norm(vs, axis=1, keepdims=True) + 1e-10
    return vs / norms


def build_faiss_index(docs: List[Dict[str, Any]], embed_model: SentenceTransformer):
    """Embed docs and build a FAISS index (Inner Product on normalized vectors).

    Returns: index, embeddings, id_map
    """
    texts = [d["text"] for d in docs]
    embeddings = embed_model.encode(texts, convert_to_numpy=True)
    embeddings = normalize_vectors(embeddings)

    dim = embeddings.shape[1]
    index = faiss.IndexFlatIP(dim)  # inner product on normalized vectors approximates cosine
    index.add(embeddings)

    id_map = [d["id"] for d in docs]
    return index, embeddings, id_map


def search_index(query: str, index, embed_model: SentenceTransformer, id_map: List[str], docs: List[Dict[str, Any]], top_k: int = 3):
    qv = embed_model.encode([query], convert_to_numpy=True)
    qv = normalize_vectors(qv)
    D, I = index.search(qv, top_k)
    results = []
    for s, idx in zip(D[0], I[0]):
        results.append({
            "score": float(s),
            "id": id_map[int(idx)],
            "text": docs[int(idx)]["text"],
            "title": docs[int(idx)].get("title", "")
        })
    return results


def build_rag_prompt(query: str, retrieved: List[Dict[str, Any]], template: str = None) -> str:
    """Join the retrieved text into a RAG-aware prompt.

    The template can be used to configure system instructions.
    """
    ctx = "\n\n---\n".join([f"[{r['id']}] {r['title']}\n{r['text']}" for r in retrieved])
    if template is None:
        system = (
            "You are Goblin Assistant. Use only the given context to answer the user's question. "
            "Be concise and cite the source IDs you used.\n\n"
        )
    else:
        system = template + "\n\n"

    prompt = f"{system}Context:\n{ctx}\n\nUser Query: {query}\n\nAnswer:"
    return prompt


# LLM call: OpenAI (if key present) otherwise local HF pipeline (text-generation) as fallback

def call_llm(prompt: str, temperature: float = 0.0, max_tokens: int = 256):
    """Return JSON: {'text': str, 'provider': 'openai'|'hf', 'raw': ...}

    If OPENAI_API_KEY is present, prefer OpenAI Chat; else use HF text-generation.
    """
    openai_key = os.environ.get("OPENAI_API_KEY") or os.environ.get("OPENAI_API_KEY_LOCAL")
    if openai_key:
        try:
            openai.api_key = openai_key
            response = openai.ChatCompletion.create(
                model="gpt-4o-mini",
                messages=[{"role": "user", "content": prompt}],
                temperature=temperature,
                max_tokens=max_tokens,
            )
            text = response["choices"][0]["message"]["content"].strip()
            return {"text": text, "provider": "openai", "raw": response}
        except Exception as e:
            print("OpenAI request failed, falling back to HF: ", e)

    # Fallback to HF pipeline
    try:
        pipeline_inst = pipeline("text-generation", model="distilgpt2")
        res = pipeline_inst(prompt, max_length=min(1024, len(prompt.split()) + 200))
        text = res[0]["generated_text"][len(prompt) :].strip()
        return {"text": text, "provider": "hf", "raw": res}
    except Exception as e:
        return {"text": "(LLM call failed: " + str(e) + ")", "provider": "none", "raw": str(e)}


# Evaluation helpers
scorer = rouge_scorer.RougeScorer(["rouge1", "rouge2", "rougeL"], use_stemmer=True)


def eval_texts(reference: str, candidate: str):
    """Return evaluation metrics: rouge and BLEU (sacrebleu)"""
    rouge_scores = scorer.score(reference, candidate)
    bleu = sacrebleu.sentence_bleu(candidate, [reference]).score
    return {
        "rouge1_f": rouge_scores["rouge1"].fmeasure,
        "rouge2_f": rouge_scores["rouge2"].fmeasure,
        "rougeL_f": rouge_scores["rougeL"].fmeasure,
        "bleu": float(bleu),
    }


def provenance_check(candidate: str, retrieved: List[Dict[str, Any]]):
    """Return a list of source ids that best match substrings in the candidate.

    Simple method: check for substring occurrences of 8+ char sequences from retrieved doc.
    """
    provenance = []
    for r in retrieved:
        # pick a unique snippet to match
        snippet = r["text"][:120].strip()
        if snippet and snippet in candidate:
            provenance.append(r["id"])
    return provenance


# High-level orchestrator

def prototype_goblin_workflow(query: str, docs: List[Dict[str, Any]], embed_model_name: str = "all-MiniLM-L6-v2", top_k: int = 3, template: str = None, temperature: float = 0.0):
    embed_model = SentenceTransformer(embed_model_name)
    idx, embeddings, id_map = build_faiss_index(docs, embed_model)
    retrieved = search_index(query, idx, embed_model, id_map, docs, top_k=top_k)

    prompt = build_rag_prompt(query, retrieved, template)
    llm_result = call_llm(prompt, temperature=temperature)

    # for evaluation, we expect the notebook user to provide a reference answer; for demo we pull sample mapping
    reference_map = {
        "What is Goblin Assistant?": "Goblin Assistant is an experimental AI assistant built for developer workflows. It fetches conversation context and answers by combining docs and LLMs.",
        "How do I install Goblin Assistant?": "Clone the repo; for frontend: pnpm install; for backend: pip install -r requirements.txt; set env vars like VITE_FASTAPI_URL and run uvicorn.",
    }

    reference = reference_map.get(query, "")
    if reference:
        eval_metrics = eval_texts(reference, llm_result["text"])  # computed on the simple sample mapping
    else:
        eval_metrics = {}

    prov = provenance_check(llm_result["text"], retrieved)

    return {
        "query": query,
        "retrieved": retrieved,
        "prompt": prompt,
        "llm_result": llm_result,
        "metrics": eval_metrics,
        "provenance": prov,
    }


# Example usage
EXAMPLE_QUERIES = [
    "What is Goblin Assistant?",
    "How do I install Goblin Assistant?",
]

embedder = SentenceTransformer("all-MiniLM-L6-v2")
index, _, idmap = build_faiss_index(SAMPLE_DOCS, embedder)

for q in EXAMPLE_QUERIES:
    print("\n---\nRunning query: ", q)
    result = prototype_goblin_workflow(q, SAMPLE_DOCS, embed_model_name="all-MiniLM-L6-v2", top_k=3)
    print("Answer (provider=", result["llm_result"]["provider"], "):\n", result["llm_result"]["text"])
    print("Retrieved doc IDs:", [r["id"] for r in result["retrieved"]])
    if result["metrics"]:
        print("Metrics:", result["metrics"])
    print("Provenance (sources quoted):", result["provenance"])


In [None]:
# 18) Iteration harness: tune retrieval and prompts
from itertools import product

PROMPT_TEMPLATES = [
    None,
    "You are Goblin Assistant. Use only the given context sections to answer, cite sources by ID.",
    "You are a helpful assistant. Provide short, step-by-step instructions if asked and show which docs you used by id.",
]


def grid_search_workflow(queries: List[str], docs: List[Dict[str, Any]], embed_model_name: str = "all-MiniLM-L6-v2"):
    results = []
    top_ks = [1, 3, 5]
    temps = [0.0, 0.3]
    for q in queries:
        for top_k, temp, template in product(top_ks, temps, PROMPT_TEMPLATES):
            out = prototype_goblin_workflow(q, docs, embed_model_name=embed_model_name, top_k=top_k, template=template, temperature=temp)
            # naive cooked detection: fraction of retrieved docs used as provenance
            prov_frac = float(len(out["provenance"]) / max(1, len(out["retrieved"])))
            out["provenance_fraction"] = prov_frac
            out["cooked_flag"] = prov_frac < 0.33 and out["llm_result"]["provider"] == "openai"
            results.append({
                "query": q,
                "top_k": top_k,
                "temp": temp,
                "template": template,
                "metrics": out["metrics"],
                "prov_frac": prov_frac,
                "cooked": out["cooked_flag"],
                "llm_provider": out["llm_result"]["provider"],
            })
    return results


# Run grid search
print('Running grid search for sample queries... this may call remote LLMs if API keys are present')
search_results = grid_search_workflow(EXAMPLE_QUERIES, SAMPLE_DOCS)

# Sort by Rouge-L fmeasure and present top results, filter out empty metrics
valid_results = [r for r in search_results if r["metrics"]]
sorted_results = sorted(valid_results, key=lambda r: r["metrics"]["rougeL_f"], reverse=True)

print('\nTop parameter combos (by Rouge-L):')
for r in sorted_results[:10]:
    print(f"Query: {r['query']}, top_k={r['top_k']}, temp={r['temp']}, template={'default' if r['template'] is None else 'custom'}, provider={r['llm_provider']}, rougeL={r['metrics']['rougeL_f']:.3f}, prov_frac={r['prov_frac']:.2f}, cooked={r['cooked']}")

# Show all combos that appear to be 'cooked' (low provenance)
cooked = [r for r in search_results if r['cooked']]
print('\nPotentially cooked results (prov_frac < 0.33):')
for r in cooked:
    print(f"Query: {r['query']}, top_k={r['top_k']}, provider={r['llm_provider']}, prov_frac={r['prov_frac']:.3f}, rougeL={r.get('metrics', {}).get('rougeL_f', None)}")

print('\nGrid search complete — examine results and adjust templates/top_k/temperature.')


In [None]:
# 19) Auto-iterate until grounded (naive loop)

def iterate_until_grounded(query: str, docs: List[Dict[str, Any]], max_top_k: int = 10, templates: List[str] = PROMPT_TEMPLATES, temps: List[float] = [0.0, 0.3], embed_model_name: str = "all-MiniLM-L6-v2"):
    # We'll iterate over top_k first and then try templates/temps for the smallest top_k that gives decent provenance
    for top_k in range(1, max_top_k + 1):
        for template in templates:
            for t in temps:
                out = prototype_goblin_workflow(query, docs, embed_model_name=embed_model_name, top_k=top_k, template=template, temperature=t)
                prov_frac = float(len(out["provenance"]) / max(1, len(out["retrieved"])))
                print(f"top_k={top_k}, temp={t}, template={'default' if template is None else 'custom'} -> prov_frac={prov_frac:.2f}, provider={out['llm_result']['provider']}")
                if prov_frac >= 0.66:  # heuristics: at least 66% of retrieved docs are referenced
                    print("Grounded answer found — returning result")
                    return out
    print("No sufficiently grounded answer found in grid. Consider adding more specific docs or changing templates.")
    return None

# Example: iterate for a query until grounded
print(iterate_until_grounded("What is Goblin Assistant?", SAMPLE_DOCS, max_top_k=5))

# If the iterative function returns None, try improving docs, chunking, or templates.
