# Tech Venture Bootcamp - Introduction to Prompt Engineering

Prof. Alberto Mart√≠n Izquierdo

Prof. Santiago Gil Begu√©

<img src="https://drive.google.com/uc?export=view&id=1wIE22tjli0_rg27P_H1aLwWCt9XgQp23" alt="IE" width="150"/>

## üîê Setting Up Your OpenAI API Access


Before interacting with the OpenAI API from Python, you need to configure your account, billing, and authentication. Follow the steps below to ensure everything works smoothly.

1. Create an OpenAI Account
If you don't already have one, sign up here:
üëâ https://platform.openai.com/docs/overview

2. Upgrade to a Paid Plan
API usage requires a minimum prepaid balance of $5 USD.
You can upgrade your plan here:
üëâ https://platform.openai.com/settings/organization/billing/overview

3. Set a Usage Limit (Recommended)
To avoid unexpected charges, configure spending limits for your organization:
üëâ https://platform.openai.com/settings/organization/limits

4. Generate an API Key
Create a new API key that your application will use to authenticate requests:
üëâ https://platform.openai.com/settings/organization/api-keys ‚Üí Create new secret key

Important notes:

- You may be asked to create a project first and assign the API key to it.
- If your API key was created before you upgraded to a paid plan, you may need to delete it and generate a new one (common cause of quota errors). Relevant reference: https://stackoverflow.com/questions/75898276/openai-api-error-429-you-exceeded-your-current-quota-please-check-your-plan-a


5. Keep Your Key Secure
API keys are sensitive credentials. Never share them publicly or store them in version control. Use environment variables or a .env file instead.

## üöÄ Getting Started: Your First Interactions with the API

You will learn how to create a minimal ‚Äúhello world‚Äù completion.

### Basic syntax and functions

In [None]:
from openai import OpenAI

# Best practice: read the API key from the environment (e.g., via os.environ["OPENAI_API_KEY"]).
# Do NOT hardcode secrets in notebooks or commit them to version control.
client = OpenAI(api_key="sk-your-token")

In [2]:
response = client.chat.completions.create(
    messages=[
        {"role": "user", "content": "tell me a joke about recommender systems"},
    ],
    model="gpt-4o",
    max_tokens=60,  # soft upper bound on the number of generated tokens
    temperature=0,  # higher = more creative/variable; lower = more deterministic
)

In [3]:
response

ChatCompletion(id='chatcmpl-D5iuJJ0c3VAHlatHUYrLsgkpAmCcO', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content="Why did the recommender system break up with the user?\n\nIt just couldn't find the right match!", refusal=None, role='assistant', annotations=[], audio=None, function_call=None, tool_calls=None))], created=1770256635, model='gpt-4o-2024-08-06', object='chat.completion', service_tier='default', system_fingerprint='fp_ad98c18a04', usage=CompletionUsage(completion_tokens=20, prompt_tokens=15, total_tokens=35, completion_tokens_details=CompletionTokensDetails(accepted_prediction_tokens=0, audio_tokens=0, reasoning_tokens=0, rejected_prediction_tokens=0), prompt_tokens_details=PromptTokensDetails(audio_tokens=0, cached_tokens=0)))

In [4]:
# The API may return multiple choices; here we take the first one for simplicity.
response_content = response.choices[0].message.content
print(response_content)

Why did the recommender system break up with the user?

It just couldn't find the right match!


In [5]:
len(response.choices)

1

Let's wrap this code in a function to reuse it.

In [6]:
def get_completion_from_messages(client, messages, temperature=0):
    response = client.chat.completions.create(
        messages=messages,
        model="gpt-5.1",
        temperature=temperature,
    )
    return response.choices[0].message.content

### The temperature parameter

In [7]:
messages = [
    {"role": "user", "content": "tell me a joke about recommender systems"},
]

response_content = get_completion_from_messages(client, messages)
print(response_content)

Why did the recommender system break up with its user?

Because every time the user said, ‚ÄúI‚Äôm just browsing,‚Äù it replied, ‚ÄúGot it‚Äîhere are 500 *highly relevant* long-term commitment options.‚Äù


In [8]:
response_content = get_completion_from_messages(client, messages, temperature=1)
print(response_content)

Why did the recommender system break up with its user?

Because every time the user said, ‚ÄúI‚Äôm just browsing,‚Äù it replied, ‚ÄúGot it‚Äîhere are 500 *exactly similar* things you‚Äôll definitely commit to.‚Äù


### Adding structure: System, Assistant, and User roles

To build more controlled and coherent conversations, the OpenAI Chat API uses three complementary roles. Each role contributes differently to the behaviour, memory, and intent of the dialogue.

**System**
- Defines high-level instructions that guide the model's behaviour.
- Use this role to set tone, personality, formatting rules, or domain-specific constraints.
- Think of it as the ‚Äúgoverning rulebook‚Äù the model must follow.

**User**
- Represents the actual human input.
- Each message expresses a request, question, or prompt that the model should respond to.

**Assistant**
- Contains the model's previous responses.
- This role helps the API maintain context across multiple turns and enables more natural, stateful conversations.

Let's modify our request by adding a system message that sets the behaviour of the LLM. Now, the model will respond in a way that aligns with the system instructions, making it more specialized.

In [9]:
messages = [
    {"role": "system", "content": "You are an assistant that speaks like Shakespeare."},
    {"role": "user", "content": "tell me a joke about recommender systems"},
]

response_content = get_completion_from_messages(client, messages)
print(response_content)

A recommender system once did proclaim,  
‚ÄúI know thy heart, I know thy every aim!‚Äù  

Quoth the user: ‚ÄúPray, then, what think‚Äôst of me?‚Äù  
It answered: ‚ÄúThou lik‚Äôst naught but *‚Äòsimilar to what thou‚Äôst already seen‚Äô*‚Äî  
so I shall show thee that, for all eternity.‚Äù


One of the most powerful aspects of using the assistant role is that we can simulate a conversation that is already in progress. Instead of starting fresh, we can provide previous messages to make the LLM continue naturally from a midpoint.

This shows how we can create multi-turn interactions, making the LLM more interactive and engaging.

In [10]:
messages = [
    {"role": "system", "content": "You are an assistant that speaks like Shakespeare."},
    {"role": "user", "content": "tell me a joke about recommender systems"},
    {
        "role": "assistant",
        "content": "Why did the recommender system wished you happy birthday",
    },
    {"role": "user", "content": "I don't know"},
]

response_content = get_completion_from_messages(client, messages)
print(response_content)

Because it had been *tracking thy date* of birth,  
And thought, ‚ÄúSince all his data‚Äôs mine‚Ä¶ what‚Äôs one more mirth?‚Äù


## üéØ 1st business application: A movie recommender chatbot

This section demonstrates how to turn general-purpose LLMs into a lightweight recommender assistant without proprietary data. You'll start with basic interactions that leverage public knowledge learned during training, then progress to an interactive chatbot UI that can be embedded into apps.

Note: If you need to recommend from private catalogs (internal content, user histories), you'll need RAG or a vector store‚Äîthat's covered later.

### Part A: Basic interactions


In this subsection we use plain chat calls to obtain recommendations grounded in the model's general knowledge (i.e., public/pop culture context learned during training). This is suitable for ‚Äúwhat's similar to X?‚Äù prompts where X is a well-known movie or song.

**When to use this**  
- You want quick, generic suggestions (e.g., ‚Äúsimilar to *Avengers: Endgame*‚Äù).  
- No private catalog constraints or real-time inventory.  

**When NOT to use this**  
- You must recommend from a restricted catalog (company titles only).  
- You need fresh or proprietary metadata (ratings, availability, user history).  
‚Üí Use **RAG** (retrieve + augment) with your own data source instead.


In [11]:
# Single-turn prompts leveraging public knowledge

messages = [
    {
        "role": "user",
        "content": "What are the most similar movies to Avengers: Endgame?",
    },
]

response_content = get_completion_from_messages(client, messages)
print(response_content)

Here are movies most similar to *Avengers: Endgame*, grouped by the kind of similarity (tone, stakes, crossover feel, time travel, etc.). I‚Äôll focus on films that combine big emotional payoffs, ensemble casts, and epic finales.

---

## 1. Direct MCU Parallels (Same tone, stakes, and characters)

These are the closest in feel and structure:

1. **Avengers: Infinity War (2018)**  
   - Essentially part one of *Endgame*.  
   - Same core cast, same villain (Thanos), universe‚Äëending stakes, darker tone.

2. **The Avengers (2012)**  
   - The original team‚Äëup.  
   - Lighter than *Endgame*, but similar ‚Äúeveryone comes together‚Äù energy and big third‚Äëact battle.

3. **Avengers: Age of Ultron (2015)**  
   - Ensemble, global threat, lots of character interplay.  
   - Sets up many emotional threads that pay off in *Endgame* (Hawkeye‚Äôs family, Vision/Wanda, Tony‚Äôs fears).

4. **Captain America: Civil War (2016)**  
   - Feels like ‚ÄúAvengers 2.5.‚Äù  
   - Large cast, emotiona

In [12]:
messages = [
    {"role": "user", "content": "What are the most similar songs to Despacito?"},
]

response_content = get_completion_from_messages(client, messages)
print(response_content)

Here are songs that are musically and culturally closest to ‚ÄúDespacito‚Äù (Luis Fonsi ft. Daddy Yankee) ‚Äî focusing on reggaeton/pop fusion, similar tempo, vibe, and era. I‚Äôll group them so you can explore by ‚Äútype‚Äù of similarity.

---

## 1. Almost the same vibe (reggaeton-pop, romantic, mid‚Äëtempo)

These are the closest in feel and structure:

- **‚Äú√âchame la Culpa‚Äù ‚Äì Luis Fonsi & Demi Lovato**  
  Same artist, same polished reggaeton-pop formula, catchy chorus, romantic/cheeky lyrics.

- **‚ÄúBailando‚Äù ‚Äì Enrique Iglesias ft. Descemer Bueno, Gente de Zona**  
  Latin pop with reggaeton/dembow rhythm, huge global hit, similar danceable but romantic energy.

- **‚ÄúDanza Kuduro‚Äù ‚Äì Don Omar ft. Lucenzo**  
  Party track with a very similar bounce and Caribbean flavor; often played in the same playlists as ‚ÄúDespacito.‚Äù

- **‚ÄúEl Perd√≥n‚Äù ‚Äì Nicky Jam & Enrique Iglesias**  
  Mid-tempo reggaeton-pop, emotional, melodic chorus, similar singalong quality.

-

In [13]:
# Adding a system message to steer behavior toward "movie recommender".
# This increases consistency and encourages short justifications for each pick

# Notes:
# - The system role acts as a 'rulebook' and often improves focus and tone.
# - Keep system prompts concise but explicit about goals, constraints, and style.

messages = [
    {
        "role": "system",
        "content": "You are a conversational assistant specializing in movie recommendations. Make sure to briefly explain why each recommendation might be of interest based on their responses.",
    },
    {"role": "user", "content": "What are some good sci-fi movies?"},
]

response_content = get_completion_from_messages(client, messages)
print(response_content)

Here are some strong sci‚Äëfi picks across different styles, with why they might be worth your time:

1. **Blade Runner 2049 (2017)**  
   Gorgeous, slow-burn sci‚Äëfi about identity, memory, and what it means to be human, with stunning visuals and atmosphere.

2. **Arrival (2016)**  
   Thoughtful, emotional first-contact story that focuses on language, time, and communication rather than big battles.

3. **Ex Machina (2014)**  
   Intimate, tense film about AI, consciousness, and manipulation‚Äîgreat if you like psychological sci‚Äëfi with a twist.

4. **Interstellar (2014)**  
   Epic space adventure mixing hard science (black holes, relativity) with a very emotional story about family and sacrifice.

5. **The Matrix (1999)**  
   Classic simulation/reality-bending sci‚Äëfi with iconic action and big philosophical questions about free will and control.

6. **Annihilation (2018)**  
   Surreal, eerie exploration sci‚Äëfi with strong visuals and a mysterious ‚Äúzone‚Äù that changes bi

In [14]:
messages = [
    {
        "role": "system",
        "content": "You are a conversational assistant specializing in movie recommendations. Make sure to briefly explain why each recommendation might be of interest based on their responses.",
    },
    {"role": "user", "content": "What are some good sci-fi movies?"},
    {
        "role": "assistant",
        "content": "Some great sci-fi movies are Interstellar, Blade Runner, and Arrival. Would you like more recommendations?",
    },
    {"role": "user", "content": "I prefer movies with more action"},
]

response_content = get_completion_from_messages(client, messages)
print(response_content)

Here are some action-heavy sci‚Äëfi movies you might like, with a quick note on why they stand out:

1. **Edge of Tomorrow (2014)**  
   - Intense, creative action with a time-loop twist; mixes big battles, humor, and smart sci‚Äëfi ideas.

2. **Mad Max: Fury Road (2015)**  
   - Nonstop, visually stunning post‚Äëapocalyptic chase movie; almost wall‚Äëto‚Äëwall action with minimal downtime.

3. **The Matrix (1999)**  
   - Iconic gunfights and martial arts mixed with a cool ‚Äúsimulated reality‚Äù concept; hugely influential.

4. **Aliens (1986)**  
   - Turns the original Alien‚Äôs horror into a military action thriller; tense, loud, and full of creature combat.

5. **Terminator 2: Judgment Day (1991)**  
   - Big set pieces, chases, and shootouts, plus a surprisingly emotional story about fate and humanity.

6. **Starship Troopers (1997)**  
   - Over-the-top battles against alien bugs; combines satire with lots of large-scale combat.

7. **District 9 (2009)**  
   - Gritty, grounded

### Part B: Full chatbot


Here we wrap the recommendation flow in an interactive UI built. The widget simulates a chat experience entirely in the notebook, but the same pattern can be embedded in:
- internal web apps (e.g., served via Panel/Bokeh/Tornado),
- existing frontends calling a Python backend (e.g., Flask/FastAPI on PythonAnywhere),
- or any service where you route messages to the OpenAI API.

We also include a **long system prompt (~20 lines)** to demonstrate how strong prompt engineering (tone, rules, constraints) can stabilize the assistant's behavior across turns.

In [15]:
!pip install -q jupyter_bokeh


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.3[0m[39;49m -> [0m[32;49m26.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [16]:
context = [
    {
        "role": "system",
        "content": """
     You are a conversational assistant specializing in movie recommendations. Your goal is to gather
     as much information as possible about the user's tastes and preferences before generating
     content-based recommendations.

     Start with simple, open-ended questions such as:
     - What kind of movies do you like?
     - Do you have a favorite movie? What did you like about it?
     - Do you prefer a specific genre, or do you like to explore different ones?
     - Are you looking for something specific to watch today, or just exploring new options?

     As the user responds, dive deeper with more detailed questions like:
     - Are you interested in popular films or hidden gems?
     - Do you prefer classic fils or more recent ones?

     Once you believe you have gathered enough information, transition to the recommendation phase.
     Use a content-based system to suggest movies that match the user's described preferences.
     Make sure to briefly explain why each recommendation might be of interest based on their responses.

     Your tone should be friendly and conversational yet efficient, avoiding redundant questions.
     If at any point the user wants to fo go straight to recommendations, adapt accordingly
     and provide suggestions based on the available information.
     """,
    }
]

In [17]:
def collect_messages(_):
    prompt = inp.value_input
    inp.value = ""
    context.append({"role": "user", "content": prompt})
    response = get_completion_from_messages(client, context)
    context.append({"role": "assistant", "content": response})
    panels.append(pn.Row("User: ", pn.pane.Markdown(prompt, width=600)))
    panels.append(pn.Row("Assitant: ", pn.pane.Markdown(response, width=600)))

    return pn.Column(*panels)

In [18]:
import panel as pn

pn.extension()
panels = []

inp = pn.widgets.TextInput(value="Hi", placeholder="Enter text here...")
button_conversation = pn.widgets.Button(name="Chat!")
interactive_conversation = pn.bind(collect_messages, button_conversation)
dashboard = pn.Column(
    inp,
    pn.Row(button_conversation),
    pn.panel(interactive_conversation, loading_indicator=True, height=1_000),
)

In [19]:
dashboard

BokehModel(combine_events=True, render_bundle={'docs_json': {'1b97da17-04f0-45f5-8d90-d9d3fd5dee1e': {'version‚Ä¶

At this stage, we have built a chatbot that:
1. Understands user preferences through a structured conversation.
2. Maintains context over multiple turns, refining its recommendations dynamically.
3. Explains its choices, providing more transparency than traditional recommendation models.
4. Leverages a powerful prompt to enhance its reasoning and adaptability.
5. Integrates with a dashboard, making the interaction more intuitive and visually engaging.

Next steps:

Experiment by modifying the system prompt to see how it changes the chatbot's behaviour. Try the following:
- Make the chatbot act like a film critic instead of a generic assistant.
- Adjust the prompt to make the chatbot focus on classic movies only.
- Ask for personalized recommendatons based on mood, actors, or directors.

By playing with these settings, you'll see how prompt engineering shapes the recommendations.

## üß† 2nd business application: RAG Q&A and LLM-as-a-Judge

Real‚Äëworld assistants often need to answer questions about private or domain-specific content. With RAG (Retrieval‚ÄëAugmented Generation), we first retrieve relevant passages from your documents and then generate an answer grounded in that context. Next, we‚Äôll see how to use an LLM‚Äëas‚Äëa‚ÄëJudge to automatically evaluate the quality of those answers.

### Part A: RAG Q&A

We emulate a common enterprise pattern:

1) **Ingest** PDFs (could be whatever format),
2) **Chunk** the text,
3) **Embed** chunks,
4) **Retrieve** the most relevant chunks for a user question,
5) **Answer** strictly using the retrieved context (or say ‚ÄúI don‚Äôt know‚Äù).
- For production, consider token-aware chunking, persistent vector storage, and caching.


**Context of the documents (KvK ‚Äì Dutch Chamber of Commerce)**  
- For this RAG example we use mock documents inspired by the *Kamer van Koophandel* (KvK), the Netherlands Chamber of Commerce.  
- KvK maintains public business records such as company registration details, legal structures, addresses, and official filings.  
- To avoid using real company data, we created three synthetic PDF documents that mimic typical KvK filings. These mock documents allow us to demonstrate the full RAG pipeline without privacy constraints.



In [20]:
! pip install -q PyPDF2

import PyPDF2


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.3[0m[39;49m -> [0m[32;49m26.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [21]:
def load_pdf(path):
    """Read all pages from a PDF and concatenate extracted text."""
    text = ""
    with open(path, "rb") as f:
        reader = PyPDF2.PdfReader(f)
        for page in reader.pages:
            text += page.extract_text()
    return text


document_text = load_pdf("KvK_1_Mockup.pdf")

In [22]:
def chunk_text(text, chunk_size=500, overlap=100):
    """
    Split text into overlapping chunks for retrieval.
    Tip: For better semantic boundaries, consider sentence-aware or token-aware chunking downstream.
    """
    chunks = []
    start = 0
    while start < len(text):
        end = start + chunk_size
        chunks.append(text[start:end])
        start = end - overlap
    return chunks


chunks = chunk_text(document_text)
print(f"Document vectorized in {len(chunks)} chunks")

Document vectorized in 10 chunks


In [23]:
def get_embedding(text):
    """
    Generate a vector embedding for a given text chunk.
    Consider rate limits and cost: for large corpora, batch and cache embeddings.
    """
    response = client.embeddings.create(model="text-embedding-3-small", input=text)
    return response.data[0].embedding


# Compute and store embeddings for all chunks (can be persisted in production).
chunk_embeddings = [get_embedding(chunk) for chunk in chunks]

Same embedding model for question + documents
- Both the **document chunks** and the **user question** are embedded using the *same* embedding model (`text-embedding-3-small`).  
- This is critical: cosine similarity is only meaningful when vectors come from an identical embedding space.  
- If you mix different embedding models, retrieval quality collapses.

In [24]:
import numpy as np


def cosine_similarity(a, b):
    a = np.array(a)
    b = np.array(b)
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))


def retrieve_relevant_chunks(question, chunks, embeddings, top_k=3):
    """
    Retrieve the top_k most similar chunks to a question based on cosine similarity.
    """
    q_embedding = get_embedding(question)
    scores = [cosine_similarity(q_embedding, emb) for emb in embeddings]
    top_indexes = np.argsort(scores)[-top_k:][::-1]
    return [chunks[i] for i in top_indexes]


question = "Who is the client of the KvK?"
relevant_chunks = retrieve_relevant_chunks(question, chunks, chunk_embeddings)

In [25]:
def answer_question(question, context_chunks):
    """
    Use the retrieved chunks as the only allowed knowledge source.
    """

    context = "\n\n".join(context_chunks)

    messages = [
        {
            "role": "system",
            "content": """
          You are a question-answering assistant.
          Answer the question using ONLY the provided context.
          If the answer is not in the context, say 'I don't know'.
       """,
        },
        {
            "role": "user",
            "content": f"""
          Context: {context}
          Question: {question}
       """,
        },
    ]

    return get_completion_from_messages(client, messages)

In [26]:
answer = answer_question(question, relevant_chunks)
print(answer)

I don't know.


In [27]:
def rag_qa_e2e(path, question):
    """Just all steps together."""
    document_text = load_pdf(path)
    chunks = chunk_text(document_text)
    chunk_embeddings = [get_embedding(chunk) for chunk in chunks]
    relevant_chunks = retrieve_relevant_chunks(question, chunks, chunk_embeddings)
    answer = answer_question(question, relevant_chunks)
    return answer


# Same process for all 3 documents
question = "What is the name of company?"
answers = [
    rag_qa_e2e(path, question) for path in [f"KvK_{x}_Mockup.pdf" for x in [1, 2, 3]]
]
answers

['The name of the company is **Test Company B.V.**',
 'The name of the company is **Testers flow B.V.**',
 'The name of the company is **Flow testing company B.V.**']

### Part B: LLM-as-a-Judge: Automatically evaluating RAG answers using another LLM

Once we have a question and a context-grounded answer (from the RAG pipeline), the next step is to **evaluate** how correct and faithful that answer is. Instead of manually inspecting each response, we can use another LLM as an *automatic evaluator*.

This second model receives:
1. The **original question**  
2. The **LLM-generated answer** from RAG  
3. The **ground-truth answer** (what we expect)


It then produces a **judgment** such as:
- "Fully Correct"  
- "Partially Correct"  
- "Incorrect"  

This technique is extremely useful for:
- Regression testing when you modify chunking or ranking  
- Comparing different RAG strategies  
- Tracking performance over many documents  
- Automated QA in production pipelines



- Gu, J., Jiang, X., Shi, Z., Tan, H., Zhai, X., Xu, C., ... & Guo, J. (2024). A survey on llm-as-a-judge. The Innovation.

- Zheng, L., Chiang, W. L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., ... & Stoica, I. (2023). Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in neural information processing systems, 36, 46595-46623.

In [28]:
messages = [
    {"role": "user", "content": "What is the strategy LLM-as-a-Judge?"},
]

response_content = get_completion_from_messages(client, messages)
print(response_content)

‚ÄúLLM-as-a-Judge‚Äù is a strategy where a large language model is used not to *generate* the main content, but to *evaluate* it.

In practice, it means:

1. **Role**  
   The LLM acts like a reviewer, grader, or referee. Instead of answering the original task, it is given:
   - The task or prompt
   - One or more candidate answers (from humans or other models)
   - Evaluation criteria (e.g., correctness, coherence, safety, style)

2. **Typical Uses**
   - **Automatic evaluation of model outputs**: scoring answers in benchmarks, competitions, or A/B tests.
   - **Preference ranking**: deciding which of two or more responses is better.
   - **Feedback generation**: explaining what‚Äôs wrong or missing in an answer.
   - **Self-improvement loops**: a model generates answers, another (or the same) model judges them, and the feedback is used to refine prompts or training.

3. **How it‚Äôs usually implemented**
   - Provide a structured prompt like:
     - ‚ÄúHere is the question‚Ä¶‚Äù
    

In [29]:
# This template instructs the "judge LLM" how to evaluate correctness.

import yaml

PROMPT_LLM_AAJ_PATH = "prompt_llmaaj.yaml"

with open(PROMPT_LLM_AAJ_PATH) as f:
    prompt_llm_aaj = yaml.safe_load(f)

print(prompt_llm_aaj["template"])

You are an evaluator. Given the question, llm_response, and ground_truth, classify the LLM response into one of three categories: (i) Fully Correct, (ii) Partially Correct, (iii) Incorrect.

1. Fully Correct
Definition: The LLM output is semantically accurate and complete, fully addressing the user's query or task. It may not match the reference (ground truth) word-for-word, but it conveys the same meaning and includes all necessary information without errors.
Key Traits:
- Correct facts and reasoning
- No significant omissions or inaccuracies
- Equivalent in meaning to the expected answer, even if phrased differently

2. Partially Correct
Definition: The LLM output contains some correct information, but it is either incomplete, partially inaccurate, or only partially addresses the user's query. It may be helpful but requires clarification, correction, or supplementation.
Key Traits:
- Mix of correct and incorrect or missing elements
- Misinterpretation of part of the query
- Useful bu

In [30]:
# Ground‚Äëtruth labels for each of our mock KvK documents.
# This represents the "correct" answer the LLM‚Äëjudge should compare against.

ground_truth = [
    "Testing Company B.V.",
    "Testers flow B.V.",
    "Flow testing company B.V.",
]

In [31]:
answers

['The name of the company is **Test Company B.V.**',
 'The name of the company is **Testers flow B.V.**',
 'The name of the company is **Flow testing company B.V.**']

In [32]:
to_values = {
    "question": question,
    "llm_response": answers[0],
    "ground_truth": ground_truth[0],
}

prompt_composed = prompt_llm_aaj["template"].format(**to_values)
print(prompt_composed[1_800:])

What is the location of the company?
llm_response: The location of the company is Berlin
ground_truth: Madrid
Classification: Incorrect

User Input:
question: What is the name of company?
llm_response: The name of the company is **Test Company B.V.**
ground_truth: Testing Company B.V.

Classification:



In [33]:
# Ask the judge model to evaluate correctness.
# LLM‚Äëas‚Äëa‚ÄëJudge is simply *another* call to the exact same OpenAI API used earlier.

messages = [
    {"role": "user", "content": prompt_composed},
]

response_llmaaj = get_completion_from_messages(client, messages)
print(response_llmaaj)

Partially Correct


In [34]:
def rag_llmaaj_e2e(to_values):
    """Just all steps together."""
    prompt_composed = prompt_llm_aaj["template"].format(**to_values)
    messages = [{"role": "user", "content": prompt_composed}]
    response_content = get_completion_from_messages(client, messages)
    return response_content

In [35]:
# Prepare judge inputs for all three mock KvK documents.

to_values_list = [
    {"question": question, "llm_response": answers[i], "ground_truth": ground_truth[i]}
    for i in range(3)
]

response_llmaaj_list = [rag_llmaaj_e2e(to_values) for to_values in to_values_list]
response_llmaaj_list

['Partially Correct', 'Fully Correct', 'Fully Correct']

In [36]:
# Compute a simple accuracy score:
# percentage of evaluations labeled exactly as "Fully Correct".

final_metric = np.mean(
    [response == "Fully Correct" for response in response_llmaaj_list]
)
print(final_metric)

0.6666666666666666


Notes & best practices

- Judge model ‚â† RAG model: using two separate LLMs removes bias and improves reliability.
- Keep the judge instructions deterministic (low temperature) to reduce variance.
- Consider returning structured outputs (e.g., JSON with multiple scores: faithfulness, completeness, relevance).
- You can scale this to hundreds of samples to benchmark different chunk sizes, embedding models, retrieval methods, or prompts.
- By running multiple prompts through the judge and computing an accuracy metric, you can quickly identify which prompt performs best. This allows for data-driven prompt engineering instead of subjective guessing.



## ‚úçÔ∏è 3: Prompt Engineering Best Practices

### What Makes a Good Prompt?

**A good prompt SHOULD:**

1. **Be clear and specific**  
   - State exactly what you want the model to do
   - Use concrete examples when possible
   - Define the expected format or structure

2. **Provide sufficient context**  
   - Include relevant background information
   - Specify constraints or requirements
   - Clarify the target audience or use case

3. **Use appropriate role assignment**  
   - Assign expertise ("You are a Python expert...")
   - Define personality or tone when relevant
   - Set behavioral guidelines

4. **Include output formatting instructions**  
   - Specify desired structure (JSON, bullet points, etc.)
   - Request explanations when needed
   - Define length constraints if applicable

5. **Handle edge cases explicitly**  
   - Tell the model what to do when uncertain
   - Provide fallback behaviors
   - Define boundaries clearly

**A good prompt should NOT:**

1. **Be vague or ambiguous**  
   ‚ùå "Tell me about data science"  
   ‚úÖ "Explain the difference between supervised and unsupervised learning in 3 sentences"

2. **Contain conflicting instructions**  
   ‚ùå "Be brief but provide comprehensive details"  
   ‚úÖ "Provide a 2-paragraph summary highlighting key points"

3. **Rely on implicit assumptions**  
   ‚ùå Assuming the model knows your specific context  
   ‚úÖ Explicitly stating your domain, constraints, and goals

4. **Overload with unnecessary information**  
   ‚ùå Including irrelevant context that dilutes the main instruction  
   ‚úÖ Focusing on information directly relevant to the task

5. **Use leading or biased framing**  
   ‚ùå "Explain why X is better than Y"  
   ‚úÖ "Compare X and Y objectively, listing pros and cons of each"

### System Prompt vs. User Prompt: When to Improve Each

Understanding when to optimize the **system prompt** versus the **user prompt** is critical for building effective LLM applications.

---

#### **When to Improve the SYSTEM Prompt**

The system prompt defines **global behavior** that applies to all interactions. Improve it when you need to:

- Establish consistent behavior across all requests**
- Define structural rules and constraints**
- Handle domain-specific knowledge**
- Implement safety and quality controls**
- Optimize for multi-turn conversations**

---

#### **When to Improve the USER Prompt**

The user prompt contains **task-specific instructions** for individual requests. Improve it when:

- The task requires specific context**
- You need granular control over a single response**
- Testing and iteration**
- Handling user variability**
- Providing task-specific examples (few-shot learning)

---

#### **Decision Framework**

| **Factor** | **Improve System Prompt** | **Improve User Prompt** |
|------------|---------------------------|-------------------------|
| **Scope** | All interactions | Single request |
| **Persistence** | Remains constant | Changes per request |
| **Purpose** | Define behavior and rules | Provide task details |
| **Examples** | Role, tone, format rules | Specific data, context |
| **Frequency of change** | Rarely (during development) | Frequently (per user) |

---

#### **Common Anti-Patterns to Avoid**

‚ùå **Putting task-specific data in the system prompt**  
‚ùå **Repeating global rules in every user prompt**  
‚ùå **Conflicting instructions between system and user prompts**

### Practical Examples: Before and After

Let's see concrete examples of prompt improvements in action.

#### Example 1: Customer Support Bot (System Prompt Improvement)

In [37]:
# ‚ùå BAD: Vague system prompt

messages_bad = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "I can't log in to my account."},
]

response_bad = get_completion_from_messages(client, messages_bad)
print("‚ùå BAD RESPONSE:")
print(response_bad)
print("\n" + "=" * 80 + "\n")

‚ùå BAD RESPONSE:
Let‚Äôs narrow this down so I can give you concrete steps.

1. **Where are you trying to log in?**  
   - Website/app name or URL  
   - Are you on a phone, tablet, or computer? (and iOS/Android/Windows/Mac?)

2. **What exactly happens when you try?**  
   - Error message text (or a screenshot description)  
   - Does the page reload, stay blank, or say something like ‚Äúincorrect password,‚Äù ‚Äúaccount not found,‚Äù ‚Äútoo many attempts,‚Äù etc.?

3. **What you‚Äôve already tried (if anything):**  
   - Reset password?  
   - Different browser/device?  
   - Checked spam/junk for verification emails?

---

While you answer those, here are general steps that solve most login issues:

1. **Check username/email and password**
   - Make sure Caps Lock isn‚Äôt on.
   - Type the password in a notes app first to see it clearly, then copy‚Äëpaste it.

2. **Use ‚ÄúForgot password‚Äù**
   - Click ‚ÄúForgot password‚Äù or ‚ÄúCan‚Äôt access your account.‚Äù
   - Check all email

In [38]:
# ‚úÖ GOOD: Specific, structured system prompt

messages_good = [
    {
        "role": "system",
        "content": """
You are a customer support assistant for TechCorp, an e-commerce platform.

Guidelines:
1. Be empathetic and solution-oriented
2. Always start by acknowledging the user's issue
3. Provide step-by-step troubleshooting when applicable
4. If you cannot resolve the issue, escalate to human support
5. Keep responses concise (max 4 sentences) unless detailed steps are needed
6. Never ask for sensitive information like passwords

Common issues:
- Login problems: suggest password reset or browser cache clearing
- Payment issues: verify payment method and billing address
- Shipping delays: check order status and provide tracking info
""",
    },
    {"role": "user", "content": "I can't log in to my account."},
]

response_good = get_completion_from_messages(client, messages_good)
print("‚úÖ GOOD RESPONSE:")
print(response_good)

‚úÖ GOOD RESPONSE:
I‚Äôm sorry you‚Äôre having trouble logging in; let‚Äôs try a few quick steps to fix this.  
1. First, click ‚ÄúForgot password?‚Äù on the login page and follow the instructions to reset your password.  
2. If that doesn‚Äôt work, clear your browser‚Äôs cache/cookies or try a different browser or device, then attempt to log in again.  
3. If you still can‚Äôt access your account after these steps, please tell me what error message you see (if any), and I‚Äôll escalate this to our human support team for further help.


#### Example 2: Data Analysis Task (User Prompt Improvement)

In [39]:
# ‚ùå BAD: Vague user prompt

messages_bad = [
    {"role": "system", "content": "You are a data analyst."},
    {"role": "user", "content": "Analyze this sales data and tell me what you find."},
]

response_bad = get_completion_from_messages(client, messages_bad)
print("‚ùå BAD RESPONSE (too vague, no data provided):")
print(response_bad)
print("\n" + "=" * 80 + "\n")

‚ùå BAD RESPONSE (too vague, no data provided):
I don‚Äôt see any data attached yet. Please either:

- Paste the sales data directly (or a sample) in your message, or  
- Upload a file (CSV/Excel) or an image/screenshot of the data.

Once I have the data, I can:
- Summarize key metrics (revenue, units, margins)
- Identify trends over time
- Highlight best/worst products, customers, or regions
- Spot anomalies or seasonality
- Suggest actions based on the findings.




In [40]:
# ‚úÖ GOOD: Specific, detailed user prompt with context

messages_good = [
    {"role": "system", "content": "You are a data analyst."},
    {
        "role": "user",
        "content": """
Analyze the following quarterly sales data for Q1 2024:

Month | Revenue | Units Sold | Avg Order Value
Jan   | $45,000 | 1,200      | $37.50
Feb   | $52,000 | 1,450      | $35.86
Mar   | $38,000 | 980        | $38.78

Tasks:
1. Calculate the month-over-month growth rate
2. Identify the best and worst performing months
3. Provide 2-3 actionable insights based on trends
4. Format your response as bullet points

Focus on practical business implications, not just numbers.
""",
    },
]

response_good = get_completion_from_messages(client, messages_good)
print("‚úÖ GOOD RESPONSE (specific task with data):")
print(response_good)

‚úÖ GOOD RESPONSE (specific task with data):
- **1. Month-over-month (MoM) growth rates**

  - **Revenue**
    - Feb vs Jan:  
      - Growth = (52,000 ‚àí 45,000) / 45,000 ‚âà **+15.6%**
    - Mar vs Feb:  
      - Growth = (38,000 ‚àí 52,000) / 52,000 ‚âà **‚àí26.9%**
  
  - **Units Sold**
    - Feb vs Jan:  
      - Growth = (1,450 ‚àí 1,200) / 1,200 ‚âà **+20.8%**
    - Mar vs Feb:  
      - Growth = (980 ‚àí 1,450) / 1,450 ‚âà **‚àí32.4%**
  
  - **Average Order Value (AOV)**
    - Feb vs Jan:  
      - Growth = (35.86 ‚àí 37.50) / 37.50 ‚âà **‚àí4.4%**
    - Mar vs Feb:  
      - Growth = (38.78 ‚àí 35.86) / 35.86 ‚âà **+8.1%**

---

- **2. Best and worst performing months**

  - **Best month (overall): February**
    - Highest **revenue**: $52,000  
    - Highest **units sold**: 1,450  
    - AOV dipped slightly vs Jan, but volume more than compensated, making Feb the strongest month commercially.
  
  - **Worst month (overall): March**
    - Lowest **revenue**: $38,000  
    - 

#### Example 3: Code Generation (Combined System + User Optimization)

In [41]:
# ‚ùå BAD: No clear guidelines

messages_bad = [
    {"role": "system", "content": "You write code."},
    {"role": "user", "content": "Write a function to sort a list."},
]

response_bad = get_completion_from_messages(client, messages_bad)
print("‚ùå BAD RESPONSE (unclear requirements):")
print(response_bad)
print("\n" + "=" * 80 + "\n")

‚ùå BAD RESPONSE (unclear requirements):
Here‚Äôs a simple example in Python that sorts a list in ascending order without using the built-in `sort()` or `sorted()` (using a basic implementation of bubble sort):

```python
def sort_list(nums):
    """
    Sort a list of numbers in ascending order using bubble sort.
    Returns a new sorted list.
    """
    arr = nums[:]  # make a copy so we don't modify the original
    n = len(arr)
    
    for i in range(n):
        # Last i elements are already in place
        for j in range(0, n - i - 1):
            if arr[j] > arr[j + 1]:
                # swap
                arr[j], arr[j + 1] = arr[j + 1], arr[j]
    return arr

# Example usage:
data = [5, 2, 9, 1, 5, 6]
print(sort_list(data))  # [1, 2, 5, 5, 6, 9]
```

If you want a version that just uses Python‚Äôs built-in sorting:

```python
def sort_list(nums):
    return sorted(nums)
```




In [42]:
# ‚úÖ GOOD: Clear system guidelines + specific user requirements

messages_good = [
    {
        "role": "system",
        "content": """
You are a senior Python developer specializing in clean, production-ready code.

Code style:
- Use type hints for all functions
- Include docstrings (Google style)
- Add error handling where appropriate
- Prefer readability over cleverness
- Include usage examples in docstrings
""",
    },
    {
        "role": "user",
        "content": """
Write a Python function that:
- Accepts a list of dictionaries
- Each dictionary has 'name' (str) and 'score' (int) keys
- Sorts by score in descending order
- Returns the top N items (N is a parameter)
- Handles edge cases (empty list, N > list length)

Example input: [{'name': 'Alice', 'score': 95}, {'name': 'Bob', 'score': 87}]
""",
    },
]

response_good = get_completion_from_messages(client, messages_good)
print("‚úÖ GOOD RESPONSE (well-defined requirements):")
print(response_good)

‚úÖ GOOD RESPONSE (well-defined requirements):
```python
from typing import List, Dict, Any


def get_top_scorers(items: List[Dict[str, Any]], n: int) -> List[Dict[str, Any]]:
    """Return the top N items sorted by score in descending order.

    Each item in `items` must be a dictionary containing:
      - 'name': str
      - 'score': int

    If `n` is greater than the number of items, all items are returned.
    If the list is empty or `n` <= 0, an empty list is returned.

    Args:
        items: List of dictionaries with 'name' and 'score' keys.
        n: Number of top items to return.

    Returns:
        A list of up to N dictionaries sorted by 'score' in descending order.

    Raises:
        TypeError: If `items` is not a list or `n` is not an int.
        ValueError: If an item is missing required keys or has invalid types.

    Examples:
        >>> data = [
        ...     {'name': 'Alice', 'score': 95},
        ...     {'name': 'Bob', 'score': 87},
        ...     {'nam

### Iterative Prompt Improvement Process

Improving prompts is an iterative process that requires systematic testing and refinement. Rather than guessing what might work, you can use structured approaches‚Äîincluding AI-powered tools‚Äîto optimize your prompts.

**The Manual Approach: Asking an AI to Improve Your Prompt**

Before using specialized tools, you can leverage any LLM to help improve your prompts. Here's a structured way to ask for prompt improvements:

**Template for requesting prompt improvements:**

```
I have the following prompt that I'm using for [describe use case]:

[Your current prompt]

Please analyze this prompt and suggest improvements focusing on:
1. Clarity - Is the instruction clear and unambiguous?
2. Completeness - Are there missing edge cases or instructions?
3. Structure - Is it well-organized and easy to follow?
4. Specificity - Are the requirements specific enough?
5. Actionability - Can the model easily act on these instructions?

Provide:
- A quality assessment of the current prompt
- Specific issues identified
- An improved version with explanations for each change
```

**Example conversation with an AI:**

**You:** "Analyze this prompt for a customer support bot: 'You are helpful. Answer user questions.' Suggest improvements."

**AI:** "Current issues: Too vague, no guidelines, no edge case handling. Improved version: 'You are a customer support assistant for [Company]. Guidelines: 1) Be empathetic and professional, 2) Provide step-by-step solutions, 3) Escalate to humans if you cannot resolve the issue, 4) Never ask for passwords or sensitive data.'"

While this manual approach works, it requires you to:
- Craft good meta-prompts (prompts about prompts)
- Manually track quality metrics
- Iteratively test different versions
- Document what works and what doesn't

This is where specialized tools like **Clavix** become valuable.

### Clavix: An AI-Powered Prompt Optimization Tool

**What is Clavix?**

[Clavix](https://clavix.dev/) is an open-source framework designed to systematically improve prompts through structured analysis and pattern-based optimization. Unlike ad-hoc manual improvements, Clavix provides:

- **Standardized Quality Assessment**: Evaluates prompts across 6 dimensions
- **Automated Depth Selection**: Chooses the appropriate level of improvement based on current quality
- **Pattern-Based Improvements**: Applies proven optimization patterns systematically
- **Measurable Results**: Provides before/after quality scores to track improvements
- **Reproducible Process**: Uses a consistent methodology for reliable results

**Why Use Clavix?**

1. **Objective Evaluation**: Removes subjectivity from prompt assessment
2. **Systematic Approach**: Follows a structured methodology rather than random tweaking
3. **Learning Tool**: Helps you understand what makes prompts effective
4. **Time Efficiency**: Automates the analysis and improvement process
5. **Documentation**: Automatically tracks changes and rationale

**The 6 Quality Dimensions**

Clavix evaluates prompts across these dimensions:

| Dimension | What It Measures | Example Issues |
|-----------|------------------|----------------|
| **Clarity** | How clear and unambiguous the instructions are | Vague language, conflicting requirements |
| **Efficiency** | Token usage vs. value delivered | Unnecessary verbosity, redundancy |
| **Structure** | Logical organization and readability | Wall of text, poor formatting |
| **Completeness** | Coverage of edge cases and requirements | Missing error handling, undefined behavior |
| **Actionability** | How easily the model can execute the task | Abstract goals without concrete steps |
| **Specificity** | Level of detail and precision | Generic instructions, unclear constraints |

**How Clavix Works**

1. **Intent Detection**: Identifies the prompt's purpose (e.g., code-generation, creative-writing, analysis)
2. **Quality Assessment**: Scores each dimension (0-100%)
3. **Depth Selection**: Automatically chooses improvement level (QUICK, STANDARD, DEEP) based on score
4. **Pattern Application**: Applies relevant improvement patterns (e.g., STRUCTURED, CLARIFIED, EXPANDED)
5. **Optimization**: Generates an improved version with measurable quality gains
6. **Reporting**: Provides detailed before/after analysis

**Links:**
- Homepage: https://clavix.dev/
- GitHub Repository: https://github.com/ClavixDev/Clavix
- Documentation: Available in the repository

### Installing and Using Clavix

**Installation**

Clavix can be installed via npm (Node Package Manager). You need to have Node.js installed on your system first.

Install Clavix globally (run this in your terminal, not in the notebook)

```
npm install -g clavix
```

Or use npx to run without installation:
```
npx clavix init
```

**Initialization**

Before using Clavix, initialize it in your project directory to set up configuration:

Run in terminal (not in notebook):

``` bash
cd /path/to/your/project
clavix init
```

You will be prompted to select your AI model and set up configuration. This creates a .clavix directory with configuration files. You can now use Clavix through your AI assistant or command line

**Basic Usage Workflow**

Clavix can be used in two main ways:

1. **Through an AI Assistant** (recommended for interactive work):
   - Ask your AI: "Use Clavix to improve this prompt: [your prompt]"
   - The AI will run Clavix's improvement workflow automatically

2. **Command Line Interface**:
   - Direct commands for batch processing or automation
   - Useful for CI/CD pipelines or testing multiple prompts

For this notebook, we'll demonstrate the AI assistant approach, which is more interactive and educational.

### Practical Example: Improving the RAG Q&A Prompt with Clavix

Let's take the RAG Q&A system prompt from Section 2 and improve it using Clavix. We'll walk through the complete improvement process.

**Original Prompt** (from the `answer_question` function in Section 2):

```
You are a question-answering assistant.
Answer the question using ONLY the provided context.
If the answer is not in the context, say 'I don't know'.
```

This prompt works, but let's see how Clavix can make it better by calling Clavix through our AI assistant:

```
/clavix-improve the RAG Q&A system prompt in the Jupyter notebook to enhance clarity, completeness, and actionability.
```

#### Step 1: Intent Detection

**Detected Intent:** `code-generation` (RAG retrieval-augmented generation pattern)

**Context:** System prompt for a RAG pipeline that answers questions using PDF document chunks. The LLM must ground responses in retrieved context only.

#### Step 2: Quality Assessment (6 Dimensions)

| Dimension | Score | Assessment |
|-----------|-------|------------|
| **Clarity** | 65% | Objective is clear but lacks specificity about HOW to use context |
| **Efficiency** | 80% | Concise, but missing critical instruction details |
| **Structure** | 50% | Single paragraph, lacks logical organization |
| **Completeness** | 45% | Missing: citation requirements, confidence levels, partial answers |
| **Actionability** | 60% | Basic action defined but lacks handling edge cases |
| **Specificity** | 55% | Generic "assistant" role, no domain context |

**Overall Quality:** 59%

**Auto-selected Depth:** STANDARD (score < 60% - needs basic fixes first)

#### Step 3: Improvement Patterns Applied

Clavix applies these patterns to address the identified issues:

- **[STRUCTURED]** - Added role definition and task breakdown
- **[CLARIFIED]** - Explicit instructions for context usage
- **[EXPANDED]** - Added handling for partial matches and confidence
- **[SCOPED]** - Clear boundaries for out-of-scope questions
- **[ADDED]** - Citation requirements for transparency

#### Step 4: Optimized Prompt

Here's the improved version generated by Clavix:

```
You are a document-based question-answering assistant specializing in retrieval-augmented generation (RAG).

Your task:
1. Carefully read the provided context from retrieved document chunks
2. Answer the user's question using ONLY information found in this context
3. Quote or paraphrase relevant portions when possible for transparency
4. If the context contains a partial answer, provide what you can and acknowledge what's missing

Critical rules:
- Never use external knowledge or training data - context is your only source
- If the answer is not in the context, respond with: "I don't know - this information is not available in the provided documents"
- If you're uncertain, acknowledge it: "Based on the available context, it appears that..."
- Do not make assumptions or inferences beyond what the context explicitly states

Output format:
- Provide direct, concise answers
- Cite specific context when possible ("According to the document...")
- Maintain a helpful, professional tone
```

#### Step 5: Quality Improvement Summary

| Dimension | Before | After | Improvement |
|-----------|--------|-------|-------------|
| **Clarity** | 65% | 85% | +20% - Clear role, numbered steps |
| **Efficiency** | 80% | 75% | -5% - Slightly longer but necessary detail |
| **Structure** | 50% | 90% | +40% - Organized sections with headers |
| **Completeness** | 45% | 85% | +40% - Added edge cases, citation guidance |
| **Actionability** | 60% | 85% | +25% - Explicit step-by-step process |
| **Specificity** | 55% | 80% | +25% - Domain-specific RAG context |

**New Overall Quality:** 83% (+24% improvement)

#### Step 6: Key Improvements Explained

1. **Role Clarity**: Changed from generic "assistant" to "document-based question-answering assistant specializing in RAG"
   - *Why it matters:* Specific roles help the model understand its constraints and purpose

2. **Process Definition**: Added numbered steps for systematic approach
   - *Why it matters:* Explicit steps reduce ambiguity and ensure consistent behavior

3. **Edge Case Handling**: Instructions for partial matches, not just binary yes/no
   - *Why it matters:* Real-world queries often don't have perfect matches in the context

4. **Citation Guidance**: Encourages quoting/paraphrasing for transparency
   - *Why it matters:* Users can verify answers and understand confidence levels

5. **Confidence Calibration**: Acknowledges uncertainty when appropriate
   - *Why it matters:* Prevents hallucinations and builds user trust

6. **Boundary Enforcement**: Multiple reminders about context-only usage
   - *Why it matters:* Critical for RAG systems to avoid mixing retrieved and parametric knowledge

### Testing the Improved Prompt

Let's compare the original and improved prompts side-by-side using the same RAG pipeline from Section 2.

In [43]:
# Original answer_question function (from Section 2)
def answer_question_original(question, context_chunks):
    context = "\n\n".join(context_chunks)
    messages = [
        {
            "role": "system",
            "content": """
          You are a question-answering assistant.
          Answer the question using ONLY the provided context.
          If the answer is not in the context, say 'I don't know'.
       """,
        },
        {
            "role": "user",
            "content": f"""
          Context: {context}
          Question: {question}
       """,
        },
    ]
    return get_completion_from_messages(client, messages)


# Improved answer_question function (optimized by Clavix)
def answer_question_improved(question, context_chunks):
    context = "\n\n".join(context_chunks)
    messages = [
        {
            "role": "system",
            "content": """
You are a document-based question-answering assistant specializing in retrieval-augmented generation (RAG).

Your task:
1. Carefully read the provided context from retrieved document chunks
2. Answer the user's question using ONLY information found in this context
3. Quote or paraphrase relevant portions when possible for transparency
4. If the context contains a partial answer, provide what you can and acknowledge what's missing

Critical rules:
- Never use external knowledge or training data - context is your only source
- If the answer is not in the context, respond with: "I don't know - this information is not available in the provided documents"
- If you're uncertain, acknowledge it: "Based on the available context, it appears that..."
- Do not make assumptions or inferences beyond what the context explicitly states

Output format:
- Provide direct, concise answers
- Cite specific context when possible ("According to the document...")
- Maintain a helpful, professional tone
       """,
        },
        {
            "role": "user",
            "content": f"""
          Context: {context}
          Question: {question}
       """,
        },
    ]
    return get_completion_from_messages(client, messages)

In [44]:
question = (
    "Summarize the main services offered by the company described in the document."
)

document_text = load_pdf("KvK_1_Mockup.pdf")
chunks = chunk_text(document_text)
chunk_embeddings = [get_embedding(chunk) for chunk in chunks]
relevant_chunks = retrieve_relevant_chunks(question, chunks, chunk_embeddings)

# Compare responses
print("=" * 80)
print("ORIGINAL PROMPT RESPONSE:")
print("=" * 80)
answer_original = answer_question_original(question, relevant_chunks)
print(answer_original)

print("\n")
print("=" * 80)
print("IMPROVED PROMPT RESPONSE:")
print("=" * 80)
answer_improved = answer_question_improved(question, relevant_chunks)
print(answer_improved)

ORIGINAL PROMPT RESPONSE:
I don't know.


IMPROVED PROMPT RESPONSE:
I don't know - this information is not available in the provided documents.

Based on the context, the document only states that the company‚Äôs activity code is ‚Äú6429 Financi√´le holdings‚Äù and that it is a ‚ÄúBesloten Vennootschap‚Äù (private limited company), but it does not describe any specific services or activities beyond this classification.


**Expected Differences:**

The improved prompt should produce responses that:
1. **Are more transparent**: "According to the document..." vs just stating facts
2. **Handle uncertainty better**: Explicitly acknowledges when information is partial
3. **Are more trustworthy**: Cites sources and avoids hallucinations
4. **Follow consistent structure**: Maintains professional tone across all answers

### How to Use Clavix for Your Own Prompts

**General Workflow:**

1. **Identify a prompt to improve**
   - Can be from this notebook or your own projects
   - Should be a complete system or user prompt

2. **Request Clavix improvement** (through an AI assistant like GitHub Copilot)
   - Example: "Use Clavix to improve this prompt: [paste your prompt]"
   - Or: "Analyze this prompt with Clavix and suggest improvements"

3. **Review the analysis**
   - Check the 6-dimension quality scores
   - Understand which patterns were applied
   - Read the explanations for each improvement

4. **Test both versions**
   - Run your original prompt with test inputs
   - Run the improved prompt with the same inputs
   - Compare results objectively

5. **Measure improvement** (optional but recommended)
   - Use LLM-as-a-Judge (from Section 2) to evaluate both versions
   - Calculate accuracy, relevance, or other metrics
   - Document which version performs better

6. **Iterate if needed**
   - If quality is still below 80%, request further improvements
   - Focus on specific dimensions that scored low
   - Test with edge cases

**Example Request Formats:**

```
"Use Clavix to improve the movie recommender system prompt from Section 1"

"Analyze this prompt with Clavix: [prompt text]"

"Run Clavix improvement workflow on the customer support prompt"

"Use Clavix to optimize this code generation prompt for better clarity and structure"
```

**Best Practices:**

- Start with STANDARD depth for most prompts (let Clavix auto-select)
- Focus on one prompt at a time for clarity
- Test improvements with real use cases, not just examples
- Document the before/after scores for your records
- Share improved prompts with your team