# Advanced Reflection Techniques in Prompt Engineering
### Prompt Engineering Course – Day X
*Instructor: Charles Niswander II*


## Learning Objectives

By the end of this notebook you will be able to:

1. **Explain** what “reflection” means for LLMs and why it improves reliability.  
2. **Compare** baseline prompting with several *advanced* reflection patterns (Self‑Critique, Reflexion, Critic‑Assistant, etc.).  
3. **Implement** a two‑pass self‑evaluation loop using the OpenAI API (or a local model).  
4. **Design & run** multi‑turn reflection pipelines that automatically retry or revise answers.  
5. **Measure** quality gains with simple programmatic or LLM‑based evaluators.  
6. **Extend** the templates provided here to new tasks (summarisation, coding, reasoning).  



---

## 1  Why Reflection?

LLMs often generate the *first* plausible answer that satisfies the probability distribution, but that answer isn’t always correct or safe.  
**Reflection** adds one or more deliberate *metacognitive* steps in which the model (or a helper agent) re‑reads, critiques, and revises its own output.

Typical benefits:

* Higher factual accuracy & fewer “silly” mistakes  
* Better chain‑of‑thought transparency for debugging  
* A natural hook for tool‑calling (tests, linters, content policies)  
* Teachable pattern for students: “Write ➜ Evaluate ➜ Improve”  



### 1.1  Basic vs. Advanced Reflection

| Level | Pattern | # Passes | Typical Prompt Stub |
|-------|---------|---------|----------------------|
| Basic | *“Think step by step”* | 1 | “Let’s think step by step.” |
| Intermediate | *Self‑Critique* | 2 | “Evaluate the previous answer. List flaws. Rewrite improved answer.” |
| Advanced | *Reflexion / RCR / ReACT+Validation* | 3‑5 | “Reflect on your reasoning, score it, and if below threshold, revise & try again (max _n_ loops).” |

In this notebook we move quickly to the **advanced** end of that spectrum.


In [None]:

# ❗ Installation (Colab only – skip if already installed)
# Feel free to swap in your favourite local model wrapper.
!pip -q install --upgrade openai==1.23.0  # or latest


In [None]:

import os
#@markdown ---  
#@markdown ### 🔑 Set your OpenAI API key  
#@markdown Store it in `OPENAI_API_KEY` env var **or** paste below (it will be hidden).
openai_api_key = os.getenv("OPENAI_API_KEY") or "sk-..."  # Replace if not set
assert openai_api_key and openai_api_key.startswith("sk-"), "Please provide a valid key."
os.environ["OPENAI_API_KEY"] = openai_api_key
print("✅ API key set (length:", len(openai_api_key), ")")


In [None]:

import openai, textwrap, time, random
openai_client = openai.OpenAI()  # uses env var

def ask_model(sys_prompt, user_prompt, model="gpt-3.5-turbo", max_tokens=256, temp=0.7):
    '''Single call convenience wrapper'''
    response = openai_client.chat.completions.create(
        model=model,
        messages=[{"role":"system","content":sys_prompt},
                  {"role":"user","content":user_prompt}],
        temperature=temp,
        max_tokens=max_tokens
    )
    return response.choices[0].message.content.strip()



---

## 2  Baseline Generation (No Reflection)

We’ll use a deliberately tricky reasoning task that *looks* easy but often trips up the model.

> **Task:** “How many prime numbers are between 1 and 100?”

(Answer = 25.)

Let’s prompt the model *without* any reflection and see what happens.


In [None]:

baseline_answer = ask_model(
    "You are a helpful assistant.",
    "How many prime numbers are between 1 and 100? Respond with just the number."
)
print("Model says:", baseline_answer)



---

## 3  Two‑Pass Self‑Critique

We now ask the model to **check** its own answer:

1. Generate an initial answer & chain‑of‑thought.  
2. Prompt: “Critique the answer above. If incorrect, provide a corrected answer.”  

We'll wrap that in a helper function so you can reuse it for any task.


In [None]:

def self_critique(task, model="gpt-3.5-turbo"):
    sys_prompt = "You are a large language model boosting your own reasoning via self‑critique."
    initial = ask_model(sys_prompt, task + "\n\nPlease explain your reasoning step by step.", model=model)
    critique_prompt = f"""**Your previous answer:**

{initial}

---

**Critique Instructions (system):**
Evaluate your answer. List any mistakes. Then give a FINAL_ANSWER line.

Return format:
1. Critique: <bullet list of issues or 'None'>
2. FINAL_ANSWER: <single line answer only>"""
    improved = ask_model(sys_prompt, critique_prompt, model=model)
    return initial, improved

init, refined = self_critique("How many prime numbers are between 1 and 100?")
print("---- Initial ----\n", init[:500], "\n")
print("---- Refined ----\n", refined)



---

## 4  Reflexion‑Style Loops

[Reflexion](https://arxiv.org/abs/2303.11366) combines *memory* of past failures with iterative self‑evaluation.

**Algorithm sketch**

```
for attempt in range(max_iters):
    answer, score = model(query + memory)
    if score >= threshold:
        break
    memory += model("Why was that wrong? Summarise mistake.")
```

Below is a *minimal* three‑try loop.


In [None]:

def reflexion_loop(task, max_iters=3, model="gpt-3.5-turbo"):
    memory = ""
    for i in range(1, max_iters+1):
        user = f"""{task}
{memory}
---

Please answer the TASK above. Then rate your own answer on a scale 0‑1 ('reflect_score'). 
If reflect_score < 0.9, explain what was wrong in <=2 sentences. 
Return JSON with keys: answer, reflect_score, reflection."""
        reply = ask_model("You are an expert reasoning agent.", user, model=model, temp=0)
        try:
            data = eval(reply)  # crude but works for simple JSON‑ish output
            score = float(data["reflect_score"])
        except Exception as e:
            print("Parsing fail on try", i, ":", e, "\nRaw:", reply)
            score = 0
            data = {"answer":"", "reflection":""}
        print(f"Try {i}, score={score}")
        if score>=0.9:
            return data["answer"]
        memory += "\nMistake summary: " + data["reflection"]
    return data["answer"]

best = reflexion_loop("How many prime numbers are between 1 and 100?")
print("Best answer:", best)



---

## 5  Critic‑Assistant Pattern

Another strategy is to **split roles**:

1. *Assistant* writes a draft.  
2. *Critic* reviews and lists issues.  
3. *Assistant* revises.  

We can implement this with two model calls using different system prompts.


In [None]:

def critic_assistant(task, model="gpt-3.5-turbo"):
    draft = ask_model("You are a creative assistant.", task + "\nWrite your answer with reasoning.")
    critique = ask_model("You are a ruthless critic.", 
                         f"Here is an answer:\n\n{draft}\n\nList up to 3 fatal flaws.")
    revised = ask_model("You are a careful assistant.", 
                        f"Original answer:\n{draft}\n\nCritique:\n{critique}\n\nRewrite a perfect answer.")
    return draft, critique, revised

d,c,r = critic_assistant("Explain why the sky is blue in two paragraphs, at an 8th‑grade level.")
print("---- Draft ----\n", d[:400], "\n")
print("---- Critique ----\n", c)
print("---- Revised ----\n", r[:400])



---

## 6  Evaluating Reflection Quality

Simple tasks (e.g., numeric answer) can be auto‑graded in Python.  
For open‑ended tasks we can *re‑use a model* as evaluator.

Below is a toy grader for the prime‑numbers task.


In [None]:

def prime_between_1_100(text):
    try:
        n = int(re.findall(r'\d+', text)[0])
        return n==25
    except:
        return False

print("Baseline correct?", prime_between_1_100(baseline_answer))
print("Self‑Critique correct?", prime_between_1_100(refined))



---

## 7  📝 Exercises

1. **Custom Task:** Pick a *reasoning* prompt where the base model often fails (e.g., “What day of the week was 4 July 1776?”).  
   * Implement a Reflexion loop that stops when the evaluator (another model call) says “confidence ≥ 0.95.”  

2. **Peer Review:** Form pairs. One writes a *draft* answer, the other critiques it, then swap.  
   * Compare results with automated Critic‑Assistant.  

3. **Metric Design:** Modify the `prime_between_1_100` grader to score *explanations* for correctness, fluency & honesty on a 0‑5 scale using GPT.  



---

## 8  Key Takeaways

* Reflection adds a **metacognitive layer** that boosts reliability.  
* Advanced loops (Reflexion, Critic‑Assistant) outperform simple *“Let’s think step by step.”*  
* Always pair reflection with **evaluation** – either heuristic or model‑based – to know when to stop looping.  
* Templates in this notebook are **starting points**: adapt them to your domain tasks, cost limits, and latency constraints.  

**Further Reading**

* Shin et al., *Reflexion: Language Agents with Verbal Reinforcement Learning* (2023)  
* Zheng et al., *Judging LLM’s reasoning chain of thought is difficult* (2024)  
* Yao et al., *Tree‑of‑Thought Deliberate Reasoning* (2023)  

Happy prompting! 🎉
