# CoT & Self‑Consistency Playground  🎲🧠  

**Prompt Engineering – Day 5 Lab Notebook**  

Explore classic & modern Chain‑of‑Thought (CoT) techniques, self‑consistency decoding, Tree‑/Graph‑of‑Thoughts, ReAct, Plan‑then‑Act, reflection, and generate‑validate workflows – all in one Colab‑ready notebook.

> Set your `OPENAI_API_KEY` (or plug in any chat‑completion endpoint) and run the cells.  
> Each section includes explanations, live demos, and **📝 Try‑it‑yourself** exercises.


---

## 📚 What you’ll learn & do

| Section | Concept | Hands‑on |
|---------|---------|----------|
| 1 | Classic & Zero‑Shot CoT | Compare accuracy with/without “Let’s think step by step.” |
| 2 | Few‑Shot CoT & **Self‑Consistency** | Generate multiple rationales and majority‑vote |
| 3 | Contrastive CoT | Ask for alternative lines of reasoning |
| 4 | **Tree‑of‑Thoughts (ToT)** | Simple breadth‑first solver for a word‑scramble puzzle |
| 5 | **Graph‑of‑Thoughts (GoT)** | Visualise reasoning paths with NetworkX |
| 6 | **ReAct** & Plan‑then‑Act | Interleave “Thought/Action” with a toy calculator tool |
| 7 | Meta‑Prompting & Reflexion | Model critiques its own answer, then retries |
| 8 | Generate‑Validate Loop | Two‑stage pipeline with automatic retries |
| 9 | Capstone Challenge | Build and share a multi‑step reasoning pipeline |

Feel free to skip around or duplicate cells while experimenting.


## 🔧 0. Setup

Run the cell below to install/update required packages and configure your OpenAI (or compatible) key.  
Replace `"YOUR_API_KEY"` with your real key or export it via an environment variable first.

```python
%pip -q install --upgrade openai tiktoken
```


In [None]:
import os, json, textwrap, random, math, time
from collections import Counter, defaultdict
import openai

# ⚠️ SET YOUR KEY!
openai.api_key = os.getenv("OPENAI_API_KEY") or "YOUR_API_KEY"
MODEL = "gpt-4o-mini"  # swap to any chat-completion model you have access to

def chat(system_prompt, user_prompt, temperature=0.7, max_tokens=512, model=MODEL):
    messages = [
        { "role": "system", "content": system_prompt },
        { "role": "user", "content": user_prompt }
    ]
    resp = openai.ChatCompletion.create(model=model,
                                        messages=messages,
                                        temperature=temperature,
                                        max_tokens=max_tokens)
    return resp.choices[0].message.content.strip()


In [None]:
# 🛠️ Helper utilities

def self_consistency(prompt, n=5, temperature=0.8):
    """Run the same prompt n times and return full rationales & majority answer."""
    rationales = [chat("You are a thoughtful assistant.", prompt,
                       temperature=temperature) for _ in range(n)]

    # Heuristic: assume final line (after last newline) is the answer
    answers = [r.split('\n')[-1].strip() for r in rationales]
    votes = Counter(answers)
    majority, _ = votes.most_common(1)[0]
    return rationales, votes, majority

def pretty_print_votes(votes):
    for ans, cnt in votes.items():
        print(f"{ans:>10}: {cnt}")


---

## 1️⃣ Classic & Zero‑Shot Chain‑of‑Thought

**Prompt pattern:** “Let’s think step by step.”

Below we tackle a math word problem in three ways:

1. **No CoT** – single short answer.  
2. **Classic CoT** – insert the magic words and let the model show its reasoning.  
3. **Zero‑Shot CoT** – no examples, only the instruction.

👉 *Run the cells and observe differences in rationality & correctness.*

In [None]:
problem = """If there are 3 red marbles and 2 blue marbles in a bag, and you draw
two marbles without replacement, what is the probability that both are red?"""

In [None]:
print("--- No CoT ---")
print(chat("You are a concise assistant.", problem + "\nAnswer succinctly."))

In [None]:
print("--- Classic CoT ---")
print(chat("You are a helpful assistant.", "Let's think step by step.\n\n" + problem))

### 📝 Try‑it‑yourself

Change `problem` above to any word problem (math, logic, trivia) and re‑run.  
Which variant gives you higher‑quality answers? Why might that be?

---

## 2️⃣ Few‑Shot CoT **+ Self‑Consistency**

Provide 1‑2 worked examples to *bootstrap* reasoning style, then sample multiple
rationales at higher temperature and majority‑vote for robustness.

In [None]:
few_shot = """Q: If 4 pens cost $1.20, how much do 10 pens cost?
A: Let's think step by step.
- First find cost per pen: $1.20 / 4 = $0.30
- Then 10 pens cost 10 * $0.30 = $3.00
Therefore, the answer is $3.00

Q: {question}
A: Let's think step by step."""
question = "A rectangle has a length of 8 cm and a width of 5 cm. What is its area?"
prompt = few_shot.format(question=question)
rats, votes, majority = self_consistency(prompt, n=6, temperature=0.9)
print("Rationales→") 
for i,r in enumerate(rats,1):
    print(f"\n--- Sample {i} ---\n{r[:500]}")

print("\nVote counts:")
pretty_print_votes(votes)
print("\nMajority answer:", majority)

### 📝 Exercise: Parameter sweep

* Try different `n` and `temperature` values.  
* Do majority votes always match ground truth?  
* What kinds of errors appear?

---

## 3️⃣ Contrastive / Alternative Reasoning

Ask the model *twice* – once for a solution, once for a **different** line of reasoning – and compare.

In [None]:
base = """{q}

Provide one possible solution. Then, **in a separate paragraph**, outline an alternative reasoning path that could also arrive at the same answer, even if the steps differ.""".format(q=problem)

print(chat("You are a reasoning assistant.", base))

---

## 4️⃣ Tree‑of‑Thoughts (ToT) Mini‑Solver 🌳

We’ll tackle a simple **word‑scramble** puzzle by exploring a search tree of partial solutions.

*State =* current permutation of letters  
*Score =* whether the permutation is a valid English word.

> This is a toy example to illustrate branching, evaluation, and backtracking.

In [None]:
import itertools, random, requests, string, json, math
# Super tiny word list
WORDS = {w.strip().lower() for w in requests.get('https://raw.githubusercontent.com/dwyl/english-words/master/words_alpha.txt').text.splitlines() if len(w)==5}

def is_word(w): return w.lower() in WORDS

scramble = "TRAIN"
print("Scramble:", scramble)

def tree_search(letters, max_branch=5, depth=5):
    frontier=[("", letters)]
    for d in range(depth):
        new=[]
        for prefix, remain in frontier:
            # expand top-k branches randomly for demo
            for choice in random.sample(remain, min(max_branch,len(remain))):
                new_pref = prefix+choice
                new_rem = remain.replace(choice, '', 1)
                if is_word(new_pref+new_rem):
                    return new_pref+new_rem
                new.append((new_pref, new_rem))
        frontier=new
    return None

solution = tree_search(scramble, depth=5)
print("Found:", solution)

---

## 5️⃣ Graph‑of‑Thoughts (GoT) Visual Explorer 🕸️

Below we build a tiny reasoning graph for dividing an integer into factors.  
Feel free to skip if you’re short on time.

In [None]:
import networkx as nx
import matplotlib.pyplot as plt
from IPython.display import display

n = 60
G = nx.DiGraph()
G.add_node(n)
front=[n]
while front:
    cur=front.pop()
    for i in range(2,int(math.sqrt(cur))+1):
        if cur%i==0:
            a,b = i, cur//i
            G.add_edge(cur,a); G.add_edge(cur,b)
            if a>1: front.append(a)
            if b>1: front.append(b)
plt.figure(figsize=(6,4))
nx.draw(G, with_labels=True, node_size=500, font_size=8)
plt.title("Graph‑of‑Factors (Reasoning Paths)")
plt.show()

---

## 6️⃣ ReAct & Plan‑then‑Act 🤖

We’ll hook the LLM to a *tiny* calculator “tool” via prompting.

```
Thought: I need to add 23 + 19.
Action: calculator.add(23,19)
Observation: 42
Thought: I have my answer.
Answer: 42
```


In [None]:
# Simple tool that the model can call by text
def tool_executor(command:str):
    try:
        cmd, args = command.split('(')
        args = args.rstrip(')')
        nums = list(map(float, args.split(',')))
        if cmd=='add': return sum(nums)
        if cmd=='mul': 
            out=1
            for n in nums: out*=n
            return out
    except Exception as e:
        return f"Error: {e}"

query = "What is (13*7) + (18*2)?"

prompt = f"""You have access to a calculator tool with syntax calculator.add(a,b) or calculator.mul(a,b,...).
Respond using the ReAct format (Thought / Action / Observation) until you reach 'Answer:'.

Question: {query}"""

# naive single call
response = chat("You are a ReAct agent. When you need to compute, write Action: ...", prompt)
print(response)

*Manually* copy any `calculator.xxx(...)` line you see in model output into `tool_executor` below to
simulate Observation feedback, then feed the updated context back to the model (2‑3 iterations).  

👉 Can you automate this loop? (Extension exercise.)

---

## 7️⃣ Meta‑Prompting & Reflexion 🔍

Ask the model to **critique** its own answer and retry if confidence is low.

In [None]:
task = "Translate the following English sentence to French: 'The quick brown fox jumps over the lazy dog.'"

initial = chat("You are a translation assistant.", task)
critique = chat("You are a critical reviewer.", 
                f"Evaluate this translation for accuracy and fluency. If imperfect, explain why and give corrections:\n\n{initial}")
print("Initial translation:", initial)
print("\nCritique & suggestions:\n", critique)


---

## 8️⃣ Generate‑Validate Loop ♻️


In [None]:
def generate_validate(task, passes=3):
    for i in range(passes):
        answer = chat("You are a reasoning assistant.", task)
        judge  = chat("You are an impartial validator.", 
                      f"Task: {task}\n\nProposed answer: {answer}\n\nIs this answer correct? Reply YES or NO and explain.")
        if "YES" in judge.upper():
            print(f"✅ Pass {i+1}: answer accepted\n{answer}")
            return answer
        print(f"❌ Pass {i+1}: retrying…\nReason: {judge}\n")
    return None

generate_validate("What is the capital of Australia?")


---

## 9️⃣ Capstone Playground 🚀

Design a *multi‑step* prompt pipeline (or small Python loop) that:

1. Uses **CoT/ToT/GoT** for initial reasoning  
2. Applies **Self‑Consistency** or **Generate‑Validate** for robustness  
3. Includes at least one **Reflection/Critique** pass

Share with classmates & try to break each other’s pipelines!

---

## 🔗 Further Reading

* Wei et al., “Chain‑of‑Thought Prompting Elicits Reasoning in Large Language Models”, 2022  
* Yao et al., “Tree‑of‑Thought: Deliberate reasoning via expansive tree search”, 2023  
* Shinn et al., “Reflexion: Language Agents with Verbal Reinforcement Learning”, 2023  
* Zhao et al., “Self‑Consistency Improves Chain of Thought Reasoning”, 2022  
* **OpenAI cookbook** – [github.com/openai/openai-cookbook](https://github.com/openai/openai-cookbook)
