# Contrastive Chain‑of‑Thought (Contrastive‑CoT)
**Prompt Engineering – Google Colab Notebook**

This notebook introduces **Contrastive Chain‑of‑Thought (Contrastive‑CoT)** prompting, a recent technique that improves reasoning accuracy by *contrasting* multiple candidate explanations and selecting the most plausible one.

> *Goal: Help your students understand, implement, and experiment with Contrastive‑CoT in practice.*

## Learning objectives
By the end of this lab you will be able to
1. **Describe** the motivation behind Chain‑of‑Thought (CoT) prompting and its limitations.  
2. **Explain** the intuition of contrastive reasoning and how it extends CoT.  
3. **Implement** a simple Contrastive‑CoT pipeline with the OpenAI API (or any LLM).  
4. **Measure** performance differences between *zero‑shot*, *standard CoT*, *self‑consistency*, and *Contrastive‑CoT*.  
5. **Experiment** with your own tasks and hyper‑parameters (number of rationales, contrastive prompts, vote rules).

---

## 1  Background: from CoT to Contrastive‑CoT
* **Chain‑of‑Thought (CoT)**: ask the model to *think step‑by‑step* before giving the final answer.  
* **Self‑Consistency**: sample *k* random CoTs, then majority‑vote the final answers (Wang et al., 2022).  
* **Why not enough?** Models sometimes produce *plausible but wrong* rationales. Majority vote may still pick the wrong answer if misleading rationales dominate.

### 1.1  Key idea
Generate *pairs* of rationales and ask the model to judge **which one is more convincing** for the same question. The winning rationale is kept; the loser is discarded. Repeating tournament‑style selection yields a single, higher‑quality explanation → higher answer accuracy.

> This “explanation fighting” is inspired by contrastive learning: good explanations should beat bad ones when placed side‑by‑side.

---

## 2  High‑level pipeline

1. **Generate candidate rationales**  
   Use a *“Let’s think step by step”* style prompt *n* times (temperature > 0) → list of *(rationale, answer)* pairs.
2. **Contrastive selection**  
   Repeatedly sample *m* distinct candidates, ask the model:  
   *“Here are two ways to solve the problem. Which answer is more likely correct and why?”*  
   Keep the winner of each duel. Iterate until one champion remains (or use a scoring function).
3. **Return champion’s answer**.

> 🛠 **Setup:** The next code cell installs/updates required libraries.  
*Skip if already installed.*

In [None]:
#@title ⬇️ Install & import
!pip -q install --upgrade openai tiktoken python-dotenv rich

In [None]:
import os, random, json, re
from typing import Tuple, List
import openai
from rich import print

# ⬇️📝 Set your API key as an environment variable or paste directly (not recommended)
openai.api_key = os.getenv('OPENAI_API_KEY', 'paste-your-key-here')

MODEL_NAME = 'gpt-3.5-turbo'

### 3  Helper: call the LLM

In [None]:
def chat_completion(system: str, user: str, temperature: float = 0.7, max_tokens: int = 512):
    response = openai.ChatCompletion.create(
        model=MODEL_NAME,
        messages=[
            { 'role':'system', 'content': system },
            { 'role':'user', 'content': user }
        ],
        temperature=temperature,
        max_tokens=max_tokens
    )
    return response.choices[0].message.content.strip()

---

## 4  Mini evaluation set
We will use a *tiny* synthetic maths & logic dataset for quick experimentation. Feel free to replace with BIG‑BENCH or GSM‑8K for homework.

In [None]:
dataset = [
    {
        'question': 'Tom has 3 blue marbles and buys twice as many red marbles. How many marbles does he have now in total?',
        'answer': '9'
    },
    {
        'question': 'If today is Wednesday, what day of the week will it be in 19 days?',
        'answer': 'Monday'
    },
    {
        'question': 'A rectangle is twice as long as it is wide. If its perimeter is 36 cm, what is its area?',
        'answer': '80'
    },
]

---

## 5  Baselines

In [None]:
def direct_answer(q: str) -> str:
    prompt = f"{q}\nAnswer:"
    return chat_completion('You are a helpful solver.', prompt, temperature=0.0, max_tokens=8)

def cot_answer(q: str, temp=0.7) -> str:
    prompt = f"{q}\nLet's think step by step."
    full = chat_completion('You are an expert reasoning assistant.', prompt, temperature=temp)
    m = re.search(r"\bAnswer\s*[:=]?\s*(?P<ans>[-+]?\d+(?:\.\d+)?|[A-Za-z]+)\b", full)
    return m.group('ans') if m else full.strip()

### 5.1  Evaluate helpers

In [None]:
def accuracy(predict_fn):
    correct = 0
    for ex in dataset:
        pred = predict_fn(ex['question'])
        if str(pred).lower().strip() == str(ex['answer']).lower().strip():
            correct += 1
    return correct / len(dataset)

print('Direct (no‑CoT) accuracy:', accuracy(direct_answer))
print('1‑sample CoT accuracy:', accuracy(cot_answer))

---

## 6  Self‑Consistency CoT (review)

We sample *k* diverse CoTs and majority‑vote their answers.

In [None]:
def self_consistency(q: str, k=5):
    answers = [cot_answer(q, temp=1.0) for _ in range(k)]
    best = max(set(answers), key=answers.count)
    return best

In [None]:
print('Self‑Consistency accuracy (k=5):', accuracy(lambda q: self_consistency(q, k=5)))

---

## 7  Contrastive‑CoT implementation

In [None]:
def duel(q: str, cand1: Tuple[str,str], cand2: Tuple[str,str]) -> Tuple[str,str]:
    """Ask the model to choose between two candidate explanations."""
    prompt = (
        f"Question: {q}\n\n"
        f"Candidate A reasoning:\n{cand1[0]}\n\n"
        f"Candidate B reasoning:\n{cand2[0]}\n\n"
        "Between A and B, which answer is more likely correct? Reply with either 'A' or 'B' and briefly justify."
    )
    choice = chat_completion('You are a strict judge of explanations.', prompt, temperature=0.2, max_tokens=32)
    winner = cand1 if choice.strip().upper().startswith('A') else cand2
    return winner

def contrastive_cot(q: str, n_generate: int = 6, rounds: int = 3) -> str:
    """Main pipeline."""
    # Step 1: generate candidates
    candidates: List[Tuple[str,str]] = []
    for _ in range(n_generate):
        gen_prompt = (
            f"{q}\n\n"
            "Let's think step by step. Give the final answer in a line that starts with 'Answer:'"
        )
        reasoning = chat_completion('You are an expert problem‑solver.', gen_prompt, temperature=1.0)
        m = re.search(r"Answer\s*[:=]\s*(?P<ans>[-+]?\d+|[A-Za-z]+)", reasoning)
        ans = m.group('ans') if m else 'unknown'
        candidates.append((reasoning, ans))

    # Step 2: tournament duels
    for _ in range(rounds):
        if len(candidates) < 2:
            break
        random.shuffle(candidates)
        winners: List[Tuple[str,str]] = []
        for i in range(0, len(candidates) - 1, 2):
            winners.append(duel(q, candidates[i], candidates[i+1]))
        if len(candidates) % 2 == 1:
            winners.append(candidates[-1])
        candidates = winners

    champion_reasoning, champion_ans = candidates[0]
    return champion_ans

In [None]:
print('Contrastive‑CoT accuracy:', accuracy(lambda q: contrastive_cot(q, n_generate=6, rounds=3)))

---

## 8  Discussion

* Contrastive‑CoT often **beats** plain CoT because the model is *better at judging than generating*.  
* Computational cost ↑ (extra API calls). Trade‑off between *n_generate* and accuracy.  
* Judge bias: The same model serves as both generator and critic. Using a **different critic model** (e.g. `gpt-4o` as judge) can further help.

---

## 9  Your turn 🎓
1. Try larger `n_generate` (e.g. 10) and more `rounds`.  
2. Swap in your own tasks (GSM‑8K JSON file).  
3. Use a stronger critic model.  
4. Plot accuracy vs. cost.

---

## 10  References
* Ye et al., *Contrastive Decoding* (2023)  
* Zhang et al., *Self‑Consistency Improves Chain‑of‑Thought Reasoning* (2022)  
* OpenAI Cookbook: <https://github.com/openai/openai-cookbook>

*Notebook prepared for **Charles N.'s Prompt Engineering class** – July 2025*