# Lesson 1: The Intuition of Test-Time Reinforcement Learning (TTRL)

## üéì What is TTRL?
Most LLMs are "frozen" after training. They generate an answer once, and if it's wrong, it stays wrong.

**Test-Time Reinforcement Learning (TTRL)**, or *Inference-Time Compute*, changes this paradigm. Instead of trusting the first answer, we treat the model as a stochastic (random) generator. By letting it "think" multiple times and exploring different reasoning paths, we can statistically marginalize out errors.

**In this lesson:**
1.  We connect to your local **Mistral** model via Ollama.
2.  We pose a problem famous for tricking human intuition (The **Linda Problem**).
3.  We implement **Majority Voting**, the simplest form of TTRL, to fix the model's bias without any training.

### üõ†Ô∏è Step 1: Library Setup
We use `ollama` as our interface to the local LLM. We also use `rich` to print beautiful tables in the terminal.

In [None]:
import time
from collections import Counter
from rich.console import Console
from rich.table import Table

# Ensure Ollama is installed
try:
    import ollama
except ImportError:
    print("Please run: pip install ollama")

console = Console()
MODEL_NAME = "mistral:7b"  # Ensure you have run `ollama pull mistral:7b`

### ü§ñ Step 2: The Model Interface
We create a function `get_answer`.

**Critical Concept: Temperature**
*   **`temperature=0.0`**: The model is deterministic. It always gives the same answer. Good for code, bad for TTRL.
*   **`temperature=0.9`**: The model takes risks. It generates diverse answers. **We need this** for TTRL to work, because we need a variety of opinions to vote on.

In [None]:
console.print(f"[bold]Connecting to Local Model:[/bold] {MODEL_NAME}")

def get_answer(prompt: str, temp: float = 0.7) -> str:
    """Gets a single completion from Ollama."""
    try:
        response = ollama.chat(
            model=MODEL_NAME,
            messages=[
                {"role": "system", "content": "You are a concise assistant. Answer with only the result."},
                {"role": "user", "content": prompt}
            ],
            # We pass the temperature dynamically
            options={"temperature": temp}
        )
        return response['message']['content'].strip()
    except Exception as e:
        return f"Error: {e}"

### üß† Step 3: The "Conjunction Fallacy" Problem
This is a classic cognitive science puzzle.

*   **Option A**: Probability of event $X$.
*   **Option B**: Probability of event $X$ AND event $Y$.

Mathematically, $P(X) \ge P(X \cap Y)$. **A is always the correct answer.**
However, LLMs (like humans) get distracted by the detailed description and often guess **B**.

In [None]:
PROBLEM = """
Linda is 31 years old, single, outspoken, and very bright. She majored in philosophy. As a student, she was deeply concerned with issues of discrimination and social justice.
Which is more probable?
A) Linda is a bank teller.
B) Linda is a bank teller and is active in the feminist movement.
Answer with only 'A' or 'B'.
"""

console.print(f"\n[bold cyan]Problem:[/bold cyan] Linda Problem (Conjunction Fallacy)")

### üé≤ Step 4: Test-Time Scaling (Sampling)
This is the core of TTRL: **Don't ask once. Ask many times.**

We sample $N=5$ times with high temperature (`0.9`).
*   Some samples might be "lazy" (System 1 bias).
*   Some samples might be "lucky" or "reasoned" (System 2 correctness).

We collect them all into a list.

In [None]:
N_SAMPLES = 5
samples = []

table = Table(title=f"Real Generated Samples (N={N_SAMPLES})")
table.add_column("Sample ID", style="dim")
table.add_column("Answer", justify="center")

console.print("[yellow]Sampling...[/yellow]")
for i in range(N_SAMPLES):
    # 1. GENERATE with High Temperature
    raw_ans = get_answer(PROBLEM, temp=0.9)
    
    # 2. NORMALIZE (Data Cleaning)
    # Models are chatty, we need to extract just 'A' or 'B'
    ans = "A" if "A" in raw_ans and not "B" in raw_ans else "B"
    if "B" in raw_ans: ans = "B"
    
    samples.append(ans)
    
    # Visualization
    color = "green" if ans == "A" else "red"
    table.add_row(str(i+1), f"[{color}]{ans}[/{color}]")

console.print(table)

### üó≥Ô∏è Step 5: Majority Voting (Pseudo-Reward)
Now we aggregate the intelligence.

We assume that **Truth is more robust than Error**. If the model hallucinates, it might hallucinate different things each time. But if it finds the logic, the logic is consistent.

Therefore, the **Consensus** is likely the correct answer.

In [None]:
counts = Counter(samples)
consensus_answer, consensus_count = counts.most_common(1)[0]

console.print(f"\n[bold]Consensus Analysis:[/bold]")
console.print(f"Most Common Answer: [bold magenta]{consensus_answer}[/bold magenta] ({consensus_count}/{N_SAMPLES} votes)")

GROUND_TRUTH = "A"
if consensus_answer == GROUND_TRUTH:
    console.print("‚úÖ [bold green]SUCCESS:[/bold green] The consensus matches the Ground Truth (A)!")
    console.print("The swarm of agents outperformed the individual bias.")
else:
    console.print("‚ùå [bold red]FAILURE:[/bold red] The model fell for the fallacy consistently.")