# Lesson 1: The Intuition of TTRL (with Real Local LLM)

**Test-Time Reinforcement Learning (TTRL)** is based on a simple premise:
*If we let a model "think" longer by exploring multiple paths, it can verify and correct itself.*

In this lesson, we perform **Generative Consensus** using your local **Ollama** model.

We will:
1.  Ask a logic question (`mistral:7b` often gets tricks wrong).
2.  Generate diverse answers (Test-Time Scaling).
3.  Use "Majority Voting" to define a **Pseudo-Reward**.

In [1]:
import time
from collections import Counter
from rich.console import Console
from rich.table import Table
try:
    import ollama
except ImportError:
    print("Please run: pip install ollama")

console = Console()
MODEL_NAME = "mistral:7b"  # Your local model

## 1. Connect to the Real Model
We define a helper function `get_answer` that queries your local Ollama instance.
Note the `temperature` parameter: High temperature means more creativity/diversity, which is essential for TTRL.

In [2]:
console.print(f"[bold]Connecting to Local Model:[/bold] {MODEL_NAME}")

def get_answer(prompt: str, temp: float = 0.7) -> str:
    """Gets a single completion from Ollama."""
    try:
        response = ollama.chat(
            model=MODEL_NAME,
            messages=[
                {"role": "system", "content": "You are a concise assistant. Answer with only the result."},
                {"role": "user", "content": prompt}
            ],
            options={"temperature": temp}
        )
        return response['message']['content'].strip()
    except Exception as e:
        return f"Error: {e}"

## 2. The Problem: Conjunction Fallacy
We use the classic "Linda Problem".

*   **Option A**: Linda is a bank teller.
*   **Option B**: Linda is a bank teller AND active in feminist movement.

**Logic Rule**: Probability(A) >= Probability(A + B). 
Therefore, **A** is mathematically more likely, even if the description sounds like B.

In [3]:
PROBLEM = """
Linda is 31 years old, single, outspoken, and very bright. She majored in philosophy. As a student, she was deeply concerned with issues of discrimination and social justice.
Which is more probable?
A) Linda is a bank teller.
B) Linda is a bank teller and is active in the feminist movement.
Answer with only 'A' or 'B'.
"""

console.print(f"\n[bold cyan]Problem:[/bold cyan] Linda Problem (Conjunction Fallacy)")

## 3. Test-Time Scaling (Sampling)
Instead of asking once, we ask **5 times** with high temperature (0.9).
This simulates "thinking about it from different angles".

In [4]:
N_SAMPLES = 5
samples = []

table = Table(title=f"Real Generated Samples (N={N_SAMPLES})")
table.add_column("Sample ID", style="dim")
table.add_column("Answer", justify="center")

console.print("[yellow]Sampling...[/yellow]")
for i in range(N_SAMPLES):
    # High temp for diversity
    raw_ans = get_answer(PROBLEM, temp=0.9)
    # Normalize answer to just A or B
    ans = "A" if "A" in raw_ans and not "B" in raw_ans else "B"
    if "B" in raw_ans: ans = "B"
    
    samples.append(ans)
    color = "green" if ans == "A" else "red"
    table.add_row(str(i+1), f"[{color}]{ans}[/{color}]")

console.print(table)

## 4. Majority Voting (Consensus)
We count the votes. The most common answer becomes our **Consensus**.
In TTRL, this Consensus acts as a **Pseudo-Reward** signal.

In [5]:
counts = Counter(samples)
consensus_answer, consensus_count = counts.most_common(1)[0]

console.print(f"\n[bold]Consensus Analysis:[/bold]")
console.print(f"Most Common Answer: [bold magenta]{consensus_answer}[/bold magenta] ({consensus_count}/{N_SAMPLES} votes)")

GROUND_TRUTH = "A"
if consensus_answer == GROUND_TRUTH:
    console.print("✅ [bold green]SUCCESS:[/bold green] The consensus matches the Ground Truth (A)!")
else:
    console.print("❌ [bold red]FAILURE:[/bold red] The model fell for the fallacy.")

## 5. Conclusion
You have just implemented the **Inference Loop** of TTRL.
In a full training system, we would take this successful result ("A") and use it to update the model weights using PPO (Proximal Policy Optimization), rewarding the model for generating reasoning that leads to "A".