<a href="https://colab.research.google.com/github/bshahrok/llm-ta-aied26/blob/main/predict_next.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Agenda:
A multi-agent discussion framework designed to  label teacher's next utterances in classroom transcripts using the CAD codebook (WCT, GT, or Other). Given a full transcript, the system processes each utterance sequentially, using all prior utterances as context to predict the label of the next one.
## todo:
[] break down teacher's long transcript

[] Other lable is vague for agents


## setup

In [14]:
import os, re, json, time, ast, logging, gc
from typing import Dict, List, Tuple, Optional, Any
from dataclasses import dataclass, field

import matplotlib.pyplot as plt
import seaborn as sns


import math
import numpy as np
import pandas as pd

import scipy
from scipy.stats import binomtest
from sklearn.mixture import GaussianMixture

import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)
# print(torch.cuda.get_device_name(0))


from transformers import AutoTokenizer, AutoModelForCausalLM

cuda


In [6]:
!pip install -U bitsandbytes>=0.46.1
import bitsandbytes

In [12]:
data_path = "/content/drive/MyDrive/Projects/2025 papers/ISLS26/CAD_data/teacher/trans_df.csv"
BATCH_NUM = 200

from google.colab import drive
drive.mount('/content/drive')


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [15]:
df = pd.read_csv(data_path)
df.head()
uttranes = list(df['transcript'])

## Configs

In [44]:
# How many discussion rounds between Phase 0 and the final vote
# 2 has been recommended by (Du et al., 2023)
NUM_DISCUSSION_ROUNDS = 2

# Vote weight for an agent that never changed its label across all rounds
CONSISTENT_AGENT_WEIGHT = 1.5
DEFAULT_AGENT_WEIGHT = 1.0

# Model
MAX_NEW_TOKENS = 512
TEMPERATURE = 0.1
TOP_K = 40
CPU_MODEL_ID = "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B"
GPU_MODEL_ID = "deepseek-ai/DeepSeek-R1-Distill-Qwen-7B"

# File to log raw model outputs when parsing fails (for debugging)
RAW_OUTPUT_LOG = "raw_output_debug.log"

MAX_RETRIES = 2

### Agent Config

In [17]:
AGENTS = [
    {
        "name": "LinguistAgent",
        "context_window": 1,
        "persona": (
            "You are a linguistics expert who focuses on surface-level language cues: "
            "pronouns (everybody, you all, group X), direct address forms, imperatives, "
            "and who the teacher is grammatically targeting."
        ),
    },
    {
        "name": "PedagogyAgent",
        "context_window": 4,
        "persona": (
            "You are an experienced pedagogy researcher who understands classroom dynamics. "
            "You reason about the instructional intent behind each utterance — whether the teacher "
            "is managing the whole class, scaffolding a small group, or doing something else entirely."
        ),
    },
    {
        "name": "ContextAgent",
        "context_window": 6,
        "persona": (
            "You are a classroom observer who tracks the flow of conversation over time. "
            "You pay careful attention to how the current utterance relates to what came before — "
            "transitions, topic shifts, and whether the teacher has changed her audience."
        ),
    },
    {
        "name": "SkepticalAgent",
        "context_window": 3,
        "persona": (
            "You are a careful, skeptical annotator. You assume nothing and look for explicit evidence "
            "in the text. If the utterance is ambiguous or lacks a clear addressee, you lean toward 'None'. "
            "You only assign WCT, GT or Other when the evidence is unambiguous."
        ),
    },
    {
        "name": "HolisticAgent",
        "context_window": 8,
        "persona": (
            "You are a holistic analyst who considers the full available context. You weigh linguistic cues, "
            "pedagogical purpose, conversational flow, and implicit classroom norms to make your decision."
        ),
    },
]


## Model manager

In [45]:
from transformers import BitsAndBytesConfig

class ModelManager:
    """Handles model loading and inference."""

    def __init__(
        self,
        model_id: Optional[str] = None,
        device: Optional[str] = None,
        temperature: float = 0.0,
        top_k: int = 40,
        use_8bit: bool = False,
        use_4bit: bool = True,
        max_memory_gb: Optional[float] = None,
        debug: bool = False
    ):

        self.logger = logging.getLogger(self.__class__.__name__)
        self.logger.setLevel(logging.DEBUG if debug else logging.INFO)

        self.device = self._resolve_device(device)
        self.model_id = model_id or self._get_default_model_id()

        # Quantization flags
        if use_4bit and use_8bit:
            raise ValueError("Pick one: use_4bit or use_8bit (not both).")
        self.use_8bit = bool(use_8bit)
        self.use_4bit = bool(use_4bit)

        # Generation defaults
        self.temperature = float(temperature)
        self.top_k = int(top_k)

        # Memory controls
        self.max_memory_gb = max_memory_gb
        # self._setup_memory_optimization()

        self._tokenizer: Optional[AutoTokenizer] = None
        self._model: Optional[AutoModelForCausalLM] = None


    @staticmethod
    def clear_cache():
        """Clears GPU cache and runs garbage collection."""
        if torch.cuda.is_available():
            torch.cuda.empty_cache()
            torch.cuda.synchronize()
        gc.collect()

    @staticmethod
    def get_memory_info() -> Dict[str, float]:
        """Returns current GPU memory usage in GB."""
        if not torch.cuda.is_available():
            return {}

        device = torch.cuda.current_device()
        allocated = torch.cuda.memory_allocated(device) / 1024**3
        reserved = torch.cuda.memory_reserved(device) / 1024**3
        total = torch.cuda.get_device_properties(device).total_memory / 1024**3
        free = total - allocated

        return {
            "allocated_gb": allocated,
            "reserved_gb": reserved,
            "total_gb": total,
            "free_gb": free
        }

    def _log_memory_usage(self):
        """Logs current memory usage."""
        info = self.get_memory_info()
        if info:
            self.logger.debug(
                f"GPU Memory - Allocated: {info['allocated_gb']:.2f}GB, "
                f"Free: {info['free_gb']:.2f}GB, Total: {info['total_gb']:.2f}GB"
            )

    @staticmethod
    def setup_memory_optimization():
        """Sets up environment variables for better memory management."""
        os.environ['PYTORCH_CUDA_ALLOC_CONF'] = 'expandable_segments:True'

    def _resolve_device(self, device: Optional[str]) -> torch.device:
        """Determines the appropriate device for model execution."""
        if device:
            return torch.device(device)
        return torch.device("cuda" if torch.cuda.is_available() else "cpu")

    def _get_default_model_id(self) -> str:
        """Selects default model based on available hardware."""
        if self.device.type == "cuda":
            return CPU_MODEL_ID
        return CPU_MODEL_ID

    def load_model(self):
        """Loads tokenizer and model if not already loaded."""
        if self._model and self._tokenizer:
            return
        self.logger.debug(f"Loading model: {self.model_id}")
        self.clear_cache()

        quantization_config = None
        if self.use_4bit:
            quantization_config = BitsAndBytesConfig(
                load_in_4bit=True,
                bnb_4bit_quant_type='nf4',
                bnb_4bit_compute_dtype=torch.bfloat16
            )
        elif self.use_8bit:
            quantization_config = BitsAndBytesConfig(
                load_in_8bit=True
            )

        try:
          self._tokenizer = AutoTokenizer.from_pretrained(self.model_id, use_fast=True)
          self._model = AutoModelForCausalLM.from_pretrained(
              self.model_id,
              device_map="auto",
              quantization_config=quantization_config
          )
          self._model.eval()

          # Log memory after loading if in debug
          self._log_memory_usage()
        except torch.cuda.OutOfMemoryError as e:
            self.logger.error(f"CUDA OOM while loading model: {e}")
            self.clear_cache()
            raise RuntimeError(
                "Out of GPU memory. Try: \n"
                "1. Use smaller model (1.5B instead of 7B)\n"
                "2. Enable quantization: use_8bit=True or use_4bit=True (already attempted if not explicitely set to false)\n"
                "3. Set max_memory_gb to limit memory per GPU\n"
                "4. Close other GPU processes"
            ) from e

    def unload_model(self):
        """Unloads the model and tokenizer from memory."""
        self.logger.info("Unloading model and tokenizer from memory.")
        del self._model
        del self._tokenizer
        self._model = None
        self._tokenizer = None
        self.clear_cache()


    def generate(
        self,
        prompt: str,
        temperature: Optional[float] = None,
        top_k: Optional[int] = None,
        remove_prompts: bool = False
    ) -> str:
        """Generates text from the model given a prompt."""
        self.load_model()
        assert self._model is not None
        assert self._tokenizer is not None

        temp = self.temperature if temperature is None else float(temperature)
        tk = self.top_k if top_k is None else int(top_k)

        try:
          # Clear cache before generation
          self.clear_cache()

          inputs = self._tokenizer(
              prompt,
              return_tensors="pt",
              truncation=True
          ).to(self._model.device)
          self.logger.debug(f" Calling model with these inputs: {inputs}")

          with torch.no_grad():
              outputs = self._model.generate(
                  **inputs,
                  do_sample=(temp > 0.0),
                  temperature=float(temp),
                  top_k=int(tk),
                  max_new_tokens=MAX_NEW_TOKENS,
                  pad_token_id=self._tokenizer.eos_token_id,
                  eos_token_id=self._tokenizer.eos_token_id,
              )
          self.logger.debug(f"Model parameters: temperature={temp}, top_k={tk}, max_new_tokens={MAX_NEW_TOKENS}")
          # self.logger.debug(f"Model raw output: {outputs}")

          full_text = self._tokenizer.decode(outputs[0], skip_special_tokens=True)



          if remove_prompts:
            prompt_text = self._tokenizer.decode(inputs["input_ids"][0], skip_special_tokens=True)
            if full_text.startswith(prompt_text):
                generated = full_text[len(prompt_text):].strip()
          else:
                generated = full_text.strip()

          # Clean up inputs/outputs tensors
          del inputs, outputs
          self.clear_cache()

          # Clean DeepSeek-R1 reasoning tokens
          # generated = self._clean_deepseek_output(generated)
          return generated

        except torch.cuda.OutOfMemoryError as e:
            self.logger.error(f"CUDA OOM during generation: {e}")

            # Log memory after loading if in debug
            self._log_memory_usage()

            self.clear_cache()
            raise RuntimeError(
                "Out of GPU memory during generation. Try:\n"
                "1. Reduce max_new_tokens\n"
                "2. Process texts in smaller batches\n"
                "3. Unload and reload model: agent.model_manager.unload_model()\n"
                "4. Enable quantization if not already enabled"
            ) from e

## Prompt Builder

In [46]:
CAD_CODEBOOK_DICT = {
    "WCT": "The teacher is addressing the whole class.",
    "GT":  "The teacher is addressing a group or a student in a group. It also includes any talk: student level",
    "Other": "The teacher isn’t talking to the whole class or any groups or students. Either she’s silent or talking to herself or a visitor in a non-distracting way: "
}


#  Default examples for few-shot learning
DEFAULT_EXAMPLES = [
    {
        "input": "Everybody please listen.",
        "output": {
            "CAD-code": "WCT",
            "rationale": 'Addresses the whole class using "Everybody" to get attention.'
        }
    },
    {
        "input": "Group 3, read the next paragraph.",
        "output": {
            "CAD-code": "GT",
            "rationale": 'Directs a specific group "Group 3" to perform an action.'
        }
    }
    ]

# JSON Schema definition
SCHEMA = {
    "CAD-code": "<ONE OF: WCT, GT, Other, NONE>",
    "rationale": "<≤5 sentences, evidence-based>"
}

# Valid codes for validation
VALID_CODES = {"WCT", "GT", "Other", "NONE"}

def build_codebook_str() -> str:
    lines = ["**CAD Codebook:**"]
    for code, desc in CAD_CODEBOOK_DICT.items():
        lines.append(f"  - {code}: {desc}")
    return "\n".join(lines)


def build_examples_str() -> str:
    lines = ["**Few-shot Examples:**"]
    for ex in DEFAULT_EXAMPLES:
        lines.append(f'  Input: "{ex["input"]}"')
        lines.append(f'  Label: {ex["output"]["CAD-code"]} — {ex["output"]["rationale"]}')
    return "\n".join(lines)


def format_context(utterances: list) -> str:
    if not utterances:
        return "  (No prior context — this is the first utterance)"
    return "\n".join([f"  [{i+1}] {u}" for i, u in enumerate(utterances)])


def format_other_agents_views(agent_states: list, exclude_name: str) -> str:
    """
    Format other agents' current label + rationale for the critique prompt.
    Agents whose label is not a valid codebook entry (e.g. fallback/placeholder
    values) are silently skipped so they don't pollute the debate.
    """
    lines = []
    for s in agent_states:
        if s["agent"] == exclude_name:
            continue
        if s["label"] not in VALID_CODES:
            continue  # skip placeholder / fallback states
        lines.append(f'  • [{s["agent"]}] → {s["label"]}')
        lines.append(f'    Rationale: {s["rationale"]}')
    return "\n".join(lines) if lines else "  (No valid agent positions to compare yet)"

## llm generation


In [47]:
def parse_llm_output(raw: str) -> dict:
    """
    Robust multi-strategy parser for LLM JSON output.

    Tries strategies in order of reliability:
      1. <answer>...</answer> XML tags  — primary (chain-of-thought stays outside)
      2. ```json ... ```                 — markdown code fence fallback
      3. First {...} JSON object         — bare JSON fallback
    Raises ValueError (with raw snippet) only if all three strategies fail.
    """
    valid = {"WCT", "GT", "Other"}

    # ── Strategy 1: last valid <answer> block ──
    # The model sometimes generates multiple <answer> blocks while looping;
    # the last complete one is most likely to be the final settled answer.
    all_answers = list(re.finditer(r"<answer>\s*(.*?)\s*</answer>", raw, re.DOTALL))
    for m in reversed(all_answers):          # iterate from last to first
        try:
            parsed = json.loads(m.group(1))
            if parsed.get("CAD-code") in valid:  # reject placeholder echoes
                return parsed
        except json.JSONDecodeError:
            continue

    # ── Strategy 2: ```json ... ``` code fence ──
    match = re.search(r"```json\s*(.*?)\s*```", raw, re.DOTALL)
    if match:
        try:
            return json.loads(match.group(1))
        except json.JSONDecodeError:
            pass

    # ── Strategy 3: first { ... } block in the output ──
    match = re.search(r"\{.*?\}", raw, re.DOTALL)
    if match:
        try:
            return json.loads(match.group(0))
        except json.JSONDecodeError:
            pass

    # ── Strategy 4: truncated JSON — extract CAD-code directly from raw text ──
    # Handles the case where MAX_TOKENS cuts the response before the closing },
    # leaving the rationale incomplete but the CAD-code value already written.
    match = re.search(
        r'"CAD-code"\s*:\s*"(' + "|".join(re.escape(k) for k in ["WCT", "GT", "Other"]) + r'")',
        raw,
    )
    if match:
        label = match.group(1).rstrip('"')
        if label in {"WCT", "GT", "Other"}:   # extra guard: reject placeholder strings
            return {
                "CAD-code": label,
                "rationale": "[Recovered from truncated output]",
            }

    raise ValueError(
        f"All parsing strategies failed. Raw output (first 400 chars):\n{raw[:400]}"
    )
def log_raw_output(prompt: str, raw: str, error: str) -> None:
    """Append a failed parse attempt to the debug log for inspection."""
    with open(RAW_OUTPUT_LOG, "a", encoding="utf-8") as f:
        f.write("=" * 60 + "\n")
        f.write(f"ERROR   : {error}\n")
        f.write(f"PROMPT  : {prompt[:50]}\n")
        f.write(f"RAW OUT : {raw}\n")
        f.write("=" * 60 + "\n\n")


In [48]:
def call_deepseek(prompt: str) -> dict:
    """Call DeepSeek with max_tokens cap to prevent runaway generation."""
    mm = ModelManager()
    response = mm.generate(prompt, remove_prompts=True)
    return parse_llm_output(response)


def safe_call(prompt: str, fallback_label: str = "None") -> dict:
    """
    Call DeepSeek with retry logic and graceful fallback.

    - Retries up to MAX_RETRIES times on any parse or API failure.
    - Logs every failed attempt (with raw output) to RAW_OUTPUT_LOG.
    - Returns a fallback label only after all retries are exhausted.
    """
    last_error = None
    for attempt in range(1, MAX_RETRIES + 1):
        try:
          res = call_deepseek(prompt)
          return res
        except Exception as e:
            last_error = e
            raw = getattr(e, "__cause__", None)
            log_raw_output(prompt, str(e), f"Attempt {attempt}/{MAX_RETRIES}: {e}")
            if attempt < MAX_RETRIES:
                print(f"    [safe_call] Parse failed (attempt {attempt}/{MAX_RETRIES}), retrying...")

    print(f"    [safe_call] All {MAX_RETRIES} attempts failed. Using fallback='{fallback_label}'. See {RAW_OUTPUT_LOG}")
    return {"CAD-code": fallback_label, "rationale": f"[Fallback after {MAX_RETRIES} failed attempts: {last_error}]"}

In [49]:
def _clean_deepseek_output(text: str) -> str:
        """
        Cleans DeepSeek-R1 model output by removing reasoning tokens.
        DeepSeek-R1 models wrap reasoning in <think></think> or similar tags.
        """
        # Remove content between think tags
        text = re.sub(r'<think>.*?</think>', '', text, flags=re.DOTALL)
        text = re.sub(r'<sub>.*?</sub>', '', text, flags=re.DOTALL)

        # Try to extract JSON object if present
        json_match = re.search(r'\{[^{}]*"CAD-code"[^{}]*\}', text, flags=re.DOTALL)
        if json_match:
            return json_match.group(0)

        return text.strip()


In [50]:
# res = safe_call("what is python?")
# res

## Multi-agent Discussion
Multi-Agent Discussion Framework for Teacher Transcript Labeling
================================================================
Pipeline per utterance:
  

1.   Phase 0 — Independent Predictions  : each agent predicts in isolation
2.   Phase 1 —
*   Structured Critique Round: each agent reads all others' views and critiques
*   Revision Round: each agent produces a final revised prediction
4.   Final   — Weighted Majority Vote    : agents that never changed position get 1.5× weight
  

*   Tie-break — Judge Agent             : neutral agent reads full debate and
breaks any tie

Model  : DeepSeek (deepseek-reasoner via OpenAI-compatible API)
Install: pip install openai


### PHASE 0 — INDEPENDENT PREDICTION

In [21]:
def build_phase0_prompt(agent: dict, context_utterances: list, next_utterance: str) -> str:
    valid_labels = list(CAD_CODEBOOK_DICT.keys())
    return f"""
{agent['persona']}

Your task is to predict the CAD label of the NEXT teacher utterance given the conversation history.

{build_codebook_str()}

{build_examples_str()}

**Conversation History (last {len(context_utterances)} utterance(s)):**
{format_context(context_utterances)}


Reason step by step:
1. Predict Who is the teacher addressing in the next utterance — the whole class, a specific group/student, or nobody in particular?
2. Does the prior conversation history shift or reinforce your reading?

When done reasoning, wrap your final answer in <answer> tags containing ONLY a JSON object:
<answer>
{{
  "CAD-code": "one of: {', '.join(valid_labels)}",
  "rationale": "your reasoning in 1-2 sentences"
}}
</answer>
""".strip()


def phase0_independent_predictions(utterances: list, target_idx: int, verbose: bool) -> list:
    """
    Phase 0: Every agent independently predicts the label of utterances[target_idx+1],
    using utterances[target_idx - window : target_idx+1] as context (next-utterance prediction).
    Returns a list of agent state dicts.
    """
    next_utterance = utterances[target_idx]
    agent_states = []

    if verbose:
        print("\n  ── Phase 0: Independent Predictions ──")

    for agent in AGENTS:
        window = agent["context_window"]
        # Context = the window of utterances BEFORE the target
        context = utterances[max(0, target_idx - window): target_idx+1]
        prompt = build_phase0_prompt(agent, context, next_utterance)
        result = safe_call(prompt)
        state = {
            "agent":          agent["name"],
            "context_window": agent["context_window"],
            "label":          result.get("CAD-code", "Other"),
            "rationale":      result.get("rationale", ""),
            "history":        [result.get("CAD-code", "Other")],  # track label across rounds
        }
        agent_states.append(state)

        if verbose:
            print(f"    [{agent['name']:20s}] → {state['label']:<8} | {state['rationale']}")

    return agent_states

### PHASE 1 — Discussion

In [22]:
def build_critique_prompt(
    agent: dict,
    context_utterances: list,
    next_utterance: str,
    my_current_state: dict,
    other_views: str,
    round_num: int,
) -> str:
    valid_labels = list(CAD_CODEBOOK_DICT.keys())
    return f"""
{agent['persona']}

You are in Round {round_num} of a structured debate about the correct CAD label
for the NEXT teacher utterance given the conversation history.

{build_codebook_str()}

**Conversation History (last {len(context_utterances + 1)} utterance(s)):**
{format_context(context_utterances)}


**Your current position:**
  Label    : {my_current_state['label']}
  Rationale: {my_current_state['rationale']}

**Other agents' current positions:**
{other_views}

Your job is to CRITICALLY ENGAGE with the other agents' reasoning:
1. Identify any reasoning you DISAGREE with and explain WHY it is flawed or insufficient.
2. Identify any reasoning you find COMPELLING and whether it changes your view.
3. IMPORTANT: Do NOT change your position simply because others disagree — only update if
   their evidence is genuinely stronger than yours. Resist social pressure.

When done reasoning, wrap your final answer in <answer> tags containing ONLY a JSON object:
<answer>
{{
  "CAD-code": "one of: {', '.join(valid_labels)}",
  "rationale": "your updated reasoning in 1-2 sentences",
  "changed_mind": true or false,
  "critique": "1-2 sentences identifying flaws or strengths in others reasoning"
}}
</answer>
""".strip()


def run_discussion_round(
    utterances: list,
    target_idx: int,
    agent_states: list,
    round_num: int,
    verbose: bool,
) -> list:
    """
    One full discussion round: each agent reads all others' views, critiques, and updates.
    All agents receive a simultaneous snapshot of states (no sequential bias).
    """
    next_utterance = utterances[target_idx]
    updated_states = []

    if verbose:
        print(f"\n  ── Discussion Round {round_num} ──")

    # Snapshot all current states before any agent updates (prevents sequential bias)
    snapshots = list(agent_states)

    for i, agent_def in enumerate(AGENTS):
        my_state = snapshots[i]
        other_views = format_other_agents_views(snapshots, exclude_name=agent_def["name"])

        window = agent_def["context_window"]
        context = utterances[max(0, target_idx - window): target_idx]

        prompt = build_critique_prompt(
            agent_def, context, next_utterance, my_state, other_views, round_num
        )
        result = safe_call(prompt)

        new_label     = result.get("CAD-code", my_state["label"])
        new_rationale = result.get("rationale", my_state["rationale"])
        changed_mind  = result.get("changed_mind", False)
        critique      = result.get("critique", "")

        updated_state = {
            "agent":          my_state["agent"],
            "context_window": my_state["context_window"],
            "label":          new_label,
            "rationale":      new_rationale,
            "history":        my_state["history"] + [new_label],
            "changed_mind":   changed_mind,
            "critique":       critique,
        }
        updated_states.append(updated_state)

        if verbose:
            change_marker = "↺ CHANGED" if changed_mind else "  stable "
            print(f"    [{agent_def['name']:20s}] {change_marker} → {new_label:<8} | {new_rationale}")
            if critique:
                print(f"    {'':20s}   Critique: {critique}")

    return updated_states

### PHASE 2 - FINAL VOTE — WEIGHTED MAJORITY

In [23]:
import re
from collections import defaultdict
def weighted_majority_vote(agent_states: list, verbose: bool) -> tuple:
    """
    Agents that never changed their label across all rounds get CONSISTENT_AGENT_WEIGHT.
    All others get DEFAULT_AGENT_WEIGHT.
    Returns (final_label, vote_detail_dict).
    """
    weighted_votes = defaultdict(float)
    vote_details = []

    for state in agent_states:
        # An agent is "consistent" if its label never changed across all rounds
        all_same = len(set(state["history"])) == 1
        weight = CONSISTENT_AGENT_WEIGHT if all_same else DEFAULT_AGENT_WEIGHT
        label  = state["label"]

        weighted_votes[label] += weight
        vote_details.append({
            "agent":      state["agent"],
            "final_label": label,
            "weight":     weight,
            "consistent": all_same,
            "history":    state["history"],
        })

    # Winner is label with highest weighted score
    final_label = max(weighted_votes, key=weighted_votes.__getitem__)
    weighted_votes_dict = dict(weighted_votes)

    if verbose:
        print("\n  ── Weighted Vote ──")
        for v in vote_details:
            flag = "★ consistent" if v["consistent"] else "  changed   "
            print(f"    [{v['agent']:20s}] {flag} → {v['final_label']:<8} (weight={v['weight']})")
        print(f"    Weighted totals: {weighted_votes_dict}")

    return final_label, weighted_votes_dict, vote_details

## OUTPUT HELPERS

In [28]:
def print_summary(results: list):
    """Print a clean summary table of all next-utterance predictions."""
    print(f"\n{'='*75}")
    print("  DEBATE LABELING SUMMARY  (next-utterance predictions)")
    print(f"{'='*75}")
    print(f"  {'Idx':<5} {'Label':<8} {'Tie?':<6} {'Weighted Votes':<30} Utterance")
    print(f"  {'-'*70}")
    for r in results:
        tie     = "yes" if r["tie_broken"] else "no"
        votes   = str(r["weighted_votes"])
        preview = r["target_utterance"][:32] + ("..." if len(r["target_utterance"]) > 32 else "")
        print(f"  {r['target_index']:<5} {r['final_label']:<8} {tie:<6} {votes:<30} \"{preview}\"")
    print(f"{'='*75}\n")


def save_results(results: list, output_path: str = "debate_results.json"):
    """Save full debate results to JSON (includes per-agent histories and critiques)."""
    with open(output_path, "w", encoding="utf-8") as f:
        json.dump(results, f, indent=2, ensure_ascii=False)
    print(f"Full results saved to: {output_path}")


## test

In [108]:
utterances = df['transcript'].to_list()
target_idx = 12
verbose = True

# Phase 0
# agent_states = phase0_independent_predictions(utterances, target_idx, verbose)

# phase 1 - discussion rounds (2 round)
# agent_states1 = run_discussion_round(
#             utterances, target_idx, agent_states, 1, verbose
        # )

# agent_states2 = run_discussion_round(
#             utterances, target_idx, agent_states1, 2, verbose
#         )
# phase 2 - final vote
final_label, weighted_votes, vote_details = weighted_majority_vote(agent_states2, verbose)


  ── Weighted Vote ──
    [LinguistAgent       ]   changed    → WCT      (weight=1.0)
    [PedagogyAgent       ]   changed    → WCT      (weight=1.0)
    [ContextAgent        ]   changed    → Other    (weight=1.0)
    [SkepticalAgent      ]   changed    → Other    (weight=1.0)
    [HolisticAgent       ]   changed    → WCT      (weight=1.0)
    Weighted totals: {'WCT': 3.0, 'Other': 2.0}


## main

In [26]:

def process_utterance(
    utterances: list,
    target_idx: int,
    num_rounds: int = NUM_DISCUSSION_ROUNDS,
    verbose: bool = True,
) -> dict:
    """
    Full debate pipeline for a single next-utterance prediction.

    Context : utterances[0 .. target_idx-1]  (each agent uses its own window slice)
    Target  : utterances[target_idx]          (the utterance whose label is predicted)

    Phases:
      0 → Independent predictions
      1..N → Structured critique + revision rounds
      Final → Weighted majority vote (+ judge tiebreak if needed)
    """
    next_utterance = utterances[target_idx]

    # Phase 0
    agent_states = phase0_independent_predictions(utterances, target_idx, verbose)

    # Discussion Rounds
    for round_num in range(1, num_rounds + 1):
        agent_states = run_discussion_round(
            utterances, target_idx, agent_states, round_num, verbose
        )

    # Weighted Vote
    final_label, weighted_votes, vote_details = weighted_majority_vote(agent_states, verbose)

    # Tie-break
    max_score  = max(weighted_votes.values())
    top_labels = [l for l, s in weighted_votes.items() if s == max_score]
    tie_broken = False

    if len(top_labels) > 1:
        # final_label = judge_tiebreak(next_utterance, agent_states, verbose)
        top_labels.sort()
        final_label = top_labels[0]
        tie_broken  = True

    return {
        "context_ends_at_index": target_idx - 1,   # last utterance used as context
        "target_index":          target_idx,        # utterance being predicted
        "target_utterance":      next_utterance,
        "agent_states":          agent_states,
        "vote_details":          vote_details,
        "weighted_votes":        weighted_votes,
        "tie_broken":            tie_broken,
        "final_label":           final_label,
    }


In [27]:
def process_transcript(
    utterances: list,
    num_rounds: int = NUM_DISCUSSION_ROUNDS,
    verbose: bool = True,
) -> list:
    """
    Predict the CAD label of every utterance (except the first) using prior context.

    For each target_idx in [1 .. len(utterances)-1]:
      - Context : utterances[max(0, target_idx-window) .. target_idx-1]
      - Target  : utterances[target_idx]

    The first utterance (index 0) has no prior context so it is skipped as a target;
    it is still used as context when predicting index 1.

    Args:
        utterances : Chronologically ordered list of teacher utterance strings.
        num_rounds : Number of critique/revision rounds (default=2, per literature).
        verbose    : Stream per-agent decisions to stdout.

    Returns:
        List of result dicts, one per predicted utterance (indices 1..N-1).
    """
    if len(utterances) < 2:
        raise ValueError("Transcript must contain at least 2 utterances for next-utterance prediction.")

    results = []

    print(f"\n{'='*65}")
    print(f"  Multi-Agent Debate Framework — CAD Labeling")
    print(f"  Transcript length   : {len(utterances)} utterances")
    print(f"  Predictions made    : {len(utterances) - 1} (utterances 2..{len(utterances)})")
    print(f"  Discussion rounds   : {num_rounds}")
    print(f"  Agents              : {[a['name'] for a in AGENTS]}")
    print(f"{'='*65}")

    # target_idx starts at 1: utterance[0] is context-only, never a prediction target
    for target_idx in range(1, len(utterances)):
        if verbose:
            print(f"\n{'─'*65}")
            print(f"  Predicting utterance {target_idx}/{len(utterances)-1}: \"{utterances[target_idx]}\"")
            print(f"  (context anchor: utterance {target_idx-1} = \"{utterances[target_idx-1]}\")")
            print(f"{'─'*65}")

        result = process_utterance(utterances, target_idx, num_rounds=num_rounds, verbose=verbose)
        results.append(result)

        if verbose:
            tie_note = " [tie-broken by judge]" if result["tie_broken"] else ""
            print(f"\n  ✓ FINAL LABEL: {result['final_label']}{tie_note}")
            print(f"    Weighted votes: {result['weighted_votes']}")

    return results

In [42]:
trans_list = df['transcript'].to_list()[15:20]
trans_list


['You guys see your group number here? You guys are group three?',
 'All right, looks like we got everybody. All right, now do I? Press next. Press next. So it says proceed. Yes. Yes.',
 'Press next. Do I have to...',
 "Okay, so now you're going to see three columns. So you're going to see three columns where you're going to be placing your cards. So in the three columns, one column says it'll be sometimes true, meaning you can find one solution where X might just equal one number. Okay, always true, meaning there's infinite solutions. And lastly, no solutions, meaning that there will, that's when you solve it, it's impossible to find a number for your equations. So you're going to be dragging and dropping. So what's happening now is you're going to be talking about which one should you drag. Okay, make sure you talk about it before you actually place something. Okay.",
 'So get closer, talk here, get closer. Mm-hmm. Mm-hmm. Mm-hmm.']

In [51]:
results = process_transcript(utterances = trans_list, num_rounds=2, verbose=True)




  Multi-Agent Debate Framework — CAD Labeling
  Transcript length   : 5 utterances
  Predictions made    : 4 (utterances 2..5)
  Discussion rounds   : 2
  Agents              : ['LinguistAgent', 'PedagogyAgent', 'ContextAgent', 'SkepticalAgent', 'HolisticAgent']

─────────────────────────────────────────────────────────────────
  Predicting utterance 1/4: "All right, looks like we got everybody. All right, now do I? Press next. Press next. So it says proceed. Yes. Yes."
  (context anchor: utterance 0 = "You guys see your group number here? You guys are group three?")
─────────────────────────────────────────────────────────────────

  ── Phase 0: Independent Predictions ──


config.json:   0%|          | 0.00/679 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/3.55G [00:00<?, ?B/s]

Loading weights:   0%|          | 0/339 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/181 [00:00<?, ?B/s]

The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


    [LinguistAgent       ] → WCT      | The next utterance begins with 'All right, looks like we got everybody.' This suggests the teacher is addressing the entire class.


Loading weights:   0%|          | 0/339 [00:00<?, ?it/s]

    [PedagogyAgent       ] → Other    | [Recovered from truncated output]


Loading weights:   0%|          | 0/339 [00:00<?, ?it/s]

    [ContextAgent        ] → WCT      | The next utterance is 'All right, looks like we got everybody. All right, now do I? Press next. Press next. So it says proceed. Yes. Yes.' This indicates that the teacher is addressing the whole class because the utterance uses 'All right, looks like we got everybody.' The word 'All right' is a common response to indicate attention to the whole class. Additionally, the use of 'group' in the prior utterance suggests that the teacher is addressing the whole class.


Loading weights:   0%|          | 0/339 [00:00<?, ?it/s]

    [SkepticalAgent      ] → Other    | The next utterance is a command to the user, not a teacher's instruction. The prior utterances are teacher's instructions. Therefore, the CAD code is Other.


Loading weights:   0%|          | 0/339 [00:00<?, ?it/s]

    [HolisticAgent       ] → WCT      | The next utterance is 'All right, looks like we got everybody. All right, now do I? Press next. Press next. So it says proceed. Yes. Yes.' This indicates that the teacher is addressing the whole class because the utterance uses 'All right' and 'looked like we got everybody' which are typical classroom cues for the whole class. The prior conversation history shows that the user is asking the whole class to proceed, which is a common practice in educational settings. Therefore, the CAD code is WCT.

  ── Discussion Round 1 ──


Loading weights:   0%|          | 0/339 [00:00<?, ?it/s]

    [LinguistAgent       ]   stable  → WCT      | [Recovered from truncated output]


Loading weights:   0%|          | 0/339 [00:00<?, ?it/s]

    [PedagogyAgent       ]   stable  → WCT      | The next utterance begins with 'All right, looks like we got everybody.' This is a common classroom response to indicate that the teacher is addressing the whole class.
                           Critique: The prior conversation history shows that the user is asking the whole class to proceed, which is a common practice in educational settings. Therefore, the CAD code should be WCT.


Loading weights:   0%|          | 0/339 [00:00<?, ?it/s]

    [ContextAgent        ]   stable  → WCT      | The next utterance begins with 'All right, looks like we got everybody. All right, now do I? Press next. Press next. So it says proceed. Yes. Yes.' This indicates that the teacher is addressing the whole class because the utterance uses 'All right' and 'looked like we got everybody.' The word 'All right' is a common response to indicate attention to the whole class. Additionally, the use of 'group' in the prior utterance suggests that the teacher is addressing the whole class.
                           Critique: The reasoning is based on the use of 'All right' and 'group' in the prior utterance, which are typical classroom cues for the whole class. The prior utterance shows that the user is asking the whole class to proceed, which is a common practice in educational settings. Therefore, the CAD code is WCT.


Loading weights:   0%|          | 0/339 [00:00<?, ?it/s]

    [SkepticalAgent      ]   stable  → WCT      | The next utterance begins with 'All right, looks like we got everybody.' This suggests the teacher is addressing the entire class.
                           Critique: The prior utterances in the conversation history are teacher's instructions. The next utterance is a command to the user, not a teacher's instruction. Therefore, the CAD code is WCT.


Loading weights:   0%|          | 0/339 [00:00<?, ?it/s]

    [HolisticAgent       ]   stable  → WCT      | The next utterance starts with 'All right, looks like we got everybody. All right, now do I? Press next. Press next. So it says proceed. Yes. Yes.' This indicates that the teacher is addressing the whole class because the utterance uses 'All right' and 'looked like we got everybody.' The word 'All right' is a common response to indicate attention to the whole class. Additionally, the use of 'group' in the prior utterance suggests that the teacher is addressing the whole class. Therefore, the CAD code is WCT.
                           Critique: The prior conversation history shows that the user is asking the whole class to proceed, which is a common practice in educational settings. The next utterance uses 'All right' and 'looked like we got everybody' to indicate that the teacher is addressing the whole class. The prior utterances also mention 'group' and 'next' which are consistent with the whole class context. Therefore, the reasonin

Loading weights:   0%|          | 0/339 [00:00<?, ?it/s]

    [LinguistAgent       ]   stable  → WCT      | [Recovered from truncated output]


Loading weights:   0%|          | 0/339 [00:00<?, ?it/s]

    [PedagogyAgent       ] ↺ CHANGED → WCT      | The next utterance begins with 'All right, looks like we got everybody.' This is a common classroom response to indicate that the teacher is addressing the whole class.
                           Critique: The use of 'All right' is a standard response in classrooms to indicate attention to the whole class. Additionally, the word 'group' in the prior utterance suggests that the teacher is addressing the whole class. Therefore, the CAD code is WCT.


Loading weights:   0%|          | 0/339 [00:00<?, ?it/s]

    [ContextAgent        ]   stable  → WCT      | The next utterance begins with 'All right, looks like we got everybody. All right, now do I? Press next. Press next. So it says proceed. Yes. Yes.' This indicates that the teacher is addressing the whole class because the utterance uses 'All right' and 'looked like we got everybody.' The word 'All right' is a common response to indicate attention to the whole class. Additionally, the use of 'group' in the prior utterance suggests that the teacher is addressing the whole class. Therefore, the CAD code is WCT.


Loading weights:   0%|          | 0/339 [00:00<?, ?it/s]

    [SkepticalAgent      ] ↺ CHANGED → WCT      | The next utterance begins with 'All right, looks like we got everybody.' This suggests the teacher is addressing the entire class.
                           Critique: The use of 'All right' is a common classroom response to indicate attention to the whole class. Additionally, the word 'group' in the prior utterance suggests that the teacher is addressing the whole class. Therefore, the CAD code is WCT.


Loading weights:   0%|          | 0/339 [00:00<?, ?it/s]

    [HolisticAgent       ]   stable  → WCT      | The next utterance starts with 'All right, looks like we got everybody. All right, now do I? Press next. Press next. So it says proceed. Yes. Yes.' This indicates that the teacher is addressing the whole class because the utterance uses 'All right' and 'looked like we got everybody.' The word 'All right' is a common response to indicate attention to the whole class. Additionally, the use of 'group' in the prior utterance suggests that the teacher is addressing the whole class. Therefore, the CAD code is WCT.
                           Critique: The reasoning is correct because the use of 'All right' and 'group' in the prior utterance supports the idea that the teacher is addressing the whole class. The other agents agree that the next utterance starts with 'All right' and 'looked like we got everybody' which is a common response. Therefore, the CAD code is WCT.

  ── Weighted Vote ──
    [LinguistAgent       ] ★ consistent → WCT      (w

Loading weights:   0%|          | 0/339 [00:00<?, ?it/s]

    [LinguistAgent       ] → WCT      | The next utterance continues the teacher's direct address to the whole class, indicated by the prior conversation history showing the teacher addressing the whole class.


Loading weights:   0%|          | 0/339 [00:00<?, ?it/s]

    [PedagogyAgent       ] → GT       | The teacher is addressing a group or student in a group, but the conversation history shows that the teacher is addressing the whole class. The next utterance is about 'Press next. Do I have to...', which seems to be a follow-up instruction related to the group's action. However, the teacher is addressing a group, which is a subset of the whole class, so the CAD code should be GT.


Loading weights:   0%|          | 0/339 [00:00<?, ?it/s]

    [ContextAgent        ] → WCT      | The next utterance is addressing the whole class by using 'everybody' and 'group' in the conversation history. The teacher is addressing the whole class, so the CAD code is WCT.


Loading weights:   0%|          | 0/339 [00:00<?, ?it/s]

    [SkepticalAgent      ] → Other    | The teacher is addressing a specific group or student, but not the whole class.


Loading weights:   0%|          | 0/339 [00:00<?, ?it/s]

    [HolisticAgent       ] → Other    | The teacher is addressing a group or student in a group, but the next utterance is 'Press next. Do I have to...'. This suggests the teacher is addressing a specific group, but the action taken is 'Do I have to...'. This could imply the teacher is silent or addressing herself.

  ── Discussion Round 1 ──


Loading weights:   0%|          | 0/339 [00:00<?, ?it/s]

    [LinguistAgent       ]   stable  → WCT      | The next utterance continues the teacher's direct address to the whole class, indicated by the prior conversation history showing the teacher addressing the whole class.
                           Critique: The reasoning is correct because the teacher is addressing the whole class, indicated by the prior conversation history showing the teacher addressing the whole class, and the next utterance continues this direct address to the whole class.


Loading weights:   0%|          | 0/339 [00:00<?, ?it/s]

    [PedagogyAgent       ]   stable  → WCT      | [Recovered from truncated output]


Loading weights:   0%|          | 0/339 [00:00<?, ?it/s]

    [ContextAgent        ]   stable  → GT       | The next utterance is addressing a group or student in a group, indicated by the prior conversation history showing the teacher addressing a group. The teacher is addressing a group or student in a group, so the CAD code is GT.
                           Critique: The prior reasoning identified the teacher addressing a group, which is a subset of the whole class. The conversation history shows the teacher addressing the whole class, so the next utterance should be GT.


Loading weights:   0%|          | 0/339 [00:00<?, ?it/s]

    [SkepticalAgent      ]   stable  → Other    | [Recovered from truncated output]


Loading weights:   0%|          | 0/339 [00:00<?, ?it/s]

    [HolisticAgent       ]   stable  → WCT      | The teacher is addressing a group or student in a group, as indicated by the conversation history showing the teacher addressing the whole class. The next utterance is 'Press next. Do I have to...', which suggests the teacher is addressing a specific group or student within the context of the whole class. Therefore, the CAD code should be WCT, as the teacher is addressing a group or student in a group, which aligns with the definition of WCT in the CAD codebook.
                           Critique: The reasoning here is that the teacher is addressing a group or student in a group, which is consistent with the definition of WCT in the CAD codebook. The next utterance does not contradict this, as it is about the teacher addressing a specific group or student within the context of the whole class. Therefore, the CAD code should remain WCT.

  ── Discussion Round 2 ──


Loading weights:   0%|          | 0/339 [00:00<?, ?it/s]

    [LinguistAgent       ]   stable  → WCT      | [Recovered from truncated output]


Loading weights:   0%|          | 0/339 [00:00<?, ?it/s]

    [PedagogyAgent       ] ↺ CHANGED → WCT      | The next utterance is 'Press next. Do I have to...', which suggests the teacher is addressing a group or student in a group. The prior conversation history shows the teacher addressing the whole class, so the next utterance is addressing a group or student in a group, which aligns with the definition of WCT in the CAD codebook. Therefore, the CAD code should be WCT.
                           Critique: The reasoning here is that the teacher is addressing a group or student in a group, which is indicated by the prior conversation history showing the teacher addressing the whole class. The next utterance is addressing a group or student in a group, so the CAD code is WCT. This seems correct because the teacher is addressing a group, and the next utterance is about a specific group or student within that context. Therefore, the CAD code should be WCT.


Loading weights:   0%|          | 0/339 [00:00<?, ?it/s]

    [ContextAgent        ]   stable  → WCT      | [Recovered from truncated output]


Loading weights:   0%|          | 0/339 [00:00<?, ?it/s]

    [SkepticalAgent      ]   stable  → WCT      | The next utterance is 'Press next. Do I have to...', which suggests the teacher is addressing a group or student in a group. The prior conversation history shows the teacher addressing the whole class. Therefore, the teacher is addressing a group or student in a group, which aligns with the definition of WCT in the CAD codebook. Therefore, the correct label should be WCT.
                           Critique: The reasoning here is that the teacher is addressing a group or student in a group, as indicated by the conversation history showing the teacher addressing the whole class. The next utterance is 'Press next. Do I have to...', which suggests the teacher is addressing a group or student within the context of the whole class. Therefore, the CAD code should be WCT.


Loading weights:   0%|          | 0/339 [00:00<?, ?it/s]

    [HolisticAgent       ]   stable  → WCT      | [Recovered from truncated output]

  ── Weighted Vote ──
    [LinguistAgent       ] ★ consistent → WCT      (weight=1.5)
    [PedagogyAgent       ]   changed    → WCT      (weight=1.0)
    [ContextAgent        ]   changed    → WCT      (weight=1.0)
    [SkepticalAgent      ]   changed    → WCT      (weight=1.0)
    [HolisticAgent       ]   changed    → WCT      (weight=1.0)
    Weighted totals: {'WCT': 5.5}

  ✓ FINAL LABEL: WCT
    Weighted votes: {'WCT': 5.5}

─────────────────────────────────────────────────────────────────
  Predicting utterance 3/4: "Okay, so now you're going to see three columns. So you're going to see three columns where you're going to be placing your cards. So in the three columns, one column says it'll be sometimes true, meaning you can find one solution where X might just equal one number. Okay, always true, meaning there's infinite solutions. And lastly, no solutions, meaning that there will, that's when you

Loading weights:   0%|          | 0/339 [00:00<?, ?it/s]

    [LinguistAgent       ] → WCT      | The next utterance discusses the class, indicating the teacher is addressing the whole class.


Loading weights:   0%|          | 0/339 [00:00<?, ?it/s]

    [PedagogyAgent       ] → GT       | The next utterance is addressing a group or student in a group, as indicated by the conversation history showing the teacher is addressing the whole class. This aligns with the GT label, which addresses a group or a student in a group.


Loading weights:   0%|          | 0/339 [00:00<?, ?it/s]

    [ContextAgent        ] → Other    | The next utterance discusses game mechanics, not the teacher addressing the whole class or groups.


Loading weights:   0%|          | 0/339 [00:00<?, ?it/s]

    [SkepticalAgent      ] → Other    | The next utterance discusses solving equations and card placement without addressing the whole class or a specific group. It focuses on individual interactions and problem-solving steps, which doesn't fit WCT or GT categories.


Loading weights:   0%|          | 0/339 [00:00<?, ?it/s]

    [HolisticAgent       ] → GT       | The next utterance discusses solving equations and manipulating elements, indicating the teacher is addressing a group or students.

  ── Discussion Round 1 ──


Loading weights:   0%|          | 0/339 [00:00<?, ?it/s]

    [LinguistAgent       ]   stable  → Other    | The next utterance is discussing game mechanics, not the teacher addressing the whole class or any groups or students. Therefore, the correct CAD code is Other.


Loading weights:   0%|          | 0/339 [00:00<?, ?it/s]

    [PedagogyAgent       ]   stable  → Other    | The next utterance is not addressing the whole class or any groups or students, so it falls under Other.
                           Critique: The next utterance discusses solving equations and card placement, which doesn't fit WCT or GT categories. It's more about individual interactions and problem-solving steps, which doesn't align with the teacher addressing a group or students.


Loading weights:   0%|          | 0/339 [00:00<?, ?it/s]

    [ContextAgent        ]   stable  → Other    | The next utterance discusses game mechanics, not the teacher addressing the whole class or groups. Therefore, the reasoning is correct but insufficient to conclude.
                           Critique: The reasoning is correct in identifying the issue, but it does not address the problem. The next utterance is discussing game mechanics, which is unrelated to the teacher addressing the whole class or groups. Therefore, the reasoning is correct but insufficient to conclude.


Loading weights:   0%|          | 0/339 [00:00<?, ?it/s]

    [safe_call] All 1 attempts failed. Using fallback='None'. See raw_output_debug.log
    [SkepticalAgent      ]   stable  → None     | [Fallback after 1 failed attempts: All parsing strategies failed. Raw output (first 400 chars):
}

<answer>
{
  "CAD-code": "one of: WCT, GT, Other",
  "rationale": "I disagree with the ContextAgent's rationale. The ContextAgent believes the next utterance is about game mechanics, but the actual next utterance is about solving equations and card placement. The ContextAgent's rationale is flawed because it doesn't account for the teacher's focus on the class and groups. The actual next uttera]


Loading weights:   0%|          | 0/339 [00:00<?, ?it/s]

    [HolisticAgent       ]   stable  → Other    | The next utterance discusses solving equations and manipulating elements, indicating the teacher isn't talking to the whole class or any groups or students.
                           Critique: The reasoning aligns with Other because the teacher isn't talking to the whole class or any groups or students. The conversation history shows the teacher is addressing the whole class, which is WCT. The next utterance is about solving equations, which is a group activity. So the reasoning doesn't fit any of the categories.

  ── Discussion Round 2 ──


Loading weights:   0%|          | 0/339 [00:00<?, ?it/s]

    [LinguistAgent       ]   stable  → Other    | The next utterance is discussing game mechanics, which is unrelated to the teacher addressing the whole class or any groups or students. Therefore, the correct CAD code is Other.
                           Critique: The reasoning is correct, but it doesn't address the teacher's role in the conversation history. The next utterance is discussing game mechanics, which is unrelated to the teacher addressing the whole class or any groups or students. Therefore, the correct CAD code is Other.


Loading weights:   0%|          | 0/339 [00:00<?, ?it/s]

    [PedagogyAgent       ]   stable  → Other    | The next utterance is not addressing the whole class or any groups or students, so it falls under Other.
                           Critique: The reasoning is correct, but it doesn't address the teacher's intent to talk about the whole class or any groups or students.


Loading weights:   0%|          | 0/339 [00:00<?, ?it/s]

    [ContextAgent        ]   stable  → Other    | The next utterance discusses game mechanics, not the teacher addressing the whole class or any groups or students. Therefore, the correct CAD code is Other.
                           Critique: The reasoning is correct but insufficient because the next utterance is discussing game mechanics, which is a different context. Therefore, the correct code is Other, but the reasoning is not strong enough.


Loading weights:   0%|          | 0/339 [00:00<?, ?it/s]

    [SkepticalAgent      ]   stable  → Other    | The next utterance is about solving equations and card placement, not about the teacher addressing the whole class or any groups or students. Therefore, the correct CAD code is Other.


Loading weights:   0%|          | 0/339 [00:00<?, ?it/s]

    [HolisticAgent       ]   stable  → Other    | The next utterance discusses solving equations and manipulating elements, indicating the teacher isn't talking to the whole class or any groups or students.
                           Critique: The reasoning is correct but insufficient because the next utterance is discussing game mechanics, not the teacher addressing the whole class or any groups or students. Therefore, the correct CAD code is Other.

  ── Weighted Vote ──
    [LinguistAgent       ]   changed    → Other    (weight=1.0)
    [PedagogyAgent       ]   changed    → Other    (weight=1.0)
    [ContextAgent        ] ★ consistent → Other    (weight=1.5)
    [SkepticalAgent      ]   changed    → Other    (weight=1.0)
    [HolisticAgent       ]   changed    → Other    (weight=1.0)
    Weighted totals: {'Other': 5.5}

  ✓ FINAL LABEL: Other
    Weighted votes: {'Other': 5.5}

─────────────────────────────────────────────────────────────────
  Predicting utterance 4/4: "So get clos

Loading weights:   0%|          | 0/339 [00:00<?, ?it/s]

    [LinguistAgent       ] → WCT      | The next utterance is part of the conversation history where the teacher is instructing the class to get closer and talk. The prior utterance was about the class moving closer, which is a common setup for a class discussion. The next utterance continues this instruction, reinforcing that the teacher is addressing the whole class. Therefore, the CAD code should be WCT.


Loading weights:   0%|          | 0/339 [00:00<?, ?it/s]

    [PedagogyAgent       ] → GT       | The teacher is addressing a specific group or student, and the next utterance is about moving closer, which suggests a collaborative effort or discussion. This aligns with the GT label because the teacher is addressing a group or a student in a group, and the talk is at the student level.


Loading weights:   0%|          | 0/339 [00:00<?, ?it/s]

    [ContextAgent        ] → WCT      | The next utterance is a continuation of the class discussion about the columns, and the teacher is using 'So' to indicate the whole class. The prior conversation history shows that the teacher is addressing the whole class by using 'You guys see your group number here?' and 'Okay, so now you're going to see three columns.' The next utterance is a continuation of the class discussion about the columns, so it's clear that the teacher is addressing the whole class.


Loading weights:   0%|          | 0/339 [00:00<?, ?it/s]

    [SkepticalAgent      ] → Other    | The next utterance is discussing proximity and talking about it, which doesn't fit the WCT or GT categories. It's more about individual interactions or a specific point in a class, but not about the whole class or any groups.


Loading weights:   0%|          | 0/339 [00:00<?, ?it/s]

    [HolisticAgent       ] → GT       | The next utterance is about group interactions, indicating the teacher is addressing a group or a student within a group, which fits the GT label.

  ── Discussion Round 1 ──


Loading weights:   0%|          | 0/339 [00:00<?, ?it/s]

    [LinguistAgent       ]   stable  → WCT      | The next utterance is part of the conversation history where the teacher is instructing the class to get closer and talk. The prior utterance was about the class moving closer, which is a common setup for a class discussion. The next utterance continues this instruction, reinforcing that the teacher is addressing the whole class. Therefore, the CAD code should be WCT.
                           Critique: The reasoning is based on the prior conversation history showing the teacher addressing the whole class, and the next utterance continues this instruction, reinforcing the WCT label. The reasoning is sound and the CAD code should be WCT.


Loading weights:   0%|          | 0/339 [00:00<?, ?it/s]

    [PedagogyAgent       ]   stable  → Other    | [Recovered from truncated output]


Loading weights:   0%|          | 0/339 [00:00<?, ?it/s]

    [ContextAgent        ]   stable  → GT       | [Recovered from truncated output]


Loading weights:   0%|          | 0/339 [00:00<?, ?it/s]

    [SkepticalAgent      ]   stable  → WCT      | [Recovered from truncated output]


Loading weights:   0%|          | 0/339 [00:00<?, ?it/s]

    [HolisticAgent       ]   stable  → Other    | [Recovered from truncated output]

  ── Discussion Round 2 ──


Loading weights:   0%|          | 0/339 [00:00<?, ?it/s]

    [LinguistAgent       ] ↺ CHANGED → WCT      | The next utterance is part of the conversation history where the teacher is instructing the class to get closer and talk. The prior utterance was about the class moving closer, which is a common setup for a class discussion. The next utterance continues this instruction, reinforcing that the teacher is addressing the whole class. Therefore, the CAD code should be WCT.
                           Critique: The reasoning here is that the next utterance is part of the instruction given by the teacher, reinforcing the teacher's instruction to the class. Therefore, the teacher is addressing the whole class, which fits the WCT category. The other agents' reasoning may not align because they are focused on different aspects, but the CAD code is determined based on the teacher's instruction and the class's movement.


Loading weights:   0%|          | 0/339 [00:00<?, ?it/s]

    [PedagogyAgent       ]   stable  → WCT      | [Recovered from truncated output]


Loading weights:   0%|          | 0/339 [00:00<?, ?it/s]

    [ContextAgent        ]   stable  → WCT      | The next utterance is part of the conversation history where the teacher is instructing the class to get closer and talk. The prior utterance was about the class moving closer, which is a common setup for a class discussion. The next utterance continues this instruction, reinforcing that the teacher is addressing the whole class. Therefore, the CAD code should be WCT.
                           Critique: The reasoning here is that the next utterance is part of the conversation history where the teacher is instructing the class to get closer and talk. The prior utterance was about the class moving closer, which is a common setup for a class discussion. The next utterance continues this instruction, reinforcing that the teacher is addressing the whole class. Therefore, the CAD code should be WCT.


Loading weights:   0%|          | 0/339 [00:00<?, ?it/s]

    [SkepticalAgent      ] ↺ CHANGED → one of: WCT, GT, Other | I disagree with the previous reasoning that the next utterance is WCT. I believe it should be GT.
                           Critique: The prior reasoning suggests that the next utterance is GT because the teacher is addressing the whole class. However, the next utterance is only talking about getting closer and talking here, which is part of the class discussion but not the whole class. Therefore, the teacher is addressing a group or students in a group, making it WCT. But wait, the teacher is instructing the class to get closer and talk, which is part of the class discussion. So the teacher is addressing the whole class. Therefore, it should be WCT. But the prior reasoning said it's GT. So I'm confused. Maybe I need to reevaluate.


Loading weights:   0%|          | 0/339 [00:00<?, ?it/s]

    [HolisticAgent       ]   stable  → WCT      | The next utterance is about getting closer, which is a common setup for a class discussion. The prior utterance was about the teacher instructing the class to move closer, which is a typical setup for a class discussion. The next utterance continues this instruction, reinforcing that the teacher is addressing the whole class. Therefore, the CAD code should be WCT.
                           Critique: The reasoning here is that the next utterance is part of the conversation history where the teacher is instructing the class to get closer. The prior utterance was about the teacher instructing the class to move closer, which is a common setup for a class discussion. The next utterance continues this instruction, reinforcing that the teacher is addressing the whole class. Therefore, the CAD code should be WCT.

  ── Weighted Vote ──
    [LinguistAgent       ] ★ consistent → WCT      (weight=1.5)
    [PedagogyAgent       ]   changed    → WCT

In [52]:
results

[{'context_ends_at_index': 0,
  'target_index': 1,
  'target_utterance': 'All right, looks like we got everybody. All right, now do I? Press next. Press next. So it says proceed. Yes. Yes.',
  'agent_states': [{'agent': 'LinguistAgent',
    'context_window': 1,
    'label': 'WCT',
    'rationale': '[Recovered from truncated output]',
    'history': ['WCT', 'WCT', 'WCT'],
    'changed_mind': False,
    'critique': ''},
   {'agent': 'PedagogyAgent',
    'context_window': 4,
    'label': 'WCT',
    'rationale': "The next utterance begins with 'All right, looks like we got everybody.' This is a common classroom response to indicate that the teacher is addressing the whole class.",
    'history': ['Other', 'WCT', 'WCT'],
    'changed_mind': True,
    'critique': "The use of 'All right' is a standard response in classrooms to indicate attention to the whole class. Additionally, the word 'group' in the prior utterance suggests that the teacher is addressing the whole class. Therefore, the CAD

In [53]:
print_summary(results)
save_results(results, "debate_results.json")


  DEBATE LABELING SUMMARY  (next-utterance predictions)
  Idx   Label    Tie?   Weighted Votes                 Utterance
  ----------------------------------------------------------------------
  1     WCT      no     {'WCT': 6.5}                   "All right, looks like we got eve..."
  2     WCT      no     {'WCT': 5.5}                   "Press next. Do I have to..."
  3     Other    no     {'Other': 5.5}                 "Okay, so now you're going to see..."
  4     WCT      no     {'WCT': 4.5, 'one of: WCT, GT, Other': 1.0} "So get closer, talk here, get cl..."

Full results saved to: debate_results.json
