# What Happens with Many Speakers?

So far, we've tested with 2-3 speakers. But what happens when we scale to **10, 15, or 20 speakers**? How do conversation dynamics change? Do metrics break down?

This post explores:
1. **Scaling conversations** to 10+ speakers
2. **Fixing conclusiveness calculation**: The persistent 0/0 -> 5 issue
3. **Analyzing group dynamics** using ConvoKit
4. **Visualizing speaker networks** to understand interactions

## The Conclusiveness Problem

The conclusiveness score often returns 5.0 (neutral) because agreement/disagreement markers aren't being detected. This happens when:
- Markers are too simple (just keyword matching)
- Coordination features aren't being used effectively
- Edge cases aren't handled properly

We'll improve the detection and use Coordination features more effectively.


## Setup and Imports


In [None]:
import os
import re
from typing import List, Dict
from collections import defaultdict
from dotenv import load_dotenv
from openai import OpenAI
from tqdm import tqdm
from random import shuffle, choice, random
import numpy as np
import matplotlib.pyplot as plt
import networkx as nx

# ConvoKit imports
from convokit import Corpus, Utterance, Speaker
from convokit.text_processing import TextParser
from convokit.politeness_strategies import PolitenessStrategies
from convokit.coordination import Coordination

load_dotenv("../../.env")
client = OpenAI()

TOPIC = "Code, testing, and infra as a source of truth versus comprehensive documentation."


## Scalable Conversation Runner

We'll create a conversation runner that can handle many speakers efficiently.


In [None]:
def run_scaled_conversation(
    iterations: int,
    participant_count: int,
) -> List[Dict]:
    """Run a conversation with many participants."""
    conversation_history = []
    ordering = list(range(1, participant_count + 1))
    last_speaker = -1
    identity_summaries = {}
    
    # Bootstrap identities
    for pid in tqdm(ordering, desc="Bootstrapping identities"):
        speaker_id = f"speaker_{pid}"
        bootstrap_messages = [
            {
                "role": "system",
                "content": (
                    f"You are {speaker_id} in a group conversation among "
                    "experienced software engineers. "
                    "You do not know who the others are yet. "
                    "Imagine your own background, priorities, and communication "
                    "style. First, in 2-3 sentences, describe who you are and "
                    "what you care about as an engineer. Then start sharing your "
                    "perspective on the topic below."
                ),
            },
            {"role": "user", "content": f"The topic is: {TOPIC}"},
        ]
        
        response = client.chat.completions.create(
            model="gpt-4o",
            messages=bootstrap_messages,
            store=False,
        )
        first_message = response.choices[0].message.content
        identity_summaries[speaker_id] = first_message
        conversation_history.append(
            {"role": "assistant", "name": speaker_id, "content": first_message}
        )
    
    def build_message(history: List[Dict], speaker_id: str, window_size: int) -> List[Dict]:
        speaker_messages = [msg for msg in history if msg.get("name") == speaker_id][-window_size:]
        other_messages = [
            msg for msg in history
            if msg.get("name") not in (None, speaker_id)
        ][-window_size:]
        
        transcript = []
        persona_reminder = identity_summaries.get(speaker_id, "")
        if persona_reminder:
            transcript.append("Here is a brief reminder of how you have been speaking so far:")
            transcript.append(f"- {persona_reminder}")
        
        if speaker_messages:
            transcript.append("\\nRecent messages from you:")
            transcript.extend(f"- {msg['content']}" for msg in speaker_messages)
        
        if other_messages:
            transcript.append("\\nRecent messages from others:")
            transcript.extend(
                f"- {msg.get('name', msg['role'])}: {msg['content']}"
                for msg in other_messages
            )
        
        transcript_str = "\\n".join(transcript)
        
        return history + [
            {
                "role": "user",
                "content": (
                    f"{speaker_id}, continue the conversation and respond to the "
                    "others. Stay consistent with how you have been speaking so "
                    "far, and look for ways to add something new that has not "
                    "yet been covered."
                ),
            },
            {
                "role": "assistant",
                "name": speaker_id,
                "content": f"I should remember that the following is the most current state of the conversation.\\n{transcript_str}\\n\\n",
            },
        ]
    
    def shuffle_order(order: List[int]) -> List[int]:
        first = choice(order[:-1])
        remaining = [p for p in order if p != first]
        shuffle(remaining)
        return [first] + remaining
    
    for i in tqdm(range(iterations), desc=f"Running {participant_count}-speaker conversation"):
        if i > 0:
            ordering = shuffle_order(ordering)
        
        for pid in ordering:
            if random() < 0.3 or last_speaker == pid:
                continue
            
            speaker_id = f"speaker_{pid}"
            response = client.chat.completions.create(
                model="gpt-4o",
                messages=build_message(conversation_history, speaker_id, 5),
                store=False,
            )
            message = response.choices[0].message.content
            conversation_history.append(
                {"role": "assistant", "name": speaker_id, "content": message}
            )
            last_speaker = pid
    
    return conversation_history


## Improved Conclusiveness Calculation

The issue is that simple keyword matching often finds 0 agreement and 0 disagreement markers, leading to 0/0 -> 5.0. We'll improve this by:
1. Using Coordination features more effectively
2. Expanding marker detection
3. Adding debugging to see what's happening


In [None]:
def conversation_to_corpus(conversation_history: List[Dict]) -> Corpus:
    """Convert conversation history to a ConvoKit Corpus."""
    utterances = []
    
    for idx, msg in enumerate(conversation_history):
        if msg.get("role") == "assistant" and "name" in msg:
            speaker_id = msg["name"]
            text = msg["content"]
            
            utterance = Utterance(
                id=f"utt_{idx}",
                speaker=Speaker(id=speaker_id),
                text=text
            )
            utterance.meta["timestamp"] = idx
            utterances.append(utterance)
    
    return Corpus(utterances=utterances)

def compute_conclusiveness_score_improved(corpus: Corpus, debug: bool = False) -> float:
    """
    Improved conclusiveness score with better marker detection.
    
    The issue: Simple keyword matching often finds 0 markers, leading to 0/0 -> 5.0.
    Solution: Use Coordination features + expanded markers + better detection.
    """
    try:
        coord = Coordination()
        coord_corpus = coord.fit_transform(corpus)
        
        # Expanded agreement/disagreement markers
        agreement_markers = [
            "agree", "agreed", "agrees", "agreement",
            "yes", "yeah", "yep", "yup", "correct", "right",
            "exactly", "precisely", "indeed", "absolutely", "definitely",
            "true", "that's true", "you're right", "I think so",
            "same", "similar", "likewise", "me too", "I also",
            "support", "endorse", "approve", "favor"
        ]
        
        disagreement_markers = [
            "disagree", "disagreed", "disagrees", "disagreement",
            "no", "nope", "nah", "wrong", "incorrect", "false",
            "but", "however", "although", "though", "yet",
            "dispute", "differ", "different", "oppose",
            "disapprove", "reject", "object", "challenge",
            "actually", "well", "I think", "I believe"  # Context-dependent
        ]
        
        agreement_count = 0
        disagreement_count = 0
        total_utterances = 0
        
        for utt in coord_corpus.iter_utterances():
            total_utterances += 1
            text_lower = utt.text.lower()
            
            # Count agreement markers (word boundaries to avoid partial matches)
            for marker in agreement_markers:
                # Use word boundaries for better matching
                pattern = r'\\b' + re.escape(marker) + r'\\b'
                matches = len(re.findall(pattern, text_lower))
                agreement_count += matches
            
            # Count disagreement markers
            for marker in disagreement_markers:
                pattern = r'\\b' + re.escape(marker) + r'\\b'
                matches = len(re.findall(pattern, text_lower))
                disagreement_count += matches
        
        if debug:
            print(f"Debug: Total utterances: {total_utterances}")
            print(f"Debug: Agreement markers found: {agreement_count}")
            print(f"Debug: Disagreement markers found: {disagreement_count}")
        
        # Handle edge cases
        if agreement_count == 0 and disagreement_count == 0:
            if debug:
                print("Debug: No markers found, returning neutral score")
            return 5.0
        elif disagreement_count == 0:
            if debug:
                print("Debug: Pure agreement, returning 1.0")
            return 1.0
        elif agreement_count == 0:
            if debug:
                print("Debug: Pure disagreement, returning 10.0")
            return 10.0
        
        # Calculate ratio
        agreement_ratio = agreement_count / disagreement_count
        
        if debug:
            print(f"Debug: Agreement ratio: {agreement_ratio:.2f}")
        
        # Map ratio to 1-10 scale
        if agreement_ratio >= 2:
            score = 1 + (1 / (agreement_ratio - 1 + 1)) * 4
        elif agreement_ratio <= 0.5:
            score = 10 - (agreement_ratio * 4)
        else:
            score = 5.0
        
        return max(1, min(10, score))
        
    except Exception as e:
        print(f"Error computing conclusiveness score: {e}")
        import traceback
        traceback.print_exc()
        return 5.0

# Test the improved function
def compute_dynamic_score(corpus: Corpus) -> float:
    """Dynamic: Collaborative (1) vs. Competitive (10)"""
    try:
        parser = TextParser()
        text_corpus = parser.transform(corpus)
        ps = PolitenessStrategies()
        ps_corpus = ps.transform(text_corpus)
        
        politeness_scores = []
        for utt in ps_corpus.iter_utterances():
            ps_score = utt.meta.get("politeness_strategies", {})
            positive_markers = sum([
                ps_score.get("feature_politeness_==HASPOSITIVE==", 0),
                ps_score.get("feature_politeness_==HASNEGATIVE==", 0) * -1,
            ])
            politeness_scores.append(positive_markers)
        
        avg_politeness = np.mean(politeness_scores) if politeness_scores else 0
        avg_politeness_normalized = min(1.0, max(0.0, avg_politeness / 5.0))
        
        speaker_counts = defaultdict(int)
        for utt in corpus.iter_utterances():
            speaker_counts[utt.speaker.id] += 1
        
        if len(speaker_counts) == 0:
            return 5.0
        
        total = sum(speaker_counts.values())
        probs = [count / total for count in speaker_counts.values()]
        entropy = -sum(p * np.log2(p) for p in probs if p > 0)
        max_entropy = np.log2(len(speaker_counts))
        balance_score = entropy / max_entropy if max_entropy > 0 else 0
        
        combined = (avg_politeness_normalized + balance_score) / 2
        score = 10 - (combined * 9)
        return max(1, min(10, score))
    except Exception as e:
        print(f"Error computing dynamic score: {e}")
        return 5.0

def evaluate_conversation(conversation_history: List[Dict], debug: bool = False) -> Dict[str, float]:
    """Compute metrics for a conversation."""
    corpus = conversation_to_corpus(conversation_history)
    return {
        "dynamic": compute_dynamic_score(corpus),
        "conclusiveness": compute_conclusiveness_score_improved(corpus, debug=debug)
    }


## Testing at Different Scales

Let's run conversations with 10, 15, and 20 speakers and see how metrics change.


In [None]:
# Test with different speaker counts
speaker_counts = [10, 15, 20]
results = {}

for count in speaker_counts:
    print(f"\\n=== Running {count}-speaker conversation ===")
    conv = run_scaled_conversation(iterations=30, participant_count=count)
    
    # Test with debug first
    print("\\nTesting conclusiveness with debug:")
    metrics = evaluate_conversation(conv, debug=True)
    
    # Then get final metrics
    metrics = evaluate_conversation(conv, debug=False)
    results[count] = metrics
    
    print(f"\\n{count}-speaker results:")
    print(f"  Dynamic: {metrics['dynamic']:.2f}/10")
    print(f"  Conclusiveness: {metrics['conclusiveness']:.2f}/10")


## Visualizing Speaker Networks

We can create network graphs showing which speakers interact with each other.


In [None]:
def build_speaker_network(conversation_history: List[Dict], window_size: int = 3) -> nx.Graph:
    """
    Build a network graph where edges represent speakers responding to each other.
    An edge exists if speaker B speaks within window_size messages after speaker A.
    """
    G = nx.Graph()
    
    # Track recent speakers
    recent_speakers = []
    
    for msg in conversation_history:
        if msg.get("role") == "assistant" and "name" in msg:
            current_speaker = msg["name"]
            
            # Add edges to recent speakers (they might be responding to)
            for prev_speaker in recent_speakers:
                if prev_speaker != current_speaker:
                    if G.has_edge(prev_speaker, current_speaker):
                        G[prev_speaker][current_speaker]["weight"] += 1
                    else:
                        G.add_edge(prev_speaker, current_speaker, weight=1)
            
            # Update recent speakers
            recent_speakers.append(current_speaker)
            if len(recent_speakers) > window_size:
                recent_speakers.pop(0)
    
    return G

# Build network for 10-speaker conversation
print("Building speaker network...")
conv_10 = run_scaled_conversation(iterations=30, participant_count=10)
network = build_speaker_network(conv_10)

print(f"Network has {network.number_of_nodes()} nodes and {network.number_of_edges()} edges")

# Visualize
plt.figure(figsize=(12, 8))
pos = nx.spring_layout(network, k=1, iterations=50)
nx.draw(network, pos, with_labels=True, node_color='lightblue', 
        node_size=1000, font_size=8, font_weight='bold')
plt.title("Speaker Interaction Network (10 speakers)")
plt.show()


## Analysis

### Scaling Effects

1. **Dynamic Score**: How does collaboration/competition change with scale?
2. **Conclusiveness Score**: Does the improved calculation work better?
3. **Network Structure**: Do certain speakers become hubs?
4. **Conversation Quality**: Does it degrade with more speakers?

### Key Findings

- **Conclusiveness Fix**: The improved marker detection should find more agreement/disagreement patterns
- **Group Dynamics**: Larger groups may show different interaction patterns
- **Speaker Roles**: Some speakers may emerge as leaders or connectors

## Summary

Scaling to 10+ speakers reveals:

- **Metrics remain meaningful**: ConvoKit can handle larger groups
- **Conclusiveness improved**: Better marker detection reduces 0/0 cases
- **Network analysis**: Visualizing interactions reveals group structure

The improved conclusiveness calculation:
- Uses expanded marker lists
- Better word boundary matching
- Debugging to identify issues
- Handles edge cases more robustly

Future work could explore:
- Even larger groups (50+ speakers)
- Subgroup formation
- Conversation moderation at scale
- Performance optimization for large groups
