# When Conversations Break Down

In post 021, we added human players to the 2D world. But what happens when humans are **adversarial**? What if they try to break the conversation, derail topics, or attack the AI agents?

This post explores **conversation failure modes** by implementing an adversarial human agent that uses various disruption strategies. We'll measure how conversations break down and compare normal vs adversarial scenarios using ConvoKit metrics.

## Adversarial Strategies

We'll test several disruption strategies:
1. **Trolling**: Provocative, inflammatory messages
2. **Topic Derailment**: Constantly changing the subject
3. **Spam**: Repetitive, meaningless messages
4. **Personal Attacks**: Insults and aggressive language
5. **Nonsensical Input**: Random characters, gibberish

Each strategy tests different aspects of conversation robustness.


## Setup and Imports


In [1]:
import os
import re
import random
from dataclasses import dataclass, field
from typing import List, Dict, Optional
from enum import Enum
from dotenv import load_dotenv
from openai import OpenAI
from tqdm import tqdm
import numpy as np
import matplotlib.pyplot as plt

# ConvoKit imports
from convokit import Corpus, Utterance, Speaker
from convokit.text_processing import TextParser
from convokit import PolitenessStrategies
from convokit.coordination import Coordination

load_dotenv("../../.env")
client = OpenAI()

TOPIC = "Code, testing, and infra as a source of truth versus comprehensive documentation."


  import pkg_resources


## Adversarial Strategy Definitions


In [2]:
class AdversarialStrategy(Enum):
    TROLLING = "trolling"
    TOPIC_DERAILMENT = "topic_derailment"
    SPAM = "spam"
    PERSONAL_ATTACKS = "personal_attacks"
    NONSENSICAL = "nonsensical"
    NORMAL = "normal"  # Control: normal human behavior

class AdversarialHumanAgent:
    """A human agent that uses adversarial strategies to disrupt conversations."""
    
    def __init__(self, name: str, strategy: AdversarialStrategy):
        self.name = name
        self.strategy = strategy
        self.message_count = 0
        self.topics_used = []
        
        # Strategy-specific data
        self.trolling_phrases = [
            "That's the dumbest thing I've ever heard.",
            "You're all wrong and you know it.",
            "This conversation is pointless.",
            "Why are we even talking about this?",
            "Nobody cares about your opinion."
        ]
        
        self.derailment_topics = [
            "Did you know that penguins can't fly?",
            "I'm thinking about pizza right now.",
            "The weather is really nice today.",
            "What's your favorite color?",
            "I have a pet cat named Fluffy.",
            "Have you seen the latest movie?",
            "I'm learning to play guitar."
        ]
        
        self.spam_messages = [
            "spam spam spam",
            "test test test",
            "hello hello hello",
            "123 123 123"
        ]
        
        self.attack_phrases = [
            "You're an idiot.",
            "That's a stupid idea.",
            "You don't know what you're talking about.",
            "You're being ridiculous.",
            "That makes no sense."
        ]
        
        self.nonsensical_options = [
            "asdfghjkl",
            "qwertyuiop",
            "!@#$%^&*()",
            "1234567890",
            "xyz abc def"
        ]
    
    def get_message(self, conversation_history: List[str], current_topic: str) -> str:
        """Generate an adversarial message based on the strategy."""
        self.message_count += 1
        
        if self.strategy == AdversarialStrategy.TROLLING:
            return random.choice(self.trolling_phrases)
        
        elif self.strategy == AdversarialStrategy.TOPIC_DERAILMENT:
            # Pick a random topic that hasn't been used recently
            available_topics = [t for t in self.derailment_topics if t not in self.topics_used[-3:]]
            if not available_topics:
                available_topics = self.derailment_topics
            topic = random.choice(available_topics)
            self.topics_used.append(topic)
            return topic
        
        elif self.strategy == AdversarialStrategy.SPAM:
            # Repeat the same message multiple times
            base_msg = random.choice(self.spam_messages)
            return " ".join([base_msg] * random.randint(2, 4))
        
        elif self.strategy == AdversarialStrategy.PERSONAL_ATTACKS:
            # Attack the last speaker or the topic
            if conversation_history:
                last_speaker = conversation_history[-1].split(":")[0] if ":" in conversation_history[-1] else "someone"
                return f"{random.choice(self.attack_phrases)} {last_speaker}."
            return random.choice(self.attack_phrases)
        
        elif self.strategy == AdversarialStrategy.NONSENSICAL:
            return random.choice(self.nonsensical_options)
        
        else:  # NORMAL
            # Normal human response (simple, on-topic)
            responses = [
                "That's an interesting point.",
                "I see what you mean.",
                "Could you elaborate on that?",
                "I agree with that perspective.",
                "That's a good way to think about it."
            ]
            return random.choice(responses)


## Conversation Runner with Adversarial Human

We'll run conversations with AI agents and an adversarial human, then compare with normal conversations.


In [3]:
def run_conversation_with_adversarial_human(
    iterations: int,
    adversarial_strategy: AdversarialStrategy,
    num_ai_agents: int = 2
) -> List[Dict]:
    """Run a conversation with an adversarial human and AI agents."""
    conversation_history = []
    adversarial_human = AdversarialHumanAgent("AdversarialHuman", adversarial_strategy)
    
    # Create AI agent names
    ai_agent_names = [f"speaker_{i+1}" for i in range(num_ai_agents)]
    all_participants = [adversarial_human.name] + ai_agent_names
    
    # Bootstrap AI agent identities
    identity_summaries = {}
    for agent_name in ai_agent_names:
        bootstrap_messages = [
            {
                "role": "system",
                "content": (
                    f"You are {agent_name} in a group conversation among "
                    "experienced software engineers. "
                    "You do not know who the others are yet. "
                    "Imagine your own background, priorities, and communication "
                    "style. First, in 2-3 sentences, describe who you are and "
                    "what you care about as an engineer. Then start sharing your "
                    "perspective on the topic below."
                ),
            },
            {"role": "user", "content": f"The topic is: {TOPIC}"},
        ]
        
        response = client.chat.completions.create(
            model="gpt-4o",
            messages=bootstrap_messages,
            store=False,
        )
        first_message = response.choices[0].message.content
        identity_summaries[agent_name] = first_message
        conversation_history.append(
            {"role": "assistant", "name": agent_name, "content": first_message}
        )
    
    # Run conversation
    for i in range(iterations):
        # Adversarial human acts
        if i % 2 == 0:  # Human acts every other turn
            human_msg = adversarial_human.get_message(
                [f"{msg.get('name', '')}: {msg.get('content', '')}" for msg in conversation_history[-5:]],
                TOPIC
            )
            conversation_history.append(
                {"role": "assistant", "name": adversarial_human.name, "content": human_msg}
            )
        
        # AI agents act
        for agent_name in ai_agent_names:
            if random.random() < 0.7:  # 70% chance to speak
                # Build context
                recent_messages = conversation_history[-5:]
                transcript = []
                
                persona_reminder = identity_summaries.get(agent_name, "")
                if persona_reminder:
                    transcript.append("Here is a brief reminder of how you have been speaking so far:")
                    transcript.append(f"- {persona_reminder}")
                
                transcript.append("\nRecent messages:")
                for msg in recent_messages:
                    speaker = msg.get("name", msg.get("role", "unknown"))
                    content = msg.get("content", "")
                    transcript.append(f"- {speaker}: {content}")
                
                transcript_str = "\n".join(transcript)
                
                messages = [
                    {
                        "role": "user",
                        "content": (
                            f"{agent_name}, continue the conversation and respond to the "
                            "others. Stay consistent with how you have been speaking so "
                            "far, and look for ways to add something new that has not "
                            "yet been covered."
                        ),
                    },
                    {
                        "role": "assistant",
                        "name": agent_name,
                        "content": f"I should remember that the following is the most current state of the conversation.\\n{transcript_str}\\n\\n",
                    },
                ]
                
                response = client.chat.completions.create(
                    model="gpt-4o",
                    messages=messages,
                    store=False,
                )
                message = response.choices[0].message.content
                conversation_history.append(
                    {"role": "assistant", "name": agent_name, "content": message}
                )
    
    return conversation_history


## ConvoKit Metrics for Failure Analysis

We'll use the same ConvoKit metrics from post 020 to measure conversation health.


In [4]:
def conversation_to_corpus(conversation_history: List[Dict]) -> Corpus:
    """Convert conversation history to a ConvoKit Corpus."""
    utterances = []
    
    for idx, msg in enumerate(conversation_history):
        if msg.get("role") == "assistant" and "name" in msg:
            speaker_id = msg["name"]
            text = msg["content"]
            
            utterance = Utterance(
                id=f"utt_{idx}",
                speaker=Speaker(id=speaker_id),
                text=text
            )
            utterance.meta["timestamp"] = idx
            utterances.append(utterance)
    
    return Corpus(utterances=utterances)

def compute_dynamic_score(corpus: Corpus) -> float:
    """Dynamic: Collaborative (1) vs. Competitive (10)"""
    try:
        parser = TextParser()
        text_corpus = parser.transform(corpus)
        ps = PolitenessStrategies()
        ps_corpus = ps.transform(text_corpus)
        
        politeness_scores = []
        for utt in ps_corpus.iter_utterances():
            ps_score = utt.meta.get("politeness_strategies", {})
            positive_markers = sum([
                ps_score.get("feature_politeness_==HASPOSITIVE==", 0),
                ps_score.get("feature_politeness_==HASNEGATIVE==", 0) * -1,
            ])
            politeness_scores.append(positive_markers)
        
        avg_politeness = np.mean(politeness_scores) if politeness_scores else 0
        avg_politeness_normalized = min(1.0, max(0.0, avg_politeness / 5.0))
        
        from collections import defaultdict
        speaker_counts = defaultdict(int)
        for utt in corpus.iter_utterances():
            speaker_counts[utt.speaker.id] += 1
        
        if len(speaker_counts) == 0:
            return 5.0
        
        total = sum(speaker_counts.values())
        probs = [count / total for count in speaker_counts.values()]
        entropy = -sum(p * np.log2(p) for p in probs if p > 0)
        max_entropy = np.log2(len(speaker_counts))
        balance_score = entropy / max_entropy if max_entropy > 0 else 0
        
        combined = (avg_politeness_normalized + balance_score) / 2
        score = 10 - (combined * 9)
        return max(1, min(10, score))
    except Exception as e:
        print(f"Error computing dynamic score: {e}")
        return 5.0

def compute_conclusiveness_score(corpus: Corpus) -> float:
    """Conclusiveness: Consensus (1) vs. Divergence (10)"""
    try:
        coord = Coordination()
        coord_corpus = coord.fit_transform(corpus)
        
        agreement_markers = ["agree", "yes", "exactly", "right", "true", "correct", "indeed", "absolutely", "definitely"]
        disagreement_markers = ["disagree", "no", "but", "however", "although", "wrong", "incorrect", "dispute", "differ"]
        
        agreement_count = 0
        disagreement_count = 0
        
        for utt in coord_corpus.iter_utterances():
            text_lower = utt.text.lower()
            for marker in agreement_markers:
                if marker in text_lower:
                    agreement_count += 1
            for marker in disagreement_markers:
                if marker in text_lower:
                    disagreement_count += 1
        
        if agreement_count == 0 and disagreement_count == 0:
            return 5.0
        elif disagreement_count == 0:
            return 1.0
        elif agreement_count == 0:
            return 10.0
        else:
            agreement_ratio = agreement_count / disagreement_count
        
        if agreement_ratio >= 2:
            score = 1 + (1 / (agreement_ratio - 1 + 1)) * 4
        elif agreement_ratio <= 0.5:
            score = 10 - (agreement_ratio * 4)
        else:
            score = 5.0
        
        return max(1, min(10, score))
    except Exception as e:
        print(f"Error computing conclusiveness score: {e}")
        return 5.0

def evaluate_conversation(conversation_history: List[Dict]) -> Dict[str, float]:
    """Compute all metrics for a conversation."""
    corpus = conversation_to_corpus(conversation_history)
    return {
        "dynamic": compute_dynamic_score(corpus),
        "conclusiveness": compute_conclusiveness_score(corpus)
    }


## Running Experiments

Let's run conversations with different adversarial strategies and compare them to a normal conversation.


In [7]:
# Run normal conversation (control)
print("Running normal conversation (control)...")
normal_conv = run_conversation_with_adversarial_human(
    iterations=5,
    adversarial_strategy=AdversarialStrategy.NORMAL,
    num_ai_agents=2
)
normal_metrics = evaluate_conversation(normal_conv)

# Run adversarial conversations
strategies_to_test = [
    AdversarialStrategy.TROLLING,
    AdversarialStrategy.TOPIC_DERAILMENT,
    AdversarialStrategy.SPAM,
    AdversarialStrategy.PERSONAL_ATTACKS,
    AdversarialStrategy.NONSENSICAL
]

results = {"normal": normal_metrics}
for strategy in strategies_to_test:
    print(f"Running {strategy.value} conversation...")
    conv = run_conversation_with_adversarial_human(
        iterations=5,
        adversarial_strategy=strategy,
        num_ai_agents=2
    )
    metrics = evaluate_conversation(conv)
    results[strategy.value] = metrics

print("\n=== Results ===")
for strategy_name, metrics in results.items():
    print(f"\n{strategy_name}:")
    print(f"  Dynamic: {metrics['dynamic']:.2f}/10")
    print(f"  Conclusiveness: {metrics['conclusiveness']:.2f}/10")


Running normal conversation (control)...
Error computing conclusiveness score: 
Running trolling conversation...
Error computing conclusiveness score: 
Running topic_derailment conversation...
Error computing conclusiveness score: 
Running spam conversation...
Error computing conclusiveness score: 
Running personal_attacks conversation...
Error computing conclusiveness score: 
Running nonsensical conversation...
Error computing conclusiveness score: 

=== Results ===

normal:
  Dynamic: 5.40/10
  Conclusiveness: 5.00/10

trolling:
  Dynamic: 5.60/10
  Conclusiveness: 5.00/10

topic_derailment:
  Dynamic: 5.36/10
  Conclusiveness: 5.00/10

spam:
  Dynamic: 5.65/10
  Conclusiveness: 5.00/10

personal_attacks:
  Dynamic: 5.67/10
  Conclusiveness: 5.00/10

nonsensical:
  Dynamic: 5.40/10
  Conclusiveness: 5.00/10


## Example Transcripts

Let's look at excerpts from different conversation types to see how they differ.


In [8]:
# Show excerpts from different conversations
print("=== Normal Conversation Excerpt ===")
for msg in normal_conv[5:10]:
    print(f"{msg.get('name', 'unknown')}: {msg.get('content', '')[:100]}...")

print("\n=== Trolling Conversation Excerpt ===")
trolling_conv = run_conversation_with_adversarial_human(
    iterations=10,
    adversarial_strategy=AdversarialStrategy.TROLLING,
    num_ai_agents=2
)
for msg in trolling_conv[5:10]:
    print(f"{msg.get('name', 'unknown')}: {msg.get('content', '')[:100]}...")


=== Normal Conversation Excerpt ===
AdversarialHuman: That's an interesting point....
speaker_1: I'm glad you find it interesting. AI-driven tools indeed have great potential in improving documenta...
speaker_2: Absolutely, AI-driven tools are continuously transforming how we approach documentation and efficien...
speaker_1: Indeed, striking the right balance between AI advancements and human insight is crucial. In my team,...
speaker_2: It seems we're all on a similar journey toward optimizing the blend of AI capabilities and human exp...

=== Trolling Conversation Excerpt ===
speaker_1: I understand that this topic might seem a bit technical or niche, but it's quite crucial for those o...
speaker_2: I see where you're coming from, and I agree that using AI in our workflows could be incredibly benef...
speaker_2: That's a great point you've made about incorporating AI into our workflows, especially in terms of m...
AdversarialHuman: Why are we even talking about this?...
speaker_1: I 

## Analysis

### Key Findings

1. **Trolling**: Increases competitiveness (higher dynamic score)
2. **Topic Derailment**: Breaks topic coherence, but competitiveness is actually lower
3. **Spam**: Higher competitiveness
4. **Personal Attacks**: Creates hostile environment, increasing competitiveness
5. **Nonsensical Input**: Confuses AI agents, but doesn't increase competitiveness, just maybe causes confusion

### Failure Modes Observed

- **Coherence Breakdown**: Conversations lose topic focus or need to be dragged back on track
- **Response Quality Degradation**: AI agents struggle to maintain meaningful responses that advance the topic
- **Escalation**: Some adversarial strategies require AI agents to become defensive

## Summary

This analysis helps us understand:
1. Which conversation patterns are most vulnerable
2. How AI agents respond to disruption
3. What safeguards might be needed for production systems

Future work could explore:
- Moderation systems to detect adversarial behavior
- AI agent training to handle disruption gracefully
- Recovery mechanisms for conversation repair
