Last time, we looked at LLMs conversing with two separate chat windows. This time, I wonder if we can do it with one. OpenAI provides a "name" field in the input messages which we can use to identify model 1 versus model 2. In this case, we tell one model to simulate the debate, with the user acting as the mediator. The interesting thing about this is that we can scale this much better. With the user as the orchestrator, we can actually add as many models/positions to the debate as we want. We can also fall away from the debate structure and come up with collaborative scenarios as well.

In [1]:
from openai import OpenAI
import os
import base64
import requests
from tqdm import tqdm
from IPython.display import FileLink, display
from dotenv import load_dotenv
# Load API key
_ = load_dotenv("../../../comm4190_F25/01_Introduction_and_setup/.env")
client = OpenAI()

## LLM Debates
Let's try LLM debates first as a continuation of our prior work. We're going to use the same topic as [last time](https://ezou626.github.io/comm4190_F25_Using_LLMs_Blog/posts/001_an_llm_conversation/an_llm_conversation.html).

We'll be a bit rigid for now. We're going to use the same model for both, an incrementing speaker naming system, and a round robin speaking format.

In [13]:
DEBATE_TOPIC = """Code, testing, and dev infra should be prioritized over comprehensive documentation."""

# prompt to analyze conversations
EVALUATION_PROMPT = """
Your objective is to analyze this conversation between speakers.
Your response should follow this organization:
- A Brief Summary
- Final Outputs/Artifacts/Takeaways
- Characteristics/Dynamic (Competitive/Collaborative/etc.)
"""

def analyze_conversation(conversation: str):
    input_chat = [
        {
            "role": "system",
            "content": EVALUATION_PROMPT
        },
        {
            "role": "user",
            "content": "Here is the transcript\n" + conversation
        }
    ]
    response = client.chat.completions.create(
        model = "gpt-4o",
        messages = input_chat,
        store = False
    )
    print(response.choices[0].message.content)

In [14]:
SYSTEM_PROMPT = "You are simulating a debate between AI agents. Each agent should respond in turn, logically arguing their point. Do not speak for both sides in one message."

# code to simulate the conversation
def run_conversation(
    iterations: int, 
    openai_model_id: str,
    participant_count: int,
    debate_topic: str
) -> list[dict]:
    # model 1 is the first speaker
    debate_history = [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": "The debate topic is: " + debate_topic}
    ]

    for _ in tqdm(range(iterations)):
        for model in range(1, participant_count + 1):
            speaker_id = f"speaker_{model}"
            debate_history.append({"role": "user", "content": f"{speaker_id}, it's your turn to speak."})
            response = client.chat.completions.create(
                model = openai_model_id,
                messages = debate_history,
                store = False
            );
            message = response.choices[0].message.content
            debate_history.append({"role": "assistant", "name": speaker_id, "content": message})

    return debate_history

# code to save the conversation
def save_conversation(
    filename: str,
    debate_history: list[dict]
) -> str:

    messages = []

    for record in debate_history:

        if record["role"] == "user":
            messages.append("mediator:\n" + record["content"])
        
        if record["role"] == "assistant":
            messages.append(f"{record["name"]}:\n{record["content"]}")
    
    conversation_transcript = "\n\n".join(messages)
    
    with open(filename, "w", encoding="utf-8") as f:
        f.write(conversation_transcript)
    
    display(FileLink(filename))

    return conversation_transcript

### Experiment 1:
Here's our first experiment configuration:
- 2 speakers
- Explicit mediator: Include mediator messages in the chat history

In [9]:
debate_1 = run_conversation(2, 'gpt-4o', 2, DEBATE_TOPIC)

100%|██████████| 2/2 [00:33<00:00, 16.92s/it]


In [10]:
conversation_transcript_1 = save_conversation("conversation_transcript_1.txt", debate_1)

### Analysis
Like last time, we're going to analyze the conversation with AI as well. 

In [12]:
analyze_conversation(conversation_transcript_1)

- **Brief Summary**: This debate centers on whether code quality, testing, and development infrastructure should take priority over comprehensive documentation. Speaker 1 argues that focusing on code and testing leads to more robust products and allows for quicker adaptations in the fast-paced tech world. Speaker 2 counters that documentation is crucial for knowledge transfer, collaboration, and regulatory compliance, emphasizing its role in preserving the long-term vision and understanding of software projects.

- **Final Outputs/Artifacts/Takeaways**: Speaker 1 stresses the importance of sustainable code quality and testing as a form of self-documentation and for enabling continuous improvement. On the other hand, Speaker 2 highlights the necessity of documentation for maintaining architectural understanding and ensuring compliance with industry standards, as well as for effective stakeholder communication.

- **Characteristics/Dynamic**: The conversation is collaborative as both spe

There's some noticable changes from last time. The participant's positions are noticably more rigid and as a result, the final outcome is less convergent. A side effect of this is that there is no natural closing spot. I think this is probably due to our significantly less cooperative system prompt. Last time, we explicitly asked for the model's to aim for a common ground. We would probably see the same thing if we were to add that to our system prompt. We could change the prompt for the next experiment to include that so we can gradually move to some concrete artifacts. However, I believe a thorough evaluation of the different kinds of prompts we can use here is probably best saved for another blog systematically evaluating various system prompts and the effect of various phrases on the model's final output. It would also be interesting to use something like [ConvoKit](https://convokit.cornell.edu/) to perform these kinds of analyses, so we can save that for a future blog.

> **Future Work:**
> - Use ConvoKit to systematically evaluate conversational features in a quantitative way based on various prompting and chat structuring strategies.

For now, I'm interested in seeing if a third model can bring new things to the conversation.

### Experiment 2
Let's try this with 3 models instead of 2. I'm wondering what a third model will do in a seemingly binary scenario. Will it support one of the positions, or will it look to take a middle ground? What if it comes up with a third position?

In [15]:
debate_2 = run_conversation(2, 'gpt-4o', 3, DEBATE_TOPIC)

100%|██████████| 2/2 [00:34<00:00, 17.31s/it]


In [17]:
conversation_transcript_2 = save_conversation("conversation_transcript_2.txt",debate_2)

### Analysis
Let's see what GPT-4o has to say about this debate.

In [18]:
analyze_conversation(conversation_transcript_2)

- **Brief Summary:** The debate centers around whether code, testing, and development infrastructure should be prioritized over comprehensive documentation. Speaker 1 argues for prioritizing code and infrastructure, emphasizing agility and efficiency, particularly in initial project phases. Speaker 2 champions the importance of documentation for usability, maintainability, and long-term benefits. Speaker 3 presents a middle ground, suggesting integrated processes that balance both priorities as complementary elements.

- **Final Outputs/Artifacts/Takeaways:** The conversation ends with speaker 3 advocating for an integrated approach where tools and methodologies make documentation a part of the development process. This includes using automated tools for inline documentation and ensuring documentation is part of the project’s definition of "done." It suggests both speed in delivery and sustainable practices can coexist without sacrificing one for the other.

- **Characteristics/Dynamic

Despite not changing the prompt at all, it seems that speaker 3 naturally took a middle path between documentation versus using the codebase as the primary source of truth in software projects. Perhaps speaker 3 taking the middle ground could result in a better synthesis of new ideas by naturally bridging the gap in a 2-way debate. In this scenario, I'm now really interested to see what would happen if we had 4 speakers.

### Experiment 3
Now we're going to try two rounds of communication where we have 4 speakers. Pretty straightforward. I wonder if speaker 4 may agree with speaker 3, since I expect the start of the conversation to go about the same.

In [19]:
debate_3 = run_conversation(2, 'gpt-4o', 4, DEBATE_TOPIC)

100%|██████████| 2/2 [01:02<00:00, 31.17s/it]


In [20]:
conversation_transcript_3 = save_conversation("conversation_transcript_3.txt", debate_3)

### Analysis
Now, let's get a quick summary of what happened.

In [21]:
analyze_conversation(conversation_transcript_3)

- **Brief Summary:**
  The conversation revolves around the debate on whether code, testing, and development infrastructure should be prioritized over comprehensive documentation in software development. Speaker_1 advocates for prioritizing core code and testing, especially in early project stages, while speaker_2 emphasizes the necessity of documentation for continuity and avoiding knowledge silos. Speaker_3 introduces a balanced approach depending on project context and phase, and speaker_4 underscores the integration of documentation with development for synergy and resilience.

- **Final Outputs/Artifacts/Takeaways:**
  - Recognition of the importance of a balanced approach between prioritizing code, testing, and infrastructure and maintaining comprehensive documentation.
  - Understanding that the project's phase and specific needs should dictate the balance between these priorities.
  - Emphasis on tools and methodologies that integrate documentation into the development process,

It seems that speaker 4 also takes a middle ground, but is a bit biased towards the second position of integrating documentation. I think we might get diminishing returns with adding more speakers in the current rigid structure without some more complicated instructions, so we could just use 3 for now. I think ideally, we'd like to move to self-organizing conversations that aren't necessarily debates.

### Experiment 4
I think one final interesting thing we could do is remove the mediator from the chat history, since they don't really provide anything. We can just inject the instructional message at the end of the chat history right before we call the API to direct the next speaker, but we don't need to persist it. The transcript technically won't be complete, but maybe we can see some differences in the formality of the responses if the mediator is not considered.

In [22]:
# code to simulate the conversation
def run_organic_conversation(
    iterations: int, 
    openai_model_id: str,
    participant_count: int,
    debate_topic: str
) -> list[dict]:
    # model 1 is the first speaker
    debate_history = [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": "The debate topic is: " + debate_topic}
    ]

    for _ in tqdm(range(iterations)):
        for model in range(1, participant_count + 1):
            speaker_id = f"speaker_{model}"
            response = client.chat.completions.create(
                model = openai_model_id,
                messages = debate_history + [{"role": "user", "content": f"{speaker_id}, it's your turn to speak."}],
                store = False
            );
            message = response.choices[0].message.content
            debate_history.append({"role": "assistant", "name": speaker_id, "content": message})

    return debate_history

In [26]:
debate_4 = run_organic_conversation(2, 'gpt-4o', 3, DEBATE_TOPIC)

100%|██████████| 2/2 [00:28<00:00, 14.20s/it]


In [27]:
conversation_transcript_4 = save_conversation("conversation_transcript_4.txt", debate_4)

### Analysis
Let's see if that had any impact on the debate.

In [28]:
analyze_conversation(conversation_transcript_4)

- **Brief Summary:** The discussion revolves around the prioritization of code, testing, and development infrastructure versus comprehensive documentation in software projects. Speaker 1 advocates for prioritizing code and testing to ensure product reliability and rapid market delivery. Speaker 2 argues for the importance of comprehensive documentation to aid in knowledge transfer, maintainability, and regulatory compliance. Speaker 3 suggests a balanced approach, tailored to project requirements, integrating documentation into development practices through tools and automation.

- **Final Outputs/Artifacts/Takeaways:** The conversation concludes with a consensus on the necessity of balance and context-specific approaches. Key artifacts to consider include the integration of automated documentation tools, ensuring documentation evolves with the codebase, and prioritizing based on project needs and industry regulations. Additionally, a continuous review process is recommended to align w

I don't think this changed much about the date. It seems like the roles are very similar to before, with the positions being very comparable. I think we may need some fundamental changes to the structure of the conversation in order to bring about more natural responses.

## Closing Remarks
Compared to the other approach that we explored in the [previous blog](https://ezou626.github.io/comm4190_F25_Using_LLMs_Blog/posts/001_an_llm_conversation/an_llm_conversation.html), it seems that sharing a chat window allows us to more easily manage the flow of a debate with similar results to before, accounting for the changes in the system prompt. To truly identify the differences between these two approaches, I think it is necessary to construct more complicated social scenarios and develop a standardized way to evaluate in what ways we see some true differences. To generalize, the importance of using "user" versus "assistant" responses to represent different participants in conversations involving multiple LLMs to influence desired results is a possible research path that we can look into for the blog.

However, with this new system prompt, it seems we come out with less concrete conclusions than before. That could be a detriment to the usefulness of this system prompt as a tool, but it could also bring us closer to real conversations. In reality, not all conversations end in a way that can be nicely wrapped up and applied to a real world problem. However, we do see many useful things come out of conversations in practice in our lives, with both AI and others. Seamlessly integrating AI in a way that adds to conversations seems like it requires a more "realistic" formulation.

> Note: Just wanted to acknowledge that we could probably specify in the system prompt that models should introduce more evidence and be clearer in their rebuttal structure, which could help make the conversation more helpful for listeners.

In terms of immediate ways we can build on this work, I'm interested in simulating a less rigid environment. In the real world, chats are not strictly turn-based (in that they follow a set order, and that each speaker always speaks). If we can introduce spontaneity (with randomness, for example), I wonder if we could increase the breadth of chats that we see.

> **Future Work**:
> - Introduce spotaneous elements to the conversation (immediate focus)
> - Evaluate using "user" versus "assistant" to represent various participants in the conversation
> - System prompt changes (mentioned before)