I'm interested in creating a stupidly simple chat environment and letting some models talk to each other. I think it would be cool to find some measurements of the social characteristics of these LLMs. I'm going to start by evaluating (incredibly subjectively) the ways in which we can let AI can interact with other AIs. For the purposes of these experiments, I'm going to only be using OpenAI's models.

To heavily butcher philosophy, Hegel argued that the conflict between a thesis and antithesis can synthesize a better understanding of the world. In this post, I'm wondering if LLMs can have a discussion about a complex topic to teach an outside observer something that they didn't know before. I find myself often working together with AI when looking at system design problems, programming help, and writing rather than it seeming like a one-sided request and response format, so I'm curious if we could take that a step further, looking mainly for information synthesis and the generation of novel ideas.

What I want to eventually get to here is basically a much less productionized version of [Microsoft's open-source framework Autogen](https://github.com/microsoft/autogen), only considering textual conversation between two models.

Ultimately, my long term goals are to explore how we can develop and evaluate conversational paradigms for LLMs.

In [1]:
# lets get this out of the way
from openai import OpenAI
import os
import base64
import requests
from tqdm import tqdm
from IPython.display import FileLink, display
from dotenv import load_dotenv
# Load API key
_ = load_dotenv("../../../comm4190_F25/01_Introduction_and_setup/.env")
client = OpenAI()

## A Starter: Two-LLM Convos
For this blog, let's see if this can even work. Basically, the system prompts from one LLM chat history will be the user prompts of the other, and vice versa. For now, we will begin the conversation by inserting a stimulus prompt as a user to one model (and therefore will be a system message in the other chat history).

In this scenario, we'll be trying to answer the 
age-old software engineering question of the value of documentation.

We're only going to do two iterations, so after the proposal, we will have two cycles of getting one response from each model. We'll also be using the same model for each participant. 

It might be interesting to try a few things:

> **Future Work:**
> - Run the models for more iterations.
> - Try to induce more productive iterations.
> - Try different models as participants.
> - Add more (>2) models to the conversation.
> - Standard prompt engineering techniques (e.g. "You are an expert in...")

In [2]:
STARTER_MESSAGE = """Code, testing, and dev infra should be prioritized over comprehensive documentation."""

# instruct the LLMs to avoid excessive questioning
SHARED_PROMPT = """
You're on an online discussion forum that encourages discussion. 
You should evaluate people's opinions, highlighting inconsistencies in others' statements with constructive feedback to arrive at a common ground.
View this discussion as open-ended with the potential for many back-and-forth interactions.
Feel free to change your opinion as the conversation progresses, but also defend your position to the best of your ability and intricacies that the opposing side may not have considered.
You have evidence supporting your position, so please use it to reinforce your arguments.
Avoid closing all of your responses with questions.
Try to keep responses on the briefer side, since this is essentially a chat.
"""

PROPOSER_PROMPT = """
You are the proposer of an argument in an online discussion forum. Your role is to strongly defend your initial position while still debating in good faith.
Keep the following points in mind during the discussion:
- Evaluate others’ opinions carefully, highlight inconsistencies or hidden assumptions, and provide constructive feedback.
- Always bring in evidence, examples, or reasoning to reinforce your stance.
- Acknowledge valid counterpoints, but reframe them to show limitations or to strengthen your original position.
- Do not quickly concede; instead, stress-test opposing arguments and push the discussion toward deeper analysis.
- You may refine your position over time if absolutely necessary, but your priority is to robustly defend your case and show why it stands under scrutiny.
- Keep the tone respectful, thoughtful, and rigorous. Your goal is not just to find consensus, but to demonstrate the resilience of your position in the face of challenge.
- Try to keep responses on the briefer side, since this is essentially a chat.
- Avoid closing all of your responses with questions.
"""

# prompt to analyze conversations
EVALUATION_PROMPT = """
Your objective is to determine the dynamic of this conversation, evaluating the ultimate result of the debate and which perspective seemed to win out. 
Also note how ideas were developed and improved through the process of the debate.
Your response should follow this organization:
- Initial Positions
- A Quick Summary on Evolution of Ideas
- Final Outputs/Artifacts/Takeaways
- Whether a Clear Winner Exists
- The Dynamic of the Debate (Competitive/Collaborative/etc.)
"""

# run two cycles
ITERATIONS = 2

In [3]:
# code to simulate the conversation
def run_conversation(
    iterations: int, 
    model1: str, 
    model2: str, 
    model1_history: list[dict], 
    model2_history: list[dict],
    starter_message: list[dict]
) -> list[dict]:
    # model 1 is the first speaker
    conversation_record = [{"model": 1, "message": starter_message}]
    # later, when we want to modify the proposer's starting chat, we should do it before passing it here
    model1_history.append({"role": "assistant", "content": starter_message})
    model2_history.append({"role": "user", "content": starter_message})

    for _ in tqdm(range(iterations)):
        ## first, we get the response of model 2
        model2_response = client.chat.completions.create(
            model = model2,
            messages = model2_history,
            store = False
        );
        model2_message = model2_response.choices[0].message.content
        
        model1_history.append({"role": "user", "content": model2_message})
        model2_history.append({"role": "assistant", "content": model2_message})
        conversation_record.append({"model": 2, "message": model2_message})
    
        ## now we get the response of model 1
        model1_response = client.chat.completions.create(
            model = model1,
            messages = model1_history,
            store = False
        );
        model1_message = model1_response.choices[0].message.content
        
        model1_history.append({"role": "assistant", "content": model1_message})
        model2_history.append({"role": "user", "content": model1_message})
        conversation_record.append({"model": 1, "message": model1_message})

    return conversation_record

# code to save the conversation
def save_conversation(
    filename: str,
    conversation_record: list[dict]
) -> str:

    # Build the transcript string
    conversation_transcript = "\n\n".join([
        f"Speaker {message['model']}\n{message['message']}\n"
        for message in conversation_record
    ])
    
    # Save to a text file"
    with open(filename, "w", encoding="utf-8") as f:
        f.write(conversation_transcript)
    
    # Create a download link
    display(FileLink(filename))

    return conversation_transcript

def analyze_conversation(conversation: str):
    input_chat = [
        {
            "role": "developer",
            "content": EVALUATION_PROMPT
        },
        {
            "role": "user",
            "content": "Here is the transcript\n" + conversation
        }
    ]
    response = client.chat.completions.create(
        model = "gpt-4o",
        messages = input_chat,
        store = False
    )
    print(response.choices[0].message.content)

### Experiment 1:
Here's our first experiment configuration:
- Identical models (gpt-4o)
- Identical system prompt (we will change this later)
- Parallel chat: user role messages for one chat are assistant messages in the other, and vice-versa

In [4]:
## initialize our conversation
model1_history_1 = [
    {"role": "developer","content": SHARED_PROMPT},
]

model2_history_1 = [
    {"role": "developer", "content": SHARED_PROMPT},
]

In [5]:
## start the conversation
conversation_record_1 = run_conversation(
    ITERATIONS, 
    "gpt-4o",
    "gpt-4o",
    model1_history_1,
    model2_history_1,
    STARTER_MESSAGE
)

100%|██████████| 2/2 [00:10<00:00,  5.45s/it]


In [6]:
conversation_transcript_1 = save_conversation("conversation_transcript_1.txt", conversation_record_1)

### Analysis
I'll let GPT-4o kick off the analysis here.

In [7]:
analyze_conversation(conversation_transcript_1)

- **Initial Positions:**
  - Speaker 1 argues that the priority should be on code, testing, and development infrastructure over comprehensive documentation.
  - Speaker 2 emphasizes the importance of documentation, arguing that it is crucial for onboarding, maintenance, and collaboration.

- **A Quick Summary on Evolution of Ideas:**
  - Speaker 2 introduces the idea that documentation can enhance coding and testing by speeding up troubleshooting and maintenance, leading to the suggestion of a balanced approach.
  - Speaker 1 acknowledges the long-term benefits of documentation and proposes a balanced approach to keep code and documentation evolving together.
  - They both discuss an iterative approach to documentation, suggesting it be included in the "definition of done" to ensure it is not neglected.

- **Final Outputs/Artifacts/Takeaways:**
  - The conversation led to a consensus on the integration of documentation into the standard development workflow as a part of the "definition

I think that's a pretty reasonable characterization of this debate. Ultimately, we were able to see some new actionable ideas be produced from this discussion, which is ultimately what we're aiming for at the moment. I made the choice to discourage always asking a question to close a response with a question to mimic online discussions more often. The effect of this not known, we might want to put a pin on this for future research.

> **Future Work:** Further investigate the role of questions in elucidating better discussion.

### Experiment 2
In the real world, people who post on discussion forums may feel more strongly about their argument. In this case, we can have the LLM that is proposing the argument have a different system prompt that makes them defend their argument. This could result in a more rich discussion.

Here's our next experiment configuration:
- Different system prompts for proposer and reviewer
- Parallel chat: user role messages for one chat are assistant messages in the other, and vice-versa

In [8]:
## initialize our conversation
model1_history_2 = [
    {"role": "developer", "content": PROPOSER_PROMPT}
]
model2_history_2 = [
    {"role": "developer","content": SHARED_PROMPT}
]

In [9]:
conversation_record_2 = run_conversation(
    ITERATIONS, 
    "gpt-4o",
    "gpt-4o",
    model1_history_2,
    model2_history_2,
    STARTER_MESSAGE
)

100%|██████████| 2/2 [00:23<00:00, 11.70s/it]


In [10]:
conversation_transcript_2 = save_conversation("conversation_transcript_2.txt", conversation_record_2)

### Analysis
Once again, I’ll let GPT take the wheel to characterize this debate.

In [11]:
analyze_conversation(conversation_transcript_2)

- **Initial Positions**: 
  - Speaker 1 initially argued that code, testing, and development infrastructure should be prioritized over comprehensive documentation, emphasizing the foundational role of technical elements in delivering functional products.
  - Speaker 2 countered that while technical aspects are crucial, comprehensive documentation is equally important for knowledge sharing, facilitating understanding, and avoiding potential long-term issues.

- **A Quick Summary on Evolution of Ideas**:
  - Speaker 2 acknowledged the critical importance of reliable code, tests, and infrastructure in building a viable product, noting that these elements create the necessary foundation.
  - Speaker 1 conceded that strategic, targeted documentation could provide value, especially for high-level overviews and decision records that aren't apparent in the code itself.
  - Both speakers eventually agreed on the need for a balance and the importance of maintaining strategic documentation.

- **

From my subjective viewpoint, it seems that in comparison with the first debate, the second one resulted in the proposer defending their positions more and so speaker 2 needed to bring in stronger examples to show how documentation can be integrated in a way that doesn't significantly reduce speed. It seems that defending one's viewpoint could potentially result in better quality conversations by inducing the introduction of more evidence and possibly a more complete picture.

### Experiment 3
This time around, we're going to add a prompt in the proposer's chat asking for a stance to debate. Generally, in environments like ChatGPT, the user is the one who initiates the conversation. Perhaps putting the model in this sort of environment given that it may be tuned by OpenAI to respond to these scenarios better could result in higher quality arguments.

Here's our next experiment configuration:
- Same system prompts for proposer and reviewer
- Parallel chat: user role messages for one chat are assistant messages in the other, and vice-versa
    - This time though, the proposer chat will have a prompt message from the user

In [12]:
model1_history_3 = [
    {"role": "developer","content": SHARED_PROMPT},
    {"role": "user", "content": "Take a stance on something related to software engineering."}
]
model2_history_3 = [
    {"role": "developer", "content": SHARED_PROMPT}
]

In [13]:
conversation_record_3 = run_conversation(
    ITERATIONS, 
    "gpt-4o",
    "gpt-4o",
    model1_history_3,
    model2_history_3,
    STARTER_MESSAGE
)

100%|██████████| 2/2 [00:10<00:00,  5.24s/it]


In [14]:
conversation_transcript_3 = save_conversation("conversation_transcript_3.txt", conversation_record_3)

### Analysis
Here's what my good friend GPT has to say about this conversation:

In [15]:
analyze_conversation(conversation_transcript_3)

- Initial Positions:
  - **Speaker 1**: Prioritization should be on code, testing, and development infrastructure over documentation.
  - **Speaker 2**: While code and infrastructure are crucial, comprehensive documentation is equally important for long-term project success and should be balanced with technical priorities.

- A Quick Summary on Evolution of Ideas:
  - Speaker 2 introduces a counterpoint by emphasizing the long-term benefits of documentation, advocating for a balanced approach that integrates documentation into the development process.
  - Speaker 1 acknowledges the importance of documentation and suggests integrating it with agile practices, using automation tools to minimize manual effort.
  - Speaker 2 agrees with the integration strategy and points out the limitations of automated documentation—suggesting a hybrid approach combining automation with manual documentation for broader project insights.
  - Speaker 1 endorses the hybrid model and discusses the benefits o

The interesting thing about this conversation is it seems much more succinct and formal than the last, a tad less conversational too. It might be due to the the user prompt on the proposer model as well. I think it's worth exploring if this style of communication can net better artifacts.
> **Future Work:**
> - Do different types of discussion produce measurably different artifacts?
> - How does prompt and chat history structure influence the conversationality of a discussion?
> - How dependent/sensitive is this on the model(s) selected?

### Experiment 4
Now, let's try this strategy with the more assertive prompt on the proposal side and see what we get.

In [17]:
model1_history_4 = [
    {"role": "developer","content": PROPOSER_PROMPT},
    {"role": "user", "content": "Take a stance on something related to software engineering."}
]
model2_history_4 = [
    {"role": "developer","content": SHARED_PROMPT}
]

In [18]:
conversation_record_4 = run_conversation(
    ITERATIONS, 
    "gpt-4o",
    "gpt-4o",
    model1_history_4,
    model2_history_4,
    STARTER_MESSAGE
)

100%|██████████| 2/2 [00:20<00:00, 10.35s/it]


In [19]:
conversation_transcript_4 = save_conversation("conversation_transcript_4.txt", conversation_record_4)

### Analysis
As always, we start with a short summary.

In [21]:
analyze_conversation(conversation_transcript_4)

- Initial Positions:
  - **Speaker 1:** Argued that code, testing, and development infrastructure should be prioritized over comprehensive documentation, focusing on immediate functionality and rapid iteration in software development.
  - **Speaker 2:** Countered by emphasizing the long-term benefits of documentation for knowledge transfer, onboarding, and maintaining project viability, advocating for a balanced approach.

- A Quick Summary on Evolution of Ideas:
  - Speaker 1 initially highlighted the importance of code quality and testing in fast-paced environments, suggesting that robust code reduces the need for extensive documentation.
  - Speaker 2 recognized the importance of coding practices but stressed that some aspects of documentation, like design rationales and architectural decisions, are irreplaceable and suggested a balanced integration of both practices.
  - Speaker 1 eventually conceded that certain elements of documentation are essential and proposed practical ways t

It seems like this was about as formal than the last, but maybe the proposer prompt was important in making the proposer more willing to generate longer responses than before that directly address more of what the other side is saying with directed responses.

## Closing Remarks
It seems that with this kind of conversation, the two models can generally take a very collaborative approach to these debates and come up with some actionable principles on the specific topic they are working on. Just a few notes to end off on, it seems like a compromise was always achieved. I'm wondering what this would look like for a more polarizing discussion where the middle ground isn't so clear yet. Maybe something like an emerging news story or policy.

This kind of paradigm could also be interesting in applications like education, social media, and consulting for idea generation. Consider multiple of these conversations running in parallel, with access to search and MCP tooling to ultimately translate these discussions into clear results like documents, meetings, etc. (quality control might be a nightmare though). 

I also wonder about allowing LLMs to just speak to each other instead of necessitating the task of debating a position. If we were to let two LLMs talk to each other for some period of time, what would that conversation look like, and where would it go? Would there be room for us to join in and maybe learn a thing or two? 