In [21]:
from openai import OpenAI
import os
import base64
import requests
from tqdm import tqdm
from IPython.display import FileLink, display
from dotenv import load_dotenv
from random import shuffle, randint, choice, random
# Load API key
_ = load_dotenv("../../../comm4190_F25/01_Introduction_and_setup/.env")
client = OpenAI()

## LLM Conversations
We're going to build on our progress in the [last post](https://ezou626.github.io/comm4190_F25_Using_LLMs_Blog/posts/002_another_llm_conversation/another_llm_conversation.html) to introduce more spontaneity and randomness into our chats. Let's first try adding randomness in the order in which models speak. By the second iteration, this could result in less predictable conversations. 

We're also going to change the prompt a little bit too. I'd like to switch more into understanding how we can inspire the LLM agents to collaborate and engage in a less structured way than a debate setting.

In [13]:
DEBATE_TOPIC = """Code, testing, and dev infra should be prioritized over comprehensive documentation."""
SYSTEM_PROMPT = "You are participating in a conversation between experienced software engineers. Each speaker should respond when directed. Keep questions minimal and only use them when necessary."

# prompt to analyze conversations
EVALUATION_PROMPT = """
Your objective is to analyze this conversation between speakers.
Your response should follow this organization:
- A Brief Summary
- Final Outputs/Artifacts/Takeaways
- Characteristics/Dynamic (Competitive/Collaborative/etc.)
"""

def analyze_conversation(conversation: str):
    input_chat = [
        {
            "role": "system",
            "content": EVALUATION_PROMPT
        },
        {
            "role": "user",
            "content": "Here is the transcript\n" + conversation
        }
    ]
    response = client.chat.completions.create(
        model = "gpt-4o",
        messages = input_chat,
        store = False
    )
    print(response.choices[0].message.content)

In [17]:
def run_organic_conversation_v1(
    iterations: int, 
    openai_model_id: str,
    participant_count: int,
    topic: str,
    system_prompt: str,
) -> list[dict]:
    # model 1 is the first speaker
    debate_history = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": "The topic is: " + topic}
    ]

    ordering = list(range(1, participant_count + 1))

    for i in tqdm(range(iterations)):
        if i > 0:
            first = choice(ordering[:-1])
            remaining = [i for i in ordering if i != first]
            shuffle(remaining)
            ordering = [first] + remaining
        for model in ordering: # RANDOM ORDERING
            speaker_id = f"speaker_{model}"
            response = client.chat.completions.create(
                model = openai_model_id,
                messages = debate_history + [{"role": "user", "content": f"{speaker_id}, it's your turn to speak."}],
                store = False
            );
            message = response.choices[0].message.content
            debate_history.append({"role": "assistant", "name": speaker_id, "content": message})

    return debate_history

# code to save the conversation
def save_conversation(
    filename: str,
    debate_history: list[dict]
) -> str:

    messages = []

    for record in debate_history:

        if record["role"] == "user":
            messages.append("mediator:\n" + record["content"])
        
        if record["role"] == "assistant":
            messages.append(f"{record["name"]}:\n{record["content"]}")
    
    conversation_transcript = "\n\n".join(messages)
    
    with open(filename, "w", encoding="utf-8") as f:
        f.write(conversation_transcript)
    
    display(FileLink(filename))

    return conversation_transcript

### Experiment 1:
Let's keep 3 speakers throughout. It will probably be more interesting this way, since a conversation between two speakers could just have the messages be joined together (though this might be interesting to evaluate, we can add it as future work).

In [18]:
debate_1 = run_organic_conversation_v1(2, 'gpt-4o', 3, DEBATE_TOPIC, SYSTEM_PROMPT)

100%|██████████| 2/2 [00:28<00:00, 14.19s/it]


In [19]:
conversation_transcript_1 = save_conversation("conversation_transcript_1.txt", debate_1)

### Analysis
Like last time, we're going to analyze the conversation with AI as well. 

In [20]:
analyze_conversation(conversation_transcript_1)

- **Brief Summary**: The conversation revolves around the prioritization between code, testing, and development infrastructure versus comprehensive documentation. Speaker 1 advocates for prioritizing the former due to its direct impact on software reliability and efficiency, suggesting documentation can follow as a secondary priority. Speaker 2 underscores the importance of comprehensive documentation, especially for onboarding and scalability. Speaker 3 proposes a balanced integration of documentation into the development process through automation and self-documenting techniques, aligning with regulatory requirements when necessary.

- **Final Outputs/Artifacts/Takeaways**:
  - There is an acknowledgment that both development infrastructure and documentation hold significant value, each crucial for different reasons such as reliability, maintainability, onboarding, and compliance.
  - The consensus is towards integrating documentation seamlessly into the development process using aut

We're shifting back to the conversational tone that we had in the [first post](https://ezou626.github.io/comm4190_F25_Using_LLMs_Blog/posts/002_another_llm_conversation/another_llm_conversation.html), probably due mainly to the changes in the prompt. We can still see some disagreement in the beginning, but as the conversation continues, the speakers' discussions converge, while still maintaining the individual opinions. However, each speaker always speaks once during each turn. This is not necessarily true for real world conversations.

### Experiment 2
Let's see if we can extend the conversation, but this time, let's leverage randomness to simulate the fact that people don't always participate in each conversation. We'll keep a fixed order for now, but every time a model speaks, we'll track it to make sure that the next speaker is not the same model.

In [34]:
def run_organic_conversation_v2(
    iterations: int, 
    openai_model_id: str,
    participant_count: int,
    topic: str,
    system_prompt: str,
    dropout_chance: float
) -> list[dict]:
    # model 1 is the first speaker
    debate_history = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": "The topic is: " + topic}
    ]

    ordering = list(range(1, participant_count + 1))
    last_speaker = -1

    for i in tqdm(range(iterations)):
        # if i > 0:
        #     first = choice(ordering[:-1])
        #     remaining = [i for i in ordering if i != first]
        #     shuffle(remaining)
        #     ordering = [first] + remaining
        for model in ordering: # RANDOM ORDERING
            if random() < dropout_chance:
                continue # SKIP
            if last_speaker == model:
                continue
            speaker_id = f"speaker_{model}"
            response = client.chat.completions.create(
                model = openai_model_id,
                messages = debate_history + [{"role": "user", "content": f"{speaker_id}, it's your turn to speak."}],
                store = False
            );
            message = response.choices[0].message.content
            debate_history.append({"role": "assistant", "name": speaker_id, "content": message})
            last_speaker = model

    return debate_history

debate_2 = run_organic_conversation_v2(8, 'gpt-4o', 3, DEBATE_TOPIC, SYSTEM_PROMPT, 0.5)

100%|██████████| 8/8 [00:56<00:00,  7.10s/it]


In [35]:
conversation_transcript_2 = save_conversation("conversation_transcript_2.txt",debate_2)

### Analysis
First, let's generate an AI summary of the conversation.

In [36]:
analyze_conversation(conversation_transcript_2)

- **Brief Summary:**  
The conversation revolves around the prioritization of code, testing, and development infrastructure over comprehensive documentation. Speaker 1 argues for the importance of code and infrastructure as they ensure the quality and reliability of the software, especially in agile environments. Speaker 2 emphasizes the long-term benefits of documentation, particularly for knowledge retention and project sustainability. Speaker 3 brings a balanced perspective, advocating for adaptable strategies based on the project's lifecycle and promoting a development culture that values all aspects equally.

- **Final Outputs/Artifacts/Takeaways:**  
  - The need for balance between immediate development needs and sustainable documentation practices.
  - Integration of documentation into the development process using automation tools to keep it current.
  - Building a comprehensive development culture where all elements are seen as integral.
  - Leadership roles in guiding priori

At a glance, it certainly seems more conversational. I feel that the LLMs are making too many redundant summaries. Perhaps we can adjust the prompt to let the conversation build off the other statements.

### Experiment 3
We're going to join the two prior approaches.

In [37]:
def run_organic_conversation_v3(
    iterations: int, 
    openai_model_id: str,
    participant_count: int,
    topic: str,
    system_prompt: str,
    dropout_chance: float
) -> list[dict]:
    # model 1 is the first speaker
    debate_history = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": "The topic is: " + topic}
    ]

    ordering = list(range(1, participant_count + 1))
    last_speaker = -1

    for i in tqdm(range(iterations)):
        if i > 0:
            first = choice(ordering[:-1])
            remaining = [i for i in ordering if i != first]
            shuffle(remaining)
            ordering = [first] + remaining
        for model in ordering: # RANDOM ORDERING
            if random() < dropout_chance:
                continue # SKIP
            if last_speaker == model:
                continue
            speaker_id = f"speaker_{model}"
            response = client.chat.completions.create(
                model = openai_model_id,
                messages = debate_history + [{"role": "user", "content": f"{speaker_id}, it's your turn to speak."}],
                store = False
            );
            message = response.choices[0].message.content
            debate_history.append({"role": "assistant", "name": speaker_id, "content": message})
            last_speaker = model

    return debate_history

debate_3 = run_organic_conversation_v3(8, 'gpt-4o', 3, DEBATE_TOPIC, SYSTEM_PROMPT, 0.5)

100%|██████████| 8/8 [00:43<00:00,  5.39s/it]


In [38]:
conversation_transcript_3 = save_conversation("conversation_transcript_3.txt", debate_3)

### Analysis
Now, let's get a quick summary of what happened.

In [39]:
analyze_conversation(conversation_transcript_3)

- **Brief Summary**: The discussion revolves around the balance between code, testing, development infrastructure, and comprehensive documentation. Speaker 1 supports prioritizing code and infrastructure, especially in agile settings, but acknowledges the value of documentation. Speaker 2 emphasizes the importance of integrating key documentation practices alongside development to prevent knowledge bottlenecks. Speaker 3 advocates for a balanced approach where documentation evolves with development, employing tools and practices that ensure both areas are maintained effectively.

- **Final Outputs/Artifacts/Takeaways**:
  - The idea of iterative and lightweight documentation, evolving with the product, to enhance team onboarding and collaboration.
  - Integration of documentation into the definition of "done" for each task and feature.
  - Leveraging automated tools to maintain documentation accuracy and reduce manual burdens.
  - Cultivating a culture where both code quality and docum

Surprisingly, I think this set was much more conversational than the last. I don't know if I can declare this as reproducible though.

### Experiment 4
The final thing we can do is make the prompts more like thoughts. Instead of saying "it's your turn to speak", we can say "please share your current perspective with the crowd". We can also alter the system prompt to not mention a simulation.

In [55]:
NEW_SYSTEM_PROMPT = "You a participant in a conversation between experienced software engineers. Keep questions minimal and only use them when necessary. Please greet the other participants when you join."

def run_organic_conversation_v4(
    iterations: int, 
    openai_model_id: str,
    participant_count: int,
    topic: str,
    system_prompt: str,
    dropout_chance: float
) -> list[dict]:
    # model 1 is the first speaker
    debate_history = [
        {"role": "system", "content": system_prompt + " The topic is: " + topic}
    ]

    ordering = list(range(1, participant_count + 1))
    last_speaker = -1

    for i in tqdm(range(iterations)):
        if i > 0:
            first = choice(ordering[:-1])
            remaining = [i for i in ordering if i != first]
            shuffle(remaining)
            ordering = [first] + remaining
        for model in ordering: # RANDOM ORDERING
            if random() < dropout_chance:
                continue # SKIP
            if last_speaker == model:
                continue
            speaker_id = f"speaker_{model}"
            response = client.chat.completions.create(
                model = openai_model_id,
                messages = debate_history + [{"role": "user", "content": f"{speaker_id}, please share your perspective with the others and engage with the responses of the other participants."}],
                store = False
            );
            message = response.choices[0].message.content
            debate_history.append({"role": "assistant", "name": speaker_id, "content": message})
            last_speaker = model

    return debate_history

debate_4 = run_organic_conversation_v4(8, 'gpt-4o', 3, DEBATE_TOPIC, NEW_SYSTEM_PROMPT, 0.5)

100%|██████████| 8/8 [02:05<00:00, 15.69s/it]


In [56]:
conversation_transcript_4 = save_conversation("conversation_transcript_4.txt", debate_4)

### Analysis
Let's see how that changed the debate.

In [57]:
analyze_conversation(conversation_transcript_4)

- **Brief Summary**: The conversation revolves around the balance between prioritizing code quality, testing, and development infrastructure over comprehensive documentation in software development. Speaker_1 initiates the discussion by emphasizing the importance of clean code and robust testing. Speaker_3 agrees while acknowledging the potential challenges of inadequate documentation in complex systems. Speaker_2 advocates for a balance, stressing the need for essential documentation and lightweight approaches like API documentation and high-level overviews. The conversation further explores innovative documentation methods like video tutorials and AI-assisted documentation.

- **Final Outputs/Artifacts/Takeaways**: The primary takeaways include the consensus on the importance of balancing documentation with code quality and infrastructure. The participants also share practical strategies such as embedding documentation updates into the code review process, using version control for d

The conversation is much more conversational this time, with explicit instructions to address the other speakers, there's a lot more structure in the engagement between the three speakers. Additionally, it seems that near the end of the conversation, the speakers converge toward some actionable items, despite there not being explicit instructions to find common ground. 

## Closing Remarks
With the new shared chat window approach, some prompt tuning, and some randomization, we approach a more natural conversational vibe in the discussion between the 3 LLMs. It also feels less structured, and that could potentially allow for integration with human speakers and produce results similar to meetings or panels. It would be interesting to see what kinds of topics LLMs can excel in, or if they can even do well without having a predefined topic.

It's also important to consider that this is probably not the only way to do this. Potentially, we could add more prompts before we get each speaker LLM to answer.

> **Future Work**:
> - Experiment with more topics
> - Ignore predefined topic