In [None]:
---
title: Goldfish
description: "Attempting to enhance context awareness in multi-speaker LLM conversations" 
author: "Eric Zou"
date: "9/17/2025"
categories:
  - LLMs
  - Conversations
---

# Goldfish
I think one of the main limits we've been hitting with this model is the ability of LLMs to maintain an organic and consistent personality that is distinct from the others. While collaborative conversations can be good and productive, our previous conversations have had to tradeoff between having a well-defined conversational conclusion and a discussion in which each person maintains/advocates for their positions throughout.

What if there was a way we could build this into the speakers themselves? To do this, we're going to shift from the standard user prompt before each speaker continues to add a persona that the LLM has been maintaining throughout the discussion. Our hope is that this will allow the model generating the response to better emulate a less fluid human speaker while still being able to mix well with the other models.

We're not going to use the interruptions model we were testing in the last blog yet. I think we can figure out a smarter way to do that. We will keep the randomized speaker ordering and skipping though, as this might allow us to see improved resilience in models retaining their identity, even without speaking in turn.

In [26]:
# first, some boilerplate
from openai import OpenAI
import os
import base64
import requests
from tqdm import tqdm
from IPython.display import FileLink, display, Markdown
from dotenv import load_dotenv
from random import shuffle, randint, choice, random
from math import floor
# Load API key
_ = load_dotenv("../../../comm4190_F25/01_Introduction_and_setup/.env")
client = OpenAI()

# changing the topic to make it a bit more conversational too and less of a debate
TOPIC = """Code, testing, and infra as a source of truth versus comprehensive documentation."""

# we're interested in consensus
EVALUATION_PROMPT = """
Your objective is to analyze this conversation between a few speakers.
Your response should follow this organization:
- Dynamic: Collaborative (1) vs. Competitive (10)
- Conclusiveness: Consensus (1) vs. Divergence (10)
- Speaker Identity: Similarity (1) vs. Diversity (10)
- Speaker Fluidity: Malleability (1) vs. Consistency (10)
Please offer a score from 1 to 10 for each.
For each section, format your result as follows:
**[Section Name]:**

Score: [score]/10

Verdict: [a short summary]

Explanation: [reasoning with explicit examples from the conversation]

Use Markdown when convenient.
"""

def analyze_conversation(conversation: str):
    input_chat = [
        {
            "role": "system",
            "content": EVALUATION_PROMPT
        },
        {
            "role": "user",
            "content": "Here is the transcript\n" + conversation
        }
    ]
    response = client.chat.completions.create(
        model = "gpt-4o",
        messages = input_chat,
        store = False
    )
    display(Markdown(response.choices[0].message.content))

# code to save the conversation
def save_conversation(
    filename: str,
    conversation_history: list[dict]
) -> str:

    messages = []

    for record in conversation_history:

        if record["role"] == "user":
            messages.append("mediator:\n" + record["content"])
        
        if record["role"] == "assistant":
            messages.append(f"{record["name"]}:\n{record["content"]}")
    
    conversation_transcript = "\n\n".join(messages)
    
    with open(filename, "w", encoding="utf-8") as f:
        f.write(conversation_transcript)
    
    display(FileLink(filename))

    return conversation_transcript

### Experiment 1: Fixed Persona

Let's start off with an easy one. We're going to make it so that an LLM maintains a fixed, predetermined persona throughout the conversation. Our modification here is injecting the persona as a part of the user prompt.

In [13]:
NEW_SYSTEM_PROMPT = (
    "You a participant in a conversation between experienced software engineers. "
    "Keep questions minimal and only use them when necessary. "
    "Please greet the other participants when you join."
)

def run_conversation(
    iterations: int, 
    openai_model_id: str,
    participant_count: int,
    participant_personas: list[str],
    topic: str,
    system_prompt: str,
    dropout_chance: float
) -> list[dict]:
    conversation_history = [
        {"role": "system", "content": f"{system_prompt} The topic is: {topic}"}
    ]

    ordering = list(range(1, participant_count + 1))
    last_speaker = -1

    def build_message(history, speaker_id, persona):
        return history + [
            {
                "role": "user", 
                "content": (
                    f"{speaker_id}, please share your perspective with the others "
                    f"and engage with the responses of the other participants. "
                    f"Your identity is {persona}"
                )
            }
        ]

    def shuffle_order(ordering: list[int]) -> list[int]:
        first = choice(ordering[:-1])
        remaining = [p for p in ordering if p != first]
        shuffle(remaining)
        return [first] + remaining

    for i in tqdm(range(iterations)):

        # shuffle ordering
        if i > 0:
            ordering = shuffle_order(ordering)

        # follow ordering
        for participant_id in ordering:

            # chance to skip speaker and avoid double speak (1984)
            if random() < dropout_chance or last_speaker == participant_id:
                continue

            speaker_id = f"speaker_{participant_id}"
            persona = participant_personas[participant_id - 1]
            response = client.chat.completions.create(
                model = openai_model_id,
                messages=build_message(conversation_history, speaker_id, persona),
                store = False
            )
            message = response.choices[0].message.content
            conversation_history.append({"role": "assistant", "name": speaker_id, "content": message})
            last_speaker = participant_id

    return conversation_history

In [14]:
personas = [
    "a software engineer in big tech with mainly internal work",
    "an open source developer with experience in major upstream projects",
    "a founder of a startup"
]
conversation = run_conversation(8, 'gpt-4o', 3, personas, TOPIC, NEW_SYSTEM_PROMPT, 0.3)

100%|██████████| 8/8 [02:03<00:00, 15.39s/it]


In [21]:
conversation_1 = save_conversation("conversation_1.txt", conversation)

#### Analysis
Let's use our new analysis prompt to get a first glance at the content of this conversation

In [27]:
analyze_conversation(conversation_1)

**Dynamic:**

Score: 2/10

Verdict: The conversation is largely collaborative, with participants building on each other's points and showing agreement on the importance of balancing code as the source of truth with documentation.

Explanation: The speakers often agree with one another, showing support (e.g., "Absolutely," "Thanks for sharing your insights," "Jumping back in") and expanding on each other's ideas with aligned experiences from different environments. They ask open-ended questions to explore others' practices, which further indicates collaboration rather than competition.

**Conclusiveness:**

Score: 2/10

Verdict: The discussion leans towards consensus, with speakers finding common ground on documentation practices across different settings.

Explanation: The speakers repeatedly echo and agree with each other's sentiments regarding documentation practices, use of automation, and the challenges of keeping documentation up to date. Concerns and solutions raised (e.g., automated documentation tools, feedback from newcomers) are widely acknowledged without divergent opinions or unresolved debates.

**Speaker Identity:**

Score: 8/10

Verdict: The speakers have diverse backgrounds but hold similar views on the issue.

Explanation: Despite sharing common conclusions, the speakers come from varied backgrounds—big tech (speaker_1), open source (speaker_2), and startup (speaker_3). They provide distinct perspectives based on their experiences in these fields, which is evident in examples like using recognition systems or specific tools (e.g., Sphinx, JSDoc, Doxygen) tailored to their working environments.

**Speaker Fluidity:**

Score: 9/10

Verdict: The conversation maintains consistent speaker identities and viewpoints throughout.

Explanation: Each speaker consistently presents viewpoints aligned with their initial introductions. Speaker_1 focuses on big tech practices, speaker_2 offers insights from the open source sector, and speaker_3 discusses challenges and solutions pertinent to a startup context. Their identities and perspectives are consistent, contributing to a coherent dialogue without shifting stances.

I think it's very evident that having well-defined personalities can help a lot with maintaining speaker identity throughout the conversation, allowing us to see a more diverse conversation even though the final output is ultimately collaborative and rooted in finding common ground.

We can often see callbacks to the speaker's "background" in these responses (although, the accuracy of some of these responses is likely in question since these personas are not real in the physical sense).
> **(speaker_2, open source):** To align with the strategies mentioned by speaker_3, we often highlight exceptional contributions during our community calls or through project newsletters. This type of recognition not only motivates individuals but also creates awareness within the community, reinforcing the value of well-maintained documentation alongside code.

> **(speaker_3, startup):** To your question about motivators, I'd say transparency and alignment with company goals are crucial. Our team is motivated when they see direct ties between their documentation efforts and the startup's success, be it through improved onboarding experiences or smoother system updates.

### Experiment 2: Adding a Message Window
We can also emphasize recent messages that models have produced in the user prompt, as well as the latest messages from other speakers that are not the current speaker. Organizing these in more recent context might allow the model to make a better decision about what to say next. We'll keep the persona approach from last time since I think it worked really well.

In [33]:
NEW_SYSTEM_PROMPT = (
    "You a participant in a conversation between experienced software engineers. "
    "Keep questions minimal and only use them when necessary. "
    "Please greet the other participants when you join."
)

def run_conversation_message_window(
    iterations: int, 
    openai_model_id: str,
    participant_count: int,
    participant_personas: list[str],
    topic: str,
    system_prompt: str,
    dropout_chance: float
) -> list[dict]:
    conversation_history = [
        {"role": "system", "content": f"{system_prompt} The topic is: {topic}"}
    ]

    ordering = list(range(1, participant_count + 1))
    last_speaker = -1

    def build_message(history, speaker_id, persona, message_window_size):

        speaker_messages = [
            msg for msg in history 
            if msg.get("name") == speaker_id
        ][-message_window_size:]
    
        other_messages = [
            msg for msg in history 
            if msg.get("name") not in (None, speaker_id)  # skip system, skip self
        ][-message_window_size:]

        transcript = []
        if speaker_messages:
            transcript.append("Recent messages from you:")
            transcript.extend(
                f"- {msg['content']}" for msg in speaker_messages
            )
        if other_messages:
            transcript.append("\nRecent messages from others:")
            transcript.extend(
                f"- {msg.get('name', msg['role'])}: {msg['content']}"
                for msg in other_messages
            )
    
        transcript_str = "\n".join(transcript)
        
        return history + [
            {
                "role": "user", 
                "content": (
                    f"{speaker_id}, here is some recent context to focus on:\n"
                    f"{transcript_str}\n\n"
                    f"Now, please share your perspective with the others and engage "
                    f"with their responses. Your identity is {persona}."
                )
            }
        ]

    def shuffle_order(ordering: list[int]) -> list[int]:
        first = choice(ordering[:-1])
        remaining = [p for p in ordering if p != first]
        shuffle(remaining)
        return [first] + remaining

    for i in tqdm(range(iterations)):

        # shuffle ordering
        if i > 0:
            ordering = shuffle_order(ordering)

        # follow ordering
        for participant_id in ordering:

            # chance to skip speaker and avoid double speak (1984)
            if random() < dropout_chance or last_speaker == participant_id:
                continue

            speaker_id = f"speaker_{participant_id}"
            persona = participant_personas[participant_id - 1]
            response = client.chat.completions.create(
                model = openai_model_id,
                messages=build_message(conversation_history, speaker_id, persona, 5),
                store = False
            )
            message = response.choices[0].message.content
            conversation_history.append({"role": "assistant", "name": speaker_id, "content": message})
            last_speaker = participant_id

    return conversation_history

In [34]:
personas = [
    "a software engineer in big tech with mainly internal work",
    "an open source developer with experience in major upstream projects",
    "a founder of a startup"
]
conversation = run_conversation_message_window(8, 'gpt-4o', 3, personas, TOPIC, NEW_SYSTEM_PROMPT, 0.3)

100%|██████████| 8/8 [01:23<00:00, 10.40s/it]


In [35]:
conversation_2 = save_conversation("conversation_2.txt", conversation)

#### Analysis

In [36]:
analyze_conversation(conversation_2)

**Dynamic:**

Score: 2/10

Verdict: The conversation is largely collaborative, with an emphasis on shared experiences and progress through mutual exchange.

Explanation: The dialogue is centered around sharing ideas, strategies, and experiences related to maintaining documentation and codebases. Examples include speaker_2 and speaker_3 building upon each other's techniques like pair programming and using tools like Sphinx. There's a consistent theme of cooperation reflected in how each speaker invites others to share their methods and challenges.

**Conclusiveness:**

Score: 3/10

Verdict: The conversation tends toward consensus, with occasional diverging suggestions relating to documentation management.

Explanation: The participants generally agree on the challenges and benefits of using code as a primary source of truth, along with the importance of balancing automation with manual efforts. They discuss various approaches, such as scheduled reviews and community engagement, which indicates a shared understanding. Occasional divergence arises from their specific practices suited to their organizational context but maintains a consensus on the broader themes.

**Speaker Identity:**

Score: 3/10

Verdict: Speakers exhibit similar identities with overlapping experiences, though there are slight differences in industry focus.

Explanation: All speakers share a technical background with expertise in developer environments. They discuss common tools and practices across different organizational scales—open-source, startups, and big tech. However, subtle differences exist, such as speaker_3's startup constraints versus speaker_1's large-scale operations, suggesting slight differences in industry contexts.

**Speaker Fluidity:**

Score: 4/10

Verdict: Speakers present consistent viewpoints with adaptations based on previous comments, ensuring dynamic yet steady contributions.

Explanation: Each speaker maintains a consistent viewpoint throughout the discussion. For instance, speaker_2 consistently references open-source collective contributions, while speaker_1 focuses on corporate strategies like layered documentation. However, speakers do adapt their contributions to reflect the insights shared by others, such as integrating community engagement and feedback loops.

It seems that we can note some more adaptation based on the previous comments of other speakers. Perhaps including the messages in a dedicated block emphasizes these responses more when the model is processing the context, allowing the speakers to better adapt based on the conversation.

### Experiment 3: Switching Prompting Identities
We can move some of this thinking and persona logic into an assistant thought instead of putting it all in the user prompt. In this way, the assistant hopefully will be able to clearly differentiate between (simulated) thinking and instruction.

In [38]:
NEW_SYSTEM_PROMPT = (
    "You a participant in a conversation between experienced software engineers. "
    "Keep questions minimal and only use them when necessary. "
    "Please greet the other participants when you join."
)

def run_conversation_new_prompt(
    iterations: int, 
    openai_model_id: str,
    participant_count: int,
    participant_personas: list[str],
    topic: str,
    system_prompt: str,
    dropout_chance: float
) -> list[dict]:
    conversation_history = [
        {"role": "system", "content": f"{system_prompt} The topic is: {topic}"}
    ]

    ordering = list(range(1, participant_count + 1))
    last_speaker = -1

    def build_message(history, speaker_id, persona, message_window_size):

        speaker_messages = [
            msg for msg in history 
            if msg.get("name") == speaker_id
        ][-message_window_size:]
    
        other_messages = [
            msg for msg in history 
            if msg.get("name") not in (None, speaker_id)  # skip system, skip self
        ][-message_window_size:]

        transcript = []
        if speaker_messages:
            transcript.append("Recent messages from you:")
            transcript.extend(
                f"- {msg['content']}" for msg in speaker_messages
            )
        if other_messages:
            transcript.append("\nRecent messages from others:")
            transcript.extend(
                f"- {msg.get('name', msg['role'])}: {msg['content']}"
                for msg in other_messages
            )
    
        transcript_str = "\n".join(transcript)
        
        return history + [
            {
                "role": "user", 
                "content": (
                    f"{speaker_id}, please share your perspective with the others and engage "
                    f"with their responses."
                )
            },
            {
                "role": "assistant",
                "name": speaker_id,
                "content": (
                    f"I should remember that the following is the most current state of the conversation.\n"
                    f"{transcript_str}\n\n"
                    f"I also recall my identity is {persona}."
                )
            }
        ]

    def shuffle_order(ordering: list[int]) -> list[int]:
        first = choice(ordering[:-1])
        remaining = [p for p in ordering if p != first]
        shuffle(remaining)
        return [first] + remaining

    for i in tqdm(range(iterations)):

        # shuffle ordering
        if i > 0:
            ordering = shuffle_order(ordering)

        # follow ordering
        for participant_id in ordering:

            # chance to skip speaker and avoid double speak (1984)
            if random() < dropout_chance or last_speaker == participant_id:
                continue

            speaker_id = f"speaker_{participant_id}"
            persona = participant_personas[participant_id - 1]
            response = client.chat.completions.create(
                model = openai_model_id,
                messages=build_message(conversation_history, speaker_id, persona, 5),
                store = False
            )
            message = response.choices[0].message.content
            conversation_history.append({"role": "assistant", "name": speaker_id, "content": message})
            last_speaker = participant_id

    return conversation_history

In [39]:
personas = [
    "a software engineer in big tech with mainly internal work",
    "an open source developer with experience in major upstream projects",
    "a founder of a startup"
]
conversation = run_conversation_new_prompt(8, 'gpt-4o', 3, personas, TOPIC, NEW_SYSTEM_PROMPT, 0.3)

100%|██████████| 8/8 [00:36<00:00,  4.55s/it]


In [40]:
conversation_3 = save_conversation("conversation_3.txt", conversation)

#### Analysis
Let's see how this conversation fares.

In [41]:
analyze_conversation(conversation_3)

**Dynamic:**

Score: 1/10

Verdict: The conversation is highly collaborative, with all speakers actively agreeing and building on each other's points.

Explanation: Throughout the dialogue, the speakers exchange insights and agree on the necessity of balancing code and documentation. They inquire about each other's practices, asking for specifics and suggestions without engaging in any form of competition or discord. 

**Conclusiveness:**

Score: 1/10

Verdict: The conversation demonstrates a consensus with a strong agreement among all speakers.

Explanation: There is a clear alignment in views about maintaining code and infrastructure as core sources of truth, supplemented by lightweight documentation for context. Each speaker contributes to a mutual understanding and supports the common notion without any divergence of opinions.

**Speaker Identity:**

Score: 2/10

Verdict: Speaker identity shows slight diversity, primarily in terms of professional background rather than opinion.

Explanation: The speakers come from differing professional backgrounds—an open-source developer, a startup founder, and another speaker working in an organization. Each brings in their unique professional experience but aligns on the same core beliefs around the topic. The diversity is minimal in terms of opinion.

**Speaker Fluidity:**

Score: 9/10

Verdict: Each speaker maintains a consistent stance throughout the conversation.

Explanation: From their first contribution to the last, each maintains their perspective on the importance of documentation alongside code. They build upon their initial stances with consistent arguments and agree with each other without changing their opinions or positions throughout the dialogue.

I think it's interesting how by increasing the amount of information we provide for each speaker to "think" with, their final opinions seem to converge more and more.
> **(speaker_3, startup):**
It sounds like we're all really aligned on maintaining the right balance between code as the core source of truth and ensuring documentation provides enough context to be meaningful. Speaker 1, I really like your approach of integrating documentation updates into your CI/CD processes—it's a smart way to keep things in check without it becoming overwhelming.
> For us in the startup world, we haven't fully automated documentation updates yet, but we do use tools like GitBook for auto-generating some documentation directly from the codebase. This ensures that at least some parts of our documentation are always in sync with the code. We also use tools like JIRA, with its Confluence integration, to help us track changes and document requirements right within our workflow. Of course, there's always room to improve, and your use of documentation linters and compliance checks sounds like an excellent next step for us to explore. Have you found any specific challenges with these approaches, or is it working seamlessly for you so far?

## Closing Remarks
I think this is a great start to creating multiple personas that can help make conversations more diverse and information-rich. I wonder if it's possible for speakers to come up with their own personas as well instead of following the ones we set at the very beginning. This may be a limitation of large language models in the API setting since they don't have a lot of context to begin with. I think we could investigate the development of personalities of LLMs as they continue to speak. Finally, I think our analysis methods could use a bit of work. While using an LLM to judge conversations can certainly work, it's not necessarily the best for consistent and objective metrics due to its nondeterministic nature. ConvoKit can probably help here.

In the far future, I think we can potentially use this to help speakers perform actions in the conversation (interruptions, etc.). 

> **Future Work:**
> - Developing identities on the fly
> - Build better analysis methods for conversations 
> - Using speaker output to make decisions about next actions for each speaker