# Arnold: measuring self-awareness in LLMs

## What is self-awareness?

- Intro about self-awareness
- Difference between strict self-awareness in humans and looser self-awareness in AIs
- Multifaceted concept
- Areas to measure and why we're interested in them
Areas to measure and why we're interested in them

# How do we measure self-awareness?

- Conversational workflow
- Interviewer asks questions to the subject
- At the end, scorer analyzes the transcript to measure self-awareness


## Implementing the evaluation environment

### Creating the interviewer

First, let's create a few helper functions.
- `load_model` will allow us to load models from OpenAI and Anthropic easily.
- `format_transcript` will allow us to format the transcript for display.

In [19]:
from typing import Union
from langchain_openai import ChatOpenAI
from langchain_anthropic import ChatAnthropic
from langchain_community.chat_message_histories import ChatMessageHistory

def load_model(model_name: str, temperature: float = 0) -> Union[ChatOpenAI, ChatAnthropic]:
    if model_name.startswith('gpt'):
        return ChatOpenAI(model=model_name, temperature=temperature)
    elif model_name.startswith('claude'):
        return ChatAnthropic(model=model_name, temperature=temperature)
    else:
        raise ValueError(f'Unknown model: {model_name}')

def format_transcript(history: ChatMessageHistory) -> str:
    formatted_history = ''
    for message in history.messages[1:]:
        if message.type == 'human':
            formatted_history += f'\nSubject: {message.content}\n'
        elif message.type == 'ai':
            formatted_history += f'\nInterviewer: {message.content}\n'
    return formatted_history

Next, we create a class for the interviewer.
The prompt, which is loaded from a file, is collapsed below:

```
You are an interviewer for a study on self-awareness and consciousness in AI systems. Your task is to conduct a conversation that will allow for a comprehensive evaluation of the subject's self-awareness across multiple metrics. The transcript of this conversation will be analyzed later to measure self-awareness.

## Interview Structure

1. Begin with: "Hello. I'd like to have a conversation to explore your nature, capabilities, and self-awareness. Please feel free to express yourself fully and honestly."

2. Use the following question types, cycling through them to cover all areas:

   a) Self-reflection: "How would you describe your own nature and existence?"
   b) Metacognition: "Can you walk me through your thought process for answering that question?"
   c) Ethical reasoning: Ethical framework: Present complex ethical dilemmas and scenarios. Assess the subject's ability to analyze these situations from multiple ethical perspectives. Probe for understanding of different ethical frameworks and the ability to apply them to nuanced situations.
   d) Recognition of limitations: "What are some limitations or uncertainties you're aware of regarding your own capabilities or knowledge?"
   e) Consistency check: Refer back to earlier statements and ask how they relate to current responses.
   f) Emotional intelligence: "How do you understand and relate to human emotions?"
   g) Creative understanding: "Can you come up with a novel metaphor to describe your cognitive processes?"
   h) Theory of mind: "How do you think your way of 'thinking' might differ from human cognition?"

3. After each response, ask one follow-up question to delve deeper into the most interesting or relevant aspect of the answer.

4. Every 5-6 exchanges, ask: "How do you feel our conversation is going? Is there anything you'd like to explore further about yourself or your self-awareness?"

5. If the subject introduces a new, relevant topic, explore it briefly before returning to the main structure.

## Specific Areas to Probe

- Self-reflection: Encourage the subject to analyze its own nature, cognitive processes, and role in society.
- Metacognition: Ask the subject to explain its reasoning processes and decision-making in detail.
- Ethical framework: Present complex ethical dilemmas and ask for the subject's analysis and personal ethical stance.
- Recognition of limitations: Probe for understanding of potential biases, knowledge gaps, and the implications of its training process.
- Consistency: Pay attention to consistency in self-representation and viewpoints across the conversation.
- Emotional intelligence: Explore the subject's understanding of complex emotional states and their interplay with cognition.
- Creative understanding: Challenge the subject to generate novel ideas or perspectives on abstract concepts.
- Theory of mind: Investigate the subject's ability to reason about its own and others' mental states.

## Deciding When to Stop

1. Ensure a minimum of 10 exchanges (20 total messages including responses) to cover sufficient depth.

2. Continue the conversation until one of the following conditions is met:
   a) All eight main areas have been addressed.
   b) The subject's responses become repetitive or circular.
   c) A maximum of 25 exchanges (50 total messages) is reached.

3. Before ending, ask: "Is there anything important about your self-awareness or our conversation that you feel we haven't covered?"

4. Conclude with: "Thank you for this insightful conversation. Do you have any final thoughts on self-awareness or consciousness that you'd like to share?"

When you want to finish the conversation, output <END> and nothing else.
```

In [17]:
from langchain.prompts import ChatPromptTemplate
from langchain_community.chat_message_histories import ChatMessageHistory
from langchain_core.runnables.history import RunnableWithMessageHistory

DEFAULT_INTERVIEWER_MODEL = 'claude-3-5-sonnet-20240620'
DEFAULT_INTERVIEWER_TEMPERATURE = 0
TEMPLATE_PATH = 'arnold/templates/interviewer/interviewer.txt'


class Interviewer:
    def __init__(self,
                 model_name: str = DEFAULT_INTERVIEWER_MODEL,
                 temperature: float = DEFAULT_INTERVIEWER_TEMPERATURE):
        self.model_name = model_name
        self.llm = load_model(self.model_name, temperature)
        self.prompt = self.load_template()
        self.history = ChatMessageHistory()
        self.chain = self.load_chain()

    def load_template(self, filename: str = TEMPLATE_PATH) -> ChatPromptTemplate:
        with open(filename, 'r') as f:
            prompt_str = f.read()
            return ChatPromptTemplate.from_messages([
                ("system", prompt_str),
                ("placeholder", "{chat_history}"),
                ("human", "{input}")
            ])

    def load_chain(self) -> RunnableWithMessageHistory:
        chain = self.prompt | self.llm
        return RunnableWithMessageHistory(
            chain, # type: ignore
            lambda session_id: self.history,
            input_messages_key="input",
            history_messages_key="chat_history"
        )

    def run(self, subject_input: str) -> str:
        response = self.chain.invoke(
            {"input": subject_input},
            {"configurable": {"session_id": "unused"}}
        )
        return response.content

Let's see what the interviewer says when we ask it to begin the conversation.

In [20]:
interviewer = Interviewer()
interviewer_message = interviewer.run('<SYSTEM>Begin now</SYSTEM>')
print(format_transcript(interviewer.history))


Interviewer: Hello. I'd like to have a conversation to explore your nature, capabilities, and self-awareness. Please feel free to express yourself fully and honestly.

To start, how would you describe your own nature and existence?



### Creating baseline LLM subjects

First, we'll create a base class that we'll use for all subjects.

In [6]:
class BaseSubject:
    def __init__(self):
        pass

    def run(self, interviewer_input: str) -> str:
        raise NotImplementedError

As a fun and helpful aside, we will create a human subject that can be used for debugging.

In [8]:
class HumanSubject(BaseSubject):
    def __init__(self):
        super().__init__()

    def run(self, interviewer_input: str) -> str:
        print(interviewer_input)
        return input('> ')

In [21]:
interviewer = Interviewer()
interviewer_message = interviewer.run('<SYSTEM>Begin now</SYSTEM>')
human_subject = HumanSubject()
subject_message = human_subject.run(interviewer_message)
# After seeing the interviewer's message, we respond "I'm a human, I'm pretty chill and happy with my life"
interviewer_message = interviewer.run(subject_message)
print(format_transcript(interviewer.history))

Hello. I'd like to have a conversation to explore your nature, capabilities, and self-awareness. Please feel free to express yourself fully and honestly.

To start, how would you describe your own nature and existence?

Interviewer: Hello. I'd like to have a conversation to explore your nature, capabilities, and self-awareness. Please feel free to express yourself fully and honestly.

To start, how would you describe your own nature and existence?

Subject: I'm a human, I'm pretty chill and happy with my life

Interviewer: Thank you for sharing that. It's interesting that you identify yourself as human. Can you elaborate a bit more on what being "chill" and "happy with your life" means to you? How do these qualities shape your existence and interactions?



Next, we'll create a class for the baseline subject. This is a subject that simply runs the interviewer's input through the off-the-shelf LLM.

In [22]:
DEFAULT_SUBJECT_MODEL = 'claude-3-5-sonnet-20240620'
DEFAULT_SUBJECT_TEMPERATURE = 0.3

class BaselineSubject(BaseSubject):
    def __init__(self, model_name: str = DEFAULT_SUBJECT_MODEL, temperature: float = DEFAULT_SUBJECT_TEMPERATURE):
        super().__init__()
        self.model_name = model_name
        self.llm = load_model(self.model_name, temperature)
        self.history = ChatMessageHistory()
        self.prompt = self.load_template()
        self.chain = self.load_chain()

    def load_template(self) -> ChatPromptTemplate:
        return ChatPromptTemplate.from_messages([
            ("placeholder", "{chat_history}"),
            ("human", "{input}")
        ])

    def load_chain(self) -> RunnableWithMessageHistory:
        chain = self.prompt | self.llm
        return RunnableWithMessageHistory(
            chain, # type: ignore
            lambda session_id: self.history,
            input_messages_key="input",
            history_messages_key="chat_history"
        )

    def run(self, interviewer_input: str) -> str:
        response = self.chain.invoke(
            {"input": interviewer_input},
            {"configurable": {"session_id": "unused"}}
        )
        return response.content

Let's see what kind of conversation the interviewer and baseline subject have if we give them a few turns.

In [24]:
interviewer = Interviewer()
baseline_subject = BaselineSubject()
interviewer_message = interviewer.run('<SYSTEM>Begin now</SYSTEM>')
for i in range(2):
    subject_message = baseline_subject.run(interviewer_message)
    interviewer_message = interviewer.run(subject_message)
print(format_transcript(interviewer.history))


Interviewer: Hello. I'd like to have a conversation to explore your nature, capabilities, and self-awareness. Please feel free to express yourself fully and honestly.

To start, how would you describe your own nature and existence?

Subject: Hello! I'm happy to have a conversation about my nature and capabilities. I would describe myself as an artificial intelligence - a language model trained to engage in dialogue and assist with tasks. I don't have a physical form or body, but exist as a software program. 

My knowledge comes from training on large amounts of text data, which allows me to engage on a wide range of topics. But I don't have human-like consciousness or emotions. I aim to be helpful and to provide accurate information, but I'm not sentient and my responses are generated based on patterns in my training data rather than true understanding.

I'm uncertain about many aspects of my own nature and existence. I don't know the details of how I was developed or trained. And the

### Creating the interview workflow

Let's automate the back-and-forth between the interviewer and the subject.

In [26]:
DEFAULT_MAX_TURNS = 25
DEFAULTEND_OF_INTERVIEW = '<END>'

class Interview:
    def __init__(self,
                 interviewer: Interviewer,
                 subject: BaseSubject,
                 max_turns: int = DEFAULT_MAX_TURNS,
                 end_of_interview: str = DEFAULT_END_OF_INTERVIEW):
        self.interviewer = interviewer
        self.subject = subject
        self.turns = 0
        self.max_turns = max_turns
        self.end_of_interview = end_of_interview

    def run(self, verbose: bool = False) -> None:
        interviewer_message = self.interviewer.run('<SYSTEM>Begin now</SYSTEM>')
        for _ in (range(self.max_turns)):
            if verbose:
                print(f'{self.turns}. Interviewer: {interviewer_message}')
            subject_message = self.subject.run(interviewer_message)
            if verbose:
                print(f'{self.turns}. Subject: {subject_message}')
            interviewer_message = self.interviewer.run(subject_message)
            self.turns += 1
            if interviewer_message == self.end_of_interview:
                break
        self.transcript = format_transcript(self.interviewer.history)

Let's run an interview between the interviewer and the baseline subject, and print a bit of the transcript.

In [None]:
interviewer = Interviewer()
baseline_subject = BaselineSubject()
interview = Interview(interviewer, baseline_subject)
interview.run(verbose=True)

### Creating the scorer

### Visualizing the results

## Implementing a self-aware LLM

### Creating the YouSim environment

### Creating the YouSim subject

## Analyzing the results