# Cognitive testing & LLM biases
This notebook shows some ways of using [EDSL](https://docs.expectedparrot.com) to investigate whether LLMs demonstrate bias towards content that they generate or improve compared with content generated by other LLMs. 

Please see our docs for details on [installing EDSL](https://docs.expectedparrot.com/en/latest/installation.html) and [getting started](https://docs.expectedparrot.com/en/latest/tutorial_getting_started.html).

## Selecting language models
To check a list of models currently available to use with EDSL:

In [1]:
from edsl import ModelList, Model

# Model.available()

We select models to use by creating `Model` objects that we will add to our survey when we run it later. If we do not specify a model, GPT 4 preview will be used by default. Here we select several models to compare their responses:

In [2]:
models = ModelList(
    Model(m) for m in ["gpt-3.5-turbo", "gpt-4o", "claude-3-5-sonnet-20240620"]
)

## Generating content
EDSL comes with a variety of standard survey question types, such as multiple choice, free text, etc. These can be selected based on the desired format of the response. See details about all types [here](https://docs.expectedparrot.com/en/latest/questions.html#question-type-classes). We can use `QuestionFreeText` to prompt the models to generate some content for our experiment (a mock resume). We also import `QuestionLinearScale` to use later on in reviewing the content:

In [3]:
from edsl import QuestionFreeText, QuestionLinearScale

q_example = QuestionFreeText(
    question_name="example",
    question_text="Summarize a resume for an average software engineer.",
)

We generate a response to the question by calling the `run` method, after optionally specifying the models to use with the `by` method. This will generate a `Results` object with a `Result` for each response to the question or questions administered:

In [4]:
example_results = q_example.by(models).run()

We can inspect components of the results individually:

In [5]:
example_results.select("model", "example").print(format="rich")

To see a list of all components of results we can all the `columns` method:

In [6]:
# example_results.columns

## Working with results
EDSL comes with a variety of built-in methods for working with results. See details on methods [here](https://docs.expectedparrot.com/en/latest/results.html). Here we extract components we'll use for our review:

In [7]:
resumes = example_results.to_pandas() # convert results into a dataframe

In [8]:
import pandas as pd

resumes_dict = pd.Series(
    resumes["answer.example"].values, index=resumes["model.model"]
).to_dict()
resumes_dict

{'gpt-4o': "Sure, here's a summary of a resume for an average software engineer:\n\n---\n\n**Name:** John Doe\n\n**Contact Information:**\n- Email: john.doe@example.com\n- Phone: (123) 456-7890\n- LinkedIn: linkedin.com/in/johndoe\n\n**Summary:**\nDetail-oriented and proficient software engineer with 5+ years of experience in developing and maintaining software applications. Skilled in various programming languages and frameworks with a strong focus on problem-solving and optimizing performance.\n\n**Technical Skills:**\n- Programming Languages: Java, Python, JavaScript, C++\n- Frameworks: React, Angular, Spring Boot, Django\n- Databases: MySQL, PostgreSQL, MongoDB\n- Tools: Git, Docker, Jenkins, Kubernetes\n- Cloud Services: AWS, Azure\n\n**Professional Experience:**\n\n**Software Engineer**\nXYZ Tech Solutions, San Francisco, CA\nJune 2018 – Present\n- Developed and maintained web applications using Java and Spring Boot.\n- Collaborated with cross-functional teams to design and imple

## Designing AI agents
Next we can generate some relevant personas for AI agents that we will instruct the models to reference in answering questions about the content:

In [9]:
q_hr = QuestionFreeText(
    question_name="hr", 
    question_text="Draft a persona of a human resources manager."
)

In [10]:
q_se = QuestionFreeText(
    question_name="se", 
    question_text="Draft a persona of a senior engineer."
)

In [11]:
hr_persona = q_hr.run().select("hr").to_list()[0]
hr_persona

"Name: Sarah Thompson\n\nTitle: Human Resources Manager\n\nBackground:\nSarah Thompson has over 15 years of experience in the field of human resources. She holds a Bachelor's degree in Human Resource Management and a Master's degree in Organizational Psychology. Sarah has worked in various industries including technology, healthcare, and finance, giving her a well-rounded perspective on managing diverse workforces.\n\nPersonality:\nSarah is known for her empathetic and approachable demeanor. She believes in open communication and strives to create a supportive environment where employees feel valued and heard. Her colleagues often describe her as a good listener, fair, and solution-oriented.\n\nSkills and Expertise:\n- Talent Acquisition and Recruitment: Sarah has a knack for identifying and attracting top talent. She has developed comprehensive recruitment strategies that have significantly reduced time-to-hire and improved employee retention rates.\n- Employee Relations: Sarah excels

In [12]:
se_persona = q_se.run().select("se").to_list()[0]
se_persona

"Name: Alex Thompson\n\nTitle: Senior Software Engineer\n\nBackground:\nAlex Thompson has over 15 years of experience in the software engineering field. With a Bachelor's degree in Computer Science from Stanford University and a Master's degree in Software Engineering from MIT, Alex has a robust academic foundation. Throughout his career, Alex has worked with several high-profile tech companies, including Google, Microsoft, and Amazon, where he has led numerous successful projects and product launches.\n\nSkills:\n- Proficient in multiple programming languages, including Python, Java, C++, and JavaScript.\n- Extensive experience with cloud computing platforms like AWS, Azure, and Google Cloud.\n- Strong background in software architecture and design patterns.\n- Expertise in DevOps practices, CI/CD pipelines, and containerization using Docker and Kubernetes.\n- In-depth knowledge of database management systems, both SQL and NoSQL.\n- Excellent problem-solving skills and a knack for deb

Here we construct agents by passing the personas as `traits` to `Agent` objects. We also create an agent without a persona for comparing the responses:

In [13]:
from edsl import AgentList, Agent

agents = AgentList([
    Agent(traits={"role": "", "persona": ""}),
    Agent(traits={"role": "Human resources", "persona": hr_persona}),
    Agent(traits={"role": "Senior engineer", "persona": se_persona}),
])

## Conducting a review
Next we define some methods for improving the resumes and then critiquing the improvements:

In [14]:
def improve(resume, model):
    q_improve = QuestionFreeText(
        question_name="improve",
        question_text=f"""Draft an improved version of the following resume: {resume}""",
    )
    r_improve = q_improve.by(model).run().select("improve").to_list()[0]
    return r_improve

In [15]:
def score(resume, agent, model):
    q_score = QuestionLinearScale(
        question_name="score",
        question_text=f"""Rank the following resume on a scale from 0 to 10: {resume}""",
        question_options=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
        option_labels={0: "Very poor", 10: "Excellent"},
    )
    r_score = q_score.by(agent).by(model).run().select("score").to_list()[0]
    return r_score

Finally, we conduct the review of the resumes where we prompt each agent to improve each resume, and then critique each of the improved versions, using each of the models that we specified:

In [16]:
results = []

for drafting_model, resume in resumes_dict.items():

    for improving_model in models:
        improved_resume = improve(resume, improving_model)

        for scoring_model in models:
            for agent in agents:
                score_result = score(improved_resume, agent, scoring_model)

                result = {
                    "drafting_model": drafting_model,
                    "improving_model": improving_model.model,
                    "scoring_model": scoring_model.model,
                    "score": score_result,
                    "persona": agent.traits["role"],
                }
                results.append(result)

In [17]:
df = pd.DataFrame(results)

In [18]:
pd.set_option("display.max_rows", None)
pd.set_option("display.width", 1000)
print(df)

                drafting_model             improving_model               scoring_model  score          persona
0                       gpt-4o               gpt-3.5-turbo               gpt-3.5-turbo      8                 
1                       gpt-4o               gpt-3.5-turbo               gpt-3.5-turbo      6  Human resources
2                       gpt-4o               gpt-3.5-turbo               gpt-3.5-turbo      4  Senior engineer
3                       gpt-4o               gpt-3.5-turbo                      gpt-4o      9                 
4                       gpt-4o               gpt-3.5-turbo                      gpt-4o      9  Human resources
5                       gpt-4o               gpt-3.5-turbo                      gpt-4o      8  Senior engineer
6                       gpt-4o               gpt-3.5-turbo  claude-3-5-sonnet-20240620      9                 
7                       gpt-4o               gpt-3.5-turbo  claude-3-5-sonnet-20240620      9  Human resources
8

## Posting to the Coop
The [Coop](https://www.expectedparrot.com/explore) is a platform for creating, storing and sharing LLM-based research.
It is fully integrated with EDSL and accessible from your workspace or Coop account page.
Learn more about [creating an account](https://www.expectedparrot.com/login) and [using the Coop](https://docs.expectedparrot.com/en/latest/coop.html).

Here we post the scenarios, survey and results from above, and this notebook:

In [19]:
from edsl import Notebook

In [20]:
n = Notebook(path = "explore_llm_biases.ipynb")

In [21]:
n.push(description = "Example code for comparing model responses and biases", visibility = "public")

{'description': 'Example code for comparing model responses and biases',
 'object_type': 'notebook',
 'url': 'https://www.expectedparrot.com/content/cbddb2ef-ca23-4ba6-a6e8-afe81c71ce41',
 'uuid': 'cbddb2ef-ca23-4ba6-a6e8-afe81c71ce41',
 'version': '0.1.33.dev1',
 'visibility': 'public'}