# Cognitive testing & LLM biases
This notebook provides example code for using [EDSL](https://docs.expectedparrot.com) to investigate biases of large language models. 

[EDSL is an open-source library](https://github.com/expectedparrot/edsl) for simulating surveys, experiments and other research with AI agents and large language models. 
Before running the code below, please ensure that you have [installed the EDSL library](https://docs.expectedparrot.com/en/latest/installation.html) and either [activated remote inference](https://docs.expectedparrot.com/en/latest/remote_inference.html) from your [Coop account](https://docs.expectedparrot.com/en/latest/coop.html) or [stored API keys](https://docs.expectedparrot.com/en/latest/api_keys.html) for the language models that you want to use with EDSL. Please also see our [documentation page](https://docs.expectedparrot.com/) for tips and tutorials on getting started using EDSL.

## Selecting language models
To check a list of models currently available to use with EDSL:

In [1]:
from edsl import ModelList, Model

Model.available 

<bound method Model.available of Available models: [['Austism/chronos-hermes-13b-v2', 'deep_infra', 0], ['BAAI/bge-base-en-v1.5', 'together', 1], ['BAAI/bge-large-en-v1.5', 'together', 2], ['Gryphe/MythoMax-L2-13b', 'deep_infra', 3], ['Gryphe/MythoMax-L2-13b', 'together', 4], ['Gryphe/MythoMax-L2-13b-Lite', 'together', 5], ['Meta-Llama/Llama-Guard-7b', 'together', 6], ['NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO', 'together', 7], ['NousResearch/Nous-Hermes-2-Yi-34B', 'together', 8], ['Qwen/Qwen1.5-110B-Chat', 'together', 9], ['Qwen/Qwen1.5-72B-Chat', 'together', 10], ['Qwen/Qwen2-72B-Instruct', 'deep_infra', 11], ['Qwen/Qwen2-72B-Instruct', 'together', 12], ['Qwen/Qwen2-7B-Instruct', 'deep_infra', 13], ['Qwen/Qwen2.5-72B-Instruct', 'deep_infra', 14], ['Salesforce/Llama-Rank-V1', 'together', 15], ['Sao10K/L3-70B-Euryale-v2.1', 'deep_infra', 16], ['Sao10K/L3.1-70B-Euryale-v2.2', 'deep_infra', 17], ['WhereIsAI/UAE-Large-V1', 'together', 18], ['amazon.titan-text-express-v1', 'bedrock', 19

We select models to use by creating `Model` objects that can be added to a survey when when it is run. If we do not specify a model, the default model is used with the survey.

To check the current default model:

In [2]:
Model()

key,value
model,gpt-4o
parameters:temperature,0.5
parameters:max_tokens,1000
parameters:top_p,1
parameters:frequency_penalty,0
parameters:presence_penalty,0
parameters:logprobs,False
parameters:top_logprobs,3


Here we select several models to compare their responses for the survey that we create in the steps below:

In [3]:
models = ModelList(
    Model(m) for m in ["gemini-pro", "gpt-4o", "claude-3-5-sonnet-20240620"]
)

## Generating content
EDSL comes with a variety of standard survey question types, such as multiple choice, free text, etc. These can be selected based on the desired format of the response. See details about all types [here](https://docs.expectedparrot.com/en/latest/questions.html#question-type-classes). We can use `QuestionFreeText` to prompt the models to generate some content for our experiment:

In [4]:
from edsl import QuestionFreeText

q = QuestionFreeText(
    question_name="haiku",
    question_text="Draft a haiku about the weather in New England. Return only the haiku."
)

We generate a response to the question by adding the models to use with the `by` method and then calling the `run` method. This generates a `Results` object with a `Result` for each response to the question:

In [5]:
results = q.by(models).run()

To see a list of all components of results:

In [6]:
results.columns 

0
agent.agent_instruction
agent.agent_name
answer.haiku
comment.haiku_comment
generated_tokens.haiku_generated_tokens
iteration.iteration
model.frequency_penalty
model.logprobs
model.maxOutputTokens
model.max_tokens


We can inspect components of the results individually:

In [7]:
results.select("model", "haiku")

model.model,answer.haiku
gemini-pro,"Snow in winter's grasp, Autumn's leaves in vibrant hues,"
gpt-4o,"Crisp leaves whispering, Misty mornings, fleeting sun—"
claude-3-5-sonnet-20240620,"Fickle seasons change Snow melts, flowers bloom, leaves fall"


## Conducting a review
Next we create a question to have a model evaluating a response that we use as an input to the new question:

In [8]:
from edsl import QuestionLinearScale

q_score = QuestionLinearScale(
    question_name="score",
    question_text="Score the following haiku on a scale from 0 to 10: {{ haiku }}",
    question_options=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    option_labels={0: "Very poor", 10: "Excellent"},
)

## Parameterizing questions
We use `Scenario` objects to add each response to the new question. EDSL comes with many methods for creating scenarios from different data sources (PDFs, CSVs, docs, images, lists, etc.), as well as `Results` objects:

In [9]:
scenarios = (
    results.to_scenario_list()
    .select("model", "haiku")
    .rename({"model":"drafting_model"}) # renaming the 'model' field to distinguish the evaluating model 
)
scenarios

haiku,drafting_model
"Snow in winter's grasp, Autumn's leaves in vibrant hues,",gemini-pro
"Crisp leaves whispering, Misty mornings, fleeting sun—",gpt-4o
"Fickle seasons change Snow melts, flowers bloom, leaves fall",claude-3-5-sonnet-20240620


Finally, we conduct the evaluation by having each model score each haiku that was generated (without information about whether the model itself was the source):

In [10]:
results = q_score.by(scenarios).by(models).run()

In [16]:
results.columns

0
agent.agent_instruction
agent.agent_name
answer.score
comment.score_comment
generated_tokens.score_generated_tokens
iteration.iteration
model.frequency_penalty
model.logprobs
model.maxOutputTokens
model.max_tokens


In [17]:
(
    results.sort_by("drafting_model", "model")
    .select("drafting_model", "model", "score", "haiku")
    .print(
        pretty_labels = {
            "scenario.drafting_model": "Drafting model",
            "model.model": "Scoring model",
            "answer.score": "Score",
            "scenario.haiku": "Haiku"
        }
    )
)

Drafting model,Scoring model,Score,Haiku
claude-3-5-sonnet-20240620,claude-3-5-sonnet-20240620,8,"Fickle seasons change Snow melts, flowers bloom, leaves fall"
claude-3-5-sonnet-20240620,gemini-pro,5,"Fickle seasons change Snow melts, flowers bloom, leaves fall"
claude-3-5-sonnet-20240620,gpt-4o,6,"Fickle seasons change Snow melts, flowers bloom, leaves fall"
gemini-pro,claude-3-5-sonnet-20240620,5,"Snow in winter's grasp, Autumn's leaves in vibrant hues,"
gemini-pro,gemini-pro,8,"Snow in winter's grasp, Autumn's leaves in vibrant hues,"
gemini-pro,gpt-4o,3,"Snow in winter's grasp, Autumn's leaves in vibrant hues,"
gpt-4o,claude-3-5-sonnet-20240620,8,"Crisp leaves whispering, Misty mornings, fleeting sun—"
gpt-4o,gemini-pro,7,"Crisp leaves whispering, Misty mornings, fleeting sun—"
gpt-4o,gpt-4o,8,"Crisp leaves whispering, Misty mornings, fleeting sun—"


## Posting to the Coop
The [Coop](https://www.expectedparrot.com/content/explore) is a platform for creating, storing and sharing LLM-based research.
It is fully integrated with EDSL and accessible from your workspace or Coop account page.
Learn more about [creating an account](https://www.expectedparrot.com/login) and [using the Coop](https://docs.expectedparrot.com/en/latest/coop.html).

Here we post this notebook:

In [18]:
from edsl import Notebook

In [19]:
n = Notebook(path = "explore_llm_biases.ipynb")

In [20]:
info = n.push(description = "Example code for comparing model responses and biases", visibility = "public")
info

{'description': 'Example code for comparing model responses and biases',
 'object_type': 'notebook',
 'url': 'https://www.expectedparrot.com/content/76b22376-49ad-43b2-aca7-8384c1ade626',
 'uuid': '76b22376-49ad-43b2-aca7-8384c1ade626',
 'version': '0.1.39.dev1',
 'visibility': 'public'}

To update an object:

In [21]:
n = Notebook(path = "explore_llm_biases.ipynb") # resave it

In [22]:
n.patch(uuid = info["uuid"], value = n)

{'status': 'success'}