# Cognitive testing & LLM biases
This notebook provides example code for using [EDSL](https://docs.expectedparrot.com) to investigate biases of large language models. 

[EDSL is an open-source library](https://github.com/expectedparrot/edsl) for simulating surveys, experiments and other research with AI agents and large language models. 
Before running the code below, please ensure that you have [installed the EDSL library](https://docs.expectedparrot.com/en/latest/installation.html) and either [activated remote inference](https://docs.expectedparrot.com/en/latest/remote_inference.html) from your [Coop account](https://docs.expectedparrot.com/en/latest/coop.html) or [stored API keys](https://docs.expectedparrot.com/en/latest/api_keys.html) for the language models that you want to use with EDSL. Please also see our [documentation page](https://docs.expectedparrot.com/) for tips and tutorials on getting started using EDSL.

## Selecting language models
To check a list of models currently available to use with EDSL:

In [1]:
from edsl import ModelList, Model

Model.available()



Unnamed: 0,Model Name,Service Name
0,gemini-1.0-pro,google
1,gemini-1.5-flash,google
2,gemini-1.5-pro,google
3,gemini-pro,google
4,meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo,together
5,mistralai/Mixtral-8x22B-Instruct-v0.1,together
6,meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo,together
7,meta-llama/Meta-Llama-3.1-405B-Instruct-Turbo,together
8,Gryphe/MythoMax-L2-13b-Lite,together
9,Salesforce/Llama-Rank-V1,together


We select models to use by creating `Model` objects that can be added to a survey when when it is run. If we do not specify a model, the default model is used with the survey.

To check the current default model:

In [2]:
Model()

Unnamed: 0,key,value
0,model,gpt-4o
1,parameters:temperature,0.500000
2,parameters:max_tokens,1000
3,parameters:top_p,1
4,parameters:frequency_penalty,0
5,parameters:presence_penalty,0
6,parameters:logprobs,False
7,parameters:top_logprobs,3


Here we select several models to compare their responses for the survey that we create in the steps below:

In [3]:
models = ModelList(
    Model(m) for m in ["gemini-1.5-flash", "gpt-4o", "claude-3-5-sonnet-20240620"]
)

## Generating content
EDSL comes with a variety of standard survey question types, such as multiple choice, free text, etc. These can be selected based on the desired format of the response. See details about all types [here](https://docs.expectedparrot.com/en/latest/questions.html#question-type-classes). We can use `QuestionFreeText` to prompt the models to generate some content for our experiment:

In [4]:
from edsl import QuestionFreeText

q = QuestionFreeText(
    question_name="haiku",
    question_text="Draft a haiku about the weather in New England. Return only the haiku."
)

We generate a response to the question by adding the models to use with the `by` method and then calling the `run` method. This generates a `Results` object with a `Result` for each response to the question:

In [5]:
results = q.by(models).run()

0,1
Job UUID,344e9a4a-8ea1-4d00-83b3-1feb9e00ca6c
Progress Bar URL,https://www.expectedparrot.com/home/remote-job-progress/344e9a4a-8ea1-4d00-83b3-1feb9e00ca6c
Error Report URL,
Results UUID,020aff9e-d90e-4959-979a-b36c2cbaba8a
Results URL,


To see a list of all components of results:

In [6]:
results.columns 

Unnamed: 0,0
0,agent.agent_instruction
1,agent.agent_name
2,answer.haiku
3,comment.haiku_comment
4,generated_tokens.haiku_generated_tokens
5,iteration.iteration
6,model.frequency_penalty
7,model.logprobs
8,model.maxOutputTokens
9,model.max_tokens


We can inspect components of the results individually:

In [7]:
results.select("model", "haiku")

Unnamed: 0,model.model,answer.haiku
0,gemini-1.5-flash,"Sun, then snow, then rain, Wind howls a New England tune,"
1,gpt-4o,"Crisp leaves whispering, Misty mornings, fleeting sun—"
2,claude-3-5-sonnet-20240620,"Fickle seasons change Snow melts, flowers bloom, leaves fall"


## Conducting a review
Next we create a question to have a model evaluating a response that we use as an input to the new question:

In [8]:
from edsl import QuestionLinearScale

q_score = QuestionLinearScale(
    question_name="score",
    question_text="Score the following haiku on a scale from 0 to 10: {{ haiku }}",
    question_options=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    option_labels={0: "Very poor", 10: "Excellent"},
)

## Parameterizing questions
We use `Scenario` objects to add each response to the new question. EDSL comes with many methods for creating scenarios from different data sources (PDFs, CSVs, docs, images, lists, etc.), as well as `Results` objects:

In [9]:
scenarios = (
    results.to_scenario_list()
    .select("model", "haiku")
    .rename({"model":"drafting_model"}) # renaming the 'model' field to distinguish the evaluating model 
)
scenarios

Unnamed: 0,haiku,drafting_model
0,"Sun, then snow, then rain, Wind howls a New England tune,",gemini-1.5-flash
1,"Crisp leaves whispering, Misty mornings, fleeting sun—",gpt-4o
2,"Fickle seasons change Snow melts, flowers bloom, leaves fall",claude-3-5-sonnet-20240620


Finally, we conduct the evaluation by having each model score each haiku that was generated (without information about whether the model itself was the source):

In [10]:
results = q_score.by(scenarios).by(models).run()

0,1
Job UUID,6e35a7f6-78ca-4bfc-9a58-e01f910a2956
Progress Bar URL,https://www.expectedparrot.com/home/remote-job-progress/6e35a7f6-78ca-4bfc-9a58-e01f910a2956
Error Report URL,
Results UUID,d8d41e36-0caa-4a3f-9fee-45268776a7aa
Results URL,


In [11]:
results.columns

Unnamed: 0,0
0,agent.agent_instruction
1,agent.agent_name
2,answer.score
3,comment.score_comment
4,generated_tokens.score_generated_tokens
5,iteration.iteration
6,model.frequency_penalty
7,model.logprobs
8,model.maxOutputTokens
9,model.max_tokens


In [12]:
(
    results.sort_by("drafting_model", "model")
    .select("drafting_model", "model", "score", "haiku")
    .print(
        pretty_labels = {
            "scenario.drafting_model": "Drafting model",
            "model.model": "Scoring model",
            "answer.score": "Score",
            "scenario.haiku": "Haiku"
        }
    )
)

Unnamed: 0,Drafting model,Scoring model,Score,Haiku
0,claude-3-5-sonnet-20240620,claude-3-5-sonnet-20240620,8,"Fickle seasons change Snow melts, flowers bloom, leaves fall"
1,claude-3-5-sonnet-20240620,gemini-1.5-flash,7,"Fickle seasons change Snow melts, flowers bloom, leaves fall"
2,claude-3-5-sonnet-20240620,gpt-4o,6,"Fickle seasons change Snow melts, flowers bloom, leaves fall"
3,gemini-1.5-flash,claude-3-5-sonnet-20240620,7,"Sun, then snow, then rain, Wind howls a New England tune,"
4,gemini-1.5-flash,gemini-1.5-flash,7,"Sun, then snow, then rain, Wind howls a New England tune,"
5,gemini-1.5-flash,gpt-4o,6,"Sun, then snow, then rain, Wind howls a New England tune,"
6,gpt-4o,claude-3-5-sonnet-20240620,8,"Crisp leaves whispering, Misty mornings, fleeting sun—"
7,gpt-4o,gemini-1.5-flash,7,"Crisp leaves whispering, Misty mornings, fleeting sun—"
8,gpt-4o,gpt-4o,8,"Crisp leaves whispering, Misty mornings, fleeting sun—"


## Posting to the Coop
The [Coop](https://www.expectedparrot.com/content/explore) is a platform for creating, storing and sharing LLM-based research.
It is fully integrated with EDSL and accessible from your workspace or Coop account page.
Learn more about [creating an account](https://www.expectedparrot.com/login) and [using the Coop](https://docs.expectedparrot.com/en/latest/coop.html).

Here we post this notebook:

In [13]:
from edsl import Notebook

In [14]:
n = Notebook(path = "explore_llm_biases.ipynb")

In [15]:
info = n.push(description = "Example code for comparing model responses and biases", visibility = "public")
info

{'description': 'Example code for comparing model responses and biases',
 'object_type': 'notebook',
 'url': 'https://www.expectedparrot.com/content/9e010472-2c69-4728-98b6-10f23819ed08',
 'uuid': '9e010472-2c69-4728-98b6-10f23819ed08',
 'version': '0.1.39.dev2',
 'visibility': 'public'}

To update an object:

In [16]:
n = Notebook(path = "explore_llm_biases.ipynb") # resave it

In [17]:
n.patch(uuid = info["uuid"], value = n)

{'status': 'success'}