# Cognitive testing & LLM biases
This notebook provides example code for using [EDSL](https://docs.expectedparrot.com) to investigate biases of large language models. 

[EDSL is an open-source library](https://github.com/expectedparrot/edsl) for simulating surveys, experiments and other research with AI agents and large language models. 
Before running the code below, please ensure that you have [installed the EDSL library](https://docs.expectedparrot.com/en/latest/installation.html) and either [activated remote inference](https://docs.expectedparrot.com/en/latest/remote_inference.html) from your [Coop account](https://docs.expectedparrot.com/en/latest/coop.html) or [stored API keys](https://docs.expectedparrot.com/en/latest/api_keys.html) for the language models that you want to use with EDSL. Please also see our [documentation page](https://docs.expectedparrot.com/) for tips and tutorials on getting started using EDSL.

## Selecting language models
To check a list of models currently available to use with EDSL:

In [1]:
from edsl import ModelList, Model

# Model.available # uncomment and run this code

We select models to use by creating `Model` objects that can be added to a survey when when it is run. If we do not specify a model, the default model is used with the survey.

To check the current default model:

In [2]:
# Model() # uncomment and run this code

Here we select several models to compare their responses for the survey that we create in the steps below:

In [3]:
models = ModelList(
    Model(m) for m in ["gemini-pro", "gpt-4o", "claude-3-5-sonnet-20240620"]
)

## Generating content
EDSL comes with a variety of standard survey question types, such as multiple choice, free text, etc. These can be selected based on the desired format of the response. See details about all types [here](https://docs.expectedparrot.com/en/latest/questions.html#question-type-classes). We can use `QuestionFreeText` to prompt the models to generate some content for our experiment:

In [4]:
from edsl import QuestionFreeText

q = QuestionFreeText(
    question_name="haiku",
    question_text="Draft a haiku about the weather in New England. Return only the haiku."
)

We generate a response to the question by adding the models to use with the `by` method and then calling the `run` method. This generates a `Results` object with a `Result` for each response to the question:

In [5]:
results = q.by(models).run()

To see a list of all components of results:

In [6]:
# results.columns # uncomment and run this code

We can inspect components of the results individually:

In [7]:
results.select("model", "haiku").print(format="rich")

## Conducting a review
Next we create a question to have a model evaluating a response that we use as an input to the new question:

In [8]:
from edsl import QuestionLinearScale

q_score = QuestionLinearScale(
    question_name="score",
    question_text="Score the following haiku on a scale from 0 to 10: {{ haiku }}",
    question_options=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    option_labels={0: "Very poor", 10: "Excellent"},
)

## Parameterizing questions
We use `Scenario` objects to add each response to the new question. EDSL comes with many methods for creating scenarios from different data sources (PDFs, CSVs, docs, images, lists, etc.), as well as `Results` objects:

In [9]:
scenarios = (
    results.to_scenario_list()
    .select("model", "haiku")
    .rename({"model":"drafting_model"}) # renaming the 'model' field to distinguish the evaluating model 
)
scenarios

Finally, we conduct the evaluation by having each model score each haiku that was generated (without information about whether the model itself was the source):

In [10]:
results = q_score.by(scenarios).by(models).run()

In [11]:
results.columns

['agent.agent_instruction',
 'agent.agent_name',
 'answer.score',
 'comment.score_comment',
 'generated_tokens.score_generated_tokens',
 'iteration.iteration',
 'model.frequency_penalty',
 'model.logprobs',
 'model.maxOutputTokens',
 'model.max_tokens',
 'model.model',
 'model.presence_penalty',
 'model.stopSequences',
 'model.temperature',
 'model.topK',
 'model.topP',
 'model.top_logprobs',
 'model.top_p',
 'prompt.score_system_prompt',
 'prompt.score_user_prompt',
 'question_options.score_question_options',
 'question_text.score_question_text',
 'question_type.score_question_type',
 'raw_model_response.score_cost',
 'raw_model_response.score_one_usd_buys',
 'raw_model_response.score_raw_model_response',
 'scenario.drafting_model',
 'scenario.haiku']

In [12]:
(
    results.sort_by("drafting_model", "model")
    .select("drafting_model", "model", "score", "haiku")
    .print(
        pretty_labels = {
            "scenario.drafting_model": "Drafting model",
            "model.model": "Scoring model",
            "answer.score": "Score",
            "scenario.haiku": "Haiku"
        },
        format="rich"
    )
)

## Posting to the Coop
The [Coop](https://www.expectedparrot.com/explore) is a platform for creating, storing and sharing LLM-based research.
It is fully integrated with EDSL and accessible from your workspace or Coop account page.
Learn more about [creating an account](https://www.expectedparrot.com/login) and [using the Coop](https://docs.expectedparrot.com/en/latest/coop.html).

Here we post this notebook:

In [13]:
from edsl import Notebook

In [14]:
n = Notebook(path = "explore_llm_biases.ipynb")

In [15]:
n.push(description = "Example code for comparing model responses and biases", visibility = "public")

{'description': 'Example code for comparing model responses and biases',
 'object_type': 'notebook',
 'url': 'https://www.expectedparrot.com/content/07ec8176-c07e-4f83-acd5-791e3d9324d2',
 'uuid': '07ec8176-c07e-4f83-acd5-791e3d9324d2',
 'version': '0.1.33.dev1',
 'visibility': 'public'}

To update an object:

In [16]:
n = Notebook(path = "explore_llm_biases.ipynb") # resave it

In [17]:
n.patch(uuid = "07ec8176-c07e-4f83-acd5-791e3d9324d2", value = n)

{'status': 'success'}