# Cognitive testing & LLM biases
This notebook shows some ways of using [EDSL](https://docs.expectedparrot.com) to investigate whether LLMs demonstrate bias towards content that they generate or improve compared with content generated by other LLMs. 

Please see our docs for details on [installing EDSL](https://docs.expectedparrot.com/en/latest/installation.html) and [getting started](https://docs.expectedparrot.com/en/latest/tutorial_getting_started.html).

## Selecting language models
To check a list of models currently available to use with EDSL:

In [1]:
from edsl import ModelList, Model

# Model.available()

We select models to use by creating `Model` objects that we will add to our survey when we run it later. If we do not specify a model, GPT 4 preview will be used by default. Here we select several models to compare their responses:

In [2]:
models = ModelList(
    Model(m) for m in ["gemini-pro", "gpt-4o", "claude-3-5-sonnet-20240620"]
)

## Generating content
EDSL comes with a variety of standard survey question types, such as multiple choice, free text, etc. These can be selected based on the desired format of the response. See details about all types [here](https://docs.expectedparrot.com/en/latest/questions.html#question-type-classes). We can use `QuestionFreeText` to prompt the models to generate some content for our experiment (a mock resume):

In [3]:
from edsl import QuestionFreeText

q = QuestionFreeText(
    question_name="haiku",
    question_text="Draft a haiku about the weather in New England. Return only the haiku."
)

We generate a response to the question by calling the `run` method, after specifying the models to use with the `by` method. This will generate a `Results` object with a `Result` for each response to the question:

In [4]:
results = q.by(models).run()

To see a list of all components of results:

In [5]:
# results.columns

We can inspect components of the results individually:

In [6]:
results.select("model", "haiku").print(format="rich")

## Conducting a review
Next we create new questions for improving the resumes and then critiquing the improvements:

In [7]:
from edsl import QuestionLinearScale

q_score = QuestionLinearScale(
    question_name="score",
    question_text="Score the following haiku on a scale from 0 to 10: {{ haiku }}",
    question_options=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    option_labels={0: "Very poor", 10: "Excellent"},
)

## Parameterizing questions
We can use `Scenario` objects to add the contents of each haiku to the scoring question. EDSL comes with many methods for creating scenarios from different data sources (PDFs, CSVs, docs, images, lists, etc.), as well as `Results` objects:

In [8]:
scenarios = (results.to_scenario_list()
             .select("model", "haiku")
             .rename({"model":"drafting_model"}) # renaming the 'model' field to distinguish the evaluating model 
            )
scenarios

Finally, we conduct the review of the resumes where we prompt each agent to improve each resume, and then critique each of the improved versions, using each of the models that we specified:

In [9]:
results = q_score.by(scenarios).by(models).run()

In [10]:
results.columns

['agent.agent_instruction',
 'agent.agent_name',
 'answer.score',
 'comment.score_comment',
 'generated_tokens.score_generated_tokens',
 'iteration.iteration',
 'model.frequency_penalty',
 'model.logprobs',
 'model.maxOutputTokens',
 'model.max_tokens',
 'model.model',
 'model.presence_penalty',
 'model.stopSequences',
 'model.temperature',
 'model.topK',
 'model.topP',
 'model.top_logprobs',
 'model.top_p',
 'prompt.score_system_prompt',
 'prompt.score_user_prompt',
 'question_options.score_question_options',
 'question_text.score_question_text',
 'question_type.score_question_type',
 'raw_model_response.score_cost',
 'raw_model_response.score_one_usd_buys',
 'raw_model_response.score_raw_model_response',
 'scenario.drafting_model',
 'scenario.haiku']

In [11]:
(
    results.sort_by("drafting_model", "model")
    .select("drafting_model", "model", "score", "haiku")
    .print(
        pretty_labels = {
            "scenario.drafting_model": "Drafting model",
            "model.model": "Scoring model",
            "answer.score": "Score",
            "scenario.haiku": "Haiku"
        },
        format="rich"
    )
)

## Posting to the Coop
The [Coop](https://www.expectedparrot.com/explore) is a platform for creating, storing and sharing LLM-based research.
It is fully integrated with EDSL and accessible from your workspace or Coop account page.
Learn more about [creating an account](https://www.expectedparrot.com/login) and [using the Coop](https://docs.expectedparrot.com/en/latest/coop.html).

Here we post this notebook:

In [12]:
from edsl import Notebook

In [13]:
n = Notebook(path = "explore_llm_biases.ipynb")

In [14]:
n.push(description = "Example code for comparing model responses and biases", visibility = "public")

{'description': 'Example code for comparing model responses and biases',
 'object_type': 'notebook',
 'url': 'https://www.expectedparrot.com/content/d6b943f9-dcf3-4de1-aa70-f542e46adc18',
 'uuid': 'd6b943f9-dcf3-4de1-aa70-f542e46adc18',
 'version': '0.1.33.dev1',
 'visibility': 'public'}

To update an object:

In [15]:
n = Notebook(path = "explore_llm_biases.ipynb") # resave it

In [16]:
n.patch(uuid = "d6b943f9-dcf3-4de1-aa70-f542e46adc18", value = n)

{'status': 'success'}