# Evaluating job tasks
This notebook provide sample EDSL code for using AI agents and language models to evaluate content. The task is designed as a survey of questions about the content that is administered to agents with relevant personas using language models to generate responses as a dataset.

Please [see our docs](https://docs.expectedparrot.com/) for tips on getting started using the EDSL package for simulating surveys and experiments with AI.

In [1]:
job_posts = [
    "Oversee daily operations, manage staff, and ensure customer satisfaction in a fast-paced retail environment.",
    "Craft engaging and informative blog posts on health and wellness topics to boost website traffic and engage readers.",
    "Analyze sales data using statistical tools to identify trends and provide actionable insights to the marketing team.",
    "Prepare gourmet dishes that comply with restaurant standards and delight customers with unique flavor combinations.",
    "Design creative visual content for marketing materials, including brochures, banners, and digital ads, using Adobe Creative Suite.",
    "Develop, test, and maintain robust software solutions to improve business processes using Python and Java.",
    "Craft coffee drinks and manage the coffee station while providing excellent customer service in a busy café.",
    "Manage recruitment processes, conduct interviews, and oversee employee benefit programs to ensure a motivated workforce.",
    "Assist veterinarians by preparing animals for surgery, administering injections, and providing post-operative care.",
    "Design aesthetic and practical outdoor spaces for clients, from residential gardens to public parks.",
    "Install and repair residential plumbing systems, including water heaters, pipes, and fixtures to ensure proper functionality.",
    "Develop comprehensive marketing strategies that align with company goals, including digital campaigns and branding efforts.",
    "Install, maintain, and repair electrical wiring, equipment, and fixtures to ensure safe and effective operation.",
    "Provide personalized fitness programs and conduct group fitness classes to help clients achieve their health goals.",
    "Diagnose and repair automotive issues, perform routine maintenance, and ensure vehicles meet safety standards.",
    "Lead creative campaigns, from concept through execution, coordinating with graphic designers and content creators.",
    "Educate students in mathematics using innovative teaching strategies to enhance understanding and interest in the subject.",
    "Drive sales through engaging customer interactions, understanding client needs, and providing product solutions.",
    "Fold dough into pretzel shapes ensuring each is uniformly twisted and perfectly salted before baking.",
    "Address customer inquiries and issues via phone and email, ensuring high levels of satisfaction and timely resolution.",
]

Draft questions in the question type templates:

In [2]:
from edsl.questions import (
    QuestionList,
    QuestionLinearScale,
    QuestionMultipleChoice,
    QuestionYesNo,
    QuestionFreeText,
)

q1 = QuestionList(
    question_name="category_list",
    question_text="Draft a list of increasingly specific categories for the following job post: {{ job_post }}",
    max_list_items=3,  # optional
)

q2 = QuestionLinearScale(
    question_name="specific_scale",
    question_text="How specific is this job post: {{ job_post }}",
    question_options=[0, 1, 2, 3, 4, 5],
    option_labels={0: "Unclear", 1: "Not at all specific", 5: "Highly specific"},
)

q3 = QuestionMultipleChoice(
    question_name="skill_choice",
    question_text="What is the skill level required for this job: {{ job_post }}",
    question_options=["Entry level", "Intermediate", "Advanced", "Expert"],
)

q4 = QuestionYesNo(
    question_name="technical_yn",
    question_text="Is this a technical job? Job post: {{ job_post }}",
)

q5 = QuestionFreeText(
    question_name="rewrite_text",
    question_text="""Consider whether the following job post could be improved for clarity.
    Then, without substantially lengthening it, draft a new version: {{ job_post }}""",
)

Combine questions into a `Survey` to administer them together:

In [3]:
from edsl import Survey

questions = [q1, q2, q3, q4, q5]

survey = Survey(questions)

If we want the agent/model to have information about prior questions in the survey we can add targeted or full memories ([learn more about adding survey rules/logic](https://docs.expectedparrot.com/en/latest/surveys.html)):

In [4]:
# Memory of a specific question is presented with another question:
# survey = survey.add_targeted_memory(q2, q1)

# Full memory of all prior questions is presented with each question (token-intensive):
# survey = survey.set_full_memory_mode()

We can create `Scenario` objects for the job posts in order to add them to all the questions when we run the survey:

In [5]:
from edsl import Scenario

scenarios = [Scenario({"job_post": p}) for p in job_posts]

We can create AI agents for the language models to use to answer the question. This is done by passing dictionaries of desired traits and personas to `Agent` objects that we will add to the survey when we run it:

In [6]:
from edsl import Agent

personas = [
    "You are a labor economist.",
]

agents = [Agent(traits={"persona": p}) for p in personas]

See a list of available language models to use in running the survey:

In [7]:
from edsl import Model

Model.available()

[['01-ai/Yi-34B-Chat', 'deep_infra', 0],
 ['Austism/chronos-hermes-13b-v2', 'deep_infra', 1],
 ['Gryphe/MythoMax-L2-13b', 'deep_infra', 2],
 ['Gryphe/MythoMax-L2-13b-turbo', 'deep_infra', 3],
 ['HuggingFaceH4/zephyr-orpo-141b-A35b-v0.1', 'deep_infra', 4],
 ['Phind/Phind-CodeLlama-34B-v2', 'deep_infra', 5],
 ['bigcode/starcoder2-15b', 'deep_infra', 6],
 ['bigcode/starcoder2-15b-instruct-v0.1', 'deep_infra', 7],
 ['claude-3-haiku-20240307', 'anthropic', 8],
 ['claude-3-opus-20240229', 'anthropic', 9],
 ['claude-3-sonnet-20240229', 'anthropic', 10],
 ['codellama/CodeLlama-34b-Instruct-hf', 'deep_infra', 11],
 ['codellama/CodeLlama-70b-Instruct-hf', 'deep_infra', 12],
 ['cognitivecomputations/dolphin-2.6-mixtral-8x7b', 'deep_infra', 13],
 ['databricks/dbrx-instruct', 'deep_infra', 14],
 ['deepinfra/airoboros-70b', 'deep_infra', 15],
 ['gemini-pro', 'google', 16],
 ['google/codegemma-7b-it', 'deep_infra', 17],
 ['google/gemma-1.1-7b-it', 'deep_infra', 18],
 ['gpt-3.5-turbo', 'openai', 19],


Select language models to use (if we do not specific a model when we run the survey, GPT 4 preview is used by default):

In [8]:
models = [Model(m) for m in ["gpt-4-1106-preview"]]

Run the survey by adding the components and then calling the `run` method:

In [9]:
results = survey.by(scenarios).by(agents).by(models).run()

In [10]:
results.show_exceptions()

This generates a dataset of `Results` that we can analyze with built-in methods for data tables, dataframes, SQL, etc. We can see a list of all the components that can be analyzed:

In [11]:
results.columns

['agent.agent_instruction',
 'agent.agent_name',
 'agent.persona',
 'answer.category_list',
 'answer.rewrite_text',
 'answer.skill_choice',
 'answer.specific_scale',
 'answer.technical_yn',
 'comment.category_list_comment',
 'comment.skill_choice_comment',
 'comment.specific_scale_comment',
 'comment.technical_yn_comment',
 'iteration.iteration',
 'model.frequency_penalty',
 'model.logprobs',
 'model.max_tokens',
 'model.model',
 'model.presence_penalty',
 'model.temperature',
 'model.top_logprobs',
 'model.top_p',
 'prompt.category_list_system_prompt',
 'prompt.category_list_user_prompt',
 'prompt.rewrite_text_system_prompt',
 'prompt.rewrite_text_user_prompt',
 'prompt.skill_choice_system_prompt',
 'prompt.skill_choice_user_prompt',
 'prompt.specific_scale_system_prompt',
 'prompt.specific_scale_user_prompt',
 'prompt.technical_yn_system_prompt',
 'prompt.technical_yn_user_prompt',
 'question_options.category_list_question_options',
 'question_options.rewrite_text_question_options',


For example, we can filter, sort, select, limit, shuffle, sample and print some components of results in a table:

In [12]:
(
    results.filter("int(specific_scale) >= 3")
    .sort_by("skill_choice")
    .select(
        "model",
        "persona",
        "job_post",
        "category_list",
        "specific_scale",
        "skill_choice",
        "technical_yn",
    )
    .print(pretty_labels={}, format="rich", max_rows=5)
)

Showing only the first 5 rows of 9 rows.


In [13]:
results.select("rewrite_text").print(format="rich")