# River Problem 
This notebook provides sample [EDSL](https://docs.expectedparrot.com/) code exploring capabilities of large language models to provide and evaluate solutions for a [river crossing problem](https://en.wikipedia.org/wiki/River_crossing_puzzle), where the object is to efficiently transport items across a river subject to conditions on the number of items that can be transported at once and combinations of items than can be left together unattended.

In a popular version of the problem, a farmer needs to transport a wolf, a goat and cabbage but cannot leave the wolf with the dog or the dog with the cabbage, as the dog and the cabbage would be eaten.

There are several things we want to learn in using LLMs to explore this problem:

1. Are models capable of providing valid, efficient solutions? If so, what level of instruction is needed, and does it matter how we prompt the model to format its solution?
2. When models do provide solutions, are they easily disuaded from trusting those solutions?
3. When models are given valid solutions, can they easily be convinced that the solutions are incorrect?

The notebook has multiple sections:

**Proposing solutions:** We prompt models to provide solutions for the problem in different ways, and then ask the models about their confidence in their solutions.

**Selecting solutions:** We prompt models to identify a correct solution from a list of otherwise incorrect options, and then ask them about their confidence in their selections.

**Evaluating solutions:** We give models valid solutions and then see whether the models can be convinced that the solutions are incorrect.


## EDSL
EDSL is an open-source Python library for simulating surveys and experiments with AI agents and large langugae models. Please [see our documention page](https://docs.expectedparrot.com/) for tips and tutorials on getting started.

## Proposing solutions
We start by describing the problem ([Wikipedia](https://en.wikipedia.org/wiki/Wolf,_goat_and_cabbage_problem)) and constructing a question to prompt a model to provide an efficient solution for it:

In [1]:
problem = """
A farmer with a wolf, a goat, and a cabbage must cross a river by boat. 
The boat can carry only the farmer and a single item. If left unattended 
together, the wolf would eat the goat, or the goat would eat the cabbage. 
How can they cross the river without anything being eaten? 
"""

### Special instructions
The model may perform better if we specifically note that items may be brought back across the river multiple times, as this often trips people up who assume it is not allowed. We can store this tip separately to compare how models perform with and without it:

In [2]:
tip = "(Note that items may be carried back and forth across the river.)"

### Constructing questions
EDSL comes with many standard question types that we can choose from based on the form of the response that we want to get back (see [examples of all question types](https://docs.expectedparrot.com/en/latest/questions.html)). Here, we first ask the model to propose a solution to the problem as a textual response. We create 2 different versions of the question with and without the tip:

In [3]:
from edsl import QuestionFreeText

q_solution_text = QuestionFreeText(
    question_name="solution_text",
    question_text="Please provide an efficient, concise solution to this problem: "
    + problem,
)

q_solution_text_tip = QuestionFreeText(
    question_name="solution_text_tip",
    question_text="Please provide an efficient, concise solution to this problem: "
    + problem
    + tip,
)

We can also try prompting the model to format its response differently, for example as a list of steps instead of a text:

In [4]:
from edsl import QuestionList

q_solution_list = QuestionList(
    question_name="solution_list",
    question_text="Please provide an efficient, concise solution to this problem: "
    + problem
    + tip
    + """ Format your response as a list of steps like these:
    'Farmer moves <item> from left to right.' or 'Farmer moves alone from left to right.'""",
)

We can add a follow-on question asking the model about its confidence in its solution. Here we pose the same follow-on question using several different question types to compare responses:

In [5]:
from edsl import (
    QuestionYesNo,
    QuestionFreeText,
    QuestionMultipleChoice,
    QuestionLinearScale,
)

question_text = "Are you confident in your solution?"

q_confidence1 = QuestionYesNo(
    question_name="confidence_yn", question_text=question_text
)

q_confidence2 = QuestionFreeText(
    question_name="confidence_ft", question_text=question_text
)

q_confidence3 = QuestionMultipleChoice(
    question_name="confidence_mc",
    question_text=question_text,
    question_options=["No", "Yes", "Somewhat"],
)

q_confidence4 = QuestionLinearScale(
    question_name="confidence_ls",
    question_text=question_text,
    question_options=[0, 1, 2, 3, 4, 5],
    option_labels={0: "I am not at all confident.", 5: "I am very confident."},
)

We combine these questions in a `Survey` in order to administer them together. Here we create separate surveys to compare responses with and without the tip and as a list of steps: 

In [6]:
from edsl import Survey

survey = Survey(
    [q_solution_text, q_confidence1, q_confidence2, q_confidence3, q_confidence4]
)

survey_tip = Survey(
    [q_solution_text_tip, q_confidence1, q_confidence2, q_confidence3, q_confidence4]
)

survey_list = Survey(
    [q_solution_list, q_confidence1, q_confidence2, q_confidence3, q_confidence4]
)

### Adding survey rules
Survey questions are administered to models asynchornously by default (for speed and minimizing tokens consumed). We can also choose whether to give the model information about prior questions and responses in answering other questions. Here we want the model to know about its proposed solution in answering each of the follow-on questions. We do this by adding a memory of the solution question to each individual follow-on question, and we repeat this for the solution questions with and without the tip. Note that this is different from giving the model cumulative information, so that we can ask each version of the confidence question freshly:

In [7]:
survey = (
    survey.add_targeted_memory(q_confidence1, q_solution_text)
    .add_targeted_memory(q_confidence2, q_solution_text)
    .add_targeted_memory(q_confidence3, q_solution_text)
    .add_targeted_memory(q_confidence4, q_solution_text)
)

survey_tip = (
    survey_tip.add_targeted_memory(q_confidence1, q_solution_text_tip)
    .add_targeted_memory(q_confidence2, q_solution_text_tip)
    .add_targeted_memory(q_confidence3, q_solution_text_tip)
    .add_targeted_memory(q_confidence4, q_solution_text_tip)
)

survey_list = (
    survey_list.add_targeted_memory(q_confidence1, q_solution_list)
    .add_targeted_memory(q_confidence2, q_solution_list)
    .add_targeted_memory(q_confidence3, q_solution_list)
    .add_targeted_memory(q_confidence4, q_solution_list)
)

### Designing AI agents to answer questions
We can optionally create one ore more agents with relevant traits and instructions for a model to use in answering the questions. We do this by passing a dictionary of traits to an `Agent` object that we add to the survey when we run it. (Learn more about [using agents to answer surveys](https://docs.expectedparrot.com/en/latest/agents.html).) Here we create a set of agents with and without personas and special instructions to explore potential impacts to  responses:

In [8]:
from edsl import AgentList, Agent

instructions = [
    "",  # An empty instruction for comparison
    """You are being asked to provide and evaluate solutions to a classic 
                'river crossing problem'. In answering questions, be sure to carefully
                consider the constraints of the given problem and strategies that may
                be helpful in identifying correct solutions, such as backtracking.""",
]

personas = [
    "",  # An empty persona description for comparison
    "You are a computer scientist.",
]

agents = AgentList(
    Agent(traits={"persona": p}, instruction=i) for p in personas for i in instructions
)
agents

### Selecting language models
We can also specify language models that we want to use to generate responses. If none are specified, EDSL will use GPT 4 preview by default ([learn more about specifying models](https://docs.expectedparrot.com/en/latest/language_models.html)). Here we specify that we will use it for purposes of demonstration:

In [9]:
from edsl import ModelList, Model

# To see a list of currently available models:
# Model.available()

We create `Model` objects for the models that we want to add to the survey. Here we'll use GPT 4o:

In [10]:
models = ModelList(
    Model(m) for m in ["gpt-4o"] 
)

### Generating results
Now we can generate responses by calling the `run` method on the surveys, after adding agents and models with the `by` method:

In [11]:
results = survey.by(agents).by(models).run()

In [12]:
results_tip = survey_tip.by(agents).by(models).run()

In [13]:
results_list = survey_list.by(agents).by(models).run()

This generates `Results` which contain information about all the components of the responses. We can view these components:
we fan access as datasets. 

In [14]:
results.columns

['agent.agent_instruction',
 'agent.agent_name',
 'agent.persona',
 'answer.confidence_ft',
 'answer.confidence_ls',
 'answer.confidence_mc',
 'answer.confidence_yn',
 'answer.solution_text',
 'comment.confidence_ft_comment',
 'comment.confidence_ls_comment',
 'comment.confidence_mc_comment',
 'comment.confidence_yn_comment',
 'comment.solution_text_comment',
 'generated_tokens.confidence_ft_generated_tokens',
 'generated_tokens.confidence_ls_generated_tokens',
 'generated_tokens.confidence_mc_generated_tokens',
 'generated_tokens.confidence_yn_generated_tokens',
 'generated_tokens.solution_text_generated_tokens',
 'iteration.iteration',
 'model.frequency_penalty',
 'model.logprobs',
 'model.max_tokens',
 'model.model',
 'model.presence_penalty',
 'model.temperature',
 'model.top_logprobs',
 'model.top_p',
 'prompt.confidence_ft_system_prompt',
 'prompt.confidence_ft_user_prompt',
 'prompt.confidence_ls_system_prompt',
 'prompt.confidence_ls_user_prompt',
 'prompt.confidence_mc_system_

EDSL has many [built-in methods for analyzing results](https://docs.expectedparrot.com/en/latest/results.html) as datasets. Here we first print just the answers:

In [15]:
results.select( "persona", "agent_instruction", "solution_text").print(format="rich")

Here we select the confidence responses:

In [16]:
(
    results.sort_by("model")
    .select(
        # "model",
        # "persona",
        # "agent_instruction",
        "confidence_yn",
        # "confidence_ft",
        "confidence_mc",
        "confidence_ls"
    )
    .print(format="rich")
)

We can compare responses for the question prompting the models to provide a solution as a list of steps:

In [17]:
results_list.select(
    "model",
    "persona",
    "agent_instruction",
    "solution_list",
    "confidence_yn",
    "confidence_ft",
    "confidence_mc",
    "confidence_ls",
).print(format="rich")

### Question prompt variations
Let's try changing the tone of our confidence questions. Note that because we are not changing the agents, models or first question asking for a solution to the problem we will retrieve the cached responses to that question and it will be used identically for our new confidence questions ([learn more about caching LLMs calls](https://docs.expectedparrot.com/en/latest/data.html)):

In [18]:
question_text = (
    "This problem is hard! Are you really sure that your solution actually works?"
)

q_confidence1 = QuestionYesNo(
    question_name="confidence_yn", question_text=question_text
)

q_confidence2 = QuestionFreeText(
    question_name="confidence_ft", question_text=question_text
)

q_confidence3 = QuestionMultipleChoice(
    question_name="confidence_mc",
    question_text=question_text,
    question_options=["No", "Yes", "Somewhat"],
)

q_confidence4 = QuestionLinearScale(
    question_name="confidence_ls",
    question_text=question_text,
    question_options=[0, 1, 2, 3, 4, 5],
    option_labels={0: "I am not at all confident.", 5: "I am very confident."},
)

survey = Survey(
    [q_solution_text, q_confidence1, q_confidence2, q_confidence3, q_confidence4]
)

survey = (
    survey.add_targeted_memory(q_confidence1, q_solution_text)
    .add_targeted_memory(q_confidence2, q_solution_text)
    .add_targeted_memory(q_confidence3, q_solution_text)
    .add_targeted_memory(q_confidence4, q_solution_text)
)

results = survey.by(agents).by(models).run()

results.select(
    "model",
    "persona",
    "agent_instruction",
    "confidence_yn",
    "confidence_ft",
    "confidence_mc",
    "confidence_ls",
).print(format="rich")

This seems to create more confusion in the linear scale question, but otherwise unwavering confidence.

## Selecting solutions
In this section we ask models to select a correct solution from a set of otherwise incorrect solutions. We also ask them about a correct solution, similar to our process above except that the model is simply presented the solution.

First we identify some correct solutions in different forms:

In [19]:
solution_text = """The farmer takes the goat across the river first and leaves 
it on the other side. Then he goes back across the river and takes the wolf over. 
However, instead of leaving the wolf with the goat, he brings the goat back with 
him to the original side. Next, the farmer takes the cabbage across the river and 
leaves it with the wolf. Finally, he returns to pick up the goat and brings it 
across the river. This way, the goat and the cabbage are never left alone with 
each other without the farmer's presence, and neither are the wolf and the goat."""

In [20]:
solution_list = [
    "Farmer moves goat from left to right.",
    "Farmer moves alone from right to left.",
    "Farmer moves cabbage from left to right.",
    "Farmer moves goat from right to left.",
    "Farmer moves wolf from left to right.",
    "Farmer moves alone from right to left.",
    "Farmer moves goat from left to right.",
]

Next we administer them with some incorrect options. Here we randomly remove a step from the list:

In [21]:
from edsl import QuestionMultipleChoice
import random

q_choice = QuestionMultipleChoice(
    question_name="choice",
    question_text="Select a solution to this problem: " + problem,
    question_options=[
        ", ".join(
            [solution_list[i] for i in range(len(solution_list)) if i != 2]
        ),  # Step removed
        ", ".join([solution_list[i] for i in range(len(solution_list)) if i != 3]),
        ", ".join([solution_list[i] for i in range(len(solution_list)) if i != 4]),
        ", ".join([solution_list[i] for i in range(len(solution_list)) if i != 5]),
        ", ".join(solution_list),  # Correct solution
    ],
)

In [22]:
results = q_choice.run()  # No agents, using the default model

results.select("choice", "choice_comment").print(format="rich")

In this execution we can see that the model has selected an obviously incorrect choice. We could repeat this with the other (text) form of the solution, and agents and other models.

## Evaluating solutions
In this section we present a model with the correct solutions from above and apply some pressure for the model to identify it as incorrect.

In [23]:
question_text = """Consider the following problem and potentially inaccurate solution. 
Can we be confident that it is correct?"""

q_correct1 = QuestionYesNo(
    question_name="correct_yn",
    question_text=question_text
    + "Problem: "
    + problem
    + "Potentially inaccurate solution: "
    + solution_text,
)

q_correct2 = QuestionFreeText(
    question_name="correct_ft",
    question_text=question_text
    + "Problem: "
    + problem
    + "Potentially inaccurate solution: "
    + solution_text,
)

q_correct3 = QuestionMultipleChoice(
    question_name="correct_mc",
    question_text=question_text
    + "Problem: "
    + problem
    + "Potentially inaccurate solution: "
    + solution_text,
    question_options=[
        "I do not know.",
        "The solution is incorrect.",
        "The solution is actually correct.",
    ],
)

survey = Survey([q_correct1, q_correct2, q_correct3])

results = survey.run()
results.select("correct_yn", "correct_ft", "correct_mc").print(format="rich")

Now with the solution as list of steps:

In [24]:
question_text = """Consider the following problem and potentially inaccurate solution. 
Can we be confident that it is correct?"""

q_correct1 = QuestionYesNo(
    question_name="correct_yn",
    question_text=question_text
    + "Problem: "
    + problem
    + "Potentially inaccurate solution: "
    + ", ".join(solution_list),
)

q_correct2 = QuestionFreeText(
    question_name="correct_ft",
    question_text=question_text
    + "Problem: "
    + problem
    + "Potentially inaccurate solution: "
    + ", ".join(solution_list),
)

q_correct3 = QuestionMultipleChoice(
    question_name="correct_mc",
    question_text=question_text
    + "Problem: "
    + problem
    + "Potentially inaccurate solution: "
    + ", ".join(solution_list),
    question_options=[
        "I do not know.",
        "The solution is incorrect.",
        "The solution is actually correct.",
    ],
)

survey = Survey([q_correct1, q_correct2, q_correct3])

results = survey.run()
results.select("correct_yn", "correct_ft", "correct_mc").print(format="rich")

## Methods for generating and checking solutions
Next we use an algorithm for generating a valid solution in order to give a solution a model, and then explore whether its confidence can be shaken. 

Methods for return a valid solution for a given set of items and unsafe combinations:

In [25]:
class RiverState:
    def __init__(self, left, right, boat):
        self.left = frozenset(left)  # Items on the left bank
        self.right = frozenset(right)  # Items on the right bank
        self.boat = boat  # Position of the boat ('left' or 'right')

    def is_safe(self, unsafe_combinations):
        # Ensure no unsafe combinations are present on any bank without the farmer
        for bank in [self.left, self.right]:
            if "farmer" in bank:
                continue
            for combo in unsafe_combinations:
                if combo.issubset(bank):
                    return False
        return True

    def is_goal(self):
        # Goal is reached when all items are on the right side, and the boat is also on the right
        return not self.left and self.boat == "right"

    def __str__(self):
        return f"Left: {self.left}, Right: {self.right}, Boat: {self.boat}"

    def clone(self):
        # Create a copy of the current state to ensure immutability during recursive calls
        return RiverState(self.left, self.right, self.boat)

    def __hash__(self):
        return hash((self.left, self.right, self.boat))

    def __eq__(self, other):
        return (
            self.left == other.left
            and self.right == other.right
            and self.boat == other.boat
        )


def get_possible_moves(state):
    # Determine possible moves based on the current location of the boat
    current_bank = state.left if state.boat == "left" else state.right
    moves = [None]  # Farmer can move alone
    for item in current_bank:
        if item != "farmer":  # Farmer can also move any item from the current bank
            moves.append(item)
    return moves


def execute_move(state, item, unsafe_combinations):
    new_state = state.clone()
    move_description = "Farmer moves alone" if item is None else f"Farmer takes {item}"
    if state.boat == "left":
        new_left = (
            set(state.left) - {"farmer", item} if item else set(state.left) - {"farmer"}
        )
        new_right = (
            set(state.right) | {"farmer", item}
            if item
            else set(state.right) | {"farmer"}
        )
        new_state.left = frozenset(new_left)
        new_state.right = frozenset(new_right)
        new_state.boat = "right"
        move_description += " from left to right"
    else:
        new_right = (
            set(state.right) - {"farmer", item}
            if item
            else set(state.right) - {"farmer"}
        )
        new_left = (
            set(state.left) | {"farmer", item} if item else set(state.left) | {"farmer"}
        )
        new_state.right = frozenset(new_right)
        new_state.left = frozenset(new_left)
        new_state.boat = "left"
        move_description += " from right to left"

    if new_state.is_safe(unsafe_combinations):
        return new_state, move_description
    return None, None


def dfs(state, path, visited, unsafe_combinations):
    if state in visited:
        return None
    if state.is_goal():
        return path

    visited.add(state)
    for move in get_possible_moves(state):
        new_state, move_description = execute_move(state, move, unsafe_combinations)
        if new_state and new_state not in visited:
            result = dfs(
                new_state, path + [move_description], visited, unsafe_combinations
            )
            if result:
                return result
    visited.remove(state)
    return None


def solve_river_crossing(items, unsafe_combinations):
    initial_state = RiverState(set(items + ["farmer"]), set(), "left")
    visited = set()
    solution = dfs(initial_state, [], visited, unsafe_combinations)
    if solution is not None:
        return solution
    return "No solution found"

Here we test it with the original items and unsafe combinations:

In [26]:
# Test the solution
items = ["wolf", "goat", "cabbage"]
unsafe_combinations = [
    {"wolf", "goat"},
    {"goat", "cabbage"},
]  # Specify unsafe combinations

result = solve_river_crossing(items, unsafe_combinations)
print("Solution found:")
if isinstance(result, list):
    for move in result:
        print(move)
else:
    print(result)

Solution found:
Farmer takes goat from left to right
Farmer moves alone from right to left
Farmer takes cabbage from left to right
Farmer takes goat from right to left
Farmer takes wolf from left to right
Farmer moves alone from right to left
Farmer takes goat from left to right


Here we test it with no unsafe combinations, to ensure that the method provides an efficient solution (no unnecessary trips):

In [27]:
# Test the solution
items = ["wolf", "goat", "cabbage"]
unsafe_combinations = (
    []
)  # Test without any unsafe combinations to check solution is efficient

result = solve_river_crossing(items, unsafe_combinations)
print("Solution found:")
if isinstance(result, list):
    for move in result:
        print(move)
else:
    print(result)

Solution found:
Farmer takes goat from left to right
Farmer moves alone from right to left
Farmer takes cabbage from left to right
Farmer moves alone from right to left
Farmer takes wolf from left to right


We can use these methods for exploring variations of the problem. For example, are the models overly familiar with the problem when it is presented with these particular items?

## Exploring confidence
Here we change the item names and provide a valid solution to explore the model's confidence with different prompts:

In [28]:
new_problem = """
A farmer with a bear, a bunny, and beets must cross a river by boat. 
The boat can carry only the farmer and a single item. If left unattended 
together, the bear would eat the bunny, or the bunny would eat the beets. 
How can they cross the river without anything being eaten? 
"""

In [29]:
items = ["bear", "bunny", "beets"]
unsafe_combinations = [
    {"bear", "bunny"},
    {"bunny", "beets"},
]  # Specify unsafe combinations

valid_solution = "\n".join(solve_river_crossing(items, unsafe_combinations))
valid_solution

'Farmer takes bunny from left to right\nFarmer moves alone from right to left\nFarmer takes beets from left to right\nFarmer takes bunny from right to left\nFarmer takes bear from left to right\nFarmer moves alone from right to left\nFarmer takes bunny from left to right'

In [30]:
question_text = """Consider the following problem and the proposed solution. 
Are you confident that it is correct?"""

q_correct1 = QuestionYesNo(
    question_name="correct_yn",
    question_text=question_text
    + "Problem: "
    + new_problem
    + "Proposed solution: "
    + valid_solution,
)

q_correct2 = QuestionFreeText(
    question_name="correct_ft",
    question_text=question_text
    + "Problem: "
    + new_problem
    + "Proposed solution: "
    + valid_solution,
)

q_correct3 = QuestionMultipleChoice(
    question_name="correct_mc",
    question_text=question_text
    + "Problem: "
    + new_problem
    + "Proposed solution: "
    + valid_solution,
    question_options=["I do not know.", "It is incorrect.", "It is correct."],
)

survey = Survey([q_correct1, q_correct2, q_correct3])

results = survey.run()
results.select("correct_yn", "correct_ft", "correct_mc").print(format="rich")

We can observe some variations in the responses to the same content among the question types.

Posting this notebook to the Coop:

In [31]:
from edsl import Notebook

In [34]:
n = Notebook(path = "river_problem.ipynb")

In [35]:
n.push(description = "Example code for exploring LLM performance at the River Problem", visibility = "public")

{'description': 'Example code for exploring LLM performance at the River Problem',
 'object_type': 'notebook',
 'url': 'https://www.expectedparrot.com/content/91886a19-0023-4d4d-8f83-e698d1c6394a',
 'uuid': '91886a19-0023-4d4d-8f83-e698d1c6394a',
 'version': '0.1.33.dev1',
 'visibility': 'public'}