# Mini problem set: Trust Modeling in Human-Robot interactions.
## Due Wednesday, April 25, 11:59 p.m.

Make sure you load the dependencies below by highlighting the cell below and pressing Shift + Enter.

In [2]:
%load_ext autoreload
%autoreload 2
import numpy as np
import mdptoolbox, mdptoolbox.example
from utils import check_omega, test_ok, check_explicability_score, check_matrix, check_cost, check_MDP, prep_browser, run_pyperplan, run_and_viz_pyperplan

As discussed in lecture, we know that _explaining_ unexpected behaviors from our robot can increase trust. In this module, we'll implement the Meta-MDP solution to the Trust-Aware Planning problem from the simple Wumpus World example problem.

## Wumpus World Modeling

Because we're so familiar with PDDL, we'll model our Wumpus World as a PDDL problem. **There's no need for you to implement anything here; this code is just so you remember what a plan looks like!**

Here, our robot is trying to navigate in a 3x3 grid to pick up a block. The robot knows that its shortest path is unencumbered, but the human has no idea--they think that there's a pile of trash on the shortest path to the green dot.

![Wumpus World](res/wumpus-world.png)

We've provided you with a fully modeled version of the robot world in `pddl/robot-domain.pddl` and `pddl/robot-problem.pddl`. Notice that none of our actions are `durative-actions`; we assume a unit cost for every action.

Instead of using Optic to run our plans, we'll use a Python PDDL Planner called [`pyperplan`](https://github.com/aibasel/pyperplan). Pyperplan is cool because it is extensible and contains a very clean implementation of some of the commmon search algorithms; if you're interested in the way planners work, definitely check out their codebase! Like Optic, however, Pyperplan only supports _positive preconditions._  

Let's compare our robot plan to our human plan. Run the following two cells to see what the human thinks the robot should do vs. what the robot thinks is optimal!

In [3]:
domain_file = 'pddl/robot-domain.pddl'
problem_file = 'pddl/robot-problem.pddl'

run_and_viz_pyperplan(domain_file, problem_file)

(robot-move robot1 sq2-2 sq1-2)
(robot-move robot1 sq1-2 sq0-2)


In [4]:
domain_file = 'pddl/human-domain.pddl'
problem_file = 'pddl/human-problem.pddl'

run_and_viz_pyperplan(domain_file, problem_file)

(robot-move robot1 sq2-2 sq2-1)
(robot-move robot1 sq2-1 sq2-0)
(robot-move robot1 sq2-0 sq1-0)
(robot-move robot1 sq1-0 sq0-0)
(robot-move robot1 sq0-0 sq0-1)
(robot-move robot1 sq0-1 sq0-2)


# The Meta-MDP

This is where the real fun begins! As discussed in class, one way of selecting robot behaviors can be modeled as a "Meta-MDP". We'll use the [Python MDP Toolbox](https://pymdptoolbox.readthedocs.io/en/latest/api/mdp.html). 

Recall our human-aware planning problem from lecture (and implemented above):

## Human-Aware Planning
Recall the Human-Aware Planning problem described in [1](https://arxiv.org/pdf/2105.01220.pdf).

**Input:**

$\mathcal{M}^R$, the robot's model of the environment and problem. Consists of the tuple $\langle\mathcal{D}^R, \mathcal{I}^R, \mathcal{G}^R\rangle$, where $\mathcal{D}^R$ is the domain,  $\mathcal{I}^R$ is the initial state, and $\mathcal{G}^R$ is the goal state.

$\mathcal{M}^G$, the human's model of the environment and problem. Consists of the tuple $\langle\mathcal{D}^H, \mathcal{I}^H, \mathcal{G}^H\rangle$, where $\mathcal{D}^H$ is the domain,  $\mathcal{I}^H$ is the initial state, and $\mathcal{G}^H$ is the goal state.

**Output:**

A _plan_; that is, a sequence of robot actions that achieve the goal state but also meets the human's expectations. We call the degree to which the robot plan $\pi$ matches the human expectations $\pi^e$ the plan _explicability_, and we often model it as the _distance_ $\delta$ between $\pi^e$ and $\pi$: 
$$
E(\pi) = -1 * \delta(\pi^e, \pi)
$$

A plan $\pi$ is _perfectly explicable_ if $E(\pi) = 0$. We often use the difference in costs between the two plans as our distance function, $\delta$.

### Problem 1: Explicability

Let's create a function to determine the explicability score which is the negative of the cost difference between the current plan and the optimal plan in the robot model.

In [7]:
def explicability_score(optimal_cost, expected_cost):
    """
    Builds the explicability score E(pi). Recall that E(plan) = - (plan_cost -
    optimal_plan)

    @param  optimal_cost:   The cost of the optimal plan.
    @param  expected_cost:  The cost of the fully explicable plan.

    @return E:             The explicability score given the optimal and explicable plans.
    """
    
    ### YOUR CODE HERE
    raise NotImplementedError()

Now, let's check the function you wrote!

In [None]:
check_explicability_score(explicability_score)
test_ok()

### Problem 2: It's MDPs All the Way Down
As in the problem statement, we'll model the problem as an infinite horizon discounted MDP of the form $$M = \langle S, A, P, C, \gamma \rangle$$

#### Problem Description
Our **state space**, $S$, are the human's "trust level." For this implementation, we'll have four trust levels, so $\| S \| = 4$. We associate each trust level with numerical values $T = \begin{bmatrix} 0 & 0.3 & 0.6 & 1.0 \end{bmatrix}$, which we will use to help us model the rest of the problem.

Our **action space**, $A$, is simple--the robot may choose between its own _optimal plan_ or the human's fully explainable (or _expected_ plan). (Therefore, $\|A \| = 2$, and $A(0) = \pi^\textrm{opt}$ and $A(1) = \pi^\textrm{exp}$). 

The **explicability score**, $E(\pi)$, is the negative of the cost difference between the current plan and the optimal plan in the robot model. For example, $E(\pi^\textrm{exp}) = - (\textrm{cost}_{\pi^{\textrm{exp}}} - \textrm{cost}_{\pi^{\textrm{opt}}})$.

As described in the [Python MDP Toolbox documentation](https://pymdptoolbox.readthedocs.io/en/latest/api/mdp.html), our transition matrix $P$ should be a [numpy array](https://numpy.org/doc/stable/reference/generated/numpy.array.html) of size `(2, 4, 4)`. Therefore, `P[k][i][j]` represents the likelihood of transitioning from state $s_i$ to state $s_j$ with the action $a_k$. There are two cases to think about when defining our transition matrix:

1. The **optimal plan**. Here, we need to consider three subcases, as our robot is following a plan with a non-perfect explicability score. The trust level may either **decrease**, **stay the same**, or **increase**. 
   - We model the likelihood that the trust level **decreases** as $P(s_i, a^\pi, s_{i-1}) = \omega(i) * (1 - E(\pi))$. 
   - We model the likelihood that the trust level **stays the same** as $P(s_i, a^\pi, s_{i}) = \omega(i) * E(\pi)$.
   - We model the likelihood that the trust level **increases** as $P(s_i, a^\pi, s_{i+1}) = (1 - \omega(i))$
2.  The **expected plan**. Here, trust increases to the next level in all but the maximum trust level (where it is expected to remain the same).

Note that `P(0)` corresponds with the transition matrix of the optimal plan, and `P(1)` corresponds with the transition matrix of the expected plan.


The **cost**, $C$, is modeled as a `numpy array` of size `(4, 2)`. $C(s_i, a^\pi) = (1 - \omega(i)) * C_e(\pi)$, where $C_e(\pi)$ is the cost of the fully explainable plan. Here, we'll assume that each action has unit cost, but this cost function could certainly get more complicated!

The likelihood that the human chooses to observe at some trust level, $\omega(i)$ is modeled as a Bernoulli distribution with probability of $(1 - T(i))$. Here, $\omega$ should be a `numpy array` of size `(1, 4)`.

#### Problem Statement
Your task is to write the functions `omega`, `transition_matrix`, and `cost` to model the MDP as described above. Once we're done modeling, we'll run policy iteration on our derived MDP and see what sort of results we get!


In [8]:
def omega(T):
    """
    Builds the omega matrix. Recall that w(i) = (1 - T(i))

    @param  T:     The numerical values for trust.
    
    @return w:     An np.array of np.shape(T). Recall that w(i)
    """
    
    ### YOUR CODE HERE
    raise NotImplementedError()

In [None]:
check_omega(omega)
test_ok()

In [9]:
def transition_matrix(T, w, E):
    """
    Builds the transition matrix, P. Recall that we have 2 actions (size of E)
    and 4 states (size of T). Therefore, our P matrix is an np.array of size
    (length(E), length(T), length(T)).


    The first action in P is following the optimal plan. The second action in P
    is following the explainable plan. See the problem statement for a full
    description of our expected transitions.
    
    @param  T: the trust levels
    @param  w: the likelihood that the human observes
    @param  E: the plan explicability score

    @return P: the transition matrix
    """
    
    ### YOUR CODE HERE
    raise NotImplementedError()


In [None]:
check_matrix(transition_matrix)
test_ok()

In [10]:
def cost(w, E, expected_cost):
    
    ### YOUR CODE HERE
    raise NotImplementedError()

In [None]:
check_cost(cost)
test_ok()

Now that we have these helper functions, let's build the Meta-MDP model! For this, use a simple cost function where each action has a unit cost. 

In [11]:
# helper function
def calculate_plan_cost(plan):
    """Calculates the cost of a plan. In our case (because each action has a unit cost), 
    the plan cost is simply the length of the plan. """
    return len(plan)

def build_meta_MDP(T, optimal_plan, expected_plan):
    """
    Builds the Meta MDP model given trust levels and an optimal and expected plan.

    @param          T: The matrix associating trust level with values in the range [0, 1].
    @param          optimal_plan: A list of actions corresponding with the optimal plan.
    @param          expected_plan: A list of actions corresponding with the explicable/expected plan.

    @return P:      The transition matrix.
    @return C:      The cost matrix.
    """

    # Get the w (omega) matrix, representing the likelihood that a human will
    # choose to observe
    w = omega(T)
    
    # Get optimal and expected costs
    optimal_cost = calculate_plan_cost(optimal_plan)
    expected_cost = calculate_plan_cost(expected_plan)
    
    # Get E (explicability score)
    E = explicability_score(optimal_cost, expected_cost)
    
    # Get P (transition matrix)
    P = transition_matrix(T, w, E)
    
    # Get C (cost)
    C = cost(w, E, expected_cost)
    
    return P, C


Finally, we'll try running the MDP that we wrote! First, let's create the optimal and expected plans given the examples, and then, execute the following cell to see your generated policy. 

In [22]:
human_domain_file = 'pddl/human-domain.pddl'
human_problem_file = 'pddl/human-problem.pddl'

robot_domain_file = 'pddl/robot-domain.pddl'
robot_problem_file = 'pddl/robot-problem.pddl'

expected_plan = run_pyperplan(human_domain_file, human_problem_file)
optimal_plan = run_pyperplan(robot_domain_file, robot_problem_file)

In [23]:
T = np.array([0, 0.3, 0.6, 1])
P, C = build_meta_MDP(T, optimal_plan, expected_plan)
gamma = 0.9

pi = mdptoolbox.mdp.PolicyIteration(P, C, gamma)
pi.run()

pi.policy

(1, 1, 1, 0)

Notice that your policy takes action $1$ (the explainable plan) in all cases except where human trust is very high. This is what we might expect intuitively, so that's pretty cool!