# Grad Project \# 1

## Imports and Utilities

These are import and utility functions, and also scaffolding of functions that we have provided for you for the project.

Read through the code in the other .py files. You will need to understand the functions in those files to complete the project.

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
from typing import (Iterable, Dict, Optional)
# Setup matplotlib animation
import matplotlib
from IPython.display import HTML
matplotlib.rc('animation', html='jshtml')
import random
import dataclasses
import numpy as np
from utils import *

Throughout this course, we will be developing a "search and rescue" robot who will be charged with navigating a sometimes dangerous grid to find and help people in need. We will first consider planning to navigate to, pick up, and drop off people at a hospital. You will use some of the algorithms we have already discussed in this subject, including heuristics search and Monte-Carlo tree search. 

# 1. Pickup Agent 

Let us first consider a deterministic gridworld-like domain that consists of a robot, a patient and a target hospital.

![Pickup figure](just_wait.png)

The robot's goal is to pick up the patient and rescue them to the target hospital. As you can see in the above figure, the blue circle represents our robot, the orange circle represents the patient, and the red cross represents the hospital. The environment has one-way roadblocks, indicated by a flat triangle between two cells. For example, the agent can only move down from (1,1) to  (2,1) but not up. The environment also has two-way roadblocks, indicated by thick black boundaries between two cells. For example, the figure above has a long corridor formed by two walls that the agent can travel down from its initial position, but not sideways.

In the following code block, we have provided an implementation of the `PickupProblem`, in terms of states, actions and costs. 

In [None]:
from pickup_problem import OneWayBlock, PickupProblem, PickupProblemState
### SET UP THE SIMPLE MAZE 

def add_wall(cell_1, cell_2, one_ways_list):
    one_ways_list.append(OneWayBlock(cell_1, cell_2))
    one_ways_list.append(OneWayBlock(cell_2, cell_1))

simple_one_ways = []
for i in range(7):
    simple_one_ways.append(OneWayBlock((i+1, 1), (i,1)))
    simple_one_ways.append(OneWayBlock((i, 3), (i+1,3)))

for i in range(1,7):
    add_wall((i,1), (i, 0), simple_one_ways)
    add_wall((i,1), (i, 2), simple_one_ways)
    # simple_one_ways.append(OneWayBlock((i, 1), (i, 0)))
    # simple_one_ways.append(OneWayBlock((i, 1), (i, 2)))
    
add_wall((0, 0), (1, 0), simple_one_ways)
add_wall((7, 0), (6, 0), simple_one_ways)
add_wall((0, 2), (1, 2), simple_one_ways)
add_wall((7, 2), (6, 2), simple_one_ways)

SimpleProblem = PickupProblem((8, 4), (0, 1), (7, 0), (0, 3), simple_one_ways)

### END OF SET UP THE SIMPLE MAZE 
SimpleProblem.render(SimpleProblem.initial)

## 1.1 A* In Pickup 

Please implement A* in the context of the Pickup domain using the (not very good) default heuristic that is included in the `PathCostProblem` definition. In the helper code above, we gave you a method `run_best_first_search` that takes a `PathCostProblem` as an argument. That method might be helpful to you here. 

For reference, our solution is **2** line(s) of code.

In [None]:
def run_astar_search(problem: PathCostProblem, step_budget: int = 10000):
    """A* search.

    Use the implementation of `run_best_first_search` with the default heuristic.
    """

    raise NotImplementedError() 
    return path 

In [None]:
search_result = run_astar_search(SimpleProblem)
print("Path information:")
print("num_steps:", search_result[3])
print("path costs:",search_result[2],"total cost:",sum(search_result[2]))
print("actions found is ", search_result[1])

## 1.2 Better heuristics for Pickup 

Please provide a better heuristic, and show that it outperforms the original heuristic. 

We have not defined in this question what it means to "outperform" here, so this is a free-form answer. You are welcome to add extra instrumentation to any of the classes to help with analysis, although our solution does not require that.  

For reference, our heuristic is **12** line(s) of code.

In [None]:
def run_astar_search_faster(problem: PathCostProblem, step_budget: int = 10000):
    """A* search.
    Write a better heuristic than the default provided one
    Use your heuristic implementation with `run_best_first_search`.
    """

    raise NotImplementedError() 
    
    return path

In [None]:
search_result = run_astar_search_faster(SimpleProblem)
# can print evidence of improvement below
# eg print(...)print("Path information:")
print("num_steps:", search_result[3])
print("path costs:",search_result[2],"total cost:",sum(search_result[2]))
print("actions found is ", search_result[1])

## Answer to Question 1.2
**Describe how you have improved the heuristic and how you have shown that it outperforms the original heuristic.**

# 2. The evolution of fire 

The above problem is a deterministic search problem. But let's make the problem more interesting: Our grid now has fire. The fire starts at some initial locations (represented by red grid cells in the first figure) and evolves through time.

Thanks to our research at MIT, we know the exact model of how fire evolves through time.

Most notably, the fire grid at time $t$ is independent of the robot and patient, and depends only on the fire grid at the previous time step. Fire also completely ignores walls, roadblocks and the hospital.

Further, given the fire grid at time $t$, the probabilities of fire at any two different cells at time $t+1$ are independent: 
$$
P(\mathbf{F}^{t+1} \mid \mathbf{F}^t) = \prod_{(i, j) \in \mathtt{grid}} P\left(\mathbf{F}_{(i, j)}^{t+1} \mid \mathbf{F}_{(i', j') \in \mathtt{neighbors}((i, j))}^t\right),
$$ where $\mathtt{neighbors}((i, j)) = \{ (i', j') \mid  |i - i'| \le 1 \land |j - j'| \le 1 \land (i, j) \neq (i', j') \}$ is the 3 by 3 patch of cells centered at $(i, j)$, including $(i, j)$.
Further, at time $t + 1$, the probability of fire in cell $(i, j)$ is the weighted 
probability of its neighboring cells on fire at time $t$ for a given fixed weight matrix
$W \in \mathbb{R}^{3\times 3}$:
$$
P\left(\mathbf{F}_{(i, j)}^{t+1} \mid \mathbf{F}_{(i', j') \in \mathtt{neighbors}((i, j))}^t\right) \propto \sum_{i', j'} W[i' - i + 1, j' - j + 1] \cdot F[i', j'] .
$$

You might recognize that given the fire grid at time $t$, a matrix of 0-1 values, 
the probability of fire at time $t+1$ is the [2D convolution](http://www.songho.ca/dsp/convolution/convolution2d_example.html) of the fire grid and the weights $W$ (normalized such that the entries sum to one).

Here's another way to understand the fire process:
- We start with some initial fire grid. 
- At each time step $t$, for each cell, we randomly select a neighbor (including the current cell) with probability proportional to the weights matrix; then, the current cell at time $t+1$ gets the selected neighbor's fire value at time $t$. Note that neighors outside of the grid does not have fire.

In [None]:
from fire_process import (
    FireMDP, 
    FireMDPState, 
    FireProcess,
    get_problem,
)  # read and understand this code

### Our Approach

We are going to adopt an online-planning approach, where at every step, our agent:
- plans according to the current state,
- executes an action, and
- observes a new state of the fire grid and replans (i.e., restarts from step one).

We will consider two different styles of planning: determinized approximation and Monte-Carlo tree search.


#### Approximate, Determinize and Replan

In our first approach, at each planning step, we will try to find an open-loop plan that is most likely to succeed.
In particular, we turn the MDP into an min-cost path problem:
- The state space no longer has the fire grid, but only contains the state of the pickup and rescue problem.
- For a step $(s_t, a, s_{t+1})$ at time $t$, we charge a cost of $c - \log \left(1 - P\left(\mathtt{on\_fire}_{t+1}(s_{t+1})\right)\right)$, where $c$ is a small cost for taking each step, and $P\left(\mathtt{on\_fire}_t(s)\right)$ is the marginal probability of stepping on fire at state $s$ at time $t$. 
- We try to find the least-cost path to reach the patient and rescue them to the hospital --- this path becomes our found open-loop plan.

Once we have a min-cost path problem, we can use A* search with a simple heuristic that ignores fire.
In particular, we will use a simple heuristic that is the sum of the manhattan distance from robot to the patient and the manhattan distance 
from patient to the hospital, scaled by the small cost `c` charged at each step. 
Note that when the robot is carrying a patient, the distance between the robot and the patient is zero.

_Hints_: 
- Before you code, try to derive the marginal probabilities of each grid cell on fire at time $t$ as an expression of the marginal probabilities of each grid cell on fire at time $t-1$. What do you find? More concretely, you may start small: Consider a grid consisting of only two cells, named $X$ and $Y$, and assume that $W$ is uniform. Then, try to write the marginal probability of cell $P(X_t=1 | X_0, Y_0)$ as an expression of $P(X_{t-1}=1 | X_0, Y_0)$ and $P(Y_{t-1}=1 | X_0, Y_0)$.
- We can formulate the A* state as $(\mathtt{robot\_loc}, \text{carrying\_patient}, \mathtt{time})$ and use time as an index into a precomputed sequence of the marginal probabilities that each cell is on fire. See `DeterminizedFireMDPState` and `DeterminizedFireMDP` for more details.

## 2.1 Determinized Min-cost Path Problem


Please complete the implementation of `DeterminizedFireMDP`. In particular, you should:
- Complete the function `fire_dist_at_time` to compute the log-likelihood of each cell being on fire at time $t$ given the true fire state at time $0$. 
- Using your implementation of `fire_dist_at_time`, complete the function `step_cost`.
- Complete the rest of the `DeterminizedFireMDP` and implement the heuristic function `h` based on description above. It might look remarkably similar to your heuristic from question 1.2, the only difference being that `DeterminizedFireMDP` contains a `PathCostProblem`. 
    

For reference, our solution is **91** line(s) of code, including the code we have provided for you. 

In [None]:
@dataclasses.dataclass(frozen=True, eq=True, order=True)
class DeterminizedFireMDPState(PickupProblemState):
  """A state for the DeterminizedFireMDP.
    
    The state is a pair of the PickupProblemState and a time step $t$.
    """
  time: int = 0

@dataclasses.dataclass(frozen=True)
class DeterminizedFireMDP(PathCostProblem):
    """Determinized version of the fire MDP --- tries to find the solution path
    that is most likely to succeed.
    """
    pickup_problem: PickupProblem
    fire_process: FireProcess

    # Additional cost for each step.
    # Can be 0 but we might have 0-cost arcs if the success probability is 1.
    action_cost = 1e-6

    # Use this to cache precomputed fire distributions, so we don't have to recompute them.
    fire_dists_cache: Dict[int, np.ndarray] = dataclasses.field(
        init=False,
        default_factory=dict,
    )

    def __post_init__(self):
        assert (self.pickup_problem.grid_shape ==
                self.fire_process.initial_fire_grid.shape)

    @property
    def initial(self) -> DeterminizedFireMDPState:
        return DeterminizedFireMDPState(
            *dataclasses.astuple(self.pickup_problem.initial),
            time=0,
        )

    def actions(self, state: DeterminizedFireMDPState) -> Iterable[Action]:
        raise NotImplementedError() 
    
    def step(self, state: DeterminizedFireMDPState,
             action: Action) -> State:
        """We automatically pick up patient if we're on that square."""
        raise NotImplementedError() 
    
    def goal_test(self, state: DeterminizedFireMDPState) -> bool:
        """True if at hospital and holding patient."""
        raise NotImplementedError()

    def step_cost(self, state1: DeterminizedFireMDPState, action: Action,
                  state2: DeterminizedFireMDPState) -> float:
        raise NotImplementedError() 

    def fire_dist_at_time(self, t: int) -> np.ndarray:
        """Return the marginal distribution of fire grid at time $t$. This should populate and use caching in self.fire_dists_cache"""

        raise NotImplementedError() 
    
    def h(self, state: DeterminizedFireMDPState) -> float:
        """heuristic based on the manhattan distance to the patient and hospital."""
        raise NotImplementedError()

## 2.2 Determinized Fire MDP Agent


Please complete the implementation of FireMDPDeterminizedAStarAgent. Note that we have filled in most of the implementation for you --- including the call to `run_astar_search` from section 1. All you need to implement is the determinized_problem method.

For reference, our solution is **36** line(s) of code, including the code we have provided. 

In [None]:
@dataclasses.dataclass(frozen=True)
class FireMDPDeterminizedAStarAgent(Agent):
    """Agent that uses A* to plan a path to the goal in a determinized
    version of the problem. Does not need any internal state since we
    re-determinize the problem at each step.
    """

    problem: FireMDP
    step_budget: int = 10000

    def determinized_problem(self,
                             state: FireMDPState) -> DeterminizedFireMDP:
        """Returns a determinized approximation of the fire MDP."""
        raise NotImplementedError() 
        
    def act(self, state: FireMDPState) -> Action:
        problem = self.determinized_problem(state)
        try:
            plan = run_astar_search(problem, self.step_budget)
        except SearchFailed:
            print("Search failed, performing a random action")
            return random.choice(list(self.problem.actions(state)))
        return plan[1][0]

Now that we have the agent, let's try to experiment with it under some environments!

We have provided you various environments under the `get_problem` function.
Try to run your agent in each provided environment several times. 
You can visualize the agent's behavior using the `run_agent_on_problem` and `animate_trajectory` functions.

Let's take a closer look at the particular MDP of `get_problem("just_wait")`. 
You might ask yourself: what is your agent's behavior (on average)? 
In particular, does the robot "wait" by patrolling in the top row for a while, 
and then moves out to rescue the patient? 
If not, then it is very likely that your agent implementation is buggy! 


### Experiment 1 ####
Please try to generate an animation of a **successful run** (i.e., the robot successfully rescues the patient) of the agent in the just_wait MDP, but **the fire does not completely die out when the robot moves down from the top row**. You might need to repeat the experiment a few times to produce this animation. If you failed to create this animation after a handful of trials (say, 15), your agent implementation might be buggy. Please feel free to reach out to us anytime if you get stuck.

Submit the animation as **just_wait_determinized.mp4**. Videos in the jupyter notebook **are not supported**, so you will need to submit the video separately.

You can view a video of the animated trajectory in the notebook by running the following code. Again, these videos will **not** be visible to the graders, so you will need to submit the video separately.
```python
HTML(animate_trajectory(...).to_html5_video())
```
To save the video to a file, you can use the following code:
```python
animate_trajectory(...).save("just_wait_determinized.mp4")
```


## 2.3 What's the Right Choice? 
Sometimes our determinized agent finds plan that is not optimal. To see the above effect, try to run the determinized agent in the MDP get_problem("the_choice"). In this environment, the agent faces a choice of going right or down from the initial location.

It may choose the shortcut by taking the down action. But, it risks itself getting close to the fire next to the one-way passage, and it cannot hide from the fire in this passage.
It may also accept the challenge by taking the right action. Here, the robot moves to a large room with more fire than the one-way passage. But it can move around to try its best to avoid fire, until it finds a clear path to rescue the patient.

#### Experiment 2 ###

Similar to experiment 1, please visualize the behavior of the determinized planning agent in this environment, and generate an animation of a successful run. What choice does your determinized agent make?

Submit the animation as **the_choice_determinized.mp4**. 

# 3. MCTS Agent


Now that we have seen a failure mode of our determinized planning agent, let's try to do better with closed-loop planning with MCTS!

With MCTS, we have more of a chance of hedging bets, so we might be inclined to go in directions where there are more options in case we get caught, even if the expected open-loop cost is higher.

We have provided you with an MCTS implementation, `run_mcts_search`. Please take a look at the documentation of `run_mcts_search` to understand how to use it, then implement an MCTS agent for MDPs.

Please complete the implementation of `MCTSAgent`.

_Hint: You can pass in the `self.planning_horizon` to `run_mcts_search`,
to handle both infinite-horizon problems (by receding-horizon planning) and finite-horizon problems._


For reference, our solution is **42** line(s) of code.

In [None]:
@dataclasses.dataclass
class MCTSAgent(Agent):
    """Agent that uses Monte Carlo Tree Search to plan a path to the goal.

    The agent simply wraps `run_mcts_search`, and it should work for any MDP.
    """

    problem: MDP

    # An optional receding horizon to use for the planning
    # If not provided, the problem must have a finite horizon
    receding_horizon: Optional[int] = None

    C: float = np.sqrt(2)
    iteration_budget: int = 1000

    t: int = dataclasses.field(default=0, init=False)

    def __post_init__(self):
        if self.receding_horizon is None:
            assert self.problem.horizon != np.inf

    def reset(self):
        self.t = 0

    @property
    def planning_horizon(self) -> int:
        """Returns the planning horizon for the current time step."""
        if self.receding_horizon is None:
            return self.problem.horizon - self.t
        return self.receding_horizon

    def act(self, state: State) -> Action:
        """Return the action to take at state."""
        raise NotImplementedError()


## 3.1 Making the Right Choice! ###
Let's run our MCTS agent in `the_choice` MDP, and see if it makes the right choice!

#### Experiment 3 ###
Similar to experiment 2, please visualize the behavior of the MCTS planning agent in the_choice MDP and generate an animation of a successful run.

Submit the animation as **the_choice_mcts.mp4**. 



## 3.2 Benchmarks ## 
Now that we have both a determinized agent and an MCTS agent, let's do some benchmarking.

We have provided you with a simple agent, `RolloutLookaheadAgent`: At each step, it performs several rollouts for each action and chooses the best one. Note that `RolloutLookaheadAgent` with `receding_horizon=0` becomes a naive agent that chooses action uniformly at random.

Let's compare these agents: `RolloutLookaheadAgent(receding_horizon=0)`, `RolloutLookaheadAgent(receding_horizon=40)`, `FireMDPDeterminizedAStarAgent`, `MCTSAgent(iteration_budget=10, receding_horizon=40)` and `MCTSAgent(iteration_budget=50, receding_horizon=40)`. Please set unspecified parameters to their default values.

Then, run the benchmarks. In particular:

- For each environment in `get_problem`, run each agent at least 10 times in that environment.
- Record the obtained total rewards for each run.
- Record the average, standard deviation, min, and max rewards each agent obtains.
Note that running the MCTS agent can take a while. In our experience, the benchmarking may take a few hours. Therefore, we recommend running the MCTS agent on a local laptop or desktop. If you choose to do so, you may change `MCTSAgent(iteration_budget=50)` to one with more iterations, such as `MCTSAgent(iteration_budget=100)` or `MCTSAgent(iteration_budget=500)` --- doing so should produce much better MCTS agent. You may also want to repeat each setting more than 10 times, such as 30 times --- doing so will reduce the effect of stochasticity in the experiments.

Please prepare a table comparing the above agents' performances in the environments. We do not impose a format for the table, but you should prepare the table such that it is reasonably readable. Remember to indicate the experiment settings for the table (the parameters you used and the number of repetitions, etc.). Once you have the table, try to identify any interesting patterns from the table, and summarize your findings in words.

Hints:
- You may want to take a look at the `benchmark_agent` and `compare_agents` functions, which contains some boilerplate code to get you started.
- In particular, you may also use the parameter `max_steps` of the above functions to limit the number of steps for each evaluation episode, if evaluation is taking too much time.

**Please submit a PDF of the results, including the table and a summarization of it.**

## Final Submission
Your final submission to gradescope should include the following files:
- `project01.ipynb`: Your completed notebook **with output from running each cell**. Make sure to save. If you made changes to any of the `.py` files, please include that as well.
- `just_wait_determinized.mp4`: The animation of the successful run of the determinized agent in the `just_wait` MDP.
- `the_choice_determinized.mp4`: The animation of the successful run of the determinized agent in the `the_choice` MDP.
- `the_choice_mcts.mp4`: The animation of the successful run of the MCTS agent in the `the_choice` MDP.
- A PDF of the results of the benchmarking, as described in the last part of the project.

## Feedback

If you have any feedback for us, please complete [this form](https://forms.gle/58Juq1TDtxXKp11q7)!