# Module 11 - Programming Assignment

## Directions

1. Change the name of this file to be your JHED id as in `jsmith299.ipynb`. Because sure you use your JHED ID (it's made out of your name and not your student id which is just letters and numbers).
2. Make sure the notebook you submit is cleanly and fully executed. I do not grade unexecuted notebooks.
3. Submit your notebook back in Blackboard where you downloaded this file.

*Provide the output **exactly** as requested*

## Reinforcement Learning with Value Iteration

These are the same maps from Module 1 but the "physics" of the world have changed. In Module 1, the world was deterministic. When the agent moved "south", it went "south". When it moved "east", it went "east". Now, the agent only succeeds in going where it wants to go *sometimes*. There is a probability distribution over the possible states so that when the agent moves "south", there is a small probability that it will go "east", "north", or "west" instead and have to move from there.

There are a variety of ways to handle this problem. For example, if using A\* search, if the agent finds itself off the solution, you can simply calculate a new solution from where the agent ended up. Although this sounds like a really bad idea, it has actually been shown to work really well in video games that use formal planning algorithms (which we will cover later). When these algorithms were first designed, this was unthinkable. Thank you, Moore's Law!

Another approach is to use Reinforcement Learning which covers problems where there is some kind of general uncertainty in the actions. We're going to model that uncertainty a bit unrealistically here but it'll show you how the algorithm works.

As far as RL is concerned, there are a variety of options there: model-based and model-free, Value Iteration, Q-Learning and SARSA. You are going to use Value Iteration.

## The World Representation

As before, we're going to simplify the problem by working in a grid world. The symbols that form the grid have a special meaning as they specify the type of the terrain and the cost to enter a grid cell with that type of terrain:

```
token   terrain    cost 
.       plains     1
*       forest     3
^       hills      5
~       swamp      7
x       mountains  impassible
```

When you go from a plains node to a forest node it costs 3. When you go from a forest node to a plains node, it costs 1. You can think of the grid as a big graph. Each grid cell (terrain symbol) is a node and there are edges to the north, south, east and west (except at the edges).

There are quite a few differences between A\* Search and Reinforcement Learning but one of the most salient is that A\* Search returns a plan of N steps that gets us from A to Z, for example, A->C->E->G.... Reinforcement Learning, on the other hand, returns  a *policy* that tells us the best thing to do in **every state.**

For example, the policy might say that the best thing to do in A is go to C. However, we might find ourselves in D instead. But the policy covers this possibility, it might say, D->E. Trying this action might land us in C and the policy will say, C->E, etc. At least with offline learning, everything will be learned in advance (in online learning, you can only learn by doing and so you may act according to a known but suboptimal policy).

Nevertheless, if you were asked for a "best case" plan from (0, 0) to (n-1, n-1), you could (and will) be able to read it off the policy because there is a best action for every state. You will be asked to provide this in your assignment.

We have the same costs as before. Note that we've negated them this time because RL requires negative costs and positive rewards:

In [1]:
costs = { '.': -1, '*': -3, '^': -5, '~': -7}

and a list of offsets for `cardinal_moves`. You'll need to work this into your **actions**, A, parameter.

In [2]:
cardinal_moves = [(0,-1), (1,0), (0,1), (-1,0)]

In [3]:
cardinal_actions = {"up": (0,-1), "right": (1,0), "down": (0,1), "left": (-1,0)}

For Value Iteration, we require knowledge of the *transition* function, as a probability distribution.

The transition function, T, for this problem is 0.70 for the desired direction, and 0.10 each for the other possible directions. That is, if the agent selects "north" then 70% of the time, it will go "north" but 10% of the time it will go "east", 10% of the time it will go "west", and 10% of the time it will go "south". If agent is at the edge of the map, it simply bounces back to the current state.

You need to implement `value_iteration()` with the following parameters:

+ world: a `List` of `List`s of terrain (this is S from S, A, T, gamma, R)
+ costs: a `Dict` of costs by terrain (this is part of R)
+ goal: A `Tuple` of (x, y) stating the goal state.
+ reward: The reward for achieving the goal state.
+ actions: a `List` of possible actions, A, as offsets.
+ gamma: the discount rate

you will return a policy: 

`{(x1, y1): action1, (x2, y2): action2, ...}`

Remember...a policy is what to do in any state for all the states. Notice how this is different than A\* search which only returns actions to take from the start to the goal. This also explains why reinforcement learning doesn't take a `start` state.

You should also define a function `pretty_print_policy( cols, rows, policy)` that takes a policy and prints it out as a grid using "^" for up, "<" for left, "v" for down and ">" for right. Use "x" for any mountain or other impassable square. Note that it doesn't need the `world` because the policy has a move for every state. However, you do need to know how big the grid is so you can pull the values out of the `Dict` that is returned.

```
vvvvvvv
vvvvvvv
vvvvvvv
>>>>>>v
^^^>>>v
^^^>>>v
^^^>>>G
```

(Note that that policy is completely made up and only illustrative of the desired output). Please print it out exactly as requested: **NO EXTRA SPACES OR LINES**.

* If everything is otherwise the same, do you think that the path from (0,0) to the goal would be the same for both A\* Search and Q-Learning?
* What do you think if you have a map that looks like:

```
><>>^
>>>>v
>>>>v
>>>>v
>>>>G
```

has this converged? Is this a "correct" policy? What are the problems with this policy as it is?


In [4]:
def read_world(filename):
    result = []
    with open(filename) as f:
        for line in f.readlines():
            if len(line) > 0:
                result.append(list(line.strip()))
    return result

In [5]:
from copy import deepcopy

---

## init_rewards

`init_rewards` initializes the `R` matrix for value iteration. It takes a reward for the goal, a goal coordinate, a list of costs for cells in the world, and a weight to apply to the costs. The function returns a fully initialized matrix that represents the rewards of each cell in the world. It assigns the goal cell a value of `reward`, mountains/impassable terrain a value of `None`, and all other cells a value equal to the cost of the cell multiplied by the `weight` parameter. In `value_iteration`, this weight is set to 5. **Used by**: [value_iteration](#value_iteration).

* **reward**: the reward for the goal
* **goal**: the goal coordinates
* **costs**: a dict of costs for each type of cell in `world`
* **world**: the world matrix
* **weight**: a weight to multiply `costs`

**returns** `List[List]]`: an initialized `R` matrix

In [6]:
def init_rewards(reward, goal, costs, world, weight):
    rows, cols = len(world), len(world[0])
    R = [[0 for x in range(cols)] for y in range(rows)]
    for y in range(rows):
        for x in range(cols):
            if (x, y) == goal:
                R[y][x] = reward
            elif world[y][x] in costs.keys():
                R[y][x] = costs[world[y][x]] * weight
            else:
                R[y][x] = None
    return R

In [7]:
# assertions/unit tests
test = [['.', '.', '.', '.', '.', '.'], 
        ['.', '*', '*', '*', '*', '.'], 
        ['.', '*', '*', '*', '*', '.'], 
        ['.', '*', '*', 'x', '*', '.'], 
        ['.', '*', '*', '*', '*', '.'], 
        ['.', '.', '.', '.', '.', '.'], 
        ['.', '.', '.', '.', '.', '.']]
costs = { '.': -1, '*': -3, '^': -5, '~': -7}
reward = 100
goal = (5, 6)
R = init_rewards(reward, goal, costs, test, 1)
assert R[0][0] == -1
assert R[6][5] == 100
assert R[3][3] == None

## bellman

`bellman` applies the Bellman equation to `state` using the `R` and `V_last` matrices, as well as a list of actions and a discount rate `gamma`. The equation is:
$$ Q[s, a] = R[s, a] + \gamma * \sum_{s'} T[s, a, s'] * V_{last}[s'] $$

In this implementation, the reward is found from `R` and the transition model has a probability of 70% for the intended `action`, with an equal distribution for all other legal actions (ones that do not involve traveling out of bounds/reaching impassable terrain). The possible actions are instantiated and the resulting `Q` value is returned if possible - otherwise, `None` is returned to signify that the current state is impassable. **Used by**: [value_iteration](#value_iteration)

* **state**: a tuple of (x, y, action)
* **R**: the reward matrix
* **V**: the value matrix
* **actions**: a list of all possible actions
* **gamma**: the discount rate
* **rows**: the number of rows in `world`
* **cols**: the number of columns in `world`

**returns** `Float` or `None`: a value for the quality of the action or `None` for impassable terrain

In [8]:
def bellman(state, R, V_last, actions, gamma, rows, cols):
    (x, y, action), possible_actions = state, []
    r_sa = R[y][x]
    if r_sa:
        (x_p, y_p) = (x + action[0], y + action[1])
        if 0 <= x_p < cols and 0 <= y_p < rows and V_last[y_p][x_p] != None:
            res = r_sa + gamma*V_last[y_p][x_p]*0.7
            for surprise_action in actions:
                (x_p, y_p) = (x + surprise_action[0], y + surprise_action[1])
                if surprise_action != action and 0 <= x_p < cols and 0 <= y_p < rows and V_last[y_p][x_p] != None:
                    possible_actions.append((surprise_action, x_p, y_p))
            for (possible_action, x_p, y_p) in possible_actions:
                res += gamma*V_last[y_p][x_p]*(0.3 / len(possible_actions))
            return res
    return None

In [9]:
# assertions/unit tests
test = [['.', '.', '.', '.', '.', '.'], 
        ['.', '*', '*', '*', '*', '.'], 
        ['.', '*', '*', '*', '*', '.'], 
        ['.', '*', '*', 'x', '*', '.'], 
        ['.', '*', '*', '*', '*', '.'], 
        ['.', '.', '.', '.', '.', '.'], 
        ['.', '.', '.', '.', '.', '.']]
V_last = [[5, 5, 5, 5, 5, 5], 
          [4, 4, 4, 4, 4, 4],
          [3, 3, 3, 3, 3, 3], 
          [2, 2, 2, 2, 2, 2], 
          [1, 1, 1, 1, 1, 1], 
          [-1, -1, -1, -1, -1, -1], 
          [0, 0, 0, 0, 0, 0]]
R = [[-1, -1, -1, -1, -1, -1], 
     [-1, -3, -3, -3, -3, -1], 
     [-1, -3, -3, -3, -3, -1], 
     [-1, -3, -3, None, -3, -1], 
     [-1, -3, -3, -3, -3, -1], 
     [-1, -1, -1, -1, -1, -1], 
     [-1, -1, -1, -1, -1, -1]]
q = bellman((0, 0, (0, 1)), R, V_last, cardinal_moves, 0.9, len(test), len(test[0]))
assert q == 2.87

q = bellman((0, 0, (0, -1)), R, V_last, cardinal_moves, 0.9, len(test), len(test[0]))
assert q == None

q = bellman((3, 5, (0, 1)), R, V_last, cardinal_moves, 0.9, len(test), len(test[0]))
assert q == -1.09

## argmax

`argmax` takes a `Q` matrix and a set of x-y coordinates, as well as a list of actions, and returns the action that corresponds to the best value in `Q` for that action from the given x-y coordinates. If no such action exists, the function returns (0, 0) to signify that the current state is impassable terrain. **Used by**: [value_iteration](#value_iteration).

* **Q**: the Q dictionary of states and actions
* **x**: the x coordinate of the current state
* **y**: the y coordinate of the current state
* **actions**: a list of all possible actions

**returns** `Tuple`: a tuple of the best move to take, or (0, 0)

In [10]:
def argmax(Q, x, y, actions):
    max_reward, policy = -1000, (0, 0)
    for action in actions:
        if (x, y, action) in Q: 
            curr_reward = deepcopy(Q[(x, y, action)])
            if curr_reward and curr_reward > max_reward:
                max_reward = curr_reward
                policy = deepcopy(action)
    return policy

In [11]:
# assertions/unit tests
Q = {(0, 0, (0, 1)): 10, (0, 0, (1, 0)): 5}
policy = argmax(Q, 0, 0, cardinal_moves)
assert policy == (0, 1)

Q = {(5, 5, (0, 1)): 5, (5, 5, (1, 0)): 5, (5, 5, (-1, 0)): 5, (5, 5, (0, -1)): 5}
policy = argmax(Q, 5, 5, cardinal_moves)
assert policy == (0, -1)

Q = {(5, 5, (0, 1)): 5, (5, 5, (1, 0)): 5, (5, 5, (-1, 0)): 5, (5, 5, (0, -1)): 5}
policy = argmax(Q, 0, 0, cardinal_moves)
assert policy == (0, 0)

## compute_err

`compute_err` finds the max element-wise difference between `V` and `V_last` to determine when to stop iterating in `value_iteration`. The function compares all non-`None` cells and finds the max difference between the two to return as the max error. **Used by**: [value_iteration](#value_iteration).

* **V**: the current value matrix
* **V_last**: the previous value matrix

**returns** `Float`: a float of the max element-wise differnce

In [12]:
def compute_err(V, V_last):
    max_err = 0
    for y in range(len(V)):
        for x in range(len(V[y])):
            if V[y][x] != None and V_last[y][x] != None: 
                max_err = max(max_err, abs(V[y][x] - V_last[y][x]))
    return max_err

In [13]:
# assertions/unit tests
m1 = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
m2 = [[1, 2, 3], [6, 5, 4], [7, 8, 9]]
err = compute_err(m1, m2)
assert err == 2

m1 = [[100, 101, 102], [103, 104, 105], [106, 107, 108]]
m2 = [[100, 0, 102], [103, 104, 105], [101, 102, 103]]
err = compute_err(m1, m2)
assert err == 101

m1 = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
m2 = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
err = compute_err(m1, m2)
assert err == 0

## value_iteration

`value_iteration` implements the Vaue Iteration algorithm for solving reinforcement learning problems using the Bellman equation (see [bellman](#bellman)). The algorithm starts with a world map, a list of costs, a goal state, a reward for reaching the goal, a list of actions, and a discount rate, and creates a policy which is iteratively improved.

The algorithm utilizes four different structures: the policy dictionary (`Pi`), which keeps track of the current iteration's best move for each cell; the reward matrix (`R`), which maps each cell in the world to a static payoff/reward using a weight of 5; the value matrix (`V` and `V_last`), which keeps track of the value/utility of traversing each cell in the matrix; and the Q dictionary (`Q`), which maps a state and action tuple (`x`, `y`, `action`) to a possible payoff. In each iteration, the algorithm decides which action is best for each state by considering the previous iteration's `V` matrix, and computing rewards for potential successors using the Bellman equation. The algorithm iterates until the maximum element-wise difference between two iterations' value matrices is less than a desired $\epsilon$ value, which is set at $1*10^{-16}$. The function returns the policy matrix, which enumerates the best move for each cell in `world` to reach `goal`.

**Uses**: [init_rewards](#init_rewards), [bellman](#bellman), [argmax](#argmax), [compute_err](#compute_err)

* **world**: the world map as a list of lists
* **costs**: the costs dictionary for each cell type in `world`
* **goal**: the goal coordinates as a tuple (x, y)
* **reward**: the reward for reaching the goal
* **gamma**: the discount rate used in the Bellman Equation

**returns**: `List[List]]`: a policy dictionary

In [14]:
def value_iteration(world, costs, goal, reward, actions, gamma):
    t, err, epsilon, limit, rows, cols = 0, 1, 1e-16, 3000, len(world), len(world[0])
    V = [[0 for y in range(cols)] for x in range(rows)]
    Q, Pi, R = {}, {}, init_rewards(reward, goal, costs, world, 5)
    while err > epsilon:
        t += 1
        V_last = deepcopy(V)
        for y, row in enumerate(world):
            for x, cell in enumerate(row):
                for action in actions:
                    Q[(x, y, action)] = bellman((x, y, action), R, V_last, actions, gamma, rows, cols)
                Pi[(x, y)] = argmax(Q, x, y, actions)
                V[y][x] = Q[(x, y, Pi[(x, y)])] if Pi[(x, y)] != (0, 0) else None
        err = compute_err(V, V_last)
    return Pi

## pretty_print_policy

`pretty_print_policy` takes a policy dictionary, dimensions for the `world`, and a `goal` state, and prints a "prettier" version of the policy. For each coordinate pair in `world`, the function looks up the best move to take from the `policy` and represents it in ASCII. The function also replaces mountains with "x" and the goal state with "G".

**returns** `None`: the function prints the pretty policy as a side-effect

In [15]:
def pretty_print_policy(cols, rows, policy, goal):
    pretty_policy = [["0" for x in range(cols)] for y in range(rows)]
    move_lookup = {(0, 1): "v", (1, 0): ">", (0, -1): "^", (-1, 0): "<"}
    for y in range(rows):
        for x in range(cols):
            pretty_policy[y][x] = "x" if (policy[(x, y)] == (0, 0)) else move_lookup[policy[(x, y)]]
    pretty_policy[goal[1]][goal[0]] = "G"
    for row in pretty_policy:
        print("".join(row))
    return None

## Value Iteration

### Small World

In [16]:
small_world = read_world( "small.txt")

In [17]:
goal = (len(small_world[0])-1, len(small_world)-1)
gamma = 0.9
reward = 150

small_policy = value_iteration(small_world, costs, goal, reward, cardinal_moves, gamma)

In [18]:
cols = len(small_world[0])
rows = len(small_world)

pretty_print_policy(cols, rows, small_policy, goal)

v>>>>v
vvv>vv
vvv>vv
vvvxvv
vvvvvv
>>>>vv
>>>>>G


### Large World

In [19]:
large_world = read_world( "large.txt")

In [20]:
goal = (len(large_world[0])-1, len(large_world)-1) # Lower Right Corner FILL ME IN
gamma = 0.9
reward = 100000

large_policy = value_iteration(large_world, costs, goal, reward, cardinal_moves, gamma)

In [21]:
cols = len(large_world[0])
rows = len(large_world)

pretty_print_policy( cols, rows, large_policy, goal)

v>>>v>>>>>>>>>>vv<>>>>>>>vv
v>>>>>>>>>>>>>vvv<xxxxxxxvv
vv^^xx>>>>>>>>>vvxxxvv<xxvv
vvv^<xxx>>>>>>>>>>>vvv<xxvv
v<v<<xxvv>>>>>>>>>>vvvxxxvv
vvv<xxvvv>>>>>>>>>>>vvvxvvv
vvvxx>vvvv^^xxx>>>>>>v>vvvv
vvv>>>vvvv<^<<xxx>>>>>vvvvv
v>>>>>vvv<<<<<xx>>>>>>>vvvv
>>>>>>vvv<<xxxx>>>>>>>>vvvv
>>>>>vvvv<xxx>>>>>vvxxxvvvv
>>>>>vvvvxxv>>>>>>vvvxxvvvv
>>>>>>vvvxxv>>>>>>>vvx>vvvv
>>>>>>v>>vvv>>>>>>>>>>>vvvv
>>>^x>v>>vvv<>>>>>>>>^xvvvv
vv^xxx>>>vvvxxx>>>>^^xxvvvv
vvxx>>>>>>>>vvxxx^^xxxvvvvv
vvvxx>>>>>>>>>vvxxxx>>vvvvv
vvvxxx>>>>>>>>>>>>>>>>vvvvv
vvvvxxx>>>>>>>>>>>>>>>vvvvv
vvvvvvxx>>>>^^x>>>>>>>vvvvv
>>>vvvvxxx^^xx>>>>>>>>vvvvv
>>>>>>>vvxxxx>>>>>>>>>>vvvv
>>>>>>>>>>>>>>>^^xx>>>>vvvv
^x>>>>>>>vvxxx^^xxvxx>>vvvv
vxxx>>>>>>>vxxxx>>>vxxx>>vv
>>>>>>>>>>>>>>>>>>>>>>>>>>G


In [None]:
world_1 = [
    ['.', '.', '.', '.'],
    ['.', '~', '.', '.'],
    ['*', '.', '.', '.'],
    ['.', '.', '.', '.']
]

goal = (len(world_1) - 1, len(world_1) - 1)
gamma = 0.9
reward = 500
policy = value_iteration(world_1, costs, goal, reward, cardinal_moves, gamma)

cols = len(world_1)
rows = len(world_1)
pretty_print_policy(cols, rows, policy, goal)

## Before You Submit...

1. Did you provide output exactly as requested?
2. Did you re-execute the entire notebook? ("Restart Kernel and Rull All Cells...")
3. If you did not complete the assignment or had difficulty please explain what gave you the most difficulty in the Markdown cell below.
4. Did you change the name of the file to `jhed_id.ipynb`?

Do not submit any other files.