# Using Bellman Equations 
## To find state values of an MDP

In this homework, we explore the basics of reinforcement learning.

We study a simple RL agent setup and learn how to use the Bellman Equation to guide it.

Consider the 5x5 grid below. We will use it to illustrate value functions for a simple finite MDP. 
- The cells of the grid correspond to the states of the environment. 
- At each cell, four actions are possible: 
    - north, south, east, and west
- Four actions deterministically cause the agent to move one cell in the respective direction on the grid. 
- Actions that would take the agent off the grid leave its location unchanged, but also result in a reward of -1.
- Other actions result in a reward of 0, except those that move the agent out of the special states A and B. 
- From state A, all four actions yield a reward of +10 and take the agent to A'.
- From state B, all actions yield a reward of +5  and take the agent to B'.

<img src="attachment:image.png" width="300">

You can assume that:
- The agent selects all four actions with equal probability in all states.
- The rewards are discounted by a factor of 0.9

> Hint: If the agent is at grid corners and moves in a random direction, what should be the resultant reward? 

> Hint: What should be the grid state values near cells marked A, B?

#### Note: Please do not modify any pre-defined variables. Doing so can affect the autograder results.

In [None]:
# importing required libraries

import numpy as np

In [None]:
# required constants

WORLD_SIZE = 5
A_POS = [0, 1]
A_PRIME_POS = [4, 1]
B_POS = [0, 3]
B_PRIME_POS = [2, 3]
DISCOUNT = 0.9

# left, up, right, down
ACTIONS = [np.array([0, -1]),
           np.array([-1, 0]),
           np.array([0, 1]),
           np.array([1, 0])]

ACTION_PROB = 0.25

Complete the step function below

In [None]:
def step(state, action):
    """
        Function that computes next grid state and reward, given current state and action
    """
    reward = None
    next_state = None
    
    if state == A_POS:
        reward = 10
        next_state = A_PRIME_POS
    elif state == B_POS:
        # START CODING HERE
        
        
        # your code here
        

        
        # END CODING HERE
    else:
        next_state = (np.array(state) + action).tolist()
        x, y = next_state
        if x < 0 or x >= WORLD_SIZE or y < 0 or y >= WORLD_SIZE:
            reward = -1.0
            next_state = state
        else:
            # START CODING HERE
            # your code here
            
            # END CODING HERE
            
    return next_state, reward

In [None]:
assert step(B_POS, ACTIONS[2]) == (B_PRIME_POS, 5), "What should be the next state and reward when agent is at B_POS and goes in any action?" 
assert step([0, 0], ACTIONS[0]) == ([0, 0], -1.0), "What should be the next state and reward when agent is at (0, 0) and goes left?" 
assert step([0, 0], ACTIONS[2]) == ([0, 1], 0.0), "What should be the next state and reward when agent is at (0, 0) and goes right (to state A)?" 

Complete the bellman_update function below
> Hint: Use Bellman Update formula

In [None]:
def bellman_update(value):
    new_value = np.zeros_like(value)

    for i in range(WORLD_SIZE):
        for j in range(WORLD_SIZE):
            for action in ACTIONS:

                # compute next state and reward given current state and action
                (next_i, next_j), reward = step([i, j], action)

                # update grid `value` using bellman equation
                # START CODING HERE
                
                # your code here
                
                
                
                # END CODING HERE

    return new_value

In [None]:
# initially value of each matrix cell is zero
value_0 = np.zeros((WORLD_SIZE, WORLD_SIZE))

# first iteration
value_1 = bellman_update(value_0)

In [None]:
# sanity check for you
ideal_value_1 = np.array([
    [-0.5 , 10.  , -0.25,  5.  , -0.5 ],
    [-0.25,  0.  ,  0.  ,  0.  , -0.25],
    [-0.25,  0.  ,  0.  ,  0.  , -0.25],
    [-0.25,  0.  ,  0.  ,  0.  , -0.25],
    [-0.5 , -0.25, -0.25, -0.25, -0.5 ]
])

In [None]:
assert np.allclose(value_1, ideal_value_1, rtol=1e-4), "This is a sanity check test. If this fails, you need to update bellman_update function." 

In [None]:
# hidden test cases 


In [None]:
# hidden test cases 


In [None]:
# hidden test cases 
