<a href="https://colab.research.google.com/github/constructor-s/aps1080_winter_2021/blob/main/A1/A1_FrozenLake.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import gym

In [2]:
gym.envs.register(
    id='FrozenLakeNotSlippery-v0',
    entry_point='gym.envs.toy_text:FrozenLakeEnv',
    kwargs={'map_name' : '4x4', 'is_slippery': False},
    max_episode_steps=100,
    reward_threshold=0.74
)

In [3]:
# Create the gridworld-like environment
env=gym.make('FrozenLakeNotSlippery-v0')
# Let's look at the model of the environment (i.e., P):
env.env.P
# Question: what is the data in this structure saying? Relate this to the course
# presentation of P

{0: {0: [(1.0, 0, 0.0, False)],
  1: [(1.0, 4, 0.0, False)],
  2: [(1.0, 1, 0.0, False)],
  3: [(1.0, 0, 0.0, False)]},
 1: {0: [(1.0, 0, 0.0, False)],
  1: [(1.0, 5, 0.0, True)],
  2: [(1.0, 2, 0.0, False)],
  3: [(1.0, 1, 0.0, False)]},
 2: {0: [(1.0, 1, 0.0, False)],
  1: [(1.0, 6, 0.0, False)],
  2: [(1.0, 3, 0.0, False)],
  3: [(1.0, 2, 0.0, False)]},
 3: {0: [(1.0, 2, 0.0, False)],
  1: [(1.0, 7, 0.0, True)],
  2: [(1.0, 3, 0.0, False)],
  3: [(1.0, 3, 0.0, False)]},
 4: {0: [(1.0, 4, 0.0, False)],
  1: [(1.0, 8, 0.0, False)],
  2: [(1.0, 5, 0.0, True)],
  3: [(1.0, 0, 0.0, False)]},
 5: {0: [(1.0, 5, 0, True)],
  1: [(1.0, 5, 0, True)],
  2: [(1.0, 5, 0, True)],
  3: [(1.0, 5, 0, True)]},
 6: {0: [(1.0, 5, 0.0, True)],
  1: [(1.0, 10, 0.0, False)],
  2: [(1.0, 7, 0.0, True)],
  3: [(1.0, 2, 0.0, False)]},
 7: {0: [(1.0, 7, 0, True)],
  1: [(1.0, 7, 0, True)],
  2: [(1.0, 7, 0, True)],
  3: [(1.0, 7, 0, True)]},
 8: {0: [(1.0, 8, 0.0, False)],
  1: [(1.0, 12, 0.0, True)],
  2: [(

---------
Data in this structure represents the *dynamics function $p$*, a mapping from (state index, action (action is represented by four possible integer values corresponding to $\leftarrow$, $\downarrow$, $\rightarrow$, $\uparrow$)) pairs to: 

1. Probability of transitioning to the next state (in this environment it is deterministic, i.e. always equal to 1)

1. The next state (index)

1. The reward (1.0 for state 15, 0.0 otherwise)

1. Whether the state terminates the episode (`True`, hole or goal) or not (`False`, safe or frozen)
---------

In [4]:
# Now let's investigate the observation space (i.e., S using our nomenclature),
# and confirm we see it is a discrete space with 16 locations
print(env.observation_space)

Discrete(16)


In [5]:
stateSpaceSize = env.observation_space.n
print(stateSpaceSize)

16


In [6]:
# Now let's investigate the action space (i.e., A) for the agent->environment
# channel
print(env.action_space)

Discrete(4)


In [7]:
# The gym environment has ...sample() functions that allow us to sample
# from the above spaces:
for g in range(1,10,1):
  print("sample from S:",env.observation_space.sample()," ... ","sample from A:",env.action_space.sample())



sample from S: 0  ...  sample from A: 0
sample from S: 1  ...  sample from A: 3
sample from S: 12  ...  sample from A: 0
sample from S: 2  ...  sample from A: 3
sample from S: 4  ...  sample from A: 1
sample from S: 12  ...  sample from A: 1
sample from S: 7  ...  sample from A: 3
sample from S: 5  ...  sample from A: 1
sample from S: 12  ...  sample from A: 2


In [8]:
# The enviroment also provides a helper to render (visualize) the environment
env.reset()
env.render()


[41mS[0mFFF
FHFH
FFFH
HFFG


In [9]:
# We can act as the agent, by selecting actions and stepping the environment
# through time to see its responses to our actions
env.reset()
exitCommand=False
while not(exitCommand):
  env.render()
  print("Enter the action as an integer from 0 to",env.action_space.n," (or exit): ")
  userInput=input()
  if userInput=="exit":
    break
  action=int(userInput)
  (observation, reward, compute, probability) = env.step(action)
  print("--> The result of taking action",action,"is:")
  print("     S=",observation)
  print("     R=",reward)
  print("     p=",probability)

  env.render()



[41mS[0mFFF
FHFH
FFFH
HFFG
Enter the action as an integer from 0 to 4  (or exit): 
exit


In [10]:
# Question: draw a table indicating the correspondence between the action
# you input (a number) and the logic action performed.
# Question: draw a table that illustrates what the symbols on the render image
# mean?
# Question: Explain what the objective of the agent is in this environment?

------------

| Index | Action |
| ----------- | ----------- |
| `0` | LEFT |
| `1` | DOWN | 
| `2` | RIGHT | 
| `3` | UP | 

------------

| Character | State |
| ----------- | ----------- |
| `S` | starting point, safe |
| `F` | frozen surface, safe |
| `H` | hole, fall to your doom |
| `G` | goal, where the frisbee is located |

------------

The objective of the agent is to reach the goal (`G`) state

------------

In [11]:
# Practical: Code up an AI that will employ random action selection in order
# to drive the agent. Test this random action selection agent with the
# above environment (i.e., code up a loop as I did above, but instead
# of taking input from a human user, take it from the AI you coded).

In [14]:
#%% An example AI that takes random moves and renders the result at each step
def play_agent(env, policy, terminate_after_n=5):
    """
    Function to test drive an agent against the environment. 
    Terminates after the same state is repeated n times. 

    Parameters
    ----------
    env : Env
        Gym environment
    policy : callable
        A function that takes `env` as input and returns a valid action. 
        Assumes the policy only returns one action with 100% probability.
    """
    env.reset()
    previous_observation = None
    repeats = 0

    env.render()
    while True:
        # env.render()

        # Sample from action space using the built-in function
        action = policy(env)
        (observation, reward, compute, probability) = env.step(action)
        print("--> The result of taking action",action,"is:")
        print("     S=",observation)
        print("     R=",reward)
        print("     p=",probability)
        env.render()

        # Terminate if stuck in a state
        if observation == previous_observation:
            repeats += 1
            if repeats >= terminate_after_n:
                break
        else:
            repeats = 0
            previous_observation = observation

env.seed(5)
policy = lambda env: env.action_space.sample()
play_agent(env, policy)


[41mS[0mFFF
FHFH
FFFH
HFFG
--> The result of taking action 1 is:
     S= 4
     R= 0.0
     p= {'prob': 1.0}
  (Down)
SFFF
[41mF[0mHFH
FFFH
HFFG
--> The result of taking action 1 is:
     S= 8
     R= 0.0
     p= {'prob': 1.0}
  (Down)
SFFF
FHFH
[41mF[0mFFH
HFFG
--> The result of taking action 2 is:
     S= 9
     R= 0.0
     p= {'prob': 1.0}
  (Right)
SFFF
FHFH
F[41mF[0mFH
HFFG
--> The result of taking action 2 is:
     S= 10
     R= 0.0
     p= {'prob': 1.0}
  (Right)
SFFF
FHFH
FF[41mF[0mH
HFFG
--> The result of taking action 2 is:
     S= 11
     R= 0.0
     p= {'prob': 1.0}
  (Right)
SFFF
FHFH
FFF[41mH[0m
HFFG
--> The result of taking action 3 is:
     S= 11
     R= 0
     p= {'prob': 1.0}
  (Up)
SFFF
FHFH
FFF[41mH[0m
HFFG
--> The result of taking action 1 is:
     S= 11
     R= 0
     p= {'prob': 1.0}
  (Down)
SFFF
FHFH
FFF[41mH[0m
HFFG
--> The result of taking action 1 is:
     S= 11
     R= 0
     p= {'prob': 1.0}
  (Down)
SFFF
FHFH
FFF[41mH[0m
HFFG
--> The re

In [15]:
# Now towards dynamic programming. Note that env.env.P has the model
# of the environment.
#
# Question: How would you represent the agent's policy function and value function?
# Practical: revise the above AI solver to use a policy function in which you
# code the random action selections in the policy function. Test this.
# Practical: Code the C-4 Policy Evaluation (Prediction) algorithm. You may use
# either the inplace or ping-pong buffer (as described in the lecture). Now
# randomly initialize your policy function, and compute its value function.
# Report your results: policy and value function. Ensure your prediction
# algo reports how many iterations it took.
#
# (Optional): Repeat the above for q.
#
# Policy Improvement:
# Question: How would you use P and your value function to improve an arbitrary
# policy, pi, per Chapter 4?
# Practical: Code the policy iteration process, and employ it to arrive at a
# policy that solves this problem. Show your testing results, and ensure
# it reports the number of iterations for each step: (a) overall policy
# iteration steps and (b) evaluation steps.
# Practical: Code the value iteration process, and employ it to arrive at a
# policy that solves this problem. Show your testing results, reporting
# the iteration counts.
# Comment on the difference between the iterations required for policy vs
# value iteration.
#
# Optional: instead of the above environment, use the "slippery" Frozen Lake via
# env = gym.make("FrozenLake-v0")

----------

The agent's policy function can be implemented as an $N$-element array lookup table of integer action indices, representing $p=100\%$ choosing the that action at that state. 

Similarly, the value function can be implemented as an $N$-element array lookup table of float values. 

----------

In [16]:
# Practical: revise the above AI solver to use a policy function in which you
# code the random action selections in the policy function. Test this.

manual_policy = [
    1, 2, 1, 0, 
    1, 0, 1, 0, 
    2, 1, 1, 0, 
    0, 2, 2, 0
]
policy = lambda env: manual_policy[env.env.s]
play_agent(env, policy)


[41mS[0mFFF
FHFH
FFFH
HFFG
--> The result of taking action 1 is:
     S= 4
     R= 0.0
     p= {'prob': 1.0}
  (Down)
SFFF
[41mF[0mHFH
FFFH
HFFG
--> The result of taking action 1 is:
     S= 8
     R= 0.0
     p= {'prob': 1.0}
  (Down)
SFFF
FHFH
[41mF[0mFFH
HFFG
--> The result of taking action 2 is:
     S= 9
     R= 0.0
     p= {'prob': 1.0}
  (Right)
SFFF
FHFH
F[41mF[0mFH
HFFG
--> The result of taking action 1 is:
     S= 13
     R= 0.0
     p= {'prob': 1.0}
  (Down)
SFFF
FHFH
FFFH
H[41mF[0mFG
--> The result of taking action 2 is:
     S= 14
     R= 0.0
     p= {'prob': 1.0}
  (Right)
SFFF
FHFH
FFFH
HF[41mF[0mG
--> The result of taking action 2 is:
     S= 15
     R= 1.0
     p= {'prob': 1.0}
  (Right)
SFFF
FHFH
FFFH
HFF[41mG[0m
--> The result of taking action 0 is:
     S= 15
     R= 0
     p= {'prob': 1.0}
  (Left)
SFFF
FHFH
FFFH
HFF[41mG[0m
--> The result of taking action 0 is:
     S= 15
     R= 0
     p= {'prob': 1.0}
  (Left)
SFFF
FHFH
FFFH
HFF[41mG[0m
--> Th

In [17]:
# Practical: Code the C-4 Policy Evaluation (Prediction) algorithm. You may use
# either the inplace or ping-pong buffer (as described in the lecture). Now
# randomly initialize your policy function, and compute its value function.
# Report your results: policy and value function. Ensure your prediction
# algo reports how many iterations it took.

import numpy as np

def policy_evaluation(policy, theta, gamma, env, V=None, print_n_iter=False):
    """
    Iterative policy evaluation, for estimating $V\approx v_\pi$, on Page 75

    Parameters
    ----------
    policy : array
        A array that takes state index as index and maps to a valid action. 
        Assumes the policy only returns one action with 100% probability.

    theta : float
        a small threshold theta > 0 determining accuracy of estimation

    gamma : float
        reward discount rate

    env : Env
        Gym environment

    V : array
        current value function table. If None, initialize arbitrarily to all zeros.

    print_n_iter : bool
        print information about number of iterations took

    Returns
    -----------
    V : array
        value function table
    """

    # inplace buffer
    # Initialize V (s), for all s in S+, arbitrarily except that V (terminal) = 0
    S = env.observation_space.n
    if V is None:
        V = np.zeros(S)

    n_iter = 0
    while True:
        n_iter += 1

        delta = 0
        for s in range(S):
            v = V[s]

            # Simplified logic for the simple case in our environment
            a = policy[s]
            V[s] = 0
            for p, s_, r, terminate in env.P[s][a]: # s_ is next state s'
                V[s] += p * (r + gamma * V[s_])
            
            delta = max(delta, abs(v - V[s]))
        
        if delta < theta:
            if print_n_iter:
                print(f"policy_evaluation converged in {n_iter} itreations")
            return V

def initialize_v(env, seed=None):
    # Intialize V(s) arbitrarily except that V(terminal) = 0
    S = env.observation_space.n
    V = np.random.RandomState(seed).rand(16)
    for i, letter in enumerate(env.desc.ravel()):
        if letter in b'GH':
            V[i] = 0
    return V

with np.printoptions(precision=3, suppress=True):
    print("Randomly initialized policy:")
    policy = np.random.RandomState(0).randint(0, 4, size=16) # Random policy
    print(policy.reshape(4, 4))
    print(np.array([["\u2190", "\u2193", "\u2192", "\u2191"][a] for a in policy]).reshape(4, 4))

    V_init = initialize_v(env, seed=0)
    # print(V_init.reshape(4, 4))
    V = policy_evaluation(policy, 1e-6, 0.9, env, V=V_init, print_n_iter=True)

    print("Final values:")
    print(V.reshape(4, 4))

Randomly initialized policy:
[[0 3 1 0]
 [3 3 3 3]
 [1 3 1 2]
 [0 3 2 0]]
[['←' '↑' '↓' '←']
 ['↑' '↑' '↑' '↑']
 ['↓' '↑' '↓' '→']
 ['←' '↑' '→' '←']]
policy_evaluation converged in 56 itreations
Final values:
[[0.  0.  0.  0. ]
 [0.  0.  0.  0. ]
 [0.  0.  0.9 0. ]
 [0.  0.  1.  0. ]]


> Policy Improvement:
>
> Question: How would you use P and your value function to improve an arbitrary
> policy, pi, per Chapter 4?
--------------------

Using the *policy iteration* approach: iteratively interleave policy evaluation and policy improvement until convergence of value and policy functions.

--------------------

In [18]:
# Practical: Code the policy iteration process, and employ it to arrive at a
# policy that solves this problem. Show your testing results, and ensure
# it reports the number of iterations for each step: (a) overall policy
# iteration steps and (b) evaluation steps.

REWARD = 2
def policy_iteration(theta, gamma, env, seed=None, print_n_iter=False):
    """
    Policy iteration (using iterative policy evaluation)
    for estimating $\pi \approx \pi_*$
    page 80

    Parameters
    ----------
    theta : float
        a small threshold theta > 0 determining accuracy of estimation
        used in policy evaluation

    gamma : float
        reward discount rate

    env : Env
        Gym environment

    seed : int
        Seed for random number generation

    print_n_iter : bool
        print information about number of iterations took

    Returns
    -----------
    V : array
        Value function table

    policy : array
        A array that takes state index as index and maps to a valid action. 
        Assumes the policy only returns one action with 100% probability.
    """
    S = env.observation_space.n
    A = env.action_space.n

    # Initialization
    V = initialize_v(env, seed=seed)
    pi = np.random.RandomState(seed=seed).randint(0, A, size=S) # Random policy
    
    policy_iteration_iter_i = 0
    while True:
        policy_iteration_iter_i += 1
        if print_n_iter:
            print(f"Policy iteration: iteration {policy_iteration_iter_i}")

        # policy evaluation
        # print(V.reshape(4, 4))
        V = policy_evaluation(pi, theta, gamma, env, V=V, print_n_iter=print_n_iter)
        # print(V.reshape(4, 4))

        # policy improvement
        # print(pi)
        policy_stable = True
        for s in range(S):
            old_action = pi[s]

            values_after_action = np.zeros(A)
            for a in range(A):
                for p, s_, r, terminate in env.P[s][a]:
                    values_after_action[a] += p * (r + gamma * V[s_])
            pi[s] = np.argmax(values_after_action)

            if old_action != pi[s]:
                policy_stable = False
        # print(pi)
        if policy_stable:
            return V, pi

with np.printoptions(precision=3, suppress=True):
    V_star, pi_star = policy_iteration(1e-2, 0.9, env, seed=0, print_n_iter=True)
    print("Final value:")
    print(V_star.reshape(4, 4))
    print("Final policy:")
    print(pi_star.reshape(4, 4))
    print(np.array([["\u2190", "\u2193", "\u2192", "\u2191"][a] for a in pi_star]).reshape(4, 4))


Policy iteration: iteration 1
policy_evaluation converged in 12 itreations
Policy iteration: iteration 2
policy_evaluation converged in 3 itreations
Policy iteration: iteration 3
policy_evaluation converged in 2 itreations
Policy iteration: iteration 4
policy_evaluation converged in 2 itreations
Policy iteration: iteration 5
policy_evaluation converged in 2 itreations
Final value:
[[0.59  0.656 0.729 0.656]
 [0.656 0.    0.81  0.   ]
 [0.729 0.81  0.9   0.   ]
 [0.    0.9   1.    0.   ]]
Final policy:
[[1 2 1 0]
 [1 0 1 0]
 [2 1 1 0]
 [0 2 2 0]]
[['↓' '→' '↓' '←']
 ['↓' '←' '↓' '←']
 ['→' '↓' '↓' '←']
 ['←' '→' '→' '←']]


In [19]:
print("Testing: ")
play_agent(env, lambda env: pi_star[env.env.s])

Testing: 

[41mS[0mFFF
FHFH
FFFH
HFFG
--> The result of taking action 1 is:
     S= 4
     R= 0.0
     p= {'prob': 1.0}
  (Down)
SFFF
[41mF[0mHFH
FFFH
HFFG
--> The result of taking action 1 is:
     S= 8
     R= 0.0
     p= {'prob': 1.0}
  (Down)
SFFF
FHFH
[41mF[0mFFH
HFFG
--> The result of taking action 2 is:
     S= 9
     R= 0.0
     p= {'prob': 1.0}
  (Right)
SFFF
FHFH
F[41mF[0mFH
HFFG
--> The result of taking action 1 is:
     S= 13
     R= 0.0
     p= {'prob': 1.0}
  (Down)
SFFF
FHFH
FFFH
H[41mF[0mFG
--> The result of taking action 2 is:
     S= 14
     R= 0.0
     p= {'prob': 1.0}
  (Right)
SFFF
FHFH
FFFH
HF[41mF[0mG
--> The result of taking action 2 is:
     S= 15
     R= 1.0
     p= {'prob': 1.0}
  (Right)
SFFF
FHFH
FFFH
HFF[41mG[0m
--> The result of taking action 0 is:
     S= 15
     R= 0
     p= {'prob': 1.0}
  (Left)
SFFF
FHFH
FFFH
HFF[41mG[0m
--> The result of taking action 0 is:
     S= 15
     R= 0
     p= {'prob': 1.0}
  (Left)
SFFF
FHFH
FFFH
HFF[41mG

In [20]:
# Practical: Code the value iteration process, and employ it to arrive at a
# policy that solves this problem. Show your testing results, reporting
# the iteration counts.

def value_iteration(theta, gamma, env, seed=None, print_n_iter=False):
    """
    Value iteration for estimating $\pi \approx \pi_*$
    page 83

    Parameters
    ----------
    theta : float
        a small threshold theta > 0 determining accuracy of estimation
        used in policy evaluation

    gamma : float
        reward discount rate

    env : Env
        Gym environment

    seed : int
        Seed for random number generation

    print_n_iter : bool
        print information about number of iterations took

    Returns
    -----------
    V : array
        Value function table

    policy : array
        A array that takes state index as index and maps to a valid action. 
        Assumes the policy only returns one action with 100% probability.
    """
    S = env.observation_space.n
    A = env.action_space.n

    # Initialization
    V = initialize_v(env, seed=seed)

    # Value iteration
    n_iter = 0
    while True:
        n_iter += 1

        delta = 0
        for s in range(S):
            v = V[s]

            values_after_action = np.zeros(A)
            for a in range(A):
                for p, s_, r, terminate in env.P[s][a]: # s_ is next state s'
                    values_after_action[a] += p * (r + gamma * V[s_])
            V[s] = max(values_after_action)

            delta = max(delta, abs(v - V[s]))
        
        if delta < theta:
            if print_n_iter:
                print(f"Value iteration converged in {n_iter} itreations")
            break

    pi = np.zeros(S, dtype=np.int)
    for s in range(S):
        values_after_action = np.zeros(A)
        for a in range(A):
            for p, s_, r, terminate in env.P[s][a]:
                values_after_action[a] += p * (r + gamma * V[s_])
        pi[s] = np.argmax(values_after_action)

    return V, pi

with np.printoptions(precision=3, suppress=True):
    V_star, pi_star = value_iteration(1e-2, 0.9, env, seed=0, print_n_iter=True)
    print("Final value:")
    print(V_star.reshape(4, 4))
    print("Final policy:")
    print(pi_star.reshape(4, 4))
    print(np.array([["\u2190", "\u2193", "\u2192", "\u2191"][a] for a in pi_star]).reshape(4, 4))


Value iteration converged in 7 itreations
Final value:
[[0.59  0.656 0.729 0.656]
 [0.656 0.    0.81  0.   ]
 [0.729 0.81  0.9   0.   ]
 [0.    0.9   1.    0.   ]]
Final policy:
[[1 2 1 0]
 [1 0 1 0]
 [2 1 1 0]
 [0 2 2 0]]
[['↓' '→' '↓' '←']
 ['↓' '←' '↓' '←']
 ['→' '↓' '↓' '←']
 ['←' '→' '→' '←']]


In [21]:
print("Testing: ")
play_agent(env, lambda env: pi_star[env.env.s])

Testing: 

[41mS[0mFFF
FHFH
FFFH
HFFG
--> The result of taking action 1 is:
     S= 4
     R= 0.0
     p= {'prob': 1.0}
  (Down)
SFFF
[41mF[0mHFH
FFFH
HFFG
--> The result of taking action 1 is:
     S= 8
     R= 0.0
     p= {'prob': 1.0}
  (Down)
SFFF
FHFH
[41mF[0mFFH
HFFG
--> The result of taking action 2 is:
     S= 9
     R= 0.0
     p= {'prob': 1.0}
  (Right)
SFFF
FHFH
F[41mF[0mFH
HFFG
--> The result of taking action 1 is:
     S= 13
     R= 0.0
     p= {'prob': 1.0}
  (Down)
SFFF
FHFH
FFFH
H[41mF[0mFG
--> The result of taking action 2 is:
     S= 14
     R= 0.0
     p= {'prob': 1.0}
  (Right)
SFFF
FHFH
FFFH
HF[41mF[0mG
--> The result of taking action 2 is:
     S= 15
     R= 1.0
     p= {'prob': 1.0}
  (Right)
SFFF
FHFH
FFFH
HFF[41mG[0m
--> The result of taking action 0 is:
     S= 15
     R= 0
     p= {'prob': 1.0}
  (Left)
SFFF
FHFH
FFFH
HFF[41mG[0m
--> The result of taking action 0 is:
     S= 15
     R= 0
     p= {'prob': 1.0}
  (Left)
SFFF
FHFH
FFFH
HFF[41mG


> Comment on the difference between the iterations required for policy vs
> value iteration.

----------------------

Policy iteration required 5 iterations, each of which consisted of 2 to 12 policy iterations to converge to the solution. In comparison, value iteration only took 7 iterations in total to converge, much faster than the policy iteration approach.

----------------------

In [30]:
# Optional: instead of the above environment, use the "slippery" Frozen Lake via
env = gym.make("FrozenLake-v0")

# policy_iteration
with np.printoptions(precision=3, suppress=True):
    V_star, pi_star = policy_iteration(1e-64, 0.9, env, seed=0, print_n_iter=True)
    print("Final value:")
    print(V_star.reshape(4, 4))
    print("Final policy:")
    print(pi_star.reshape(4, 4))
    print(np.array([["\u2190", "\u2193", "\u2192", "\u2191"][a] for a in pi_star]).reshape(4, 4))

print("Testing: ")
play_agent(env, lambda env: pi_star[env.env.s])

Policy iteration: iteration 1
policy_evaluation converged in 123 itreations
Policy iteration: iteration 2
policy_evaluation converged in 44 itreations
Policy iteration: iteration 3
policy_evaluation converged in 49 itreations
Final value:
[[0.008 0.01  0.02  0.006]
 [0.018 0.    0.055 0.   ]
 [0.049 0.108 0.164 0.   ]
 [0.    0.147 0.382 0.   ]]
Final policy:
[[1 3 0 3]
 [0 0 0 0]
 [3 1 0 0]
 [0 2 2 0]]
[['↓' '↑' '←' '↑']
 ['←' '←' '←' '←']
 ['↑' '↓' '←' '←']
 ['←' '→' '→' '←']]
Testing: 

[41mS[0mFFF
FHFH
FFFH
HFFG
--> The result of taking action 1 is:
     S= 0
     R= 0.0
     p= {'prob': 0.3333333333333333}
  (Down)
[41mS[0mFFF
FHFH
FFFH
HFFG
--> The result of taking action 1 is:
     S= 0
     R= 0.0
     p= {'prob': 0.3333333333333333}
  (Down)
[41mS[0mFFF
FHFH
FFFH
HFFG
--> The result of taking action 1 is:
     S= 4
     R= 0.0
     p= {'prob': 0.3333333333333333}
  (Down)
SFFF
[41mF[0mHFH
FFFH
HFFG
--> The result of taking action 0 is:
     S= 0
     R= 0.0
     p= {'p

In [29]:
# value_iteration

with np.printoptions(precision=3, suppress=True):
    V_star, pi_star = value_iteration(1e-64, 0.9, env, seed=0, print_n_iter=True)
    print("Final value:")
    print(V_star.reshape(4, 4))
    print("Final policy:")
    print(pi_star.reshape(4, 4))
    print(np.array([["\u2190", "\u2193", "\u2192", "\u2191"][a] for a in pi_star]).reshape(4, 4))

print("Testing: ")
play_agent(env, lambda env: pi_star[env.env.s])

Value iteration converged in 202 itreations
Final value:
[[0.069 0.061 0.074 0.056]
 [0.092 0.    0.112 0.   ]
 [0.145 0.247 0.3   0.   ]
 [0.    0.38  0.639 0.   ]]
Final policy:
[[0 3 0 3]
 [0 0 0 0]
 [3 1 0 0]
 [0 2 1 0]]
[['←' '↑' '←' '↑']
 ['←' '←' '←' '←']
 ['↑' '↓' '←' '←']
 ['←' '→' '↓' '←']]
Testing: 

[41mS[0mFFF
FHFH
FFFH
HFFG
--> The result of taking action 0 is:
     S= 4
     R= 0.0
     p= {'prob': 0.3333333333333333}
  (Left)
SFFF
[41mF[0mHFH
FFFH
HFFG
--> The result of taking action 0 is:
     S= 0
     R= 0.0
     p= {'prob': 0.3333333333333333}
  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG
--> The result of taking action 0 is:
     S= 4
     R= 0.0
     p= {'prob': 0.3333333333333333}
  (Left)
SFFF
[41mF[0mHFH
FFFH
HFFG
--> The result of taking action 0 is:
     S= 4
     R= 0.0
     p= {'prob': 0.3333333333333333}
  (Left)
SFFF
[41mF[0mHFH
FFFH
HFFG
--> The result of taking action 0 is:
     S= 0
     R= 0.0
     p= {'prob': 0.3333333333333333}
  (Left)
[41mS[0mFF