# Practical 4: Infinite Horizon Dynamic Programming

Author: FIRSTNAME  LASTNAME

Student Number: n00000000

### Learning Outcomes:
- Infinite horizon dynamic programming
- Value Iteration
- Policy Iteration

We will require the following library for this practical (Import all necessary libraries before running the code):

In [None]:
import numpy as np
import matplotlib.pyplot as plt

import time

import gymnasium as gym


import os
from IPython.display import clear_output

## Part A: Value Iteration

### Example 1: Grid World
Consider a robot navigating in a grid-based environment. Each cell in the grid represents a distinct state of the surroundings. The robot can take four deterministic actions at each cell: "up," "down," "left," and "right," resulting in the robot to move precisely one cell in the corresponding direction on the grid. Actions that would take the agent off the grid are not allowed. Within the grid, certain states (orange) correspond to undesirable conditions, such as rough terrain, while one state (green) represents the ultimate goal.

Upon reaching the goal state, the robot gains a reward of 1. Conversely, traversing the rough terrain incurs a penalty (or negative reward) of 10. Additionally, every move the robot makes entails a penalty of 1. The robot's primary objective is to efficiently reach the goal state, aiming to maximize the total reward (minimize the total penalty) incurred. This entails both avoiding the rough terrain and efficiently navigating through the grid.

<img src="grid_world.png" alt="Image" width="300" height="300" />

### Q1
Observe the grid world, what do you intuit as the optimal policy for each cell?

< Answer Here >

### Q2
Complete the following code to implement the value iteration algorithm for the grid world problem. Print the outcomes of the optimal value function and the corresponding optimal policy.

In [None]:
# Define the grid world as a matrix using np.array. Each entry correspond to its reward.
grid = np.array([
    [0, 0, -10, 0],
    [0, 0, 0, 0],
    [0, 0, -10, 0],
    [0, 0, -10, 1]
])

In [None]:
# Initialize the value function as a zero matrix with the same shape with the grid.
values = np.zeros_like(grid, dtype=float)

In [None]:
# Define the function to get next state. The action includes "up", "down", "left", "right".
def get_next_state(i, j, action):
    
    # Hint: (i,j) represents the position. Change "i" or "j" for each action.
    ### START CODE HERE ###

    
    
    
    
    ### END CODE HERE ###

In [None]:
# Define the function to check if the next state is valid. The states beyond the grid are not valid. This function returns Boolean value.
def is_valid_state(i, j, grid):
    rows, cols = grid.shape
    return 0 <= i < rows and 0 <= j < cols

In [None]:
# Perform value iteration
alpha = 0.9  # Discount factor
epsilon = 1e-5  # Convergence threshold

while True:
    delta = 0
    new_values = np.copy(values)
    for i in range(grid.shape[0]):
        for j in range(grid.shape[1]):
            if grid[i, j] == 1:  # Terminal state
                continue

            # Hint: use "for action in ['up', 'down', 'left', 'right']:" to update the value function
            ### START CODE HERE ###

            
            
            
            
            
            ### END CODE HERE ###

    values = new_values
    if delta < epsilon:
        break

In [None]:
# Obtain the optimal policy
policy = np.empty_like(grid, dtype='<U5')  # Unicode strings with length 5
for i in range(grid.shape[0]):
    for j in range(grid.shape[1]):
        if grid[i, j] == 1:
            policy[i, j] = 'T'  # Terminal state
            continue

        best_action = None
        best_value = float('-inf')

        for action in ['up', 'down', 'left', 'right']:
            
            # Hint: select the action with the maximum value
            ### START CODE HERE ###

            
            
            
            
            ### END CODE HERE ###

        policy[i, j] = best_action

In [None]:
print(values)
print(policy)

### Q3
Does the optimal policy align with your initial expectations?

< Answer Here >

### Q4
Now, let's examine an alternative scenario where the penalty of traversing rough terrain is 1, and the reward for reaching the goal state is 10. What do you intuit as the optimal policy?

<img src="grid_world2.png" alt="Image" width="300" height="300" />

< Answer Here >

### Q5
Modify the code above to implement value iteration for the revised scenario. 

### Q6
Assess whether the resultant optimal policy aligns with your intuition in Q5. Provide an explanation for the observed outcome in relation to your intuition.

< Answer Here >

### Q7
Now, let's introduce a new consideration: The orange states are some holes. If the robot falls into the hole, the game will be reset, and the robot will be reinitialized back to the starting point (0,0). Intuit the optimal policy for the two previously scenarios: one with a penalty of 10 and a reward of 1, and the other with a penalty of 1 and a reward of 10.

< Answer Here >

### Q8
Implement the value iteration algorithm for the scenario where the robot resets upon falling into holes. Print the optimal value function and optimal policy outcomes. Do these results align with your initial expectations?

< Answer Here >

### Example 2: Frozen Lake

Frozen lake is a gymnasium environment involving crossing a frozen lake from start to goal without falling into any holes by walking over the frozen lake. The player may not always move in the intended direction due to the slippery nature of the frozen lake. (See the documentation: https://gymnasium.farama.org/environments/toy_text/frozen_lake/)

The game starts with the player at location [0,0] of the frozen lake grid world with the goal located at far extent of the world e.g. [3,3] for the 4x4 environment.  The player makes moves until they reach the goal or fall in a hole.
The observation is the player’s current position. The action space consists of "left, down, right, up". The reward of reaching the goal is 1, otherwise 0.

0: Move left

1: Move down

2: Move right

3: Move up


In [None]:
# This is a simple example of the gymnasium interaface. You can run this cell to visualize the environment
os.environ["SDL_VIDEODRIVER"] = "dummy"

env = gym.make("FrozenLake-v1", render_mode="rgb_array")
env.action_space.seed(42)

observation, info = env.reset(seed=42)

for _ in range(20):
    action = env.action_space.sample()  # this is where you would insert your policy
    observation, reward, terminated, truncated, info = env.step(action)

    if terminated or truncated:
      observation, info = env.reset()
        
    clear_output(wait=True)
    plt.imshow( env.render() )
    plt.show()
    env.render()

env.close()

### Q9
We first consider deterministic Frozen Lake (use argument is_slippery=False). Observe the Frozen Lake environment, and intuit the optimal policy. What is the optimal action at each position? Explain why you chose this action.

< Answer Here >

### Q10
Implement value iteration algorithm to obtain an optimal policy for Frozen Lake environment.

In [None]:
# Create the environment of Frozen Lake
env = gym.make("FrozenLake-v1", is_slippery=False, map_name="4x4",  desc=["SFFF", "FHFH", "FFFH", "HFFG"])

In [None]:
alpha = 0.99  # Discount factor
epsilon = 1e-5  # Convergence threshold

num_states = env.observation_space.n
num_actions = env.action_space.n
V = np.zeros(num_states)  # Initialization the value function for each state
t=0
start_time = time.time()
while True:
    t+=1
    delta = 0

    # Update the value function for each state
    for s in range(num_states):
        v = V[s]
        
        # Compute the value for each action in the current state
        q_values = []
        for a in range(num_actions):
            q_value = 0
            for prob, next_state, reward, _ in env.P[s][a]:
                ### START CODE HERE ###
                
                ### END CODE HERE ###
            q_values.append(q_value)

        # Choose the action that maximizes the value
        V[s] = max(q_values)

        # Compute the difference between the new and old value
        delta = max(delta, np.abs(v - V[s]))

    # Check if the value function has converged
    if delta < epsilon:
        break
end_time = time.time()
execution_time = end_time - start_time
print(t)
print(execution_time/t)

In [None]:
# Obtain the optimal policy
policy = np.zeros(num_states, dtype=int)
for s in range(num_states):
    
    # Hint: using your value function "V" to choose the action that maximizes the value in the current state
    ### START CODE HERE ###

    
    
    
    
    ### END CODE HERE ###

optimal_values = V
optimal_policy = policy
print("Optimal Policy:")
print(optimal_policy.reshape((4, 4)))
print("\nOptimal Value Function:")
print(optimal_values.reshape((4, 4)))

In [None]:
# Evaluate the agent trained by value iteration
os.environ["SDL_VIDEODRIVER"] = "dummy"

env = gym.make("FrozenLake-v1", render_mode="rgb_array", is_slippery=False, map_name="4x4",  desc=["SFFF", "FHFH", "FFFH", "HFFG"])  # Establish again a visual environment
env.action_space.seed(42)

observation, info = env.reset(seed=42)
done = False

while not done:
    ### START CODE HERE ###
    action =              # this is where you would insert your policy
    ### END CODE HERE ###
    observation, reward, terminated, truncated, info = env.step(action)

    if terminated or truncated:
        done = True
        observation, info = env.reset()
        
    clear_output(wait=True)
    plt.imshow( env.render() )
    plt.show()
    env.render()

env.close()

### Q11
Does the observed optimal policy match your initial expectations? Provide an explanation for the alignment or any disparities that you may have observed.

< Answer Here >

## Part B: Policy Iteration

### Q12
Complete the following code to implement policy iteration algorithm to obtain an optimal policy for Frozen Lake environment.

In [None]:
# Create the environment of Frozen Lake
env = gym.make("FrozenLake-v1",is_slippery=False, map_name="4x4",  desc=["SFFF", "FHFH", "FFFH", "HFFG"])

In [None]:
alpha = 0.9  # Discount factor
epsilon = 1e-5  # Convergence threshold

num_states = env.observation_space.n
num_actions = env.action_space.n
V = np.zeros(num_states)  # Initialization the value function for each state
policy = np.random.randint(low=0, high=num_actions, size=num_states)
t=0
# Policy Iteration algorithm
start_time = time.time()
while True:
    t+=1
    while True:
        delta = 0
        for s in range(num_states):
            
            # Policy evaluation
            v = V[s]
            action = policy[s]
            q_value = 0
            for trans_prob, next_state, reward, done in env.P[s][action]:
                ### START CODE HERE ###
                
                ### END CODE HERE ###
            V[s] = q_value
            
            delta = max(delta, np.abs(v - V[s]))
        if delta < epsilon:
            break

    policy_stable = True
    for s in range(num_states):
        
        old_action = policy[s]
        q_values = np.zeros(num_actions)
        # Hint: policy improvement
        ### START CODE HERE ###
        for a in range(num_actions):

        
        
        ### END CODE HERE ###
        
        
        # Hint: termination condition. If all old action is equal to new action, the iteration is terminated.
        ### START CODE HERE ###

        

        ### END CODE HERE ###

    if policy_stable:
        break
end_time = time.time()
execution_time = end_time - start_time
print(t)
print(execution_time/t)
print("Optimal Policy:")
print(policy.reshape((4, 4)))
print("\nOptimal Value Function:")
print(V.reshape((4, 4)))

### Q13
Evaluate the optimal policy by policy iteration in Frozen Lake environment.

In [None]:
# Evaluate the agent trained by value iteration
os.environ["SDL_VIDEODRIVER"] = "dummy"

env = gym.make("FrozenLake-v1", render_mode="rgb_array", is_slippery=False, map_name="4x4",  desc=["SFFF", "FHFH", "FFFH", "HFFG"])  # Establish again a visual environment
env.action_space.seed(42)

observation, info = env.reset(seed=42)
done = False

while not done:
    ### START CODE HERE ###
    action =              # this is where you would insert your policy
    ### END CODE HERE ###
    observation, reward, terminated, truncated, info = env.step(action)

    if terminated or truncated:
        done = True
        observation, info = env.reset()
        
    clear_output(wait=True)
    plt.imshow( env.render() )
    plt.show()
    env.render()

env.close()

## Part C: Analysis

### Q14
For Frozen Lake scenario, do the outcomes of value iteration and policy iteration align? Provide an explanation for your observation.

< Answer Here >

### Q15
In Q10 and Q12, print the number of iterations and the runtime per iteration for both algorithms. Compare the differences between these two algorithms based on the iteration count and the time taken for each iteration.

< Answer Here >

### Q16
Now, consider the stochastic Frozen Lake environemnt (Set the argument "is_slippery=True"). The lake is slippery so the player may move perpendicular to the intended direction sometimes. For example, if action is left and is_slippery is True, then:
- P(move left)=1/3
- P(move up)=1/3
- P(move down)=1/3

Evaluate the optimal policy obtained in deterministic scenario for the stochastic scenario. Does it work? Why?

< Answer Here >

### Q17
Implement both value and policy iteration for the stochastic environment, and observe the optimal policy.

< Answer Here >

### Q18
Discuss similarities and differences of the optimal policy in deterministic and stochastic scenarios.

< Answer Here >

### Q19
Change the map size and randomized locations of holes, and compute the optimal policy.

< Answer Here >

### Q20
Suppose that the grid is extremely large with a large number of states. Are there deficiencies with value and policy iteration? Discuss how to obtain the optimal policy?

< Answer Here >