## Frozen Lake with value Iteration

- This is an implemention of **Value Iteration**, a core algorithm in dynamic programming for solving ***Markov Decision Processes (MDPs)***. 
- We will apply Value Iteration to the FrozenLake-v1 environment in Gymnasium to find the optimal policy that maximizes the
expected reward.

Value Iteration is a simple algorithm that combines both the Policy Evaluation and Policy Improvement
steps. It iteratively updates the value function for each state until it converges to the optimal value
function, then the policy can be derived from the value function.
Overview:
You will be working with the FrozenLake-v1 environment, a simple grid world where the agent has to
move from the starting point to the goal while avoiding holes. The agent can take four actions (left,
down, right, up). Your task is to:
1. Apply Value Iteration to compute the optimal value function.
2. Extract the optimal policy from the value function.
3. Test the optimal policy by simulating the agent's behavior in the environment.
Step-by-Step Instructions

## Dependencies

In [1]:
! pip install "gymnasium[toy-text]" --quiet

In [2]:
import gymnasium as gym
import numpy as np

## FrozenLake V1 environment (deterministic for simplicity)

In [3]:
env = gym.make("FrozenLake-v1", render_mode="human", is_slippery=False)
env.reset()

(0, {'prob': 1})

### Initialise Parameters

In [4]:
gamma = 0.99 # Discount factor
theta = 1e-6 # Convergence threshold
value_table = np.zeros(env.observation_space.n) # Initialize value function for all states
num_actions = env.action_space.n # Number of actions available in the environment

### Implement Value Iteration Algorithm

In [None]:
def value_iteration():
    while True:
        delta = 0
        
        # Iterate over all states
        for state in range(env.observation_space.n):
            v = value_table[state]
            max_value = float('-inf')
            
            # Iterate over all actions to find the maximum expected value for each state
            for action in range(num_actions):
                action_value = 0
                # Sum over all possible next states 
                for prob, next_state, reward, done in env.env.P[state][action]:
                    action_value += (reward + gamma * prob * value_table[next_state]))
                    max_value = max(max_value, action_value)
            # Update the value table for the current state
            value_table[state] = max_value
            delta = max(delta, abs(v - value_table[state]))
            # Check for convergence
        if delta < theta:
            break

(16,)

: 

## Extract the Optimal Policy

In [None]:
def extract_policy():
    # Initialize the policy array with zeros
    policy = np.zeros(env.observation_space.n, dtype=int)
    
    # Iterate over all states
    for state in range(env.observation_space.n):
        action_values = np.zeros(num_actions)
        
        # Evaluate all actions for the current state
        for action in range(num_actions):
            action_value = 0
            
            # Sum over all possible next states
            for prob, next_state, reward, done in env.env.P[state][action]:
                action_value += prob * (reward + gamma * value_table[next_state])
            
            # Store the action value
            action_values[action] = action_value
        
        # Choose the action with the highest value
        policy[state] = np.argmax(action_values)
    
    return policy

## Visualizing the Optimal Policy

In [None]:
optimal_policy = extract_policy()
print("Optimal Policy:")
print(optimal_policy.reshape((4, 4))) # Reshape to visualize as a 4x4 grid

## Test the Optimal Policy

In [None]:
state = env.reset()[0]
done = False
total_reward = 0

while not done:
    # Follow the optimal policy
    action = optimal_policy[state]
    
    # Take the action and observe the next state and reward
    next_state, reward, done, _, _ = env.step(action)
    
    # Accumulate the total reward
    total_reward += reward
    
    # Update the current state
    state = next_state
    
    # Optional: Render the environment to visualize the agent's movements
    env.render()

# Print the total reward obtained by following the optimal policy
print(f"Total reward using optimal policy: {total_reward}")