# Solving the Frozen Lake Problem with Policy Iteration

We learned that in the frozen lake environment, our goal is to reach the goal state G from
the starting state S without visiting the hole states H. Now, let's learn how to compute the optimal policy using the policy iteration method in the frozen lake environment.

First, let's import the necessary libraries:

In [1]:
import gym
import numpy as np

Now, let's create the frozen lake environment using gym:

In [2]:
env = gym.make('FrozenLake-v0')

We learned that in the policy iteration, we compute the value function using the policy
iteratively. Once we found the optimal value function then the policy which is used to
compute the optimal value function will be the optimal policy.

So, first, let's learn how to compute the value function using the policy. 


## Computing value function using policy

This step is exactly the same as how we computed the value function in the value iteration
method but with a small difference. Here we compute the value function using the policy
but in the value iteration method, we compute the value function by taking the maximum
over Q values. Now, let's learn how to define a function that computes the value function
using the given policy.


Let's define a function called compute_value_function which takes the policy as a
parameter:


In [3]:
def compute_value_function(policy):
    
    #now, let's define the number of iterations
    num_iterations = 1000
    
    #define the threshold value
    threshold = 1e-20
    
    #set the discount factor
    gamma = 1.0
    
    #now, we will initialize the value table, with the value of all states to zero
    value_table = np.zeros(env.observation_space.n)
    
    #for every iteration
    for i in range(num_iterations):
        
        #update the value table, that is, we learned that on every iteration, we use the updated value
        #table (state values) from the previous iteration
        updated_value_table = np.copy(value_table)
        
        

        #thus, for each state, we select the action according to the given policy and then we update the
        #value of the state using the selected action as shown below
        
        #for each state
        for s in range(env.observation_space.n):
            
            #select the action in the state according to the policy
            a = policy[s]
            
            #compute the value of the state using the selected action
            value_table[s] = sum([prob * (r + gamma * updated_value_table[s_]) 
                                        for prob, s_, r, _ in env.P[s][a]])
            
        #after computing the value table, that is, value of all the states, we check whether the
        #difference between value table obtained in the current iteration and previous iteration is
        #less than or equal to a threshold value if it is less then we break the loop and return the
        #value table as an accurate value function of the given policy

        if (np.sum((np.fabs(updated_value_table - value_table))) <= threshold):
            break
            
    return value_table

Now that we have computed the value function of the policy, let's see how to extract the
policy from the value function. 

## Extracting policy from the value function

This step is exactly the same as how we extracted policy from the value function in the
value iteration method. Thus, similar to what we learned in the value iteration method, we
define a function called extract_policy to extract a policy given the value function as
shown below:
    

In [4]:
def extract_policy(value_table):
    
    #set the discount factor
    gamma = 1.0
     
    #first, we initialize the policy with zeros, that is, first, we set the actions for all the states to
    #be zero
    policy = np.zeros(env.observation_space.n) 
    
    #now, we compute the Q function using the optimal value function obtained from the
    #previous step. After computing the Q function, we can extract policy by selecting action which has
    #maximum Q value. Since we are computing the Q function using the optimal value
    #function, the policy extracted from the Q function will be the optimal policy. 
    
    #As shown below, for each state, we compute the Q values for all the actions in the state and
    #then we extract policy by selecting the action which has maximum Q value.
    
    #for each state
    for s in range(env.observation_space.n):
        
        #compute the Q value of all the actions in the state
        Q_values = [sum([prob*(r + gamma * value_table[s_])
                             for prob, s_, r, _ in env.P[s][a]]) 
                                   for a in range(env.action_space.n)] 
                
        #extract policy by selecting the action which has maximum Q value
        policy[s] = np.argmax(np.array(Q_values))        
    
    return policy

## Putting it all together

First, let's define a function called policy_iteration which takes the environment as a
parameter

In [5]:
def policy_iteration(env):
    
    #set the number of iterations
    num_iterations = 1000
    
    #we learned that in the policy iteration method, we begin by initializing a random policy.
    #so, we will initialize the random policy which selects the action 0 in all the states
    policy = np.zeros(env.observation_space.n)  
    
    #for every iteration
    for i in range(num_iterations):
        #compute the value function using the policy
        value_function = compute_value_function(policy)
        
        #extract the new policy from the computed value function
        new_policy = extract_policy(value_function)
           
        #if the policy and new_policy are same then break the loop
        if (np.all(policy == new_policy)):
            break
        
        #else, update the current policy to new_policy
        policy = new_policy
        
    return policy


Now, let's learn how to perform policy iteration and find the optimal policy in the frozen
lake environment. 

So, we just feed the frozen lake environment to our policy_iteration
function as shown below and get the optimal policy:

In [6]:
optimal_policy = policy_iteration(env)

We can print the optimal policy: 

In [7]:
print(optimal_policy)

[0. 3. 3. 3. 0. 0. 0. 0. 3. 1. 0. 0. 0. 2. 1. 0.]


As we can observe, our optimal policy tells us to perform the correct action in each
state. Thus, we learned how to perform the policy iteration method to compute the optimal
policy. 