In this part our goal is to reach the goal state G from the starting state S without visiting the hole states H. That is, while trying to reach the goal state G from the starting state S, if the agent visits the hole states H,then it will fall into the hole and die as Figure shows:

![image.png](attachment:image.png)

• S implies the starting state

• F implies the frozen states

• H implies the hole states

• G implies the goal state

## Solving the problem with value iteration
In the previous part, we learned about the Frozen Lake environment. we want the agent to avoid the hole states H to reach the goal state G.
How can we achieve this goal? That is, how can we reach state G from S without
visiting H? We learned that the optimal policy tells the agent to perform the correct
action in each state. So, if we find the optimal policy, then we can reach state G
from S without visiting state H. Okay, how can we find the optimal policy? We
can use the value iteration method we just learned to find the optimal policy.

Remember that all our states (S to G) will be encoded from 0 to 16 and all four
actions—left, down, up, right—will be encoded from 0 to 3 in the Gym toolkit.

In [1]:
# First, let's import the necessary libraries:
import gym
import numpy as np

# Now, let's create the Frozen Lake environment using Gym:
env = gym.make('FrozenLake-v1')
env.reset()
env.render()

  deprecation(
  deprecation(
If you want to render in human mode, initialize the environment in this way: gym.make('EnvName', render_mode='human') and don't call the render method.
See here for more information: https://www.gymlibrary.ml/content/api/[0m
  deprecation(


The preceding code will display

![image.png](attachment:image.png)

let's learn how to compute the optimal policy using the value
iteration method.

In the value iteration method, we perform two steps:
1. Compute the optimal value function by taking the maximum over the Q
function, that is

     ![image.png](attachment:image.png)


2. Extract the optimal policy from the computed optimal value function


## Computing the optimal value function
We will develop a function named `value_iteration` to iteratively compute the optimal value function by maximizing the Q function.

In [2]:
'''
num_iterations = 1000
Set the threshold number for checking the convergence of the value function:
threshold = 1e-20
set the discount factor to 1
'''
def one_step_lookahead(state, V, env, gamma=0.99):
    A = np.zeros(env.action_space.n)
    for action in range(env.action_space.n):
        for prob, next_state, reward, done in env.P[state][action]:
            A[action] += prob * (reward + gamma * V[next_state])
    return A
# def value_iteration(env):
    "Write your code for the calculation here."
    # return value_table
def value_iteration(env, gamma=0.99, theta=1e-6):

    V = np.zeros(env.observation_space.n)
    while True:
        delta = 0
        for state in range(env.observation_space.n):
            A = one_step_lookahead(state, V, env, gamma)
            best_action_value = np.max(A)
            delta = max(delta, np.abs(best_action_value - V[state]))
            V[state] = best_action_value
        if delta < theta:
            break

    return V

## Extracting the optimal policy from the optimal value function

In the previous step, we computed the optimal value function. Now, we will extract the optimal policy from the computed optimal value function.

we define a function called extract_policy, which takes value_table as a
parameter:

In [3]:
def extract_policy(V, env, gamma=0.99):
    """
    استخراج سیاست بهینه بر اساس تابع ارزش.

    پارامترها:
    V: تابع ارزش بهینه.
    env: محیط.
    gamma: ضریب تخفیف.

    خروجی:
    policy: سیاست بهینه.
    """
    policy = np.zeros(env.observation_space.n, dtype=int)
    for state in range(env.observation_space.n):
        A = one_step_lookahead(state, V, env, gamma)
        best_action = np.argmax(A)
        policy[state] = best_action
    return policy

## Putting it all together

We learned that in the Frozen Lake environment, our goal is to find the optimal
policy that selects the correct action in each state so that we can reach state G from
state A without visiting the hole states.
First, we compute the optimal value function using our value_iteration function by
passing our Frozen Lake environment as the parameter:

In [4]:
env = gym.make('FrozenLake-v1')
optimal_value_function = value_iteration(env)
optimal_policy = extract_policy(optimal_value_function, env)


Next, we extract the optimal policy from the optimal value function using our
extract_policy function:

In [5]:

print("سیاست بهینه (هر عدد نمایانگر اقدام بهینه در هر حالت است):")
print(optimal_policy)
print("تابع ارزش بهینه:")
print(optimal_value_function)

سیاست بهینه (هر عدد نمایانگر اقدام بهینه در هر حالت است):
[0 3 3 3 0 0 0 0 3 1 0 0 0 2 1 0]
تابع ارزش بهینه:
[0.54201404 0.49878743 0.47067727 0.45683193 0.5584404  0.
 0.35834012 0.         0.59179013 0.64307363 0.61520214 0.
 0.         0.74171617 0.86283528 0.        ]
