<center>

# Dynamic Programming, Frozen Lake

</center>

In the FrozenLake environment, the agent navigates a 4x4 gridworld

<img src="../_aux/images/frozen-lake-6.jpg">
Source: http://eskipaper.com/images/frozen-lake-6.jpg

**Env description:**

Winter is here. You and your friends were tossing around a frisbee at the park
when you made a wild throw that left the frisbee out in the middle of the lake.
The water is mostly frozen, but there are a few holes where the ice has melted.

If you step into one of those holes, you'll fall into the freezing water.
At this time, there's an international frisbee shortage, so it's absolutely imperative that
you navigate across the lake and retrieve the disc.

However, the ice is slippery, so you won't always move in the direction you intend.
The surface is described using a grid like the following 
    
S<font color='blue'>FFF</font><br>
<font color='blue'>F</font><font color='red'>H</font><font color='blue'>F</font><font color='red'>H</font><br>
<font color='blue'>FFF</font><font color='red'>H</font><br>
<font color='red'>H</font><font color='blue'>FF</font>G<br>

S : starting point, safe<br>
F : frozen surface, safe<br>
H : hole, fall to your doom<br>
G : goal, where the frisbee is located<br>

The episode ends when you reach the goal or fall in a hole.
You receive a reward of 1 if you reach the goal, and zero otherwise.

### Import Packages

In [14]:
import os,copy
import numpy as np

import check_test
from frozenlake import FrozenLakeEnv
from plot_utils import plot_values

### Set Environment

In [15]:
env = FrozenLakeEnv(is_slippery=True)

**Environment Note:**

The agent moves through a **$4 \times 4$ gridworld**, with states numbered as follows:
```
[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]
 [12 13 14 15]]
```
and the agent has **4 potential actions**:
```
LEFT = 0
DOWN = 1
RIGHT = 2
UP = 3
```



### Preview

In [16]:
# print the state space and action space, total number of states and actions
print(f'state space: {env.observation_space}')
print(f'Action_space: {env.action_space}')
print(f'total number of states: {env.nS}')
print(f'total number of actions: {env.nA}')

state space: Discrete(16)
Action_space: Discrete(4)
total number of states: 16
total number of actions: 4


In [19]:
# looking at one-step dynamics of the Markov decision process (MDP)
prob, next_state, reward, done = env.P[1][0][0] 
prob, next_state, reward, done

(0.3333333333333333, 1, 0.0, False)

### Define Policy

<img src="../_aux/images/policy-eval.png" width="500">

- `env`: This is an instance of an OpenAI Gym environment, where `env.P` returns the one-step dynamics.
- `policy`: This is a 2D numpy array with `policy.shape[0]` equal to the number of states (`env.nS`), and `policy.shape[1]` equal to the number of actions (`env.nA`).  `policy[s][a]` returns the probability that the agent takes action `a` while in state `s` under the policy.
- `gamma`: This is the discount rate.  It must be a value between 0 and 1, inclusive (default value: `1`).
- `theta`: This is a very small positive number that is used to decide if the estimate has sufficiently converged to the true value function (default value: `1e-8`).

In [None]:
def policy_evaluation(env, policy, gamma=1, theta=1e-8):
    V = np.zeros(env.nS)
    while True:
        delta = 0
        for s in range(env.nS):                                                  # loop for every state
            Vs = 0
            for a, action_prob in enumerate(policy[s]):                          # loop for every action
                for prob, next_state, reward, done in env.P[s][a]:               # loop for every probability of each possible reward and next state 
                    Vs += action_prob * prob * (reward + gamma * V[next_state])  # calculate Vscore
            delta = max(delta, np.abs(V[s]-Vs))                                  # calculate delta
            V[s] = Vs                                                            # update state-value ?? based on what ??
        if delta < theta:                                                        # check process, if all state has been optimized (small delta, than)
            break
    return V                                                                     # look for that video that show the gradual change

**note:**<br>
compared the simple 2x2 grid world with mountain, in this slippery frozen lake env, there's a chance that the selected action is not taken hence the action probability here. In the grid world with mountain it was set to 1 (frozen lake can also be set to non slippery).

In [53]:
# set random policy
random_policy = np.ones([env.nS, env.nA]) / env.nA # equal prob on all

In [20]:
# evaluate the policy 
V = policy_evaluation(env, random_policy)

plot_values(V)

NameError: name 'policy_evaluation' is not defined

In [None]:
check_test.run_check('policy_evaluation_check', policy_evaluation)