# Frozen Lake

**"Winter is here. You and your friends were tossing around a frisbee at the park when you made a wild throw that left the frisbee out in the middle of the lake. The water is mostly frozen, but there are a few holes where the ice has melted. If you step into one of those holes, you'll fall into the freezing water. At this time, there's an international frisbee shortage, so it's absolutely imperative that you navigate across the lake and retrieve the disc. However, the ice is slippery, so you won't always move in the direction you intend."**

from [openai](https://gym.openai.com/envs/FrozenLake-v0/)


In [46]:
# all imports
import gym
import numpy as np

In [2]:
# creating the MDP
wrapper = gym.Wrapper(gym.make("FrozenLake-v0"))


## The MDP has the following form:


![frozen](img/fl.png)




We always can use some methods from the wrapper class to see some aspects of the MDP

In [3]:
print("observation space: {}".format(wrapper.observation_space))
print("Actions space: {}".format(wrapper.action_space))
print("reward range: {}".format(wrapper.reward_range))

observation space: Discrete(16)
Actions space: Discrete(4)
reward range: (-inf, inf)


## Using the method render, we can visualize the agent moving in the enviroment

- S is start

- F is frozen

- H is hole

- G is goal

In [4]:
plan = [1,1,1,2]

wrapper.reset()
wrapper.render()

for i in range(len(plan)):
        action = plan[i]
        obs, reward , done , info = wrapper.step(action)
        wrapper.render()


[41mS[0mFFF
FHFH
FFFH
HFFG
  (Down)
S[41mF[0mFF
FHFH
FFFH
HFFG
  (Down)
SFFF
F[41mH[0mFH
FFFH
HFFG
  (Down)
SFFF
F[41mH[0mFH
FFFH
HFFG
  (Right)
SFFF
F[41mH[0mFH
FFFH
HFFG


## Playing 100 episodes with random actions

In [49]:
total_reward = 0

episodes = 200

for i in range(episodes):
    done = False
    wrapper.reset()
    while done is False:
        action = wrapper.action_space.sample()
        _, reward , done , _ = wrapper.step(action)
        total_reward += reward
        
print("Average reward = {}".format(total_reward/episodes))

Average reward = 0.01


## We can create random policies

In [50]:
def creat_deterministic_policy(env):
    """
    using an enviroment with discrete states this function returns
    a dictionary state:action
    
    :type env: gym.Env
    :rtype: dict {int: int}
    """
    assert type(env.observation_space) == gym.spaces.discrete.Discrete
    number_states = env.observation_space.n
    number_actions = env.action_space.n 
    policy = {}
    for i in range(number_states):
        action = np.random.randint(number_actions,
                                   size=1)[0]
        assert env.action_space.contains(action)
        assert env.observation_space.contains(i)
        policy[i]= action
    return policy


def softmax(x):
    """
    Compute softmax values for each
    sets of scores in x.
    
    :type x: np.array
    :rtype: np.array
    """
    e_x = np.exp(x - np.max(x))
    return e_x / e_x.sum(axis=0) 

def creat_stochastic_policy(env,low=5.0,high=10.0):
    """
    using an enviroment with discrete states this function returns
    a dictionary state:[prob of actions]
    
    :type env: gym.Env
    :rtype: dict {int: [float]}
    """
    assert type(env.observation_space) == gym.spaces.discrete.Discrete
    number_states = env.observation_space.n
    number_actions = env.action_space.n 
    policy = {}
    for i in range(number_states):
        actions = np.random.randint(low,
                                    high,
                                    size=number_actions)
        actions = softmax(actions)
        assert env.observation_space.contains(i)
        policy[i]= actions
    return policy
    

## Checking an deterministic policy

In [71]:
policy = creat_deterministic_policy(wrapper)


total_reward = 0

episodes = 200

for i in range(episodes):
    done = False
    obs = wrapper.reset()
    while done is False:
        action = policy[obs]
        obs, reward , done , _ = wrapper.step(action)
        total_reward += reward
        
print("Average reward = {}".format(total_reward/episodes))
print("Policy = ")
print(policy)

Average reward = 0.035
Policy = 
{0: 3, 1: 2, 2: 2, 3: 0, 4: 3, 5: 0, 6: 0, 7: 1, 8: 0, 9: 1, 10: 2, 11: 3, 12: 0, 13: 2, 14: 3, 15: 1}


## Checking an stochastic policy

In [70]:
policy = creat_stochastic_policy(wrapper)

actions = list(range(wrapper.action_space.n))

total_reward = 0

episodes = 200

for i in range(episodes):
    done = False
    obs = wrapper.reset()
    while done is False:
        probabilities = policy[obs]
        action = np.random.choice(actions, 1, p=probabilities)[0]
        obs, reward , done , _ = wrapper.step(action)
        total_reward += reward
        
print("Average reward = {}".format(total_reward/episodes))
print("Policy = ")
for key in list(policy.keys()):
    print(key, policy[key])

Average reward = 0.035
Policy = 
0 [ 0.08259454  0.08259454  0.61029569  0.22451524]
1 [ 0.08714432  0.64391426  0.0320586   0.23688282]
2 [ 0.22451524  0.61029569  0.08259454  0.08259454]
3 [ 0.47628706  0.47628706  0.02371294  0.02371294]
4 [ 0.25618664  0.01275478  0.69638749  0.03467109]
5 [ 0.03467109  0.25618664  0.01275478  0.69638749]
6 [ 0.06193488  0.45764028  0.02278457  0.45764028]
7 [ 0.69638749  0.01275478  0.25618664  0.03467109]
8 [ 0.3994863   0.3994863   0.1469628   0.05406459]
9 [ 0.01736167  0.94791499  0.01736167  0.01736167]
10 [ 0.01275478  0.25618664  0.69638749  0.03467109]
11 [ 0.64391426  0.08714432  0.23688282  0.0320586 ]
12 [ 0.09625514  0.71123459  0.09625514  0.09625514]
13 [ 0.3994863   0.3994863   0.1469628   0.05406459]
14 [ 0.02278457  0.45764028  0.06193488  0.45764028]
15 [ 0.44039854  0.05960146  0.05960146  0.44039854]
